AEGIS: Exploring the Limit of World Knowledge Capabilities for Unified Multimodal Models

The official code of the paper "AEGIS: Exploring the Limit of World Knowledge Capabilities for Unified Multimodal Models".

The capability of Unified Multimodal Models (UMMs) to apply world knowledge across diverse tasks remains a critical, unresolved challenge. Existing benchmarks fall short, offering only siloed, single-task evaluations with limited diagnostic power. To bridge this gap, we propose AEGIS (Assessing Editing, Generation, Interpretation-Understanding for Super-intelligence), a comprehensive multi-task benchmark covering visual understanding, generation, editing, and interleaved generation. AEGIS comprises 1,050 challenging, manually-annotated questions spanning 21 topics (including STEM, humanities, daily life) and 6 reasoning types. To concretely evaluate UMM performance without ambiguous metrics, we propose Deterministic Checklist-based Evaluation (DCE), a protocol utilizing atomic "Y/N" judgments. Our extensive experiments reveal that most UMMs exhibit severe world knowledge deficits and performance degrades significantly with complex reasoning.

[🆕 Blog] [📜 ArXiv Paper] [🤗 HF Datasets]

📖 Highlights

The main contributions of this work are as follows:

Comprehensive Multi-Task Benchmark: Assesses Visual Understanding, Generation, Editing, and Interleaved Generation simultaneously.
Extensive Knowledge Coverage: 1,050 questions across 21 topics (STEM, Humanities, Daily Life) and 6 reasoning types.
Deterministic Evaluation (DCE): A novel checklist-based protocol that replaces ambiguous scores with atomic "Yes/No" judgments for reliability.
In-depth Diagnosis: Reveals severe world knowledge deficits in SOTA UMMs and the impact of reasoning complexity.

📊 Evaluation Results

We evaluated various Unified Multimodal Models (UMMs) and Single-Task Generative Models. Below is the overall performance comparison.

🛠️ Installation

We adopt Gemini-2.5-Pro as our evaluation tool. You can use the following command to install the Gemini API client.

conda create -n aegis python=3.11
conda activate aegis
pip install google-genai openai numpy tqdm Pillow

🤗 Data Preparation

We provide the manually-annotated data in the HF Datasets. You should organize the data as follows in data directory:

aegis/
└── data/
    ├── annotation.json
    └── images/
        ├── activity/
        ├── anime/
        ├── ...
        ├── ...
        ├── physics/
        └── traffic/

🔍 Evaluation

You can use the following command to evaluate the model.

Inference

# Use Gemini and GPT-4o for understanding task as an example
GEMINI_API_KEY="xxx" GEMINI_BASE_URL="xxx" python gemini/understanding.py --output-dir output-gemini
OPENAI_API_KEY="xxx" OPENAI_BASE_URL="xxx" python gpt/understanding.py --output-dir output-4o

Evaluation

# Use understanding task as an example
GEMINI_API_KEY="xxx" GEMINI_BASE_URL="xxx" python eval/understanding.py --output-dir output-gemini
GEMINI_API_KEY="xxx" GEMINI_BASE_URL="xxx" python eval/understanding.py --output-dir output-4o

Calculate Score

# Use understanding task as an example
python eval_score/understanding.py --input-json output-gemini/understanding_eval/gemini-2.5-pro_results.json
python eval_score/understanding.py --input-json output-4o/understanding_eval/gemini-2.5-pro_results.json

🎫 License

This project is released under the MIT License.

📧 Contact

If you have any questions or suggestions, please feel free to contact us at [email protected].

📜 Citation

If you find this work helpful in your research, please consider citing:

@misc{aegis,
      title={AEGIS: Exploring the Limit of World Knowledge Capabilities for Unified Mulitmodal Models}, 
      author={Jintao Lin, Bowen Dong, Weikang Shi, Chenyang Lei, Suiyun Zhang, Rui Liu, Xihui Liu},
      year={2026},
      eprint={2601.00561},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.00561}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
data		data
eval		eval
eval_score		eval_score
gemini		gemini
gpt		gpt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AEGIS: Exploring the Limit of World Knowledge Capabilities for Unified Multimodal Models

📖 Highlights

📊 Evaluation Results

🛠️ Installation

🤗 Data Preparation

🔍 Evaluation

🎫 License

📧 Contact

📜 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AEGIS: Exploring the Limit of World Knowledge Capabilities for Unified Multimodal Models

📖 Highlights

📊 Evaluation Results

🛠️ Installation

🤗 Data Preparation

🔍 Evaluation

🎫 License

📧 Contact

📜 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages