| [ArXiv] | [🤗HuggingFace] |
🌟 Any contributions via PRs, issues, emails or other methods are greatly appreciated.
- 🎖️ Our work is accepted by AAAI 2025 !
- 🔥 We have release benchmark on [🤗HuggingFace].
- 🔥 The paper is also available on [ArXiv].
Large Vision-Language Models (LVLMs) have recently demonstrated amazing success in multi-modal tasks, including advancements in Multi-modal Chain-of-Thought (MCoT) reasoning. Despite these successes, current benchmarks still follow a traditional paradigm with multi-modal input and text-modal output, which leads to significant drawbacks such as missing visual operations and vague expressions. Motivated by this, we introduce a novel Chain of Multi-modal Thought (CoMT) benchmark to address these limitations. Different from the traditional MCoT benchmark, CoMT requires both multi-modal input and multi-modal reasoning output, aiming to mimic human-like reasoning that inherently integrates visual operation. Specifically, CoMT consists of four categories: (1) Visual Creation, (2) Visual Deletion, (3) Visual Update, and (4) Visual Selection to comprehensively explore complex visual operations and concise expression in real scenarios. We evaluate various LVLMs and strategies on CoMT, revealing some key insights into the capabilities and limitations of the current approaches. We hope that CoMT can inspire more research on introducing multi-modal generation into the reasoning process.
The structure of CoMT is as below:
comt
├── data.jsonl # The data file of CoMT.
├── images # Images of CoMT.
│ ├── creation # Visual Creation task.
│ ├── deletion # Visual Deletion task.
│ ├── update # Visual Update task.
│ └── selection # Visual Selection task.
│ ├─└── original # Original image before stitching.
Each line in data.jsonl
follows the format below:
{
"id": "[ID]",
"question": "[QUESTION]",
"option": ["[OPTION1]", "[OPTION2]", ...],
"image": ["[IMAGE0:IMAGE-ID0]", "[IMAGE1:IMAGE-ID1]", ...],
"rationale": "[RATIONALE]",
"answer": "A/B/C/D",
"type": "[TYPE]", // the task type, like creation, deletion, ...
"annotations": "[ANNOTATIONS]" // grounding coordinates or tangram annotations, etc
}
We provide evaluate.py
to evaluate your own results:
python evaluate.py --data_path [COMT_PATH] \
--metric_path [JSONL_PATH]
Among them, each line of file in jsonl
must meet the following format:
{
"id": "[ID]",
"response": "xxx ANSWER: (A)",
}
If you find this project useful for your research, please consider citing the following paper:
@inproceedings{cheng2025comt,
title={CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models},
author={Cheng, Zihui and Chen, Qiguang and Zhang, Jin and Fei, Hao and Feng, Xiaocheng and Che, Wanxiang and Li, Min and Qin, Libo},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={39},
year={2025}
}
Please create Github issues here or email Zihui Cheng, Qiguang Chen, Libo Qin if you have any questions or suggestions.