CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models

🌟 Any contributions via PRs, issues, emails or other methods are greatly appreciated.

🔥News

🎖️ Our work is accepted by AAAI 2025 !
🔥 We have release benchmark on [🤗HuggingFace].
🔥 The paper is also available on [ArXiv].

💡 Motivation

Large Vision-Language Models (LVLMs) have recently demonstrated amazing success in multi-modal tasks, including advancements in Multi-modal Chain-of-Thought (MCoT) reasoning. Despite these successes, current benchmarks still follow a traditional paradigm with multi-modal input and text-modal output, which leads to significant drawbacks such as missing visual operations and vague expressions. Motivated by this, we introduce a novel Chain of Multi-modal Thought (CoMT) benchmark to address these limitations. Different from the traditional MCoT benchmark, CoMT requires both multi-modal input and multi-modal reasoning output, aiming to mimic human-like reasoning that inherently integrates visual operation. Specifically, CoMT consists of four categories: (1) Visual Creation, (2) Visual Deletion, (3) Visual Update, and (4) Visual Selection to comprehensively explore complex visual operations and concise expression in real scenarios. We evaluate various LVLMs and strategies on CoMT, revealing some key insights into the capabilities and limitations of the current approaches. We hope that CoMT can inspire more research on introducing multi-modal generation into the reasoning process.

🎯 Dataset

The structure of CoMT is as below:

comt
├── data.jsonl           # The data file of CoMT.
├── images          # Images of CoMT.
│   ├── creation         # Visual Creation task.
│   ├── deletion          # Visual Deletion task.
│   ├── update         # Visual Update task.
│   └── selection        # Visual Selection task.
│   ├─└── original        # Original image before stitching.

Each line in data.jsonl follows the format below:

{
  "id": "[ID]",
  "question": "[QUESTION]",
  "option": ["[OPTION1]", "[OPTION2]", ...],
  "image": ["[IMAGE0:IMAGE-ID0]", "[IMAGE1:IMAGE-ID1]", ...],
  "rationale": "[RATIONALE]",
  "answer": "A/B/C/D",
  "type": "[TYPE]", // the task type, like creation, deletion, ...
  "annotations": "[ANNOTATIONS]" // grounding coordinates or tangram annotations, etc
}

🚀 Evaluation

We provide evaluate.py to evaluate your own results:

python evaluate.py --data_path [COMT_PATH] \
                   --metric_path [JSONL_PATH]

Among them, each line of file in jsonl must meet the following format:

{
  "id": "[ID]",
  "response": "xxx ANSWER: (A)",
}

✒️ Reference

If you find this project useful for your research, please consider citing the following paper:

@inproceedings{cheng2025comt,
  title={CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models},
  author={Cheng, Zihui and Chen, Qiguang and Zhang, Jin and Fei, Hao and Feng, Xiaocheng and Che, Wanxiang and Li, Min and Qin, Libo},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={39},
  year={2025}
}

📲 Contact

Please create Github issues here or email Zihui Cheng, Qiguang Chen, Libo Qin if you have any questions or suggestions.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
imgs		imgs
README.md		README.md
evaluate.py		evaluate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models

🔥News

💡 Motivation

🎯 Dataset

🚀 Evaluation

✒️ Reference

📲 Contact

About

Releases

Packages

Languages

czhhzc/CoMT

Folders and files

Latest commit

History

Repository files navigation

CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models

🔥News

💡 Motivation

🎯 Dataset

🚀 Evaluation

✒️ Reference

📲 Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages