Skip to content
/ CoMT Public

code for "CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models"

Notifications You must be signed in to change notification settings

czhhzc/CoMT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

SVG Image CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models

version PRs-Welcome Issues

| [ArXiv] | [🤗HuggingFace] |

🌟 Any contributions via PRs, issues, emails or other methods are greatly appreciated.

🔥News

  • 🎖️ Our work is accepted by AAAI 2025 !
  • 🔥 We have release benchmark on [🤗HuggingFace].
  • 🔥 The paper is also available on [ArXiv].

💡 Motivation

Large Vision-Language Models (LVLMs) have recently demonstrated amazing success in multi-modal tasks, including advancements in Multi-modal Chain-of-Thought (MCoT) reasoning. Despite these successes, current benchmarks still follow a traditional paradigm with multi-modal input and text-modal output, which leads to significant drawbacks such as missing visual operations and vague expressions. Motivated by this, we introduce a novel Chain of Multi-modal Thought (CoMT) benchmark to address these limitations. Different from the traditional MCoT benchmark, CoMT requires both multi-modal input and multi-modal reasoning output, aiming to mimic human-like reasoning that inherently integrates visual operation. Specifically, CoMT consists of four categories: (1) Visual Creation, (2) Visual Deletion, (3) Visual Update, and (4) Visual Selection to comprehensively explore complex visual operations and concise expression in real scenarios. We evaluate various LVLMs and strategies on CoMT, revealing some key insights into the capabilities and limitations of the current approaches. We hope that CoMT can inspire more research on introducing multi-modal generation into the reasoning process.

🎯 Dataset

The structure of CoMT is as below:

comt
├── data.jsonl           # The data file of CoMT.
├── images          # Images of CoMT.
│   ├── creation         # Visual Creation task.
│   ├── deletion          # Visual Deletion task.
│   ├── update         # Visual Update task.
│   └── selection        # Visual Selection task.
│   ├─└── original        # Original image before stitching.

Each line in data.jsonl follows the format below:

{
  "id": "[ID]",
  "question": "[QUESTION]",
  "option": ["[OPTION1]", "[OPTION2]", ...],
  "image": ["[IMAGE0:IMAGE-ID0]", "[IMAGE1:IMAGE-ID1]", ...],
  "rationale": "[RATIONALE]",
  "answer": "A/B/C/D",
  "type": "[TYPE]", // the task type, like creation, deletion, ...
  "annotations": "[ANNOTATIONS]" // grounding coordinates or tangram annotations, etc
}

🚀 Evaluation

We provide evaluate.py to evaluate your own results:

python evaluate.py --data_path [COMT_PATH] \
                   --metric_path [JSONL_PATH]

Among them, each line of file in jsonl must meet the following format:

{
  "id": "[ID]",
  "response": "xxx ANSWER: (A)",
}

✒️ Reference

If you find this project useful for your research, please consider citing the following paper:

@inproceedings{cheng2025comt,
  title={CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models},
  author={Cheng, Zihui and Chen, Qiguang and Zhang, Jin and Fei, Hao and Feng, Xiaocheng and Che, Wanxiang and Li, Min and Qin, Libo},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={39},
  year={2025}
}

📲 Contact

Please create Github issues here or email Zihui Cheng, Qiguang Chen, Libo Qin if you have any questions or suggestions.

About

code for "CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages