Skip to content

This is the official implementation of our paper "QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension"

Notifications You must be signed in to change notification settings

MAC-AutoML/QuoTA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension

Arxiv

😮 Highlights

introductionv2_01

  • We design a versatile plug-and-play pipeline for existing LVLMs: QuoTA provides a training-free solution applicable to diverse LVLMs, enhancing long video understanding performance by assigning visual tokens based on text instruction (query) relevance. This approach offers a more elegant and direct methodology compared to conventional attention-based analytical techniques.
  • We propose CoT-driven query decouple for query-oriented frame scoring: QuoTA employs Chain-of-Thoughts to decouple query into a specific-designed question, enabling high-quality scoring of video frames.
  • Our QuoTA setting a new state-of-the-art: Integration of QuoTA with LLaVA-Video-7B yields a 3.2% average performance improvement across six benchmarks, achieving the best results in five video benchmarks, including Video-MME and MLVU, among 7B LVLMs.

framework_01

results

🔨 Usage

This repo is built upon LLaVA-NeXT:

  • Step 1: Clone and build LLaVA-NeXT conda environment, then install the following packages in llava envs:
git clone https://github.com/LLaVA-VL/LLaVA-NeXT
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # Enable PEP 660 support.
pip install -e ".[train]"
# install qwen toolkit
pip install qwen-vl-utils
  • Step 2: Replace the file under LLaVA-NeXT/llava/model/llava_arch.py with core/llava_arch.py:

  • Step 3: Copy the file core/merge.py under LLaVA-NeXT/llava/model/

  • Step 4: Move all our code (tools/ and quota_pipeline.py) under the root dir (LLaVA-NeXT) of LLaVA-NeXT

  • Step 5: You can now run our pipeline build upon LLaVA-Video-7B by:

python quota_pipeline.py
  • Note that you can also use our pipeline for other LVLMs.

✏️ Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝:

@article{luo2025quota,
  title={QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension},
  author={Luo, Yongdong and Chen, Wang and Zheng, Xiawu and Huang, Weizhong and Yin, Shukang and Lin, Haojia and Fu, Chaoyou and Huang, Jinfa and Ji, Jiayi and Luo, Jiebo and others},
  journal={arXiv preprint arXiv:2503.08689},
  year={2025}
}

About

This is the official implementation of our paper "QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages