LinVT: Empower Your Image-level Large Language Model to Understand Videos

News

[2024/12/09] 🔥 Our paper is coming! We release our paper on Arxiv. Please refer to the paper for more details.

Our method achieves the following rankings with only a 7B-size model:

Leaderboards

VideoVista

MLVU

Model Architecture

📖 Abstract

Large Language Models (LLMs) have been widely used in various tasks, motivating us to develop an LLM-based assistant for videos. Instead of training from scratch, we propose a module to transform arbitrary well-trained image-based LLMs into video-LLMs (after being trained on video data). To better adapt image-LLMs for processing videos, we introduce two design principles: linear transformation to preserve the original visual-language alignment and representative information condensation from redundant video content. Guided by these principles, we propose the Linear Video Tokenizer (LinVT), which enables existing image-LLMs to understand videos. We benchmark LinVT with six recent visual LLMs: Blip-3, Molmo, Mipha, InternVL2, Qwen2-VL and Aquila, show-casing the high compatibility of LinVT. Extensive experiments illustrate the effectiveness of LinVT in multi-modal video understanding while preserving the original image-comprehension capabilities.

Installation

Install required packages.

conda create -n LinVT python=3.10.13
conda activate LinVT
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 -c pytorch -c conda-forge -y
pip install -r requirements.txt

Model weights

comming soon.

Inference

sh evaluate.sh [model_weight] [task_name] --dynamic

Training

comming soon.

Citation

If you find this repository useful, please consider giving a star ⭐ and citation:

@article{gao2024linvt,
  title={LinVT: Empower Your Image-level Large Language Model to Understand Videos},
  author={Gao, Lishuai and Zhong, Yujie and Zeng, Yingsen and Tan, Haoxian and Li, Dengjie and Zhao, Zheng},
  journal={arXiv preprint arXiv:2412.05185},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit gaolishuai update Dec 30, 2024 9301c89 · Dec 30, 2024 History 38 Commits
.idea		.idea
classification		classification
clip_benchmark		clip_benchmark
docs		docs
internvl_chat		internvl_chat
internvl_chat_llava		internvl_chat_llava
internvl_g		internvl_g
requirements		requirements
segmentation		segmentation
streamlit_demo		streamlit_demo
video_retrieval		video_retrieval
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LinVT: Empower Your Image-level Large Language Model to Understand Videos

News

Leaderboards

VideoVista

MLVU

Model Architecture

📖 Abstract

Installation

Model weights

Inference

Training

Citation

About

Releases

Packages

Languages

gls0425/LinVT

Folders and files

Latest commit

History

Repository files navigation

LinVT: Empower Your Image-level Large Language Model to Understand Videos

News

Leaderboards

VideoVista

MLVU

Model Architecture

📖 Abstract

Installation

Model weights

Inference

Training

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages