HaploVL is a multimodal understanding foundation model that delivers comprehensive cross-modal understanding capabilities for text, images, and video inputs through a single transformer architecture.
This repository contains the PyTorch implementation, model weights, and training code for Haplo.
🌟 Unified Architecture: Single transformer model supporting early fusion of multi-modal inputs and auto-regressive response generation
🌟 Efficient Training: Optimized training recipe leveraging pre-trained knowledge with reduced resource consumption
🌟 Scalable Design: Flexible framework supporting both Ascend NPU and GPU environments
🌟 Extended Capabilities: Native support for multiple image understanding and video processing
# Option1:
pip install git+https://github.com/Tencent/HaploVLM.git
# Option2:
git clone https://github.com/Tencent/HaploVLM.git
cd HaploVLM
pip install -e . -v
Basic usage example:
from haplo import HaploProcessor, HaploForConditionalGeneration
processor = HaploProcessor.from_pretrained('stevengrove/Haplo-7B-Pro')
model = HaploForConditionalGeneration.from_pretrained(
'stevengrove/Haplo-7B-Pro',
torch_dtype=torch.bfloat16
).to('cuda')
conversation = [
{'role': 'user', 'content': [
{'type': 'text', 'text': 'Describe this image.'},
{'type': 'image', 'path': 'assets/example-image.png'}
]}
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
return_tensors='pt'
).to('cuda')
outputs = model.generate(inputs)
print(processor.decode(outputs[0]))
Launch an interactive demo:
python demo/demo.py \
-m "stevengrove/Haplo-7B-Pro-Video" \
--server-port 8080 \
--device cuda \
--dtype bfloat16
Multi-Modal Capabilities
Category | Example |
---|---|
Single Image Understanding | ![]() |
Multi-Image Understanding | ![]() |
Video Understanding | ![]() |
@article{HaploVL,
title={HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding},
author={Yang, Rui and Song, Lin and Xiao, Yicheng and Huang, Runhui and Ge, Yixiao and Shan, Ying and Zhao, Hengshuang},
journal={arXiv preprint arXiv:2503.14694},
year={2025}
}