📄 Paper Link | 📥 Model Download | ⚡ Quick Start | 📜 License
Orthus is a unified multimodal model that copes with discrete text tokens and lossless continuous image features under the AR modeling principle. Unlike prior arts, Orthus simultaneously enjoys these three benefits for the first time:
- Unified modeling of AR and Diffusion within a single transformer
- Lossless visual signal for understanding and generation
- Seamless cross-modality learning and mix-modality generation
Orthus excels in generating images given textual prompts, answering questions based on visual inputs, and even crafting lengthy image-text interleaved contents. Here are some examples generated by Orthus.
Given images and texts, Orthus embeds both discrete text tokens (from a tokenizer) and continuous patch-wise image features (from a vision encoder) into the same representation space, where an AR transformer backbone is then invoked to model the inter- and intra-modality and generates a sequence of output vectors. The vectors are routed to modality-specific heads, with the language modeling head to predict the next text token categorically and the diffusion head to predict the next image patch feature through conditional diffusion modeling. During inference, Orthus autoregressively predicts the next text token or image patch according to the indication of special transition tokens.
- Compared with fully AR models (left), Orthus (right) adopts continuous image features, eliminating information loss caused by VQ.
- Compared with AR-diffusion mixed models (middle), Orthus (right) disentangles diffusion from the transformer backbone, avoids the noise disturbance of visual understanding and renders the characterization of the correlation between modalities by straightforward AR.
We release Orthus to the public, aiming to support a wider spectrum of research within the academic community. Commercial usage is permitted under the terms outlined in license.
Model name | Total parameters | Download |
---|---|---|
Orthus-base | 7B | 🤗 Hugging Face |
Orthus-instruct | 7B | 🤗 Hugging Face |
- Environment setup:
conda create -n orthus python=3.11
conda activate orthus
- Clone this repository and build from source:
git clone https://github.com/karrykkk/Orthus.git
cd Orthus
- Install dependency:
pip install torch==2.0.1 accelerate==1.0.0 diffusers==0.31.0 numpy==1.26.4 flash-attn==2.6.3
pip install torchvision==0.15.2
cd transformers
pip install -e .
For Multimodal Understanding, try
CUDA_VISIBLE_DEVICES=0 python inference/multimodal_understanding.py
For Text-to-image Generation, try
CUDA_VISIBLE_DEVICES=0 python inference/t2i_generation.py
For Interleaved image-text Generation, try
CUDA_VISIBLE_DEVICES=0 python inference/interleave_generation.py
We would like to sincerely thank the developers of the open-source models Chameleon and Anole, as well as the repositories accelerate, diffusers and this PR of transformers, as our work is heavily built upon these resources.
This is the official project repository for the following paper. If you find this repository helpful, Please kindly cite:
@article{kou2024orthus,
title={Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads},
author={Kou, Siqi and Jin, Jiachun and Liu, Chang and Ma, Ye and Jia, Jian and Chen, Quan and Jiang, Peng and Deng, Zhijie},
journal={arXiv preprint arXiv:2412.00127},
year={2024}
}