V²Flow: Unifying Visual Tokenization and Large Language Model Vocabularies for Autoregressive Image Generation
V²Flow introduces an advanced vector-quantized image tokenizer designed to seamlessly integrate visual tokenization with existing large language model (LLM) vocabularies. By aligning structural representations and latent distributions between image tokens and textual tokens, V²Flow enables effective autoregressive image generation leveraging pre-trained LLMs.
- 2025-03-30: Release of the complete training and inference codebase for V²Flow. Pretrained models (1024x1024 and 512x512 resolutions) will be available shortly.
- 2025-03-10: V²Flow is released on arXiv.
The complete data preparation, training, and inference instructions for the V²Flow tokenizer can be found here.
- V²Flow tokenizer
- Training and inference codes
- Checkpoints
- Gradio Demo
- V²Flow+LLaMA for Autoregressive Visual Generation
- Training and inference codes
- Checkpoints
- Gradio Demo
We thank the great work from MAR, LLaVA and VideoLLaMA