MQuant

Offical code for MQuant: Unleashing the Inference Potential of Multimodal Large Language Models via Full Static Quantization.(Paper)

MQuant: Unleashing the Inference Potential of Multimodal Large Language Models via Full Static Quantization\

Highlight

MQuant is the first quantization solution for Multimodal large language models applicable to 5 mainstream MLLMs.
MQuant proposes the Modality-Specific Static Quantization (MSQ) to significantly reduce the Time-to-First-Token (TTFT) and Rotation Magnitude Suppression (RMS) to mitigate weight outliers.
MQuant achieves near-floating-point accuracy (<1% degradation) while reducing inference latency by up to 30% on 5 mainstram MLLMs (Qwen-VL/Intern-VL/Qwen2-VL/GLM-4V/MiniCPM-V) under W4A8 setting.

ToDo List

release the quantization code for other MLLMs
release the quantization code for Qwen-VL
release the core code after the paper accepted
update acknowledgement
release the paper link

Contact

Any questions or suggestions are welcome! [Jiangyong Yu] jiangyongyufocus@gmail.com , Dawei Yangdawei.yang@houmo.ai, Sifan Zhou sifanjay@gmail.com

Abstract

Recently, multimodal large language models (MLLMs) have garnered widespread attention due to their ability to perceive and understand multimodal signals. However, their large parameter sizes and substantial computational demands severely hinder their practical deployment and application. While quantization is an effective way to reduce model size and inference latency, its application to MLLMs remains underexplored. In this paper, we conduct an in-depth analysis of MLLMs quantization and identify several challenges: slow inference speed of the visual tokens, distributional differences across modalities, and visual outlier clipping degrades performance. To address these challenges, we propose MQuant, a quantization framework tailored for MLLMs. Specifically, 1) we design Modality-specific Quantization (MSQ) and Attention-Invariant Flexible Switching (AIFS) to support per-tensor static quantization and facilitate efficient inference. 2) we introduce a unified LayerNorm-to-RMSNorm transformation, achieving seamless integration of the MLLM vision encoder with Hadamard rotation. 3) we propose Rotation Magnitude Suppression (RMS) to mitigate outliers introduced by Hadamard rotation. Experiments conducted on five mainstream MLLMs demonstrate the superior performance and broad applicability of MQuant. For example, it maintains around 98% of the floating-point accuracy under the W4A8 setting. To the best of our knowledge, MQuant is the first quantization solution for MLLMs, paving the way for future advancements in their application.

License

MQuant is release under MIT license (see LICENSE).

Citation

If you think our paper or code is helpful, please consider citing our work.

@misc{yu2025mquantunleashinginferencepotential,
      title={MQuant: Unleashing the Inference Potential of Multimodal Large Language Models via Full Static Quantization}, 
      author={JiangYong Yu and Sifan Zhou and Dawei Yang and Shuo Wang and Shuoyu Li and Xing Hu and Chen Xu and Zukang Xu and Changyong Shu and Zhihang Yuan},
      year={2025},
      eprint={2502.00425},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.00425}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MQuant

Highlight

ToDo List

Contact

Abstract

License

Citation

Star History

About

Releases

Packages

StiphyJay/MQuant

Folders and files

Latest commit

History

Repository files navigation

MQuant

Highlight

ToDo List

Contact

Abstract

License

Citation

Star History

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages