Visual Position Prompt for MLLM based Visual Grounding

Wei Tang^*,1,2 Yanpeng Sun¹ Qinying Gu^✉,2 Zechao Li^✉,1

¹Nanjing University of Science and Technology; ²Shanghai Artificial Intelliaence Laboratory
^✉ Corresponding Author
^* This work was done during his internship at Shanghai Artificial Intelliaence Laboratory.

Updates

19 Mar, 2025: 💥💥 Our paper "Visual Position Prompt for MLLM based Visual Grounding" has been submitted to IEEE Transactions on Multimedia (TMM).

Although Multimodal Large Language Models (MLLMs) excel at various image-related tasks, they encounter challenges in precisely aligning coordinates with spatial information within images, particularly in position-aware tasks such as visual grounding. This limitation arises from two key factors. First, MLLMs lack explicit spatial references, making it difficult to associate textual descriptions with precise image locations. Second, their feature extraction processes prioritize global context over fine-grained spatial details, leading to weak localization capability. To address this issue, we introduce VPP-LLaVA, an MLLM equipped with Visual Position Prompt (VPP) to improve its grounding capability. VPP-LLaVA integrates two complementary mechanisms. The global VPP overlays learnable, axis-like embeddings onto the input image to provide structured spatial cues. The local VPP focuses on fine-grained localization by incorporating position-aware queries, which suggests probable object locations. We also introduce a VPP-SFT dataset with 0.6M samples, consolidating high-quality visual grounding data into a compact format for efficient model training. Training on this dataset with VPP enhances the model's performance, achieving state-of-the-art results on standard grounding benchmarks despite using fewer training samples compared to other MLLMs like MiniGPT-v2, which rely on much larger datasets ($\sim$21M samples). The code and VPP-SFT dataset will be available at https://github.com/WayneTomas/VPP-LLaVA upon acceptance.

Cite

@misc{tang2025visualpositionpromptmllm,
      title={Visual Position Prompt for MLLM based Visual Grounding}, 
      author={Wei Tang and Yanpeng Sun and Qinying Gu and Zechao Li},
      year={2025},
      eprint={2503.15426},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.15426}, 
}

paper link: https://arxiv.org/abs/2503.15426

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visual Position Prompt for MLLM based Visual Grounding

Updates

Cite

About

Releases

Packages

WayneTomas/VPP-LLaVA

Folders and files

Latest commit

History

Repository files navigation

Visual Position Prompt for MLLM based Visual Grounding

Updates

Cite

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages