Skip to content

LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning

License

Notifications You must be signed in to change notification settings

DeepLearnXMU/LLaVE

Repository files navigation

LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning

This repo contains the code and data for LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning, we have developed a series of more powerful unified multimodal embedding models that can accept inputs combining text and images, and even video.

Release Notes

  • [2025/03/10] 🔥 We are excited to release LLaVE-0.5B, LLaVE-2B, LLaVE-7B. The paper, models, and inference code are now publicly available.

MMEB Leaderboard

We achieved the top ranking on the MMEB leaderboard using only a small amount of data.

MMEB Leaderboard

Model Performance

LLaVE-7B achieved the SOTA performance on MMEB using only 662K training pairs. MMEB

Although LLaVE is trained on image-text data, it can generalize to text-video retrieval tasks in a zero-shot manner and achieve strong performance, demonstrating its remarkable potential for transfer to other embedding tasks. video-retrieve

Models & Scripts

Installation

1. Clone this repository and navigate to the LLaVA folder:

git clone https://github.com/DeepLearnXMU/LLaVE
cd LLaVE

2. Install the inference package:

conda create -n llave python=3.10 -y
conda activate llave
pip install --upgrade pip  # Enable PEP 660 support.
pip install -e ".[train]"

Quick Start

import torch
import copy
from PIL import Image
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates
from llava.model.builder import load_pretrained_model
from llava.mm_utils import tokenizer_image_token, process_images

pretrained = "zhibinlan/LLaVE-0.5B"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)  # Add any other thing you want to pass in llava_model_args
model.eval()

# Image + Text -> Text
image = Image.open("figures/example.jpg")
image_tensor = process_images([image], image_processor, model.config)
image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models

question = DEFAULT_IMAGE_TOKEN + " Represent the given image with the following question: What is in the image"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], "\n")
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
attention_mask=input_ids.ne(tokenizer.pad_token_id)
image_sizes = [image.size]
query_embed = model.encode_multimodal_embeddings(input_ids, attention_mask=attention_mask,images=image_tensor, image_sizes=image_sizes)

target_string = "A cat and a dog"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], target_string)
conv.append_message(conv.roles[1], "\n")
target_string = conv.get_prompt()
target_input_ids = tokenizer(target_string, return_tensors="pt").input_ids.to(device)
attention_mask=target_input_ids.ne(tokenizer.pad_token_id)
target_embed = model.encode_multimodal_embeddings(target_input_ids, attention_mask=attention_mask)

print("A cat and a dog similarity score: ", query_embed @ target_embed.T)
# 0.5B: A cat and a dog similarity score: tensor([[0.4802]]

neg_string = "A cat and a tiger"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], neg_string)
conv.append_message(conv.roles[1], "\n")
neg_string = conv.get_prompt()
neg_input_ids = tokenizer(neg_string, return_tensors="pt").input_ids.to(device)
attention_mask=neg_input_ids.ne(tokenizer.pad_token_id)
neg_embed = model.encode_multimodal_embeddings(neg_input_ids, attention_mask=attention_mask)
print("A cat and a tiger similarity score: ", query_embed @ neg_embed.T)
# 0.5B: A cat and a tiger similarity score: tensor([[0.3413]]

MMEB Inference & Evaluation

Download the image file zip from huggingface.

wget https://huggingface.co/datasets/TIGER-Lab/MMEB-eval/resolve/main/images.zip
unzip images.zip -d eval_images/

Run the following script to eval.

prefix="your code dir"
PROMPT_VERSION="qwen_1_5"
RUN_NAME="zhibinlan/LLaVE-2B"
python3 $prefix/LLaVE/llava/eval/model_embed.py \
    --model_name_or_path $RUN_NAME \
    --version $PROMPT_VERSION \
    --dataset_name TIGER-Lab/MMEB-eval \
    --image_folder $prefix/MMEB-eval/eval_images/ \
    --encode_output_path $prefix/outputs/$RUN_NAME \
    --subset_name ImageNet-1K HatefulMemes SUN397 N24News VOC2007 OK-VQA A-OKVQA DocVQA InfographicsVQA ChartQA Visual7W VisDial CIRR NIGHTS WebQA VisualNews_i2t VisualNews_t2i MSCOCO_t2i MSCOCO_i2t MSCOCO Place365 ImageNet-A ImageNet-R ObjectNet Country211 ScienceQA GQA TextVQA VizWiz FashionIQ Wiki-SS-NQ OVEN EDIS RefCOCO Visual7W-Pointing RefCOCO-Matching \
    --dataset_split test --per_device_eval_batch_size 4 \
    --dataloader_num_workers 4 \
    --normalize

Zero-shot Video-text Retrieval

Run the following script to eval zero-shot video-text retrieval. (The current code only supports single GPU inference for this task.)

export CUDA_VISIBLE_DEVICES=0
PROMPT_VERSION="qwen_1_5"
RUN_NAME="zhibinlan/LLaVE-7B"

python3 -m torch.distributed.launch --nproc_per_node=1 \
    $prefix/LLaVE/CLIP4Clip/main_task_retrieval.py \
    --model_name_or_path $prefix/checkpoints/$RUN_NAME \
    --version qwen_1_5 \
    --do_eval \
    --data_path $prefix/dataset/MSVD/msvd_data \
    --features_path $prefix/dataset/MSVD/YouTubeClips \
    --output_dir $prefix/outputs/MSVD/$RUN_NAME \
    --datatype msvd \
    --batch_size_val 2 \

Acknowledgement

  • We have adapted code from LLaVA-NeXT, which is a training framework for a family of open large multimodal models.
  • We used data from VLM2Vec, which includes 36 datasets.

Citation

@article{lan2025llave,
  title={LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning},
  author={Lan, Zhibin and Niu, Liqiang and Meng, Fandong and Zhou, Jie and Su, Jinsong},
  journal={arXiv preprint arXiv:2503.04812},
  year={2025}
}

About

LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published