Skip to content

[NAACL 2025] VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

Notifications You must be signed in to change notification settings

function2-llx/MMMM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[NAACL 2025] VividMed

This is the official repository for the paper VividMed: Vision Language Model with Versatile Visual Grounding for Medicine.

arXiv preprint

Environment Setup

git clone --recursive https://github.com/function2-llx/MMMM.git
cd MMMM
mamba env create -f environment.yaml
BUILD_MONAI=1 pip install --no-build-isolation -e third-party/LuoLib/third-party/MONAI
mamba activate mmmm
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo \
"export PYTHONPATH=$PWD:$PYTHONPATH
export BUILD_MONAI=1" \
>> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

Data Preparation

Download the datasets (MIMIC-CXR, CT-RATE, etc.) and extract them to data/origin/<data type>/<dataset name>, where <data type> can be local for image datasets with localized annotations (bounding boxes, segmentation) and vision-language for VQA and radiology report datasets.

Then execute pre-processing scripts for each dataset. For instance, for MIMIC-CXR, execute the script at scripts/data/vl/MIMIC-CXR/MIMIC-CXR.py to pre-process the data. After the pre-processing is finished, the pre-processed data are placed at data/processed/vision-language/MIMIC-CXR, where <split>.json specifies the data items for each split.

Training

THUDM/cogvlm-chat-hf is used as the base VLM.

The example commands for running the three-stage training of VividMed are as follows. Please adapt your number of devices and batch size accordingly. Note that we disable torch.compile due the compatibility issue with our dependencies, enable it if you're ready to address it.

# Stage 1: Visual Grounding Pre-training
python scripts/cli.py fit -c conf/phase-vg/fit.yaml --compile false --data.dataloader.train_batch_size ... --trainer.accumulate_grad_batches ... --seed_everything $RANDOM --model.freeze_sam false --model.freeze_isam false
# Stage 2: Medical Visual Instruction Tuning
python scripts/cli.py fit -c conf/phase-vlm/fit.yaml --compile false --data.dataloader.train_batch_size ... --trainer.accumulate_grad_batches ... --seed_everything $RANDOM
# Stage 3: Alignment (grounded report generate)
python scripts/cli.py fit -c conf/phase-grg/fit.yaml --compile false --data.dataloader.train_batch_size ... --trainer.accumulate_grad_batches ... --seed_everything $RANDOM --model.freeze_sam false --model.freeze_isam false

Grounded Reports Construction

We demonstrate how to follow our proposed pipeline to construct visually grounded reports, i.e., textual reports accompanied localized annotations. Results are saved under data/processed/visual-grounding.

Key Phrases Identification & Positive Targets Filtering

Instruct an LLM (Meta Llama 3 70B in our case) to identify key phrases in the report text that correspond to anatomical structures or abnormality findings on images. Then, we need to instruct the LLM to filter only positive targets from the output of the last step.

These two steps can be completed by executing the script at scripts/data/vg/tag.py. The vLLM engine is used for efficient inference (Thanks a lot!).

Localized Annotations Generation

After the phrases to be grounded are identified from the report text, use pre-trained models to generate corresponding pseudo labels.

For CT-RATE, execute scripts/data/vg/CT-RATE/sat/inference.py to generate segmentation masks of anatomical structures, using the pre-trained SAT model.

For MIMIC-CXR, we train a DINO model for disease detection and use the bounding boxes generated by this model with the detrex framework on VinDr-CXR dataset. The inference script is at scripts/data/vg/MIMIC-CXR/detrex/tools/MIMIC-CXR-vg/infer.py.

Pre-trained Checkpoints

We have released checkpoint for Stage 2 (see the release page). The checkpoint for Stage 3 will be released soon.

Our released checkpoints are in the format of PEFT's LoRA adapters. Please follow the official instructions to load the adapter and merge it with the base model.

Downstream Tasks

We provide scripts for fine-tuning models on downstream medical VLM tasks (medical VQA, report generation). See the CLI script at scripts/finetune/cli.py for more details.

About

[NAACL 2025] VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

Resources

Stars

Watchers

Forks

Contributors 3

  •  
  •  
  •