GitHub - Row11n/Prova: [AAAI-25] Official repository of "Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection"

Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers
for Vast-Vocabulary Object Detection

Yitong Chen^1,2, Wenhao Yao^1, Lingchen Meng^1, Sihong Wu¹, Zuxuan Wu^1,2†, Yu-Gang Jiang¹,

¹ Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University,
² Shanghai Innovation Institute

^ Equal contributions; ^† Corresponding author.

[`Paper AAAI-25`] [ `Checkpoints`]

Introduction

We introduce Prova, a multi-modal prototype classifier for vast-vocabulary object detection. Prova extracts comprehensive multi-modal prototypes as initialization of alignment classifiers to tackle the vast-vocabulary object recognition failure problem. On V3Det, this simple method greatly enhances the performance among one-stage, two-stage, and DETR-based detectors with only additional projection layers in both supervised and open-vocabulary settings. In particular, Prova improves Faster R-CNN, FCOS, and DINO by 3.3, 6.2, and 2.9 AP respectively in the supervised setting of V3Det. For the open-vocabulary setting, Prova achieves a new state-of-the-art performance with 32.8 base AP and 11.0 novel AP, which is of 2.6 and 4.3 gain over the previous methods.

Cookbook

Our Prova, as an additional prototype-based classfication head, is easy to implement in any codebase. We release the training of DINO-Prova on V3Det built upon RichSem codebase.

Install

conda init
conda create -n Prova python=3.8 -y
conda activate Prova
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

# install packages
pip install numpy==1.21.6
pip install scipy termcolor addict yapf==0.40.0 timm==0.5.4 lvis pycocotools ftfy regex PyWavelets mmengine
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
cd models/richsem/ops
python setup.py build install --user
cd ../../..

Multi-Modal Prototypes

The prototypes should be organized as:

Prova
  └── prototypes
      ├── visual_prototypes_v3det.pt
      ├── textual_prototypes_v3det.pt
      └── visual_prototypes_v3det_ovd.pt

Data

The ImageNet dataset and V3Det dataset should be organized as:

Prova
  └── DATASET
      ├── V3Det_ImageNet21k_Cls_100/
      └── v3det
          ├── annotations/
          │      ├── category_name_13204_v3det_2023_v1.txt
          │      └── v3det_2023_v1_category_tree.json
          ├── images/
          ├── test/
          ├── v3det_2023_v1_category_tree.json
          ├── v3det_2023_v1_train_ovd_base.json
          ├── v3det_2023_v1_train.json
          ├── v3det_2023_v1_val.json
          └── v3det_2023_v1_val_tiny.json

v3det_2023_v1_val_tiny.json is a subset for v3det val set and utilized to speed validation process up.

Training

Train DINO-Prova in supervised setting w/o ImageNet:

bash scripts/prova_dist.sh 8 --output_dir your/output_dir -c config/Prova/prova_r50_1k_v3det.py --dataset_file v3det --data_path DATASET/v3det

Train DINO-Prova in open-vocabulary setting w/ ImageNet:

bash scripts/prova_dist.sh 8 --output_dir your/output_dir -c config/Prova/prova_r50_22k_v3det_ovd_w_inet --dataset_file v3det --data_path DATASET/v3det

Testing

Test DINO-Prova with 8 GPUs:

bash scripts/prova_dist.sh 8 --output_dir your/output_dir/full_eval -c config/Prova/prova_r50_1k_v3det.py --dataset_file v3det_full --data_path DATASET/v3det --test --resume your/checkpoint.pth

python evaluation/eval_v3det.py your/output_dir/full_eval/bbox_pred.json | tee -a your/output_dir/full_eval/result.txt

Models

Prova on Supervised V3Det

Model	Backbone	Epochs	$AP$	$AP_{50}$	$AP_{75}$	Config	Download
DINO	RN50	24	33.5	37.7	35.0	-	-
DINO-Prova	RN50	24	36.4	41.3	38.1	config	model
DINO	SwinBase	24	42.0	46.8	43.9	-	-
DINO-Prova	SwinBase	24	44.5	49.9	46.6	config	-
DINO-Prova-22K	SwinBase-22k	24	50.3	56.1	52.6	config	model

Prova on Open-Vocabulary V3Det

Model	Backbone	Epochs	$AP_{base}$	$AP_{novel}$	$AP_{final}$	Config	Download
DINO*	R50-22k	24	19.0	1.9	10.5	-	-
DINO-Prova	R50-22k	24	31.4	9.5	20.5	config	-
DINO-Prova-22K	R50-22k	24	32.8	11.0	21.9	config	model

Acknowledgement

Thanks for these excellent opensource projects:

Citation

If you find this project useful for your research, please use the following BibTeX entry.

@inproceedings{prova2025,
  title={Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection},
  author={Chen, Yitong and Yao, Wenhao and Meng, Lingchen and Wu, Sihong and Wu, Zuxuan and Jiang, Yu-Gang},
  booktitle={AAAI},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
clip		clip
config/Prova		config/Prova
dataset		dataset
evaluation		evaluation
models		models
scripts		scripts
util		util
.gitignore		.gitignore
README.md		README.md
engine.py		engine.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers
for Vast-Vocabulary Object Detection

Yitong Chen^1,2, Wenhao Yao^1, Lingchen Meng^1, Sihong Wu¹, Zuxuan Wu^1,2†, Yu-Gang Jiang¹,

¹ Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University,
² Shanghai Innovation Institute

^ Equal contributions; ^† Corresponding author.

[`Paper AAAI-25`] [ `Checkpoints`]

Introduction

Cookbook

Install

Multi-Modal Prototypes

Data

Training

Testing

Models

Prova on Supervised V3Det

Prova on Open-Vocabulary V3Det

Acknowledgement

Citation

About

Releases

Packages

Languages

Row11n/Prova

Folders and files

Latest commit

History

Repository files navigation

Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection

Yitong Chen1,2*, Wenhao Yao1*, Lingchen Meng1*, Sihong Wu1, Zuxuan Wu1,2†, Yu-Gang Jiang1, 1 Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University, 2 Shanghai Innovation Institute * Equal contributions; † Corresponding author. [Paper AAAI-25] [ Checkpoints]

Introduction

Cookbook

Install

Multi-Modal Prototypes

Data

Training

Testing

Models

Prova on Supervised V3Det

Prova on Open-Vocabulary V3Det

Acknowledgement

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers
for Vast-Vocabulary Object Detection

Yitong Chen^1,2, Wenhao Yao^1, Lingchen Meng^1, Sihong Wu¹, Zuxuan Wu^1,2†, Yu-Gang Jiang¹,

¹ Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University,
² Shanghai Innovation Institute

^ Equal contributions; ^† Corresponding author.

[`Paper AAAI-25`] [ `Checkpoints`]

Packages