Open-vocabulary COCO

Data preparation

Prepare data following MMDetection. Obtain the json files for OV-COCO from GoogleDrive and put them under data/coco/wusize The data structure looks like:

checkpoints/
├── clip_vitb32.pth
├── res50_fpn_soco_star_400.pth
data/
├── coco
│   ├── annotations
│   │   ├── instances_{train,val}2017.json
│   ├── wusize
│   │   ├── instances_train2017_base.json
│   │   ├── instances_val2017_base.json
│   │   ├── instances_val2017_novel.json
│   │   ├── captions_train2017_tags_allcaps.json
│   ├── train2017
│   ├── val2017
│   ├── test2017

Otherwise, generate the json files using the following scripts

python tools/pre_processors/keep_coco_base.py \
      --json_path data/coco/annotations/instances_train2017.json \
      --out_path data/coco/wusize/instances_train2017_base.json

python tools/pre_processors/keep_coco_base.py \
      --json_path data/coco/annotations/instances_val2017.json \
      --out_path data/coco/wusize/instances_val2017_base.json

python tools/pre_processors/keep_coco_novel.py \
      --json_path data/coco/annotations/instances_val2017.json \
      --out_path data/coco/wusize/instances_val2017_novel.json

The json file for caption supervision captions_train2017_tags_allcaps.json is obtained following Detic. Put it under data/coco/wusize.

Class Embeddings

As the training on COCO tends to converge to base categories, we use the output of the last attention layer for classification. Generate the class embeddings by

python tools/hand_craft_prompt.py --model_version ViT-B/32 --ann data/coco/annotations/instances_val2017.json \
--out_path data/metadata/coco_clip_hand_craft.npy --dataset coco

The generated file data/metadata/coco_clip_hand_craft_attn12.npy is used for training and testing.

Testing