Monitoring of the training/evaluation progress is possible via command line as well as Visdom.
For the latter, a Visdom server must be running at VISDOM_PORT=8090
and VISDOM_SERVER=http://localhost
.
To deactivate Visdom logging set VISDOM_ON=False
.
We provide configurations files under configs/deformable_mask_head/
and configs/devis/
to train the Mask-Head and DeVIS respectively.
In order to launch a training you just need to simply specify the number of GPUS using --nproc_per_node
and the corresponding config file after --config-file
.
For instance, the command for training YT-VIS 2019 model with 4GPUs is as following:
torchrun --nproc_per_node=4 main.py --config-file configs/devis/YT-19/devis_R_50_YT-19.yaml
User can also override config file parameters by passing the new KEY VALUE pair. For instance, to double the default lr:
torchrun --nproc_per_node=4 main.py --config-file configs/devis/YT-19/devis_R_50_YT-19.yaml SOLVER.BASE_LR 0.0002
Dataset | Backbone | AP | Total batch size |
Training GPU hours * |
Max GPU memory |
URL |
---|---|---|---|---|---|---|
COCO | R50 | 38.0 | 14 | 345 | 27GB | config model |
COCO | R101 | 39.9 | 14 | 260 | 32GB | config model |
COCO | SwinL | 45.2 | 7 | 470 | 26GB | config model |
YouTube-VIS 19 | R50 | 44.4 | 4 | 120 | 18GB | config log model |
YouTube-VIS 19 | SwinL | 57.1 | 4 | 220 | 37GB | config log model |
YouTube-VIS 21 | R50 | 43.1 | 4 | 200 | 24GB | config log model |
YouTube-VIS 21** | SwinL | 54.4 | 4 | 305 | 40GB | config log model |
OVIS | R50 | 23.7 | 4 | 145 | 24GB | config log model |
OVIS | SwinL | 35.5 | 4 | 204 | 38GB | config log model |
** We have used the following train set in order to fit GPU memory, which removes the 2 most crowded train videos.
We also provide configuration file to run all the ablation studies presented on Table 1:
Method | Clip size | K_temp | Features scales |
AP | Training GPU hours* |
Max GPU memory |
URL |
---|---|---|---|---|---|---|---|
Deformable VisTR | 36 | 4 | 1 | 34.2 | 190 | 10GB | config |
Deformable VisTR | 36 | 0 | 1 | 35.3 | 150 | 7GB | config |
Deformable VisTR | 6 | 0 | 1 | 32.4 | 40 | 2GB | config |
DeVIS | 6 | 4 | 1 | 34.0 | 46 | 3GB | config |
+increase spatial inputs | 6 | 4 | 4 | 35.9 | 104 | 15GB | config |
+instance aware obj. queries | 6 | 4 | 4 | 37.0 | 115 | 15GB | config |
+multi-scale mask head | 6 | 4 | 4 | 40.2 | 128 | 16GB | config |
+multi-cue clip tracking | 6 | 4 | 4 | 41.9 | --- | -- | config |
+aux. loss weighting | 6 | 4 | 4 | 44.0 | 128 | 16GB | config |
*Training GPU hours measured on a RTX A6000 GPU
We support evaluation during training for VIS datasets despite GT annotations not available.
Results will be saved into TEST.SAVE_PATH
folder, created inside OUTPUT_DIR
.
Users can set EVAL_PERIOD
to select the interval between validations (0 to disable it)
Additionally, START_EVAL_EPOCH
allows selecting at which epoch start considering EVAL_PERIOD
, useful in order to omit first epochs.