Skip to content

Latest commit

 

History

History
108 lines (75 loc) · 4.8 KB

README.md

File metadata and controls

108 lines (75 loc) · 4.8 KB

UniSim: Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics

The key contributions and findings of our work are as follows:

  • We introduce UniSim-Bench, a comprehensive benchmark spanning 7 multi-modal perceptual similarity tasks and encompassing 25 datasets.

  • Our evaluation demonstrates that while general-purpose models perform reasonably well on average, they often fall short compared to specialized models on specific tasks.

  • In contrast, metrics fine-tuned for individual tasks show strong performance but fail to generalize effectively to unseen, yet related, tasks.

  • To address this gap, we propose UniSim, a family of multi-task perceptual similarity metrics designed as a first step toward a unified framework for perceptual similarity.

  • UniSim leverages fine-tuning of both encoder-based and generative vision-language models on a subset of tasks from UniSim-Bench, achieving the highest average performance. Notably, it even surpasses task-specific models in certain cases.

  • Despite these advancements, our findings reveal that the models continue to struggle with generalization to unseen tasks, underscoring the persistent challenge of developing a robust, unified perceptual similarity metric that aligns with human notions of similarity.

Dataset

We collect and adapt multiple existing datasets for each tasks, as detailed in the table below. A subset of datasets from the Core 2AFC Tasks are then used to train our UniSim models (see below), while the others are held-out for evaluation only. Moreover, all datasets from the OOD Generalization Tasks are only used at test time. The UniSim-Bench data can be found here on HuggingFace.

Checkpoints

We train three versions of our multi-task perceptual metric UniSim:

Quick Start

  • Loading models
from models import load_unisim_models
load_unisim_models(model_name, model_path, device, cache_dir)

model_name should be chosen from ['unisim_vit_b_32', 'unisim_vit_l_14', 'unisim_ll_n_0.5'].

  • Quick use of the UniSim metrics:
from models import get_unisim_metric
get_unisim_metric(model_name, model_path, task_type, images, texts, device, cache_dir)

Here is an example:

model_name = 'unisim_ll_n_0.5'
model_path = 'path/to/model'
device='cuda:0'
cache_dir='./'

images = ['/uni_data/nights/ref/000/002.png',
          '/uni_data/nights/distort/000/002_0.png', 
          '/uni_data/nights/distort/000/002_1.png']
texts = []
task_type = 'Img_2AFC'
pred = get_unisim_metric(model_name, model_path, task_type, images, texts, device, cache_dir)

print(f'Task: {task_type}, Pred: {pred}')

  • Evaluation scripts for other metrics:
bash scripts/eval_encoder.sh 
bash scripts/eval_lmm.sh

Acknowledgement

This work leverages the code and resources from OpenCLIP and LLaVA-Next repositories. Moreover, we use several existing datasets, whose details and references can be found in our paper.

We thank the authors of these repositories and datasets for making their work publicly available and contributing to the research community.

Citation

If you use our code or models, please consider citing our work using the following BibTex entry:

@article{ghazanfari2024towards,
  title={Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics},
  author={Ghazanfari, Sara and Garg, Siddharth and Flammarion, Nicolas and Krishnamurthy, Prashanth and Khorrami, Farshad and Croce, Francesco},
  journal={arXiv preprint arXiv:2412.10594},
  year={2024}
}