This repository contains the official implementation of text-to-image part in ContextDiff. Here, we only provide a sample code on COCO dataset for simplicity, and you may change to any datasets to apply our method.
Environment Setup
git clone https://github.com/YangLing0818/ContextDiff.git
conda create -n ContextDiff python==3.8
pip install -r requirements.txt
cd ContextDiff_image
pip install git+https://github.com/openai/CLIP.git
pip install git+https://github.com/huggingface/diffusers
Download Model Weights
Here we choose Stable Diffusion as our diffusion backbone, you can download the model weights using our download.py in folder 'ckpt/'.
cd ckpt
python download.py
wget "https://openaipublic.azureedge.net/clip/models/8fa8567bab74a42d41c5915025a8e4538c3bdbe8804a470a72f30b0d94fab599/RN101.pt"
wget "https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt"
wget "https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt"
cd ..
Download Datasets
cd datasets
wget http://images.cocodataset.org/zips/train2017.zip
unzip train2017.zip
cd ..
python process_img.py --src=./dataset/train2017 --size=512 --dest=./dataset/train2017
Train Context-Aware Adapter
CUDA_VISIBLE_DEVICES=0 python train_adapter.py --train_data_dir './dataset/train2017' --mixed_precision 'fp16' --output_dir 'output/' --train_batch_size 64 --num_train_epochs 20 --checkpointing_steps 10000 "--t5_model" 'path to text encoders'
You can check the code for details, and choose hyper-parameters based on your device.
Finetune Diffusion Model with Context-Aware Adapter
CUDA_VISIBLE_DEVICES=0 finetune_diffusion.py --pretrained_model_name_or_path="stabilityai/stable-diffusion-2-1-base" --train_data_dir=./train2017 --use_ema --resolution=512 --center_crop --random_flip --train_batch_size=32 --gradient_accumulation_steps=1 --gradient_checkpointing --max_train_steps=50000 --checkpointing_steps=10000 --learning_rate=2e-05 --max_grad_norm=1 --lr_scheduler="constant" --lr_warmup_steps=0
--output_dir="./output"
For the '--mean_path' and '--std_path' in the code, it is generated from the dataset embeddings. You can use cluster method like GMM to obtain std and mean from your datasets. This method could help to accelerate the speed of training convergence by optimizing denoising starting point from pure Gaussian distribution to the image distribution of the training dataset. You can also directly create means and variances that conform to an isotropic Gaussian distribution.
@inproceedings{
yang2024crossmodal,
title={Cross-Modal Contextualized Diffusion Models for Text-Guided Visual Generation and Editing},
author={Ling Yang and Zhilong Zhang and Zhaochen Yu and Jingwei Liu and Minkai Xu and Stefano Ermon and Bin CUI},
booktitle={International Conference on Learning Representations},
year={2024}
}