Skip to content

Repository for code used to run evaluation of Channel Agnostic ViT

License

Notifications You must be signed in to change notification settings

GSK-AI/campfire

Repository files navigation

Campfire: Channel Agnostic Morphological Profiling From Imaging under a Range of Experimental settings

Accompanying code for Out-of-distribution evaluations of channel agnostic masked autoencoders in fluorescence microscopy

Authors: Christian John Hurry, Jinjie Zhang, Olubukola Ishola, Emma Slade, Cuong Q. Nguyen

Instructions

Cloning and setting up your environment

git clone https://github.com/GSK-AI/campfire
cd campfire
conda env create --name campfire --file env.yaml
source activate campfire
source .env
pip install -e .

Downloading checkpoints and data

Model checkpoint for Campfire can be found within the latest release.

Run tests

To run unit tests, run the following with access to a GPU node:

cd campfire
source activate campfire
python run_tests.py 

Use

1. Access to CAMPFIRE for researchers

Although we do not provide code for the pretraining of Campfire, we provide code for the model architecture, and the model checkpoint for testing and further use.

1.1. Model Classes

Campfire is a masked autoencoder. A model of this type can be instantiated with the following

from masked_autoencoder.model import MaskedAutoencoder

model = MaskedAutoencoder(
    img_size=112, 
    patch_size=14, 
    in_chans = 5,
    mask_ratio = 0.8,
)

The parameters specified here in the instantiation are the minimum set of of necessary parameters for the instatiation of the MaskedAutoencoder class.

The MaskedAutoencoder class expects a dictionary with at least two keys: inputs and inputs_channel_idx.

  • inputs: are image tensors with dimensions (batch_size, number_of_channels, image_height, image_width)
  • inputs_channel_idx: are tensors of dimension (batch_size, number_of_channels) that indicates which channel is assigned to each index. In a dataset, there may be many 3-channel images, but those 3 channels differ in type between the images. inputs_channel_idx at element (0,0) will be an integer in range [0,num_channels-1] indicating that the first channel in the first sample in the batch is from the channel corresponding to that integer.

Example input data can be loaded and fed-forward the masked autoencoder via

from masked_autoencoder.test_data import TEST_DATA

embeddings = model(TEST_DATA, embed=True)['embeddings']

1.2. Loading Campfire

To load the pretrained model Campfire please run the following:

checkpoint_path  = /path/to/local/copy/of/checkpoint

checkpoint = torch.load(checkpoint_path)

model = MaskedAutoencoder(
    img_size=112, 
    patch_size=14, 
    in_chans = 5,
    mask_ratio = 0.8,
)

model.load_state_dict(checkpoint['state_dict'])

2. How to use code to pretrain, and evaluate channel agnostic model

To generate the results of the experiments in our manuscript, we provide the code to generate the data splitting used in the training and evaluation of our model, and also provide code to run the downstream evaluations of our model. Although we do not provide the code to train Campfire, by providing this code, it is possible to train a model using the same subset of the JUMP-CP dataset, and more importantly, to evaluate a model with the same evaluation sets as in our manuscript.

2.1 Generating training, validation, and held-out splits.

To train and evaluate a model using the same criteria as in our manuscript, it is necessary to assign the wells of the TARGET2 and COMPOUND plates of Source 3 of the JUMP-CP datasets to various training and evaluation splits. In order to do so, update the paths and download the relevant metadata mentioned in example_configs/controls_config.yaml and then run the following command

python create_controls.py -c example_configs/controls_config.yaml

This will generate a .csv file for each of the TARGET2 and COMPOUND plates in Source 3. These .csv files are plate maps, such that each cell will describe the position on the 384-well plate, the compound with which it was stimulated, and the split to which it has been assigned. Using the provided config will generate the same set of data splits used to train Campfire.

2.2. Model pretraining and inference

We do not provide code to run model pretraining. To keep comparisons with Campfire consistent, models should be evaluated using the test and held-out splits defined in the controls. If researchers wish to compare models trained on identical datasets, Campfire was pretrained using all wells assigned to the train split in the control csvs generated by the above.

The downstream evaluations specified in the next sections assume that a model is pretrained. All that is necessary to run the below, are the control csvs, and an additional csv containing the single cell embeddings of the model for all wells in the TARGET2 plates. This csv file should contain columns for: the ROW and COLUMN of the well, the PLATE BARCODE of the plate from which the well is derived, and for a model with embedding dimension D should have D columns named EMBEDDING_LAYER_NAME_0,...,EMBEDDING_LAYER_NAME_D-1.

2.3. Given a set of embeddings, run linear probing, predicting 1-of-9 controls

In our manuscript, we evaluate models for fluorscence microscopy by training a linear layer with single cell embeddings to predict 1-of-9 control compounds. To evaluate a new model, pretrained on some dataset or perhaps a subset of the JUMP-CP dataset not included in our evaluation splits, one can alter the config file example_configs/controls_task_config.yaml such that the directory to the control csv file and the csv file containing the single cell embeddings for the TARGET2 plates in source 3 are provided.

Then one must run

python runners/run_linear_pipeline.py

which will trigger a pipleline that samples single cell embeddings from each well of TARGET2 plates, assigns them to data splits, and then trains several linear layers using different subsets of the training set. The output folder will contain the performance metrics for the in-distribution test set and out-of-distribution test set, for the model specified by the config file.

2.4. Given a set of embeddings, run linear probing, predicting 1-of-60 held-out compounds

To evaluate models when dealing with images of cells subject to out-of-distribution compounds, we train a linear layer with single cell embeddings to predict 1-of-60 compounds held-out of model pretraining (as specified in the control csv files generated earlier). First adapt the config file example_configs/held_out_compound_task_config.yaml to include paths to control csv and model embedding csv files. After setting config file, run

 python runners/run_held_out_pipeline.py

to trigger a pipeline that will take single cell embeddings from the TARGET2 plates, sample 30 embeddings from each well, and train 5 linear layers, using 5 subsets of the training data via cross-fold validation. The output of this will be the performance metrics for the in-distribution test set and out-of-distribution test set, for the model specified by the config file.

2.5. Given a set of embeddings, plot top-2 principal components

In our manuscript, we finetune Campfire on a macrophage dataset, and analyse the principal components of the model embeddings. To produce similar plots, update the config example_configs/finetuning_configs.yaml with paths to the relevant files described within, containing model embeddings and targets, and run the following:

python modelling/well_embedding_pca.py -c example_configs/finetuning_configs.yaml

2.6. Given a set of embeddings, compute and plot Z' scores for embeddings from different targets

In our manuscript, we finetune campfire on a macrophage dataset, and compute the z-prime score for each pair of targets, demonstrating the extent to which embeddings from each target are distinct. To produce similar plots, update the config example_configs/finetuning_configs.yaml with paths to the relevant files described within, containing model embeddings and targets, and run the following:

python modelling/embedding_zprime_score.py -c example_configs/finetuning_configs.yaml

About

Repository for code used to run evaluation of Channel Agnostic ViT

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages