Campfire: Channel Agnostic Morphological Profiling From Imaging under a Range of Experimental settings
Accompanying code for Out-of-distribution evaluations of channel agnostic masked autoencoders in fluorescence microscopy
Authors: Christian John Hurry, Jinjie Zhang, Olubukola Ishola, Emma Slade, Cuong Q. Nguyen
git clone https://github.com/GSK-AI/campfire
cd campfire
conda env create --name campfire --file env.yaml
source activate campfire
source .env
pip install -e .
Model checkpoint for Campfire can be found within the latest release.
To run unit tests, run the following with access to a GPU node:
cd campfire
source activate campfire
python run_tests.py
Although we do not provide code for the pretraining of Campfire, we provide code for the model architecture, and the model checkpoint for testing and further use.
Campfire is a masked autoencoder. A model of this type can be instantiated with the following
from masked_autoencoder.model import MaskedAutoencoder
model = MaskedAutoencoder(
img_size=112,
patch_size=14,
in_chans = 5,
mask_ratio = 0.8,
)
The parameters specified here in the instantiation are the minimum set of of necessary parameters for the instatiation of the MaskedAutoencoder
class.
The MaskedAutoencoder
class expects a dictionary with at least two keys: inputs
and inputs_channel_idx
.
inputs
: are image tensors with dimensions (batch_size, number_of_channels, image_height, image_width)inputs_channel_idx
: are tensors of dimension (batch_size, number_of_channels) that indicates which channel is assigned to each index. In a dataset, there may be many 3-channel images, but those 3 channels differ in type between the images.inputs_channel_idx
at element (0,0) will be an integer in range [0,num_channels-1] indicating that the first channel in the first sample in the batch is from the channel corresponding to that integer.
Example input data can be loaded and fed-forward the masked autoencoder via
from masked_autoencoder.test_data import TEST_DATA
embeddings = model(TEST_DATA, embed=True)['embeddings']
To load the pretrained model Campfire please run the following:
checkpoint_path = /path/to/local/copy/of/checkpoint
checkpoint = torch.load(checkpoint_path)
model = MaskedAutoencoder(
img_size=112,
patch_size=14,
in_chans = 5,
mask_ratio = 0.8,
)
model.load_state_dict(checkpoint['state_dict'])
To generate the results of the experiments in our manuscript, we provide the code to generate the data splitting used in the training and evaluation of our model, and also provide code to run the downstream evaluations of our model. Although we do not provide the code to train Campfire, by providing this code, it is possible to train a model using the same subset of the JUMP-CP dataset, and more importantly, to evaluate a model with the same evaluation sets as in our manuscript.
To train and evaluate a model using the same criteria as in our manuscript, it is necessary to assign the wells of the TARGET2 and COMPOUND plates of Source 3 of the JUMP-CP datasets to various training and evaluation splits. In order to do so, update the paths and download the relevant metadata mentioned in example_configs/controls_config.yaml
and then run the following command
python create_controls.py -c example_configs/controls_config.yaml
This will generate a .csv file for each of the TARGET2 and COMPOUND plates in Source 3. These .csv files are plate maps, such that each cell will describe the position on the 384-well plate, the compound with which it was stimulated, and the split to which it has been assigned. Using the provided config will generate the same set of data splits used to train Campfire.
We do not provide code to run model pretraining. To keep comparisons with Campfire consistent, models should be evaluated using the test and held-out splits defined in the controls. If researchers wish to compare models trained on identical datasets, Campfire was pretrained using all wells assigned to the train
split in the control csvs generated by the above.
The downstream evaluations specified in the next sections assume that a model is pretrained. All that is necessary to run the below, are the control csvs, and an additional csv containing the single cell embeddings of the model for all wells in the TARGET2 plates. This csv file should contain columns for: the ROW and COLUMN of the well, the PLATE BARCODE of the plate from which the well is derived, and for a model with embedding dimension D should have D columns named EMBEDDING_LAYER_NAME_0,...,EMBEDDING_LAYER_NAME_D-1.
In our manuscript, we evaluate models for fluorscence microscopy by training a linear layer with single cell embeddings to predict 1-of-9 control compounds. To evaluate a new model, pretrained on some dataset or perhaps a subset of the JUMP-CP dataset not included in our evaluation splits, one can alter the config file example_configs/controls_task_config.yaml
such that the directory to the control csv file and the csv file containing the single cell embeddings for the TARGET2 plates in source 3 are provided.
Then one must run
python runners/run_linear_pipeline.py
which will trigger a pipleline that samples single cell embeddings from each well of TARGET2 plates, assigns them to data splits, and then trains several linear layers using different subsets of the training set. The output folder will contain the performance metrics for the in-distribution test set and out-of-distribution test set, for the model specified by the config file.
To evaluate models when dealing with images of cells subject to out-of-distribution compounds, we train a linear layer with single cell embeddings to predict 1-of-60 compounds held-out of model pretraining (as specified in the control csv files generated earlier). First adapt the config file example_configs/held_out_compound_task_config.yaml
to include paths to control csv and model embedding csv files. After setting config file, run
python runners/run_held_out_pipeline.py
to trigger a pipeline that will take single cell embeddings from the TARGET2 plates, sample 30 embeddings from each well, and train 5 linear layers, using 5 subsets of the training data via cross-fold validation. The output of this will be the performance metrics for the in-distribution test set and out-of-distribution test set, for the model specified by the config file.
In our manuscript, we finetune Campfire on a macrophage dataset, and analyse the principal components of the model embeddings. To produce similar plots, update the config example_configs/finetuning_configs.yaml
with paths to the relevant files described within, containing model embeddings and targets, and run the following:
python modelling/well_embedding_pca.py -c example_configs/finetuning_configs.yaml
In our manuscript, we finetune campfire on a macrophage dataset, and compute the z-prime score for each pair of targets, demonstrating the extent to which embeddings from each target are distinct. To produce similar plots, update the config example_configs/finetuning_configs.yaml
with paths to the relevant files described within, containing model embeddings and targets, and run the following:
python modelling/embedding_zprime_score.py -c example_configs/finetuning_configs.yaml