Bioinformatics Institute spring project, year 2022. Student: Velikonivtsev Fyodor Supervisors: Tolstoganov Ivan, Korobeynikov Anton
Goal: discover properties, opportunities and perspectives of clustering metagenome Hi-C data by graph deep learning methods.
Objectives:
- Develop core understanding of problem´s crucial concepts and current advances
- Apply effective GNN clustering models - DMoN, GraphMB
- Apply VAE model - VAMB
- Create interface & modify models API for correctly processing Hi-C data formats
- Explore models´ hyperparameters space and compare results
- Compare tools efficiency with VAMB and Bin3C as baselines using AMBER & CheckM
Bin was considered as HQ (high-quality metagenome-assembled genome) if it had >95% completeness and <5% Contamination 3 datasets have been used:
- Zymo dataset (supervised) [6625 contigs, 76799 Hi-C links]
- IC9 dataset (unsupervised) size [165712 contigs, 1150887 Hi-C links]
- CAMI AIRWAYS synthetic dataset (supervised) [728682 contigs, 70405 Hi-C links]
- Zymo dataset - 0 HQ genomes
- Wase taken out of experiment
- Restored 7-8/10 HQ MAGs vs. VAMB’s 10/10 vs bin3C`s (non-DL tool for clustering Hi-C data) 6/10 (a)
- Restored 98/600 HQ MAGs vs. VAMB’s 93/600 in CAMI AIRWAYS (b)
- Restored 12 HQ bins vs VAMB’s 12 in IC9 dataset (c)
- DMoN strongly depends from input node features and doesn’t support edge features, therefore it is incompatible for contact map clustering
- GraphMB showed better results on larger dataset where VAMB isn’t enough - it shows that some new information from contact graph was used to resolve bins
- GraphMB relies on the effective VAMB workflow - this explains close results on small Zymo dataset and its close to VAMB result on such small datasets
- GraphMB showed its competitive effectiveness in the problem of Hi-C contact map clustering
- Nissen, J.N., Johansen, J., Allesøe, R.L. et al. Improved metagenome binning and assembly using deep variational autoencoders. Nat Biotechnol 39, 555–560 (2021). https://doi.org/10.1038/s41587-020-00777-4
- Metagenomic binning with assembly graph embeddings. Andre Lamurias, Mantas Sereika, Mads Albertsen, Katja Hose, Thomas Dyhre Nielsen. bioRxiv 2022.02.25.481923; doi: https://doi.org/10.1101/2022.02.25.481923
- Tsitsulin, Anton & Palowitch, John & Perozzi, Bryan & Müller, Emmanuel. (2020). Graph Clustering with Graph Neural Networks.
- Fernando Meyer, Peter Hofmann, Peter Belmann, Ruben Garrido-Oter, Adrian Fritz, Alexander Sczyrba, Alice C McHardy, AMBER: Assessment of Metagenome BinnERs, GigaScience, Volume 7, Issue 6, June 2018, giy069, https://doi.org/10.1093/gigascience/giy069
- Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015 Jul;25(7):1043-55. doi: 10.1101/gr.186072.114. Epub 2015 May 14. PMID: 25977477; PMCID: PMC4484387.
- DeMaere, M., Darling, A. bin3C: exploiting Hi-C sequencing data to accurately resolve metagenome-assembled genomes. Genome Biol 20, 46 (2019). https://doi.org/10.1186/s13059-019-1643-1
This repository contains tools for:
- Correctly preprocessing Hi-C contact map and related assembly graph with scaffolds to create compatible input formats for DMoN & GraphMB -
preprocess_files.py
- Correctly transforming ground truth labels (if any) to AMBER specific input format & post-clustering output modification for VAMB output -
vamb2amber.py
- Basic exploratory data analysis by providing length distribution for subsets of contigs with Hi-C links and without Hi-C links - might be helpful in choosing minimal length threshold while clustering -
preprocess_files.py
- Conveniently comparing CheckM quality assessment summaries of several binning runs by providing metrics plot for each of the runs (example can be seen at plot (c)) -
compare_checkm_results.py
- Desired tested GNN - DMoN, GraphMB
- QC tools - AMBER, CHECKM
- Python 3.8+
- Python packages (see installation for details)
- UNIX command line
- GPU is preferrable
- CPU - any, multicore is preferrable
- RAM - 16 Gb
- Disk space - depends on data, + 2 Gb for CheckM database
- Install AMBER, CheckM into separate environments
- Install desired tool into a separate environment (crucial for GraphMB and AMBER) (GraphMB - my modification, DMoN - my modification, VAMB - my modification (minimal modified tool))
- Install python packages:
pip install -U numpy scipy pandas sklearn tqdm plotly kaleido
- Clone repository and add it to PATH (e.g. for bash):
git clone https://github.com/Abusagit/GNN_plus_HiC.git && cd GNN_plus_HiC && echo 'export PATH="your-dir:$PATH"' >> ~/.bashrc && source ~/.bashrc
- Preprocess data for given GNN (e.g. GraphMB: transform
contact_map.tsv, scaffolds.fasta
):
py preprocess_files.py --graphmb -c contact_map.tsv --scaling log -f scaffolds.fasta --mimic-jgi -o graphmb_input/
- Run GNN:
graphmb --assembly graphmb_input/ [--other-paraneters-for-graphmb]
- Run AMBER or CheckM
- You could also run VAMB - it produces clustering output incompatible for AMBER work but suitable for CheckM. You can use the followeing:
py vamb2amber.py -i amber_result.tsv -g golden_standard_with_amber_format.tsv -o outdir/vamb_for_amber.tsv
- In the case of having labels you can directly compare binning results by AMBER - it provides comprehensive visualization plots. However, this is not the case for CheckM multiple study - here you can use
compare_checkm_results.py
and compare joint distribution of completeness and purity metrics among HQ genomes inm
binning results (any number starting from 1):
py compare_checkm_results.py -i [checkm_input_1 ... checkm_input_m] --labels [label_1 ... label_m] --min-completeness 0.95 --min-purity 0.95