Skip to content

Whole-genome SNP based identification of members of the Mycobacterium tuberculosis complex

License

Notifications You must be signed in to change notification settings

philipwfowler/snpit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Table of Contents

snpit

Code style: black

Whole genome SNP based identification of members of the Mycobacterium tuberculosis complex. Based on code originally written by Samuel Lipworth and turned into a package by Philip Fowler and Michael Hall.

snpit allows rapid Mycobacterial speciation of VCF files aligned to NC000962 (H37rV) and FAST(A/Q) files.

For more information please see the article;

Lipworth S, Jajou R, de Neeling A, et al. SNP-IT Tool for Identifying Subspecies and Associated Lineages of Mycobacterium tuberculosis Complex. Emerging Infectious Diseases. 2019;25(3):482-488. doi:10.3201/eid2503.180894.

Please email samuel.lipworth@medsci.ox.ac.uk with any queries.

Installation

snpit requires python version 3.5 or greater.

PyPi

# not yet setup

Conda

# not yet setup

Locally

There are two ways of doing this: installing to your local python packages, or in a virtual environment (recommended).

First clone the repository on your local machine and move into the directory.

git clone https://github.com/philipwfowler/snpit.git
cd snpit

Virtual environment [recommended]

# install snpit and dependencies
make install
# make sure it is working
make test
# get the command to activate the environment
make activate
# activate the environment with the output from the above command
# start using snpit
snpit --help
# when you are done, exit the environment
deactivate

Without virtual environment

Note: We strongly encourage using a virtual environment if you are installing locally.

python3 setup.py install --user
# make sure it is working
python3 setup.py test

Usage

VCF input and print result to screen

snpit --input in.vcf

Note: You do not need to specify anything special if your file is multi-sample.

FASTA input and write result to file

snpit --input in.fa --output out.tsv

VCF input and only use records that have PASS in the FILTER field

snpit -i in.vcf --filter -o out.tsv

Filtering VCF based on STATUS field

This is a custom field that has been used in some CRyPTIC pipelines. It is used as a more fine-grained FILTER column in that some samples may pass for a position, and others may not.

snpit -i in.vcf --status -o out.tsv

Increase threshold for calling a lineage

snpit -i in.vcf --threshold 95

The threshold is the percentage of the positions known to identify this lineage that are found in your sample.

Full usage

To get the full usage/help menu for snpit just run

snpit --help
usage: snpit [-h] -i INPUT [-o OUTPUT] [--threshold THRESHOLD] [--filter]
             [--status] [-v]

Whole genome SNP based identification of members of the Mycobacterium
tuberculosis complex. SNP-IT allows rapid Mycobacterial speciation of VCF
files aligned to NC000962 (H37Rv).

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Path to the VCF or FAST(A/Q) file to read and
                        classify. File can be multi-sample and/or compressed.
  -o OUTPUT, --output OUTPUT
                        Path to output results to. Default is STDOUT (-).
  --threshold THRESHOLD
                        The percentage of snps above which a sample is
                        considered to belong to a lineage. [10.0]
  --filter              Whether to adhere to the FILTER column.
  --status              Whether to adhere to the STATUS column. This is a
                        custom field that gives more fine-grained control over
                        whether a sample passes a user-defined filtering
                        criterion, even if the record has PASS in FILTER.
  -v, --version         Show the program's version number and exit.

Output format

The output file is a tab-delimited file (containing a header).

Sample  Species Lineage Sublineage      Name    Percentage
sample1 M. tuberculosis Lineage 2       N/A     beijing 91.78
sample2 M. tuberculosis Lineage 2       N/A     beijing 97.37
sample3 M. tuberculosis Lineage 4       Haarlem haarlem 100.0

From left to right, the columns are:

  • Sample - the name of the sample. This is taken from the sample column heading in the VCF or the FAST(A/Q) header.
  • Species - Species of the call.
  • Lineage - Lineage of the call (if Mtb.).
  • Sublineage - Sublineage of the call (if applicable).
  • Name - name of file in the lib/ directory where the marker variants for this call were taken from. This also relates to the common name for the lineage in some cases.
  • Percentage - Percentage of the call's variants found in the sample.

Contributing

We welcome any contributions. Firstly, fork this repository and clone it locally.

Next, setup pipenv for the project

make init
make install
make test

If you wish to put in a pull request to the main repository, please write a thorough description of the changes you have made.

Code style

This project uses the black code formatter. Please ensure any code you wish to merge has been formatted accordingly using

make lint

About

Whole-genome SNP based identification of members of the Mycobacterium tuberculosis complex

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •