Here we provide a R Notebook template to generate a sequences variants table from FASTQ files generated by Paired-End Illumina sequencing. For maximal reproducibility, we also provide a R project file and its associated packrat packages managment directory.
For an example of report generated from the R Notebook, see example.nb.html.
Note: Commands described in this documentation assume that you are using a Unix CLI.
- Git clone URL: https://github.com/chuvpne/dada2-pipeline.git
- Documentation: https://github.com/chuvpne/dada2-pipeline
- DADA2: https://benjjneb.github.io/dada2
- illumina-utils: https://github.com/merenlab/illumina-utils
- docker image with pre-installed environment: https://hub.docker.com/r/chuvpne/pne-docker
The minimal requirements are listed below then further detailed in this section:
- R 3.6.0: https://www.r-project.org
- R packrat package: https://cran.r-project.org/web/packages/packrat
- RStudio: https://www.rstudio.com
- GNU parallel: https://www.gnu.org/software/parallel
- illumina-utils: https://github.com/merenlab/illumina-utils
- git: https://git-scm.com/
Note that additional system requirements as well as dependencies may be necessary for the installation of some R packages.
For maximal reproducibility and easy deployement across machines and platforms, you can open the R Notebook template using RStudio server running on a docker container. A suitable docker image satisfying all system requirements for using the provided R Notebook template is available at https://hub.docker.com/r/chuvpne/pne-docker. Instructions are further provided below in the Working on docker section.
The DADA2 pipeline comes as a R package. Make sure that R (3.6.0 or higher) is installed before starting.
R packages are managed using packrat. This ensures maximal reproducibility and portability of the analysis by using an encapsulated, version controlled, installation of R packages instead of any system-level R packages installation.
Make sure that the packrat R package is installed before starting:
# R
> install.packages("packrat")
The dada2-pipeline.Rmd
file provided in this repository is a R Notebook. As such, it includes both code lines (chunks) and text and can be exported into html and pdf files.
The prefered way of using the dada2-pipeline.Rmd
file is with the RStudio Integrated Development Environment (IDE). Make sure you have RStudio installed before starting.
GNU parallel is a shell tool to execute jobs in parallel. Some chunks in the dada2-pipeline.Rmd
R Notebook are written in bash and use GNU parallel to run commands in parallel and reduce computation time.
Make sure you have GNU parallel installed before starting.
On debian linux, you can install GNU parallel using the apt-get
package manager command:
$ apt-get install parallel
You can find the documentation on GNU parallel here. There is also an intro video available here and examples there.
If needed, use the man
command:
$ man parallel
Illumina-utils is a small FASTQ files-processing library.
Some chunks in the dada2-pipeline.Rmd
R Notebook use it to demultiplex FASTQ files.
Make sure illumina-utils is installed before starting. Installation instructions are available at https://github.com/merenlab/illumina-utils/.
Best is to create a virtual environment for illumina-utils using virtualenv. On debian linux, you can install it using Python package manager pip
command:
$ pip install virtualenv
$ virtualenv -p python3 ~/illumina-utils # important to specify Python3
$ source ~/illumina-utils/bin/activate
$ pip install illumina-utils
To activate/deactivate your illumina-utils virtual environment run:
$ source ~/illumina-utils/bin/activate
$ deactivate
This repository is managed using the git distributed version control system. You can get a local copy of it using the git clone
command.
If not done yet, install git.
On debian linux, you can install git using the apt-get
package manager command:
$ apt-get install git
Before starting, make sure you have the five files listed below:
R1.fastq.gz
: FASTQ file for the forward readR2.fastq.gz
: FASTQ file for the reverse readIndex.fastq.gz
: FASTQ file for the index readbarcode_to_sample.txt
: A text file mapping index barcodes to samples- A DADA2-formatted reference database: see https://benjjneb.github.io/dada2/training.html. Per example, Silva version 132:
silva_nr_v132_train_set.fa.gz
andsilva_species_assignment_v132.fa.gz
.
It is assumed that the FASTQ files were archived using gzip.
Please pay particular attention to the format of the files. The following points are critical:
- FASTQ files must use the Phred+33 quality score format and sequences headers (
@
lines) must fit the standard format of the CASAVA 1.8 output:
@EAS139:136:FC706VJ:2:2104:15343:197393 1:N:0:0
- The
barcode_to_sample.txt
file must contain two tab-delimited columns: the first for samples names and the second for samples barcodes as shown here. Avoid special characters.
If multiple sequencing runs have to be analyzed together, create a directory for each run and place the respective FASTQ and barcode_to_sample_[runNN].txt
files inside.
Get a copy of this repository and set it as your working directory:
$ git clone https://github.com/chuvpne/dada2-pipeline.git my_project_dir
$ cd my_project_dir
If not done yet, get a copy of the DADA2-formatted reference database of your choice at https://benjjneb.github.io/dada2/training.html. We recommend to store it in a directory dedicated to databases instead of keeping it inside the main project directory.
Per example, create a db
directory in your home directory and download the SILVA version 132 reference database into it (Assuming wget is installed):
$ mkdir $HOME/db
$ wget https://zenodo.org/record/1172783/files/silva_nr_v132_train_set.fa.gz -P $HOME/db/
$ wget https://zenodo.org/record/1172783/files/silva_species_assignment_v132.fa.gz -P $HOME/db/
Place your FASTQ files and the barcode_to_sample.txt
file in a directory, then place this directory within the directory named run_data
.
The final directory structure should look like:
my_project_dir ├── run_data | └── run01 | ├── R1.fastq.gz | ├── R2.fastq.gz | ├── Index.fastq.gz | └── barcode_to_sample.txt ├── packrat | └── ... ├── dada2-pipeline.Rmd ├── dada2-pipeline.Rproj ├── LICENSE.txt └── README.md
If multiple sequencing runs have to be analyzed together, place each run directory inside the directory named run_data
.
In this case, the final directory structure should look like:
my_project_dir ├── run_data | ├── run01 | | ├── R1.fastq.gz | | ├── R2.fastq.gz | | ├── Index.fastq.gz | | └── barcode_to_sample.txt | ├── run02 | | ├── R1.fastq.gz | | ├── R2.fastq.gz | | ├── Index.fastq.gz | | └── barcode_to_sample.txt . . . . . . | └── runNN | ├── R1.fastq.gz | ├── R2.fastq.gz | ├── Index.fastq.gz | └── barcode_to_sample.txt ├── packrat | └── ... ├── dada2-pipeline.Rmd ├── dada2-pipeline.Rproj ├── LICENSE.txt └── README.md
-
Load the
dada2-pipeline.Rproj
R project file in RStudio. -
Open the
dada2-pipeline.Rmd
R Notebook template in RStudio and follow the instructions in the text and comments. At the end of the pipeline, inital files and results are archived into two separate archives. -
Store archives in a safe place!
A docker image satisfying all system requirements for using the provided R Notebook template is available at https://hub.docker.com/r/chuvpne/pne-docker. See https://github.com/chuvpne/pne-docker for the documentation. This docker image comes with pre-installed R, RStudio server and illumina-utils. The R packrat package is also pre-installed and all system requirements for the installation and usage of the R dada2 package are met.
The DADA2-formatted reference database must be mounted on the docker container.
To do that, add the following option to the docker run
command documented at https://github.com/chuvpne/pne-docker:
-v /path/to/train_set.fa.gz/directory:/home/$USER/db:ro
Replace /path/to/train_set.fa.gz/directory
by the right path to the directory containing the reference database. Per example: $HOME/db/silva
.
If you used this repository in a publication, please mention its url. Per example:
The implementation of the DADA2 pipeline used to process FASTQ files is available at https://github.com/chuvpne/dada2-pipeline.
In addition, you may cite the tools used by this pipeline:
-
DADA2: Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP (2016). "DADA2: High-resolution sample inference from Illumina amplicon data." Nature Methods, 13, 581-583. doi: 10.1038/nmeth.3869.
-
illumina-utils: Eren AM, Vineis JH, Morrison HG, Sogin ML (2013). "A Filtering Method to Generate High Quality Short Reads Using Illumina Paired-End Technology." PLOS ONE, 8(6). doi: 10.1371/journal.pone.0066643.
- Copyright (c) 2018 Service de Pneumologie, Centre Hospitalier Universitaire Vaudois (CHUV), Switzerland and Monash University, Melbourne, Australia
- License: The R Notebook template (.Rmd) is provided under the MIT license (See LICENSE.txt for details)
- Authors: A. Rapin, C. Pattaroni, B.J. Marsland
Thank you for your interest in contributing. Get started here.