You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/customisation/databases.md
+14-48
Original file line number
Diff line number
Diff line change
@@ -2,23 +2,23 @@
2
2
3
3
## Introduction
4
4
5
-
Viralgenie uses a multitude of databases in order to analyse reads, contigs and consensus constructs. The default databases will be sufficient in most cases but there are always exceptions. This document will guide you towards the right documentation location for creating your custom databases.
5
+
Viralgenie uses a multitude of databases in order to analyze reads, contigs, and consensus constructs. The default databases will be sufficient in most cases but there are always exceptions. This document will guide you towards the right documentation location for creating your custom databases.
6
6
7
7
!!! Tip
8
8
Keep an eye out for [nf-core createtaxdb](https://nf-co.re/createtaxdb/) as it can be used for the customization of the main databases but the pipeline is still under development.
9
9
10
10
## Reference pool
11
11
12
-
The reference pool dataset is used to identify potential references for scaffolding. It's a fasta file that will be used to make a blast database within the pipeline. The default database is the [clustered Reference Viral DataBase (C-RVDB)](https://rvdb.dbi.udel.edu/) a database that was build for enhancing virus detection using high-throughput/next-generation sequencing (HTS/NGS) technologies. An alternative reference pool is the [Virosaurus](https://viralzone.expasy.org/8676) database which is a manually curated database of viral genomes.
12
+
The reference pool dataset is used to identify potential references for scaffolding. It's a fasta file that will be used to make a blast database within the pipeline. The default database is the [clustered Reference Viral DataBase (C-RVDB)](https://rvdb.dbi.udel.edu/) a database that was built for enhancing virus detection using high-throughput/next-generation sequencing (HTS/NGS) technologies. An alternative reference pool is the [Virosaurus](https://viralzone.expasy.org/8676) database which is a manually curated database of viral genomes.
13
13
14
14
Any nucleotide fasta file will do. Specify it with the parameter `--reference_pool`.
15
15
16
16
## Kaiju
17
17
18
-
The kaiju database will be used to classify the reads and intermediate contigs in taxonomic groups. The default database is the RVDB-prot pre-built database from Kaiju.
18
+
The Kaiju database will be used to classify the reads and intermediate contigs in taxonomic groups. The default database is the RVDB-prot pre-built database from Kaiju.
19
19
20
-
A number of Kaiju pre-built indexes for reference datasets are maintained by the the developers of Kaiju and made available on the [Kaiju website](https://bioinformatics-centre.github.io/kaiju/downloads.html).
21
-
To build a kaiju database, you need three components: a FASTA file with the protein sequences, the NCBI taxonomy dump files, and you need to define the uppercase characters of the standard 20 amino acids you wish to include.
20
+
A number of Kaiju pre-built indexes for reference datasets are maintained by the developers of Kaiju and made available on the [Kaiju website](https://bioinformatics-centre.github.io/kaiju/downloads.html).
21
+
To build a Kaiju database, you need three components: a FASTA file with the protein sequences, the NCBI taxonomy dump files, and you need to define the uppercase characters of the standard 20 amino acids you wish to include.
22
22
23
23
!!! Warning
24
24
The headers of the protein fasta file must be numeric NCBI taxon identifiers of the protein sequences.
@@ -72,7 +72,7 @@ You can then add your FASTA files with the following build command.
You can repeat this step multiple times to iteratively add more genomes prior building.
75
+
You can repeat this step multiple times to iteratively add more genomes prior to building.
76
76
77
77
Once all genomes are added to the library, you can build the database (and optionally clean it up):
78
78
@@ -89,32 +89,31 @@ You can then add the `<YOUR_DB_NAME>/` path to your nf-core/taxprofiler database
89
89
- `hash.k2d`
90
90
- `taxo.k2d`
91
91
92
-
93
92
You can follow the Kraken2 [tutorial](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#custom-databases) for a more detailed description.
94
93
95
94
### Host read removal
96
95
97
-
Viralgenie uses kraken2 to remove contaminated reads.
96
+
Viralgenie uses Kraken2 to remove contaminated reads.
98
97
99
98
!!! info
100
99
The reason why we use Kraken2 for host removal over regular read mappers is nicely explained in the following papers:
101
100
102
101
* [The human “contaminome”: bacterial, viral, and computational contamination in whole genome sequences from 1000 families](https://www.nature.com/articles/s41598-022-13269-z)
103
102
* [Reconstruction of the personal information from human genome reads in gut metagenome sequencing data](https://www.nature.com/articles/s41564-023-01381-3)
104
103
105
-
The contamination database is likely the largest database. The default databases is made small explicitly made smaller to save storage for end users but is not optimal. I would recommend to create a database consisting of the libraries `human, archea, bacteria` which will be more then 200GB in size. Additionally, it's good practice to include DNA & RNA of the host of origin if known (i.e. mice, ticks, mosquito, ... ). Add it as described above.
104
+
The contamination database is likely the largest database. The default databases are made small explicitly to save storage for end users but are not optimal. I would recommend creating a database consisting of the libraries `human, archaea, bacteria` which will be more than 200GB in size. Additionally, it's good practice to include DNA & RNA of the host of origin if known (i.e. mice, ticks, mosquito, ... ). Add it as described above.
106
105
107
106
Set it with the variable `--host_k2_db`
108
107
109
108
### Viral Diversity with Kraken2
110
109
111
-
The metagenomic diveristy estimated with kraken2 is based on the viral refseq database which can cut short if you expect your the species within your sample to have a large amount of diversity eg below 80% ANI ([quasi-species](https://link.springer.com/chapter/10.1007/978-3-642-77011-1_1)). To resolve this it's better to create a database that contains a wider species diversity then only one genome per species. Databases that have this wider diversity is[Virosaurus](https://viralzone.expasy.org/8676) or the [RVDB](https://rvdb.dbi.udel.edu/home) which can increase the accuracy of kraken2. Add it as described above.
110
+
The metagenomic diversity estimated with Kraken2 is based on the viral refseq database which can cut short if you expect the species within your sample to have a large amount of diversity eg below 80% ANI ([quasi-species](https://link.springer.com/chapter/10.1007/978-3-642-77011-1_1)). To resolve this it's better to create a database that contains a wider species diversity than only one genome per species. Databases that have this wider diversity are[Virosaurus](https://viralzone.expasy.org/8676) or the [RVDB](https://rvdb.dbi.udel.edu/home) which can increase the accuracy of Kraken2. Add it as described above.
112
111
113
112
Set it with the variable `--kraken2_db`
114
113
115
114
## Annotation sequences
116
115
117
-
Identifying the species and the segment of the final genome constructs is done based on a tblastx search (with MMSEQS) to a annotated sequencing dataset. This dataset is by default the [Virosaurus](https://viralzone.expasy.org/8676) as it contains a good representation of the viral genomes and is annotated.
116
+
Identifying the species and the segment of the final genome constructs is done based on a tblastx search (with MMSEQS) to an annotated sequencing dataset. This dataset is by default the [Virosaurus](https://viralzone.expasy.org/8676) as it contains a good representation of the viral genomes and is annotated.
118
117
119
118
This annotation database can be specified using `--annotation_db`
120
119
@@ -127,23 +126,23 @@ In case [Virosaurus](https://viralzone.expasy.org/8676) does not suffice your ne
An easy to use public database with a lot of metadata is [BV-BRC](https://www.bv-brc.org/). Sequences can be extracted using their [CLI-tool](https://www.bv-brc.org/docs/cli_tutorial/index.html) and linked to their [metadata](https://www.bv-brc.org/docs/cli_tutorial/cli_getting_started.html#the-bv-brc-database)
129
+
An easy-to-use public database with a lot of metadata is [BV-BRC](https://www.bv-brc.org/). Sequences can be extracted using their [CLI-tool](https://www.bv-brc.org/docs/cli_tutorial/index.html) and linked to their [metadata](https://www.bv-brc.org/docs/cli_tutorial/cli_getting_started.html#the-bv-brc-database)
131
130
132
131
Here we select all viral genomes that are not lab reassortments and are reference genomes and add metadata attributes to the output.
133
-
> This is an example, in case you need to have a more elaborate dataset then virosaurus, be more inclusive towards your taxa of interest and include more metadata attributes.
132
+
> This is an example, in case you need to have a more elaborate dataset than Virosaurus, be more inclusive towards your taxa of interest and include more metadata attributes.
Any attribute can be downloaded and will be added to the final report if the formatting remains the same.
144
143
For a complete list of attributes see `p3-all-genomes --fields` or read their [manual](https://www.bv-brc.org/docs/cli_tutorial/cli_getting_started.html)
145
144
146
-
Next, the metadata and the genomic data is combined into a single fasta file where the metada fields are stored in the fasta comment as `key1="value1"|key2="value2"|...` using the following python code.
145
+
Next, the metadata and the genomic data are combined into a single fasta file where the metadata fields are stored in the fasta comment as `key1="value1"|key2="value2"|...` using the following python code.
147
146
148
147
```python
149
148
import pandas as pd
@@ -184,36 +183,3 @@ with open("bv-brc-refvirus-anno.fasta", "w") as f:
184
183
- `refseq-virus.fasta`
185
184
- `refseq-virus-anno.txt`
186
185
- `bv-brc-refvirus-anno.fasta`
187
-
188
-
189
-
<!-- ### Bracken
190
-
191
-
Bracken does not require an independent database but rather builds upon Kraken2 databases. [The pre-built Kraken2 databases hosted by Ben Langmead](https://benlangmead.github.io/aws-indexes/k2) already contain the required files to run Bracken.
192
-
193
-
However, to build custom databases, you will need a Kraken2 database, the (average) read lengths (in bp) of your sequencing experiment, the K-mer size used to build the Kraken2 database, and Kraken2 available on your machine.
0 commit comments