Skip to content

Commit e23def3

Browse files
committed
Fixing a lot more spelling mistakes
1 parent c90297d commit e23def3

File tree

4 files changed

+113
-147
lines changed

4 files changed

+113
-147
lines changed

docs/customisation/databases.md

+14-48
Original file line numberDiff line numberDiff line change
@@ -2,23 +2,23 @@
22

33
## Introduction
44

5-
Viralgenie uses a multitude of databases in order to analyse reads, contigs and consensus constructs. The default databases will be sufficient in most cases but there are always exceptions. This document will guide you towards the right documentation location for creating your custom databases.
5+
Viralgenie uses a multitude of databases in order to analyze reads, contigs, and consensus constructs. The default databases will be sufficient in most cases but there are always exceptions. This document will guide you towards the right documentation location for creating your custom databases.
66

77
!!! Tip
88
Keep an eye out for [nf-core createtaxdb](https://nf-co.re/createtaxdb/) as it can be used for the customization of the main databases but the pipeline is still under development.
99

1010
## Reference pool
1111

12-
The reference pool dataset is used to identify potential references for scaffolding. It's a fasta file that will be used to make a blast database within the pipeline. The default database is the [clustered Reference Viral DataBase (C-RVDB)](https://rvdb.dbi.udel.edu/) a database that was build for enhancing virus detection using high-throughput/next-generation sequencing (HTS/NGS) technologies. An alternative reference pool is the [Virosaurus](https://viralzone.expasy.org/8676) database which is a manually curated database of viral genomes.
12+
The reference pool dataset is used to identify potential references for scaffolding. It's a fasta file that will be used to make a blast database within the pipeline. The default database is the [clustered Reference Viral DataBase (C-RVDB)](https://rvdb.dbi.udel.edu/) a database that was built for enhancing virus detection using high-throughput/next-generation sequencing (HTS/NGS) technologies. An alternative reference pool is the [Virosaurus](https://viralzone.expasy.org/8676) database which is a manually curated database of viral genomes.
1313

1414
Any nucleotide fasta file will do. Specify it with the parameter `--reference_pool`.
1515

1616
## Kaiju
1717

18-
The kaiju database will be used to classify the reads and intermediate contigs in taxonomic groups. The default database is the RVDB-prot pre-built database from Kaiju.
18+
The Kaiju database will be used to classify the reads and intermediate contigs in taxonomic groups. The default database is the RVDB-prot pre-built database from Kaiju.
1919

20-
A number of Kaiju pre-built indexes for reference datasets are maintained by the the developers of Kaiju and made available on the [Kaiju website](https://bioinformatics-centre.github.io/kaiju/downloads.html).
21-
To build a kaiju database, you need three components: a FASTA file with the protein sequences, the NCBI taxonomy dump files, and you need to define the uppercase characters of the standard 20 amino acids you wish to include.
20+
A number of Kaiju pre-built indexes for reference datasets are maintained by the developers of Kaiju and made available on the [Kaiju website](https://bioinformatics-centre.github.io/kaiju/downloads.html).
21+
To build a Kaiju database, you need three components: a FASTA file with the protein sequences, the NCBI taxonomy dump files, and you need to define the uppercase characters of the standard 20 amino acids you wish to include.
2222

2323
!!! Warning
2424
The headers of the protein fasta file must be numeric NCBI taxon identifiers of the protein sequences.
@@ -72,7 +72,7 @@ You can then add your FASTA files with the following build command.
7272
kraken2-build --add-to-library *.fna --db <YOUR_DB_NAME>
7373
```
7474

75-
You can repeat this step multiple times to iteratively add more genomes prior building.
75+
You can repeat this step multiple times to iteratively add more genomes prior to building.
7676

7777
Once all genomes are added to the library, you can build the database (and optionally clean it up):
7878

@@ -89,32 +89,31 @@ You can then add the `<YOUR_DB_NAME>/` path to your nf-core/taxprofiler database
8989
- `hash.k2d`
9090
- `taxo.k2d`
9191

92-
9392
You can follow the Kraken2 [tutorial](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#custom-databases) for a more detailed description.
9493

9594
### Host read removal
9695

97-
Viralgenie uses kraken2 to remove contaminated reads.
96+
Viralgenie uses Kraken2 to remove contaminated reads.
9897

9998
!!! info
10099
The reason why we use Kraken2 for host removal over regular read mappers is nicely explained in the following papers:
101100

102101
* [The human “contaminome”: bacterial, viral, and computational contamination in whole genome sequences from 1000 families](https://www.nature.com/articles/s41598-022-13269-z)
103102
* [Reconstruction of the personal information from human genome reads in gut metagenome sequencing data](https://www.nature.com/articles/s41564-023-01381-3)
104103

105-
The contamination database is likely the largest database. The default databases is made small explicitly made smaller to save storage for end users but is not optimal. I would recommend to create a database consisting of the libraries `human, archea, bacteria` which will be more then 200GB in size. Additionally, it's good practice to include DNA & RNA of the host of origin if known (i.e. mice, ticks, mosquito, ... ). Add it as described above.
104+
The contamination database is likely the largest database. The default databases are made small explicitly to save storage for end users but are not optimal. I would recommend creating a database consisting of the libraries `human, archaea, bacteria` which will be more than 200GB in size. Additionally, it's good practice to include DNA & RNA of the host of origin if known (i.e. mice, ticks, mosquito, ... ). Add it as described above.
106105

107106
Set it with the variable `--host_k2_db`
108107

109108
### Viral Diversity with Kraken2
110109

111-
The metagenomic diveristy estimated with kraken2 is based on the viral refseq database which can cut short if you expect your the species within your sample to have a large amount of diversity eg below 80% ANI ([quasi-species](https://link.springer.com/chapter/10.1007/978-3-642-77011-1_1)). To resolve this it's better to create a database that contains a wider species diversity then only one genome per species. Databases that have this wider diversity is [Virosaurus](https://viralzone.expasy.org/8676) or the [RVDB](https://rvdb.dbi.udel.edu/home) which can increase the accuracy of kraken2. Add it as described above.
110+
The metagenomic diversity estimated with Kraken2 is based on the viral refseq database which can cut short if you expect the species within your sample to have a large amount of diversity eg below 80% ANI ([quasi-species](https://link.springer.com/chapter/10.1007/978-3-642-77011-1_1)). To resolve this it's better to create a database that contains a wider species diversity than only one genome per species. Databases that have this wider diversity are [Virosaurus](https://viralzone.expasy.org/8676) or the [RVDB](https://rvdb.dbi.udel.edu/home) which can increase the accuracy of Kraken2. Add it as described above.
112111

113112
Set it with the variable `--kraken2_db`
114113

115114
## Annotation sequences
116115

117-
Identifying the species and the segment of the final genome constructs is done based on a tblastx search (with MMSEQS) to a annotated sequencing dataset. This dataset is by default the [Virosaurus](https://viralzone.expasy.org/8676) as it contains a good representation of the viral genomes and is annotated.
116+
Identifying the species and the segment of the final genome constructs is done based on a tblastx search (with MMSEQS) to an annotated sequencing dataset. This dataset is by default the [Virosaurus](https://viralzone.expasy.org/8676) as it contains a good representation of the viral genomes and is annotated.
118117

119118
This annotation database can be specified using `--annotation_db`
120119

@@ -127,23 +126,23 @@ In case [Virosaurus](https://viralzone.expasy.org/8676) does not suffice your ne
127126
>NC_001731; usual name=Molluscum contagiosum virus; clinical level=SPECIES; clinical typing=unknown; species=Molluscum contagiosum virus; taxid=10279; acronym=MOCV; nucleic acid=DNA; circular=N; segment=N/A; host=Human,Vertebrate;
128127
```
129128

130-
An easy to use public database with a lot of metadata is [BV-BRC](https://www.bv-brc.org/). Sequences can be extracted using their [CLI-tool](https://www.bv-brc.org/docs/cli_tutorial/index.html) and linked to their [metadata](https://www.bv-brc.org/docs/cli_tutorial/cli_getting_started.html#the-bv-brc-database)
129+
An easy-to-use public database with a lot of metadata is [BV-BRC](https://www.bv-brc.org/). Sequences can be extracted using their [CLI-tool](https://www.bv-brc.org/docs/cli_tutorial/index.html) and linked to their [metadata](https://www.bv-brc.org/docs/cli_tutorial/cli_getting_started.html#the-bv-brc-database)
131130

132131
Here we select all viral genomes that are not lab reassortments and are reference genomes and add metadata attributes to the output.
133-
> This is an example, in case you need to have a more elaborate dataset then virosaurus, be more inclusive towards your taxa of interest and include more metadata attributes.
132+
> This is an example, in case you need to have a more elaborate dataset than Virosaurus, be more inclusive towards your taxa of interest and include more metadata attributes.
134133
135134
```bash
136135
# download annotation metadata +/- 5s
137136
p3-all-genomes --eq superkingdom,Viruses --eq reference_genome,Reference --ne host_common_name,'Lab reassortment' --attr genome_id,species,segment,genome_name,genome_length,host_common_name,genbank_accessions,taxon_id > all-virus-anno.txt
138-
# download genome data, done seperatly as it takes much longer to query +/- 1 hour
137+
# download genome data, done separately as it takes much longer to query +/- 1 hour
139138
p3-all-genomes --eq superkingdom,Viruses --eq reference_genome,Reference --ne host_common_name,'Lab reassortment' | p3-get-genome-contigs --attr sequence > all-virus.fasta
140139
```
141140

142141
!!! Tip
143142
Any attribute can be downloaded and will be added to the final report if the formatting remains the same.
144143
For a complete list of attributes see `p3-all-genomes --fields` or read their [manual](https://www.bv-brc.org/docs/cli_tutorial/cli_getting_started.html)
145144

146-
Next, the metadata and the genomic data is combined into a single fasta file where the metada fields are stored in the fasta comment as `key1="value1"|key2="value2"|...` using the following python code.
145+
Next, the metadata and the genomic data are combined into a single fasta file where the metadata fields are stored in the fasta comment as `key1="value1"|key2="value2"|...` using the following python code.
147146

148147
```python
149148
import pandas as pd
@@ -184,36 +183,3 @@ with open("bv-brc-refvirus-anno.fasta", "w") as f:
184183
- `refseq-virus.fasta`
185184
- `refseq-virus-anno.txt`
186185
- `bv-brc-refvirus-anno.fasta`
187-
188-
189-
<!-- ### Bracken
190-
191-
Bracken does not require an independent database but rather builds upon Kraken2 databases. [The pre-built Kraken2 databases hosted by Ben Langmead](https://benlangmead.github.io/aws-indexes/k2) already contain the required files to run Bracken.
192-
193-
However, to build custom databases, you will need a Kraken2 database, the (average) read lengths (in bp) of your sequencing experiment, the K-mer size used to build the Kraken2 database, and Kraken2 available on your machine.
194-
195-
```bash
196-
bracken-build -d <KRAKEN_DB_DIR> -k <KRAKEN_DB_KMER_LENGTH> -l <READLENGTH>
197-
```
198-
199-
> [!Tip]
200-
> You can speed up database construction by supplying the threads parameter (`-t`).
201-
202-
> [!Tip]
203-
> If you do not have Kraken2 in your `$PATH` you can point to the binary with `-x /<path>/<to>/kraken2`.
204-
205-
<details markdown="1">
206-
<summary>Expected files in database directory</summary>
207-
208-
- `bracken`
209-
- `hash.k2d`
210-
- `opts.k2d`
211-
- `taxo.k2d`
212-
- `database100mers.kmer_distrib`
213-
- `database150mers.kmer_distrib`
214-
215-
</details>
216-
217-
You can follow Bracken [tutorial](https://ccb.jhu.edu/software/bracken/index.shtml?t=manual) for more information.
218-
219-
Set it with the variable `--bracken_db` -->

0 commit comments

Comments
 (0)