Joon-Klaps
diff --git a/‎docs/customisation/databases.md
+14-48 b/‎docs/customisation/databases.md
+14-48
@@ -2,23 +2,23 @@
 
 ## Introduction
 
-Viralgenie uses a multitude of databases in order to analyse reads, contigs and consensus constructs. The default databases will be sufficient in most cases but there are always exceptions. This document will guide you towards the right documentation location for creating your custom databases.
+Viralgenie uses a multitude of databases in order to analyze reads, contigs, and consensus constructs. The default databases will be sufficient in most cases but there are always exceptions. This document will guide you towards the right documentation location for creating your custom databases.
 
 !!! Tip
     Keep an eye out for [nf-core createtaxdb](https://nf-co.re/createtaxdb/) as it can be used for the customization of the main databases but the pipeline is still under development.
 
 ## Reference pool
 
-The reference pool dataset is used to identify potential references for scaffolding. It's a fasta file that will be used to make a blast database within the pipeline. The default database is the [clustered Reference Viral DataBase (C-RVDB)](https://rvdb.dbi.udel.edu/) a database that was build for enhancing virus detection using high-throughput/next-generation sequencing (HTS/NGS) technologies. An alternative reference pool is the [Virosaurus](https://viralzone.expasy.org/8676) database which is a manually curated database of viral genomes.
+The reference pool dataset is used to identify potential references for scaffolding. It's a fasta file that will be used to make a blast database within the pipeline. The default database is the [clustered Reference Viral DataBase (C-RVDB)](https://rvdb.dbi.udel.edu/) a database that was built for enhancing virus detection using high-throughput/next-generation sequencing (HTS/NGS) technologies. An alternative reference pool is the [Virosaurus](https://viralzone.expasy.org/8676) database which is a manually curated database of viral genomes.
 
 Any nucleotide fasta file will do. Specify it with the parameter `--reference_pool`.
 
 ## Kaiju
 
-The kaiju database will be used to classify the reads and intermediate contigs in taxonomic groups. The default database is the RVDB-prot pre-built database from Kaiju.
+The Kaiju database will be used to classify the reads and intermediate contigs in taxonomic groups. The default database is the RVDB-prot pre-built database from Kaiju.
 
-A number of Kaiju pre-built indexes for reference datasets are maintained by the the developers of Kaiju and made available on the [Kaiju website](https://bioinformatics-centre.github.io/kaiju/downloads.html).
-To build a kaiju database, you need three components: a FASTA file with the protein sequences, the NCBI taxonomy dump files, and you need to define the uppercase characters of the standard 20 amino acids you wish to include.
+A number of Kaiju pre-built indexes for reference datasets are maintained by the developers of Kaiju and made available on the [Kaiju website](https://bioinformatics-centre.github.io/kaiju/downloads.html).
+To build a Kaiju database, you need three components: a FASTA file with the protein sequences, the NCBI taxonomy dump files, and you need to define the uppercase characters of the standard 20 amino acids you wish to include.
 
 !!! Warning
     The headers of the protein fasta file must be numeric NCBI taxon identifiers of the protein sequences.
@@ -72,7 +72,7 @@ You can then add your FASTA files with the following build command.
 kraken2-build --add-to-library *.fna --db <YOUR_DB_NAME>
 ```
 
-You can repeat this step multiple times to iteratively add more genomes prior building.
+You can repeat this step multiple times to iteratively add more genomes prior to building.
 
 Once all genomes are added to the library, you can build the database (and optionally clean it up):
 
@@ -89,32 +89,31 @@ You can then add the `<YOUR_DB_NAME>/` path to your nf-core/taxprofiler database
         -   `hash.k2d`
         -   `taxo.k2d`
 
-
 You can follow the Kraken2 [tutorial](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#custom-databases) for a more detailed description.
 
 ### Host read removal
 
-Viralgenie uses kraken2 to remove contaminated reads.
+Viralgenie uses Kraken2 to remove contaminated reads.
 
 !!! info
     The reason why we use Kraken2 for host removal over regular read mappers is nicely explained in the following papers:
 
     * [The human “contaminome”: bacterial, viral, and computational contamination in whole genome sequences from 1000 families](https://www.nature.com/articles/s41598-022-13269-z)
     * [Reconstruction of the personal information from human genome reads in gut metagenome sequencing data](https://www.nature.com/articles/s41564-023-01381-3)
 
-The contamination database is likely the largest database. The default databases is made small explicitly made smaller to save storage for end users but is not optimal. I would recommend to create a database consisting of the libraries `human, archea, bacteria` which will be more then 200GB in size. Additionally, it's good practice to include DNA & RNA of the host of origin if known (i.e. mice, ticks, mosquito, ... ). Add it as described above.
+The contamination database is likely the largest database. The default databases are made small explicitly to save storage for end users but are not optimal. I would recommend creating a database consisting of the libraries `human, archaea, bacteria` which will be more than 200GB in size. Additionally, it's good practice to include DNA & RNA of the host of origin if known (i.e. mice, ticks, mosquito, ... ). Add it as described above.
 
 Set it with the variable `--host_k2_db`
 
 ### Viral Diversity with Kraken2
 
-The metagenomic diveristy estimated with kraken2 is based on the viral refseq database which can cut short if you expect your the species within your sample to have a large amount of diversity eg below 80% ANI ([quasi-species](https://link.springer.com/chapter/10.1007/978-3-642-77011-1_1)). To resolve this it's better to create a database that contains a wider species diversity then only one genome per species. Databases that have this wider diversity is [Virosaurus](https://viralzone.expasy.org/8676) or the [RVDB](https://rvdb.dbi.udel.edu/home) which can increase the accuracy of kraken2. Add it as described above.
+The metagenomic diversity estimated with Kraken2 is based on the viral refseq database which can cut short if you expect the species within your sample to have a large amount of diversity eg below 80% ANI ([quasi-species](https://link.springer.com/chapter/10.1007/978-3-642-77011-1_1)). To resolve this it's better to create a database that contains a wider species diversity than only one genome per species. Databases that have this wider diversity are [Virosaurus](https://viralzone.expasy.org/8676) or the [RVDB](https://rvdb.dbi.udel.edu/home) which can increase the accuracy of Kraken2. Add it as described above.
 
 Set it with the variable `--kraken2_db`
 
 ## Annotation sequences
 
-Identifying the species and the segment of the final genome constructs is done based on a tblastx search (with MMSEQS) to a annotated sequencing dataset. This dataset is by default the [Virosaurus](https://viralzone.expasy.org/8676) as it contains a good representation of the viral genomes and is annotated.
+Identifying the species and the segment of the final genome constructs is done based on a tblastx search (with MMSEQS) to an annotated sequencing dataset. This dataset is by default the [Virosaurus](https://viralzone.expasy.org/8676) as it contains a good representation of the viral genomes and is annotated.
 
 This annotation database can be specified using `--annotation_db`
 
@@ -127,23 +126,23 @@ In case [Virosaurus](https://viralzone.expasy.org/8676) does not suffice your ne
 >NC_001731; usual name=Molluscum contagiosum virus; clinical level=SPECIES; clinical typing=unknown; species=Molluscum contagiosum virus; taxid=10279; acronym=MOCV; nucleic acid=DNA; circular=N; segment=N/A; host=Human,Vertebrate;
 ```
 
-An easy to use public database with a lot of metadata is [BV-BRC](https://www.bv-brc.org/). Sequences can be extracted using their [CLI-tool](https://www.bv-brc.org/docs/cli_tutorial/index.html) and linked to their [metadata](https://www.bv-brc.org/docs/cli_tutorial/cli_getting_started.html#the-bv-brc-database)
+An easy-to-use public database with a lot of metadata is [BV-BRC](https://www.bv-brc.org/). Sequences can be extracted using their [CLI-tool](https://www.bv-brc.org/docs/cli_tutorial/index.html) and linked to their [metadata](https://www.bv-brc.org/docs/cli_tutorial/cli_getting_started.html#the-bv-brc-database)
 
 Here we select all viral genomes that are not lab reassortments and are reference genomes and add metadata attributes to the output.
-> This is an example, in case you need to have a more elaborate dataset then virosaurus, be more inclusive towards your taxa of interest and include more metadata attributes.
+> This is an example, in case you need to have a more elaborate dataset than Virosaurus, be more inclusive towards your taxa of interest and include more metadata attributes.
 
 ```bash
 # download annotation metadata +/- 5s
 p3-all-genomes --eq superkingdom,Viruses --eq reference_genome,Reference --ne host_common_name,'Lab reassortment' --attr genome_id,species,segment,genome_name,genome_length,host_common_name,genbank_accessions,taxon_id   > all-virus-anno.txt
-# download genome data, done seperatly as it takes much longer to query +/- 1 hour
+# download genome data, done separately as it takes much longer to query +/- 1 hour
 p3-all-genomes --eq superkingdom,Viruses --eq reference_genome,Reference --ne host_common_name,'Lab reassortment' | p3-get-genome-contigs --attr sequence > all-virus.fasta
 ```
 
 !!! Tip
     Any attribute can be downloaded and will be added to the final report if the formatting remains the same.
     For a complete list of attributes see `p3-all-genomes --fields` or read their [manual](https://www.bv-brc.org/docs/cli_tutorial/cli_getting_started.html)
 
-Next, the metadata and the genomic data is combined into a single fasta file where the metada fields are stored in the fasta comment as `key1="value1"|key2="value2"|...` using the following python code.
+Next, the metadata and the genomic data are combined into a single fasta file where the metadata fields are stored in the fasta comment as `key1="value1"|key2="value2"|...` using the following python code.
 
 ```python
 import pandas as pd
@@ -184,36 +183,3 @@ with open("bv-brc-refvirus-anno.fasta", "w") as f:
     -   `refseq-virus.fasta`
     -   `refseq-virus-anno.txt`
     -   `bv-brc-refvirus-anno.fasta`
-
-
-<!-- ### Bracken
-
-Bracken does not require an independent database but rather builds upon Kraken2 databases. [The pre-built Kraken2 databases hosted by Ben Langmead](https://benlangmead.github.io/aws-indexes/k2) already contain the required files to run Bracken.
-
-However, to build custom databases, you will need a Kraken2 database, the (average) read lengths (in bp) of your sequencing experiment, the K-mer size used to build the Kraken2 database, and Kraken2 available on your machine.
-
-```bash
-bracken-build -d <KRAKEN_DB_DIR> -k <KRAKEN_DB_KMER_LENGTH> -l <READLENGTH>
-```
-
-> [!Tip]
-> You can speed up database construction by supplying the threads parameter (`-t`).
-
-> [!Tip]
-> If you do not have Kraken2 in your `$PATH` you can point to the binary with `-x /<path>/<to>/kraken2`.
-
-<details markdown="1">
-<summary>Expected files in database directory</summary>
-
--   `bracken`
-    -   `hash.k2d`
-    -   `opts.k2d`
-    -   `taxo.k2d`
-    -   `database100mers.kmer_distrib`
-    -   `database150mers.kmer_distrib`
-
-</details>
-
-You can follow Bracken [tutorial](https://ccb.jhu.edu/software/bracken/index.shtml?t=manual) for more information.
-
-Set it with the variable `--bracken_db` -->