You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -17,28 +21,48 @@ Viralgenie offers an elaborate workflow for the assembly and polishing of viral
17
21
18
22
The consensus genome of all clusters are then send to the [variant analysis & iterative refinement](variant_and_refinement.md) step.
19
23
20
-
## De-novo Assembly
24
+
## 1. De-novo Assembly
25
+
21
26
Three assemblers are used, [SPAdes](http://cab.spbu.ru/software/spades/), [Megahit](https://github.com/voutcn/megahit), and [Trinity](https://github.com/trinityrnaseq/trinityrnaseq). The resulting contigs of all specified assemblers, are combined and processed further together.
22
27
> Modify the spades mode with `--spades_mode [default: rnaviral]` and supply specific params with `--spades_yml` or a hmm model with `--spades_hmm`.
23
28
24
29
> Specify the assemblers to use with the `--assemblers` parameter where the assemblers are separated with a ','. The default is `spades,megahit,trinity`.
25
30
26
-
Contigs can be extended using [SSPACE Basic](https://github.com/nsoranzo/sspace_basic) with the `--skip_sspace_basic false` parameter. SSPACE is a tool for scaffolding contigs using paired-end reads. It is modified from SSAKE assembler and has the feature of extending contigs using reads that are unmappable in the contig assembly step.
27
-
28
31
Low complexity contigs can be filtered out using prinseq++ with the `--skip_contig_prinseq false` parameter. Complexity filtering is primarily a run-time optimisation step. Low-complexity sequences are defined as having commonly found stretches of nucleotides with limited information content (e.g. the dinucleotide repeat CACACACACA). Such sequences can produce a large number of high-scoring but biologically insignificant results in database searches. Removing these reads therefore saves computational time and resources.
29
32
30
-
## Reference Matching
33
+
## 2. Extension
34
+
35
+
Contigs can be extended using [SSPACE Basic](https://github.com/nsoranzo/sspace_basic) with the `--skip_sspace_basic false` parameter. SSPACE is a tool for scaffolding contigs using paired-end reads. It is modified from SSAKE assembler and has the feature of extending contigs using reads that are unmappable in the contig assembly step. To maximize its efficiency, consider specifying the arguments `--read_distance`, `--read_distance_sd`, and `--read_orientation`. For more information on these arguments, see the [parameters assembly section](../parameters.md#assembly).
36
+
37
+
> The extension of contigs is ran by default, to skip this step, use `--skip_sspace_basic`.
38
+
39
+
## 3. Coverage calculation
40
+
41
+
Processed reads are mapped back against the contigs to determine the number of reads mapping towards each contig. This is done with [`BowTie2`](http://bowtie-bio.sourceforge.net/bowtie2/),[`BWAmem2`](https://github.com/bwa-mem2/bwa-mem2) or [`BWA`](https://github.com/lh3/bwa). This step is used to remove contig clusters that have little to no coverage downstream.
42
+
43
+
> Specify the mapper to use with the `--mapper` parameter. The default is [`BWAmem2`](https://github.com/bwa-mem2/bwa-mem2). To skip contig filtering specify `--perc_reads_contig 0`.
44
+
45
+
## 4. Reference Matching
46
+
31
47
The newly assembled contigs are compared to a reference sequence pool (--reference_pool) using a [BLASTn search](https://www.ncbi.nlm.nih.gov/books/NBK153387/). This process not only helps annotate the contigs but also assists in linking together sets of contigs that are distant within a single genome. Essentially, it aids in identifying contigs belonging to the same genomic segment and choosing the right reference for scaffolding purposes.
32
48
33
49
The top 5 hits for each contig are combined with the denovo contigs and send to the clustering step.
34
50
35
51
> The reference pool can be specified with the `--reference_pool` parameter. The default is the latest clustered [Reference Viral DataBase (RVDB)](https://rvdb.dbi.udel.edu/).
36
52
37
-
## Clustering
53
+
## 5. Taxonomy guided Clustering
38
54
39
-
The clustering workflow of contigs consists out of 2 steps, the [pre-clustering](#pre-clustering) and [actual clustering](#actual-clustering). Here contigs are first separated based on identified taxonomy-id ([Kraken2](https://ccb.jhu.edu/software/kraken2/), [Kaiju](https://kaiju.binf.ku.dk/)) and are subsequently clustered further to identify genome segments.
55
+
The clustering workflow of contigs consists out of 2 steps, the [pre-clustering using taxonomy](#51-pre-clustering-using-taxonomy) and
56
+
[actual clustering on nucleotide similarity](#52-actual-clustering-on-nucloetide-similarity). The taxonomy guided clustering is used to separate contigs based on taxonomy and nucleotide similarity.
40
57
41
-
### Pre-clustering
58
+
```mermaid
59
+
graph LR;
60
+
A[Contigs] --> B["`**Pre-clustering**`"];
61
+
B --> C["`**Actual clustering**`"];
62
+
```
63
+
64
+
65
+
### 5.1 Pre-clustering using taxonomy
42
66
43
67
The contigs along with their references have their taxonomy assigned using [Kraken2](https://ccb.jhu.edu/software/kraken2/) and [Kaiju](https://kaiju.binf.ku.dk/).
44
68
@@ -58,7 +82,7 @@ graph LR;
58
82
E --> F["Taxon simplification"];
59
83
```
60
84
61
-
!!! Tip annotate "Having very complex metagenomes"
85
+
!!! Tip annotate "Having complex metagenomic samples?"
62
86
The pre-clustering step can be used to simplify the taxonomy of the contigs, let [NCBI's taxonomy browser](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi) help you identify taxon-id's for simplification. The simplification can be done in several ways:
63
87
64
88
- Make sure your contamination database is up to date and removes the relevant taxa.
> The pre-clustering step will be run by default but can be skipped with the argument `--skip_preclustering`. Specify which classifier to use with `--precluster_classifiers` parameter. The default is `kaiju,kraken2`. Contig taxon filtering is still enabled despite not having to solve for inconsistencies if only Kaiju or Kraken2 is ran.
96
122
97
-
### Actual clustering
123
+
### 5.2 Actual clustering on nucloetide similarity
98
124
99
125
The clustering is performed with one of the following tools:
100
126
@@ -111,18 +137,33 @@ These methods all come with their own advantages and disadvantages. For example,
111
137
!!! Tip
112
138
When pre-clustering is performed, it is recommended to set a lower identity_threshold (60-70% ANI) as the new goal becomes to separate genome segments within the same bin.
113
139
114
-
> The clustering method can be specified with the `--clustering_method` parameter. The default is `mash`.
140
+
> The clustering method can be specified with the `--clustering_method` parameter. The default is `cdhitest`.
115
141
116
142
> The network clustering method for `mash` can be specified with the `--network_clustering` parameter. The default is `connected_components`, alternative is [`leiden`](https://www.nature.com/articles/s41598-019-41695-z).
117
143
118
-
> The similarity threshold can be specified with the `--similarity_threshold` parameter. The default is `0.6`. However, for cdhit the default is `0.8` which is its minimum value.
144
+
> The similarity threshold can be specified with the `--similarity_threshold` parameter. The default is `0.85`.
145
+
146
+
147
+
## 6. Coverage filtering
148
+
149
+
The coverage of the contigs is calculated using the same method as in the [coverage calculation step](#3-coverage-calculation). A cumulative sum is taken across the contigs from every assembler. If these cumulative sums is above the specified `--perc_reads_contig` parameter, the contig is kept. If all cumulative sums is below the specified parameter, the contig is removed.
150
+
151
+
!!! Info annotate "Show me an example how it works"
152
+
If the `--perc_reads_contig` is set to `5`, the cumulative sum of the contigs from every assembler is calculated. For example:
153
+
154
+
- Cluster 1: the cumulative sum of the contigs from SPAdes is 0.6, Megahit is 0.5, the cluster is kept.
155
+
- Cluster 2: the cumulative sum of the contigs from SPAdes is 0.1, Megahit is 0.1, the cluster is removed.
156
+
- Cluster 3: the cumulative sum of the contigs from SPAdes is 0.5, Megahit is 0, the cluster is kept.
157
+
158
+
> The default is `5` and can be specified with the `--perc_reads_contig` parameter.
159
+
119
160
120
-
## Scaffolding
161
+
## 7. Scaffolding
121
162
122
-
After classifying all contigs and their top BLAST hits into distinct clusters or bins, the contigs are then scaffolded to the centroid of each bin. Any external references that are not centroids of the cluster are subsequently removed to prevent further bias. All members of the cluster are consequently mapped towards their centroid with [Minimap2](https://github.com/lh3/minimap2) and conensus is called using [iVar-consensus](https://andersen-lab.github.io/ivar/html/manualpage.html).
163
+
After classifying all contigs and their top BLAST hits into distinct clusters or bins, the contigs are then scaffolded to the centroid of each bin. Any external references that are not centroids of the cluster are subsequently removed to prevent further bias. All members of the cluster are consequently mapped towards their centroid with [Minimap2](https://github.com/lh3/minimap2) and consensus is called using [iVar-consensus](https://andersen-lab.github.io/ivar/html/manualpage.html).
123
164
124
165
125
-
## Annotation with Reference
166
+
## 8. Annotation with Reference
126
167
127
168
Regions with 0-depth coverage are annotated with the reference sequence. This is done with a [custom script](https://github.com/Joon-Klaps/viralgenie/blob/dev/bin/lowcov_to_reference.py) that uses the coverage of the denovo contigs towards the reference sequence to identify regions with 0-depth coverage. The reference sequence is then annotated to these regions.
Copy file name to clipboardexpand all lines: nextflow_schema.json
-7
Original file line number
Diff line number
Diff line change
@@ -714,13 +714,6 @@
714
714
"description": "Skip creating an alignment of each the collapsed clusters and each iterative step",
715
715
"fa_icon": "fas fa-forward"
716
716
},
717
-
"contig_filter_level": {
718
-
"type": "string",
719
-
"default": "normal",
720
-
"fa_icon": "fas fa-filter",
721
-
"description": "Specify how strict the filtering should be.",
722
-
"help_text": "none - don't filter the report of final contigs (not recomended)\nnormal - display only the consensus of the last iteration that had hits in the annotation set\nstrict - Group results by identified species & segment & lineage (if available) to report only the top hits"
0 commit comments