You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -59,7 +59,7 @@
59
59
10.[Optional] Remove clusters with low read coverage. `bin/extract_clusters.py`
60
60
11. Scaffolding of contigs to centroid ([`Minimap2`](https://github.com/lh3/minimap2), [`iVar-consensus`](https://andersen-lab.github.io/ivar/html/manualpage.html))
61
61
12.[Optional] Annotate 0-depth regions with external reference `bin/lowcov_to_reference.py`.
62
-
13.[Optional] Select best reference from `--mapping_constrains`:
62
+
13.[Optional] Select best reference from `--mapping_constraints`:
63
63
-[`Mash sketch`](https://github.com/marbl/Mash)
64
64
-[`Mash screen`](https://github.com/marbl/Mash)
65
65
14. Mapping filtered reads to supercontig and mapping constrains([`BowTie2`](http://bowtie-bio.sourceforge.net/bowtie2/),[`BWAmem2`](https://github.com/bwa-mem2/bwa-mem2) and [`BWA`](https://github.com/lh3/bwa))
help="Mapping constrains file containing information on the sequences that need to be used for mapping against the samples, supported formats: '.csv', '.tsv', '.yaml', '.yml'",
Copy file name to clipboardexpand all lines: docs/parameters.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -133,7 +133,7 @@ Parameters relating to the analysis of variants associated to contigs and scaffo
133
133
|-----------|-----------|-----------|
134
134
|`skip_variant_calling`| Skip the analysis of variants for the external reference or contigs ||
135
135
|`mapper`| Define which mapping tool needs to be used when mapping reads to reference | bwamem2 |
136
-
|`mapping_constrains`| Sequence to use as a mapping reference instead of the de novo contigs or scaffolds ||
136
+
|`mapping_constraints`| Sequence to use as a mapping reference instead of the de novo contigs or scaffolds ||
137
137
|`deduplicate`| deduplicate the reads <details><summary>Help</summary><small>If used with umi's, `umi tools` will be used to group and call consensus of each indiual read group. If not used with umi's use `PicardsMarkDuplicates`. </small></details>| True |
138
138
|`variant_caller`| Define the variant caller to use: 'ivar' or 'bcftools' | ivar |
139
139
|`consensus_caller`| consensus tool used for calling new consensus in final iteration | ivar |
> The overal workflow of creating reference assisted assemblies can be skipped with the argument `--skip_assembly`. See the [parameters assembly section](../parameters.md#assembly) for all relevant arguments to control the assembly steps.
17
+
> The overall workflow of creating reference assisted assemblies can be skipped with the argument `--skip_assembly`. See the [parameters assembly section](../parameters.md#assembly) for all relevant arguments to control the assembly steps.
19
18
20
19
> The overall refinement of contigs can be skipped with the argument `--skip_polishing`. See the [parameters polishing section](../parameters.md#polishing) for all relevant arguments to control the polishing steps.
21
20
22
-
The consensus genome of all clusters are then send to the [variant analysis & iterative refinement](variant_and_refinement.md) step.
21
+
The consensus genome of all clusters are then sent to the [variant analysis & iterative refinement](variant_and_refinement.md) step.
23
22
24
23
## 1. De-novo Assembly
25
24
@@ -34,7 +33,7 @@ Low complexity contigs can be filtered out using prinseq++ with the `--skip_cont
34
33
35
34
Contigs can be extended using [SSPACE Basic](https://github.com/nsoranzo/sspace_basic) with the `--skip_sspace_basic false` parameter. SSPACE is a tool for scaffolding contigs using paired-end reads. It is modified from SSAKE assembler and has the feature of extending contigs using reads that are unmappable in the contig assembly step. To maximize its efficiency, consider specifying the arguments `--read_distance`, `--read_distance_sd`, and `--read_orientation`. For more information on these arguments, see the [parameters assembly section](../parameters.md#assembly).
36
35
37
-
> The extension of contigs is ran by default, to skip this step, use `--skip_sspace_basic`.
36
+
> The extension of contigs is run by default, to skip this step, use `--skip_sspace_basic`.
38
37
39
38
## 3. Coverage calculation
40
39
@@ -44,24 +43,23 @@ Processed reads are mapped back against the contigs to determine the number of r
44
43
45
44
## 4. Reference Matching
46
45
47
-
The newly assembled contigs are compared to a reference sequence pool (--reference_pool) using a [BLASTn search](https://www.ncbi.nlm.nih.gov/books/NBK153387/). This process not only helps annotate the contigs but also assists in linking together sets of contigs that are distant within a single genome. Essentially, it aids in identifying contigs belonging to the same genomic segment and choosing the right reference for scaffolding purposes.
46
+
The newly assembled contigs are compared to a reference sequence pool (`--reference_pool`) using a [BLASTn search](https://www.ncbi.nlm.nih.gov/books/NBK153387/). This process not only helps annotate the contigs but also assists in linking together sets of contigs that are distant within a single genome. Essentially, it aids in identifying contigs belonging to the same genomic segment and choosing the right reference for scaffolding purposes.
48
47
49
-
The top 5 hits for each contig are combined with the denovo contigs and send to the clustering step.
48
+
The top 5 hits for each contig are combined with the de novo contigs and sent to the clustering step.
50
49
51
50
> The reference pool can be specified with the `--reference_pool` parameter. The default is the latest clustered [Reference Viral DataBase (RVDB)](https://rvdb.dbi.udel.edu/).
52
51
53
52
## 5. Taxonomy guided Clustering
54
53
55
-
The clustering workflow of contigs consists out of 2 steps, the [pre-clustering using taxonomy](#51-pre-clustering-using-taxonomy) and
56
-
[actual clustering on nucleotide similarity](#52-actual-clustering-on-nucloetide-similarity). The taxonomy guided clustering is used to separate contigs based on taxonomy and nucleotide similarity.
54
+
The clustering workflow of contigs consists of 2 steps, the [pre-clustering using taxonomy](#51-pre-clustering-using-taxonomy) and
55
+
[actual clustering on nucleotide similarity](#52-actual-clustering-on-nucleotide-similarity). The taxonomy guided clustering is used to separate contigs based on taxonomy and nucleotide similarity.
57
56
58
57
```mermaid
59
58
graph LR;
60
59
A[Contigs] --> B["`**Pre-clustering**`"];
61
60
B --> C["`**Actual clustering**`"];
62
61
```
63
62
64
-
65
63
### 5.1 Pre-clustering using taxonomy
66
64
67
65
The contigs along with their references have their taxonomy assigned using [Kraken2](https://ccb.jhu.edu/software/kraken2/) and [Kaiju](https://kaiju.binf.ku.dk/).
@@ -70,7 +68,7 @@ The contigs along with their references have their taxonomy assigned using [Krak
As Kajiu and Kraken2 can have different taxonomic assignments, an additional step is performed to resolve potential inconsistencies in taxonomy and to identify the taxonomy of the contigs. This is done with a custom script that is based on `KrakenTools extract_kraken_reads.py` and `kaiju-Merge-Outputs`.
71
+
As Kaiju and Kraken2 can have different taxonomic assignments, an additional step is performed to resolve potential inconsistencies in taxonomy and to identify the taxonomy of the contigs. This is done with a custom script that is based on `KrakenTools extract_kraken_reads.py` and `kaiju-Merge-Outputs`.
74
72
75
73
```mermaid
76
74
graph LR;
@@ -94,7 +92,7 @@ graph LR;
94
92
95
93
1. Options here are 'species', 'genus', 'family', 'order', 'class', 'phylum', 'kingdom' or 'superkingdom'.
96
94
97
-
2.`--precluster_include_childeren`__"genus1"__ :
95
+
2.`--precluster_include_children`__"genus1"__ :
98
96
99
97
```mermaid
100
98
graph TD;
@@ -118,9 +116,9 @@ graph LR;
118
116
```
119
117
Dotted lines represent exclusion of taxa.
120
118
121
-
> The pre-clustering step will be run by default but can be skipped with the argument `--skip_preclustering`. Specify which classifier to use with `--precluster_classifiers` parameter. The default is `kaiju,kraken2`. Contig taxon filtering is still enabled despite not having to solve for inconsistencies if only Kaiju or Kraken2 is ran.
119
+
> The pre-clustering step will be run by default but can be skipped with the argument `--skip_preclustering`. Specify which classifier to use with `--precluster_classifiers` parameter. The default is `kaiju,kraken2`. Contig taxon filtering is still enabled despite not having to solve for inconsistencies if only Kaiju or Kraken2 is run.
122
120
123
-
### 5.2 Actual clustering on nucloetide similarity
121
+
### 5.2 Actual clustering on nucleotide similarity
124
122
125
123
The clustering is performed with one of the following tools:
126
124
@@ -131,7 +129,6 @@ The clustering is performed with one of the following tools:
These methods all come with their own advantages and disadvantages. For example, cdhitest is very fast but cannot be used for large viruses >10Mb and similarity threshold cannot go below 80% which is not preferable for highly diverse RNA viruses. Vsearch is slower but accurate. Mmseqs-linclust is the fastest but tends to create a large amount of bins. Mmseqs-cluster is slower but can handle larger datasets and is more accurate. vRhyme is a new method that is still under development but has shown promising results but can sometimes not output any bins when segments are small. Mash is a very fast comparison method is linked with a custom script that identifies communities within a network.
136
133
137
134
!!! Tip
@@ -143,28 +140,25 @@ These methods all come with their own advantages and disadvantages. For example,
143
140
144
141
> The similarity threshold can be specified with the `--similarity_threshold` parameter. The default is `0.85`.
145
142
146
-
147
143
## 6. Coverage filtering
148
144
149
-
The coverage of the contigs is calculated using the same method as in the [coverage calculation step](#3-coverage-calculation). A cumulative sum is taken across the contigs from every assembler. If these cumulative sums is above the specified `--perc_reads_contig` parameter, the contig is kept. If all cumulative sums is below the specified parameter, the contig is removed.
145
+
The coverage of the contigs is calculated using the same method as in the [coverage calculation step](#3-coverage-calculation). A cumulative sum is taken across the contigs from every assembler. If these cumulative sums are above the specified `--perc_reads_contig` parameter, the contig is kept. If all cumulative sums are below the specified parameter, the contig is removed.
150
146
151
147
!!! Info annotate "Show me an example how it works"
152
148
If the `--perc_reads_contig` is set to `5`, the cumulative sum of the contigs from every assembler is calculated. For example:
153
149
154
-
- Cluster 1: the cumulative sum of the contigs from SPAdes is 0.6, Megahit is 0.5, the cluster is kept.
150
+
- Cluster 1: the cumulative sum of the contigs from SPAdes is 0.6, Megahit is 0.5, the cluster is kept.
155
151
- Cluster 2: the cumulative sum of the contigs from SPAdes is 0.1, Megahit is 0.1, the cluster is removed.
156
152
- Cluster 3: the cumulative sum of the contigs from SPAdes is 0.5, Megahit is 0, the cluster is kept.
157
153
158
154
> The default is `5` and can be specified with the `--perc_reads_contig` parameter.
159
155
160
-
161
156
## 7. Scaffolding
162
157
163
158
After classifying all contigs and their top BLAST hits into distinct clusters or bins, the contigs are then scaffolded to the centroid of each bin. Any external references that are not centroids of the cluster are subsequently removed to prevent further bias. All members of the cluster are consequently mapped towards their centroid with [Minimap2](https://github.com/lh3/minimap2) and consensus is called using [iVar-consensus](https://andersen-lab.github.io/ivar/html/manualpage.html).
164
159
165
-
166
160
## 8. Annotation with Reference
167
161
168
-
Regions with 0-depth coverage are annotated with the reference sequence. This is done with a [custom script](https://github.com/Joon-Klaps/viralgenie/blob/dev/bin/lowcov_to_reference.py) that uses the coverage of the denovo contigs towards the reference sequence to identify regions with 0-depth coverage. The reference sequence is then annotated to these regions.
162
+
Regions with 0-depth coverage are annotated with the reference sequence. This is done with a [custom script](https://github.com/Joon-Klaps/viralgenie/blob/dev/bin/lowcov_to_reference.py) that uses the coverage of the de novo contigs towards the reference sequence to identify regions with 0-depth coverage. The reference sequence is then annotated to these regions.
169
163
170
164
> This step can be skipped using `--skip_hybrid_consensus` parameter.
0 commit comments