Skip to content

Commit e2981a5

Browse files
committed
add numbers on assembly & polishing workflow
1 parent 26e6b9d commit e2981a5

9 files changed

+101
-71
lines changed

.github/workflows/build-docs.yml

+2-2
Original file line numberDiff line numberDiff line change
@@ -40,11 +40,11 @@ jobs:
4040
- name: Install Nextflow
4141
uses: nf-core/setup-nextflow@v2
4242
- name: Build parameter docs
43-
run: nf-core schema docs --format markdown --columns parameter,description,default --output docs/parameters.md --force
43+
run: nf-core pipelines schema docs --format markdown --columns parameter,description,default --output docs/parameters.md --force
4444
- name: Read parameter tip
4545
id: read_tip
4646
run: |
47-
tip=$(cat docs/parameter_tip.md)
47+
tip=$(cat docs/template/parameter_tip.md)
4848
echo "tip_content=$tip" >> $GITHUB_ENV
4949
- name: Append content to parameters.md
5050
run: |

bin/custom_multiqc.py

-8
Original file line numberDiff line numberDiff line change
@@ -121,14 +121,6 @@ def file_choices(choices, fname):
121121
help="Checkv summary files for each sample",
122122
type=Path,
123123
)
124-
parser.add_argument(
125-
"--filter_level",
126-
metavar="FILTER LEVEL",
127-
choices=["normal", "strict", "none"],
128-
default="normal",
129-
type=str,
130-
help="Specify how strict the filtering should be, default is normal.",
131-
)
132124

133125
parser.add_argument(
134126
"--clusters_files",

conf/modules.config

-1
Original file line numberDiff line numberDiff line change
@@ -1668,7 +1668,6 @@ process {
16681668
withName: CUSTOM_MULTIQC {
16691669
ext.args = [
16701670
params.save_intermediate_polishing ? "--save_intermediate" : '',
1671-
params.contig_filter_level ? "--filter_level ${params.contig_filter_level}" : '',
16721671
].join(' ').trim()
16731672
publishDir = [
16741673
[

docs/images/assembly_polishing.png

13.5 KB
Loading

docs/parameters.md

+38-32
Large diffs are not rendered by default.

docs/template/parameter_tip.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
!!! Tip "Need Something more interactive?"
22
Use [`nf-core launch`](https://nf-co.re/tools#launch-a-pipeline) to interactivly set your parameters:
33
```console
4-
nf-core launch Joon-Klaps/viralgenie
4+
nf-core pipelines launch Joon-Klaps/viralgenie
55
```

docs/workflow/assembly_polishing.md

+60-19
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,15 @@
33

44
Viralgenie offers an elaborate workflow for the assembly and polishing of viral genomes:
55

6-
- [Assembly](#de-novo-assembly): combining the results of multiple assemblers.
7-
- [Reference Matching](#reference-matching): comparing the newly assembled contigs to a reference sequence pool.
8-
- [Clustering](#clustering): clustering the contigs based on taxonomy and similarity.
9-
- [Scaffolding](#scaffolding): scaffolding the contigs to the centroid of each bin.
10-
- [Annotation with Reference](#annotation-with-reference): annotating regions with 0-depth coverage with the reference sequence.
6+
1. [Assembly](#1-de-novo-assembly): combining the results of multiple assemblers.
7+
1. [Extension](#2-extension): extending contigs using paired-end reads.
8+
1. [Coverage calculation](#3-coverage-calculation): mapping reads back to the contigs to determine coverage.
9+
1. [Reference Matching](#4-reference-matching): comparing contigs to a reference sequence pool.
10+
1. [Taxonomy guided Clustering](#5-taxonomy-guided-clustering): clustering contigs based on taxonomy and nucleotide similarity.
11+
- [Pre-clustering](#51-pre-clustering-using-taxonomy): separating contigs based on identified taxonomy-id.
12+
- [Actual clustering](#52-actual-clustering-on-nucloetide-similarity): clustering contigs based on nucleotide similarity.
13+
1. [Scaffolding](#scaffolding): scaffolding the contigs to the centroid of each bin.
14+
1. [Annotation with Reference](#annotation-with-reference): annotating regions with 0-depth coverage with the reference sequence.
1115

1216
![assembly_polishing](../images/assembly_polishing.png)
1317

@@ -17,28 +21,48 @@ Viralgenie offers an elaborate workflow for the assembly and polishing of viral
1721
1822
The consensus genome of all clusters are then send to the [variant analysis & iterative refinement](variant_and_refinement.md) step.
1923

20-
## De-novo Assembly
24+
## 1. De-novo Assembly
25+
2126
Three assemblers are used, [SPAdes](http://cab.spbu.ru/software/spades/), [Megahit](https://github.com/voutcn/megahit), and [Trinity](https://github.com/trinityrnaseq/trinityrnaseq). The resulting contigs of all specified assemblers, are combined and processed further together.
2227
> Modify the spades mode with `--spades_mode [default: rnaviral]` and supply specific params with `--spades_yml` or a hmm model with `--spades_hmm`.
2328
2429
> Specify the assemblers to use with the `--assemblers` parameter where the assemblers are separated with a ','. The default is `spades,megahit,trinity`.
2530
26-
Contigs can be extended using [SSPACE Basic](https://github.com/nsoranzo/sspace_basic) with the `--skip_sspace_basic false` parameter. SSPACE is a tool for scaffolding contigs using paired-end reads. It is modified from SSAKE assembler and has the feature of extending contigs using reads that are unmappable in the contig assembly step.
27-
2831
Low complexity contigs can be filtered out using prinseq++ with the `--skip_contig_prinseq false` parameter. Complexity filtering is primarily a run-time optimisation step. Low-complexity sequences are defined as having commonly found stretches of nucleotides with limited information content (e.g. the dinucleotide repeat CACACACACA). Such sequences can produce a large number of high-scoring but biologically insignificant results in database searches. Removing these reads therefore saves computational time and resources.
2932

30-
## Reference Matching
33+
## 2. Extension
34+
35+
Contigs can be extended using [SSPACE Basic](https://github.com/nsoranzo/sspace_basic) with the `--skip_sspace_basic false` parameter. SSPACE is a tool for scaffolding contigs using paired-end reads. It is modified from SSAKE assembler and has the feature of extending contigs using reads that are unmappable in the contig assembly step. To maximize its efficiency, consider specifying the arguments `--read_distance`, `--read_distance_sd`, and `--read_orientation`. For more information on these arguments, see the [parameters assembly section](../parameters.md#assembly).
36+
37+
> The extension of contigs is ran by default, to skip this step, use `--skip_sspace_basic`.
38+
39+
## 3. Coverage calculation
40+
41+
Processed reads are mapped back against the contigs to determine the number of reads mapping towards each contig. This is done with [`BowTie2`](http://bowtie-bio.sourceforge.net/bowtie2/),[`BWAmem2`](https://github.com/bwa-mem2/bwa-mem2) or [`BWA`](https://github.com/lh3/bwa). This step is used to remove contig clusters that have little to no coverage downstream.
42+
43+
> Specify the mapper to use with the `--mapper` parameter. The default is [`BWAmem2`](https://github.com/bwa-mem2/bwa-mem2). To skip contig filtering specify `--perc_reads_contig 0`.
44+
45+
## 4. Reference Matching
46+
3147
The newly assembled contigs are compared to a reference sequence pool (--reference_pool) using a [BLASTn search](https://www.ncbi.nlm.nih.gov/books/NBK153387/). This process not only helps annotate the contigs but also assists in linking together sets of contigs that are distant within a single genome. Essentially, it aids in identifying contigs belonging to the same genomic segment and choosing the right reference for scaffolding purposes.
3248

3349
The top 5 hits for each contig are combined with the denovo contigs and send to the clustering step.
3450

3551
> The reference pool can be specified with the `--reference_pool` parameter. The default is the latest clustered [Reference Viral DataBase (RVDB)](https://rvdb.dbi.udel.edu/).
3652
37-
## Clustering
53+
## 5. Taxonomy guided Clustering
3854

39-
The clustering workflow of contigs consists out of 2 steps, the [pre-clustering](#pre-clustering) and [actual clustering](#actual-clustering). Here contigs are first separated based on identified taxonomy-id ([Kraken2](https://ccb.jhu.edu/software/kraken2/), [Kaiju](https://kaiju.binf.ku.dk/)) and are subsequently clustered further to identify genome segments.
55+
The clustering workflow of contigs consists out of 2 steps, the [pre-clustering using taxonomy](#51-pre-clustering-using-taxonomy) and
56+
[actual clustering on nucleotide similarity](#52-actual-clustering-on-nucloetide-similarity). The taxonomy guided clustering is used to separate contigs based on taxonomy and nucleotide similarity.
4057

41-
### Pre-clustering
58+
```mermaid
59+
graph LR;
60+
A[Contigs] --> B["`**Pre-clustering**`"];
61+
B --> C["`**Actual clustering**`"];
62+
```
63+
64+
65+
### 5.1 Pre-clustering using taxonomy
4266

4367
The contigs along with their references have their taxonomy assigned using [Kraken2](https://ccb.jhu.edu/software/kraken2/) and [Kaiju](https://kaiju.binf.ku.dk/).
4468

@@ -58,7 +82,7 @@ graph LR;
5882
E --> F["Taxon simplification"];
5983
```
6084

61-
!!! Tip annotate "Having very complex metagenomes"
85+
!!! Tip annotate "Having complex metagenomic samples?"
6286
The pre-clustering step can be used to simplify the taxonomy of the contigs, let [NCBI's taxonomy browser](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi) help you identify taxon-id's for simplification. The simplification can be done in several ways:
6387

6488
- Make sure your contamination database is up to date and removes the relevant taxa.
@@ -80,6 +104,7 @@ graph LR;
80104
B --- E[species2];
81105
C -.- F[species3];
82106
```
107+
Dotted lines represent exclusion of taxa.
83108
84109
3. `--precluster_include_parents` __"species3"__ :
85110
@@ -91,10 +116,11 @@ graph LR;
91116
B -.- E[species2]
92117
C --- F[species3]
93118
```
119+
Dotted lines represent exclusion of taxa.
94120
95121
> The pre-clustering step will be run by default but can be skipped with the argument `--skip_preclustering`. Specify which classifier to use with `--precluster_classifiers` parameter. The default is `kaiju,kraken2`. Contig taxon filtering is still enabled despite not having to solve for inconsistencies if only Kaiju or Kraken2 is ran.
96122
97-
### Actual clustering
123+
### 5.2 Actual clustering on nucloetide similarity
98124
99125
The clustering is performed with one of the following tools:
100126
@@ -111,18 +137,33 @@ These methods all come with their own advantages and disadvantages. For example,
111137
!!! Tip
112138
When pre-clustering is performed, it is recommended to set a lower identity_threshold (60-70% ANI) as the new goal becomes to separate genome segments within the same bin.
113139
114-
> The clustering method can be specified with the `--clustering_method` parameter. The default is `mash`.
140+
> The clustering method can be specified with the `--clustering_method` parameter. The default is `cdhitest`.
115141
116142
> The network clustering method for `mash` can be specified with the `--network_clustering` parameter. The default is `connected_components`, alternative is [`leiden`](https://www.nature.com/articles/s41598-019-41695-z).
117143
118-
> The similarity threshold can be specified with the `--similarity_threshold` parameter. The default is `0.6`. However, for cdhit the default is `0.8` which is its minimum value.
144+
> The similarity threshold can be specified with the `--similarity_threshold` parameter. The default is `0.85`.
145+
146+
147+
## 6. Coverage filtering
148+
149+
The coverage of the contigs is calculated using the same method as in the [coverage calculation step](#3-coverage-calculation). A cumulative sum is taken across the contigs from every assembler. If these cumulative sums is above the specified `--perc_reads_contig` parameter, the contig is kept. If all cumulative sums is below the specified parameter, the contig is removed.
150+
151+
!!! Info annotate "Show me an example how it works"
152+
If the `--perc_reads_contig` is set to `5`, the cumulative sum of the contigs from every assembler is calculated. For example:
153+
154+
- Cluster 1: the cumulative sum of the contigs from SPAdes is 0.6, Megahit is 0.5, the cluster is kept.
155+
- Cluster 2: the cumulative sum of the contigs from SPAdes is 0.1, Megahit is 0.1, the cluster is removed.
156+
- Cluster 3: the cumulative sum of the contigs from SPAdes is 0.5, Megahit is 0, the cluster is kept.
157+
158+
> The default is `5` and can be specified with the `--perc_reads_contig` parameter.
159+
119160
120-
## Scaffolding
161+
## 7. Scaffolding
121162
122-
After classifying all contigs and their top BLAST hits into distinct clusters or bins, the contigs are then scaffolded to the centroid of each bin. Any external references that are not centroids of the cluster are subsequently removed to prevent further bias. All members of the cluster are consequently mapped towards their centroid with [Minimap2](https://github.com/lh3/minimap2) and conensus is called using [iVar-consensus](https://andersen-lab.github.io/ivar/html/manualpage.html).
163+
After classifying all contigs and their top BLAST hits into distinct clusters or bins, the contigs are then scaffolded to the centroid of each bin. Any external references that are not centroids of the cluster are subsequently removed to prevent further bias. All members of the cluster are consequently mapped towards their centroid with [Minimap2](https://github.com/lh3/minimap2) and consensus is called using [iVar-consensus](https://andersen-lab.github.io/ivar/html/manualpage.html).
123164
124165
125-
## Annotation with Reference
166+
## 8. Annotation with Reference
126167
127168
Regions with 0-depth coverage are annotated with the reference sequence. This is done with a [custom script](https://github.com/Joon-Klaps/viralgenie/blob/dev/bin/lowcov_to_reference.py) that uses the coverage of the denovo contigs towards the reference sequence to identify regions with 0-depth coverage. The reference sequence is then annotated to these regions.
128169

nextflow.config

-1
Original file line numberDiff line numberDiff line change
@@ -150,7 +150,6 @@ params {
150150
skip_alignment_qc = false
151151
annotation_db = "ftp://ftp.expasy.org/databases/viralzone/2020_4/virosaurus90_vertebrate-20200330.fas.gz"
152152
skip_annotation = false
153-
contig_filter_level = "normal"
154153
mmseqs_searchtype = 4
155154

156155
// Global

nextflow_schema.json

-7
Original file line numberDiff line numberDiff line change
@@ -714,13 +714,6 @@
714714
"description": "Skip creating an alignment of each the collapsed clusters and each iterative step",
715715
"fa_icon": "fas fa-forward"
716716
},
717-
"contig_filter_level": {
718-
"type": "string",
719-
"default": "normal",
720-
"fa_icon": "fas fa-filter",
721-
"description": "Specify how strict the filtering should be.",
722-
"help_text": "none - don't filter the report of final contigs (not recomended)\nnormal - display only the consensus of the last iteration that had hits in the annotation set\nstrict - Group results by identified species & segment & lineage (if available) to report only the top hits"
723-
},
724717
"mmseqs_searchtype": {
725718
"type": "integer",
726719
"default": 4,

0 commit comments

Comments
 (0)