add numbers on assembly & polishing workflow

Joon-Klaps · Joon-Klaps · commit e2981a5fa6ef · 2025-01-06T16:46:42.000Z
diff --git a/.github/workflows/build-docs.yml b/.github/workflows/build-docs.yml
@@ -40,11 +40,11 @@ jobs:
       - name: Install Nextflow
         uses: nf-core/setup-nextflow@v2
       - name: Build parameter docs
-        run: nf-core schema docs --format markdown --columns parameter,description,default --output docs/parameters.md --force
+        run: nf-core pipelines schema docs --format markdown --columns parameter,description,default --output docs/parameters.md --force
       - name: Read parameter tip
         id: read_tip
         run: |
-          tip=$(cat docs/parameter_tip.md)
+          tip=$(cat docs/template/parameter_tip.md)
           echo "tip_content=$tip" >> $GITHUB_ENV
       - name: Append content to parameters.md
         run: |
diff --git a/bin/custom_multiqc.py b/bin/custom_multiqc.py
@@ -121,14 +121,6 @@ def file_choices(choices, fname):
         help="Checkv summary files for each sample",
         type=Path,
     )
-    parser.add_argument(
-        "--filter_level",
-        metavar="FILTER LEVEL",
-        choices=["normal", "strict", "none"],
-        default="normal",
-        type=str,
-        help="Specify how strict the filtering should be, default is normal.",
-    )
 
     parser.add_argument(
         "--clusters_files",
diff --git a/conf/modules.config b/conf/modules.config
@@ -1668,7 +1668,6 @@ process {
     withName: CUSTOM_MULTIQC {
         ext.args   = [
                 params.save_intermediate_polishing ? "--save_intermediate" : '',
-                params.contig_filter_level ? "--filter_level ${params.contig_filter_level}" : '',
             ].join(' ').trim()
         publishDir = [
             [
diff --git a/docs/images/assembly_polishing.png b/docs/images/assembly_polishing.png
diff --git a/docs/parameters.md b/docs/parameters.md
diff --git a/docs/template/parameter_tip.md b/docs/template/parameter_tip.md
@@ -1,5 +1,5 @@
 !!! Tip "Need Something more interactive?"
     Use [`nf-core launch`](https://nf-co.re/tools#launch-a-pipeline) to interactivly set your parameters:
     ```console
-    nf-core launch Joon-Klaps/viralgenie
+    nf-core pipelines launch Joon-Klaps/viralgenie
     ```
diff --git a/docs/workflow/assembly_polishing.md b/docs/workflow/assembly_polishing.md
@@ -3,11 +3,15 @@
 
 Viralgenie offers an elaborate workflow for the assembly and polishing of viral genomes:
 
-- [Assembly](#de-novo-assembly): combining the results of multiple assemblers.
-- [Reference Matching](#reference-matching): comparing the newly assembled contigs to a reference sequence pool.
-- [Clustering](#clustering): clustering the contigs based on taxonomy and similarity.
-- [Scaffolding](#scaffolding): scaffolding the contigs to the centroid of each bin.
-- [Annotation with Reference](#annotation-with-reference): annotating regions with 0-depth coverage with the reference sequence.
+1. [Assembly](#1-de-novo-assembly): combining the results of multiple assemblers.
+1. [Extension](#2-extension): extending contigs using paired-end reads.
+1. [Coverage calculation](#3-coverage-calculation): mapping reads back to the contigs to determine coverage.
+1. [Reference Matching](#4-reference-matching): comparing contigs to a reference sequence pool.
+1. [Taxonomy guided Clustering](#5-taxonomy-guided-clustering): clustering contigs based on taxonomy and nucleotide similarity.
+    - [Pre-clustering](#51-pre-clustering-using-taxonomy): separating contigs based on identified taxonomy-id.
+    - [Actual clustering](#52-actual-clustering-on-nucloetide-similarity): clustering contigs based on nucleotide similarity.
+1. [Scaffolding](#scaffolding): scaffolding the contigs to the centroid of each bin.
+1. [Annotation with Reference](#annotation-with-reference): annotating regions with 0-depth coverage with the reference sequence.
 
 ![assembly_polishing](../images/assembly_polishing.png)
 
@@ -17,28 +21,48 @@ Viralgenie offers an elaborate workflow for the assembly and polishing of viral
 
 The consensus genome of all clusters are then send to the [variant analysis & iterative refinement](variant_and_refinement.md) step.
 
-## De-novo Assembly
+## 1. De-novo Assembly
+
 Three assemblers are used, [SPAdes](http://cab.spbu.ru/software/spades/), [Megahit](https://github.com/voutcn/megahit), and [Trinity](https://github.com/trinityrnaseq/trinityrnaseq). The resulting contigs of all specified assemblers, are combined and processed further together.
 > Modify the spades mode with `--spades_mode [default: rnaviral]` and supply specific params with `--spades_yml` or a hmm model with `--spades_hmm`.
 
 > Specify the assemblers to use with the `--assemblers` parameter where the assemblers are separated with a ','. The default is `spades,megahit,trinity`.
 
-Contigs can be extended using [SSPACE Basic](https://github.com/nsoranzo/sspace_basic) with the `--skip_sspace_basic false` parameter. SSPACE is a tool for scaffolding contigs using paired-end reads. It is modified from SSAKE assembler and has the feature of extending contigs using reads that are unmappable in the contig assembly step.
-
 Low complexity contigs can be filtered out using prinseq++ with the `--skip_contig_prinseq false` parameter. Complexity filtering is primarily a run-time optimisation step. Low-complexity sequences are defined as having commonly found stretches of nucleotides with limited information content (e.g. the dinucleotide repeat CACACACACA). Such sequences can produce a large number of high-scoring but biologically insignificant results in database searches. Removing these reads therefore saves computational time and resources.
 
-## Reference Matching
+## 2. Extension
+
+Contigs can be extended using [SSPACE Basic](https://github.com/nsoranzo/sspace_basic) with the `--skip_sspace_basic false` parameter. SSPACE is a tool for scaffolding contigs using paired-end reads. It is modified from SSAKE assembler and has the feature of extending contigs using reads that are unmappable in the contig assembly step. To maximize its efficiency, consider specifying the arguments `--read_distance`, `--read_distance_sd`, and `--read_orientation`.  For more information on these arguments, see the [parameters assembly section](../parameters.md#assembly).
+
+> The extension of contigs is ran by default, to skip this step, use `--skip_sspace_basic`.
+
+## 3. Coverage calculation
+
+Processed reads are mapped back against the contigs to determine the number of reads mapping towards each contig. This is done with [`BowTie2`](http://bowtie-bio.sourceforge.net/bowtie2/),[`BWAmem2`](https://github.com/bwa-mem2/bwa-mem2) or [`BWA`](https://github.com/lh3/bwa). This step is used to remove contig clusters that have little to no coverage downstream.
+
+> Specify the mapper to use with the `--mapper` parameter. The default is [`BWAmem2`](https://github.com/bwa-mem2/bwa-mem2). To skip contig filtering specify `--perc_reads_contig 0`.
+
+## 4. Reference Matching
+
 The newly assembled contigs are compared to a reference sequence pool (--reference_pool) using a [BLASTn search](https://www.ncbi.nlm.nih.gov/books/NBK153387/). This process not only helps annotate the contigs but also assists in linking together sets of contigs that are distant within a single genome. Essentially, it aids in identifying contigs belonging to the same genomic segment and choosing the right reference for scaffolding purposes.
 
 The top 5 hits for each contig are combined with the denovo contigs and send to the clustering step.
 
 > The reference pool can be specified with the `--reference_pool` parameter. The default is the latest clustered [Reference Viral DataBase (RVDB)](https://rvdb.dbi.udel.edu/).
 
-## Clustering
+## 5. Taxonomy guided Clustering
 
-The clustering workflow of contigs consists out of 2 steps, the [pre-clustering](#pre-clustering) and [actual clustering](#actual-clustering). Here contigs are first separated based on identified taxonomy-id ([Kraken2](https://ccb.jhu.edu/software/kraken2/), [Kaiju](https://kaiju.binf.ku.dk/)) and are subsequently clustered further to identify genome segments.
+The clustering workflow of contigs consists out of 2 steps, the [pre-clustering using taxonomy](#51-pre-clustering-using-taxonomy) and
+[actual clustering on nucleotide similarity](#52-actual-clustering-on-nucloetide-similarity). The taxonomy guided clustering is used to separate contigs based on taxonomy and nucleotide similarity.
 
-### Pre-clustering
+```mermaid
+graph LR;
+    A[Contigs] --> B["`**Pre-clustering**`"];
+    B --> C["`**Actual clustering**`"];
+```
+
+
+### 5.1 Pre-clustering using taxonomy
 
 The contigs along with their references have their taxonomy assigned using [Kraken2](https://ccb.jhu.edu/software/kraken2/) and [Kaiju](https://kaiju.binf.ku.dk/).
 
@@ -58,7 +82,7 @@ graph LR;
     E --> F["Taxon simplification"];
 ```
 
-!!! Tip annotate "Having very complex metagenomes"
+!!! Tip annotate "Having complex metagenomic samples?"
     The pre-clustering step can be used to simplify the taxonomy of the contigs, let [NCBI's taxonomy browser](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi) help you identify taxon-id's for simplification. The simplification can be done in several ways:
 
     - Make sure your contamination database is up to date and removes the relevant taxa.
@@ -80,6 +104,7 @@ graph LR;
         B --- E[species2];
         C -.- F[species3];
     ```
+    Dotted lines represent exclusion of taxa.
 
 3. `--precluster_include_parents` __"species3"__ :
 
@@ -91,10 +116,11 @@ graph LR;
         B -.- E[species2]
         C --- F[species3]
     ```
+    Dotted lines represent exclusion of taxa.
 
 > The pre-clustering step will be run by default but can be skipped with the argument `--skip_preclustering`. Specify which classifier to use with `--precluster_classifiers` parameter. The default is `kaiju,kraken2`. Contig taxon filtering is still enabled despite not having to solve for inconsistencies if only Kaiju or Kraken2 is ran.
 
-### Actual clustering
+### 5.2 Actual clustering on nucloetide similarity
 
 The clustering is performed with one of the following tools:
 
@@ -111,18 +137,33 @@ These methods all come with their own advantages and disadvantages. For example,
 !!! Tip
     When pre-clustering is performed, it is recommended to set a lower identity_threshold (60-70% ANI) as the new goal becomes to separate genome segments within the same bin.
 
-> The clustering method can be specified with the `--clustering_method` parameter. The default is `mash`.
+> The clustering method can be specified with the `--clustering_method` parameter. The default is `cdhitest`.
 
 > The network clustering method for `mash` can be specified with the `--network_clustering` parameter. The default is `connected_components`, alternative is [`leiden`](https://www.nature.com/articles/s41598-019-41695-z).
 
-> The similarity threshold can be specified with the `--similarity_threshold` parameter. The default is `0.6`. However, for cdhit the default is `0.8` which is its minimum value.
+> The similarity threshold can be specified with the `--similarity_threshold` parameter. The default is `0.85`.
+
+
+## 6. Coverage filtering
+
+The coverage of the contigs is calculated using the same method as in the [coverage calculation step](#3-coverage-calculation). A cumulative sum is taken across the contigs from every assembler. If these cumulative sums is above the specified `--perc_reads_contig` parameter, the contig is kept. If all cumulative sums is below the specified parameter, the contig is removed.
+
+!!! Info annotate "Show me an example how it works"
+    If the `--perc_reads_contig` is set to `5`, the cumulative sum of the contigs from every assembler is calculated. For example:
+
+    -  Cluster 1: the cumulative sum of the contigs from SPAdes is 0.6, Megahit is 0.5, the cluster is kept.
+    - Cluster 2: the cumulative sum of the contigs from SPAdes is 0.1, Megahit is 0.1, the cluster is removed.
+    - Cluster 3: the cumulative sum of the contigs from SPAdes is 0.5, Megahit is 0, the cluster is kept.
+
+> The default is `5` and can be specified with the `--perc_reads_contig` parameter.
+
 
-## Scaffolding
+## 7. Scaffolding
 
-After classifying all contigs and their top BLAST hits into distinct clusters or bins, the contigs are then scaffolded to the centroid of each bin. Any external references that are not centroids of the cluster are subsequently removed to prevent further bias. All members of the cluster are consequently mapped towards their centroid with [Minimap2](https://github.com/lh3/minimap2) and conensus is called using [iVar-consensus](https://andersen-lab.github.io/ivar/html/manualpage.html).
+After classifying all contigs and their top BLAST hits into distinct clusters or bins, the contigs are then scaffolded to the centroid of each bin. Any external references that are not centroids of the cluster are subsequently removed to prevent further bias. All members of the cluster are consequently mapped towards their centroid with [Minimap2](https://github.com/lh3/minimap2) and consensus is called using [iVar-consensus](https://andersen-lab.github.io/ivar/html/manualpage.html).
 
 
-## Annotation with Reference
+## 8. Annotation with Reference
 
 Regions with 0-depth coverage are annotated with the reference sequence. This is done with a [custom script](https://github.com/Joon-Klaps/viralgenie/blob/dev/bin/lowcov_to_reference.py) that uses the coverage of the denovo contigs towards the reference sequence to identify regions with 0-depth coverage. The reference sequence is then annotated to these regions.
 
diff --git a/nextflow.config b/nextflow.config
@@ -150,7 +150,6 @@ params {
     skip_alignment_qc           = false
     annotation_db               = "ftp://ftp.expasy.org/databases/viralzone/2020_4/virosaurus90_vertebrate-20200330.fas.gz"
     skip_annotation             = false
-    contig_filter_level         = "normal"
     mmseqs_searchtype           = 4
 
     // Global
diff --git a/nextflow_schema.json b/nextflow_schema.json
@@ -714,13 +714,6 @@
                     "description": "Skip creating an alignment of each the collapsed clusters and each iterative step",
                     "fa_icon": "fas fa-forward"
                 },
-                "contig_filter_level": {
-                    "type": "string",
-                    "default": "normal",
-                    "fa_icon": "fas fa-filter",
-                    "description": "Specify how strict the filtering should be.",
-                    "help_text": "none - don't filter the report of final contigs (not recomended)\nnormal - display only the consensus of the last iteration that had hits in the annotation set\nstrict - Group results by identified species & segment & lineage (if available) to report only the top hits"
-                },
                 "mmseqs_searchtype": {
                     "type": "integer",
                     "default": 4,

Original file line number	Diff line number	Diff line change
`@@ -1668,7 +1668,6 @@ process {`
`1668`	`1668`	`withName: CUSTOM_MULTIQC {`
`1669`	`1669`	`ext.args = [`
`1670`	`1670`	`params.save_intermediate_polishing ? "--save_intermediate" : '',`
`1671`		`- params.contig_filter_level ? "--filter_level ${params.contig_filter_level}" : '',`
`1672`	`1671`	`].join(' ').trim()`
`1673`	`1672`	`publishDir = [`
`1674`	`1673`	`[`