Skip to content

Commit c90297d

Browse files
committed
fixing quite a few typos
1 parent 0c6c561 commit c90297d

21 files changed

+140
-125
lines changed

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@
5959
10. [Optional] Remove clusters with low read coverage. `bin/extract_clusters.py`
6060
11. Scaffolding of contigs to centroid ([`Minimap2`](https://github.com/lh3/minimap2), [`iVar-consensus`](https://andersen-lab.github.io/ivar/html/manualpage.html))
6161
12. [Optional] Annotate 0-depth regions with external reference `bin/lowcov_to_reference.py`.
62-
13. [Optional] Select best reference from `--mapping_constrains`:
62+
13. [Optional] Select best reference from `--mapping_constraints`:
6363
- [`Mash sketch`](https://github.com/marbl/Mash)
6464
- [`Mash screen`](https://github.com/marbl/Mash)
6565
14. Mapping filtered reads to supercontig and mapping constrains([`BowTie2`](http://bowtie-bio.sourceforge.net/bowtie2/),[`BWAmem2`](https://github.com/bwa-mem2/bwa-mem2) and [`BWA`](https://github.com/lh3/bwa))

assets/schemas/mapping_constrains.json

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
{
22
"$schema": "https://json-schema.org/draft/2020-12/schema",
3-
"$id": "https://raw.githubusercontent.com/Joon-Klaps/viralgenie/dev/assets/schemas/mapping_constrains.json",
4-
"title": "Joon-Klaps/viralgenie pipeline - params.mapping_constrains schema",
5-
"description": "Schema for the file provided with params.mapping_constrains",
3+
"$id": "https://raw.githubusercontent.com/Joon-Klaps/viralgenie/dev/assets/schemas/mapping_constraints.json",
4+
"title": "Joon-Klaps/viralgenie pipeline - params.mapping_constraints schema",
5+
"description": "Schema for the file provided with params.mapping_constraints",
66
"type": "array",
77
"items": {
88
"type": "object",

bin/custom_multiqc.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ def file_choices(choices, fname):
101101
)
102102

103103
parser.add_argument(
104-
"--mapping_constrains",
104+
"--mapping_constraints",
105105
metavar="MAPPING CONSTRAINS",
106106
help="Mapping constrains file containing information on the sequences that need to be used for mapping against the samples, supported formats: '.csv', '.tsv', '.yaml', '.yml'",
107107
type=lambda s: file_choices(("csv", "tsv", "yaml", "yml"), s),

bin/utils/module_data_processing.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -275,7 +275,7 @@ def reformat_constrain_df(df, file_columns, args):
275275
return df, df
276276

277277
# Add constrain metadata to the mapping constrain table
278-
constrain_meta = filelist_to_df([args.mapping_constrains])
278+
constrain_meta = filelist_to_df([args.mapping_constraints])
279279

280280
# drop unwanted columns & reorder
281281
constrain_meta = drop_columns(constrain_meta, ["sequence", "samples"])
@@ -295,12 +295,12 @@ def reformat_constrain_df(df, file_columns, args):
295295

296296
# add mapping summary to sample overview table in ... wide format with species & segment combination
297297
logger.info("Creating mapping constrain summary (wide) table")
298-
mapping_constrains_summary = create_constrain_summary(df, file_columns).set_index("sample")
298+
mapping_constraints_summary = create_constrain_summary(df, file_columns).set_index("sample")
299299

300300
logger.info("Coalescing columns")
301301
coalesced_constrains = coalesce_constrain(df)
302302
coalesced_constrains = drop_columns(coalesced_constrains, ["id", "selection", "rank"])
303-
return coalesced_constrains, mapping_constrains_summary
303+
return coalesced_constrains, mapping_constraints_summary
304304

305305

306306
def generate_ignore_samples(dataframe: pd.DataFrame) -> pd.Series:

conf/modules.config

+1-1
Original file line numberDiff line numberDiff line change
@@ -372,7 +372,7 @@ process {
372372
ext.args =
373373
[
374374
params.spades_mode ? "--${params.spades_mode}" : '' //,
375-
// params.mapping_constrains ? "--trusted-contigs ${params.mapping_constrains}" : ''
375+
// params.mapping_constraints ? "--trusted-contigs ${params.mapping_constraints}" : ''
376376
].join(' ').trim()
377377
publishDir = [
378378
[

conf/tests/test.config

+1-1
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ params {
3939
kaiju_db = "https://kaiju-idx.s3.eu-central-1.amazonaws.com/2023/kaiju_db_viruses_2023-05-26.tgz"
4040

4141
reference_pool = "https://github.com/Joon-Klaps/nextclade_data/raw/old_datasets/data/nextstrain/sars-cov-2/MN908947/sequences.fasta"
42-
mapping_constrains = "${projectDir}/assets/samplesheets/mapping_constrains.csv"
42+
mapping_constraints = "${projectDir}/assets/samplesheets/mapping_constraints.csv"
4343

4444
save_intermediate_polishing = true
4545
intermediate_mapping_stats = true

conf/tests/test_fail_mapped.config

+1-1
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ params {
3434
skip_read_classification = true
3535
kaiju_db = "https://kaiju-idx.s3.eu-central-1.amazonaws.com/2023/kaiju_db_viruses_2023-05-26.tgz"
3636
reference_pool = "https://github.com/Joon-Klaps/nextclade_data/raw/old_datasets/data/nextstrain/sars-cov-2/MN908947/sequences.fasta"
37-
mapping_constrains = "${projectDir}/assets/samplesheets/mapping_constrains_fail.tsv"
37+
mapping_constraints = "${projectDir}/assets/samplesheets/mapping_constraints_fail.tsv"
3838

3939
min_mapped_reads = 100
4040
intermediate_mapping_stats = true

conf/tests/test_full.config

+1-1
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ params {
4040
kaiju_db = "https://kaiju-idx.s3.eu-central-1.amazonaws.com/2023/kaiju_db_viruses_2023-05-26.tgz"
4141
reference_pool = "https://github.com/Joon-Klaps/nextclade_data/raw/old_datasets/data/nextstrain/sars-cov-2/MN908947/sequences.fasta"
4242

43-
mapping_constrains = "${projectDir}/assets/samplesheets/mapping_constrains.csv"
43+
mapping_constraints = "${projectDir}/assets/samplesheets/mapping_constraints.csv"
4444
checkv_db = "https://github.com/nf-core/test-datasets/raw/phageannotator/modules/nfcore/checkv/endtoend/checkv_minimal_db.tar"
4545

4646
save_intermediate_polishing = true

conf/tests/test_umi.config

+1-1
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ params {
4242
kaiju_db = "https://kaiju-idx.s3.eu-central-1.amazonaws.com/2023/kaiju_db_viruses_2023-05-26.tgz"
4343

4444
reference_pool = "https://github.com/Joon-Klaps/nextclade_data/raw/old_datasets/data/nextstrain/sars-cov-2/MN908947/sequences.fasta"
45-
mapping_constrains = "${projectDir}/assets/samplesheets/mapping_constrains.csv"
45+
mapping_constraints = "${projectDir}/assets/samplesheets/mapping_constraints.csv"
4646

4747
save_intermediate_polishing = true
4848
save_intermediate_reads = true

docs/images/preprocessing.png

2.76 KB
Loading
53.5 KB
Loading

docs/parameters.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -133,7 +133,7 @@ Parameters relating to the analysis of variants associated to contigs and scaffo
133133
|-----------|-----------|-----------|
134134
| `skip_variant_calling` | Skip the analysis of variants for the external reference or contigs | |
135135
| `mapper` | Define which mapping tool needs to be used when mapping reads to reference | bwamem2 |
136-
| `mapping_constrains` | Sequence to use as a mapping reference instead of the de novo contigs or scaffolds | |
136+
| `mapping_constraints` | Sequence to use as a mapping reference instead of the de novo contigs or scaffolds | |
137137
| `deduplicate` | deduplicate the reads <details><summary>Help</summary><small>If used with umi's, `umi tools` will be used to group and call consensus of each indiual read group. If not used with umi's use `PicardsMarkDuplicates`. </small></details>| True |
138138
| `variant_caller` | Define the variant caller to use: 'ivar' or 'bcftools' | ivar |
139139
| `consensus_caller` | consensus tool used for calling new consensus in final iteration | ivar |

docs/workflow/assembly_polishing.md

+17-23
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
21
# Assembly & polishing
32

43
Viralgenie offers an elaborate workflow for the assembly and polishing of viral genomes:
@@ -9,17 +8,17 @@ Viralgenie offers an elaborate workflow for the assembly and polishing of viral
98
1. [Reference Matching](#4-reference-matching): comparing contigs to a reference sequence pool.
109
1. [Taxonomy guided Clustering](#5-taxonomy-guided-clustering): clustering contigs based on taxonomy and nucleotide similarity.
1110
- [Pre-clustering](#51-pre-clustering-using-taxonomy): separating contigs based on identified taxonomy-id.
12-
- [Actual clustering](#52-actual-clustering-on-nucloetide-similarity): clustering contigs based on nucleotide similarity.
13-
1. [Scaffolding](#scaffolding): scaffolding the contigs to the centroid of each bin.
14-
1. [Annotation with Reference](#annotation-with-reference): annotating regions with 0-depth coverage with the reference sequence.
11+
- [Actual clustering](#52-actual-clustering-on-nucleotide-similarity): clustering contigs based on nucleotide similarity.
12+
1. [Scaffolding](#7-scaffolding): scaffolding the contigs to the centroid of each bin.
13+
1. [Annotation with Reference](#8-annotation-with-reference): annotating regions with 0-depth coverage with the reference sequence.
1514

1615
![assembly_polishing](../images/assembly_polishing.png)
1716

18-
> The overal workflow of creating reference assisted assemblies can be skipped with the argument `--skip_assembly`. See the [parameters assembly section](../parameters.md#assembly) for all relevant arguments to control the assembly steps.
17+
> The overall workflow of creating reference assisted assemblies can be skipped with the argument `--skip_assembly`. See the [parameters assembly section](../parameters.md#assembly) for all relevant arguments to control the assembly steps.
1918
2019
> The overall refinement of contigs can be skipped with the argument `--skip_polishing`. See the [parameters polishing section](../parameters.md#polishing) for all relevant arguments to control the polishing steps.
2120
22-
The consensus genome of all clusters are then send to the [variant analysis & iterative refinement](variant_and_refinement.md) step.
21+
The consensus genome of all clusters are then sent to the [variant analysis & iterative refinement](variant_and_refinement.md) step.
2322

2423
## 1. De-novo Assembly
2524

@@ -34,7 +33,7 @@ Low complexity contigs can be filtered out using prinseq++ with the `--skip_cont
3433

3534
Contigs can be extended using [SSPACE Basic](https://github.com/nsoranzo/sspace_basic) with the `--skip_sspace_basic false` parameter. SSPACE is a tool for scaffolding contigs using paired-end reads. It is modified from SSAKE assembler and has the feature of extending contigs using reads that are unmappable in the contig assembly step. To maximize its efficiency, consider specifying the arguments `--read_distance`, `--read_distance_sd`, and `--read_orientation`. For more information on these arguments, see the [parameters assembly section](../parameters.md#assembly).
3635

37-
> The extension of contigs is ran by default, to skip this step, use `--skip_sspace_basic`.
36+
> The extension of contigs is run by default, to skip this step, use `--skip_sspace_basic`.
3837
3938
## 3. Coverage calculation
4039

@@ -44,24 +43,23 @@ Processed reads are mapped back against the contigs to determine the number of r
4443
4544
## 4. Reference Matching
4645

47-
The newly assembled contigs are compared to a reference sequence pool (--reference_pool) using a [BLASTn search](https://www.ncbi.nlm.nih.gov/books/NBK153387/). This process not only helps annotate the contigs but also assists in linking together sets of contigs that are distant within a single genome. Essentially, it aids in identifying contigs belonging to the same genomic segment and choosing the right reference for scaffolding purposes.
46+
The newly assembled contigs are compared to a reference sequence pool (`--reference_pool`) using a [BLASTn search](https://www.ncbi.nlm.nih.gov/books/NBK153387/). This process not only helps annotate the contigs but also assists in linking together sets of contigs that are distant within a single genome. Essentially, it aids in identifying contigs belonging to the same genomic segment and choosing the right reference for scaffolding purposes.
4847

49-
The top 5 hits for each contig are combined with the denovo contigs and send to the clustering step.
48+
The top 5 hits for each contig are combined with the de novo contigs and sent to the clustering step.
5049

5150
> The reference pool can be specified with the `--reference_pool` parameter. The default is the latest clustered [Reference Viral DataBase (RVDB)](https://rvdb.dbi.udel.edu/).
5251
5352
## 5. Taxonomy guided Clustering
5453

55-
The clustering workflow of contigs consists out of 2 steps, the [pre-clustering using taxonomy](#51-pre-clustering-using-taxonomy) and
56-
[actual clustering on nucleotide similarity](#52-actual-clustering-on-nucloetide-similarity). The taxonomy guided clustering is used to separate contigs based on taxonomy and nucleotide similarity.
54+
The clustering workflow of contigs consists of 2 steps, the [pre-clustering using taxonomy](#51-pre-clustering-using-taxonomy) and
55+
[actual clustering on nucleotide similarity](#52-actual-clustering-on-nucleotide-similarity). The taxonomy guided clustering is used to separate contigs based on taxonomy and nucleotide similarity.
5756

5857
```mermaid
5958
graph LR;
6059
A[Contigs] --> B["`**Pre-clustering**`"];
6160
B --> C["`**Actual clustering**`"];
6261
```
6362

64-
6563
### 5.1 Pre-clustering using taxonomy
6664

6765
The contigs along with their references have their taxonomy assigned using [Kraken2](https://ccb.jhu.edu/software/kraken2/) and [Kaiju](https://kaiju.binf.ku.dk/).
@@ -70,7 +68,7 @@ The contigs along with their references have their taxonomy assigned using [Krak
7068
> - Kraken2: viral refseq database, `--kraken2_db`
7169
> - Kaiju: clustered [RVDB](https://rvdb.dbi.udel.edu/), `--kaiju_db`
7270
73-
As Kajiu and Kraken2 can have different taxonomic assignments, an additional step is performed to resolve potential inconsistencies in taxonomy and to identify the taxonomy of the contigs. This is done with a custom script that is based on `KrakenTools extract_kraken_reads.py` and `kaiju-Merge-Outputs`.
71+
As Kaiju and Kraken2 can have different taxonomic assignments, an additional step is performed to resolve potential inconsistencies in taxonomy and to identify the taxonomy of the contigs. This is done with a custom script that is based on `KrakenTools extract_kraken_reads.py` and `kaiju-Merge-Outputs`.
7472

7573
```mermaid
7674
graph LR;
@@ -94,7 +92,7 @@ graph LR;
9492

9593
1. Options here are 'species', 'genus', 'family', 'order', 'class', 'phylum', 'kingdom' or 'superkingdom'.
9694

97-
2. `--precluster_include_childeren`__"genus1"__ :
95+
2. `--precluster_include_children` __"genus1"__ :
9896

9997
```mermaid
10098
graph TD;
@@ -118,9 +116,9 @@ graph LR;
118116
```
119117
Dotted lines represent exclusion of taxa.
120118
121-
> The pre-clustering step will be run by default but can be skipped with the argument `--skip_preclustering`. Specify which classifier to use with `--precluster_classifiers` parameter. The default is `kaiju,kraken2`. Contig taxon filtering is still enabled despite not having to solve for inconsistencies if only Kaiju or Kraken2 is ran.
119+
> The pre-clustering step will be run by default but can be skipped with the argument `--skip_preclustering`. Specify which classifier to use with `--precluster_classifiers` parameter. The default is `kaiju,kraken2`. Contig taxon filtering is still enabled despite not having to solve for inconsistencies if only Kaiju or Kraken2 is run.
122120
123-
### 5.2 Actual clustering on nucloetide similarity
121+
### 5.2 Actual clustering on nucleotide similarity
124122
125123
The clustering is performed with one of the following tools:
126124
@@ -131,7 +129,6 @@ The clustering is performed with one of the following tools:
131129
- [`vRhyme`](https://github.com/AnantharamanLab/vRhyme)
132130
- [`mash`](https://github.com/marbl/Mash)
133131
134-
135132
These methods all come with their own advantages and disadvantages. For example, cdhitest is very fast but cannot be used for large viruses >10Mb and similarity threshold cannot go below 80% which is not preferable for highly diverse RNA viruses. Vsearch is slower but accurate. Mmseqs-linclust is the fastest but tends to create a large amount of bins. Mmseqs-cluster is slower but can handle larger datasets and is more accurate. vRhyme is a new method that is still under development but has shown promising results but can sometimes not output any bins when segments are small. Mash is a very fast comparison method is linked with a custom script that identifies communities within a network.
136133
137134
!!! Tip
@@ -143,28 +140,25 @@ These methods all come with their own advantages and disadvantages. For example,
143140
144141
> The similarity threshold can be specified with the `--similarity_threshold` parameter. The default is `0.85`.
145142
146-
147143
## 6. Coverage filtering
148144
149-
The coverage of the contigs is calculated using the same method as in the [coverage calculation step](#3-coverage-calculation). A cumulative sum is taken across the contigs from every assembler. If these cumulative sums is above the specified `--perc_reads_contig` parameter, the contig is kept. If all cumulative sums is below the specified parameter, the contig is removed.
145+
The coverage of the contigs is calculated using the same method as in the [coverage calculation step](#3-coverage-calculation). A cumulative sum is taken across the contigs from every assembler. If these cumulative sums are above the specified `--perc_reads_contig` parameter, the contig is kept. If all cumulative sums are below the specified parameter, the contig is removed.
150146
151147
!!! Info annotate "Show me an example how it works"
152148
If the `--perc_reads_contig` is set to `5`, the cumulative sum of the contigs from every assembler is calculated. For example:
153149
154-
- Cluster 1: the cumulative sum of the contigs from SPAdes is 0.6, Megahit is 0.5, the cluster is kept.
150+
- Cluster 1: the cumulative sum of the contigs from SPAdes is 0.6, Megahit is 0.5, the cluster is kept.
155151
- Cluster 2: the cumulative sum of the contigs from SPAdes is 0.1, Megahit is 0.1, the cluster is removed.
156152
- Cluster 3: the cumulative sum of the contigs from SPAdes is 0.5, Megahit is 0, the cluster is kept.
157153
158154
> The default is `5` and can be specified with the `--perc_reads_contig` parameter.
159155
160-
161156
## 7. Scaffolding
162157
163158
After classifying all contigs and their top BLAST hits into distinct clusters or bins, the contigs are then scaffolded to the centroid of each bin. Any external references that are not centroids of the cluster are subsequently removed to prevent further bias. All members of the cluster are consequently mapped towards their centroid with [Minimap2](https://github.com/lh3/minimap2) and consensus is called using [iVar-consensus](https://andersen-lab.github.io/ivar/html/manualpage.html).
164159
165-
166160
## 8. Annotation with Reference
167161
168-
Regions with 0-depth coverage are annotated with the reference sequence. This is done with a [custom script](https://github.com/Joon-Klaps/viralgenie/blob/dev/bin/lowcov_to_reference.py) that uses the coverage of the denovo contigs towards the reference sequence to identify regions with 0-depth coverage. The reference sequence is then annotated to these regions.
162+
Regions with 0-depth coverage are annotated with the reference sequence. This is done with a [custom script](https://github.com/Joon-Klaps/viralgenie/blob/dev/bin/lowcov_to_reference.py) that uses the coverage of the de novo contigs towards the reference sequence to identify regions with 0-depth coverage. The reference sequence is then annotated to these regions.
169163
170164
> This step can be skipped using `--skip_hybrid_consensus` parameter.

0 commit comments

Comments
 (0)