Joon-Klaps
diff --git a/‎README.md
+11-8 b/‎README.md
+11-8
diff --git a/‎assets/samplesheets/mapping_constrains.csv
+4-4 b/‎assets/samplesheets/mapping_constrains.csv
+4-4
diff --git a/‎assets/schemas/mapping_constrains.json
+6 b/‎assets/schemas/mapping_constrains.json
+6
diff --git a/‎bin/select_reference.py
+137 b/‎bin/select_reference.py
+137
diff --git a/‎conf/modules.config
+43-2 b/‎conf/modules.config
+43-2
diff --git a/‎docs/images/metromap_style_pipeline_workflow_viralgenie.png
0 Bytes b/‎docs/images/metromap_style_pipeline_workflow_viralgenie.png
0 Bytes
@@ -51,16 +51,19 @@
     - [`mmseqs-linclust`](https://github.com/soedinglab/MMseqs2/wiki#linear-time-clustering-using-mmseqs-linclust)
     - [`mmseqs-cluster`](https://github.com/soedinglab/MMseqs2/wiki#cascaded-clustering)
     - [`vRhyme`](https://github.com/AnantharamanLab/vRhyme)
-    - [`mash`](https://github.com/marbl/Mash)
+    - [`Mash`](https://github.com/marbl/Mash)
 8. Scaffolding of contigs to centroid ([`Minimap2`](https://github.com/lh3/minimap2), [`iVar-consensus`](https://andersen-lab.github.io/ivar/html/manualpage.html))
 9. [Optional] Annotate 0-depth regions with external reference `custom-script`.
-10. Mapping filtered reads to supercontig and mapping constrains([`BowTie2`](http://bowtie-bio.sourceforge.net/bowtie2/),[`BWAmem2`](https://github.com/bwa-mem2/bwa-mem2) and [`BWA`](https://github.com/lh3/bwa))
-11. [Optional] Deduplicate reads ([`Picard`](https://broadinstitute.github.io/picard/) or if UMI's are used [`UMI-tools`](https://umi-tools.readthedocs.io/en/latest/QUICK_START.html))
-12. Variant calling and filtering ([`BCFTools`](http://samtools.github.io/bcftools/bcftools.html),[`iVar`](https://andersen-lab.github.io/ivar/html/manualpage.html))
-13. Create consensus genome ([`BCFTools`](http://samtools.github.io/bcftools/bcftools.html),[`iVar`](https://andersen-lab.github.io/ivar/html/manualpage.html))
-14. Repeat step 10-13 multiple times for the denovo contig route
-15. Consensus evaluation and annotation ([`QUAST`](http://quast.sourceforge.net/quast),[`CheckV`](https://bitbucket.org/berkeleylab/checkv/src/master/),[`blastn`](https://blast.ncbi.nlm.nih.gov/Blast.cgi), [`mmseqs-search`](https://github.com/soedinglab/MMseqs2/wiki#batch-sequence-searching-using-mmseqs-search))
-16. Result summary visualisation for raw read, alignment, assembly, variant calling and consensus calling results ([`MultiQC`](http://multiqc.info/))
+10. [Optional] Select best reference from `--mapping_constrains`:
+    - [`Mash sketch`](https://github.com/marbl/Mash)
+    - [`Mash screen`](https://github.com/marbl/Mash)
+11. Mapping filtered reads to supercontig and mapping constrains([`BowTie2`](http://bowtie-bio.sourceforge.net/bowtie2/),[`BWAmem2`](https://github.com/bwa-mem2/bwa-mem2) and [`BWA`](https://github.com/lh3/bwa))
+12. [Optional] Deduplicate reads ([`Picard`](https://broadinstitute.github.io/picard/) or if UMI's are used [`UMI-tools`](https://umi-tools.readthedocs.io/en/latest/QUICK_START.html))
+13. Variant calling and filtering ([`BCFTools`](http://samtools.github.io/bcftools/bcftools.html),[`iVar`](https://andersen-lab.github.io/ivar/html/manualpage.html))
+14. Create consensus genome ([`BCFTools`](http://samtools.github.io/bcftools/bcftools.html),[`iVar`](https://andersen-lab.github.io/ivar/html/manualpage.html))
+15. Repeat step 11-14 multiple times for the denovo contig route
+16. Consensus evaluation and annotation ([`QUAST`](http://quast.sourceforge.net/quast),[`CheckV`](https://bitbucket.org/berkeleylab/checkv/src/master/),[`blastn`](https://blast.ncbi.nlm.nih.gov/Blast.cgi), [`mmseqs-search`](https://github.com/soedinglab/MMseqs2/wiki#batch-sequence-searching-using-mmseqs-search))
+17. Result summary visualisation for raw read, alignment, assembly, variant calling and consensus calling results ([`MultiQC`](http://multiqc.info/))
 
 ## Usage
 
 
@@ -1,4 +1,4 @@
-id,species,segment,samples,definition,sequence
-MN908947,SARS-COV-2,Wuhan-hu-1,SRR11140744;SRR11140748,Original Wuhan strain sequenced in 2019,https://raw.githubusercontent.com/Joon-Klaps/nextclade_data/old_datasets/data/nextstrain/sars-cov-2/MN908947/reference.fasta
-BA2,,,SRR11140744,,https://github.com/Joon-Klaps/nextclade_data/raw/old_datasets/data/nextstrain/sars-cov-2/BA.2/reference.fasta
-BA24,,,,,https://github.com/Joon-Klaps/nextclade_data/raw/old_datasets/data/nextstrain/sars-cov-2/BA.2/reference.fasta
+id,species,segment,samples,definition,selection,sequence
+MN908947,SARS-COV-2,Wuhan-hu-1,SRR11140744;SRR11140748,Original Wuhan strain sequenced in 2019,false,https://raw.githubusercontent.com/Joon-Klaps/nextclade_data/old_datasets/data/nextstrain/sars-cov-2/MN908947/reference.fasta
+BA2,,,SRR11140744,,false,https://github.com/Joon-Klaps/nextclade_data/raw/old_datasets/data/nextstrain/sars-cov-2/BA.2/reference.fasta
+BA24,,,,,true,https://github.com/Joon-Klaps/nextclade_data/raw/old_datasets/data/nextstrain/sars-cov-2/MN908947/sequences.fasta
@@ -64,6 +64,12 @@
                     }
                 ]
             },
+            "selection": {
+                "errorMessage": "Selection can only be true or false",
+                "meta": ["selection"],
+                "type": "boolean",
+                "default": false
+            },
             "definition": {
                 "errorMessage": "Give a definition of the sequence for metadata annotation purposes only",
                 "meta": ["definition"],
 
@@ -0,0 +1,137 @@
+#!/usr/bin/env python
+
+"""Provide a command line tool to filter blast results."""
+
+import argparse
+import logging
+import sys
+from pathlib import Path
+
+import pandas as pd
+from Bio import SeqIO
+
+logger = logging.getLogger()
+
+
+def parse_args(argv=None):
+    """Define and immediately parse command line arguments."""
+    parser = argparse.ArgumentParser(
+        description="Provide a command line tool to filter blast results.",
+        epilog="Example: python blast_filter.py in.clstr prefix",
+    )
+
+    parser.add_argument(
+        "-i",
+        "--mash",
+        metavar="MASH FILE",
+        type=Path,
+        help="Mash screen result file.",
+    )
+
+    parser.add_argument(
+        "-r",
+        "--references",
+        metavar="REFERENCE FILE",
+        type=Path,
+        help="Contig sequence file was screened against",
+    )
+
+    parser.add_argument(
+        "-p",
+        "--prefix",
+        metavar="PREFIX",
+        type=str,
+        help="Output file prefix",
+    )
+
+    parser.add_argument(
+        "-l",
+        "--log-level",
+        help="The desired log level (default WARNING).",
+        choices=("CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG"),
+        default="INFO",
+    )
+    return parser.parse_args(argv)
+
+def to_dict_remove_dups(sequences):
+    return {record.id: record for record in sequences}
+
+
+def extract_hits(df, references, prefix):
+    """
+    Extracts contigs hits from a DataFrame and writes them to a FASTA file.
+
+    Args:
+        df (pandas.DataFrame): DataFrame containing the hits information.
+        contigs (str): Path to the contigs file.
+        references (str): Path to the references file in FASTA format.
+        prefix (str): Prefix for the output file.
+
+    Returns:
+        None
+    """
+    try:
+        ref_records = SeqIO.to_dict(SeqIO.parse(references, "fasta"))
+    except ValueError as e:
+        logger.warning(
+            "Indexing the reference pool file causes an error: %s \n Make sure all fasta headers are unique and it is in fasta format! \n AUTOFIX: Taking last occurence of duplicates to continue analysis",
+            e,
+        )
+        ref_records = to_dict_remove_dups(SeqIO.parse(references, "fasta"))
+    with open(f"{prefix}_reference.fa", "w") as f:
+        init_position = f.tell()
+        for hit in df["query-ID"].unique():
+            hit_name = hit.split(" ")[0]
+            if hit_name in ref_records:
+                # Sometimes reads can have illegal characters in the header
+                ref_records[hit_name].id = ref_records[hit_name].id.replace("\\","_")
+                ref_records[hit_name].description = ref_records[hit_name].description.replace("\\","_")
+                SeqIO.write(ref_records[hit_name], f, "fasta")
+        if f.tell() == init_position:
+            logger.error("No reference sequences found in the hits. Exiting...")
+
+
+def read_mash_screen(file):
+    """
+    Read in the file and return a pandas DataFrame
+    File format:
+    [identity, shared-hashes, median-multiplicity, p-value, query-ID, query-comment]
+    0.996341	3786/4000	121	0	USANYPRL230901_81A112023
+    0.997144	3832/4000	124	0	EnglandQEUH3267E6482022
+    0.997039	3826/4000	121	0	OP958840
+    0.997022	3825/4000	122	0	OP971202
+    """
+
+    logger.info("Reading in the mash screen file...")
+    df = pd.read_csv(file, sep="\t", header=None)
+    df.columns = ["identity", "shared-hashes", "median-multiplicity", "p-value", "query-ID", "query-comment"]
+
+    logger.info("Removing duplicates and sorting by identity and shared-hashes...")
+    df['shared-hashes_num'] = df['shared-hashes'].str.split('/').str[0].astype(float)
+    df = df.sort_values(by=["identity", "shared-hashes_num"], ascending=False)
+    df = df.drop(columns=['shared-hashes_num'])
+
+    return df.iloc[[0]]
+
+
+def main(argv=None):
+    """Coordinate argument parsing and program execution."""
+    args = parse_args(argv)
+    logging.basicConfig(level=args.log_level, format="[%(levelname)s] %(message)s")
+    if not args.mash.is_file():
+        logger.error(f"The given input file {args.mash} was not found!")
+        sys.exit(2)
+    if not args.references.is_file():
+        logger.error(f"The given input file {args.references} was not found!")
+        sys.exit(2)
+
+    df = read_mash_screen(args.mash)
+
+    extract_hits(df, args.references, args.prefix)
+
+    df.to_json(f"{args.prefix}.json",orient="records", lines=True)
+
+    return 0
+
+if __name__ == "__main__":
+    sys.exit(main())
@@ -595,8 +595,10 @@ process {
 
             withName: MASH_DIST{
                 ext.args = [
-                        "-i",       // sketch sequence not file, **don't change will make pipeline fail**
-                        "-t"        // table format **don't change will make pipeline fail**
+                        "-i",                                   // sketch sequence not file, **don't change will make pipeline fail**
+                        "-t",                                   // table format **don't change will make pipeline fail**
+                        "-s ${params.mash_sketch_size}",        // sketch size
+                        "-k ${params.mash_sketch_kmer_size}",   // k-mer size
                     ].join(' ').trim()
                 publishDir =[
                     [
@@ -722,6 +724,45 @@ process {
         }
     }
 
+    withName: CAT_CAT_READS {
+        publishDir = [
+            enabled: false
+        ]
+    }
+
+
+    withName: MASH_SKETCH {
+        ext.args = [
+            "-s ${params.mash_sketch_size}",        // sketch size
+            "-k ${params.mash_sketch_kmer_size}",   // k-mer size
+            "-i ",                                  // Sketch individual sequences DON'T CHANGE
+        ].join(' ').trim()
+        publishDir = [
+            path: { "${params.outdir}/variants/mapping-info/mash/sketch" },
+            mode: params.publish_dir_mode,
+            pattern: '*.msh',
+            saveAs: { filename -> params.prefix || params.global_prefix  ? "${params.global_prefix}-$filename" : filename }
+        ]
+    }
+
+    withName: MASH_SCREEN {
+        publishDir = [
+            path: { "${params.outdir}/variants/mapping-info/mash/screen" },
+            mode: params.publish_dir_mode,
+            pattern: '*.screen',
+            saveAs: { filename -> params.prefix || params.global_prefix  ? "${params.global_prefix}-$filename" : filename }
+        ]
+    }
+
+    withName: SELECT_REFERENCE {
+        publishDir = [
+            path: { "${params.outdir}/variants/mapping-info/mash/select-ref" },
+            mode: params.publish_dir_mode,
+            pattern: '*.json',
+            saveAs: { filename -> params.prefix || params.global_prefix  ? "${params.global_prefix}-$filename" : filename }
+        ]
+    }
+
     //
     // Iterative mapping
     // use the '$meta.iteration' variable to create a new directory for each iteration when publishing the dir's of the modules
Original file line number	Diff line number	Diff line change
`@@ -64,6 +64,12 @@`
`64`	`64`	`}`
`65`	`65`	`]`
`66`	`66`	`},`
	`67`	`+ "selection": {`
	`68`	`+ "errorMessage": "Selection can only be true or false",`
	`69`	`+ "meta": ["selection"],`
	`70`	`+ "type": "boolean",`
	`71`	`+ "default": false`
	`72`	`+ },`
`67`	`73`	`"definition": {`
`68`	`74`	`"errorMessage": "Give a definition of the sequence for metadata annotation purposes only",`
`69`	`75`	`"meta": ["definition"],`