MucOneUp is a Python tool for simulating MUC1 VNTR diploid references. It builds customized references that:
- Generate haplotypes containing a variable‐length VNTR region using a probability model or fixed‐lengths.
- Force a canonical terminal block (
6
or6p
→7 → 8 → 9
) before appending the right‐hand constant. - Optionally introduce mutations (inserts, deletes, or replacements) in selected repeats with precise control over which haplotypes are affected.
- Support structure files with mutation information, allowing for reproducible generation of specific mutated VNTR structures.
- Generate series of simulations when fixed-length ranges are provided (via the
--simulate-series
flag) so that a simulation is run for each possible length (or combination of lengths for multiple haplotypes). - Run dual simulations (normal and mutated) when a comma-separated mutation name is provided.
- Detect Toxic Protein Features: When ORF prediction is activated (via the
--output-orfs
flag), the tool scans the resulting ORF FASTA file for toxic protein sequence features by analyzing the repeat structure and amino acid composition of the variable region. A quantitative “deviation” (or similarity) score is computed relative to a wild–type model, and if the overall score exceeds a user–defined cutoff, the protein is flagged as toxic. - Generate Comprehensive Simulation Statistics:
For each simulation run, detailed statistics are generated—including simulation runtime, haplotype-level metrics (repeat counts, VNTR lengths, GC content, repeat length summaries, and mutation details), as well as overall aggregated statistics. In dual simulation mode, separate reports are produced for the normal and mutated outputs.
Additionally, MucOneUp supports multiple read simulation pipelines:
For Illumina reads, we use a port of w‑Wessim2. This pipeline simulates reads from the simulated FASTA by:
- Replacing Ns and generating systematic errors using external tools
- Converting the FASTA to 2bit format
- Extracting a subset reference from a sample BAM
- Running pblat for alignment
- Generating fragments and creating reads using the w‑Wessim2 port
- Splitting interleaved FASTQ into paired FASTQ files and aligning to a human reference
For Oxford Nanopore (ONT) reads, we integrate NanoSim to generate realistic long reads with the error profiles characteristic of nanopore sequencing. This pipeline:
- Uses pre-trained models specific to ONT technologies
- Simulates long reads with realistic error profiles
- Aligns reads to the reference using minimap2
- Generates alignment files in BAM format
- Installation
- Installing Required Reference Files
- Quick Start
- Usage
- Toxic Protein Detection
- Simulation Statistics
- Read Simulation Integration
- Project Structure and Logic
- License
- Clone or download this repository.
- Install in a Python 3.7+ environment:
This will install the
pip install .
muc_one_up
Python package locally.
Optional (Conda/Mamba Environments for Read Simulation):
For Illumina read simulation with w-Wessim2, create the environment:
mamba env create -f conda/env_wessim.yamlFor ONT read simulation with NanoSim, create the environment:
mamba env create -f conda/env_nanosim.yamlAfter creating these environments, update the
tools
section in your configuration file to reference the executables from the newly created environments. Alternatively, you can install the tools locally.
Once installed, you’ll have a command-line program called muconeup
available.
In order to run the read simulation pipeline, MucOneUp requires several external reference files (such as a human reference FASTA and a reseq model file). A helper script is provided to automate the download, extraction, and (optional) indexing of these reference files. Follow these steps to install the required references and update your configuration accordingly:
A helper script is provided in the helpers
directory. This script downloads the required reference files based on a JSON configuration (by default, helpers/install_references_config.json
).
For example, to install the references into a directory named ./references
, run:
python helpers/install_references.py --output-dir ./references
This command will:
- Download the reference files (e.g. the GRCh38 human reference and a reseq model file).
- Verify the downloads via MD5 checksum.
- Extract any gzip-compressed files automatically.
- Index FASTA files using BWA if an indexing command is provided in the configuration.
(If you prefer not to index the FASTA files automatically, use the--skip-indexing
flag.)
After the script completes, an installed_references.json
file is generated in the output directory. This file maps each reference name to its absolute file path.
Once the references are installed, update your MucOneUp configuration file (config.json
) so that the fields in the tools
and read_simulation
sections point to the correct local paths. For example, if your reference FASTA file (e.g. GRCh38) is now located at:
/path/to/references/GRCh38_no_alt_analysis_set.fna.gz
update the human_reference
field in the read_simulation
section accordingly. Similarly, update the path for the reseq model file and any other reference paths you are using. For example:
{
"tools": {
"reseq": "mamba run --no-capture-output -n env_wessim reseq",
"faToTwoBit": "mamba run --no-capture-output -n env_wessim faToTwoBit",
"samtools": "mamba run --no-capture-output -n env_wessim samtools",
"pblat": "mamba run --no-capture-output -n env_wessim pblat",
"bwa": "mamba run --no-capture-output -n env_wessim bwa",
"nanosim": "mamba run --no-capture-output -n env_nanosim simulator.py",
"minimap2": "mamba run --no-capture-output -n env_nanosim minimap2"
},
"read_simulation": {
"reseq_model": "/path/to/references/Hs-Nova-TruSeq.reseq",
"sample_bam": "data/twist_v2.bam",
"human_reference": "/path/to/references/GRCh38_no_alt_analysis_set.fna.gz",
"read_number": 10000,
"fragment_size": 250,
"fragment_sd": 35,
"min_fragment": 20,
"threads": 8,
"downsample_coverage": 200,
"downsample_seed": 42,
"reference_assembly": "hg38",
"vntr_region_hg19": "chr1:155160500-155162000",
"vntr_region_hg38": "chr1:155188000-155192500"
}
}
Be sure to replace /path/to/references
with the absolute path where your references were installed (as reported in the installed_references.json
file).
Before running the full simulation or read simulation pipeline, check that the reference files are accessible and correctly referenced in your config. Review the contents of the installed_references.json
file and ensure that each required file exists at the specified path.
- Create or update a JSON config (see Config File Layout) describing your repeats, probabilities, mutations, etc.
- Run the tool by specifying your config along with desired parameters. For example:
muconeup --config config.json --out-base muc1_simulated --output muc1_simulated.fa --output-structure muc1_struct.txt
- Inspect the resulting outputs:
muc1_simulated.fa
: Multi-FASTA file of haplotype sequences.muc1_struct.txt
: Textual representation of each haplotype’s chain of repeats.simulation_stats.json
(or with variant suffixes in dual simulation mode): JSON file(s) containing detailed simulation statistics.
Below are the available command-line arguments. Use muconeup --help
for more details.
Argument | Description |
---|---|
--config <path> |
Required. Path to the JSON config file containing repeats, probabilities, constants, length model, mutations, tools, and read simulation settings. |
--out-base <basename> |
Base name for all output files. All outputs (simulation FASTA, VNTR structure, ORF FASTA, read simulation outputs, and ORF toxic protein statistics) will be named using this base. Default is muc1_simulated . |
--out-dir <folder> |
Output folder where all files will be written. Defaults to the current directory. |
--num-haplotypes N |
Number of haplotypes to simulate. Typically 2 for diploid. Defaults to 2. |
--fixed-lengths <vals> |
One or more fixed lengths (or ranges) for each haplotype’s VNTR repeats. Values may be a single integer (e.g. 60 ) or a range (e.g. 20-40 ). When a range is provided, the default behavior is to pick one value at random from each range. Use the --simulate-series flag (see below) to run a simulation for every value (or combination) in the range. |
--simulate-series |
(Optional) When specified and fixed-length ranges are provided, the program will generate a simulation iteration for every possible length (or combination of lengths for multiple haplotypes) instead of choosing a single random value. This flag is useful when you want to explore the entire parameter space. |
--seed <int> |
Random seed for reproducible simulations (affects VNTR building and mutation target selection). |
--mutation-name <str> |
(Optional) Name of a mutation from the config to apply. To run dual simulations (normal and mutated), provide a comma-separated pair (e.g. normal,dupC ). If a single value is provided, only one simulation is mutated. |
--mutation-targets <pairs> |
(Optional) One or more haplotype_index,repeat_index pairs (1-based). E.g., 1,25 2,30 . If provided, each pair indicates which haplotype and repeat position to mutate. If omitted, the mutation is applied at a random allowed repeat. Only haplotypes specified in these targets will have mutation information in their FASTA headers. |
--input-structure <file> |
(Optional) Path to a structure file containing predefined VNTR repeat chains. If the structure file includes a header comment with mutation information (e.g., # Mutation Applied: dupC (Targets: [(1, 25)]) ), that information will be used to apply mutations to the specific haplotypes and positions instead of using random targets or CLI-specified targets. |
--output-structure |
(Optional) If provided, output a VNTR structure file (text) listing the chain of repeats for each haplotype. |
--output-orfs |
(Optional) If provided, run ORF prediction and output an ORF FASTA file using the normalized naming scheme. Additionally, the resulting ORF file will be scanned for toxic protein sequence features and a JSON statistics file is generated. |
--orf-min-aa <int> |
Minimum peptide length (in amino acids) to report from ORF prediction. Defaults to 100. |
--orf-aa-prefix <str> |
(Optional) Filter resulting peptides to only those beginning with this prefix. If used without a value, defaults to MTSSV . If omitted, no prefix filtering is applied. |
--simulate-reads |
(Optional) If provided, run the read simulation pipeline on the simulated FASTA. This pipeline produces an aligned/indexed BAM and gzipped paired FASTQ files. |
--random-snps |
(Optional) If provided, generate random SNPs and integrate them into the simulated sequences. |
--random-snp-density <float> |
Density of random SNPs to generate (SNPs per kilobase). Defaults to 1.0. |
--random-snp-output-file <path> |
(Optional) Path to a file where generated random SNPs will be saved in TSV format. |
--snp-file <path> |
(Optional) Path to a TSV file containing predefined SNPs to integrate into the simulated sequences. |
--log-level <level> |
Set logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL, NONE). Default is INFO. |
-
Generate two diploid haplotypes with random VNTR lengths (sampled from the length model defined in your config):
muconeup --config config.json --out-base muc1_sim --output muc1_sim.fa --output-structure muc1_struct.txt
-
Force a fixed length of 60 repeats for each haplotype:
muconeup --config config.json --out-base muc1_fixed --output muc1_fixed.fa --output-structure muc1_fixed.txt --num-haplotypes 2 --fixed-lengths 60
-
Generate a single simulation using a fixed-length range (a random value is chosen from the range for each haplotype):
muconeup --config config.json --out-base muc1_random_range --fixed-lengths 20-40
-
Apply a specific mutation to targeted positions in specific haplotypes:
muconeup --config config.json --out-base muc1_mutated --mutation-name dupC --mutation-targets 1,25 2,30
This applies the
dupC
mutation to haplotype 1 at repeat position 25 and to haplotype 2 at repeat position 30. The FASTA header for each mutated haplotype will include the mutation information. -
Use a structure file with mutation information for reproducible generation of specific VNTR structures:
muconeup --config config.json --out-base muc1_from_structure --input-structure structure_file.txt
Where the structure file contains mutation information in a header comment:
# Mutation Applied: dupC (Targets: [(1, 25)]) haplotype_1 1-2-3-4-5-C-X-B-X-X-X-X-X-X-X-X-V-G-B-X-X-G-A-B-Xm-X-X-X-A-A-A-B-X-D-E-C-6-7-8-9 haplotype_2 1-2-3-4-5-C-X-A-B-X-X-X-V-G-A-A-A-B-B-X-X-X-X-X-X-X-X-X-X-X-X-X-V-V-V-V-V-V-V-V-V-G-A-B-B-X-A-A-N-R-X-X-X-X-A-B-6p-7-8-9
Note that the mutated repeat position is marked with an "m" suffix in the structure file (e.g.,
Xm
). The output FASTA will include mutation information only in the header of haplotype 1, which is the only one with a mutation according to the targets. -
Generate a series of simulations using a fixed-length range (each possible value in the range, or combination thereof, produces an output file):
muconeup --config config.json --out-base muc1_series --simulate-series --fixed-lengths 20-40
-
Apply a known mutation (
dupC
) at a specific repeat (haplotype #1, repeat #5):muconeup --config config.json --out-base muc1_mutated --mutation-name dupC --mutation-targets 1,5
-
Run dual simulation (normal and mutated) with a mutation (
dupC
):muconeup --config config.json --out-base muc1_dual --mutation-name normal,dupC
-
Apply a known mutation (
snpA
) to a random allowed repeat:muconeup --config config.json --out-base muc1_random_mut --mutation-name snpA
When the ORF prediction is enabled (using the --output-orfs
flag), MucOneUp automatically scans the resulting ORF FASTA file for toxic protein sequence features. This new feature works as follows:
-
Extracting the Variable Region:
The ORF sequence is trimmed by removing the constant left/right flanks (if provided in the configuration), isolating the variable region (i.e. the repeat region). -
Detecting and Quantifying Repeats:
A sliding window—whose length equals that of a consensus repeat motif (e.g."RCHLGPGHQAGPGLHR"
)—moves across the variable region. For each window, the similarity (using a simple Hamming distance approach) is computed, and windows exceeding a set identity threshold (e.g. 80%) are considered as valid repeats. From these, the number of repeats and the average repeat identity score are calculated. -
Amino Acid Composition Analysis:
The tool computes the frequency of key residues (such as R, C, and H) in the variable region and compares these frequencies to a wild–type model (by default, an approximation is generated by repeating the consensus motif). A composition similarity score is calculated as:S_composition = 1 - (sum(|f_mut - f_wt|) / sum(f_wt))
where a score near 1 indicates high similarity (i.e. normal) and lower scores indicate divergence (i.e. toxicity).
-
Combining Metrics:
The overall similarity (or deviation) score is computed as a weighted sum of the repeat score and the composition similarity. In this implementation, a higher overall score indicates divergence from the wild–type (i.e. a toxic protein), while a lower score indicates similarity to the wild–type (normal). If the overall score exceeds a user–defined toxic detection cutoff (e.g. 0.5), the ORF is flagged as toxic. -
Output:
The detection metrics for each ORF (including repeat count, average repeat identity, repeat score, composition similarity, overall score, and the toxic flag) are saved in a JSON file (with file extensionorf_stats.txt
) alongside the ORF FASTA output.
A new feature generates a comprehensive statistics report for each simulation run. This report includes:
- Runtime: Total simulation runtime in seconds.
- Haplotype-Level Metrics: For each simulated haplotype, the report details the number of repeats, VNTR region length, GC content, individual repeat lengths (with min, max, and average), repeat type counts, and mutation details.
- Aggregated Metrics: Overall statistics aggregated from all haplotypes.
- Dual Simulation Reporting: In dual mutation mode, separate statistics reports are produced for the normal and mutated outputs.
The statistics are saved as JSON files (e.g., muc1_simulated.002.simulation_stats.json.normal
and muc1_simulated.002.simulation_stats.json.mut
).
The read simulation pipeline simulates Illumina reads from the generated FASTA files. This pipeline leverages external tools (reseq, faToTwoBit, samtools, pblat, bwa) and incorporates a port of w‑Wessim2 to:
- Replace Ns in the FASTA.
- Generate systematic errors and convert the FASTA to 2bit format.
- Extract a subset reference from a sample BAM.
- Align the 2bit file to the subset reference using pblat.
- Simulate fragments and create reads using the w‑Wessim2 port.
- Split the interleaved FASTQ into paired FASTQ files.
- Align the reads to a human reference.
MucOneUp supports the integration of Single Nucleotide Polymorphisms (SNPs) into simulated sequences. This feature allows for more realistic simulations by incorporating natural genetic variation. SNPs can be integrated in two ways:
- From a predefined file using the
--snp-file
parameter. - Generated randomly using the
--random-snps
parameter and a specified density.
The SNP file format is tab-separated (TSV) with the following columns:
- haplotype (1-based): The haplotype index (1 or 2 for diploid)
- position (0-based): Position in the haplotype sequence
- ref_base: Expected reference base at the position
- alt_base: Alternative base to introduce
Example SNP file content:
haplotype position ref_base alt_base
1 125 A G
2 236 C T
When using --random-snps
, MucOneUp will:
- Generate random SNPs based on the specified density (SNPs per kilobase).
- Ensure SNPs are distributed across both haplotypes.
- Save the generated SNPs to a file if
--random-snp-output-file
is provided.
In dual mutation mode (using --mutation-name normal,dupC
), SNPs are applied to both the normal and mutated sequences. For mutated sequences, the skip_reference_check
option is automatically enabled, allowing SNPs to be applied even when mutations have altered the original reference bases.
This is particularly useful when you want to simulate scenarios where both structural mutations and SNPs are present in the sample, providing a more realistic representation of genetic diversity.
# Generate random SNPs with specified density
muconeup --config config.json --out-base muc1_with_snps --random-snps --random-snp-density 0.5 --random-snp-output-file output/muc1_random_snps.tsv
# Apply SNPs from a predefined file
muconeup --config config.json --out-base muc1_with_predefined_snps --snp-file my_snps.tsv
# Combine dual mutation mode with SNP integration
muconeup --config config.json --out-base muc1_dual_with_snps --mutation-name normal,dupC --random-snps
MucOneUp now supports structure files that contain mutation information embedded in header comments. This allows for reproducible generation of specific mutated VNTR structures.
A structure file with mutation information has the following format:
# Mutation Applied: <mutation_name> (Targets: [(<haplotype_index>, <position>), ...])
haplotype_1 <repeat_chain_with_m_suffix_for_mutated_positions>
haplotype_2 <repeat_chain>
Example:
# Mutation Applied: dupC (Targets: [(1, 25)])
haplotype_1 1-2-3-4-5-C-X-B-X-X-X-X-X-X-X-X-V-G-B-X-X-G-A-B-Xm-X-X-X-A-A-A-B-X-D-E-C-6-7-8-9
haplotype_2 1-2-3-4-5-C-X-A-B-X-X-X-V-G-A-A-A-B-B-X-X-X-X-X-X-X-X-X-X-X-X-X-V-V-V-V-V-V-V-V-V-G-A-B-B-X-A-A-N-R-X-X-X-X-A-B-6p-7-8-9
When using such a structure file with the --input-structure
argument, MucOneUp will:
- Parse the mutation information from the header comment
- Apply the specified mutation to the targeted haplotypes and positions
- Generate output FASTA files with mutation information only in the headers of affected haplotypes
- Create output structure files that preserve the mutation information
This feature is particularly useful for:
- Reproducing specific known mutations for testing
- Generating consistent test data across multiple runs
- Creating benchmark datasets for variant calling tools
The muc_one_up Python package is organized into modules. Here is a brief summary:
muc_one_up/
├── cli.py # Main CLI logic and argument parsing (supports series simulation, dual mutation modes, toxic protein detection, simulation statistics, and read simulation)
├── config.py # Loads and validates the JSON configuration file
├── distribution.py # Samples the target VNTR length from a specified distribution
├── fasta_writer.py # Helper for writing FASTA files with support for per-haplotype mutation comments
├── mutate.py # Logic to apply specified mutations (including complex types like delete_insert) to targeted haplotypes
├── probabilities.py # Provides weighted random selections for repeat transitions
├── simulate.py # Core simulation code for building haplotypes (chains of repeats with terminal block insertion)
├── read_simulation.py # Integrates an external read simulation pipeline (using w‑Wessim2) to generate reads from simulated FASTA files
├── translate.py # Translates DNA to protein and performs ORF prediction using orfipy
├── toxic_protein_detector.py # Scans ORF FASTA outputs to detect toxic protein sequence features based on repeat structure and amino acid composition
├── simulation_statistics.py # **New Feature**: Generates comprehensive simulation statistics for each simulation run
└── __init__.py # Package initialization and version information
-
CLI (cli.py):
- Parses command-line arguments and loads the configuration.
- Supports using predefined VNTR chains from structure files via
--input-structure
, including embedded mutation information. - If fixed-length ranges are provided, either picks a random value for each haplotype (default) or—if
--simulate-series
is specified—runs a simulation for every possible length (or combination of lengths for multiple haplotypes). - Simulates haplotypes via simulate_diploid() or simulate_from_chains() when using structure files.
- Optionally applies mutations using apply_mutations() with precise control over targeted haplotypes and positions via
--mutation-targets
. - Dual simulation is supported when a comma-separated mutation name is provided.
- Writes output files (FASTA, VNTR structure, ORFs) with numbered filenames and haplotype-specific mutation information in FASTA headers.
- When ORF prediction is activated (via
--output-orfs
), the resulting ORF FASTA is further scanned for toxic protein features using toxic_protein_detector.py. The detection statistics are saved as a JSON file. - Generates a detailed simulation statistics report for each simulation iteration.
- Optionally runs the read simulation pipeline.
-
Simulation (simulate.py):
- Constructs haplotypes by sampling repeats according to probability distributions.
- Forces the final block of repeats (
6
/6p
→7
→8
→9
). - Appends left and right constant flanks from the config.
-
Mutations (mutate.py):
- Applies mutations (insertion, deletion, replacement, and now combined deletion–insertion via
"delete_insert"
) at specified repeats. - Ensures that if the current repeat symbol isn’t allowed, it is changed to an allowed one.
- Rebuilds the haplotype sequence and marks mutated repeats with an “m” suffix.
- Records the mutated VNTR unit sequences for separate output.
- Applies mutations (insertion, deletion, replacement, and now combined deletion–insertion via
-
Configuration (config.py):
- Loads and validates the configuration JSON against a predefined schema.
- The config file includes definitions for repeats, constants, probabilities, length model, mutations, external tool commands, and read simulation settings.
MucOneUp provides two modes for handling mutations when the target repeat is not in the mutation's allowed_repeats
list:
By default, when a mutation is applied to a repeat that isn't listed in the allowed_repeats
for that mutation:
- The system automatically changes the target repeat to a randomly chosen repeat from the
allowed_repeats
list - A warning message is logged indicating this forced change
- The simulation continues with the substituted repeat type
This behavior ensures simulations complete successfully even when target repeats don't match what the mutation allows.
When you need more precise control, enable strict mode by setting "strict_mode": true
in a mutation definition:
"mutations": {
"myMutation": {
"allowed_repeats": ["X", "C"],
"strict_mode": true,
"changes": [
// mutation changes here
]
}
}
With strict mode enabled:
- The system validates target repeats before applying mutations
- If a target repeat isn't in the
allowed_repeats
list, an error is raised with a detailed message - The simulation stops instead of automatically changing the repeat type
For explicitly specified mutation targets (using --mutation-targets
):
- In non-strict mode: If the target repeat isn't in
allowed_repeats
, it's automatically converted to a random allowed repeat with a warning - In strict mode: If the target repeat isn't in
allowed_repeats
, an error is raised and the simulation stops
For random mutation targets (when no explicit targets provided):
- In both modes: The system only selects target positions that already have a repeat type from the
allowed_repeats
list - This ensures that even in strict mode, randomly selected targets won't cause pipeline failures
Strict mode is particularly useful when:
- Precision is critical: Ensure mutations are only applied to specific repeat types
- Debugging simulations: Catch configuration issues early rather than having silent substitutions
- Scientific rigor: Prevent automatic changes that could compromise experimental design
- Quality control: Verify that manually specified targets meet your configuration requirements
A simplified example:
{
"repeats": {
"1": "CACAGCATTCTTCTC...",
"2": "CTGAGTGGTGGAGGA...",
// ...
"9": "TGAGCCTGATGCAGA..."
},
"constants": {
"left": "ACGTACGTACGTACGT",
"right": "TGCAAGCTTTGCAAGC"
},
"probabilities": {
"1": { "2": 1.0 },
"2": { "3": 1.0 },
"3": { "4": 1.0 },
"4": { "5": 1.0 },
"5": { "X": 0.2, "C": 0.8 },
// ...
"9": { "END": 1.0 }
},
"length_model": {
"distribution": "normal",
"min_repeats": 20,
"max_repeats": 130,
"mean_repeats": 70,
"median_repeats": 65
},
"mutations": {
"dupC": {
"allowed_repeats": ["X", "C", "B", "A"],
"strict_mode": false, // Optional: When true, raises an error if target repeat isn't allowed
"changes": [
{
"type": "insert",
"start": 2,
"end": 3,
"sequence": "G"
}
]
},
"delinsAT": {
"allowed_repeats": ["C", "X"], // Must only contain valid repeat keys from 'repeats' section
"changes": [
{
"type": "delete_insert",
"start": 2,
"end": 4,
"sequence": "AT"
}
]
}
// Additional mutations...
},
"tools": {
"reseq": "mamba run --no-capture-output -n wessim reseq",
"faToTwoBit": "mamba run --no-capture-output -n wessim faToTwoBit",
"samtools": "mamba run --no-capture-output -n wessim samtools",
"pblat": "mamba run --no-capture-output -n wessim pblat",
"bwa": "mamba run --no-capture-output -n wessim bwa"
},
"read_simulation": {
"reseq_model": "reference/Hs-Nova-TruSeq.reseq",
"sample_bam": "/path/to/sample.bam",
"human_reference": "/path/to/GRCh38.fna.gz",
"read_number": 10000,
"fragment_size": 250,
"fragment_sd": 35,
"min_fragment": 20,
"threads": 8
},
"nanosim_params": {
"training_data_path": "reference/nanosim/human_giab_hg002_sub1M_kitv14_dorado_v3.2.1",
"coverage": 30,
"read_type": "ONT",
"min_read_length": 100,
"max_read_length": 100000,
"threads": 8
}
}
Note: When using the reference installation helper (see above), update your configuration to reference the absolute paths of the downloaded files (as indicated in the installed_references.json
file).
This project is released under the MIT License. See LICENSE for details.