MucOneUp

MucOneUp is a Python tool for simulating MUC1 VNTR diploid references. It builds customized references that:

Generate haplotypes containing a variable‐length VNTR region using a probability model or fixed‐lengths.
Force a canonical terminal block (6 or 6p → 7 → 8 → 9) before appending the right‐hand constant.
Optionally introduce mutations (inserts, deletes, or replacements) in selected repeats with precise control over which haplotypes are affected.
Support structure files with mutation information, allowing for reproducible generation of specific mutated VNTR structures.
Generate series of simulations when fixed-length ranges are provided (via the --simulate-series flag) so that a simulation is run for each possible length (or combination of lengths for multiple haplotypes).
Run dual simulations (normal and mutated) when a comma-separated mutation name is provided.
Detect Toxic Protein Features: When ORF prediction is activated (via the --output-orfs flag), the tool scans the resulting ORF FASTA file for toxic protein sequence features by analyzing the repeat structure and amino acid composition of the variable region. A quantitative “deviation” (or similarity) score is computed relative to a wild–type model, and if the overall score exceeds a user–defined cutoff, the protein is flagged as toxic.
Generate Comprehensive Simulation Statistics:
For each simulation run, detailed statistics are generated—including simulation runtime, haplotype-level metrics (repeat counts, VNTR lengths, GC content, repeat length summaries, and mutation details), as well as overall aggregated statistics. In dual simulation mode, separate reports are produced for the normal and mutated outputs.

Additionally, MucOneUp supports multiple read simulation pipelines:

For Illumina reads, we use a port of w‑Wessim2. This pipeline simulates reads from the simulated FASTA by:

Replacing Ns and generating systematic errors using external tools
Converting the FASTA to 2bit format
Extracting a subset reference from a sample BAM
Running pblat for alignment
Generating fragments and creating reads using the w‑Wessim2 port
Splitting interleaved FASTQ into paired FASTQ files and aligning to a human reference

For Oxford Nanopore (ONT) reads, we integrate NanoSim to generate realistic long reads with the error profiles characteristic of nanopore sequencing. This pipeline:

Uses pre-trained models specific to ONT technologies
Simulates long reads with realistic error profiles
Aligns reads to the reference using minimap2
Generates alignment files in BAM format

Installation

Clone or download this repository.
Install in a Python 3.7+ environment:
```
pip install .
```
This will install the muc_one_up Python package locally.

Optional (Conda/Mamba Environments for Read Simulation):

For Illumina read simulation with w-Wessim2, create the environment:
mamba env create -f conda/env_wessim.yaml
For ONT read simulation with NanoSim, create the environment:
mamba env create -f conda/env_nanosim.yaml
After creating these environments, update the tools section in your configuration file to reference the executables from the newly created environments. Alternatively, you can install the tools locally.

Once installed, you’ll have a command-line program called muconeup available.

Installing Required Reference Files

In order to run the read simulation pipeline, MucOneUp requires several external reference files (such as a human reference FASTA and a reseq model file). A helper script is provided to automate the download, extraction, and (optional) indexing of these reference files. Follow these steps to install the required references and update your configuration accordingly:

Step 1. Run the Reference Installation Helper

A helper script is provided in the helpers directory. This script downloads the required reference files based on a JSON configuration (by default, helpers/install_references_config.json).

For example, to install the references into a directory named ./references, run:

python helpers/install_references.py --output-dir ./references

This command will:

Download the reference files (e.g. the GRCh38 human reference and a reseq model file).
Verify the downloads via MD5 checksum.
Extract any gzip-compressed files automatically.
Index FASTA files using BWA if an indexing command is provided in the configuration.
(If you prefer not to index the FASTA files automatically, use the --skip-indexing flag.)

After the script completes, an installed_references.json file is generated in the output directory. This file maps each reference name to its absolute file path.

Step 2. Update Your Configuration

Once the references are installed, update your MucOneUp configuration file (config.json) so that the fields in the tools and read_simulation sections point to the correct local paths. For example, if your reference FASTA file (e.g. GRCh38) is now located at:

/path/to/references/GRCh38_no_alt_analysis_set.fna.gz

update the human_reference field in the read_simulation section accordingly. Similarly, update the path for the reseq model file and any other reference paths you are using. For example:

{
  "tools": {
    "reseq": "mamba run --no-capture-output -n env_wessim reseq",
    "faToTwoBit": "mamba run --no-capture-output -n env_wessim faToTwoBit",
    "samtools": "mamba run --no-capture-output -n env_wessim samtools",
    "pblat": "mamba run --no-capture-output -n env_wessim pblat",
    "bwa": "mamba run --no-capture-output -n env_wessim bwa",
    "nanosim": "mamba run --no-capture-output -n env_nanosim simulator.py",
    "minimap2": "mamba run --no-capture-output -n env_nanosim minimap2"
  },
  "read_simulation": {
    "reseq_model": "/path/to/references/Hs-Nova-TruSeq.reseq",
    "sample_bam": "data/twist_v2.bam",
    "human_reference": "/path/to/references/GRCh38_no_alt_analysis_set.fna.gz",
    "read_number": 10000,
    "fragment_size": 250,
    "fragment_sd": 35,
    "min_fragment": 20,
    "threads": 8,
    "downsample_coverage": 200,
    "downsample_seed": 42,
    "reference_assembly": "hg38",
    "vntr_region_hg19": "chr1:155160500-155162000",
    "vntr_region_hg38": "chr1:155188000-155192500"
  }
}

Be sure to replace /path/to/references with the absolute path where your references were installed (as reported in the installed_references.json file).

Step 3. Verify Reference Installation

Before running the full simulation or read simulation pipeline, check that the reference files are accessible and correctly referenced in your config. Review the contents of the installed_references.json file and ensure that each required file exists at the specified path.

Quick Start

Create or update a JSON config (see Config File Layout) describing your repeats, probabilities, mutations, etc.

Run the tool by specifying your config along with desired parameters. For example:

muconeup --config config.json --out-base muc1_simulated --output muc1_simulated.fa --output-structure muc1_struct.txt

Inspect the resulting outputs:
- muc1_simulated.fa: Multi-FASTA file of haplotype sequences.
- muc1_struct.txt: Textual representation of each haplotype’s chain of repeats.
- simulation_stats.json (or with variant suffixes in dual simulation mode): JSON file(s) containing detailed simulation statistics.

Usage

Below are the available command-line arguments. Use muconeup --help for more details.

Command-Line Arguments

Argument	Description
`--config <path>`	Required. Path to the JSON config file containing repeats, probabilities, constants, length model, mutations, tools, and read simulation settings.
`--out-base <basename>`	Base name for all output files. All outputs (simulation FASTA, VNTR structure, ORF FASTA, read simulation outputs, and ORF toxic protein statistics) will be named using this base. Default is `muc1_simulated`.
`--out-dir <folder>`	Output folder where all files will be written. Defaults to the current directory.
`--num-haplotypes N`	Number of haplotypes to simulate. Typically `2` for diploid. Defaults to 2.
`--fixed-lengths <vals>`	One or more fixed lengths (or ranges) for each haplotype’s VNTR repeats. Values may be a single integer (e.g. `60`) or a range (e.g. `20-40`). When a range is provided, the default behavior is to pick one value at random from each range. Use the `--simulate-series` flag (see below) to run a simulation for every value (or combination) in the range.
`--simulate-series`	(Optional) When specified and fixed-length ranges are provided, the program will generate a simulation iteration for every possible length (or combination of lengths for multiple haplotypes) instead of choosing a single random value. This flag is useful when you want to explore the entire parameter space.
`--seed <int>`	Random seed for reproducible simulations (affects VNTR building and mutation target selection).
`--mutation-name <str>`	(Optional) Name of a mutation from the config to apply. To run dual simulations (normal and mutated), provide a comma-separated pair (e.g. `normal,dupC`). If a single value is provided, only one simulation is mutated.
`--mutation-targets <pairs>`	(Optional) One or more `haplotype_index,repeat_index` pairs (1-based). E.g., `1,25 2,30`. If provided, each pair indicates which haplotype and repeat position to mutate. If omitted, the mutation is applied at a random allowed repeat. Only haplotypes specified in these targets will have mutation information in their FASTA headers.
`--input-structure <file>`	(Optional) Path to a structure file containing predefined VNTR repeat chains. If the structure file includes a header comment with mutation information (e.g., `# Mutation Applied: dupC (Targets: [(1, 25)])`), that information will be used to apply mutations to the specific haplotypes and positions instead of using random targets or CLI-specified targets.
`--output-structure`	(Optional) If provided, output a VNTR structure file (text) listing the chain of repeats for each haplotype.
`--output-orfs`	(Optional) If provided, run ORF prediction and output an ORF FASTA file using the normalized naming scheme. Additionally, the resulting ORF file will be scanned for toxic protein sequence features and a JSON statistics file is generated.
`--orf-min-aa <int>`	Minimum peptide length (in amino acids) to report from ORF prediction. Defaults to 100.
`--orf-aa-prefix <str>`	(Optional) Filter resulting peptides to only those beginning with this prefix. If used without a value, defaults to `MTSSV`. If omitted, no prefix filtering is applied.
`--simulate-reads`	(Optional) If provided, run the read simulation pipeline on the simulated FASTA. This pipeline produces an aligned/indexed BAM and gzipped paired FASTQ files.
`--random-snps`	(Optional) If provided, generate random SNPs and integrate them into the simulated sequences.
`--random-snp-density <float>`	Density of random SNPs to generate (SNPs per kilobase). Defaults to 1.0.
`--random-snp-output-file <path>`	(Optional) Path to a file where generated random SNPs will be saved in TSV format.
`--snp-file <path>`	(Optional) Path to a TSV file containing predefined SNPs to integrate into the simulated sequences.
`--log-level <level>`	Set logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL, NONE). Default is INFO.

Example Commands

Generate two diploid haplotypes with random VNTR lengths (sampled from the length model defined in your config):
```
muconeup --config config.json --out-base muc1_sim --output muc1_sim.fa --output-structure muc1_struct.txt
```

Force a fixed length of 60 repeats for each haplotype:

muconeup --config config.json --out-base muc1_fixed --output muc1_fixed.fa --output-structure muc1_fixed.txt --num-haplotypes 2 --fixed-lengths 60

Generate a single simulation using a fixed-length range (a random value is chosen from the range for each haplotype):
```
muconeup --config config.json --out-base muc1_random_range --fixed-lengths 20-40
```
Apply a specific mutation to targeted positions in specific haplotypes:
```
muconeup --config config.json --out-base muc1_mutated --mutation-name dupC --mutation-targets 1,25 2,30
```
This applies the dupC mutation to haplotype 1 at repeat position 25 and to haplotype 2 at repeat position 30. The FASTA header for each mutated haplotype will include the mutation information.
Use a structure file with mutation information for reproducible generation of specific VNTR structures:
```
muconeup --config config.json --out-base muc1_from_structure --input-structure structure_file.txt
```
Where the structure file contains mutation information in a header comment:
```
# Mutation Applied: dupC (Targets: [(1, 25)])
haplotype_1 1-2-3-4-5-C-X-B-X-X-X-X-X-X-X-X-V-G-B-X-X-G-A-B-Xm-X-X-X-A-A-A-B-X-D-E-C-6-7-8-9
haplotype_2 1-2-3-4-5-C-X-A-B-X-X-X-V-G-A-A-A-B-B-X-X-X-X-X-X-X-X-X-X-X-X-X-V-V-V-V-V-V-V-V-V-G-A-B-B-X-A-A-N-R-X-X-X-X-A-B-6p-7-8-9
```
Note that the mutated repeat position is marked with an "m" suffix in the structure file (e.g., Xm). The output FASTA will include mutation information only in the header of haplotype 1, which is the only one with a mutation according to the targets.
Generate a series of simulations using a fixed-length range (each possible value in the range, or combination thereof, produces an output file):
```
muconeup --config config.json --out-base muc1_series --simulate-series --fixed-lengths 20-40
```

Apply a known mutation (dupC) at a specific repeat (haplotype #1, repeat #5):

muconeup --config config.json --out-base muc1_mutated --mutation-name dupC --mutation-targets 1,5

Run dual simulation (normal and mutated) with a mutation (dupC):

muconeup --config config.json --out-base muc1_dual --mutation-name normal,dupC

Apply a known mutation (snpA) to a random allowed repeat:

muconeup --config config.json --out-base muc1_random_mut --mutation-name snpA

Toxic Protein Detection

When the ORF prediction is enabled (using the --output-orfs flag), MucOneUp automatically scans the resulting ORF FASTA file for toxic protein sequence features. This new feature works as follows:

Extracting the Variable Region:
The ORF sequence is trimmed by removing the constant left/right flanks (if provided in the configuration), isolating the variable region (i.e. the repeat region).
Detecting and Quantifying Repeats:
A sliding window—whose length equals that of a consensus repeat motif (e.g. "RCHLGPGHQAGPGLHR")—moves across the variable region. For each window, the similarity (using a simple Hamming distance approach) is computed, and windows exceeding a set identity threshold (e.g. 80%) are considered as valid repeats. From these, the number of repeats and the average repeat identity score are calculated.
Amino Acid Composition Analysis:
The tool computes the frequency of key residues (such as R, C, and H) in the variable region and compares these frequencies to a wild–type model (by default, an approximation is generated by repeating the consensus motif). A composition similarity score is calculated as:
```
S_composition = 1 - (sum(|f_mut - f_wt|) / sum(f_wt))
```
where a score near 1 indicates high similarity (i.e. normal) and lower scores indicate divergence (i.e. toxicity).
Combining Metrics:
The overall similarity (or deviation) score is computed as a weighted sum of the repeat score and the composition similarity. In this implementation, a higher overall score indicates divergence from the wild–type (i.e. a toxic protein), while a lower score indicates similarity to the wild–type (normal). If the overall score exceeds a user–defined toxic detection cutoff (e.g. 0.5), the ORF is flagged as toxic.
Output:
The detection metrics for each ORF (including repeat count, average repeat identity, repeat score, composition similarity, overall score, and the toxic flag) are saved in a JSON file (with file extension orf_stats.txt) alongside the ORF FASTA output.

Simulation Statistics

A new feature generates a comprehensive statistics report for each simulation run. This report includes:

Runtime: Total simulation runtime in seconds.
Haplotype-Level Metrics: For each simulated haplotype, the report details the number of repeats, VNTR region length, GC content, individual repeat lengths (with min, max, and average), repeat type counts, and mutation details.
Aggregated Metrics: Overall statistics aggregated from all haplotypes.
Dual Simulation Reporting: In dual mutation mode, separate statistics reports are produced for the normal and mutated outputs.

The statistics are saved as JSON files (e.g., muc1_simulated.002.simulation_stats.json.normal and muc1_simulated.002.simulation_stats.json.mut).

Read Simulation Integration

Illumina Read Simulation

The read simulation pipeline simulates Illumina reads from the generated FASTA files. This pipeline leverages external tools (reseq, faToTwoBit, samtools, pblat, bwa) and incorporates a port of w‑Wessim2 to:

Replace Ns in the FASTA.
Generate systematic errors and convert the FASTA to 2bit format.
Extract a subset reference from a sample BAM.
Align the 2bit file to the subset reference using pblat.
Simulate fragments and create reads using the w‑Wessim2 port.
Split the interleaved FASTQ into paired FASTQ files.
Align the reads to a human reference.

SNP Integration

MucOneUp supports the integration of Single Nucleotide Polymorphisms (SNPs) into simulated sequences. This feature allows for more realistic simulations by incorporating natural genetic variation. SNPs can be integrated in two ways:

From a predefined file using the --snp-file parameter.
Generated randomly using the --random-snps parameter and a specified density.

SNP File Format

The SNP file format is tab-separated (TSV) with the following columns:

haplotype (1-based): The haplotype index (1 or 2 for diploid)
position (0-based): Position in the haplotype sequence
ref_base: Expected reference base at the position
alt_base: Alternative base to introduce

Example SNP file content:

haplotype	position	ref_base	alt_base
1	125	A	G
2	236	C	T

Random SNP Generation

When using --random-snps, MucOneUp will:

Generate random SNPs based on the specified density (SNPs per kilobase).
Ensure SNPs are distributed across both haplotypes.
Save the generated SNPs to a file if --random-snp-output-file is provided.

Dual Mutation Mode and SNPs

In dual mutation mode (using --mutation-name normal,dupC), SNPs are applied to both the normal and mutated sequences. For mutated sequences, the skip_reference_check option is automatically enabled, allowing SNPs to be applied even when mutations have altered the original reference bases.

This is particularly useful when you want to simulate scenarios where both structural mutations and SNPs are present in the sample, providing a more realistic representation of genetic diversity.

Example Commands

# Generate random SNPs with specified density
muconeup --config config.json --out-base muc1_with_snps --random-snps --random-snp-density 0.5 --random-snp-output-file output/muc1_random_snps.tsv

# Apply SNPs from a predefined file
muconeup --config config.json --out-base muc1_with_predefined_snps --snp-file my_snps.tsv

# Combine dual mutation mode with SNP integration
muconeup --config config.json --out-base muc1_dual_with_snps --mutation-name normal,dupC --random-snps

Structure Files with Mutation Information

MucOneUp now supports structure files that contain mutation information embedded in header comments. This allows for reproducible generation of specific mutated VNTR structures.

Structure File Format

A structure file with mutation information has the following format:

# Mutation Applied: <mutation_name> (Targets: [(<haplotype_index>, <position>), ...])
haplotype_1 <repeat_chain_with_m_suffix_for_mutated_positions>
haplotype_2 <repeat_chain>

Example:

# Mutation Applied: dupC (Targets: [(1, 25)])
haplotype_1 1-2-3-4-5-C-X-B-X-X-X-X-X-X-X-X-V-G-B-X-X-G-A-B-Xm-X-X-X-A-A-A-B-X-D-E-C-6-7-8-9
haplotype_2 1-2-3-4-5-C-X-A-B-X-X-X-V-G-A-A-A-B-B-X-X-X-X-X-X-X-X-X-X-X-X-X-V-V-V-V-V-V-V-V-V-G-A-B-B-X-A-A-N-R-X-X-X-X-A-B-6p-7-8-9

When using such a structure file with the --input-structure argument, MucOneUp will:

Parse the mutation information from the header comment
Apply the specified mutation to the targeted haplotypes and positions
Generate output FASTA files with mutation information only in the headers of affected haplotypes
Create output structure files that preserve the mutation information

This feature is particularly useful for:

Reproducing specific known mutations for testing
Generating consistent test data across multiple runs
Creating benchmark datasets for variant calling tools

Project Structure and Logic

The muc_one_up Python package is organized into modules. Here is a brief summary:

muc_one_up/
├── cli.py           # Main CLI logic and argument parsing (supports series simulation, dual mutation modes, toxic protein detection, simulation statistics, and read simulation)
├── config.py        # Loads and validates the JSON configuration file
├── distribution.py  # Samples the target VNTR length from a specified distribution
├── fasta_writer.py  # Helper for writing FASTA files with support for per-haplotype mutation comments
├── mutate.py        # Logic to apply specified mutations (including complex types like delete_insert) to targeted haplotypes
├── probabilities.py # Provides weighted random selections for repeat transitions
├── simulate.py      # Core simulation code for building haplotypes (chains of repeats with terminal block insertion)
├── read_simulation.py  # Integrates an external read simulation pipeline (using w‑Wessim2) to generate reads from simulated FASTA files
├── translate.py     # Translates DNA to protein and performs ORF prediction using orfipy
├── toxic_protein_detector.py   # Scans ORF FASTA outputs to detect toxic protein sequence features based on repeat structure and amino acid composition
├── simulation_statistics.py    # **New Feature**: Generates comprehensive simulation statistics for each simulation run
└── __init__.py      # Package initialization and version information

High-Level Logic

CLI (cli.py):
- Parses command-line arguments and loads the configuration.
- Supports using predefined VNTR chains from structure files via --input-structure, including embedded mutation information.
- If fixed-length ranges are provided, either picks a random value for each haplotype (default) or—if --simulate-series is specified—runs a simulation for every possible length (or combination of lengths for multiple haplotypes).
- Simulates haplotypes via simulate_diploid() or simulate_from_chains() when using structure files.
- Optionally applies mutations using apply_mutations() with precise control over targeted haplotypes and positions via --mutation-targets.
- Dual simulation is supported when a comma-separated mutation name is provided.
- Writes output files (FASTA, VNTR structure, ORFs) with numbered filenames and haplotype-specific mutation information in FASTA headers.
- When ORF prediction is activated (via --output-orfs), the resulting ORF FASTA is further scanned for toxic protein features using toxic_protein_detector.py. The detection statistics are saved as a JSON file.
- Generates a detailed simulation statistics report for each simulation iteration.
- Optionally runs the read simulation pipeline.
Simulation (simulate.py):
- Constructs haplotypes by sampling repeats according to probability distributions.
- Forces the final block of repeats (6/6p → 7 → 8 → 9).
- Appends left and right constant flanks from the config.
Mutations (mutate.py):
- Applies mutations (insertion, deletion, replacement, and now combined deletion–insertion via "delete_insert") at specified repeats.
- Ensures that if the current repeat symbol isn’t allowed, it is changed to an allowed one.
- Rebuilds the haplotype sequence and marks mutated repeats with an “m” suffix.
- Records the mutated VNTR unit sequences for separate output.
Configuration (config.py):
- Loads and validates the configuration JSON against a predefined schema.
- The config file includes definitions for repeats, constants, probabilities, length model, mutations, external tool commands, and read simulation settings.

Mutation Strict Mode

MucOneUp provides two modes for handling mutations when the target repeat is not in the mutation's allowed_repeats list:

Default Mode (Auto-conversion)

By default, when a mutation is applied to a repeat that isn't listed in the allowed_repeats for that mutation:

The system automatically changes the target repeat to a randomly chosen repeat from the allowed_repeats list
A warning message is logged indicating this forced change
The simulation continues with the substituted repeat type

This behavior ensures simulations complete successfully even when target repeats don't match what the mutation allows.

Strict Mode

When you need more precise control, enable strict mode by setting "strict_mode": true in a mutation definition:

"mutations": {
  "myMutation": {
    "allowed_repeats": ["X", "C"],
    "strict_mode": true,
    "changes": [
      // mutation changes here
    ]
  }
}

With strict mode enabled:

The system validates target repeats before applying mutations
If a target repeat isn't in the allowed_repeats list, an error is raised with a detailed message
The simulation stops instead of automatically changing the repeat type

Important Behavior Differences

For explicitly specified mutation targets (using --mutation-targets):

In non-strict mode: If the target repeat isn't in allowed_repeats, it's automatically converted to a random allowed repeat with a warning
In strict mode: If the target repeat isn't in allowed_repeats, an error is raised and the simulation stops

For random mutation targets (when no explicit targets provided):

In both modes: The system only selects target positions that already have a repeat type from the allowed_repeats list
This ensures that even in strict mode, randomly selected targets won't cause pipeline failures

When to Use Strict Mode

Strict mode is particularly useful when:

Precision is critical: Ensure mutations are only applied to specific repeat types
Debugging simulations: Catch configuration issues early rather than having silent substitutions
Scientific rigor: Prevent automatic changes that could compromise experimental design
Quality control: Verify that manually specified targets meet your configuration requirements

Config File Layout

A simplified example:

{
  "repeats": {
    "1": "CACAGCATTCTTCTC...", 
    "2": "CTGAGTGGTGGAGGA...", 
    // ...
    "9": "TGAGCCTGATGCAGA..."
  },
  "constants": {
    "left": "ACGTACGTACGTACGT",
    "right": "TGCAAGCTTTGCAAGC"
  },
  "probabilities": {
    "1": { "2": 1.0 },
    "2": { "3": 1.0 },
    "3": { "4": 1.0 },
    "4": { "5": 1.0 },
    "5": { "X": 0.2, "C": 0.8 },
    // ...
    "9": { "END": 1.0 }
  },
  "length_model": {
    "distribution": "normal",
    "min_repeats": 20,
    "max_repeats": 130,
    "mean_repeats": 70,
    "median_repeats": 65
  },
  "mutations": {
    "dupC": {
      "allowed_repeats": ["X", "C", "B", "A"], 
      "strict_mode": false,  // Optional: When true, raises an error if target repeat isn't allowed
      "changes": [
        {
          "type": "insert",
          "start": 2,
          "end": 3,
          "sequence": "G"
        }
      ]
    },
    "delinsAT": {
      "allowed_repeats": ["C", "X"],  // Must only contain valid repeat keys from 'repeats' section
      "changes": [
        {
          "type": "delete_insert",
          "start": 2,
          "end": 4,
          "sequence": "AT"
        }
      ]
    }
    // Additional mutations...
  },
  "tools": {
    "reseq": "mamba run --no-capture-output -n wessim reseq",
    "faToTwoBit": "mamba run --no-capture-output -n wessim faToTwoBit",
    "samtools": "mamba run --no-capture-output -n wessim samtools",
    "pblat": "mamba run --no-capture-output -n wessim pblat",
    "bwa": "mamba run --no-capture-output -n wessim bwa"
  },
  "read_simulation": {
    "reseq_model": "reference/Hs-Nova-TruSeq.reseq",
    "sample_bam": "/path/to/sample.bam",
    "human_reference": "/path/to/GRCh38.fna.gz",
    "read_number": 10000,
    "fragment_size": 250,
    "fragment_sd": 35,
    "min_fragment": 20,
    "threads": 8
  },
  "nanosim_params": {
    "training_data_path": "reference/nanosim/human_giab_hg002_sub1M_kitv14_dorado_v3.2.1",
    "coverage": 30,
    "read_type": "ONT",
    "min_read_length": 100,
    "max_read_length": 100000,
    "threads": 8
  }
}

Note: When using the reference installation helper (see above), update your configuration to reference the absolute paths of the downloaded files (as indicated in the installed_references.json file).

License

This project is released under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
conda		conda
data		data
helpers		helpers
muc_one_up		muc_one_up
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
config.json		config.json
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MucOneUp

Table of Contents

Installation

Installing Required Reference Files

Step 1. Run the Reference Installation Helper

Step 2. Update Your Configuration

Step 3. Verify Reference Installation

Quick Start

Usage

Command-Line Arguments

Example Commands

Toxic Protein Detection

Simulation Statistics

Read Simulation Integration

Illumina Read Simulation

SNP Integration

SNP File Format

Random SNP Generation

Dual Mutation Mode and SNPs

Example Commands

Structure Files with Mutation Information

Structure File Format

Project Structure and Logic

High-Level Logic

Mutation Strict Mode

Default Mode (Auto-conversion)

Strict Mode

Important Behavior Differences

When to Use Strict Mode

Config File Layout

License

About

Releases

Packages

Languages

berntpopp/MucOneUp

Folders and files

Latest commit

History

Repository files navigation

MucOneUp

Table of Contents

Installation

Installing Required Reference Files

Step 1. Run the Reference Installation Helper

Step 2. Update Your Configuration

Step 3. Verify Reference Installation

Quick Start

Usage

Command-Line Arguments

Example Commands

Toxic Protein Detection

Simulation Statistics

Read Simulation Integration

Illumina Read Simulation

SNP Integration

SNP File Format

Random SNP Generation

Dual Mutation Mode and SNPs

Example Commands

Structure Files with Mutation Information

Structure File Format

Project Structure and Logic

High-Level Logic

Mutation Strict Mode

Default Mode (Auto-conversion)

Strict Mode

Important Behavior Differences

When to Use Strict Mode

Config File Layout

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages