Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a testing viral metagenomics dataset #7

Open
5 tasks
saramonzon opened this issue Mar 23, 2025 · 0 comments
Open
5 tasks

Create a testing viral metagenomics dataset #7

saramonzon opened this issue Mar 23, 2025 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@saramonzon
Copy link
Member

Description:

We need to generate a small, controlled in silico test dataset for the viral metagenomics pipeline. This will be used for validating functionality and benchmarking performance.

The dataset should include:

  • A metadata.csv file with a list of organisms (viruses, bacteria, fungi, and host genomes like human or other mammals)
  • Simulated FASTQ files for both Illumina and Nanopore sequencing platforms
  • For each sample, we should define the expected % of reads from each organism (to later test detection/quantification accuracy)

The goal is to create synthetic samples that closely resemble real-world metagenomic data in complexity and composition.


Deliverables:

  • metadata.csv with organism names and assigned abundance percentages per sample
  • Paired-end FASTQ files for Illumina
  • Single-end FASTQ files (or appropriate format) for Nanopore
  • Clear documentation on how the data was generated (tools, parameters, etc.)

Useful tools for in silico dataset generation:

Here are some open-source tools that can help with generating synthetic metagenomic data:

  1. [CAMISIM](https://github.com/CAMI-challenge/CAMISIM)

    • A powerful tool for simulating microbial communities and metagenomic sequencing datasets.
    • Supports both Illumina and Nanopore.
    • Can simulate taxonomic compositions and include strain-level variation.
  2. [InSilicoSeq](https://github.com/HadrienG/InSilicoSeq)

    • Focused on realistic Illumina read simulation from genomes.
    • Allows setting error models and read proportions.
  3. [NanoSim](https://github.com/bcgsc/NanoSim)

    • Simulator for Oxford Nanopore reads.
    • Can be trained on real data or use predefined profiles.
  4. [art](https://www.niehs.nih.gov/research/resources/software/biostatistics/art/index.cfm)

    • A classic Illumina read simulator.
    • Good for simple and fast simulations.
  5. [Grinder](https://github.com/zlinsly/grinder)

    • General-purpose read simulator for amplicon and shotgun sequencing.
    • Can generate mixed community samples with specific abundance profiles.
  6. [NeatSeq-Flow’s simulator module](https://neatseq-flow.readthedocs.io/en/latest/Modules.html#simulatefastq)

    • Simple for small datasets, can complement other tools.

Next steps:

  • Decide on the list of organisms to include (and find/download their genomes)
  • Define desired composition for a small number of test samples (e.g., 3-5 samples)
  • Select tool(s) for simulation and document the choice
  • Generate and validate synthetic reads
  • Upload test data to the repo or a public bucket (Zenodo, S3, etc.)
@saramonzon saramonzon added the enhancement New feature or request label Mar 23, 2025
@palec87 palec87 self-assigned this Mar 24, 2025
@Shettland Shettland transferred this issue from BU-ISCIII/PikaVirus_OLD Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants