Create a testing viral metagenomics dataset #7

saramonzon · 2025-03-23T19:52:33Z

Description:

We need to generate a small, controlled in silico test dataset for the viral metagenomics pipeline. This will be used for validating functionality and benchmarking performance.

The dataset should include:

A metadata.csv file with a list of organisms (viruses, bacteria, fungi, and host genomes like human or other mammals)
Simulated FASTQ files for both Illumina and Nanopore sequencing platforms
For each sample, we should define the expected % of reads from each organism (to later test detection/quantification accuracy)

The goal is to create synthetic samples that closely resemble real-world metagenomic data in complexity and composition.

Deliverables:

metadata.csv with organism names and assigned abundance percentages per sample
Paired-end FASTQ files for Illumina
Single-end FASTQ files (or appropriate format) for Nanopore
Clear documentation on how the data was generated (tools, parameters, etc.)

Useful tools for in silico dataset generation:

Here are some open-source tools that can help with generating synthetic metagenomic data:

[CAMISIM](https://github.com/CAMI-challenge/CAMISIM)
- A powerful tool for simulating microbial communities and metagenomic sequencing datasets.
- Supports both Illumina and Nanopore.
- Can simulate taxonomic compositions and include strain-level variation.
[InSilicoSeq](https://github.com/HadrienG/InSilicoSeq)
- Focused on realistic Illumina read simulation from genomes.
- Allows setting error models and read proportions.
[NanoSim](https://github.com/bcgsc/NanoSim)
- Simulator for Oxford Nanopore reads.
- Can be trained on real data or use predefined profiles.
[art](https://www.niehs.nih.gov/research/resources/software/biostatistics/art/index.cfm)
- A classic Illumina read simulator.
- Good for simple and fast simulations.
[Grinder](https://github.com/zlinsly/grinder)
- General-purpose read simulator for amplicon and shotgun sequencing.
- Can generate mixed community samples with specific abundance profiles.
[NeatSeq-Flow’s simulator module](https://neatseq-flow.readthedocs.io/en/latest/Modules.html#simulatefastq)
- Simple for small datasets, can complement other tools.

Next steps:

Decide on the list of organisms to include (and find/download their genomes)
Define desired composition for a small number of test samples (e.g., 3-5 samples)
Select tool(s) for simulation and document the choice
Generate and validate synthetic reads
Upload test data to the repo or a public bucket (Zenodo, S3, etc.)

The text was updated successfully, but these errors were encountered:

saramonzon added the enhancement label Mar 23, 2025

saramonzon mentioned this issue Mar 23, 2025

Benchmark DIAMOND for pathogen detection and reference-level classification #8

Open

11 tasks

palec87 self-assigned this Mar 24, 2025

Shettland transferred this issue from BU-ISCIII/PikaVirus_OLD Mar 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a testing viral metagenomics dataset #7

Create a testing viral metagenomics dataset #7

saramonzon commented Mar 23, 2025

Create a testing viral metagenomics dataset #7

Create a testing viral metagenomics dataset #7

Comments

saramonzon commented Mar 23, 2025