Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long read taxonomy classifier: recommended command line? #2758

Open
luizirber opened this issue Sep 13, 2023 · 6 comments
Open

Long read taxonomy classifier: recommended command line? #2758

luizirber opened this issue Sep 13, 2023 · 6 comments

Comments

@luizirber
Copy link
Member

from a private message:

We are currently benchmarking sourmash as long read taxonomy classifier, meaning mapping them against genomes to assign taxonomy labels. But we struggle a bit on how best to do this. Do you have a recommended command line for this task? Embedding each read separately is a bit slow.

Posting here and opening the floor for suggestions =]

A clarification question: is the goal to assign taxonomy to each read? Since reads are expected to be long scaled=1000 might work, but probably will have to go to scaled=100, which means building new databases (ours in https://sourmash.readthedocs.io/en/latest/databases.html are usually scaled=1000).

We usually do taxonomic profiling, which would be one sig for the dataset and then running gather. But that doesn't do read-level classification...

cc @ctb @bluegenes

@ctb
Copy link
Contributor

ctb commented Sep 13, 2023

soooooo... I have Thoughts. But while I think they are not wrong, they are not fully fleshed out.

When working with massive collections of genomes that have substantial redundancy (GTDB rs214, for example), starting with read-level classification will result in less precise results. This is because many reads will map equally to multiple genomes - even fairly long reads. Evolution, man! (Even discounting issues of contamination and lateral transfer of genomic regions.)

The trick that sourmash uses is to first find the best set of reference genomes, based on unique combinations of distinctive hashes. This seems to work well when strain or genome specific regions are available in the reference database, as they typically are in Bacteria and Archaea.

Then you can do things like map reads to those genomes. genome-grist implements this workflow for Illumina read metagenomes.

This is why (I strongly believe, with solid but not yet published receipts ;) sourmash performs so well in Portik et al, 2022. Corollary: single read classification on its own is DoOmED.

However, it's also clear to me that in this case sourmash is simply doing a highly specific and very sensitive database lookup, with no real hope of generalizing to unknown genomes. Maybe that's ok, or maybe not - depends on use case ;).

@jaebeom-kim
Copy link

Hi!
I'm trying to get some read-by-read classification results with the following strategy.

## BUILD DB
sourmash sketch fromfile files.csv -p dna,scaled=1000 \
        -o /mnt/scratch/jaebeom/gtdb_202_inclusion/databases/sourmash/inclusion_db_sigs_31_1000.zip 

sourmash lca index --scaled 1000 \
        -f taxonomy.csv \
        inclusion_31_1000.lca.json \
        /mnt/scratch/jaebeom/gtdb_202_inclusion/databases/sourmash/inclusion_db_sigs_31_1000.zip

## SKETCH QUERY
 sourmash sketch dna /fast/jaebeom/long-read/inclusion/prokaryote_inclusion_ont.fq \
         -p k=31,scaled=1000 \
         --singleton \
         -o /fast/jaebeom/long-read/results/sourmash/inclusion/inclusion_ont_reads_singleton.sig.zip

## CLASSIFY
sourmash lca classify \
        --scaled 1000 \
        --query /fast/jaebeom/long-read/results/sourmash/inclusion/inclusion_ont_reads_singleton.sig.zip \
        --db /mnt/scratch/jaebeom/gtdb_202_inclusion/databases/sourmash/inclusion.lca.js.lca.json \
        -o /fast/jaebeom/long-read/results/sourmash/inclusion/ont_lca_classify_31-1000.csv 

With this strategy, only 3,271 out of 3,433,969 simulated (using PBSIM3) ONT reads could be classified.
The genomes used for the read simulation are also used for indexing the reference DB,
so it is expected that much more reads should be classified.

To get a better result, I'm currently testing with scaled=100 as @luizirber suggested.
I'll post the result here.

If you have other opinions or advice, please share it! :)

@taylorreiter
Copy link
Contributor

When working with massive collections of genomes that have substantial redundancy (GTDB rs214, for example), starting with read-level classification will result in less precise results. This is because many reads will map equally to multiple genomes - even fairly long reads. Evolution, man! (Even discounting issues of contamination and lateral transfer of genomic regions.)

The trick that sourmash uses is to first find the best set of reference genomes, based on unique combinations of distinctive hashes. This seems to work well when strain or genome specific regions are available in the reference database, as they typically are in Bacteria and Archaea.

Oooh this is a great point and I totally agree. I recently decontaminated an isoseq transcriptome by:

  1. running sourmash gather on the whole thing
  2. identifying contaminant matches
  3. downloading those genomes from genbank
  4. BLASTn-ing each transcript against the downloaded genomes
  5. filtering BLAST results for strong matches (large fraction of contig, high percent identity) and removing them as contaminants.

a la genome grist, but w BLAST instead of read mapping bc the fragments are much longer and there are far fewer of them.

@ctb
Copy link
Contributor

ctb commented Sep 14, 2023

thanks taylor!

a different phrasing: I'd like to classify reads to their most likely genome, not taxon, and that has different challenges!

@ctb
Copy link
Contributor

ctb commented Sep 14, 2023

To get a better result, I'm currently testing with scaled=100 as @luizirber suggested.
I'll post the result here.

Yes, this sounds good to me! I do think sourmash gather is going to work no worse than, and potentially much better than, sourmash lca. Plus it should be a lot faster (esp given the optimizations to gather that we're releasing over in https://github.com/sourmash-bio/pyo3_branchwater).

@ctb
Copy link
Contributor

ctb commented Oct 19, 2023

recipe for multigather and tax annotation here: #2816 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants