-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long read taxonomy classifier: recommended command line? #2758
Comments
soooooo... I have Thoughts. But while I think they are not wrong, they are not fully fleshed out. When working with massive collections of genomes that have substantial redundancy (GTDB rs214, for example), starting with read-level classification will result in less precise results. This is because many reads will map equally to multiple genomes - even fairly long reads. Evolution, man! (Even discounting issues of contamination and lateral transfer of genomic regions.) The trick that sourmash uses is to first find the best set of reference genomes, based on unique combinations of distinctive hashes. This seems to work well when strain or genome specific regions are available in the reference database, as they typically are in Bacteria and Archaea. Then you can do things like map reads to those genomes. genome-grist implements this workflow for Illumina read metagenomes. This is why (I strongly believe, with solid but not yet published receipts ;) sourmash performs so well in Portik et al, 2022. Corollary: single read classification on its own is DoOmED. However, it's also clear to me that in this case sourmash is simply doing a highly specific and very sensitive database lookup, with no real hope of generalizing to unknown genomes. Maybe that's ok, or maybe not - depends on use case ;). |
Hi!
With this strategy, only 3,271 out of 3,433,969 simulated (using PBSIM3) ONT reads could be classified. To get a better result, I'm currently testing with If you have other opinions or advice, please share it! :) |
Oooh this is a great point and I totally agree. I recently decontaminated an isoseq transcriptome by:
a la genome grist, but w BLAST instead of read mapping bc the fragments are much longer and there are far fewer of them. |
thanks taylor! a different phrasing: I'd like to classify reads to their most likely genome, not taxon, and that has different challenges! |
Yes, this sounds good to me! I do think |
recipe for multigather and tax annotation here: #2816 (comment) |
from a private message:
Posting here and opening the floor for suggestions =]
A clarification question: is the goal to assign taxonomy to each read? Since reads are expected to be long
scaled=1000
might work, but probably will have to go toscaled=100
, which means building new databases (ours in https://sourmash.readthedocs.io/en/latest/databases.html are usuallyscaled=1000
).We usually do taxonomic profiling, which would be one sig for the dataset and then running
gather
. But that doesn't do read-level classification...cc @ctb @bluegenes
The text was updated successfully, but these errors were encountered: