-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Analyze organism/source #108
Comments
I'm currently running the 770 samples tests with different To summarize, from our data of 77 samples from the lab, which consisted mostly of model organisms, only 5 had the organisms falsely inferred, 1 mouse (where the issue was that the ratio between 1st and 2nd inferred org was less than 2, with 30% as mmusculus and 17% as omeridionalis, which is weird, cause it's a butterfly, and for the other mouse samples, the second most frequent org is usually another mouse, like mspicilegus) and 4 E.coli samples, that I checked and it is not in our list of transcripts. I suppose some RP sequences could be added? |
So unfortunately, as it turns out, there's no point lowering However, out of the 226 Undecided samples (which is the 29% of the 770 samples), 77 had the organism correctly matched, except the frequency between 1st and 2nd most frequent source was <2 (the default), and 94 had the an incorrect organism inferred. We could try setting this to 1, as none of the samples had a frequency ratio lower than that (except for when none of the organisms were identified - 55). We would get an additional ~10% increase in correctly identified organisms, but at the same time, ~12% increase in false predictions. What do you think? Here are the results from the rerun, with |
Interesting results. I think it's really not too bad for a start. A couple of points:
|
One other thing you could do: Create separate stats for (1) the 10 most common organisms in SRA or whatever number of organisms we need to account for more than 95% of RNA-Seq samples in SRA, out of the organisms we support (for these we wanna be particularly good), (2) the rest of the organisms, and (3) all organisms (what you have now). |
And one more question: Are all of the annotated organisms for the 770 samples supported, in theory, by HTSinfer? |
Yeah, sorry, I got it mixed up with the min-match-percentage 😅 It makes sense that we wouldn't want this lowered.
Yes, I checked these, I think they were specifically chosen because they're in the transcript list.
It's a good idea, I'll do the mapping and create the stats. Tbh I think it would be best to focus on the most common organisms (like you mentioned); currently, there's 440 orgs supported by HTSinfer, which is far too much, and there's a large number of closely related ones. I'll look up the most common ones in SRA and remove the others from the list, and check if that lowers the number of false positives. |
Yes, but please don't remove orgs too drastically. Once you have a list of orgs by sample count in SRA, we can take the top 100. If that's still not enough, we can further reduce. But being able to support many organisms is still a cool feature, so I wouldn't want to go down to 20 or less, if we can avoid it. |
Oh, that's a bit unexpected! I mean, checking against all the 770 orgs, okay, we are then "forcing" a lot of false positives, because the true ones aren't in the list. But for the second part, where you only consider the most common orgs, I would have expected a better result: more reduction of undecided, and less false positives. Hard to imagine why we have more, although ... I guess we are still pushing reads that don't map well to the target organism to any of fewer other choices, and those could then maybe mess with the numbers, which are probably rather low, in general. Two things I can think of:
As for the PE samples: That's weird indeed. What I would actually expect how the mapping should be done for PE samples is that both reads are mapped together, resulting in a single file of alignments - rather than mapping both libraries separately and then deciding separately and somehow combining or concatenating. This should actually be much more stringent, because the best RP gene compatible with both mates would be picked, and in cases where, say, there is two reasonably good options for one mate, it is a lot less likely that for the other mate the wrong organism of the two would also be in the top choices. Anyway, thanks for the good work! |
* feat: add org param * refactor: avoid duplicate mappings (#131) Co-authored-by: Boris Jurič <499542@mail.muni.cz> Co-authored-by: Alex Kanitz <alexander.kanitz@alumni.ethz.ch> * fix typo, update pylint config * feat: add org_id param #108 * refactor: get_library_source.py #108 * test: add org param tests #108 * fix: update Pydantic version (#146) * fix pydantic issues * fix: update pydantic version in envs * fix: pin sphinx-rtd-theme into env * fix: update readthedocs config * update readme, gitignore * feat: infer org source if id not in dict #108 * replace json with model_dump * feat: add org_id param #108 * feat: add org_id param #108 * refactor: replace org with tax-id * refactor get_library_source * refactor get_library_source tests * refactor: update models.py * refactor: fix typos --------- Co-authored-by: Boris Jurič <74237898+BorisYourich@users.noreply.github.com> Co-authored-by: Boris Jurič <499542@mail.muni.cz> Co-authored-by: Alex Kanitz <alexander.kanitz@alumni.ethz.ch>
It is important that we also get an idea of (1) why some of our organism annotations fail and (2) whether perhaps there are also mistakes in the SRA metadata. @balajtimate: To do that, please map all libraries for which the organism was falsely annotated and at least a few dozen of the libraries for which no organism was annotated against the top 3 organisms and check the mapping rates. |
So here are some final notes on this issue:
So I cleaned up the mined dataset (corrected the lib source so that it matched the SRA metadata and removed 45 problematic samples, like single cell ones and where the actual lib source organism is not found in our transcript db) and checked the lib sources again, if the result of HTSinfer matched with the metadata: There are still 37 mismatches however, but I think the 5% false positives are not bad. There are also weird cases:
I would say in these cases HTSinfer actually provided a more precise information about the lib source than the metadata from SRA. Lastly, I mapped the rest of the libraries with mismatches against the metadata organism and the inferred organism, and to no surprise, the samples had a significantly higher mapping rate to the metadata organism than the inferred ones. This was also true for the couple of Undecided lib source that I mapped. In most cases, the decision came down to the ratio of mapped reads from the first and second most common lib source, which has to be at least 2. Lowering it to 1.5 also wouldn't make sense, as we would get an almost equal number of matches and mismatches. It's also worth noting that the most common incorrectly inferred organism is Ficedula albicollis and Oryza meridionalis. I have no idea what could be the meaning of that. So while there is still room for improvement, I think with the testing data that we had this is probably the best result. I will summarize the findings from here and write it in #56. |
This is awesome work, thanks a lot @balajtimate! And fantastic results 😍 The only thing I would still like to see though is the full mappings of at least the 37 mismatches (ideally of all samples) against the top 3 inferred organisms/sources and comparison with what SRA reports. It is really a powerful statement for the paper (and the usefulness of HTSinfer) if we are able to quantify (to a limited extent) how often samples are misannotated or heavily contaminated. |
One goal is to minimize false predictions. Here we can try playing with
--library-source-min-match-percentag
and--library-source-min-frequency-ratio
.However, another goal is to also make sure that we get reliable predictions at least for the most common model organism (not useful that we never get a mouse because based on our rRNA gene transcripts we cannot sufficiently distinguish between mouse and shrew mouse). About this, there are basically two things we can do:
The text was updated successfully, but these errors were encountered: