-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix performance issue finding last alignment with lots of alignments #16
fix performance issue finding last alignment with lots of alignments #16
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, suggesting fixing one spelling error.
https://biopython.org/docs/1.75/api/Bio.Align.html#Bio.Align.PairwiseAlignments
Biopython recommends checking len(alignments)
because of the issue you identified.
The one concern I would have is—have we tested this with different match, mismatch, etc. scores? Biopython pairwise alignment selects the alignment algorithm based on those values. Is it possible that while the algorithm with the current settings returns a 5' favoring alignment first, other algorithms don't maintain this behavior? [Might be moot since we don't expose aligner scores to the user in __main__.py
and maybe don't anticipate ever changing them]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I have done a bit more extensive testing. I generated ~75 random AA seqs with ~200 residues each. I then aligned all against all on the main branch and this branch. Attached is a CSV with the summary of the results and some screenshots showing perf improvements: Here is the rough script for generating scores (switch to generate_hgvs_variants call to get the alignments): import pandas
from Bio import SeqIO
from palamedes import generate_alignment
def load_seqs(file):
with open(file, "r") as fh:
return list(SeqIO.parse(fh, "fasta"))
def main():
seqs = load_seqs("seqs.fa")
print(f"Loaded {len(seqs)}...")
for seq in seqs:
seq.annotations['molecule_type'] = 'protein'
rows = []
for ref_idx, ref_seq in enumerate(seqs):
print(f"Aligning seq {ref_idx} to all others...")
for alt_idx, alt_seq in enumerate(seqs):
res = generate_alignment(ref_seq, alt_seq)
rows.append(
{
'ref_id': ref_idx,
'alt_id': alt_idx,
'score': res.score,
}
)
pandas.DataFrame(rows).to_csv('alignment_score.csv', index=False)
if __name__ == "__main__":
main() The TLDR is: scores are the same in all cases. Variants are different in ~40% of cases. Of the ones I eyeballed, it seemed like there was mostly agreement across the whole alignment but some slight differences in the variants called. In general the PR results were more biased towards 3' end, which is ideal. So I think it's clear that the reverse process + taking first does not produce the "same" alignment as the forward taking last every time, but the score is the same. I am comfortable with this, but open to thoughts from others. ![]() Files: |
Also @rohan-mammothbio - The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reversing makes sense, this would just flip the alignment matrix so the last path in the backtracking process is the first path.
Also Like the palindromic test case.
Description
After some extensive testing, I have identified a fairly significant bottleneck in the alignment code for highly divergent sequences which produce a large number of alignments. We attempt to take the last one in the set of alignments, but this is a generate with an expensive next operation, so advancing to the last alignment has the potential to take quite a long time. Additionally, we were doing it in a list comp so the memory usage exploded. As a cherry on top, I misread the docs and all alignments returned are assumed to be tied for score, so doing the check is pointless. We would still need to advance to the end, which is not doable for large alignment sets. We very much want that last alignment in order to follow the 3' end rule for HGVS. To that end, I have updated the code with an idea that I "think" works. I have not fully validated it in all cases but the basic idea is that we:
The thinking is that the "first" alignment in the reverse, should correspond to the last alignment in the forward. If we can correctly reverse the alignment, then we can access the last forward alignment in a much more performant manner. A toy example is as follows. Given
ATTG
andATG
, we want to gap the second T. We get 2 alignments back from the aligner:If we align the reversed sequences:
If you reverse the first alignment provided by the second call you get:
Which is the second alignment provided by the forward.
I added a few additional test cases to ensure this is working. I would love feedback from folks on edge cases or additional test cases they would like added.
Relevant Links
https://www.sophiagenetics.com/science-hub/hgvs-nomenclature/: The "The 3 prime rule for mutation nomenclature" second
Screenshots & Media
Random 500 residue AA vs another random one, killed manually after on min on


main
:Finished quickly on new branch:
200 residue test, showing perf improvement and same results:


Testing