Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix performance issue finding last alignment with lots of alignments #16

Merged

Conversation

mbiokyle29
Copy link
Contributor

Description

After some extensive testing, I have identified a fairly significant bottleneck in the alignment code for highly divergent sequences which produce a large number of alignments. We attempt to take the last one in the set of alignments, but this is a generate with an expensive next operation, so advancing to the last alignment has the potential to take quite a long time. Additionally, we were doing it in a list comp so the memory usage exploded. As a cherry on top, I misread the docs and all alignments returned are assumed to be tied for score, so doing the check is pointless. We would still need to advance to the end, which is not doable for large alignment sets. We very much want that last alignment in order to follow the 3' end rule for HGVS. To that end, I have updated the code with an idea that I "think" works. I have not fully validated it in all cases but the basic idea is that we:

  • Figure out how to "reverse" an alignment (just reference the gapped sequences)
  • Reverse the input sequences and generate alignments
  • Take the first alignment and then "reverse" it using the logic above

The thinking is that the "first" alignment in the reverse, should correspond to the last alignment in the forward. If we can correctly reverse the alignment, then we can access the last forward alignment in a much more performant manner. A toy example is as follows. Given ATTG and ATG, we want to gap the second T. We get 2 alignments back from the aligner:

> palamedes ATTG ATG
First:
ref               0 ATTG 4
                  0 |-|| 4
alt               0 A-TG 3


Last:
ref               0 ATTG 4
                  0 ||-| 4
alt               0 AT-G 3

If we align the reversed sequences:

> palamedes GTTA GTA
First:
ref               0 GTTA 4
                  0 |-|| 4
alt               0 G-TA 3


Last:
ref               0 GTTA 4
                  0 ||-| 4
alt               0 GT-A 3

If you reverse the first alignment provided by the second call you get:

ATTG
||-|
AT-G

Which is the second alignment provided by the forward.

I added a few additional test cases to ensure this is working. I would love feedback from folks on edge cases or additional test cases they would like added.

Relevant Links

https://www.sophiagenetics.com/science-hub/hgvs-nomenclature/: The "The 3 prime rule for mutation nomenclature" second

Screenshots & Media

Random 500 residue AA vs another random one, killed manually after on min on main:
Screen Shot 2024-04-17 at 3 01 04 PM
Finished quickly on new branch:
Screen Shot 2024-04-17 at 3 28 03 PM

200 residue test, showing perf improvement and same results:
Screen Shot 2024-04-17 at 3 30 53 PM
Screen Shot 2024-04-17 at 3 31 00 PM

Testing

  • As mentioned, added unit tests for some cases. All other tests pass (although there was not much alignment testing happening)
  • See screenshot for performance test showing big improvement on highly divergent sequences

Copy link
Contributor

@rohan-mammothbio rohan-mammothbio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, suggesting fixing one spelling error.

https://biopython.org/docs/1.75/api/Bio.Align.html#Bio.Align.PairwiseAlignments
Biopython recommends checking len(alignments) because of the issue you identified.

The one concern I would have is—have we tested this with different match, mismatch, etc. scores? Biopython pairwise alignment selects the alignment algorithm based on those values. Is it possible that while the algorithm with the current settings returns a 5' favoring alignment first, other algorithms don't maintain this behavior? [Might be moot since we don't expose aligner scores to the user in __main__.py and maybe don't anticipate ever changing them]

Copy link
Contributor

@rohan-mammothbio rohan-mammothbio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mbiokyle29
Copy link
Contributor Author

mbiokyle29 commented Apr 18, 2024

I have done a bit more extensive testing. I generated ~75 random AA seqs with ~200 residues each. I then aligned all against all on the main branch and this branch. Attached is a CSV with the summary of the results and some screenshots showing perf improvements:
Screen Shot 2024-04-18 at 10 10 58 AM
Screen Shot 2024-04-18 at 9 59 50 AM

Here is the rough script for generating scores (switch to generate_hgvs_variants call to get the alignments):

import pandas
from Bio import SeqIO

from palamedes import generate_alignment

def load_seqs(file):
    with open(file, "r") as fh:
        return list(SeqIO.parse(fh, "fasta"))


def main():
    seqs = load_seqs("seqs.fa")
    print(f"Loaded {len(seqs)}...")

    for seq in seqs:
        seq.annotations['molecule_type'] = 'protein'

    rows = []
    for ref_idx, ref_seq in enumerate(seqs):

        print(f"Aligning seq {ref_idx} to all others...")
        for alt_idx, alt_seq in enumerate(seqs):

            res = generate_alignment(ref_seq, alt_seq)
            rows.append(
                {
                    'ref_id': ref_idx,
                    'alt_id': alt_idx,
                    'score': res.score,
                }
            )

    pandas.DataFrame(rows).to_csv('alignment_score.csv', index=False)

if __name__ == "__main__":
    main()

The TLDR is: scores are the same in all cases. Variants are different in ~40% of cases. Of the ones I eyeballed, it seemed like there was mostly agreement across the whole alignment but some slight differences in the variants called. In general the PR results were more biased towards 3' end, which is ideal. So I think it's clear that the reverse process + taking first does not produce the "same" alignment as the forward taking last every time, but the score is the same. I am comfortable with this, but open to thoughts from others.

Screen Shot 2024-04-18 at 11 02 03 AM

Files:
full.csv
seqs.fa.txt

@mbiokyle29
Copy link
Contributor Author

Also @rohan-mammothbio - The generate_hgvs_variants function in the public API does allow for passing a pre-configured aligner via a kwarg. I can add a note to the doc string that "all bets might be off" on 3' end rule if you think it's worth it. TBH 3' end rule seems like it's already best effort at this point, so maybe not a huge issue.

Copy link

@matandro matandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reversing makes sense, this would just flip the alignment matrix so the last path in the backtracking process is the first path.
Also Like the palindromic test case.

@mbiokyle29 mbiokyle29 merged commit cb6e300 into mammothbio-os:main Apr 18, 2024
3 checks passed
@mbiokyle29 mbiokyle29 deleted the alignment-reversal-3p-end-rule branch April 18, 2024 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants