fix performance issue finding last alignment with lots of alignments #16

mbiokyle29 · 2024-04-17T21:31:38Z

Description

After some extensive testing, I have identified a fairly significant bottleneck in the alignment code for highly divergent sequences which produce a large number of alignments. We attempt to take the last one in the set of alignments, but this is a generate with an expensive next operation, so advancing to the last alignment has the potential to take quite a long time. Additionally, we were doing it in a list comp so the memory usage exploded. As a cherry on top, I misread the docs and all alignments returned are assumed to be tied for score, so doing the check is pointless. We would still need to advance to the end, which is not doable for large alignment sets. We very much want that last alignment in order to follow the 3' end rule for HGVS. To that end, I have updated the code with an idea that I "think" works. I have not fully validated it in all cases but the basic idea is that we:

Figure out how to "reverse" an alignment (just reference the gapped sequences)
Reverse the input sequences and generate alignments
Take the first alignment and then "reverse" it using the logic above

The thinking is that the "first" alignment in the reverse, should correspond to the last alignment in the forward. If we can correctly reverse the alignment, then we can access the last forward alignment in a much more performant manner. A toy example is as follows. Given ATTG and ATG, we want to gap the second T. We get 2 alignments back from the aligner:

> palamedes ATTG ATG
First:
ref               0 ATTG 4
                  0 |-|| 4
alt               0 A-TG 3


Last:
ref               0 ATTG 4
                  0 ||-| 4
alt               0 AT-G 3

If we align the reversed sequences:

> palamedes GTTA GTA
First:
ref               0 GTTA 4
                  0 |-|| 4
alt               0 G-TA 3


Last:
ref               0 GTTA 4
                  0 ||-| 4
alt               0 GT-A 3

If you reverse the first alignment provided by the second call you get:

ATTG
||-|
AT-G

Which is the second alignment provided by the forward.

I added a few additional test cases to ensure this is working. I would love feedback from folks on edge cases or additional test cases they would like added.

Relevant Links

https://www.sophiagenetics.com/science-hub/hgvs-nomenclature/: The "The 3 prime rule for mutation nomenclature" second

Screenshots & Media

Random 500 residue AA vs another random one, killed manually after on min on main:

Finished quickly on new branch:

200 residue test, showing perf improvement and same results:

Testing

As mentioned, added unit tests for some cases. All other tests pass (although there was not much alignment testing happening)
See screenshot for performance test showing big improvement on highly divergent sequences

rohan-mammothbio

Looks good, suggesting fixing one spelling error.

https://biopython.org/docs/1.75/api/Bio.Align.html#Bio.Align.PairwiseAlignments
Biopython recommends checking len(alignments) because of the issue you identified.

The one concern I would have is—have we tested this with different match, mismatch, etc. scores? Biopython pairwise alignment selects the alignment algorithm based on those values. Is it possible that while the algorithm with the current settings returns a 5' favoring alignment first, other algorithms don't maintain this behavior? [Might be moot since we don't expose aligner scores to the user in __main__.py and maybe don't anticipate ever changing them]

palamedes/align.py

rohan-mammothbio

LGTM

mbiokyle29 · 2024-04-18T17:04:59Z

I have done a bit more extensive testing. I generated ~75 random AA seqs with ~200 residues each. I then aligned all against all on the main branch and this branch. Attached is a CSV with the summary of the results and some screenshots showing perf improvements:

Here is the rough script for generating scores (switch to generate_hgvs_variants call to get the alignments):

import pandas
from Bio import SeqIO

from palamedes import generate_alignment

def load_seqs(file):
    with open(file, "r") as fh:
        return list(SeqIO.parse(fh, "fasta"))


def main():
    seqs = load_seqs("seqs.fa")
    print(f"Loaded {len(seqs)}...")

    for seq in seqs:
        seq.annotations['molecule_type'] = 'protein'

    rows = []
    for ref_idx, ref_seq in enumerate(seqs):

        print(f"Aligning seq {ref_idx} to all others...")
        for alt_idx, alt_seq in enumerate(seqs):

            res = generate_alignment(ref_seq, alt_seq)
            rows.append(
                {
                    'ref_id': ref_idx,
                    'alt_id': alt_idx,
                    'score': res.score,
                }
            )

    pandas.DataFrame(rows).to_csv('alignment_score.csv', index=False)

if __name__ == "__main__":
    main()

The TLDR is: scores are the same in all cases. Variants are different in ~40% of cases. Of the ones I eyeballed, it seemed like there was mostly agreement across the whole alignment but some slight differences in the variants called. In general the PR results were more biased towards 3' end, which is ideal. So I think it's clear that the reverse process + taking first does not produce the "same" alignment as the forward taking last every time, but the score is the same. I am comfortable with this, but open to thoughts from others.

Files:
full.csv
seqs.fa.txt

mbiokyle29 · 2024-04-18T17:10:04Z

Also @rohan-mammothbio - The generate_hgvs_variants function in the public API does allow for passing a pre-configured aligner via a kwarg. I can add a note to the doc string that "all bets might be off" on 3' end rule if you think it's worth it. TBH 3' end rule seems like it's already best effort at this point, so maybe not a huge issue.

matandro

The reversing makes sense, this would just flip the alignment matrix so the last path in the backtracking process is the first path.
Also Like the palindromic test case.

mbiokyle29 added 2 commits April 17, 2024 14:44

fix performance issue finding last alignment with lots of alignments

cb6ddc4

use slice syntax for reversed

8099521

rohan-mammothbio requested changes Apr 17, 2024

View reviewed changes

palamedes/align.py Outdated Show resolved Hide resolved

palamedes/align.py Outdated Show resolved Hide resolved

palamedes/align.py Outdated Show resolved Hide resolved

minor: spelling fix

a305ca5

rohan-mammothbio approved these changes Apr 17, 2024

View reviewed changes

matandro approved these changes Apr 18, 2024

View reviewed changes

mbiokyle29 merged commit cb6e300 into mammothbio-os:main Apr 18, 2024
3 checks passed

mbiokyle29 deleted the alignment-reversal-3p-end-rule branch April 18, 2024 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix performance issue finding last alignment with lots of alignments #16

fix performance issue finding last alignment with lots of alignments #16

mbiokyle29 commented Apr 17, 2024

rohan-mammothbio left a comment •

edited

Loading

rohan-mammothbio left a comment

mbiokyle29 commented Apr 18, 2024 •

edited

Loading

mbiokyle29 commented Apr 18, 2024

matandro left a comment

fix performance issue finding last alignment with lots of alignments #16

fix performance issue finding last alignment with lots of alignments #16

Conversation

mbiokyle29 commented Apr 17, 2024

Description

Relevant Links

Screenshots & Media

Testing

rohan-mammothbio left a comment • edited Loading

Choose a reason for hiding this comment

rohan-mammothbio left a comment

Choose a reason for hiding this comment

mbiokyle29 commented Apr 18, 2024 • edited Loading

mbiokyle29 commented Apr 18, 2024

matandro left a comment

Choose a reason for hiding this comment

rohan-mammothbio left a comment •

edited

Loading

mbiokyle29 commented Apr 18, 2024 •

edited

Loading