Use nextalign instead of custom Python codon alignment script #64

huddlej · 2021-08-10T03:20:27Z

Context
The workflow uses a custom Python script to perform codon-aware alignment, to resolve issues with variable sites in mafft's alignments. Since @rneher wrote this script, he and @ivan-aksamentov developed nextalign. Even though nextalign was developed with SARS-CoV-2 in mind, it works well for H3N2 HA sequences, too, and is way faster.

Description / proposed solution
We should try using nextalign for our flu builds. We'll need to setup the corresponding FASTA reference files for all of the lineages and segments or maybe implement GenBank file support in nextalign (whichever is easier...I can guess which though!). Then, we can take advantage of the codon-aware alignment functionality and also run alignments with multiple threads to speed up that step of our builds.

Once we have nextalign in place, we could start to do analyses that previously would have taken too long like creating multiple sequence alignments of all amino acid sequences for HA and running the titer substitution model on all available sequences and titers.

huddlej · 2021-12-17T17:38:27Z

Now that Nextclade has seasonal flu datasets, we should consider using Nextclade for alignment and clade annotations in our standard builds. This approach would quickly produce codon-aware nucleotide alignments, amino acid translations, and clade annotations for every sequence in the database.

ivan-aksamentov · 2021-12-17T17:52:58Z

We'll need to setup the corresponding FASTA reference files for all of the lineages and segments

For the Nextalign part, there are many input files in Nextclade repo already. You'll probably need something even more sophisticated that that. But this might be a partial solution ar at least a starting point:

https://github.com/nextstrain/nextclade/tree/master/data/flu

We should consider using Nextclade for alignment and clade annotations in our standard builds

Nextclade would require more files to run than Nextalign, including a reference auspice tree for every variation, every root sequence etc. So much more involved in terms of science things. Unless it all can piggyback on the existing trees somehow:

https://github.com/nextstrain/nextclade_data/tree/master/data/datasets

In either case, happy to help with the engineering part! Sadly I am almost entirely ignorant about the science of flu itself.

huddlej · 2024-01-30T19:53:21Z

Closing this since we've used nextalign since the refactor.

huddlej added the enhancement New feature or request label Aug 10, 2021

huddlej closed this as completed Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use nextalign instead of custom Python codon alignment script #64

Use nextalign instead of custom Python codon alignment script #64

huddlej commented Aug 10, 2021

huddlej commented Dec 17, 2021

ivan-aksamentov commented Dec 17, 2021 •

edited

Loading

huddlej commented Jan 30, 2024

Use nextalign instead of custom Python codon alignment script #64

Use nextalign instead of custom Python codon alignment script #64

Comments

huddlej commented Aug 10, 2021

huddlej commented Dec 17, 2021

ivan-aksamentov commented Dec 17, 2021 • edited Loading

huddlej commented Jan 30, 2024

ivan-aksamentov commented Dec 17, 2021 •

edited

Loading