Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use nextalign instead of custom Python codon alignment script #64

Closed
huddlej opened this issue Aug 10, 2021 · 3 comments
Closed

Use nextalign instead of custom Python codon alignment script #64

huddlej opened this issue Aug 10, 2021 · 3 comments
Labels
enhancement New feature or request

Comments

@huddlej
Copy link
Contributor

huddlej commented Aug 10, 2021

Context
The workflow uses a custom Python script to perform codon-aware alignment, to resolve issues with variable sites in mafft's alignments. Since @rneher wrote this script, he and @ivan-aksamentov developed nextalign. Even though nextalign was developed with SARS-CoV-2 in mind, it works well for H3N2 HA sequences, too, and is way faster.

Description / proposed solution
We should try using nextalign for our flu builds. We'll need to setup the corresponding FASTA reference files for all of the lineages and segments or maybe implement GenBank file support in nextalign (whichever is easier...I can guess which though!). Then, we can take advantage of the codon-aware alignment functionality and also run alignments with multiple threads to speed up that step of our builds.

Once we have nextalign in place, we could start to do analyses that previously would have taken too long like creating multiple sequence alignments of all amino acid sequences for HA and running the titer substitution model on all available sequences and titers.

@huddlej huddlej added the enhancement New feature or request label Aug 10, 2021
@huddlej
Copy link
Contributor Author

huddlej commented Dec 17, 2021

Now that Nextclade has seasonal flu datasets, we should consider using Nextclade for alignment and clade annotations in our standard builds. This approach would quickly produce codon-aware nucleotide alignments, amino acid translations, and clade annotations for every sequence in the database.

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Dec 17, 2021

We'll need to setup the corresponding FASTA reference files for all of the lineages and segments

For the Nextalign part, there are many input files in Nextclade repo already. You'll probably need something even more sophisticated that that. But this might be a partial solution ar at least a starting point:

https://github.com/nextstrain/nextclade/tree/master/data/flu

We should consider using Nextclade for alignment and clade annotations in our standard builds

Nextclade would require more files to run than Nextalign, including a reference auspice tree for every variation, every root sequence etc. So much more involved in terms of science things. Unless it all can piggyback on the existing trees somehow:

https://github.com/nextstrain/nextclade_data/tree/master/data/datasets

In either case, happy to help with the engineering part! Sadly I am almost entirely ignorant about the science of flu itself.

@huddlej
Copy link
Contributor Author

huddlej commented Jan 30, 2024

Closing this since we've used nextalign since the refactor.

@huddlej huddlej closed this as completed Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants