Skip to content

aws-samples/genomic-language-model-pretraining-with-healthomics-seq-store

Genomic language model pretraining with the HealthOmics sequence store

In this repo we show how to use genomic language models on the AWS cloud.

A genomic language model is a type of machine learning model, typically based on architectures like transformers or recurrent neural networks (RNNs), that is trained to understand and generate sequences of DNA, RNA, or other biological sequences. Just as language models are trained on human text to predict and generate natural language, genomic language models are trained on nucleotide sequences (like those composed of the bases A, T, C, and G) to capture the underlying patterns and structures in genomic data.

We show here, how to work with four genomic language models---HyenaDNA, Evo, Evo 2, and Caduceus---and a sc-RNAseq foundational model, Geneformer.


HyenaDNA

See here our HyenaDNA project.


Evo

See here our Evo project.


Evo 2

See here our Evo 2 project.


Geneformer

See here our Geneformer project.


Caduceus

See here our Caduceus project.


Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published