Pre-training data for 1000 Genomes Project #89

cbirchsy · 2025-01-30T18:30:20Z

Hi,

I was wondering if you could share the pre-training data for the 1000G model and/or share more details on the preprocessing steps to generate the FASTA sequences?

Many thanks,
Callum

JavierMenRev · 2025-02-17T10:26:36Z

Hi @cbirchsy --

Sorry for the late reply. We used bcftools consensus function to convert the VCF genotypes into FASTA sequences. Note that for each individual we extracted two sequences. That is, after selecting a given individual and a 6kb region along their genome, we used the -I -H 1pIu and -I -H 2pIu flags when calling the function.

Let me know if this helps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-training data for 1000 Genomes Project #89

Pre-training data for 1000 Genomes Project #89

cbirchsy commented Jan 30, 2025

JavierMenRev commented Feb 17, 2025

Pre-training data for 1000 Genomes Project #89

Pre-training data for 1000 Genomes Project #89

Comments

cbirchsy commented Jan 30, 2025

JavierMenRev commented Feb 17, 2025