Genotype-phenotype prediction requires diverse data, including genetic variants, functional annotations (FA), linkage disequilibrium, polygenic risk scores (PRS), covariates, and genome-wide association studies (GWAS). We present a framework to optimize data integration for phenotype prediction. Using UK Biobank genotypes (733 samples), 2 GWAS for migraine, 3 for depression, publicly available FA, and 4 PRS tools, we tested various combinations of datasets for migraine prediction. The best individual dataset performance achieved a test AUC of 0.643899 (Β±0.143235). However, when different combinations were formedβconfiguration 1 (migraine-related data sources) and configuration 2 (migraine and depression-related data sources)βthe combination of covariates, PRS-PLINK, and weighted-annotated genotype data in Configuration 1 achieved a test AUC of 0.685461 (Β±0.127186). In configuration 2, combining unweighted-annotated genotype data and PRS-LDAK achieved a test AUC of 0.663471 (Β±0.048293). We observed that the inclusion of PRS, covariates, PRS from AnnoPred and LDAK, and annotated genotype data improves prediction performance.
- Place your genotype data
- Put data in the following format for a specific Phenotype
- Create a directory/Folder
EFGPP/
- Download Annotations.tsv and place it in the working folder from: https://drive.google.com/drive/folders/19PMporIqzUj9IY3FNXbnzR_I-aBy8KmT?usp=sharing
Location: EFGPP/Annotations.tsv
EFGPP/Phenotype
EFGPP/Phenotype/Phenotype_1.gz (GWAS file 1)
EFGPP/Phenotype/Phenotype_2.gz (GWAS file 2)
EFGPP/Phenotype/Phenotype_1 (Directory containing genotype data)
EFGPP/Phenotype/Phenotype_2 (Directory containing genotype data)
For Phenotype_1:
EFGPP/Phenotype/Phenotype_1/Phenotype.gz (Cleaned GWAS file)
EFGPP/Phenotype/Phenotype_1/Phenotype.bed
EFGPP/Phenotype/Phenotype_1/Phenotype.bim
EFGPP/Phenotype/Phenotype_1/Phenotype.fam
EFGPP/Phenotype/Phenotype_1/Phenotype.cov (FID,IID,COV1,COV2,...)
EFGPP/Phenotype/Phenotype_1/Phenotype.height (FID,IID,Height)
For Phenotype_2:
EFGPP/Phenotype/Phenotype_2/Phenotype.gz (Cleaned GWAS file)
EFGPP/Phenotype/Phenotype_2/Phenotype.bed
EFGPP/Phenotype/Phenotype_2/Phenotype.bim
EFGPP/Phenotype/Phenotype_2/Phenotype.fam
EFGPP/Phenotype/Phenotype_2/Phenotype.cov (FID,IID,COV1,COV2,...)
EFGPP/Phenotype/Phenotype_2/Phenotype.height (FID,IID,Height)
Sample EFGPP/Phenotype/Phenotype_2/Phenotype.height
:
IID FID Height
5777494 5777494 1
3939107 3939107 1
Sample EFGPP/Phenotype/Phenotype_2/Phenotype.gz
:
CHR BP SNP A1 A2 N SE P OR INFO MAF
8 101592213 rs62513865 T C 480359 0.0153 0.3438 1.01461 0.957 1
8 106973048 rs79643588 A G 480359 0.0136 0.1231 1.02122 0.999 1
8 108690829 rs17396518 T G 480359 0.008 0.6821 1.00331 0.98 1
8 108681675 rs983166 A C 480359 0.008 0.2784 0.99144 0.991 1
8 103044620 rs28842593 T C 480359 0.0112 0.3381 0.98926 0.934 1
8 105176418 rs377046245 I D 480359 0.0196 0.1311 1.03004 0.994 1
8 103128181 chr8_103128181_I D I 480359 0.0108 0.968 1.0004 0.995 1
8 100479917 rs3134156 T C 480359 0.0108 0.6805 0.99561 0.987 1
8 103144592 rs6980591 A C 480359 0.0095 0.1579 1.01349 0.994 1
EFGPP/migraine/
βββ migraine
β βββ migraine.bed
β βββ migraine.bim
β βββ migraine.cov
β βββ migraine.fam
β βββ migraine.gz
β βββ migraine.height
βββ migraine_5
β βββ migraine.bed
β βββ migraine.bim
β βββ migraine.cov
β βββ migraine.fam
β βββ migraine.gz
β βββ migraine.height
βββ migraine_5.gz (GWAS files)
βββ migraine.gz (GWAS files)
EFGPP/depression/
βββ depression_11
β βββ depression.bed
β βββ depression.bim
β βββ depression.cov
β βββ depression.fam
β βββ depression.gz
βββ depression_11.gz
βββ depression_17
β βββ depression.bed
β βββ depression.bim
β βββ depression.cov
β βββ depression.fam
β βββ depression.gz
βββ depression_17.gz
βββ depression_4
β βββ depression.bed
β βββ depression.bim
β βββ depression.cov
β βββ depression.fam
β βββ depression.gz
βββ depression_4.gz
Execute this command for quality controls. Specify the Phenotype and GWAS file as arguments:
python Step2-GWASAndIndividualGenotypeDataQualityControls.py migraine migraine_5.gz
python Step2-GWASAndIndividualGenotypeDataQualityControls.py Phenotype GWASFile
This step will split the data using stratified fold:
python Step2-SplitData.py
phenotype_paths = [
('migraine', 'migraine/migraine'),
('migraine', 'migraine/migraine_5'),
('depression', 'depression/depression_11'),
('depression', 'depression/depression_17'),
('depression', 'depression/depression_4'),
]
Perform p-value threshold on the genotype data:
python Step3-PrepareDataForGenotypeMachineDeepLearning-P-valueThreshold.py migraine migraine_5 0
python Step3-PrepareDataForGenotypeMachineDeepLearning-P-valueThreshold.py Phenotype GWASFile Fold
Append functional annotations with the genotype data:
python Step3.1-UpdateGenotypeWithAnnotationData.py migraine migraine_5 0
python Step3.1-UpdateGenotypeWithAnnotationData.py Phenotype GWASFile Fold
Generate PCA for genotype data:
python Step4-GeneratePCA.py migraine migraine_5
python Step4-GeneratePCA.py Phenotype GWASFile
Generate PRS scores using various tools:
# Using Plink
python Step5-Plink.py Phenotype GWASFile 0
python Step5-Plink.py migraine migraine_5 0
# Using LDAK-GWAS
python Step5-LDAK-GWAS.py migraine migraine_5 0
# Using PRSice-2
python Step5-PRSice-2.py migraine migraine_5 0
# Using AnnoPred
python Step5-AnnoPredCode.py migraine migraine_5 0
Note: Before running these commands:
- Download and install: Plink, LDAK, LDAK-GWAS, and PRSice-2
- Installation guide: PRSTools Installation Guide
Generate the core base datasets:
python Step6-CoreBaseDataGenerator.py migraine 0
python Step6-CoreBaseDataGenerator.py Phenotype Fold
This will create datasets in:
EFGPP/migraine/Fold_0/Datasets
EFGPP/migraine/Fold_1/Datasets
EFGPP/migraine/Fold_2/Datasets
EFGPP/migraine/Fold_3/Datasets
EFGPP/migraine/Fold_4/Datasets
Configuration options in CoreBaseDataGenerator.py:
self.phenotype_gwas_pairs = [
("migraine", "migraine.gz"),
("migraine", "migraine_5.gz"),
# ("depression", "depression_11.gz"),
# ("depression", "depression_4.gz"),
# ("depression", "depression_17.gz"),
]
self.models = ["Plink", "PRSice-2", "AnnoPred", "LDAK-GWAS"]
self.pca_components = 10
self.scaling_options = [False] # Can be [True, False]
self.snp_options = [
"snps_50", "snps_200", "snps_500", "snps_1000",
"snps_5000", "snps_10000", "snps_annotated_50",
"snps_annotated_200", "snps_annotated_500"
]
Analyze datasets across folds using these commands:
# Find unique datasets for a specific fold
python Step7-CoreBaseDataSelection-FindSimilarity.py migraine 0
python Step7-CoreBaseDataSelection-FindSimilarity.py Phenotype Fold
# Find common datasets across all folds
python Step7.1-CoreBaseDataSelection-FindCommon.py migraine
python Step7.1-CoreBaseDataSelection-FindCommon.py Phenotype
Results will be stored in: EFGPP/migraine/Results/UniqueDatasets.txt
Apply machine learning and deep learning algorithms using:
python Step8-CoreBasePredictor.py Phenotype Fold NumberofDataset Algorithm
Example commands:
- For machine learning:
python CoreBasePredictor.py migraine 0 1 ML
- For deep learning:
python CoreBasePredictor.py migraine 0 1 DL
Aggregate results across folds per dataset:
# For machine learning results
python Step9-CoreBasePredictorAggregate.py Phenotype Result_Directory Minimum_Validation_AUC
python Step9-CoreBasePredictorAggregate.py migraine ResultsML 0.5
# For deep learning results
python Step9-CoreBasePredictorAggregate.py migraine ResultsDL 0.5
Results are stored in EFGPP/migraine/Results/ResultsML/Aggregated
Dataset,Model,ML_Parameters,Train AUC,Validation AUC,Test AUC,...
1,Naive Bayes,{'var_smoothing': 1e-07},0.7487,0.5550,0.5867,...
2,Logistic Regression,{'C': 0.1, 'penalty': 'l2'},0.6540,0.5378,0.5419,...
3,Logistic Regression,{'C': 1.0, 'penalty': 'l2'},0.6483,0.6286,0.6321,...
Analyze your results with:
# For basic exploratory data analysis
python Step10-CoreBaseExploratoryDataAnalysis.py migraine ResultsML
python Step10-CoreBaseExploratoryDataAnalysis.py Phenotype Result_Directory
This generates comprehensive insights including:
- Top performing models and their frequency
- Best AUC metrics across:
- Phenotype-GWAS pairs
- Dataset types
- Weight file configurations
- SNP types
- Model types
Perform cluster analysis with:
# For detailed cluster analysis
python Step10.1-CoreBaseDataClusterAnalysis.py migraine ResultsML
python Step10.1-CoreBaseDataClusterAnalysis.py Phenotype Result_Directory
Example cluster visualization:
Get the best datasets based on training/validation stability and AUC:
# Find top performing datasets
python Step11-CoreBasePredictorFindTop10.py migraine ResultsML 0.6
python Step11-CoreBasePredictorFindTop10.py Phenotype Result_Directory Minimum_Validation_AUC
The composite score is calculated as:
valid_models['Composite_Score'] = (
0.2 * valid_models['Validation AUC_Norm'] +
0.2 * valid_models['Train_Val_Gap_Norm'] +
0.2 * valid_models['Train_Stability_Norm'] +
0.2 * valid_models['Val_Stability_Norm']
)
Make categories of the data and try all combinations of the data. There are multiple ways to form the categories.
For genotype data, there are 4 categories: f"Genotype_{weight_status}_{annotation_status}_{gwas_name}"
. For PRS, there can be PRS tools * GWASFile categories: f"PRS_{gwas_name}_{model}"
or f"PRS_{gwas_name}"
.
One can modify the following code to form different categories:
for _, row in results_df.iterrows():
dataset_id = row['Dataset']
dataset_type = row['Dataset_Type'].lower()
annotation_status = get_annotation_status(row.get('snps'))
# Build category key based on dataset type and properties
if dataset_type == 'genotype':
weight_status = "W" if row.get('weight_file_present', False) else 'UW'
gwas_name = get_gwas_name(row.get('gwas_file'))
category_key = f"Genotype_{weight_status}_{annotation_status}_{gwas_name}"
elif dataset_type == 'prs':
gwas_name = get_gwas_name(row.get('gwas_file'))
model = row.get('model', 'NA')
category_key = f"PRS_{gwas_name}"
elif dataset_type == 'covariates':
features = row.get('Features', 'NA')
category_key = f"Covariates"
elif dataset_type == 'pca':
components = row.get('pca_components', 'NA')
category_key = f"PCA_{components}"
else:
category_key = f"Other_{dataset_type}"
categories[category_key].append(dataset_id)
The selected datasets for Configuration 1 and Configuration 2 are presented in Table 1 and on GitHub:
Results for each combination are available on GitHub:
# For detailed cluster analysis
python Step12-Combinations.py --phenotype migraine --results_dir ResultsML
python Step12-Combinations.py --phenotype Phenotype --results_dir Result_Directory
python Step13-PlotCombinations.py migraine ResultsML
python Step13-PlotCombinations.py Phenotype Result_Directory