levinas
diff --git a/‎README.md
+93 b/‎README.md
+93
diff --git a/‎src/Convalesco_predictions.py
+39 b/‎src/Convalesco_predictions.py
+39
diff --git a/‎src/concept_bundles.py
+27 b/‎src/concept_bundles.py
+27
diff --git a/‎src/concept_sets.py
+12 b/‎src/concept_sets.py
+12
diff --git a/‎src/concept_to_feature.py
+66 b/‎src/concept_to_feature.py
+66
diff --git a/‎src/covid_dates.py
+15 b/‎src/covid_dates.py
+15
diff --git a/‎src/covid_dates_test.py
+15 b/‎src/covid_dates_test.py
+15
diff --git a/‎src/covid_episodes.py
+16 b/‎src/covid_episodes.py
+16
diff --git a/‎src/covid_episodes_test.py
+13 b/‎src/covid_episodes_test.py
+13
diff --git a/‎src/curated_bundles.py
+7 b/‎src/curated_bundles.py
+7
diff --git a/‎src/demographics.py
+21 b/‎src/demographics.py
+21
diff --git a/‎src/demographics_test.py
+19 b/‎src/demographics_test.py
+19
@@ -0,0 +1,93 @@
+# NIH Long Covid Challenge Solution
+
+This repository contains a winning submission for the NIH Long Covid Computational Challenge ([L3C](https://www.challenge.gov/?challenge=l3c)) developed by [Team Convalesco](https://www.linkedin.com/pulse/announcing-nih-long-covid-computational/). The objective of the challenge was to develop machine learning models to predict which patients are susceptible to developing PASC/Long COVID using structured medical records up to 28 days from COVID onset. 
+
+## Overview
+
+Our solution leverages the rich clinical data available in the [N3C environment](https://ncats.nih.gov/n3c/about/data-overview) including condition occurrences, lab measurements, drug exposure, doctor notes, etc. With model generalizability and robustness in mind, we focus on creating a small number of meaningful features by curating and expanding concept sets. A key idea in feature engineering is to use the temporal information in the medical records to create features that are more predictive of Long COVID risks. The original submission consists of ~100 workflow cells operating on Spark dataframes in the N3C enclave. All the transform codes are included in this repository to be tested and run locally on synthetic data. 
+
+## Installation
+
+1. Clone the repository:
+```
+git clone https://github.com/levinas/long-covid-prediction.git
+cd long-covid-prediction
+```
+
+2. Create a virtual environment (optional):
+```
+conda create -n l3c python=3.10
+conda activate l3c
+```
+
+3. Install the required packages:
+```
+pip install -r requirements.txt
+```
+
+4. Ensure Java, a [PySpark dependency](https://spark.apache.org/docs/latest/api/python/getting_started/install.html), is installed and the JAVA_HOME environment variable is set. 
+
+For example, on an Ubuntu Linux machine, you can run the following command (or use other package managers such as homebrew to avoid sudo):
+```
+sudo apt-get install openjdk-17-jdk
+export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
+```
+
+## Running the Code on Synthetic Data
+
+1. Download the synthetic data: 
+
+Download [synthetic_data.zip](https://www.dropbox.com/s/krrw6ydutf6j98p/synthetic_data.zip?dl=0) (1.5GB). Extract the zip file and place the folder in the root directory of the repo. Make sure the directory structure looks like `synthetic_data/training/person.csv`. A command line example to do this is:
+```
+cd long-covid-prediction
+wget https://www.dropbox.com/s/krrw6ydutf6j98p/synthetic_data.zip
+unzip synthetic_data.zip
+```
+
+2. Run the demo script from the root directory of the repo:
+```
+./run_all.sh
+```
+This will run the entire workflow on the synthetic data. The final output will be saved as `Convalesco_predictions.csv` in the root directory of this repo; the outputs of all intermediate datasets will be saved in the `output/` folder.
+
+The test run on the synthetic data could take 1-2 hours on a typical linux machine with 64 GB memory. PySpark may generate `RowBasedKeyValueBatch` warnings that could be safely avoided.
+
+Th final output is a patient-level table with prediction results for the testing data with 8 columns: 
+ ```python
+ # Key columns: 
+ #   person_id             
+ #   outcome_likelihoods:  final prediction on patient PASC probability  
+ #   confidence_estimate:  a proxy estimate based on patient data completeness 
+ #   likelihood_3month:    predicted probability of PASC within 3 months after COVID index 
+ #   likelihood_6month:    predicted probability of PASC within 6 months after COVID index 
+ # Additional columns: 
+ #   model100_pred:        prediction of Model_100 with 100 temporal features 
+ #   model36_pred:         prediction of Model_36, a simple model with 36 temporal features 
+ #   model_z_pred:         prediction of Model_Z, an aspiring "zero-bias" model 
+ ```
+In this example, since we are using synthetic data, the predictions will not be as meaningful.
+
+## Models and Features
+
+We have created 4 models with different emphases, and our submission is an ensemble of the first three.
+
+![Table 1](figs/table1-model-description.png)
+
+The model features are grouped into seven categories, and the population-level feature utilization scores on the real data are shown in the figure below. 
+
+![Fig. 2](figs/fig2-feature-categories.png)
+
+
+## Documentation
+
+The key components of the repository are as follows:
+
+- `src/`: Contains all the source code including the ~100 transforms and global code.
+- `src/global_code.py`: Global python code.
+- `utils/execution_engine.py`: Execution engine.
+
+The original submission was developed on the N3C environment in the form of a [Palantir Code Workbook](https://www.palantir.com/docs/foundry/code-workbook/overview/). We used global Python code extensively to simplify the transform codes and make the reusable blocks more readable. The execution engine is a Python module we developed after the challenge to enable local execution of the original codes with minimal modifications.
+
+For more details, please refer to the [DOCUMENTATION](DOCUMENTATION.md).
+
+
@@ -0,0 +1,39 @@
+# Key columns in this submission:
+#   person_id            
+#   outcome_likelihoods:  final prediction on patient PASC probability 
+#   confidence_estimate:  proxy quality estimate based on data completeness
+#   likelihood_3month:    predicted probability of PASC within 3 months after COVID index
+#   likelihood_6month:    predicted probability of PASC within 6 months after COVID index
+
+# Additional columns:
+#   model100_pred:        prediction of Model_100 with 100 temporal features
+#   model36_pred:         prediction of Model_36, a simple model with 36 temporal features
+#   model_z_pred:         prediction of Model_Z, an aspiring "zero-bias" model
+
+
+import pandas as pd
+
+def Convalesco_predictions(train_test_model: pd.DataFrame, 
+                           person_data_completeness_test):
+    df = spark.createDataFrame(train_test_model)
+
+    # add confidence estimate
+    df_quality = person_data_completeness_test \
+        .select('person_id', 'completeness_score') \
+        .join(df.select('person_id'), on='person_id', how='right') \
+        .fillna(0)            
+
+    df = df.join(df_quality, on='person_id', how='left')
+    
+    # round numbers for better display
+    df = df.select('person_id',
+        F.round(col('outcome_likelihoods'), 8).alias('outcome_likelihoods'),
+        F.round(col('completeness_score'), 3).alias('confidence_estimate'),
+        F.round(col('model_t_3month'), 6).alias('likelihood_3month'),
+        F.round(col('model_t_6month'), 6).alias('likelihood_6month'),
+        F.round(col('model100'), 6).alias('model100_pred'),
+        F.round(col('model36'), 6).alias('model36_pred'),
+        F.round(col('model_z'), 6).alias('model_z_pred'),
+    )
+
+    return df
@@ -0,0 +1,27 @@
+def concept_bundles(raw_concept_bundles,
+                    concept_set_members,
+                    curated_bundles):
+    df1 = raw_concept_bundles.select(
+        'tag_name',
+        col('concept_set_name').alias('raw_concept_set_name'),
+        col('best_version_id'))
+    df2 = concept_set_members.drop('concept_id', 'concept_name').distinct()
+    df2_current = df2.where(col('is_most_recent_version') & ~col('archived'))
+    df = df1.join(df2, df1.best_version_id == df2.codeset_id, how='left')
+    df_outdated = df.where(~col('is_most_recent_version') | col('archived')) \
+        .drop('is_most_recent_version', 'archived') \
+        .select('tag_name', 'raw_concept_set_name', 'concept_set_name',
+            col('codeset_id').alias('old_codeset_id'),
+            col('version').alias('old_version'))
+    df_current = df2_current.join(df.select('tag_name', 'codeset_id'),
+                                  on='codeset_id')
+    df_updated = df_outdated.join(df2_current, on='concept_set_name')
+    cols = [
+        'tag_name', 'codeset_id', 'concept_set_name', 'version',
+        'is_most_recent_version', 'archived'
+    ]
+    df = df_current.select(cols).union(df_updated.select(cols)) \
+        .withColumnRenamed('tag_name', 'bundle_name') \
+        .join(curated_bundles, on='bundle_name', how='left') \
+        .orderBy('bundle_name', 'concept_set_name')
+    return df
@@ -0,0 +1,12 @@
+# typical runtime: 20s;  output shape: 10 x 4848231
+
+def concept_sets(concept_set_members,
+                 concept_bundles):
+    df = concept_set_members.where(col('is_most_recent_version') & ~col('archived'))
+    df = df.join(concept_bundles.select('bundle_name', 'bundle_id','codeset_id'),
+                 on='codeset_id',
+                 how='left')
+    df_count = df.groupBy('codeset_id').count().withColumnRenamed('count', 'member_count')
+    df = df.join(df_count, on='codeset_id', how='left')
+    df = df.orderBy(col('version').desc(), col('codeset_id'))
+    return df
@@ -0,0 +1,66 @@
+# typical runtime: 1m;  output shape: 17 x 57010
+
+def concept_to_feature(event_stats, concept_sets):
+    df_concept = event_stats # assumes columns: concept_id, concept_name
+    df_bundle = concept_sets  # assumes columns: concept_id, concept_name, codeset_id, concept_set_name, bundle_id, bundle_name
+
+    selected_concept_ids = get_selected_concept_ids()
+    selected_set_names = get_selected_concept_set_names()
+    custom_sets = create_custom_concept_sets(df_bundle, df_concept)
+
+    null = lit(None)
+
+    case1 = col('concept_id').isin(selected_concept_ids)
+    df1 = df_concept.where(case1) \
+        .withColumn('feature_source', lit('concept_id')) \
+        .withColumn('feature_id', format_string('c%d', 'concept_id')) \
+        .withColumn('feature_name', format_string('C: %s', 'concept_name'))
+    
+    df1.count()
+
+    case2 = col('concept_set_name').isin(selected_set_names)
+    df2 = df_bundle.where(case2) \
+        .withColumn('feature_source', lit('codeset_id')) \
+        .withColumn('feature_id', format_string('s%d', 'codeset_id')) \
+        .withColumn('feature_name', format_string('S: %s', 'concept_set_name'))
+    
+    df2.count()
+
+    case3 = col('concept_set_name').startswith('ARIScience')
+    df3 = df_bundle.where(case3) \
+        .withColumn('feature_source', lit('codeset_id')) \
+        .withColumn('feature_id', format_string('s%d', 'codeset_id')) \
+        .withColumn('feature_name', format_string('A: %s', 
+            regexp_extract('concept_set_name', 'ARIScience\s+[-–]\s+(.*?)\s*[-–]*\s*[A-Z]*$', 1)))
+    
+    df3.count()
+
+    df4 = df_bundle.where(col('bundle_id').isNotNull()) \
+        .withColumn('feature_source', lit('bundle_id')) \
+        .withColumn('feature_id', col('bundle_id')) \
+        .withColumn('feature_name', format_string('B: %s', 'bundle_name'))
+
+    df4.count()
+
+    df_custom_sets = custom_sets.select('concept_id', 'custom_set_id', 'custom_set_name')
+    df5 = df_concept.join(df_custom_sets, on='concept_id') \
+        .withColumn('feature_source', lit('custom_set_id')) \
+        .withColumn('feature_id', col('custom_set_id')) \
+        .withColumn('feature_name', format_string('X: %s', 'custom_set_name'))
+    
+    df5.count()
+
+    dfs = [df1, df2, df3, df4, df5]
+    cols = ['concept_id', 'feature_id', 'feature_name', 'feature_source']
+    df_union = reduce(DataFrame.union, [d.select(cols) for d in dfs]).distinct()
+
+    df_union.count()
+
+    df = df_concept.join(df_union, on='concept_id', how='left')
+
+    df = move_cols_to_front(df, ['concept_id', 'concept_name', 'domain_id', 
+        'feature_id', 'feature_name', 'feature_source'])
+    df = df.orderBy(col('cmi').desc())
+
+    return df
+  
@@ -0,0 +1,15 @@
+def covid_dates(concept, 
+                measurement, 
+                merged_events, 
+                person_table, 
+                concept_set_members):
+    df1 = covid_index_from_measurement(concept, measurement, person_table, concept_set_members, mark_all=True)
+    df2 = covid_index_from_concepts(merged_events, person_table, use_custom_covid_set=True, mark_all=True)
+
+    df = df1.unionByName(df2, allowMissingColumns=True)
+    df = df.select('person_id', 'date', 
+        'covid_test_positive', 'covid_concept_positive', 
+        'concept_id', 'concept_name')
+    df = df.orderBy('person_id', 'date')
+ 
+    return df
@@ -0,0 +1,15 @@
+def covid_dates_test(concept, 
+                     measurement_test, 
+                     merged_events_test, 
+                     person_test, 
+                     concept_set_members):
+    df1 = covid_index_from_measurement(concept, measurement_test, person_test, concept_set_members, mark_all=True)
+    df2 = covid_index_from_concepts(merged_events_test, person_test, use_custom_covid_set=True, mark_all=True)
+
+    df = df1.unionByName(df2, allowMissingColumns=True)
+    df = df.select('person_id', 'date', 
+        'covid_test_positive', 'covid_concept_positive', 
+        'concept_id', 'concept_name')
+    df = df.orderBy('person_id', 'date')
+ 
+    return df
@@ -0,0 +1,16 @@
+# typical runtime: 30s;  output shape: 18 x 57672
+
+def covid_episodes(covid_dates, 
+                   silver):
+    df = compute_covid_diagnostic_windows(covid_dates, silver)
+    cols = ['person_id', 'time_to_pasc', 'covid_index',
+        'num_covid_episodes', 'total_episode_length', 'max_episode_length',
+        'months_from_covid_index', 'months_from_first_covid']
+    # for i in range(1, 6):
+    for i in range(1, 2):
+        cols.append(f'covid_{i}_first')
+        cols.append(f'covid_{i}_last')
+    df = silver.select('person_id', 'time_to_pasc') \
+        .join(df, on='person_id', how='left')
+    df = df.select(cols).distinct().orderBy('person_id')
+    return df
@@ -0,0 +1,13 @@
+def covid_episodes_test(covid_dates_test, 
+                        silver_test):
+    df = compute_covid_diagnostic_windows(covid_dates_test, silver_test)
+    cols = ['person_id', 'covid_index',
+        'num_covid_episodes', 'total_episode_length', 'max_episode_length',
+        'months_from_covid_index', 'months_from_first_covid']
+    # for i in range(1, 6):
+    for i in range(1, 2):
+        if f'covid_{i}_first' in df.columns:
+            cols.append(f'covid_{i}_first')
+            cols.append(f'covid_{i}_last')
+    df = df.select(cols).distinct().orderBy('person_id')
+    return df
@@ -0,0 +1,7 @@
+# typical runtime: 4m;  output shape: 2 x 32 
+
+def curated_bundles():
+    selected_bundle_dict = get_selected_bundle_dict()
+    pandas_bundle = pd.DataFrame.from_dict(selected_bundle_dict, orient='index').reset_index()
+    pandas_bundle.columns = ['bundle_id', 'bundle_name']
+    return spark.createDataFrame(pandas_bundle)
@@ -0,0 +1,21 @@
+# typical runtime: 10s;  output shape: 17 x 57672
+
+# def demographics(person_table, location):
+def demographics(person_table):
+    cols = ['year_of_birth', 'gender_concept_id', 'race_concept_id', 'ethnicity_concept_id']#, 'data_partner_id']
+    df = person_table.select('person_id', *cols)    
+
+    # df_dp = data_partner_id_to_onehot(df)
+    # df = df.join(df_dp, on='person_id', how='left')
+
+    # df_loc = location.dropDuplicates(['location_id'])
+    # df_zip = person_table.select('person_id', 'location_id') \
+    #     .join(df_loc, on='location_id', how='left') \
+    #     .select('person_id', 'zip') \
+    #     .withColumn('zip_id', col('zip').astype(IntegerType())) \
+    #     .drop('zip')
+
+    # df = df.join(df_zip, on='person_id', how='left').fillna(0)
+    df = df.fillna(0)
+
+    return df
@@ -0,0 +1,19 @@
+# def demographics_test(person_test, location_test):
+def demographics_test(person_test):
+    cols = ['year_of_birth', 'gender_concept_id', 'race_concept_id', 'ethnicity_concept_id']#, 'data_partner_id']
+    df = person_test.select('person_id', *cols)    
+
+    # df_dp = data_partner_id_to_onehot(df)
+    # df = df.join(df_dp, on='person_id', how='left')
+
+    # df_loc = location_test.dropDuplicates(['location_id'])
+    # df_zip = person_test.select('person_id', 'location_id') \
+    #     .join(df_loc, on='location_id', how='left') \
+    #     .select('person_id', 'zip') \
+    #     .withColumn('zip_id', col('zip').astype(IntegerType())) \
+    #     .drop('zip')
+
+    # df = df.join(df_zip, on='person_id', how='left').fillna(0)
+    df = df.fillna(0)
+
+    return df