You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This repository contains a winning submission for the NIH Long Covid Computational Challenge ([L3C](https://www.challenge.gov/?challenge=l3c)) developed by [Team Convalesco](https://www.linkedin.com/pulse/announcing-nih-long-covid-computational/). The objective of the challenge was to develop machine learning models to predict which patients are susceptible to developing PASC/Long COVID using structured medical records up to 28 days from COVID onset.
4
+
5
+
## Overview
6
+
7
+
Our solution leverages the rich clinical data available in the [N3C environment](https://ncats.nih.gov/n3c/about/data-overview) including condition occurrences, lab measurements, drug exposure, doctor notes, etc. With model generalizability and robustness in mind, we focus on creating a small number of meaningful features by curating and expanding concept sets. A key idea in feature engineering is to use the temporal information in the medical records to create features that are more predictive of Long COVID risks. The original submission consists of ~100 workflow cells operating on Spark dataframes in the N3C enclave. All the transform codes are included in this repository to be tested and run locally on synthetic data.
4. Ensure Java, a [PySpark dependency](https://spark.apache.org/docs/latest/api/python/getting_started/install.html), is installed and the JAVA_HOME environment variable is set.
29
+
30
+
For example, on an Ubuntu Linux machine, you can run the following command (or use other package managers such as homebrew to avoid sudo):
Download [synthetic_data.zip](https://www.dropbox.com/s/krrw6ydutf6j98p/synthetic_data.zip?dl=0) (1.5GB). Extract the zip file and place the folder in the root directory of the repo. Make sure the directory structure looks like `synthetic_data/training/person.csv`. A command line example to do this is:
2. Run the demo script from the root directory of the repo:
48
+
```
49
+
./run_all.sh
50
+
```
51
+
This will run the entire workflow on the synthetic data. The final output will be saved as `Convalesco_predictions.csv` in the root directory of this repo; the outputs of all intermediate datasets will be saved in the `output/` folder.
52
+
53
+
The test run on the synthetic data could take 1-2 hours on a typical linux machine with 64 GB memory. PySpark may generate `RowBasedKeyValueBatch` warnings that could be safely avoided.
54
+
55
+
Th final output is a patient-level table with prediction results for the testing data with 8 columns:
56
+
```python
57
+
# Key columns:
58
+
# person_id
59
+
# outcome_likelihoods: final prediction on patient PASC probability
60
+
# confidence_estimate: a proxy estimate based on patient data completeness
61
+
# likelihood_3month: predicted probability of PASC within 3 months after COVID index
62
+
# likelihood_6month: predicted probability of PASC within 6 months after COVID index
63
+
# Additional columns:
64
+
# model100_pred: prediction of Model_100 with 100 temporal features
65
+
# model36_pred: prediction of Model_36, a simple model with 36 temporal features
66
+
# model_z_pred: prediction of Model_Z, an aspiring "zero-bias" model
67
+
```
68
+
In this example, since we are using synthetic data, the predictions will not be as meaningful.
69
+
70
+
## Models and Features
71
+
72
+
We have created 4 models with different emphases, and our submission is an ensemble of the first three.
73
+
74
+

75
+
76
+
The model features are grouped into seven categories, and the population-level feature utilization scores on the real data are shown in the figure below.
77
+
78
+

79
+
80
+
81
+
## Documentation
82
+
83
+
The key components of the repository are as follows:
84
+
85
+
-`src/`: Contains all the source code including the ~100 transforms and global code.
86
+
-`src/global_code.py`: Global python code.
87
+
-`utils/execution_engine.py`: Execution engine.
88
+
89
+
The original submission was developed on the N3C environment in the form of a [Palantir Code Workbook](https://www.palantir.com/docs/foundry/code-workbook/overview/). We used global Python code extensively to simplify the transform codes and make the reusable blocks more readable. The execution engine is a Python module we developed after the challenge to enable local execution of the original codes with minimal modifications.
90
+
91
+
For more details, please refer to the [DOCUMENTATION](DOCUMENTATION.md).
0 commit comments