gromdimon · gromdimon · Nov 30, 2024 · Nov 29, 2024 · Nov 30, 2024
diff --git a/README.md b/README.md
@@ -18,6 +18,7 @@ The repo has a folder "algo" for various algorithms. Other folders are dedicated
 
 
 ### Projects Included:
+- SpliceAI: A reimplementation of SpliceAI algorithm based on [paper](https://doi.org/10.1016/j.cell.2018.12.015)
 - Mikrograd: A minimal autograd engine based on [micrograd](https://github.com/karpathy/micrograd) by `Andrej Karpathy`
 - Makemore: A tool for generating synthetic data. Based on the [makemore](https://github.com/karpathy/makemore/tree/master) by `Andrej Karpathy`
 - RuPoemGPT: A character-level language model trained on a collection of russian poems. Based on the [nanoGPT](https://github.com/karpathy/nanoGPT) by `Andrej Karpathy`

diff --git a/spliceai/.gitignore b/spliceai/.gitignore
@@ -0,0 +1,5 @@
+libtorch/
+CMakeCache.txt
+cmake_install.cmake
+Makefile
+CMakeFiles/
diff --git a/spliceai/CMakeLists.txt b/spliceai/CMakeLists.txt
@@ -0,0 +1,27 @@
+cmake_minimum_required(VERSION 3.18 FATAL_ERROR)
+project(SpliceAI_Cpp)
+
+# Set the C++ standard
+set(CMAKE_CXX_STANDARD 17)
+set(CMAKE_CXX_STANDARD_REQUIRED ON)
+
+# Add libtorch
+find_package(Torch REQUIRED)
+
+# Add HDF5
+set(HDF5_ROOT "/opt/homebrew")
+set(HDF5_INCLUDE_DIRS "${HDF5_ROOT}/include")
+set(HDF5_LIBRARIES "${HDF5_ROOT}/lib/libhdf5.dylib")
+
+# Add include directories
+include_directories(${HDF5_INCLUDE_DIRS})
+include_directories(${TORCH_INCLUDE_DIRS})
+
+# Add executable and link libraries
+add_executable(train train.cpp data_loader.cpp spliceai.cpp)
+target_link_libraries(train "${TORCH_LIBRARIES}" "${HDF5_LIBRARIES}")
+set_property(TARGET train PROPERTY CXX_STANDARD 17)
+
+# Optional: Print HDF5 paths for debugging
+message(STATUS "HDF5 Include: ${HDF5_INCLUDE_DIRS}")
+message(STATUS "HDF5 Libraries: ${HDF5_LIBRARIES}")
diff --git a/spliceai/README.md b/spliceai/README.md
@@ -0,0 +1,185 @@
+# SpliceAI C++ Reimplementation
+
+![SpliceAI header](./assets/header.jpg)
+
+SpliceAI is a deep learning-based tool designed for identifying splice sites in genomic sequences. This project is a C++ reimplementation of the original [SpliceAI](https://basespace.illumina.com/s/5u6ThOblecrh) project, which was implemented in Python using PyTorch. The reimplementation uses **LibTorch**, the official PyTorch C++ API, for defining and training the neural network models.
+
+---
+
+## Table of Contents
+
+1. [Project Overview](#project-overview)
+2. [Setup](#setup)
+   - [Prerequisites](#prerequisites)
+   - [Installation](#installation)
+   - [Dataset Preparation](#dataset-preparation)
+3. [Usage](#usage)
+   - [Training the Model](#training-the-model)
+   - [Testing the Model](#testing-the-model)
+4. [Dataset Information](#dataset-information)
+5. [Examples](#examples)
+
+---
+
+## Project Overview
+
+SpliceAI is a state-of-the-art deep learning model that identifies splice sites in human genomic sequences. It uses convolutional neural networks with residual connections to predict acceptor and donor splice sites from DNA sequences.
+
+This C++ implementation aims to provide a high-performance version of SpliceAI that can be integrated into other C++ applications or run independently.
+
+Key features of this implementation:
+- Implements the SpliceAI neural network variants (`80nt`, `400nt`, `2000nt`, `10000nt`).
+- Compatible with Linux and macOS.
+- Supports training, validation, and testing with HDF5 datasets.
+- Uses **LibTorch** for neural network operations and **HDF5 C++ API** for data handling.
+
+---
+
+## Setup
+
+### Prerequisites
+
+1. **Linux or macOS**:
+   - Ubuntu 20.04+ or macOS Big Sur+ recommended.
+2. **C++ Compiler**:
+   - GCC 8+ (Linux) or Clang (macOS).
+3. **Libraries**:
+   - [LibTorch](https://pytorch.org/cppdocs/installing.html) (PyTorch C++ API).
+   - HDF5 library for reading and writing HDF5 files.
+
+### Installation
+
+1. **Clone the Repository**:
+   ```bash
+   git clone https://github.com/gromdimon/stronghold.git
+   cd stronghold/spliceai
+   ```
+
+2. **Install Dependencies**:
+   - **LibTorch**:
+     Download and install LibTorch from the [official website](https://pytorch.org/get-started/locally/).
+     ```bash
+     wget https://download.pytorch.org/libtorch/cpu/libtorch-shared-with-deps-latest.zip
+     unzip libtorch-shared-with-deps-latest.zip
+     export TORCH_HOME=$(pwd)/libtorch
+     ```
+
+   - **HDF5 Library**:
+     For Ubuntu:
+     ```bash
+     sudo apt-get install libhdf5-dev
+     ```
+     For macOS (using Homebrew):
+     ```bash
+     brew install hdf5
+     ```
+
+3. **Build the Project**:
+   - Use CMake to build the project:
+     ```bash
+     mkdir build
+     cd build
+     cmake -DCMAKE_PREFIX_PATH=$TORCH_HOME ..
+     make
+     ```
+
+---
+
+## Dataset Preparation
+
+1. **Download the SpliceAI Train Code**:
+   - Download the SpliceAI training code directory from [Illumina BaseSpace](https://basespace.illumina.com/s/5u6ThOblecrh).
+   - Unzip it into the `spliceai_train_code` directory:
+     ```bash
+     mkdir spliceai_train_code
+     unzip path/to/spliceai_train_code.zip -d spliceai_train_code
+     ```
+
+2. **Download the Human Reference Genome**:
+   - Download the `hg19` reference genome into the `spliceai_train_code/reference` directory.
+
+3. **Generate Training and Test Sets**:
+   - Navigate to the `spliceai_train_code/Canonical` directory:
+     ```bash
+     cd spliceai_train_code/Canonical
+     ```
+
+   - Configure the `CL_max` variable in `constants.py` to the desired sequence length (e.g., `80`, `400`, `2000`, or `10000`).
+
+   - Run the following commands to generate train/test datasets:
+     ```bash
+     chmod 755 grab_sequence.sh
+     ./grab_sequence.sh
+
+     # Requires Python 2.7 with numpy, h5py, scikit-learn installed
+     python create_datafile.py train all  # ~4 minutes
+     python create_datafile.py test 0    # ~1 minute
+
+     python create_dataset.py train all  # ~11 minutes
+     python create_dataset.py test 0     # ~1 minute
+     ```
+
+   - This will create:
+     - `dataset_train_all.h5` (~5.4 GB)
+     - `dataset_test_0.h5` (~0.5 GB)
+
+---
+
+## Usage
+
+### Training the Model
+
+To train the SpliceAI model, use the following command:
+```bash
+./train --model 80nt --output model.pt --train-h5 spliceai_train_code/Canonical/dataset_train_all.h5 --test-h5 spliceai_train_code/Canonical/dataset_test_0.h5 --epochs 10 --batch-size 18 --learning-rate 0.001 --seed 42
+```
+
+### Testing the Model
+
+To test the trained model on a test dataset:
+```bash
+./train --model 80nt --test-only --test-h5 spliceai_train_code/Canonical/dataset_test_0.h5 --pretrained model.pt
+```
+
+---
+
+## Dataset Information
+
+The dataset for this project consists of genomic sequences and corresponding splice site labels.
+
+- **Input**: DNA sequences encoded as a one-hot matrix (shape: `(batch_size, 4, sequence_length)`).
+- **Output**: Labels indicating the likelihood of a position being an acceptor or donor splice site.
+
+### Files:
+
+- **Training Dataset**:
+  - File: `dataset_train_all.h5`
+  - Size: ~5.4 GB
+- **Testing Dataset**:
+  - File: `dataset_test_0.h5`
+  - Size: ~0.5 GB
+
+### Data Format:
+
+- **HDF5 Structure**:
+  - `X{shard_idx}`: Input DNA sequences (shape: `(num_samples, sequence_length, 4)`).
+  - `Y{shard_idx}`: Labels (shape: `(num_samples, sequence_length, 3)`).
+
+---
+
+## Examples
+
+### Training Example
+
+Training a `400nt` model with default parameters:
+```bash
+./train --model 400nt --output spliceai_400nt.pt --train-h5 spliceai_train_code/Canonical/dataset_train_all.h5 --test-h5 spliceai_train_code/Canonical/dataset_test_0.h5 --epochs 15
+```
+
+### Testing Example
+
+Evaluating the trained model:
+```bash
+./train --model 400nt --test-only --test-h5 spliceai_train_code/Canonical/dataset_test_0.h5 --pretrained spliceai_400nt.pt
+```
+
diff --git a/spliceai/assets/header.jpg b/spliceai/assets/header.jpg
diff --git a/spliceai/data_loader.cpp b/spliceai/data_loader.cpp
@@ -0,0 +1,65 @@
+#include "data_loader.h"
+#include <iostream>
+
+HDF5Dataset::HDF5Dataset(const std::string& file_path, int shard_idx) {
+    try {
+        H5::H5File file(file_path, H5F_ACC_RDONLY);
+
+        // Read dataset X
+        std::string dataset_x_name = "X" + std::to_string(shard_idx);
+        H5::DataSet dataset_x = file.openDataSet(dataset_x_name);
+        H5::DataSpace dataspace_x = dataset_x.getSpace();
+
+        // Get dimensions
+        hsize_t dims_x[3];
+        dataspace_x.getSimpleExtentDims(dims_x, NULL);
+
+        // Read data into buffer
+        std::vector<float> x_data(dims_x[0] * dims_x[1] * dims_x[2]);
+        dataset_x.read(x_data.data(), H5::PredType::NATIVE_FLOAT);
+
+        // Convert to Tensor and transpose to match PyTorch dimensions
+        auto x_tensor = torch::from_blob(x_data.data(), 
+            {static_cast<int64_t>(dims_x[0]), 
+             static_cast<int64_t>(dims_x[1]), 
+             static_cast<int64_t>(dims_x[2])}, torch::kFloat32).clone();
+        x_tensor = x_tensor.permute({0, 2, 1}); // From (N, L, C) to (N, C, L)
+
+        // Read dataset Y
+        std::string dataset_y_name = "Y" + std::to_string(shard_idx);
+        H5::DataSet dataset_y = file.openDataSet(dataset_y_name);
+        H5::DataSpace dataspace_y = dataset_y.getSpace();
+
+        // Get dimensions
+        hsize_t dims_y[3];
+        dataspace_y.getSimpleExtentDims(dims_y, NULL);
+
+        // Read data into buffer
+        std::vector<float> y_data(dims_y[0] * dims_y[1] * dims_y[2]);
+        dataset_y.read(y_data.data(), H5::PredType::NATIVE_FLOAT);
+
+        // Convert to Tensor
+        auto y_tensor = torch::from_blob(y_data.data(),
+            {static_cast<int64_t>(dims_y[1]), 
+             static_cast<int64_t>(dims_y[2])}, torch::kFloat32).clone();
+
+        // Store data and targets
+        data_.push_back(x_tensor);
+        targets_.push_back(y_tensor);
+
+    } catch (H5::FileIException& error) {
+        error.printErrorStack();
+    } catch (H5::DataSetIException& error) {
+        error.printErrorStack();
+    } catch (H5::DataSpaceIException& error) {
+        error.printErrorStack();
+    }
+}
+
+torch::data::Example<> HDF5Dataset::get(size_t index) {
+    return {data_[index], targets_[index]};
+}
+
+torch::optional<size_t> HDF5Dataset::size() const {
+    return data_.size();
+}
diff --git a/spliceai/data_loader.h b/spliceai/data_loader.h
@@ -0,0 +1,21 @@
+#ifndef DATA_LOADER_H
+#define DATA_LOADER_H
+
+#include <torch/torch.h>
+#include <string>
+#include <vector>
+#include "H5Cpp.h"
+
+class HDF5Dataset : public torch::data::datasets::Dataset<HDF5Dataset> {
+public:
+    HDF5Dataset(const std::string& file_path, int shard_idx);
+
+    torch::data::Example<> get(size_t index) override;
+    torch::optional<size_t> size() const override;
+
+private:
+    std::vector<torch::Tensor> data_;
+    std::vector<torch::Tensor> targets_;
+};
+
+#endif // DATA_LOADER_H