Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SpliceAI project #5

Merged
merged 2 commits into from
Nov 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ The repo has a folder "algo" for various algorithms. Other folders are dedicated


### Projects Included:
- SpliceAI: A reimplementation of SpliceAI algorithm based on [paper](https://doi.org/10.1016/j.cell.2018.12.015)
- Mikrograd: A minimal autograd engine based on [micrograd](https://github.com/karpathy/micrograd) by `Andrej Karpathy`
- Makemore: A tool for generating synthetic data. Based on the [makemore](https://github.com/karpathy/makemore/tree/master) by `Andrej Karpathy`
- RuPoemGPT: A character-level language model trained on a collection of russian poems. Based on the [nanoGPT](https://github.com/karpathy/nanoGPT) by `Andrej Karpathy`
Expand Down
5 changes: 5 additions & 0 deletions spliceai/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
libtorch/
CMakeCache.txt
cmake_install.cmake
Makefile
CMakeFiles/
27 changes: 27 additions & 0 deletions spliceai/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
cmake_minimum_required(VERSION 3.18 FATAL_ERROR)
project(SpliceAI_Cpp)

# Set the C++ standard
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

# Add libtorch
find_package(Torch REQUIRED)

# Add HDF5
set(HDF5_ROOT "/opt/homebrew")
set(HDF5_INCLUDE_DIRS "${HDF5_ROOT}/include")
set(HDF5_LIBRARIES "${HDF5_ROOT}/lib/libhdf5.dylib")

# Add include directories
include_directories(${HDF5_INCLUDE_DIRS})
include_directories(${TORCH_INCLUDE_DIRS})

# Add executable and link libraries
add_executable(train train.cpp data_loader.cpp spliceai.cpp)
target_link_libraries(train "${TORCH_LIBRARIES}" "${HDF5_LIBRARIES}")
set_property(TARGET train PROPERTY CXX_STANDARD 17)

# Optional: Print HDF5 paths for debugging
message(STATUS "HDF5 Include: ${HDF5_INCLUDE_DIRS}")
message(STATUS "HDF5 Libraries: ${HDF5_LIBRARIES}")
185 changes: 185 additions & 0 deletions spliceai/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
# SpliceAI C++ Reimplementation

![SpliceAI header](./assets/header.jpg)

SpliceAI is a deep learning-based tool designed for identifying splice sites in genomic sequences. This project is a C++ reimplementation of the original [SpliceAI](https://basespace.illumina.com/s/5u6ThOblecrh) project, which was implemented in Python using PyTorch. The reimplementation uses **LibTorch**, the official PyTorch C++ API, for defining and training the neural network models.

---

## Table of Contents

1. [Project Overview](#project-overview)
2. [Setup](#setup)
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Dataset Preparation](#dataset-preparation)
3. [Usage](#usage)
- [Training the Model](#training-the-model)
- [Testing the Model](#testing-the-model)
4. [Dataset Information](#dataset-information)
5. [Examples](#examples)

---

## Project Overview

SpliceAI is a state-of-the-art deep learning model that identifies splice sites in human genomic sequences. It uses convolutional neural networks with residual connections to predict acceptor and donor splice sites from DNA sequences.

This C++ implementation aims to provide a high-performance version of SpliceAI that can be integrated into other C++ applications or run independently.

Key features of this implementation:
- Implements the SpliceAI neural network variants (`80nt`, `400nt`, `2000nt`, `10000nt`).
- Compatible with Linux and macOS.
- Supports training, validation, and testing with HDF5 datasets.
- Uses **LibTorch** for neural network operations and **HDF5 C++ API** for data handling.

---

## Setup

### Prerequisites

1. **Linux or macOS**:
- Ubuntu 20.04+ or macOS Big Sur+ recommended.
2. **C++ Compiler**:
- GCC 8+ (Linux) or Clang (macOS).
3. **Libraries**:
- [LibTorch](https://pytorch.org/cppdocs/installing.html) (PyTorch C++ API).
- HDF5 library for reading and writing HDF5 files.

### Installation

1. **Clone the Repository**:
```bash
git clone https://github.com/gromdimon/stronghold.git
cd stronghold/spliceai
```

2. **Install Dependencies**:
- **LibTorch**:
Download and install LibTorch from the [official website](https://pytorch.org/get-started/locally/).
```bash
wget https://download.pytorch.org/libtorch/cpu/libtorch-shared-with-deps-latest.zip
unzip libtorch-shared-with-deps-latest.zip
export TORCH_HOME=$(pwd)/libtorch
```

- **HDF5 Library**:
For Ubuntu:
```bash
sudo apt-get install libhdf5-dev
```
For macOS (using Homebrew):
```bash
brew install hdf5
```

3. **Build the Project**:
- Use CMake to build the project:
```bash
mkdir build
cd build
cmake -DCMAKE_PREFIX_PATH=$TORCH_HOME ..
make
```

---

## Dataset Preparation

1. **Download the SpliceAI Train Code**:
- Download the SpliceAI training code directory from [Illumina BaseSpace](https://basespace.illumina.com/s/5u6ThOblecrh).
- Unzip it into the `spliceai_train_code` directory:
```bash
mkdir spliceai_train_code
unzip path/to/spliceai_train_code.zip -d spliceai_train_code
```

2. **Download the Human Reference Genome**:
- Download the `hg19` reference genome into the `spliceai_train_code/reference` directory.

3. **Generate Training and Test Sets**:
- Navigate to the `spliceai_train_code/Canonical` directory:
```bash
cd spliceai_train_code/Canonical
```

- Configure the `CL_max` variable in `constants.py` to the desired sequence length (e.g., `80`, `400`, `2000`, or `10000`).

- Run the following commands to generate train/test datasets:
```bash
chmod 755 grab_sequence.sh
./grab_sequence.sh

# Requires Python 2.7 with numpy, h5py, scikit-learn installed
python create_datafile.py train all # ~4 minutes
python create_datafile.py test 0 # ~1 minute

python create_dataset.py train all # ~11 minutes
python create_dataset.py test 0 # ~1 minute
```

- This will create:
- `dataset_train_all.h5` (~5.4 GB)
- `dataset_test_0.h5` (~0.5 GB)

---

## Usage

### Training the Model

To train the SpliceAI model, use the following command:
```bash
./train --model 80nt --output model.pt --train-h5 spliceai_train_code/Canonical/dataset_train_all.h5 --test-h5 spliceai_train_code/Canonical/dataset_test_0.h5 --epochs 10 --batch-size 18 --learning-rate 0.001 --seed 42
```

### Testing the Model

To test the trained model on a test dataset:
```bash
./train --model 80nt --test-only --test-h5 spliceai_train_code/Canonical/dataset_test_0.h5 --pretrained model.pt
```

---

## Dataset Information

The dataset for this project consists of genomic sequences and corresponding splice site labels.

- **Input**: DNA sequences encoded as a one-hot matrix (shape: `(batch_size, 4, sequence_length)`).
- **Output**: Labels indicating the likelihood of a position being an acceptor or donor splice site.

### Files:

- **Training Dataset**:
- File: `dataset_train_all.h5`
- Size: ~5.4 GB
- **Testing Dataset**:
- File: `dataset_test_0.h5`
- Size: ~0.5 GB

### Data Format:

- **HDF5 Structure**:
- `X{shard_idx}`: Input DNA sequences (shape: `(num_samples, sequence_length, 4)`).
- `Y{shard_idx}`: Labels (shape: `(num_samples, sequence_length, 3)`).

---

## Examples

### Training Example

Training a `400nt` model with default parameters:
```bash
./train --model 400nt --output spliceai_400nt.pt --train-h5 spliceai_train_code/Canonical/dataset_train_all.h5 --test-h5 spliceai_train_code/Canonical/dataset_test_0.h5 --epochs 15
```

### Testing Example

Evaluating the trained model:
```bash
./train --model 400nt --test-only --test-h5 spliceai_train_code/Canonical/dataset_test_0.h5 --pretrained spliceai_400nt.pt
```

Binary file added spliceai/assets/header.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
65 changes: 65 additions & 0 deletions spliceai/data_loader.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
#include "data_loader.h"
#include <iostream>

HDF5Dataset::HDF5Dataset(const std::string& file_path, int shard_idx) {
try {
H5::H5File file(file_path, H5F_ACC_RDONLY);

// Read dataset X
std::string dataset_x_name = "X" + std::to_string(shard_idx);
H5::DataSet dataset_x = file.openDataSet(dataset_x_name);
H5::DataSpace dataspace_x = dataset_x.getSpace();

// Get dimensions
hsize_t dims_x[3];
dataspace_x.getSimpleExtentDims(dims_x, NULL);

// Read data into buffer
std::vector<float> x_data(dims_x[0] * dims_x[1] * dims_x[2]);
dataset_x.read(x_data.data(), H5::PredType::NATIVE_FLOAT);

// Convert to Tensor and transpose to match PyTorch dimensions
auto x_tensor = torch::from_blob(x_data.data(),
{static_cast<int64_t>(dims_x[0]),
static_cast<int64_t>(dims_x[1]),
static_cast<int64_t>(dims_x[2])}, torch::kFloat32).clone();
x_tensor = x_tensor.permute({0, 2, 1}); // From (N, L, C) to (N, C, L)

// Read dataset Y
std::string dataset_y_name = "Y" + std::to_string(shard_idx);
H5::DataSet dataset_y = file.openDataSet(dataset_y_name);
H5::DataSpace dataspace_y = dataset_y.getSpace();

// Get dimensions
hsize_t dims_y[3];
dataspace_y.getSimpleExtentDims(dims_y, NULL);

// Read data into buffer
std::vector<float> y_data(dims_y[0] * dims_y[1] * dims_y[2]);
dataset_y.read(y_data.data(), H5::PredType::NATIVE_FLOAT);

// Convert to Tensor
auto y_tensor = torch::from_blob(y_data.data(),
{static_cast<int64_t>(dims_y[1]),
static_cast<int64_t>(dims_y[2])}, torch::kFloat32).clone();

// Store data and targets
data_.push_back(x_tensor);
targets_.push_back(y_tensor);

} catch (H5::FileIException& error) {
error.printErrorStack();
} catch (H5::DataSetIException& error) {
error.printErrorStack();
} catch (H5::DataSpaceIException& error) {
error.printErrorStack();
}
}

torch::data::Example<> HDF5Dataset::get(size_t index) {
return {data_[index], targets_[index]};
}

torch::optional<size_t> HDF5Dataset::size() const {
return data_.size();
}
21 changes: 21 additions & 0 deletions spliceai/data_loader.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#ifndef DATA_LOADER_H
#define DATA_LOADER_H

#include <torch/torch.h>
#include <string>
#include <vector>
#include "H5Cpp.h"

class HDF5Dataset : public torch::data::datasets::Dataset<HDF5Dataset> {
public:
HDF5Dataset(const std::string& file_path, int shard_idx);

torch::data::Example<> get(size_t index) override;
torch::optional<size_t> size() const override;

private:
std::vector<torch::Tensor> data_;
std::vector<torch::Tensor> targets_;
};

#endif // DATA_LOADER_H
Loading