Discrimination of PDPs in Water Samples: A Machine Learning Approach Using TIC Data (HS-GC-IMS)

📌 Overview

This repository contains the code and workflows for the analysis of hydrocarbon contamination in water samples using Headspace Gas Chromatography-Ion Mobility Spectrometry (HS-GC-IMS) in combination with machine learning techniques. The project aims to develop classification models for detecting and differentiating hydrocarbons in complex aqueous matrices.

📂 Project Structure

The repository is structured as follows:

├── App/                                 # Shiny Application
├── figures/                             # Generated figures from data analysis
├── scripts/                             # Contains scripts for data analysis
│   ├── eda                              # Exploratory Data Analysis (EDA)
│       ├── missing_values_hsgcims_waters.R
│       ├── outliers_hsgcims_waters.R
│   ├── tic                              # Total Ion Chromatogram (TIC) ploting 
│       ├── tic_hsgcims_waters.R
│   ├── unsupervised_algorithms          # Unsupervised Machine Learning (HCA, PCA)
│       ├── hca_hsgcims_waters.R
│       ├── pca_hsgcims_waters.R
│   ├── supervised_algorithms            # Supervised Machine Learning (SVM, RF)
│       ├── svm_hsgcims_waters.R
│       ├── rf_hsgcims_waters.R
├── requirements.txt                     # Required R packages
├── README.md                            # Project documentation
├── .gitignore                           # Ignore file
├── LICENSE                              # License file

🔄 Workflow

The data analysis workflow follows these main steps:

Exploratory Data Analysis (EDA)

Detection of missing values and outliers
Visualization of Total Ion Chromatogram (TIC) profiles based on contamination classes
Exploratory assessment of the dataset using Hierarchical Clustering Analysis (HCA) and Principal Component Analysis (PCA)

Supervised Machine Learning

Random Forest (RF) and Support Vector Machine (SVM) models are developed for classification. Three feature selection strategies are applied:

Using all predictors
Selection via the Boruta algorithm
Selection via Genetic Algorithm (GA)

Standard machine learning workflow applied:

Splitting the dataset into training (70%) and test (30%) sets.
Feature extraction is performed exclusively on the training set.
Data is scaled and centered after splitting and feature extraction.
Class balancing strategies such as undersampling and SMOTE are applied to the training set.
Model training includes hyperparameter tuning and cross-validation.
Evaluation is conducted on the test set to assess final performance.

🖥️ Software and Dependencies

The analysis is conducted in R (v4.4.0) within Visual Studio Code. The required R packages are specified in requirements.txt, and include:

Data handling: readxl (v1.4.3), writexl (v1.5.1), dplyr (v1.1.4), tidyr (v1.3.1), data.table (v1.16.2)
Visualization: ggplot2 (v3.5.1), ggcorrplot (v0.1.4.1), gridExtra (v2.3), viridis (v0.6.5), ggrepel (v0.9.6), reshape2 (v1.4.4)
Clustering & Dimensionality Reduction: stats, factoextra (v1.0.7), cluster (v2.1.6)
Feature Selection: Boruta (v8.0.0), GA (v3.2.4), recipes (v1.1.0), themis (v1.0.2)
Machine Learning: caret (v6.0.94), doParallel (v1.0.17)

🚀 How to Use the Repository

Clone the repository

git clone https://github.com/Marta-Barea/tic-hs-gc-ims-waters-ml.git
cd tic-hs-gc-ims-waters-ml

Set up the R environment

Ensure you have R (v4.4.0 or later) installed. Install the required packages using the requirements.txt file.

🤝 Collaboration

This project is conducted in collaboration with the Department of Analytical Chemistry, University of Cádiz (UCA).

📜 License

This project is licensed under the GNU GENERAL PUBLIC License. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Discrimination of PDPs in Water Samples: A Machine Learning Approach Using TIC Data (HS-GC-IMS)

📌 Overview

📂 Project Structure

🔄 Workflow

🖥️ Software and Dependencies

🚀 How to Use the Repository

🤝 Collaboration

📜 License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
App		App
figures		figures
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

Marta-Barea/tic-hs-gc-ims-waters-ml

Folders and files

Latest commit

History

Repository files navigation

Discrimination of PDPs in Water Samples: A Machine Learning Approach Using TIC Data (HS-GC-IMS)

📌 Overview

📂 Project Structure

🔄 Workflow

🖥️ Software and Dependencies

🚀 How to Use the Repository

🤝 Collaboration

📜 License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages