Skip to content

purbancz/large-scale-dim-red

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Large-scale dimensionality reduction on HPC clusters

This repository contains scripts and workflows to perform large-scale dimensionality reduction tasks using HPC resources on Athena and Ares clusters. The primary goal is to enable efficient computation and parallelism for handling high-dimensional datasets. Below are the steps, commands, and setup details for this repository.


Prerequisites

Requirements

  • Valid account on Athena or Ares cluster.
    ssh [username]@ares.cyfronet.pl
    ssh [username]@athena.cyfronet.pl
  • Access to dataset.
    ls -l /net/pr2/projects/plgrid/plgglscclass/geometricus_embeddings/X_concatenated_all_dims.npy
  • Familiarity with SLURM job scheduler.

Environment Setup

  • Install Miniconda to manage Python environments.
  • Load necessary modules (specific to the cluster):
    module load miniconda3
    on Ares or
    module load Miniconda3
    on Athena.

Repository Structure

.
├── ares/                           # Directory for Ares cluster scripts and outputs
│   ├── dim_red/                    # Dimensionality reduction scripts and results
│   │   ├── output_1node_max_run0/  # Results for 1-node maximum configuration
│   │   ├── output_2nodes_run0/     # Results for 2-nodes configuration
│   │   ├── dim_red_1node.sh        # SLURM script for 1-node
│   │   ├── dim_red_2nodes.sh       # SLURM script for 2-nodes
│   │   └── geom_emb_dim_red.ipynb  # Jupyter notebook for dimensionality reduction
├── athena/                         # Directory for Athena cluster scripts and outputs
│   ├── dim_red/                    # Dimensionality reduction scripts and results
│   │   ├── output_1node_run0/      # Results for 1-node configuration
│   │   ├── output_2nodes_run0/     # Results for 2-nodes configuration
│   │   ├── dim_red_1node.sh        # SLURM script for 1-node
│   │   ├── dim_red_2nodes.sh       # SLURM script for 2-nodes
│   │   └── geom_emb_dim_red.py     # Python script for dimensionality reduction
└── README.md                       # This file

Workflow and Commands

1. Setting Up the Python Environment

  • (Recommended step) Configure conda to use your $SCRATCH storage space:
    conda config --add envs_dirs ${SCRATCH}/.conda/envs 
    conda config --add pkgs_dirs ${SCRATCH}/.conda/pkgs
  • Create a virtual environment:
    conda create -n dim-reduction python=3.8 -y
    conda activate dim-reduction
  • Install required libraries:
    conda install -c conda-forge numpy pandas seaborn matplotlib scikit-learn umap-learn pacmap trimap

2. Preparing the Input Data

  • Check whether you have acces to the data
    ls -l /net/pr2/projects/plgrid/plgglscclass/geometricus_embeddings/X_concatenated_all_dims.npy
    ls -ld /net/pr2/projects/plgrid/plgglscclass/geometricus_embeddings

3. Running Dimensionality Reduction Locally

  • Test scripts locally to ensure compatibility before deploying on the cluster:

    python ares/dim_red/geom_emb_dim_red.py
  • You can also use jupyter notebook file to test compatibility of the code (locally or in the cluster). To do it on the cluster:

    1. Request interactive job on the claster, e.g., on Ares:
    srun --time=2:00:00 --mem=64G --cpus-per-task=16 --ntasks=1 --partition=plgrid --account=[grantname]-cpu --pty /bin/bash
    1. Install and run Jupyter server:
    conda install -c conda-forge jupyter
    hostname
    jupyter notebook --no-browser --port=[port-number] --ip=[hostname]

    Note: Use a port number greater than 1024 (e.g., 8888). Jupyter will display a connection URL like:

    http://ag0009:8888/?token=your_token_here
    

    Keep this URL handy, you will need it later.

    1. On your local machine:
    • Open Visual Studio Code.
    • Install the Remote - SSH extension by Microsoft (if not already installed).
    • Press Ctrl+Shift+P (or Cmd+Shift+P on macOS) to open the Command Palette.
    • Type "Remote-SSH: Connect to Host" and select it.
    • Connect to Ares using your SSH configuration.
    • In VSC, navigate to ares/dim_red and open geom_emb_dim_red.ipynb file.
    • When prompted to select a kernel, choose "Existing Jupyter Server".
    • Paste the Jupyter connection URL you obtained earlier (e.g., http://ag0009:8888/?token=your_token_here).

4. Running on HPC Clusters

Example SLURM Script for Ares (1 Node)

Save the following script as ares/dim_red/dim_red_1node.sh:

#!/bin/bash
#SBATCH --job-name=geom_emb_dim_red_1node
#SBATCH --output=dim_red_1node_%j.out
#SBATCH --error=dim_red_1node_%j.err
#SBATCH --time=72:00:00
#SBATCH --partition=plgrid
#SBATCH --account=[grantname]-cpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=48
#SBATCH --mem=184800

module load miniconda3
conda init
eval "$(conda shell.bash hook)"
conda activate dim-reduction

python geom_emb_dim_red.py

Example SLURM Script for Athena (2 Nodes)

Save the following script as athena/dim_red/dim_red_2nodes.sh:

#!/bin/bash
#SBATCH --job-name=dim_red_2nodes
#SBATCH --output=dim_red_2nodes_%j.out
#SBATCH --error=dim_red_2nodes_%j.err
#SBATCH --time=48:00:00
#SBATCH --partition=plgrid-gpu-a100
#SBATCH --account=[grantname]-gpu-a100
#SBATCH --nodes=2
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=800G
#SBATCH --gres=gpu:1

module load miniconda3
conda init
eval "$(conda shell.bash hook)"
conda activate dim-reduction

python geom_emb_dim_red.py

Submitting the Job

Submit the job using:

sbatch dim_red/dim_red_1node.sh
sbatch dim_red/dim_red_2nodes.sh

5. Monitoring Job Progress

  • Check all jobs status:
    squeue
  • Check your jobs status:
    squeue -u $USER
  • View detailed job information:
    sacct -j <job_id>
  • Cancel the job:
    scancel <job_id>
  • Inspect logs:
    less dim_red/dim_red_1node_<job_id>.out
    less dim_red/dim_red_2nodes_<job_id>.out

Links and Resources


Notes

  • Modify memory and time requirements in the SLURM script according to the size of your dataset.
  • Use multi-node setups for larger datasets and adjust #SBATCH directives accordingly.
  • The first time you run conda it might be necessary to initialize it with command conda init bash (after which the shell needs to be reloaded).

Authors and Acknowledgments

  • Developed by github X linkedIn website .
  • Thanks to Cyfronet HPC team for their documentation.

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

Large-scale dimensionality reduction on HPC clusters

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published