Skip to content

Template machine learning project using wandb, hydra-zen and submitit on Slurm with Apptainer

License

Notifications You must be signed in to change notification settings

marvinsxtr/ml-project-template

Repository files navigation

πŸš€ ML Project Template

A modern template for machine learning experimentation using wandb, hydra-zen, and submitit on a Slurm cluster with Docker/Apptainer containerization.

Note: This template is optimized for the ML Group cluster setup but can be easily adapted to similar environments.

Python 3.12 Docker WandB Hydra Zen Submitit

✨ Key Features

  • πŸ“¦ Python environment in Docker via uv
  • πŸ“Š Logging and visualizations via Weights and Biases
  • 🧩 Reproducibility and modular type-checked configs via hydra-zen
  • πŸ–₯️ Submit Slurm jobs and parameter sweeps directly from Python via submitit
  • πŸ”„ No .def or .sh files needed for Apptainer/Slurm

πŸ“‹ Table of Contents

🐳 Container Setup

Choose one of the following methods to set up your environment:

Option 1: Apptainer

  1. Configure environment bindings

    Add to your .zshrc or .bashrc:

    export APPTAINER_BIND=/opt/slurm-23.2,/opt/slurm,/etc/slurm,/etc/munge,/var/log/munge,/var/run/munge,/lib/x86_64-linux-gnu
    export APPTAINERENV_APPEND_PATH=/opt/slurm/bin:/opt/slurm/sbin
  2. Install VSCode Command Line Interface (Optional)

    This step is required if you plan to create a remote tunnel. First, install the Remote Tunnels extentsion in VSCode.

  3. Connect to compute resources

    For CPU resources:

    srun --partition=cpu-2h --pty bash

    For GPU resources:

    srun --partition=gpu-2h --gpus-per-task=1 --pty bash
  4. Launch container

    To open a tunnel to connect you local VSCode to the container on the cluster:

    apptainer run --nv --writable-tmpfs docker://ghcr.io/marvinsxtr/ml-project-template:main code tunnel

    In VSCode press Shift+Alt+P (Windows/Linux) or Shift+Cmd+P (Mac), type connect to tunnel, select GitHub and select your named node on the cluster. Your IDE is now connected to the cluster.

    To open a shell in the container on the cluster:

    apptainer run --nv --writable-tmpfs docker://ghcr.io/marvinsxtr/ml-project-template:main /bin/bash

    πŸ’‘ This may take a few minutes on the first run as the container image is downloaded.

Option 2: Docker

Run the container directly with:

docker run -it --rm --platform=linux/amd64 ghcr.io/marvinsxtr/ml-project-template:main /bin/bash

πŸ’‘ You can specify a version tag (e.g., v0.0.1) instead of main. Available versions are listed at GitHub Container Registry.

πŸ“¦ Package Management

This project uses uv for Python dependency management.

Adding or Updating Dependencies

Inside the container (e.g., VSCode shell with Docker Container):

# Add a specific package
uv add <package-name>

# Update all dependencies from pyproject.toml or requirements.txt
uv sync

πŸ”„ Updating the Docker Image

  1. Update dependencies using uv as described above

  2. Commit changes to the repository:

    Use tags for versioning:

    git add pyproject.toml uv.lock 
    git commit -m "Updated dependencies"
    git tag v0.0.1
    git push && git push --tags
  3. Use the updated image:

    The GitHub Actions workflow automatically builds a new image when changes are pushed.

    With Apptainer:

    apptainer run --nv --writable-tmpfs docker://ghcr.io/marvinsxtr/ml-project-template:v0.0.1 /bin/bash

    With Docker:

    docker run -it --rm --platform=linux/amd64 ghcr.io/marvinsxtr/ml-project-template:v0.0.1 /bin/bash

πŸ”‘ Container Registry Authentication

Generate Token

  1. Create a new GitHub token at Settings β†’ Developer settings β†’ Personal access tokens with:
    • read:packages permission
    • write:packages permission

Log In

With Apptainer:

apptainer remote login --username <your GitHub username> docker://ghcr.io

When prompted, enter your token as the password.

With Docker:

echo <your GitHub token> | docker login ghcr.io -u <your GitHub username> --password-stdin

πŸ› οΈ Development Notes

Building Locally for Testing

Test your Dockerfile locally before pushing:

docker buildx build -t ml-project-template .

πŸ§ͺ Running Experiments

WandB Logging

Logging to WandB is optional for local jobs but mandatory for jobs submitted to the cluster.

Create a .env file in the root of the repository with:

WANDB_API_KEY=your_api_key_here
WANDB_ENTITY=your_entity
WANDB_PROJECT=your_project_name

Local Execution

Run a script locally with:

python src/ml_project_template/runs/main.py

Hydra will automatically generate a config.yaml in the outputs/<date>/<time>/.hydra folder which you can use to reproduce the same run later.

To enable WandB logging:

python src/ml_project_template/runs/main.py cfg/wandb=base

For WandB offline mode:

python src/ml_project_template/runs/main.py cfg/wandb=base cfg.wandb.mode=offline

Single Job

To run a job on the cluster:

python src/ml_project_template/runs/main.py cfg/job=base

This will automatically enable WandB logging. See src/ml_project_template/configs/runs/base.py to configure the job settings.

Distributed Sweep

Run a parameter sweep over multiple seeds using multiple nodes:

python src/ml_project_template/runs/main.py cfg/job=sweep

This will automatically enable WandB logging. See src/ml_project_template/configs/runs/base.py to configure sweep parameters.

πŸ‘₯ Contributions

Contributions to this documentation and template are very welcome! Feel free to open a PR or reach out with suggestions.

πŸ™ Acknowledgements

This template is based on a previous example project.

About

Template machine learning project using wandb, hydra-zen and submitit on Slurm with Apptainer

Topics

Resources

License

Stars

Watchers

Forks

Packages