A modern template for machine learning experimentation using wandb, hydra-zen, and submitit on a Slurm cluster with Docker/Apptainer containerization.
Note: This template is optimized for the ML Group cluster setup but can be easily adapted to similar environments.
- π¦ Python environment in Docker via uv
- π Logging and visualizations via Weights and Biases
- 𧩠Reproducibility and modular type-checked configs via hydra-zen
- π₯οΈ Submit Slurm jobs and parameter sweeps directly from Python via submitit
- π No
.def
or.sh
files needed for Apptainer/Slurm
- Container Setup
- Package Management
- Updating the Docker Image
- Container Registry Authentication
- Development Notes
- Running Experiments
- Contributions
- Acknowledgements
Choose one of the following methods to set up your environment:
-
Configure environment bindings
Add to your
.zshrc
or.bashrc
:export APPTAINER_BIND=/opt/slurm-23.2,/opt/slurm,/etc/slurm,/etc/munge,/var/log/munge,/var/run/munge,/lib/x86_64-linux-gnu export APPTAINERENV_APPEND_PATH=/opt/slurm/bin:/opt/slurm/sbin
-
Install VSCode Command Line Interface (Optional)
This step is required if you plan to create a remote tunnel. First, install the Remote Tunnels extentsion in VSCode.
-
Connect to compute resources
For CPU resources:
srun --partition=cpu-2h --pty bash
For GPU resources:
srun --partition=gpu-2h --gpus-per-task=1 --pty bash
-
Launch container
To open a tunnel to connect you local VSCode to the container on the cluster:
apptainer run --nv --writable-tmpfs docker://ghcr.io/marvinsxtr/ml-project-template:main code tunnel
In VSCode press
Shift+Alt+P
(Windows/Linux) orShift+Cmd+P
(Mac), type connect to tunnel, select GitHub and select your named node on the cluster. Your IDE is now connected to the cluster.To open a shell in the container on the cluster:
apptainer run --nv --writable-tmpfs docker://ghcr.io/marvinsxtr/ml-project-template:main /bin/bash
π‘ This may take a few minutes on the first run as the container image is downloaded.
Run the container directly with:
docker run -it --rm --platform=linux/amd64 ghcr.io/marvinsxtr/ml-project-template:main /bin/bash
π‘ You can specify a version tag (e.g.,
v0.0.1
) instead ofmain
. Available versions are listed at GitHub Container Registry.
This project uses uv for Python dependency management.
Inside the container (e.g., VSCode shell with Docker Container):
# Add a specific package
uv add <package-name>
# Update all dependencies from pyproject.toml or requirements.txt
uv sync
-
Update dependencies using
uv
as described above -
Commit changes to the repository:
Use tags for versioning:
git add pyproject.toml uv.lock git commit -m "Updated dependencies" git tag v0.0.1 git push && git push --tags
-
Use the updated image:
The GitHub Actions workflow automatically builds a new image when changes are pushed.
With Apptainer:
apptainer run --nv --writable-tmpfs docker://ghcr.io/marvinsxtr/ml-project-template:v0.0.1 /bin/bash
With Docker:
docker run -it --rm --platform=linux/amd64 ghcr.io/marvinsxtr/ml-project-template:v0.0.1 /bin/bash
- Create a new GitHub token at Settings β Developer settings β Personal access tokens with:
read:packages
permissionwrite:packages
permission
With Apptainer:
apptainer remote login --username <your GitHub username> docker://ghcr.io
When prompted, enter your token as the password.
With Docker:
echo <your GitHub token> | docker login ghcr.io -u <your GitHub username> --password-stdin
Test your Dockerfile locally before pushing:
docker buildx build -t ml-project-template .
Logging to WandB is optional for local jobs but mandatory for jobs submitted to the cluster.
Create a .env
file in the root of the repository with:
WANDB_API_KEY=your_api_key_here
WANDB_ENTITY=your_entity
WANDB_PROJECT=your_project_name
Run a script locally with:
python src/ml_project_template/runs/main.py
Hydra will automatically generate a config.yaml
in the outputs/<date>/<time>/.hydra
folder which you can use to reproduce the same run later.
To enable WandB logging:
python src/ml_project_template/runs/main.py cfg/wandb=base
For WandB offline mode:
python src/ml_project_template/runs/main.py cfg/wandb=base cfg.wandb.mode=offline
To run a job on the cluster:
python src/ml_project_template/runs/main.py cfg/job=base
This will automatically enable WandB logging. See src/ml_project_template/configs/runs/base.py
to configure the job settings.
Run a parameter sweep over multiple seeds using multiple nodes:
python src/ml_project_template/runs/main.py cfg/job=sweep
This will automatically enable WandB logging. See src/ml_project_template/configs/runs/base.py
to configure sweep parameters.
Contributions to this documentation and template are very welcome! Feel free to open a PR or reach out with suggestions.
This template is based on a previous example project.