Matepoint

Overview

Matepoint is a fork of PyTorch's torch.utils.checkpoint that allows you to utilize CPU RAM when you're low on GPU VRAM. While standard checkpointing trades computation for memory by recomputing activations during the backward pass, Matepoint takes this further by:

Automatically offloading activation tensors to CPU after the forward pass
Efficiently moving tensors back to GPU only when needed during the backward pass
Supporting pipelined tensor transfers for better performance
Providing optional CPU memory pooling for large, similarly-shaped tensors

Usage

Replace your existing torch.utils.checkpoint calls with matepoint:

from matepoint import checkpoint

# Instead of:
# from torch.utils.checkpoint import checkpoint

def forward(self, x):
    # Use exactly like torch.utils.checkpoint
    out = checkpoint(self.layer, x, use_reentrant=False)
    return out

Requirements

PyTorch >= 2.4.0
CUDA-capable GPU
Sufficient CPU memory for activation storage

Installation

pip install --index-url https://test.pypi.org/simple/ matepoint

Build

rm -rf dist/ build/ .egg-info
python setup.py sdist bdist_wheel
twine upload --repository testpypi dist/*
# if needed, can specify exact version
# twine upload --repository testpypi dist/matepoint-0.1.7*

References

Refer to the Matepoint section in this blog post for more details on the implementation and performance benefits.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Real-World Example

We actually built Matepoint when we were running out of VRAM trying to solve weather(™) with transformers. While WeatherMesh, our model itself isn't huge (~180M parameters), forecasting weather for the entire planet over 6 days means running through 200+ transformer layers.

Without some clever tricks, we'd need hundreds of GiB of VRAM. Even regular checkpointing wasn't enough - storing those 200MiB latent tensors for each transformer block would eat up around 40GiB of VRAM, which is more than even an RTX 4090 can handle.

Matepoint ships those tensors off to CPU RAM when we don't need them, then brings them back just in time during the backward pass. Adding more forecast days costs almost nothing in VRAM terms. This meant we could train our whole weather model on consumer RTX 4090s instead of shelling out for pricier hardware.

Check out these visualizations to see Matepoint in action:

Advanced Options

Pipeline Mode

Matepoint overlaps data movement with computation by default, improving performance by efficiently transferring tensors between CPU and GPU. You can disable this optimization if needed:

import matepoint
matepoint.NOPIPELINE = True  # Disable pipelined tensor transfers

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
images		images
matepoint		matepoint
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Matepoint

Overview

Usage

Requirements

Installation

Build

References

License

Real-World Example

Advanced Options

Pipeline Mode

About

Releases

Packages

Contributors 4

Languages

License

windborne/matepoint

Folders and files

Latest commit

History

Repository files navigation

Matepoint

Overview

Usage

Requirements

Installation

Build

References

License

Real-World Example

Advanced Options

Pipeline Mode

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages