Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Install PyTorch for ROCM instead of CPU-only #1032

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

marbre
Copy link
Collaborator

@marbre marbre commented Mar 4, 2025

These workflows run on MI300 machines but install a CPU-only version of PyTorch instead of installing the ROCm enabled one.

These workflows run on MI300 machines but install a CPU-only version of
PyTorch instead of installing the ROCm enabled one.
@marbre marbre marked this pull request as draft March 4, 2025 10:32
@marbre
Copy link
Collaborator Author

marbre commented Mar 4, 2025

While these tests use a runner with MI300, they might not use torch+ROCm to run anything on the GPU. This PR would switch all those workflows to PyTorch+ROCm. Instead of switching in our workflows (if not needed) we might want to be more specific in the developer_guide.md#install-pytorch-for-your-system docs.

pip install --no-compile -r pytorch-cpu-requirements.txt
pip install --no-compile -r pytorch-rocm-requirements.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our CI should install what is minimally needed, and our stack is designed from the ground up to avoid dependencies on kernel libraries and other bloat that makes its way into ML frameworks. Users can install whatever they want, such as if they are mixing stock PyTorch with our packages.

I would be okay with deleting https://github.com/nod-ai/shark-ai/blob/main/pytorch-rocm-requirements.txt and instead directing users to either the official PyTorch/ROCm install instructions or linking to our other recommendations (e.g. https://github.com/nod-ai/TheRock/, once it is ready/tested).


See how much this slows down CI:

  • Before: https://github.com/nod-ai/shark-ai/actions/runs/13645323949/job/38143083590#step:6:35

    40s for "Install pip deps"

    Tue, 04 Mar 2025 03:05:47 GMT Looking in indexes: https://download.pytorch.org/whl/cpu/
    Tue, 04 Mar 2025 03:05:48 GMT Collecting torch==2.3.0 (from -r pytorch-cpu-requirements.txt (line 2))
    Tue, 04 Mar 2025 03:05:48 GMT   Downloading https://download.pytorch.org/whl/cpu/torch-2.3.0%2Bcpu-cp311-cp311-linux_x86_64.whl (190.4 MB)
    Tue, 04 Mar 2025 03:05:49 GMT      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 190.4/190.4 MB 215.3 MB/s eta 0:00:00
    
  • After: https://github.com/nod-ai/shark-ai/actions/runs/13648884431/job/38152825413?pr=1032#step:6:35

    2m40s for "Install pip deps", downloading 4GB+

    Tue, 04 Mar 2025 08:00:18 GMT Looking in indexes: https://download.pytorch.org/whl/rocm6.2
    Tue, 04 Mar 2025 08:00:18 GMT Collecting torch>=2.3.0 (from -r pytorch-rocm-requirements.txt (line 2))
    Tue, 04 Mar 2025 08:00:18 GMT   Downloading https://download.pytorch.org/whl/rocm6.2/torch-2.5.1%2Brocm6.2-cp311-cp311-linux_x86_64.whl (3973.6 MB)
    Tue, 04 Mar 2025 08:00:45 GMT      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.0/4.0 GB 24.3 MB/s eta 0:00:00
    

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Scott here. sharktank uses pytorch only for very minimal dependencies. All CIs changed here use the GPUs but not in eager mode, hence not losing out without torch+rocm.
Currently working on enabling ci_eval.yaml to use GPU in eager mode and will require this change in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants