Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[proposal] support GPU topology-aware scheduling #1116

Closed
Tracked by #332
happy2048 opened this issue Mar 14, 2023 · 6 comments
Closed
Tracked by #332

[proposal] support GPU topology-aware scheduling #1116

happy2048 opened this issue Mar 14, 2023 · 6 comments
Labels

Comments

@happy2048
Copy link

What is your proposal:
In Distributed Deep Learning Job, each worker for the training job may involve data exchange and other operations. The bandwidth between GPU cards will affect the training time of the training job. Although the k8s native scheduler can allocate GPU cards to the workers of the training job, bandwidth between GPU cards is not considered; this proposal will provide a scheduling plugin to consider the bandwidth between a group of GPU cards when allocating a group of GPU cards to pods.

Why is this needed:
NVIDIA Collective Communication Library (NCCL) is a Magnum IO library provided by NVIDIA, which can realize GPU-accelerated collective operations. NCCL is topology-aware (automatically perceives the connection type between GPU cards, no manual configuration is required) and is optimized to pass PCIe, NVLink, Ethernet, and InfiniBand interconnects enable high bandwidth and low latency. In the deep learning distributed training job, the distributed training framework (Pytorch, MPI) combined with the NCCL library can achieve the acceleration effect. The NCCL library can perceive the connection between the GPU cards. Different connection types have different bandwidths. The size of the bandwidth affects the training time of the training job.

The following is a matrix describing the bandwidth between 8 GPU cards on a node, and the unit of value is GB/s:

Bandwidth Matrix:
       gpu_0   gpu_1   gpu_2   gpu_3   gpu_4   gpu_5   gpu_6   gpu_7
gpu_0  750.48  48.39   48.33   96.41   15.77   15.52   96.40   15.74
gpu_1  48.39   753.38  96.46   48.38   4.64    16.93   16.98   96.39
gpu_2  48.38   96.25   751.92  96.48   48.39   17.57   17.59   16.72
gpu_3  96.25   48.39   96.43   750.48  15.45   48.39   15.88   14.62
gpu_4  5.00    16.81   48.38   15.98   755.56  96.39   48.38   96.44
gpu_5  15.80   16.93   17.50   48.39   96.25   751.92  96.23   48.38
gpu_6  96.42   16.75   17.47   15.89   48.35   96.28   754.10  48.33
gpu_7  15.65   96.20   16.77   15.71   96.25   48.38   48.33   754.83

If a distributed training job has 2 Pods, and each Pod requests 2 GPU cards, then the [gpu0, gpu1, gpu2, gpu3] combination should be selected first rather than the [gpu0, gpu1, gpu2, gpu5] combination, because the former The bottleneck bandwidth is 48.33 (bottleneck bandwidth refers to the minimum bandwidth of any two GPU connections in a group of GPU cards), while the bottleneck bandwidth of the latter is 4.64. If the latter is allocated to the training job, it will greatly affect the training time.

Is there a suggested solution, if so, please add it:

#1115

@happy2048 happy2048 added the kind/proposal Create a report to help us improve label Mar 14, 2023
happy2048 added a commit to happy2048/koordinator that referenced this issue Mar 14, 2023
@saintube
Copy link
Member

/area koord-scheduler

@jasonliu747 jasonliu747 changed the title [proposal]Supporting GPU topology-aware scheduling [proposal] support GPU topology-aware scheduling Mar 17, 2023
@stale
Copy link

stale bot commented Jun 15, 2023

This issue has been automatically marked as stale because it has not had recent activity.
This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the issue is closed
    You can:
  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Close this issue or PR with /close
    Thank you for your contributions.

@eahydra
Copy link
Member

eahydra commented Jul 11, 2023

/remove-lifecycle stale

@stale
Copy link

stale bot commented Oct 9, 2023

This issue has been automatically marked as stale because it has not had recent activity.
This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the issue is closed
    You can:
  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Close this issue or PR with /close
    Thank you for your contributions.

@stale stale bot added the lifecycle/stale label Oct 9, 2023
Copy link

stale bot commented Nov 8, 2023

This issue has been automatically closed because it has not had recent activity.
This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the issue is closed
    You can:
  • Reopen this PR with /reopen
    Thank you for your contributions.

@stale stale bot closed this as completed Nov 8, 2023
@eahydra eahydra reopened this Dec 4, 2023
Copy link

stale bot commented Jan 3, 2024

This issue has been automatically closed because it has not had recent activity.
This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the issue is closed
    You can:
  • Reopen this PR with /reopen
    Thank you for your contributions.

@stale stale bot closed this as completed Jan 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants