[proposal] support GPU topology-aware scheduling #1116

happy2048 · 2023-03-14T07:34:30Z

What is your proposal:
In Distributed Deep Learning Job, each worker for the training job may involve data exchange and other operations. The bandwidth between GPU cards will affect the training time of the training job. Although the k8s native scheduler can allocate GPU cards to the workers of the training job, bandwidth between GPU cards is not considered; this proposal will provide a scheduling plugin to consider the bandwidth between a group of GPU cards when allocating a group of GPU cards to pods.

Why is this needed:
NVIDIA Collective Communication Library (NCCL) is a Magnum IO library provided by NVIDIA, which can realize GPU-accelerated collective operations. NCCL is topology-aware (automatically perceives the connection type between GPU cards, no manual configuration is required) and is optimized to pass PCIe, NVLink, Ethernet, and InfiniBand interconnects enable high bandwidth and low latency. In the deep learning distributed training job, the distributed training framework (Pytorch, MPI) combined with the NCCL library can achieve the acceleration effect. The NCCL library can perceive the connection between the GPU cards. Different connection types have different bandwidths. The size of the bandwidth affects the training time of the training job.

The following is a matrix describing the bandwidth between 8 GPU cards on a node, and the unit of value is GB/s:

Bandwidth Matrix:
       gpu_0   gpu_1   gpu_2   gpu_3   gpu_4   gpu_5   gpu_6   gpu_7
gpu_0  750.48  48.39   48.33   96.41   15.77   15.52   96.40   15.74
gpu_1  48.39   753.38  96.46   48.38   4.64    16.93   16.98   96.39
gpu_2  48.38   96.25   751.92  96.48   48.39   17.57   17.59   16.72
gpu_3  96.25   48.39   96.43   750.48  15.45   48.39   15.88   14.62
gpu_4  5.00    16.81   48.38   15.98   755.56  96.39   48.38   96.44
gpu_5  15.80   16.93   17.50   48.39   96.25   751.92  96.23   48.38
gpu_6  96.42   16.75   17.47   15.89   48.35   96.28   754.10  48.33
gpu_7  15.65   96.20   16.77   15.71   96.25   48.38   48.33   754.83

If a distributed training job has 2 Pods, and each Pod requests 2 GPU cards, then the [gpu0, gpu1, gpu2, gpu3] combination should be selected first rather than the [gpu0, gpu1, gpu2, gpu5] combination, because the former The bottleneck bandwidth is 48.33 (bottleneck bandwidth refers to the minimum bandwidth of any two GPU connections in a group of GPU cards), while the bottleneck bandwidth of the latter is 4.64. If the latter is allocated to the training job, it will greatly affect the training time.

Is there a suggested solution, if so, please add it:

#1115

The text was updated successfully, but these errors were encountered:

…tor-sh#1116) Signed-off-by: happy2048 <2270020588@qq.com>

saintube · 2023-03-15T06:45:50Z

/area koord-scheduler

stale · 2023-06-15T08:20:39Z

This issue has been automatically marked as stale because it has not had recent activity.
This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the issue is closed
You can:
Mark this issue or PR as fresh with /remove-lifecycle stale
Close this issue or PR with /close
Thank you for your contributions.

eahydra · 2023-07-11T07:34:08Z

/remove-lifecycle stale

stale · 2023-10-09T08:05:26Z

This issue has been automatically marked as stale because it has not had recent activity.
This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the issue is closed
You can:
Mark this issue or PR as fresh with /remove-lifecycle stale
Close this issue or PR with /close
Thank you for your contributions.

stale · 2023-11-08T20:05:33Z

This issue has been automatically closed because it has not had recent activity.
This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the issue is closed
You can:
Reopen this PR with /reopen
Thank you for your contributions.

stale · 2024-01-03T07:40:59Z

This issue has been automatically closed because it has not had recent activity.
This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the issue is closed
You can:
Reopen this PR with /reopen
Thank you for your contributions.

happy2048 added the kind/proposal Create a report to help us improve label Mar 14, 2023

happy2048 added a commit to happy2048/koordinator that referenced this issue Mar 14, 2023

koord-scheduler: add gpu topology-aware scheduling proposal (koordina…

23068f5

…tor-sh#1116) Signed-off-by: happy2048 <2270020588@qq.com>

koordinator-bot bot added the area/koord-scheduler label Mar 15, 2023

jasonliu747 mentioned this issue Mar 17, 2023

[proposal] support fine-grained GPU management #332

Open

28 tasks

jasonliu747 changed the title ~~[proposal]Supporting GPU topology-aware scheduling~~ [proposal] support GPU topology-aware scheduling Mar 17, 2023

stale bot added the lifecycle/stale label Jun 15, 2023

koordinator-bot bot removed the lifecycle/stale label Jul 11, 2023

stale bot added the lifecycle/stale label Oct 9, 2023

stale bot closed this as completed Nov 8, 2023

eahydra reopened this Dec 4, 2023

stale bot closed this as completed Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[proposal] support GPU topology-aware scheduling #1116

[proposal] support GPU topology-aware scheduling #1116

happy2048 commented Mar 14, 2023

saintube commented Mar 15, 2023

stale bot commented Jun 15, 2023

eahydra commented Jul 11, 2023

stale bot commented Oct 9, 2023

stale bot commented Nov 8, 2023

stale bot commented Jan 3, 2024

[proposal] support GPU topology-aware scheduling #1116

[proposal] support GPU topology-aware scheduling #1116

Comments

happy2048 commented Mar 14, 2023

saintube commented Mar 15, 2023

stale bot commented Jun 15, 2023

eahydra commented Jul 11, 2023

stale bot commented Oct 9, 2023

stale bot commented Nov 8, 2023

stale bot commented Jan 3, 2024