-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[proposal] support GPU topology-aware scheduling #1116
Comments
…tor-sh#1116) Signed-off-by: happy2048 <2270020588@qq.com>
/area koord-scheduler |
This issue has been automatically marked as stale because it has not had recent activity.
|
/remove-lifecycle stale |
This issue has been automatically marked as stale because it has not had recent activity.
|
This issue has been automatically closed because it has not had recent activity.
|
This issue has been automatically closed because it has not had recent activity.
|
What is your proposal:
In Distributed Deep Learning Job, each worker for the training job may involve data exchange and other operations. The bandwidth between GPU cards will affect the training time of the training job. Although the k8s native scheduler can allocate GPU cards to the workers of the training job, bandwidth between GPU cards is not considered; this proposal will provide a scheduling plugin to consider the bandwidth between a group of GPU cards when allocating a group of GPU cards to pods.
Why is this needed:
NVIDIA Collective Communication Library (NCCL) is a Magnum IO library provided by NVIDIA, which can realize GPU-accelerated collective operations. NCCL is topology-aware (automatically perceives the connection type between GPU cards, no manual configuration is required) and is optimized to pass PCIe, NVLink, Ethernet, and InfiniBand interconnects enable high bandwidth and low latency. In the deep learning distributed training job, the distributed training framework (Pytorch, MPI) combined with the NCCL library can achieve the acceleration effect. The NCCL library can perceive the connection between the GPU cards. Different connection types have different bandwidths. The size of the bandwidth affects the training time of the training job.
The following is a matrix describing the bandwidth between 8 GPU cards on a node, and the unit of value is GB/s:
If a distributed training job has 2 Pods, and each Pod requests 2 GPU cards, then the [gpu0, gpu1, gpu2, gpu3] combination should be selected first rather than the [gpu0, gpu1, gpu2, gpu5] combination, because the former The bottleneck bandwidth is 48.33 (bottleneck bandwidth refers to the minimum bandwidth of any two GPU connections in a group of GPU cards), while the bottleneck bandwidth of the latter is 4.64. If the latter is allocated to the training job, it will greatly affect the training time.
Is there a suggested solution, if so, please add it:
#1115
The text was updated successfully, but these errors were encountered: