Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The two-threaded all-to-all communication timeout error #1610

Open
wczhao opened this issue Feb 18, 2025 · 3 comments
Open

The two-threaded all-to-all communication timeout error #1610

wczhao opened this issue Feb 18, 2025 · 3 comments

Comments

@wczhao
Copy link

wczhao commented Feb 18, 2025

hello, When I was training the LLM MOE model and used all-to-all communication, I encountered an error "proxy.cc:1458 NCCL WARN [Service thread] Accept failed Success" when using multiple threads.

environment
NCCL Version: 2.22.3-1+cuda12.6
Megatron LLM Version: core_r0.10.0
CUDA Version: 12.6
Operating System: Debian12
Hardware:
nvidia A100 two node (16 gpu)

Here is the bug

2025-02-18 12:54 [rank8]:[E218 04:54:22.731995105 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=146, OpType=ALLTOALL_BASE, NumelIn=1621248, NumelOut=1572864, Timeout(ms)=600000) ran for 600021 milliseconds before timing out.
 2025-02-18 12:54 [rank8]:[E218 04:54:22.732116247 ProcessGroupNCCL.cpp:1795] [PG ID 11 PG GUID 117 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 146, last enqueued NCCL work: 154, last completed NCCL work: 145.
 2025-02-18 12:54 [rank8]:[E218 04:54:22.793014061 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70, OpType=_ALLGATHER_BASE, NumelIn=32, NumelOut=64, Timeout(ms)=600000) ran for 600079 milliseconds before timing out.
 2025-02-18 12:54 [rank8]:[E218 04:54:22.793053616 ProcessGroupNCCL.cpp:1795] [PG ID 13 PG GUID 141 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 70, last enqueued NCCL work: 70, last completed NCCL work: 69.
 2025-02-18 12:54 [rank8]:[E218 04:54:22.793542372 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=70, OpType=ALLREDUCE, NumelIn=32, NumelOut=32, Timeout(ms)=600000) ran for 600080 milliseconds before timing out.
 2025-02-18 12:54 [rank8]:[E218 04:54:22.793571553 ProcessGroupNCCL.cpp:1795] [PG ID 10 PG GUID 105 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 70, last enqueued NCCL work: 70, last completed NCCL work: 69.
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:173:934 [0] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:173:934 [0] NCCL INFO misc/socket.cc:550 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:173:934 [0] NCCL INFO misc/socket.cc:573 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:173:934 [0] NCCL INFO misc/socket.cc:621 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:173:3260 [0] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:173:3260 [0] NCCL INFO misc/socket.cc:752 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:173:3260 [0] NCCL INFO misc/socket.cc:428 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:173:3260 [0] NCCL INFO misc/socket.cc:564 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:173:3260 [0] NCCL INFO misc/socket.cc:668 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:173:3260 [0] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Success
 2025-02-18 12:54 [rank15]:[E218 04:54:22.845578749 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=151, OpType=ALLTOALL_BASE, NumelIn=1572864, NumelOut=1731840, Timeout(ms)=600000) ran for 600011 milliseconds before timing out.
 2025-02-18 12:54 [rank15]:[E218 04:54:22.845670158 ProcessGroupNCCL.cpp:1795] [PG ID 12 PG GUID 120 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 151, last enqueued NCCL work: 152, last completed NCCL work: 150.
 2025-02-18 12:54 [rank12]:[E218 04:54:22.854771287 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=216, OpType=ALLTOALL_BASE, NumelIn=1766400, NumelOut=1572864, Timeout(ms)=600000) ran for 600041 milliseconds before timing out.
 2025-02-18 12:54 [rank12]:[E218 04:54:22.854846003 ProcessGroupNCCL.cpp:1795] [PG ID 11 PG GUID 119 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 216, last enqueued NCCL work: 220, last completed NCCL work: 215.
 2025-02-18 12:54 [rank12]:[E218 04:54:22.856940622 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=87, OpType=ALLREDUCE, NumelIn=32, NumelOut=32, Timeout(ms)=600000) ran for 600036 milliseconds before timing out.
 2025-02-18 12:54 [rank12]:[E218 04:54:22.856973869 ProcessGroupNCCL.cpp:1795] [PG ID 10 PG GUID 109 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 87, last enqueued NCCL work: 87, last completed NCCL work: 86.
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:1116 [4] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:1116 [4] NCCL INFO misc/socket.cc:550 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:1116 [4] NCCL INFO misc/socket.cc:573 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:1116 [4] NCCL INFO misc/socket.cc:621 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:3156 [4] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:3156 [4] NCCL INFO misc/socket.cc:752 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:3156 [4] NCCL INFO misc/socket.cc:428 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:3156 [4] NCCL INFO misc/socket.cc:564 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:3156 [4] NCCL INFO misc/socket.cc:668 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:3156 [4] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Success
 2025-02-18 12:54 [rank12]:[E218 04:54:22.861173210 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=87, OpType=_ALLGATHER_BASE, NumelIn=32, NumelOut=64, Timeout(ms)=600000) ran for 600039 milliseconds before timing out.
 2025-02-18 12:54 [rank12]:[E218 04:54:22.861196431 ProcessGroupNCCL.cpp:1795] [PG ID 13 PG GUID 143 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 87, last enqueued NCCL work: 87, last completed NCCL work: 86.
 2025-02-18 12:54 [rank13]:[E218 04:54:22.864467604 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=216, OpType=ALLTOALL_BASE, NumelIn=1379328, NumelOut=1572864, Timeout(ms)=600000) ran for 600045 milliseconds before timing out.
 2025-02-18 12:54 [rank13]:[E218 04:54:22.864535778 ProcessGroupNCCL.cpp:1795] [PG ID 11 PG GUID 119 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 216, last enqueued NCCL work: 220, last completed NCCL work: 215.
 2025-02-18 12:54 [rank15]:[E218 04:54:22.883327702 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=57, OpType=ALLREDUCE, NumelIn=32, NumelOut=32, Timeout(ms)=600000) ran for 600045 milliseconds before timing out.
 2025-02-18 12:54 [rank15]:[E218 04:54:22.883358915 ProcessGroupNCCL.cpp:1795] [PG ID 11 PG GUID 112 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 57, last enqueued NCCL work: 57, last completed NCCL work: 56.
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:915 [7] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:915 [7] NCCL INFO misc/socket.cc:550 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:915 [7] NCCL INFO misc/socket.cc:573 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:915 [7] NCCL INFO misc/socket.cc:621 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:3078 [7] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:3078 [7] NCCL INFO misc/socket.cc:752 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:3078 [7] NCCL INFO misc/socket.cc:428 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:3078 [7] NCCL INFO misc/socket.cc:564 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:3078 [7] NCCL INFO misc/socket.cc:668 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:3078 [7] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Success
 2025-02-18 12:54 [rank13]:[E218 04:54:22.898471814 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=87, OpType=_ALLGATHER_BASE, NumelIn=32, NumelOut=64, Timeout(ms)=600000) ran for 600081 milliseconds before timing out.
 2025-02-18 12:54 [rank13]:[E218 04:54:22.898502583 ProcessGroupNCCL.cpp:1795] [PG ID 13 PG GUID 143 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 87, last enqueued NCCL work: 87, last completed NCCL work: 86.
 2025-02-18 12:54 [rank14]:[E218 04:54:22.918261108 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=151, OpType=ALLTOALL_BASE, NumelIn=1572864, NumelOut=1413888, Timeout(ms)=600000) ran for 600077 milliseconds before timing out.
 2025-02-18 12:54 [rank14]:[E218 04:54:22.918359341 ProcessGroupNCCL.cpp:1795] [PG ID 12 PG GUID 120 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 151, last enqueued NCCL work: 152, last completed NCCL work: 150.
 2025-02-18 12:54 [rank15]:[E218 04:54:22.921481021 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=57, OpType=_ALLGATHER_BASE, NumelIn=32, NumelOut=64, Timeout(ms)=600000) ran for 600083 milliseconds before timing out.
 2025-02-18 12:54 [rank15]:[E218 04:54:22.921511099 ProcessGroupNCCL.cpp:1795] [PG ID 14 PG GUID 144 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 57, last enqueued NCCL work: 57, last completed NCCL work: 56.
 2025-02-18 12:54 [rank14]:[E218 04:54:22.921550502 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=57, OpType=_ALLGATHER_BASE, NumelIn=32, NumelOut=64, Timeout(ms)=600000) ran for 600083 milliseconds before timing out.
 2025-02-18 12:54 [rank14]:[E218 04:54:22.921584749 ProcessGroupNCCL.cpp:1795] [PG ID 14 PG GUID 144 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 57, last enqueued NCCL work: 57, last completed NCCL work: 56.
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:1130 [5] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:1130 [5] NCCL INFO misc/socket.cc:550 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:1130 [5] NCCL INFO misc/socket.cc:573 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:1130 [5] NCCL INFO misc/socket.cc:621 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:1130 [5] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:3163 [5] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:3163 [5] NCCL INFO misc/socket.cc:752 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:1130 [5] NCCL INFO misc/socket.cc:58 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:3163 [5] NCCL INFO misc/socket.cc:428 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:3163 [5] NCCL INFO misc/socket.cc:564 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:1130 [5] NCCL INFO misc/socket.cc:775 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:3163 [5] NCCL INFO misc/socket.cc:668 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:3163 [5] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:3163 [5] NCCL INFO misc/socket.cc:826 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:3163 [5] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 1, res=3, closed=0
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:3163 [5] proxy.cc:1521 NCCL WARN [Proxy Service 1] Failed to execute operation Close from rank 1, retcode 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:1128 [4] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:1128 [4] NCCL INFO misc/socket.cc:550 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:1128 [4] NCCL INFO misc/socket.cc:573 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:1128 [4] NCCL INFO misc/socket.cc:621 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:3165 [4] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:3165 [4] NCCL INFO misc/socket.cc:752 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:3165 [4] NCCL INFO misc/socket.cc:428 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:3165 [4] NCCL INFO misc/socket.cc:564 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:3165 [4] NCCL INFO misc/socket.cc:668 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:3165 [4] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:1128 [4] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:1128 [4] NCCL INFO misc/socket.cc:58 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:1128 [4] NCCL INFO misc/socket.cc:775 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:3165 [4] NCCL INFO misc/socket.cc:826 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:3165 [4] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 0, res=3, closed=0
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:3165 [4] proxy.cc:1521 NCCL WARN [Proxy Service 0] Failed to execute operation Close from rank 0, retcode 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:1122 [5] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:1122 [5] NCCL INFO misc/socket.cc:550 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:1122 [5] NCCL INFO misc/socket.cc:573 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:1120 [4] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:1122 [5] NCCL INFO misc/socket.cc:621 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:1120 [4] NCCL INFO misc/socket.cc:550 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:1120 [4] NCCL INFO misc/socket.cc:573 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:1120 [4] NCCL INFO misc/socket.cc:621 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:3172 [5] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:3172 [5] NCCL INFO misc/socket.cc:752 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:3172 [5] NCCL INFO misc/socket.cc:428 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:3172 [5] NCCL INFO misc/socket.cc:564 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:3174 [4] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:3172 [5] NCCL INFO misc/socket.cc:668 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:3174 [4] NCCL INFO misc/socket.cc:752 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:3172 [5] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:3174 [4] NCCL INFO misc/socket.cc:428 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:3174 [4] NCCL INFO misc/socket.cc:564 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:3174 [4] NCCL INFO misc/socket.cc:668 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:1122 [5] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:3174 [4] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:1122 [5] NCCL INFO misc/socket.cc:58 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:1122 [5] NCCL INFO misc/socket.cc:775 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:3172 [5] NCCL INFO misc/socket.cc:826 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:3172 [5] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 1, res=3, closed=0
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:3172 [5] proxy.cc:1521 NCCL WARN [Proxy Service 1] Failed to execute operation Close from rank 1, retcode 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:1120 [4] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:3174 [4] NCCL INFO misc/socket.cc:826 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:3174 [4] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 0, res=3, closed=0
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:1120 [4] NCCL INFO misc/socket.cc:58 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:1120 [4] NCCL INFO misc/socket.cc:775 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:3174 [4] proxy.cc:1521 NCCL WARN [Proxy Service 0] Failed to execute operation Close from rank 0, retcode 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:939 [6] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:939 [6] NCCL INFO misc/socket.cc:550 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:939 [6] NCCL INFO misc/socket.cc:573 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:939 [6] NCCL INFO misc/socket.cc:621 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:939 [6] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:939 [6] NCCL INFO misc/socket.cc:58 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:939 [6] NCCL INFO misc/socket.cc:775 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:3087 [6] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:3087 [6] NCCL INFO misc/socket.cc:752 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:3087 [6] NCCL INFO misc/socket.cc:428 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:3087 [6] NCCL INFO misc/socket.cc:564 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:3087 [6] NCCL INFO misc/socket.cc:668 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:3087 [6] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:3087 [6] NCCL INFO misc/socket.cc:826 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:3087 [6] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 0, res=3, closed=0
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:3087 [6] proxy.cc:1521 NCCL WARN [Proxy Service 0] Failed to execute operation Close from rank 0, retcode 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:922 [7] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:922 [7] NCCL INFO misc/socket.cc:550 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:922 [7] NCCL INFO misc/socket.cc:573 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:922 [7] NCCL INFO misc/socket.cc:621 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:3094 [7] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:3094 [7] NCCL INFO misc/socket.cc:752 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:3094 [7] NCCL INFO misc/socket.cc:428 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:3094 [7] NCCL INFO misc/socket.cc:564 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:3094 [7] NCCL INFO misc/socket.cc:668 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:3094 [7] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:3096 [6] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:3094 [7] NCCL INFO misc/socket.cc:826 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:3096 [6] NCCL INFO misc/socket.cc:752 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:922 [7] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:3096 [6] NCCL INFO misc/socket.cc:428 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:922 [7] NCCL INFO misc/socket.cc:58 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:922 [7] NCCL INFO misc/socket.cc:775 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:3096 [6] NCCL INFO misc/socket.cc:564 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:923 [6] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:3094 [7] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 1, res=3, closed=0
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:923 [6] NCCL INFO misc/socket.cc:550 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:3096 [6] NCCL INFO misc/socket.cc:668 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:923 [6] NCCL INFO misc/socket.cc:573 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:3094 [7] proxy.cc:1521 NCCL WARN [Proxy Service 1] Failed to execute operation Close from rank 1, retcode 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:923 [6] NCCL INFO misc/socket.cc:621 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:3096 [6] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:936 [7] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:936 [7] NCCL INFO misc/socket.cc:550 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:936 [7] NCCL INFO misc/socket.cc:573 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:936 [7] NCCL INFO misc/socket.cc:621 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:923 [6] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:3085 [7] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:3096 [6] NCCL INFO misc/socket.cc:826 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:3085 [7] NCCL INFO misc/socket.cc:752 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:923 [6] NCCL INFO misc/socket.cc:58 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:3085 [7] NCCL INFO misc/socket.cc:428 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:3096 [6] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 0, res=3, closed=0
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:923 [6] NCCL INFO misc/socket.cc:775 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:3085 [7] NCCL INFO misc/socket.cc:564 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:3085 [7] NCCL INFO misc/socket.cc:668 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:3096 [6] proxy.cc:1521 NCCL WARN [Proxy Service 0] Failed to execute operation Close from rank 0, retcode 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:936 [7] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:936 [7] NCCL INFO misc/socket.cc:58 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:936 [7] NCCL INFO misc/socket.cc:775 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:3085 [7] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:3085 [7] NCCL INFO misc/socket.cc:826 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:3085 [7] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 1, res=3, closed=0
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:3085 [7] proxy.cc:1521 NCCL WARN [Proxy Service 1] Failed to execute operation Close from rank 1, retcode 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:173:960 [0] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:173:960 [0] NCCL INFO misc/socket.cc:550 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:173:960 [0] NCCL INFO misc/socket.cc:573 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:173:960 [0] NCCL INFO misc/socket.cc:621 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:173:3266 [0] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:173:3266 [0] NCCL INFO misc/socket.cc:752 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:173:3266 [0] NCCL INFO misc/socket.cc:428 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:173:3266 [0] NCCL INFO misc/socket.cc:564 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:173:3266 [0] NCCL INFO misc/socket.cc:668 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:173:3266 [0] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:173:960 [0] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:173:960 [0] NCCL INFO misc/socket.cc:58 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:173:960 [0] NCCL INFO misc/socket.cc:775 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:173:3266 [0] NCCL INFO misc/socket.cc:826 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:173:3266 [0] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 0, res=3, closed=0
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:173:3266 [0] proxy.cc:1521 NCCL WARN [Proxy Service 0] Failed to execute operation Close from rank 0, retcode 3
 2025-02-18 12:54 [rank9]:[E218 04:54:22.061694977 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66, OpType=ALLREDUCE, NumelIn=32, NumelOut=32, Timeout(ms)=600000) ran for 600394 milliseconds before timing out.
 2025-02-18 12:54 [rank9]:[E218 04:54:22.061699237 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=66, OpType=_ALLGATHER_BASE, NumelIn=32, NumelOut=64, Timeout(ms)=600000) ran for 600394 milliseconds before timing out.
 2025-02-18 12:54 [rank9]:[E218 04:54:22.061701708 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=130, OpType=ALLTOALL_BASE, NumelIn=1774848, NumelOut=1572864, Timeout(ms)=600000) ran for 600413 milliseconds before timing out.
 2025-02-18 12:54 [rank9]:[E218 04:54:22.061816341 ProcessGroupNCCL.cpp:1795] [PG ID 11 PG GUID 117 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 130, last enqueued NCCL work: 147, last completed NCCL work: 129.
 2025-02-18 12:54 [rank9]:[E218 04:54:22.061821020 ProcessGroupNCCL.cpp:1795] [PG ID 10 PG GUID 106 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 66, last enqueued NCCL work: 70, last completed NCCL work: 65.
 2025-02-18 12:54 [rank9]:[E218 04:54:22.061823344 ProcessGroupNCCL.cpp:1795] [PG ID 13 PG GUID 141 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 66, last enqueued NCCL work: 70, last completed NCCL work: 65.
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:943 [1] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:949 [1] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:949 [1] NCCL INFO misc/socket.cc:550 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:949 [1] NCCL INFO misc/socket.cc:573 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:943 [1] NCCL INFO misc/socket.cc:550 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:949 [1] NCCL INFO misc/socket.cc:621 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:943 [1] NCCL INFO misc/socket.cc:573 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:964 [1] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:964 [1] NCCL INFO misc/socket.cc:550 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:3258 [1] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:964 [1] NCCL INFO misc/socket.cc:573 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:943 [1] NCCL INFO misc/socket.cc:621 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:964 [1] NCCL INFO misc/socket.cc:621 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:3258 [1] NCCL INFO misc/socket.cc:752 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:3274 [1] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:3274 [1] NCCL INFO misc/socket.cc:752 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:3274 [1] NCCL INFO misc/socket.cc:428 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:964 [1] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:3274 [1] NCCL INFO misc/socket.cc:564 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:3258 [1] NCCL INFO misc/socket.cc:428 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:3258 [1] NCCL INFO misc/socket.cc:564 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:3265 [1] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:3258 [1] NCCL INFO misc/socket.cc:668 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:3265 [1] NCCL INFO misc/socket.cc:752 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:3274 [1] NCCL INFO misc/socket.cc:668 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:3265 [1] NCCL INFO misc/socket.cc:428 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:3258 [1] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Success
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:3274 [1] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:964 [1] NCCL INFO misc/socket.cc:58 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:3265 [1] NCCL INFO misc/socket.cc:564 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:964 [1] NCCL INFO misc/socket.cc:775 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:3265 [1] NCCL INFO misc/socket.cc:668 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:949 [1] NCCL INFO misc/socket.cc:47 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:3265 [1] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:3274 [1] NCCL INFO misc/socket.cc:826 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:3274 [1] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 1, res=3, closed=0
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:949 [1] NCCL INFO misc/socket.cc:58 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:949 [1] NCCL INFO misc/socket.cc:775 -> 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:3265 [1] NCCL INFO misc/socket.cc:826 -> 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:3265 [1] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 1, res=3, closed=0
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:3265 [1] proxy.cc:1521 NCCL WARN [Proxy Service 1] Failed to execute operation Close from rank 1, retcode 3
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:174:3274 [1] proxy.cc:1521 NCCL WARN [Proxy Service 1] Failed to execute operation Close from rank 1, retcode 3
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:179:923 [6] NCCL INFO comm 0x555590368b10 rank 0 nranks 2 cudaDev 6 busId e0000 - Abort COMPLETE
 2025-02-18 12:54 [rank14]:[E218 04:54:23.220620192 ProcessGroupNCCL.cpp:1844] [PG ID 12 PG GUID 120 Rank 0] Timeout at NCCL work: 151, last enqueued NCCL work: 152, last completed NCCL work: 150.
 2025-02-18 12:54 [rank14]:[E218 04:54:23.220632044 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
 2025-02-18 12:54 [rank14]:[E218 04:54:23.220636179 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
 2025-02-18 12:54 [rank14]:[E218 04:54:23.222095654 ProcessGroupNCCL.cpp:1605] [PG ID 12 PG GUID 120 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=151, OpType=ALLTOALL_BASE, NumelIn=1572864, NumelOut=1413888, Timeout(ms)=600000) ran for 600077 milliseconds before timing out.
 2025-02-18 12:54 Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
 2025-02-18 12:54 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f419569c0a8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
 2025-02-18 12:54 frame #1: <unknown function> + 0x114308e (0x7f419685008e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f4196865ee2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x24b (0x7f419686f31b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f4196870d7d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #5: <unknown function> + 0xdc253 (0x7f41950b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
 2025-02-18 12:54 frame #6: <unknown function> + 0x94ac3 (0x7f41f4d30ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
 2025-02-18 12:54 frame #7: <unknown function> + 0x126850 (0x7f41f4dc2850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
 2025-02-18 12:54 
 2025-02-18 12:54 terminate called after throwing an instance of 'c10::DistBackendError'
 2025-02-18 12:54   what():  [PG ID 12 PG GUID 120 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=151, OpType=ALLTOALL_BASE, NumelIn=1572864, NumelOut=1413888, Timeout(ms)=600000) ran for 600077 milliseconds before timing out.
 2025-02-18 12:54 Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
 2025-02-18 12:54 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f419569c0a8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
 2025-02-18 12:54 frame #1: <unknown function> + 0x114308e (0x7f419685008e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f4196865ee2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x24b (0x7f419686f31b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f4196870d7d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #5: <unknown function> + 0xdc253 (0x7f41950b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
 2025-02-18 12:54 frame #6: <unknown function> + 0x94ac3 (0x7f41f4d30ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
 2025-02-18 12:54 frame #7: <unknown function> + 0x126850 (0x7f41f4dc2850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
 2025-02-18 12:54 
 2025-02-18 12:54 Exception raised from ncclCommWatchdog at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1611 (most recent call first):
 2025-02-18 12:54 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f419569c0a8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
 2025-02-18 12:54 frame #1: <unknown function> + 0x114308e (0x7f419685008e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #2: <unknown function> + 0xdd23be (0x7f41964df3be in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #3: <unknown function> + 0xdc253 (0x7f41950b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
 2025-02-18 12:54 frame #4: <unknown function> + 0x94ac3 (0x7f41f4d30ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
 2025-02-18 12:54 frame #5: <unknown function> + 0x126850 (0x7f41f4dc2850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:180:922 [7] NCCL INFO comm 0x555744cda020 rank 1 nranks 2 cudaDev 7 busId e4000 - Abort COMPLETE
 2025-02-18 12:54 [rank15]:[E218 04:54:23.231578316 ProcessGroupNCCL.cpp:1844] [PG ID 12 PG GUID 120 Rank 1] Timeout at NCCL work: 151, last enqueued NCCL work: 152, last completed NCCL work: 150.
 2025-02-18 12:54 [rank15]:[E218 04:54:23.231589411 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
 2025-02-18 12:54 [rank15]:[E218 04:54:23.231593265 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
 2025-02-18 12:54 [rank15]:[E218 04:54:23.233033151 ProcessGroupNCCL.cpp:1605] [PG ID 12 PG GUID 120 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=151, OpType=ALLTOALL_BASE, NumelIn=1572864, NumelOut=1731840, Timeout(ms)=600000) ran for 600011 milliseconds before timing out.
 2025-02-18 12:54 Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
 2025-02-18 12:54 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f0729e9c0a8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
 2025-02-18 12:54 frame #1: <unknown function> + 0x114308e (0x7f072b05008e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f072b065ee2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x24b (0x7f072b06f31b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f072b070d7d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #5: <unknown function> + 0xdc253 (0x7f07298b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
 2025-02-18 12:54 frame #6: <unknown function> + 0x94ac3 (0x7f0789525ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
 2025-02-18 12:54 frame #7: <unknown function> + 0x126850 (0x7f07895b7850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
 2025-02-18 12:54 
 2025-02-18 12:54 terminate called after throwing an instance of 'c10::DistBackendError'
 2025-02-18 12:54   what():  [PG ID 12 PG GUID 120 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=151, OpType=ALLTOALL_BASE, NumelIn=1572864, NumelOut=1731840, Timeout(ms)=600000) ran for 600011 milliseconds before timing out.
 2025-02-18 12:54 Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
 2025-02-18 12:54 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f0729e9c0a8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
 2025-02-18 12:54 frame #1: <unknown function> + 0x114308e (0x7f072b05008e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f072b065ee2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x24b (0x7f072b06f31b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f072b070d7d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #5: <unknown function> + 0xdc253 (0x7f07298b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
 2025-02-18 12:54 frame #6: <unknown function> + 0x94ac3 (0x7f0789525ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
 2025-02-18 12:54 frame #7: <unknown function> + 0x126850 (0x7f07895b7850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
 2025-02-18 12:54 
 2025-02-18 12:54 Exception raised from ncclCommWatchdog at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1611 (most recent call first):
 2025-02-18 12:54 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f0729e9c0a8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
 2025-02-18 12:54 frame #1: <unknown function> + 0x114308e (0x7f072b05008e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #2: <unknown function> + 0xdd23be (0x7f072acdf3be in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #3: <unknown function> + 0xdc253 (0x7f07298b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
 2025-02-18 12:54 frame #4: <unknown function> + 0x94ac3 (0x7f0789525ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
 2025-02-18 12:54 frame #5: <unknown function> + 0x126850 (0x7f07895b7850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
 2025-02-18 12:54 
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:177:1128 [4] NCCL INFO comm 0x5591a8431ff0 rank 0 nranks 2 cudaDev 4 busId c5000 - Abort COMPLETE
 2025-02-18 12:54 [rank12]:[E218 04:54:23.314069184 ProcessGroupNCCL.cpp:1844] [PG ID 13 PG GUID 143 Rank 0] Timeout at NCCL work: 87, last enqueued NCCL work: 87, last completed NCCL work: 86.
 2025-02-18 12:54 [rank12]:[E218 04:54:23.314083987 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
 2025-02-18 12:54 [rank12]:[E218 04:54:23.314089273 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
 2025-02-18 12:54 ncivki7qpk7pmo1btv0h0:178:1130 [5] NCCL INFO comm 0x55ffa0a299c0 rank 1 nranks 2 cudaDev 5 busId ca000 - Abort COMPLETE
 2025-02-18 12:54 [rank13]:[E218 04:54:23.314207038 ProcessGroupNCCL.cpp:1844] [PG ID 13 PG GUID 143 Rank 1] Timeout at NCCL work: 87, last enqueued NCCL work: 87, last completed NCCL work: 86.
 2025-02-18 12:54 [rank13]:[E218 04:54:23.314219642 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
 2025-02-18 12:54 [rank13]:[E218 04:54:23.314224750 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
 2025-02-18 12:54 [rank12]:[E218 04:54:23.315529658 ProcessGroupNCCL.cpp:1605] [PG ID 13 PG GUID 143 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=87, OpType=_ALLGATHER_BASE, NumelIn=32, NumelOut=64, Timeout(ms)=600000) ran for 600039 milliseconds before timing out.
 2025-02-18 12:54 Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
 2025-02-18 12:54 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f9e6cfb30a8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
 2025-02-18 12:54 frame #1: <unknown function> + 0x114308e (0x7f9e6e16708e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f9e6e17cee2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x24b (0x7f9e6e18631b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f9e6e187d7d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #5: <unknown function> + 0xdc253 (0x7f9e6cab0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
 2025-02-18 12:54 frame #6: <unknown function> + 0x94ac3 (0x7f9ecc5a2ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
 2025-02-18 12:54 frame #7: <unknown function> + 0x126850 (0x7f9ecc634850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
 2025-02-18 12:54 
 2025-02-18 12:54 terminate called after throwing an instance of 'c10::DistBackendError'
 2025-02-18 12:54 [rank13]:[E218 04:54:23.315646688 ProcessGroupNCCL.cpp:1605] [PG ID 13 PG GUID 143 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=87, OpType=_ALLGATHER_BASE, NumelIn=32, NumelOut=64, Timeout(ms)=600000) ran for 600081 milliseconds before timing out.
 2025-02-18 12:54 Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
 2025-02-18 12:54 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f43aaa9c0a8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
 2025-02-18 12:54 frame #1: <unknown function> + 0x114308e (0x7f43abc5008e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f43abc65ee2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x24b (0x7f43abc6f31b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f43abc70d7d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
 2025-02-18 12:54 frame #5: <unknown function> + 0xdc253 (0x7f43aa4b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
 2025-02-18 12:54 frame #6: <unknown function> + 0x94ac3 (0x7f440a143ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
 2025-02-18 12:54 frame #7: <unknown function> + 0x126850 (0x7f440a1d5850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
@kiskra-nvidia
Copy link
Member

Yeah, the error message is far from perfect, but I suspect that it's not the root cause. The root cause appears to be a collective operation timeout, which could be due to any number of reasons.

What do you mean by "when using multiple threads"? Are those threads using the same communicator?

@wczhao wczhao changed the title The two-threaded all-to-all communication error: "proxy.cc:1458 NCCL WARN [Service thread] Accept failed Success" The two-threaded all-to-all communication timeout error Feb 24, 2025
@wczhao
Copy link
Author

wczhao commented Feb 24, 2025

Yeah, the error message is far from perfect, but I suspect that it's not the root cause. The root cause appears to be a collective operation timeout, which could be due to any number of reasons.

What do you mean by "when using multiple threads"? Are those threads using the same communicator?

Thank you for your reply. My situation is that I used two threads to concurrently execute two AlltoAll communications. These two AlltoAll communications used the same communicator, which caused a timeout. If these two threads use different communicators, could a timeout still occur?

@kiskra-nvidia
Copy link
Member

If you have two threads issuing collective operations at the same time on the same communicator, there's a good chance that that's the root cause of the hangs. See https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/threadsafety.html .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants