Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

completion wasn't found in the CQ after timeout ; poll completion failed #1

Open
emperorlu opened this issue Aug 20, 2018 · 10 comments

Comments

@emperorlu
Copy link

image

这是为什么?
谢谢。

@alvinkwok1
Copy link
Owner

你服务端,客户端的运行参数分别是什么?

@alvinkwok1
Copy link
Owner

我这边朋友测试了这个Mellanox中的文档代码并没有发现问题。

@emperorlu
Copy link
Author

哦哦,我已经发现问题所在,已经解决,谢谢了

@alvinkwok1
Copy link
Owner

方便能告知下是什么原因吗?

@chiica
Copy link

chiica commented Apr 11, 2019

服务端:

[root@clx05 01]# ./service -d mlx5_1
Device name : "mlx5_1"
IB port : 1
TCP port : 19875

waiting on port 19875 for TCP connection
TCP connection was established
searching for IB devices in host
found 2 device(s)
going to send the message: 'SEND operation '
MR was registered with addr=0x114e490, lkey=0x8df1a, rkey=0x8df1a, flags=0x7
QP was created, QP number=0x8e7

Local LID = 0x0
Remote address = 0x23e5490
Remote rkey = 0x8bddd
Remote QP number = 0x8e6
Remote LID = 0x0
failed to modify QP state to RTR
failed to modify QP state to RTR
failed to connect QPs

test result is 1

客户端:

[root@clx05 01]# ./service -d mlx5_1 10.0.0.101
servername=10.0.0.101

Device name : "mlx5_1"
IB port : 1
IP : 10.0.0.101
TCP port : 19875

TCP connection was established
searching for IB devices in host
found 2 device(s)
MR was registered with addr=0x23e5490, lkey=0x8bddd, rkey=0x8bddd, flags=0x7
QP was created, QP number=0x8e6

Local LID = 0x0
Remote address = 0x114e490
Remote rkey = 0x8df1a
Remote QP number = 0x8e7
Remote LID = 0x0
Receive Request was posted
failed to modify QP state to RTR
failed to modify QP state to RTR
failed to connect QPs

test result is 1

网卡是通的,但是为什么会提示连接QPs失败?谢谢!

@SomXinBingKuang
Copy link

@emperorlu 可以说说是什么原因吗,我也遇到了这个问题

@SomXinBingKuang
Copy link

@fruitdish @emperorlu
我把 MAX_POLL_CQ_TIMEOUT 设置为20秒,可以在服务端看到如下报错:
got bad completion with status: 0xc, vendor syndrome: 0x81
服务端和客户端跑在同一台机器上没问题,但跑在两台机器上就会有这个问题
./a.out -g 0
./a.out -g 0 30.102.74.192

@alvinkwok1
Copy link
Owner

@fruitdish @emperorlu
我把 MAX_POLL_CQ_TIMEOUT 设置为20秒,可以在服务端看到如下报错:
got bad completion with status: 0xc, vendor syndrome: 0x81
服务端和客户端跑在同一台机器上没问题,但跑在两台机器上就会有这个问题
./a.out -g 0
./a.out -g 0 30.102.74.192

很抱歉,已经脱离该领域,无法回答您的问题了

@maijinsheng
Copy link

image

这是为什么?
谢谢。

怎么解决的呢?麻烦解答一下,谢谢

@SomXinBingKuang
Copy link

可以尝试修改一下客户端/服务端的参数 -g 0 为 -g 1,-g 2 试试(修改其中一侧就行)。这是一个跟 gid table 相关的问题。客户端和服务端使用的 gid 要在同一个网段才行(也许我这样表达并不科学)。

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants