Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: underlying websocket connection leak #1186

Closed
jm-clius opened this issue Sep 26, 2022 · 14 comments · Fixed by #1483
Closed

bug: underlying websocket connection leak #1186

jm-clius opened this issue Sep 26, 2022 · 14 comments · Fixed by #1483
Assignees
Labels
bug Something isn't working

Comments

@jm-clius
Copy link
Contributor

Problem

An underlying bug is causing some incoming websocket connections to "leak", resulting in an increase in open file descriptors. It's not yet clear whether this is related to the nim-libp2p, nim-chronos or nim-websock libraries, so this tracking issue will be linked to the underlying issue and relevant repo once it's created.

Impact

Eventually a node exhibiting this leak will reach the maximum amount of open file descriptors. This could lead to:

  • the node restarting
  • the node refusing new connections

To reproduce

This has been observed on the status.prod and wakuv2.test fleets. The status.test fleet, which runs the same version of nwaku has not shown the same leak. This is likely because the leak is triggered by very specific (possibly erratic) websocket behaviour exhibited by client(s) connected to only the fleets with leak detected.

Expected behavior

Connections should close properly and the corresponding file descriptor released in all cases.

Screenshots/logs

image

7 days of open file descriptor count on status.prod

nwaku version/commit hash

latest master on 2022/09/21: commit 11832411

cc @Menduist

@jm-clius jm-clius added bug Something isn't working track:maintenance labels Sep 26, 2022
@jm-clius jm-clius added this to the Release 0.14 milestone Sep 26, 2022
@Menduist
Copy link
Contributor

Posting my original investigation here:

16:52:28.586629 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 20 <0.000031>
16:52:35.008421 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 70 <0.000034>
16:52:58.624601 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 20 <0.000030>
16:53:05.009432 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 70 <0.000034>
16:53:09.844583 accept(11, 0x7ffe0be9b9e0, [0 => 28]) = 70 <0.000032> # bye bye 70
16:53:28.662283 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 20 <0.000037>
16:53:35.010786 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 71 <0.000062>
16:53:58.706336 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 20 <0.000036>
16:54:05.011209 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 71 <0.000035>
16:54:09.881759 accept(11, 0x7ffe0be9b9e0, [0 => 28]) = 71 <0.000033> # bye bye 71
16:54:28.742303 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 20 <0.000029>
16:54:35.019056 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 72 <0.000040>
16:54:58.785436 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 20 <0.000029>
16:55:05.020150 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 72 <0.000036>
16:55:09.941550 accept(11, 0x7ffe0be9b9e0, [0 => 28]) = 72 <0.000033> # bye bye 72
16:55:30.210264 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 20 <0.000036>
16:55:35.021139 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 73 <0.000038>
16:55:58.904547 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 20 <0.000030>
16:56:05.022469 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 73 <0.000032>
16:56:09.997573 accept(11, 0x7ffe0be9b9e0, [0 => 28]) = 73 <0.000047> # bye bye 73
16:56:28.941819 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 20 <0.000031>
16:56:35.023149 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 74 <0.000035>

Everytime we accept on the fd number 11, the returned fd get stuck
fd number 11 = websocket
But when I connect manually, using nc:

17:01:41.659418 accept(11, 0x7ffe0be9b9e0, [0 => 28]) = 77 <0.000040>
17:01:45.720612 accept(11, 0x7ffe0be9b9e0, [0 => 28]) = 77 <0.000037>
17:01:49.843990 accept(11, 0x7ffe0be9b9e0, [0 => 28]) = 77 <0.000046>
17:01:53.740448 accept(11, 0x7ffe0be9b9e0, [0 => 28]) = 77 <0.000079>

The fd get released correctly

16:54:09.901187 sendto(71, "\25\3\3\0\32\0\0\0\0\0\0\0\3\245OR\242\227\260\234]g\237\210T_\342d\226w\260", 31, MSG_NOSIGNAL, NULL, 0) = -1 EPIPE (Broken pipe)

16:55:09.990054 sendto(72, "\25\3\3\0\32\0\0\0\0\0\0\0\3\17\263\371+\205!\355\6*\275\351\340\225\313\253\232\30\353", 31, MSG_NOSIGNAL, NULL, 0) = -1 EPIPE (Broken pipe)

16:56:10.019808 sendto(73, "\25\3\3\0\32\0\0\0\0\0\0\0\3Yr\320q|nnt\376\371\276h\246n\370;\205\372", 31, MSG_NOSIGNAL, NULL, 0) = -1 EPIPE (Broken pipe) <0.000028>

Last sign of life we get for each stuck fd
Soo somewhere amongst the chronos - websock - libp2p stack we don't handle EPIPE correctly

@fryorcraken fryorcraken moved this to Todo in Waku Sep 28, 2022
jakubgs added a commit to status-im/infra-role-nim-waku that referenced this issue Oct 6, 2022
Websockets stop responding after some time due to socket leak:
waku-org/nwaku#1186

Signed-off-by: Jakub Sokołowski <jakub@status.im>
@jakubgs
Copy link
Contributor

jakubgs commented Oct 6, 2022

Added restarts every 6 hours with random 3 hour stagger to mitigate this temporarily:

jakubgs added a commit to status-im/infra-role-nim-waku that referenced this issue Oct 6, 2022
Websockets stop responding after some time due to socket leak:
waku-org/nwaku#1186

Signed-off-by: Jakub Sokołowski <jakub@status.im>
jakubgs added a commit to status-im/infra-role-nim-waku that referenced this issue Oct 11, 2022
waku-org/nwaku#1186

Signed-off-by: Jakub Sokołowski <jakub@status.im>
@jakubgs
Copy link
Contributor

jakubgs commented Oct 11, 2022

Due to periodic restarts the issue might be less noticeable, but it is still there:

image

Any updates?

@jm-clius
Copy link
Contributor Author

With many of the nim-libp2p/websock team participating at devcon, the fix is targeted for beginning-December (and release 0.14.0)

jakubgs added a commit to status-im/infra-role-nim-waku that referenced this issue Oct 11, 2022
waku-org/nwaku#1186

Signed-off-by: Jakub Sokołowski <jakub@status.im>
@jakubgs
Copy link
Contributor

jakubgs commented Oct 11, 2022

Okay, then if there is no fix in sight I will simply disable the Consul Websocket healthchecks, because I don't want to get pinged for a known issue that is not currently being researched: status-im/infra-role-nim-waku@b9868446

@Menduist
Copy link
Contributor

Status update:
The two PRs above fix about 5 leaks, but there are still more to go!
It's now exhibiting a new symptom, where the last sign of life in strace of a leaked FD is:
recvfrom(22, "", 4096, 0, NULL, NULL) = 0
after a correct connection has been setup.

So again, somewhere in the stack this is not handled properly. Investigation will continue next week.

@fryorcraken fryorcraken added critical This issue needs critical attention and removed critical This issue needs critical attention labels Nov 17, 2022
@fryorcraken
Copy link
Collaborator

@Menduist: Would it be worth merging and release the PRs above so that we can improve the status quo?
We could then integrate changes to nwaku and deploy them on status.prod and status.test.

@fryorcraken
Copy link
Collaborator

@jakubgs what is the restart frequency at the moment for status.prod and status.test please? I tried to check the config but not able to relate.

@jakubgs
Copy link
Contributor

jakubgs commented Nov 18, 2022

@fryorcraken It's literally in he second comment on this issue, with commits: #1186 (comment)

@Menduist
Copy link
Contributor

Update:
Now been running for 4h without any leaks thanks to further patches (and status-im/nim-chronos#329)
Will leave it running over the week-end to be sure, clean up the various PRs & merge them

@jm-clius jm-clius moved this from Todo to In Progress in Waku Nov 21, 2022
@jakubgs
Copy link
Contributor

jakubgs commented Nov 22, 2022

Indeed, it looks very good:

image

Great work!

@Menduist
Copy link
Contributor

Merged libp2p & websock PR, now waiting on status-im/nim-chronos#330

@jm-clius
Copy link
Contributor Author

jm-clius commented Jan 9, 2023

@Menduist are you aware of any progress in merging the underlying fixes?

@Menduist
Copy link
Contributor

Menduist commented Jan 9, 2023

No :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants