Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dnsperf: net.h:133: perf_net_sockeq: Assertion `sock_b' failed. #208

Closed
pspacek opened this issue Dec 6, 2022 · 9 comments · Fixed by #218
Closed

dnsperf: net.h:133: perf_net_sockeq: Assertion `sock_b' failed. #208

pspacek opened this issue Dec 6, 2022 · 9 comments · Fixed by #218

Comments

@pspacek
Copy link
Contributor

pspacek commented Dec 6, 2022

Version: 2.10.0, commit c1bef8b

Sometimes I've noticed crash in the -m tcp mode:

dnsperf: net.h:133: perf_net_sockeq: Assertion `sock_b' failed.

Shortest reproducer I can come up with is:

while dnsperf -s 127.0.0.1  -d /tmp/qlist -l 0.0001 -m tcp; do true; done

where 127.0.0.1 does not listen on port 53. /tmp/qlist has just net. SOA and nothing else. Repeat it couple times and it should crash within a minute.

Unfortunately the machine where this happens is in a farm and does not have debug build and all the other jazz needed to get proper traceback.

@jelu
Copy link
Member

jelu commented Jan 16, 2023

I'm unable to hit this, although I hit another bug :D

Possible to get a stack trace via gdb or something?

@pspacek
Copy link
Contributor Author

pspacek commented Jan 16, 2023

Luckily I've reproduced it locally, even though build with CFLAGS=-O0 -ggdb3 took many rounds to get to the assertion failure.

Retested on commit 638e7e7.

(gdb) thread apply all bt

Thread 3 (Thread 0x7f6142f11740 (LWP 55448)):
#0  0x00007f614282c0bf in poll () from /usr/lib/libc.so.6
#1  0x00005561f54e6ac7 in perf_os_waituntilanyreadable (socks=0x7ffc05cfd080, nfds=1, pipe_fd=8, timeout=100) at os.c:96
#2  0x00005561f54e68bd in perf_os_waituntilreadable (sock=0x7ffc05cfd100, pipe_fd=8, timeout=100) at os.c:65
#3  0x00005561f54fb612 in main (argc=9, argv=0x7ffc05ffd718) at dnsperf.c:1368

Thread 2 (Thread 0x7f6141c2e6c0 (LWP 55450)):
#0  0x00007f614283a97b in connect () from /usr/lib/libc.so.6
#1  0x00005561f54ebbf5 in perf__tcp_connect (sock=0x5561f6065580) at net_tcp.c:141
#2  0x00005561f54ec254 in perf__tcp_sockready (sock=0x5561f6065580, pipe_fd=4, timeout=100000) at net_tcp.c:283
#3  0x00005561f54f650a in perf_net_sockready (sock=0x5561f6065580, pipe_fd=4, timeout=100000) at /home/pspacek/w/pkg/dnsperf/git/src/net.h:140
#4  0x00005561f54f957a in do_send (arg=0x7f6142430010) at dnsperf.c:848
#5  0x00007f61427b78fd in ?? () from /usr/lib/libc.so.6
#6  0x00007f6142839a60 in ?? () from /usr/lib/libc.so.6

Thread 1 (Thread 0x7f614242f6c0 (LWP 55449)):
#0  0x00007f61427b964c in ?? () from /usr/lib/libc.so.6
#1  0x00007f6142769958 in raise () from /usr/lib/libc.so.6
#2  0x00007f614275353d in abort () from /usr/lib/libc.so.6
#3  0x00007f614275345c in ?? () from /usr/lib/libc.so.6
#4  0x00007f6142762486 in __assert_fail () from /usr/lib/libc.so.6
#5  0x00005561f54f6488 in perf_net_sockeq (sock_a=0x5561f6065580, sock_b=0x0) at /home/pspacek/w/pkg/dnsperf/git/src/net.h:133
#6  0x00005561f54f9de7 in do_recv (arg=0x7f6142430010) at dnsperf.c:1028
#7  0x00007f61427b78fd in ?? () from /usr/lib/libc.so.6
#8  0x00007f6142839a60 in ?? () from /usr/lib/libc.so.6
#5  0x00005561f54f6488 in perf_net_sockeq (sock_a=0x5561f6065580, sock_b=0x0) at /home/pspacek/w/pkg/dnsperf/git/src/net.h:133
133	   assert(sock_b);
(gdb) p sock_b
$1 = (struct perf_net_socket *) 0x0

@pspacek
Copy link
Contributor Author

pspacek commented Jan 16, 2023

This might or might not be the same thing as #216.

jelu added a commit to jelu/dnsperf that referenced this issue Jan 17, 2023
- `dnsperf`: Fix DNS-OARC#208:
  - `recv_one()`: Fix handling errno, only store EAGAIN if no other error has been received
  - `do_recv()`: Don't break on error as it will count it as a received message
@jelu
Copy link
Member

jelu commented Jan 17, 2023

Those CFLAGS helped me replicate! Try the PR?

@pspacek
Copy link
Contributor Author

pspacek commented Jan 18, 2023

It's probably incomplete. Even with the fix I can still see behavior described in #216.

Also, occasionally dnsperf exists with

Error: failed to receive packet: Bad file descriptor

or

Error: all sockets reported failure, can not continue

while sometimes it produces "normal looking" stats with all zeroes.

Mainly the "badfd" error message is suspicious. All zeroes are also kind of weird as "all sockets reported failure" is what actually should happen.

Prolonging the test length to 1 second to allow all RSTs to be processed does not change the behavior, just requires lot more cycles to get there.

@jelu
Copy link
Member

jelu commented Jan 18, 2023

Well, I expected that. At least the assert is fixed. Will look at the badfd more.

jelu added a commit to jelu/dnsperf that referenced this issue Jan 18, 2023
- `dnsperf`: Issue DNS-OARC#208:
  - `recv_one()`: Fix handling errno, only store EAGAIN if no other error has been received
  - `do_recv()`: Don't break on error as it will count it as a received message
jelu added a commit to jelu/dnsperf that referenced this issue Jan 18, 2023
- `net`: Fix DNS-OARC#208: Treat `EBADF` as `EAGAIN` for stateful connections, receive thread might read from a closed socket if send thread is reconnecting
@jelu
Copy link
Member

jelu commented Jan 18, 2023

@pspacek let's try again 😄

@pspacek
Copy link
Contributor Author

pspacek commented Jan 18, 2023

Not there yet. I can still get the assertion failure and also reproduce #216. Tested on 9f31595.

@jelu jelu closed this as completed in #218 Jan 18, 2023
@jelu jelu reopened this Jan 18, 2023
@pspacek
Copy link
Contributor Author

pspacek commented Jan 18, 2023

It turns out I was executing old binary 🤦

Anyway, I don't see assertion failures or badfd messages anymore. What I can still reproduce consistently is #216 . This time tested for real on 2e648bb.

@jelu jelu closed this as completed Jan 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants