-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
socket_vmnet gets stuck randomly #39
Comments
More information on the issue:
Logs after the VM corresponding to socket 11 is stopped: |
On further analysis, it looks like a deadlock problem: |
This code has multiple threads writing to a socket at the same time. This will cause problem - potentially corrupting packets. We need to remove the flooding by mapping MAC addresses to socket-ids. so that 2 threads don't write to a given socket at the same time. |
@sheelchand Thanks for analysis, would you be interested in submitting a PR? |
I tried but the Mac’s in the packets don’t match the VMs Macs. I tried putting a semephore before sending to a socket but that did not help either. I will continue to look at it. |
There are 2 race conditions when writing to connections: - Since we don't hold the semaphore during iteration, a connection can be removed by another thread while we try to use it, which will lead to use after free. - Multiple threads may try to write to the same connection socket, corrupting the packets (lima-vm#39). Both issues are fixed by holding the semaphore during iteration and writing to the socket. This is not the most efficient way but socket_vmnet crashes daily and we must stop the bleeding first. We can to add more fine grain locking later.
@sheelchand removing flooding is important for performance but it will not solve the issue of writing to the same socket at the same time from different threads. Example flow when we send each packet only the destination:
writev(2) does not mention anything about thread safety or message size that can be written atomically, so we should assume that this is unsafe. send(2) seems safe:
But using send we will have to copy the packet so do one syscall, so writev seems better way. Specially if we find how to send multiple packets per syscall instead of one without sending all packets to all the guests. |
There are 2 race conditions when writing to connections: - Since we don't hold the semaphore during iteration, a connection can be removed by another thread while we try to use it, which will lead to use after free. - Multiple threads may try to write to the same connection socket, corrupting the packets (lima-vm#39). Both issues are fixed by holding the semaphore during iteration and writing to the socket. This is not the most efficient way but socket_vmnet crashes daily and we must stop the bleeding first. We can to add more fine grain locking later.
I can reproduce it reliably now - the key is running socket_vmnet in the shell and not via launch. When we run with luanchd, we don't specify ProcessType so the system apply resources limits which makes socket_vment almost twice slower and harder to reproduce this bug. The backtrace I see is:
This is not really a deadlock in socket_vmnet - it is trying to write to the vm unix socket, and if the buffers are full, the write will block. We can avoid this by using non-blocking I/O and drop packets when the socket is not writable. But why we cannot write to the socket? I think the issue is in lima hostagent - if it fails to copy a packet from the unix socket to the datagram socket it stops reading without logging any error! The issue comes from tcpproxy:
Inside proxyCopy we get the error from io.Copy and return it via the channel:
So it waits until both goroutines finish, but drops the error silently. The most likely error is ENOBUFS - it is impossible to avoid this error in macOS when writing fast to a unix datagram socket. The write should be retried until it succeeds, or the packet should be dropped. But the tcpproxy just fail the entire operation silently. @balajiv113 What do you think? |
According to launchd.plist(5), if ProcessType is left unspecified, the system will apply light resource limits to the job, throttling its CPU usage and I/O bandwidth. Turns out that these resource limits cause lower and unpredictable performance. Testing ProcessType Interactive increase iperf3 throughput from 1.32 to 3.36 Gbit/s (2.45 times faster). Before: % iperf3-darwin -c 192.168.105.58 Connecting to host 192.168.105.58, port 5201 [ 5] local 192.168.105.1 port 50333 connected to 192.168.105.58 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd RTT [ 5] 0.00-1.00 sec 158 MBytes 1.32 Gbits/sec 0 2.91 MBytes 21ms [ 5] 1.00-2.00 sec 163 MBytes 1.36 Gbits/sec 0 3.05 MBytes 19ms [ 5] 2.00-3.00 sec 152 MBytes 1.28 Gbits/sec 0 3.15 MBytes 19ms [ 5] 3.00-4.00 sec 167 MBytes 1.40 Gbits/sec 0 3.23 MBytes 21ms [ 5] 4.00-5.00 sec 162 MBytes 1.36 Gbits/sec 0 3.29 MBytes 21ms [ 5] 5.00-6.00 sec 151 MBytes 1.27 Gbits/sec 0 3.34 MBytes 21ms [ 5] 6.00-7.00 sec 160 MBytes 1.34 Gbits/sec 0 3.36 MBytes 20ms [ 5] 7.00-8.00 sec 152 MBytes 1.28 Gbits/sec 0 3.38 MBytes 22ms [ 5] 8.00-9.00 sec 161 MBytes 1.35 Gbits/sec 0 3.38 MBytes 23ms [ 5] 9.00-10.00 sec 152 MBytes 1.27 Gbits/sec 0 3.39 MBytes 21ms - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 1.54 GBytes 1.32 Gbits/sec 0 sender [ 5] 0.00-10.00 sec 1.54 GBytes 1.32 Gbits/sec receiver After: % iperf3-darwin -c 192.168.105.58 Connecting to host 192.168.105.58, port 5201 [ 5] local 192.168.105.1 port 50358 connected to 192.168.105.58 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd RTT [ 5] 0.00-1.00 sec 431 MBytes 3.61 Gbits/sec 0 8.00 MBytes 9ms [ 5] 1.00-2.00 sec 333 MBytes 2.79 Gbits/sec 0 8.00 MBytes 15ms [ 5] 2.00-3.00 sec 371 MBytes 3.11 Gbits/sec 0 8.00 MBytes 10ms [ 5] 3.00-4.00 sec 373 MBytes 3.12 Gbits/sec 0 8.00 MBytes 9ms [ 5] 4.00-5.00 sec 415 MBytes 3.48 Gbits/sec 0 8.00 MBytes 9ms [ 5] 5.00-6.00 sec 424 MBytes 3.55 Gbits/sec 0 8.00 MBytes 10ms [ 5] 6.00-7.00 sec 423 MBytes 3.55 Gbits/sec 0 8.00 MBytes 10ms [ 5] 7.00-8.00 sec 413 MBytes 3.46 Gbits/sec 0 8.00 MBytes 10ms [ 5] 8.00-9.00 sec 418 MBytes 3.51 Gbits/sec 0 8.00 MBytes 9ms [ 5] 9.00-10.00 sec 407 MBytes 3.41 Gbits/sec 0 8.00 MBytes 9ms - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 3.91 GBytes 3.36 Gbits/sec 0 sender [ 5] 0.00-10.01 sec 3.91 GBytes 3.36 Gbits/sec receiver Testing with 2 vms is much slower and tend to get stuck because of lima-vm#39. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
According to launchd.plist(5), if ProcessType is left unspecified, the system will apply light resource limits to the job, throttling its CPU usage and I/O bandwidth. Turns out that these resource limits cause lower and unpredictable performance. Testing ProcessType Interactive increase iperf3 throughput from 1.32 to 3.59 Gbit/s (2.71 times faster). Before: % iperf3-darwin -c 192.168.105.58 Connecting to host 192.168.105.58, port 5201 [ 5] local 192.168.105.1 port 50333 connected to 192.168.105.58 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd RTT [ 5] 0.00-1.00 sec 158 MBytes 1.32 Gbits/sec 0 2.91 MBytes 21ms [ 5] 1.00-2.00 sec 163 MBytes 1.36 Gbits/sec 0 3.05 MBytes 19ms [ 5] 2.00-3.00 sec 152 MBytes 1.28 Gbits/sec 0 3.15 MBytes 19ms [ 5] 3.00-4.00 sec 167 MBytes 1.40 Gbits/sec 0 3.23 MBytes 21ms [ 5] 4.00-5.00 sec 162 MBytes 1.36 Gbits/sec 0 3.29 MBytes 21ms [ 5] 5.00-6.00 sec 151 MBytes 1.27 Gbits/sec 0 3.34 MBytes 21ms [ 5] 6.00-7.00 sec 160 MBytes 1.34 Gbits/sec 0 3.36 MBytes 20ms [ 5] 7.00-8.00 sec 152 MBytes 1.28 Gbits/sec 0 3.38 MBytes 22ms [ 5] 8.00-9.00 sec 161 MBytes 1.35 Gbits/sec 0 3.38 MBytes 23ms [ 5] 9.00-10.00 sec 152 MBytes 1.27 Gbits/sec 0 3.39 MBytes 21ms - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 1.54 GBytes 1.32 Gbits/sec 0 sender [ 5] 0.00-10.00 sec 1.54 GBytes 1.32 Gbits/sec receiver After: % iperf3-darwin -c 192.168.105.58 Connecting to host 192.168.105.58, port 5201 [ 5] local 192.168.105.1 port 50415 connected to 192.168.105.58 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd RTT [ 5] 0.00-1.00 sec 453 MBytes 3.80 Gbits/sec 0 8.00 MBytes 10ms [ 5] 1.00-2.00 sec 426 MBytes 3.57 Gbits/sec 0 8.00 MBytes 9ms [ 5] 2.00-3.00 sec 422 MBytes 3.54 Gbits/sec 0 8.00 MBytes 9ms [ 5] 3.00-4.00 sec 405 MBytes 3.40 Gbits/sec 0 8.00 MBytes 9ms [ 5] 4.00-5.00 sec 429 MBytes 3.60 Gbits/sec 0 8.00 MBytes 9ms [ 5] 5.00-6.00 sec 433 MBytes 3.64 Gbits/sec 0 8.00 MBytes 9ms [ 5] 6.00-7.00 sec 432 MBytes 3.62 Gbits/sec 0 8.00 MBytes 10ms [ 5] 7.00-8.00 sec 432 MBytes 3.63 Gbits/sec 0 8.00 MBytes 9ms [ 5] 8.00-9.00 sec 414 MBytes 3.47 Gbits/sec 0 8.00 MBytes 9ms [ 5] 9.00-10.00 sec 433 MBytes 3.63 Gbits/sec 0 8.00 MBytes 9ms - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 4.18 GBytes 3.59 Gbits/sec 0 sender [ 5] 0.00-10.01 sec 4.18 GBytes 3.59 Gbits/sec receiver Testing with 2 vms is much slower and tend to get stuck because of lima-vm#39. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
According to launchd.plist(5), if ProcessType is left unspecified, the system will apply light resource limits to the job, throttling its CPU usage and I/O bandwidth. Turns out that these resource limits cause lower and unpredictable performance. Testing ProcessType Interactive increase iperf3 throughput from 1.32 to 3.59 Gbit/s (2.71 times faster). Before: % iperf3-darwin -c 192.168.105.58 Connecting to host 192.168.105.58, port 5201 [ 5] local 192.168.105.1 port 50333 connected to 192.168.105.58 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd RTT [ 5] 0.00-1.00 sec 158 MBytes 1.32 Gbits/sec 0 2.91 MBytes 21ms [ 5] 1.00-2.00 sec 163 MBytes 1.36 Gbits/sec 0 3.05 MBytes 19ms [ 5] 2.00-3.00 sec 152 MBytes 1.28 Gbits/sec 0 3.15 MBytes 19ms [ 5] 3.00-4.00 sec 167 MBytes 1.40 Gbits/sec 0 3.23 MBytes 21ms [ 5] 4.00-5.00 sec 162 MBytes 1.36 Gbits/sec 0 3.29 MBytes 21ms [ 5] 5.00-6.00 sec 151 MBytes 1.27 Gbits/sec 0 3.34 MBytes 21ms [ 5] 6.00-7.00 sec 160 MBytes 1.34 Gbits/sec 0 3.36 MBytes 20ms [ 5] 7.00-8.00 sec 152 MBytes 1.28 Gbits/sec 0 3.38 MBytes 22ms [ 5] 8.00-9.00 sec 161 MBytes 1.35 Gbits/sec 0 3.38 MBytes 23ms [ 5] 9.00-10.00 sec 152 MBytes 1.27 Gbits/sec 0 3.39 MBytes 21ms - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 1.54 GBytes 1.32 Gbits/sec 0 sender [ 5] 0.00-10.00 sec 1.54 GBytes 1.32 Gbits/sec receiver After: % iperf3-darwin -c 192.168.105.58 Connecting to host 192.168.105.58, port 5201 [ 5] local 192.168.105.1 port 50415 connected to 192.168.105.58 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd RTT [ 5] 0.00-1.00 sec 453 MBytes 3.80 Gbits/sec 0 8.00 MBytes 10ms [ 5] 1.00-2.00 sec 426 MBytes 3.57 Gbits/sec 0 8.00 MBytes 9ms [ 5] 2.00-3.00 sec 422 MBytes 3.54 Gbits/sec 0 8.00 MBytes 9ms [ 5] 3.00-4.00 sec 405 MBytes 3.40 Gbits/sec 0 8.00 MBytes 9ms [ 5] 4.00-5.00 sec 429 MBytes 3.60 Gbits/sec 0 8.00 MBytes 9ms [ 5] 5.00-6.00 sec 433 MBytes 3.64 Gbits/sec 0 8.00 MBytes 9ms [ 5] 6.00-7.00 sec 432 MBytes 3.62 Gbits/sec 0 8.00 MBytes 10ms [ 5] 7.00-8.00 sec 432 MBytes 3.63 Gbits/sec 0 8.00 MBytes 9ms [ 5] 8.00-9.00 sec 414 MBytes 3.47 Gbits/sec 0 8.00 MBytes 9ms [ 5] 9.00-10.00 sec 433 MBytes 3.63 Gbits/sec 0 8.00 MBytes 9ms - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 4.18 GBytes 3.59 Gbits/sec 0 sender [ 5] 0.00-10.01 sec 4.18 GBytes 3.59 Gbits/sec receiver Testing with 2 vms is much slower and tend to get stuck because of lima-vm#39. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
According to launchd.plist(5), if ProcessType is left unspecified, the system will apply light resource limits to the job, throttling its CPU usage and I/O bandwidth. Turns out that these resource limits cause lower and unpredictable performance. Testing ProcessType Interactive increase iperf3 throughput from 1.32 to 3.59 Gbit/s (2.71 times faster). Before: % iperf3-darwin -c 192.168.105.58 Connecting to host 192.168.105.58, port 5201 [ 5] local 192.168.105.1 port 50333 connected to 192.168.105.58 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd RTT [ 5] 0.00-1.00 sec 158 MBytes 1.32 Gbits/sec 0 2.91 MBytes 21ms [ 5] 1.00-2.00 sec 163 MBytes 1.36 Gbits/sec 0 3.05 MBytes 19ms [ 5] 2.00-3.00 sec 152 MBytes 1.28 Gbits/sec 0 3.15 MBytes 19ms [ 5] 3.00-4.00 sec 167 MBytes 1.40 Gbits/sec 0 3.23 MBytes 21ms [ 5] 4.00-5.00 sec 162 MBytes 1.36 Gbits/sec 0 3.29 MBytes 21ms [ 5] 5.00-6.00 sec 151 MBytes 1.27 Gbits/sec 0 3.34 MBytes 21ms [ 5] 6.00-7.00 sec 160 MBytes 1.34 Gbits/sec 0 3.36 MBytes 20ms [ 5] 7.00-8.00 sec 152 MBytes 1.28 Gbits/sec 0 3.38 MBytes 22ms [ 5] 8.00-9.00 sec 161 MBytes 1.35 Gbits/sec 0 3.38 MBytes 23ms [ 5] 9.00-10.00 sec 152 MBytes 1.27 Gbits/sec 0 3.39 MBytes 21ms - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 1.54 GBytes 1.32 Gbits/sec 0 sender [ 5] 0.00-10.00 sec 1.54 GBytes 1.32 Gbits/sec receiver After: % iperf3-darwin -c 192.168.105.58 Connecting to host 192.168.105.58, port 5201 [ 5] local 192.168.105.1 port 50415 connected to 192.168.105.58 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd RTT [ 5] 0.00-1.00 sec 453 MBytes 3.80 Gbits/sec 0 8.00 MBytes 10ms [ 5] 1.00-2.00 sec 426 MBytes 3.57 Gbits/sec 0 8.00 MBytes 9ms [ 5] 2.00-3.00 sec 422 MBytes 3.54 Gbits/sec 0 8.00 MBytes 9ms [ 5] 3.00-4.00 sec 405 MBytes 3.40 Gbits/sec 0 8.00 MBytes 9ms [ 5] 4.00-5.00 sec 429 MBytes 3.60 Gbits/sec 0 8.00 MBytes 9ms [ 5] 5.00-6.00 sec 433 MBytes 3.64 Gbits/sec 0 8.00 MBytes 9ms [ 5] 6.00-7.00 sec 432 MBytes 3.62 Gbits/sec 0 8.00 MBytes 10ms [ 5] 7.00-8.00 sec 432 MBytes 3.63 Gbits/sec 0 8.00 MBytes 9ms [ 5] 8.00-9.00 sec 414 MBytes 3.47 Gbits/sec 0 8.00 MBytes 9ms [ 5] 9.00-10.00 sec 433 MBytes 3.63 Gbits/sec 0 8.00 MBytes 9ms - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 4.18 GBytes 3.59 Gbits/sec 0 sender [ 5] 0.00-10.01 sec 4.18 GBytes 3.59 Gbits/sec receiver Testing with 2 vms is much slower and tend to get stuck because of lima-vm#39. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
We used external package (tcpproxy) for proxying between unix stream and datagram sockets. This package cannot handle ENOBUFS error, expected condition on BSD based systems, and worse, it hides errors and stop forwarding packets silently when write to vz socket fails with ENOBUFS[1]. Fix the issues by replacing tcpproxy with our own implementation. Fixes: - Fix error handling if write to vz datagram socket fail with ENOBUFS. We retry the write and drop the packet if we could not write after a short timeout. - Fix error handling if we could not read packet header or body from socket_vmnet stream socket. Previously we logged an error and continue, sending corrupted packet to VZ. - Fix error handling if writing a packet to socket_vmnet stream socket returned after writing partial packet. Now we write the complete packet after short write. Previously this ended with 2 corrupted packets. - Log error if forwarding packets from vz to socket_vmnet or from socket_vmnet to vz failed. Simplification: - Use binary.Read() and binary.Write() to read and write qemu packet header. New logs: % grep 'Dropping packet' ~/.lima/server/ha.stderr.log {"level":"debug","msg":"Dropping packet: write unixgram -\u003e: write: no buffer space available","time":"2024-10-02T04:01:39+03:00"} [1] lima-vm/socket_vmnet#39 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
We used external package (tcpproxy) for proxying between unix stream and datagram sockets. This package cannot handle ENOBUFS error, expected condition on BSD based systems, and worse, it hides errors and stop forwarding packets silently when write to vz socket fails with ENOBUFS[1]. Fix the issues by replacing tcpproxy with our own implementation. Fixes: - Fix error handling if write to vz datagram socket fail with ENOBUFS. We retry the write and drop the packet if we could not write after a short timeout. - Fix error handling if we could not read packet header or body from socket_vmnet stream socket. Previously we logged an error and continue, sending corrupted packet to VZ. - Fix error handling if writing a packet to socket_vmnet stream socket returned after writing partial packet. Now we write the complete packet after short write. Previously this ended with 2 corrupted packets. - Log error if forwarding packets from vz to socket_vmnet or from socket_vmnet to vz failed. Simplification: - Use binary.Read() and binary.Write() to read and write qemu packet header. New logs: % grep 'Dropping packet' ~/.lima/server/ha.stderr.log {"level":"debug","msg":"Dropping packet: write unixgram -\u003e: write: no buffer space available","time":"2024-10-02T04:01:39+03:00"} [1] lima-vm/socket_vmnet#39 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
@nirs In gvisor-tap-vsock as well we recently handled ENOBUFS https://github.com/containers/gvisor-tap-vsock/pull/370/files This usually happens in dgram socket, maybe we can simply try to create a custom connection overriding |
@balajiv113 thanks, I'll send a lima fix. |
We used external package (tcpproxy) for proxying between unix stream and datagram sockets. This package cannot handle ENOBUFS error, expected condition on BSD based systems, and worse, it hides errors and stop forwarding packets silently when write to vz socket fails with ENOBUFS[1]. Fix the issues by replacing tcpproxy with a simpler and more direct implementation that will be easier to maintain. Fixes: - Fix error handling if write to vz datagram socket fail with ENOBUFS. We retry the write until it succeeds. Same solution is used in gvisor-tap-vsock[2]. - Fix error handling if we could not read packet header or body from socket_vmnet stream socket. Previously we logged an error and continue, sending corrupted packet to VZ. - Fix error handling if writing a packet to socket_vmnet stream socket returned after writing partial packet. Now we handle short writes and write the complete packet. Previously this ended with 2 corrupted packets. - Log error if forwarding packets from vz to socket_vmnet or from socket_vmnet to vz failed. Simplification: - Use binary.Read() and binary.Write() to read and write qemu packet header. [1] lima-vm/socket_vmnet#39 [2] containers/gvisor-tap-vsock#370 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
We used external package (tcpproxy) for proxying between unix stream and datagram sockets. This package cannot handle ENOBUFS error, expected condition on BSD based systems, and worse, it hides errors and stop forwarding packets silently when write to vz socket fails with ENOBUFS[1]. Fix the issues by replacing tcpproxy with a simpler and more direct implementation that will be easier to maintain. Fixes: - Fix error handling if write to vz datagram socket fail with ENOBUFS. We retry the write until it succeeds. Same solution is used in gvisor-tap-vsock[2]. - Fix error handling if we could not read packet header or body from socket_vmnet stream socket. Previously we logged an error and continue, sending corrupted packet to VZ. - Fix error handling if writing a packet to socket_vmnet stream socket returned after writing partial packet. Now we handle short writes and write the complete packet. Previously this ended with 2 corrupted packets. - Log error if forwarding packets from vz to socket_vmnet or from socket_vmnet to vz failed. Simplification: - Use binary.Read() and binary.Write() to read and write qemu packet header. [1] lima-vm/socket_vmnet#39 [2] containers/gvisor-tap-vsock#370 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
We used external package (tcpproxy) for proxying between unix stream and datagram sockets. This package cannot handle ENOBUFS error, expected condition on BSD based systems, and worse, it hides errors and stop forwarding packets silently when write to vz socket fails with ENOBUFS[1]. Fix the issues by replacing tcpproxy with a simpler and more direct implementation that will be easier to maintain. Fixes: - Fix error handling if write to vz datagram socket fail with ENOBUFS. We retry the write until it succeeds. Same solution is used in gvisor-tap-vsock[2]. - Fix error handling if we could not read packet header or body from socket_vmnet stream socket. Previously we logged an error and continue, sending corrupted packet to VZ. - Fix error handling if writing a packet to socket_vmnet stream socket returned after writing partial packet. Now we handle short writes and write the complete packet. Previously this ended with 2 corrupted packets. - Log error if forwarding packets from vz to socket_vmnet or from socket_vmnet to vz failed. Simplification: - Use binary.Read() and binary.Write() to read and write qemu packet header. Visibility: - Make QEMUPacketConn private since it is an implementation detail correct only for lima vz-socket_vmnet use case. [1] lima-vm/socket_vmnet#39 [2] containers/gvisor-tap-vsock#370 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
We used external package (tcpproxy) for proxying between unix stream and datagram sockets. This package cannot handle ENOBUFS error, expected condition on BSD based systems, and worse, it hides errors and stop forwarding packets silently when write to vz socket fails with ENOBUFS[1]. Fix the issues by replacing tcpproxy with a simpler and more direct implementation that will be easier to maintain. Fixes: - Fix error handling if write to vz datagram socket fail with ENOBUFS. We retry the write until it succeeds with a very short sleep between retries.. Similar solution is used in gvisor-tap-vsock[2]. - Fix error handling if we could not read packet header or body from socket_vmnet stream socket. Previously we logged an error and continue, sending corrupted packet to VZ. - Fix error handling if writing a packet to socket_vmnet stream socket returned after writing partial packet. Now we handle short writes and write the complete packet. Previously this ended with 2 corrupted packets. - Log error if forwarding packets from vz to socket_vmnet or from socket_vmnet to vz failed. Simplification: - Use binary.Read() and binary.Write() to read and write qemu packet header. Visibility: - Make QEMUPacketConn private since it is an implementation detail correct only for lima vz-socket_vmnet use case. [1] lima-vm/socket_vmnet#39 [2] containers/gvisor-tap-vsock#370 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
We used external package (tcpproxy) for proxying between unix stream and datagram sockets. This package cannot handle ENOBUFS error, expected condition on BSD based systems, and worse, it hides errors and stop forwarding packets silently when write to vz socket fails with ENOBUFS[1]. Fix the issues by replacing tcpproxy with a simpler and more direct implementation that will be easier to maintain. Fixes: - Fix error handling if write to vz datagram socket fail with ENOBUFS. We retry the write until it succeeds with a very short sleep between retries.. Similar solution is used in gvisor-tap-vsock[2]. - Fix error handling if we could not read packet header or body from socket_vmnet stream socket. Previously we logged an error and continue, sending corrupted packet to VZ. - Fix error handling if writing a packet to socket_vmnet stream socket returned after writing partial packet. Now we handle short writes and write the complete packet. Previously this ended with 2 corrupted packets. - Log error if forwarding packets from vz to socket_vmnet or from socket_vmnet to vz failed. Simplification: - Use binary.Read() and binary.Write() to read and write qemu packet header. Visibility: - Make QEMUPacketConn private since it is an implementation detail correct only for lima vz-socket_vmnet use case. [1] lima-vm/socket_vmnet#39 [2] containers/gvisor-tap-vsock#370 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
We used external package (tcpproxy) for proxying between unix stream and datagram sockets. This package cannot handle ENOBUFS error, expected condition on BSD based systems, and worse, it hides errors and stop forwarding packets silently when write to vz socket fails with ENOBUFS[1]. Fix the issues by replacing tcpproxy with a simpler and more direct implementation that will be easier to maintain. Fixes: - Fix error handling if write to vz datagram socket fail with ENOBUFS. We retry the write until it succeeds with a very short sleep between retries. Similar solution is used in gvisor-tap-vsock[2]. - Fix error handling if we could not read packet header or body from socket_vmnet stream socket. Previously we logged an error and continue to send corrupted packet to vz from the point of the failure. - Fix error handling if writing a packet to socket_vmnet stream socket returned after writing partial packet. Now we handle short writes and write the complete packet. Previously would break the protocol and continue to send corrupted packet from the point of the failure. - Log error if forwarding packets from vz to socket_vmnet or from socket_vmnet to vz failed. Simplification: - Use binary.Read() and binary.Write() to read and write qemu packet header. Visibility: - Make QEMUPacketConn private since it is an implementation detail of vz when using socket_vmnet. [1] lima-vm/socket_vmnet#39 [2] containers/gvisor-tap-vsock#370 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
The ENOBUFS errors are easily reproducible and recoverable with lima-vm/lima#2680. @sheelchand can you test with the lima fix? |
We used external package (tcpproxy) for proxying between unix stream and datagram sockets. This package cannot handle ENOBUFS error, expected condition on BSD based systems, and worse, it hides errors and stop forwarding packets silently when write to vz socket fails with ENOBUFS[1]. Fix the issues by replacing tcpproxy with a simpler and more direct implementation that will be easier to maintain. Fixes: - Fix error handling if write to vz datagram socket fail with ENOBUFS. We retry the write until it succeeds with a very short sleep between retries. Similar solution is used in gvisor-tap-vsock[2]. - Fix error handling if we could not read packet header or body from socket_vmnet stream socket. Previously we logged an error and continue to send corrupted packet to vz from the point of the failure. - Fix error handling if writing a packet to socket_vmnet stream socket returned after writing partial packet. Now we handle short writes and write the complete packet. Previously would break the protocol and continue to send corrupted packet from the point of the failure. - Log error if forwarding packets from vz to socket_vmnet or from socket_vmnet to vz failed. Simplification: - Use binary.Read() and binary.Write() to read and write qemu packet header. Visibility: - Make QEMUPacketConn private since it is an implementation detail of vz when using socket_vmnet. [1] lima-vm/socket_vmnet#39 [2] containers/gvisor-tap-vsock#370 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
We used external package (tcpproxy) for proxying between unix stream and datagram sockets. This package cannot handle ENOBUFS error, expected condition on BSD based systems, and worse, it hides errors and stop forwarding packets silently when write to vz socket fails with ENOBUFS[1]. Fix the issues by replacing tcpproxy with a simpler and more direct implementation that will be easier to maintain. Fixes: - Fix error handling if write to vz datagram socket fail with ENOBUFS. We retry the write until it succeeds with a very short sleep between retries. Similar solution is used in gvisor-tap-vsock[2]. - Fix error handling if we could not read packet header or body from socket_vmnet stream socket. Previously we logged an error and continue to send corrupted packet to vz from the point of the failure. - Fix error handling if writing a packet to socket_vmnet stream socket returned after writing partial packet. Now we handle short writes and write the complete packet. Previously would break the protocol and continue to send corrupted packet from the point of the failure. - Log error if forwarding packets from vz to socket_vmnet or from socket_vmnet to vz failed. Simplification: - Use binary.Read() and binary.Write() to read and write qemu packet header. Visibility: - Make QEMUPacketConn private since it is an implementation detail of vz when using socket_vmnet. [1] lima-vm/socket_vmnet#39 [2] containers/gvisor-tap-vsock#370 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
We used external package (tcpproxy) for proxying between unix stream and datagram sockets. This package cannot handle ENOBUFS error, expected condition on BSD based systems, and worse, it hides errors and stop forwarding packets silently when write to vz socket fails with ENOBUFS[1]. Fix the issues by replacing tcpproxy with a simpler and more direct implementation that will be easier to maintain. Fixes: - Fix error handling if write to vz datagram socket fail with ENOBUFS. We retry the write until it succeeds with a very short sleep between retries. Similar solution is used in gvisor-tap-vsock[2]. - Fix error handling if we could not read packet header or body from socket_vmnet stream socket. Previously we logged an error and continue to send corrupted packet to vz from the point of the failure. - Fix error handling if writing a packet to socket_vmnet stream socket returned after writing partial packet. Now we handle short writes and write the complete packet. Previously would break the protocol and continue to send corrupted packet from the point of the failure. - Log error if forwarding packets from vz to socket_vmnet or from socket_vmnet to vz failed. Simplification: - Use binary.Read() and binary.Write() to read and write qemu packet header. Visibility: - Make QEMUPacketConn private since it is an implementation detail of vz when using socket_vmnet. Testing: - Add a packet forwarding test covering the happy path in 10 milliseconds. [1] lima-vm/socket_vmnet#39 [2] containers/gvisor-tap-vsock#370 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
The lima issue is fixed now. We can close this when a lima version including the fix is released. |
@AkihiroSuda, @jandubois: lima >= 1.0.0 fixed the issue, we can close this. |
We have 7 qemu VMs running, having 3 virtual ethernet interface each.
socket_vmnet works most of the times but randomly stops working and the communication between the VMs is stopped.
The debug logs show the process get stuck on writev() call.
DEBUG| [Socket-to-Socket i=1815762] Sending from socket 8 to socket 5: 4 + 95 bytes
There is no log after the above log:
On the VM reboot the logs show that writev() call return -1
I suspect this is due to a race condition when multiple threads are accessing the socket to send and receive data. I don't have the exact explanation yet bet the behavior is pointing to a race condition.
The text was updated successfully, but these errors were encountered: