Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

socket_vmnet gets stuck randomly #39

Closed
sheelchand opened this issue Oct 17, 2023 · 12 comments
Closed

socket_vmnet gets stuck randomly #39

sheelchand opened this issue Oct 17, 2023 · 12 comments
Labels
bug Something isn't working

Comments

@sheelchand
Copy link

We have 7 qemu VMs running, having 3 virtual ethernet interface each.

socket_vmnet works most of the times but randomly stops working and the communication between the VMs is stopped.

The debug logs show the process get stuck on writev() call.
DEBUG| [Socket-to-Socket i=1815762] Sending from socket 8 to socket 5: 4 + 95 bytes

There is no log after the above log:
On the VM reboot the logs show that writev() call return -1

I suspect this is due to a race condition when multiple threads are accessing the socket to send and receive data. I don't have the exact explanation yet bet the behavior is pointing to a race condition.

@sheelchand
Copy link
Author

More information on the issue:

  1. I mapped each VM to a socket Id and saw that a thread was stuck on writing to socket 11. I stopped the VM corresponding to socket 11 and socket_vmnet crashes out:
    the process was stuck sending to socket 11
    DEBUG| [Socket-to-Socket i=166220] Sending from socket 7 to socket 11: 4 + 46 bytes

Logs after the VM corresponding to socket 11 is stopped:
DEBUG| [Socket-to-Socket i=166220] Sent from socket 7 to socket 11: -1 bytes (including uint32be header)
writev: Broken pipe
DEBUG| [Socket-to-VMNET i=166221] Receiving from the socket 7
DEBUG| [Socket-to-VMNET i=166221] Received from the socket 7: 46 bytes
DEBUG| [Socket-to-VMNET i=166221] Sending to VMNET: 46 bytes
on_accept(): vmnet_return_t VMNET_INVALID_ARGUMENT
vmnet_write: Broken pipe
DEBUG| [Socket-to-Socket i=9903992] Sent from socket 5 to socket 7: -1 bytes (including uint32be header)
writev: Bad file descriptor
DEBUG| [Socket-to-VMNET i=9903993] Receiving from the socket 5
DEBUG| [Socket-to-Socket i=53677] Sent from socket 10 to socket 7: -1 bytes (including uint32be header)
writev: Bad file descriptor
DEBUG| [Socket-to-VMNET i=53678] Receiving from the socket 10
DEBUG| [Socket-to-Socket i=4171484] Sent from socket 9 to socket 7: -1 bytes (including uint32be header)
writev: Bad file descriptor
DEBUG| [Socket-to-VMNET i=4171485] Receiving from the socket 9
DEBUG| [Socket-to-VMNET i=9903993] Received from the socket 5: 1514 bytes
DEBUG| [Socket-to-VMNET i=9903993] Sending to VMNET: 1514 bytes
DEBUG| [Socket-to-Socket i=5600510] Sent from socket 11 to socket 7: -1 bytes (including uint32be header)
writev: Bad file descriptor
DEBUG| [Socket-to-VMNET i=5600511] Receiving from the socket 11
DEBUG| [Socket-to-Socket i=93534327] Sent from socket 6 to socket 7: -1 bytes (including uint32be header)
writev: Bad file descriptor
DEBUG| [Socket-to-VMNET i=93534328] Receiving from the socket 6
on_accept(): vmnet_return_t VMNET_INVALID_ARGUMENT
vmnet_write: Bad file descriptor
DEBUG| [Socket-to-Socket i=68470895] Sent from socket 8 to socket 7: -1 bytes (including uint32be header)
writev: Bad file descriptor
DEBUG| [Socket-to-VMNET i=68470896] Receiving from the socket 8
DEBUG| [Handler i=0] Sent to the socket: -1 bytes (including uint32be header)
writev: Bad file descriptor
DEBUG| [Socket-to-VMNET i=53678] Received from the socket 10: 46 bytes
DEBUG| [Socket-to-VMNET i=53678] Sending to VMNET: 46 bytes
DEBUG| [Socket-to-VMNET i=5600511] Received from the socket 11: 78 bytes
on_accept(): vmnet_return_t VMNET_INVALID_ARGUMENT
vmnet_write: Bad file descriptor
DEBUG| [Socket-to-VMNET i=93534328] Received from the socket 6: 70 bytes
DEBUG| [Socket-to-VMNET i=93534328] Sending to VMNET: 70 bytes
DEBUG| shutting down with rc=1
DEBUG| [Socket-to-VMNET i=4171485] Received from the socket 9: 124 bytes
DEBUG| [Socket-to-VMNET i=4171485] Sending to VMNET: 124 bytes
DEBUG| [Socket-to-VMNET i=5600511] Sending to VMNET: 78 bytes

@sheelchand
Copy link
Author

On further analysis, it looks like a deadlock problem:
Socket 7's buffer is full, other sockets, including socket 11) are blocked sending to it. Socket 7 gets blocked sending to socket 11 because 11's buffer is full - socket 11 and socket 7 are in deadlock.

@sheelchand
Copy link
Author

This code has multiple threads writing to a socket at the same time. This will cause problem - potentially corrupting packets. We need to remove the flooding by mapping MAC addresses to socket-ids. so that 2 threads don't write to a given socket at the same time.

@AkihiroSuda AkihiroSuda added the bug Something isn't working label Oct 24, 2023
@AkihiroSuda
Copy link
Member

@sheelchand Thanks for analysis, would you be interested in submitting a PR?

@sheelchand
Copy link
Author

I tried but the Mac’s in the packets don’t match the VMs Macs. I tried putting a semephore before sending to a socket but that did not help either. I will continue to look at it.

nirs added a commit to nirs/socket_vmnet that referenced this issue Sep 14, 2024
There are 2 race conditions when writing to connections:
- Since we don't hold the semaphore during iteration, a connection can
  be removed by another thread while we try to use it, which will lead
  to use after free.
- Multiple threads may try to write to the same connection socket,
  corrupting the packets (lima-vm#39).

Both issues are fixed by holding the semaphore during iteration and
writing to the socket. This is not the most efficient way but
socket_vmnet crashes daily and we must stop the bleeding first. We can
to add more fine grain locking later.
@nirs
Copy link
Member

nirs commented Sep 15, 2024

@sheelchand removing flooding is important for performance but it will not solve the issue of writing to the same socket at the same time from different threads.

Example flow when we send each packet only the destination:

thread1                              thread2          thread3
write packet from conn1 to conn2     -                write packet from conn3 to conn2

writev(2) does not mention anything about thread safety or message size that can be written atomically, so we should assume that this is unsafe.

send(2) seems safe:

 ...The length of the message is given by length.  If the message is too long
 to pass atomically through the underlying protocol, the error EMSGSIZE is
 returned, and the message is not transmitted.

But using send we will have to copy the packet so do one syscall, so writev seems better way. Specially if we find how to send multiple packets per syscall instead of one without sending all packets to all the guests.

nirs added a commit to nirs/socket_vmnet that referenced this issue Sep 16, 2024
There are 2 race conditions when writing to connections:
- Since we don't hold the semaphore during iteration, a connection can
  be removed by another thread while we try to use it, which will lead
  to use after free.
- Multiple threads may try to write to the same connection socket,
  corrupting the packets (lima-vm#39).

Both issues are fixed by holding the semaphore during iteration and
writing to the socket. This is not the most efficient way but
socket_vmnet crashes daily and we must stop the bleeding first. We can
to add more fine grain locking later.
@nirs
Copy link
Member

nirs commented Sep 30, 2024

I can reproduce it reliably now - the key is running socket_vmnet in the shell and not via launch. When we run with luanchd, we don't specify ProcessType so the system apply resources limits which makes socket_vment almost twice slower and harder to reproduce this bug.

The backtrace I see is:

(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x0000000193c856b4 libsystem_kernel.dylib`__accept + 8
    frame #1: 0x00000001044a5ad8 socket_vmnet`main + 2292
    frame #2: 0x0000000193933154 dyld`start + 2476
  thread #2, queue = 'io.github.lima-vm.socket_vmnet.accept'
    frame #0: 0x0000000193c80c4c libsystem_kernel.dylib`writev + 8
    frame #1: 0x00000001044a5ed0 socket_vmnet`__main_block_invoke + 748
    frame #2: 0x0000000193b0b750 libdispatch.dylib`_dispatch_call_block_and_release + 32
    frame #3: 0x0000000193b0d3e8 libdispatch.dylib`_dispatch_client_callout + 20
    frame #4: 0x0000000193b108ec libdispatch.dylib`_dispatch_continuation_pop + 600
    frame #5: 0x0000000193b0ff08 libdispatch.dylib`_dispatch_async_redirect_invoke + 584
    frame #6: 0x0000000193b1eea8 libdispatch.dylib`_dispatch_root_queue_drain + 392
    frame #7: 0x0000000193b1f6b8 libdispatch.dylib`_dispatch_worker_thread2 + 156
    frame #8: 0x0000000193cb9fd0 libsystem_pthread.dylib`_pthread_wqthread + 228
  thread #3, queue = 'io.github.lima-vm.socket_vmnet.accept'
    frame #0: 0x0000000193c7d994 libsystem_kernel.dylib`read + 8
    frame #1: 0x00000001044a5d98 socket_vmnet`__main_block_invoke + 436
    frame #2: 0x0000000193b0b750 libdispatch.dylib`_dispatch_call_block_and_release + 32
    frame #3: 0x0000000193b0d3e8 libdispatch.dylib`_dispatch_client_callout + 20
    frame #4: 0x0000000193b108ec libdispatch.dylib`_dispatch_continuation_pop + 600
    frame #5: 0x0000000193b0ff08 libdispatch.dylib`_dispatch_async_redirect_invoke + 584
    frame #6: 0x0000000193b1eea8 libdispatch.dylib`_dispatch_root_queue_drain + 392
    frame #7: 0x0000000193b1f6b8 libdispatch.dylib`_dispatch_worker_thread2 + 156
    frame #8: 0x0000000193cb9fd0 libsystem_pthread.dylib`_pthread_wqthread + 228
  thread #4, queue = 'io.github.lima-vm.socket_vmnet.events'
    frame #0: 0x0000000193c80c4c libsystem_kernel.dylib`writev + 8
    frame #1: 0x00000001044a66b8 socket_vmnet`_on_vmnet_packets_available + 740
    frame #2: 0x00000001044a622c socket_vmnet`__start_block_invoke.31 + 176
    frame #3: 0x000000022a5018c8 vmnet`__vmnet_interface_set_event_callback_block_invoke_3 + 132
    frame #4: 0x0000000193b0d3e8 libdispatch.dylib`_dispatch_client_callout + 20
    frame #5: 0x0000000193b108ec libdispatch.dylib`_dispatch_continuation_pop + 600
    frame #6: 0x0000000193b247f0 libdispatch.dylib`_dispatch_source_latch_and_call + 420
    frame #7: 0x0000000193b233b4 libdispatch.dylib`_dispatch_source_invoke + 832
    frame #8: 0x0000000193b10764 libdispatch.dylib`_dispatch_continuation_pop + 208
    frame #9: 0x0000000193b0ff08 libdispatch.dylib`_dispatch_async_redirect_invoke + 584
    frame #10: 0x0000000193b2000c libdispatch.dylib`_dispatch_root_queue_drain_deferred_item + 336
    frame #11: 0x0000000193b1f8f4 libdispatch.dylib`_dispatch_kevent_worker_thread + 500
    frame #12: 0x0000000193cba044 libsystem_pthread.dylib`_pthread_wqthread + 344
  thread #5
    frame #0: 0x0000000193c7ea84 libsystem_kernel.dylib`__workq_kernreturn + 8

This is not really a deadlock in socket_vmnet - it is trying to write to the vm unix socket, and if the buffers are full, the write will block. We can avoid this by using non-blocking I/O and drop packets when the socket is not writable. But why we cannot write to the socket?

I think the issue is in lima hostagent - if it fails to copy a packet from the unix socket to the datagram socket it stops reading without logging any error!

The issue comes from tcpproxy:

// HandleConn implements the Target interface.
func (dp *DialProxy) HandleConn(src net.Conn) {
	ctx := context.Background()
	var cancel context.CancelFunc
	if dp.DialTimeout >= 0 {
		ctx, cancel = context.WithTimeout(ctx, dp.dialTimeout())
	}
	dst, err := dp.dialContext()(ctx, "tcp", dp.Addr)
	if cancel != nil {
		cancel()
	}
	if err != nil {
		dp.onDialError()(src, err)
		return
	}
	defer goCloseConn(dst)

	if err = dp.sendProxyHeader(dst, src); err != nil {
		dp.onDialError()(src, err)
		return
	}
	defer goCloseConn(src)

	if ka := dp.keepAlivePeriod(); ka > 0 {
		for _, c := range []net.Conn{src, dst} {
			if c, ok := tcpConn(c); ok {
				c.SetKeepAlive(true)
				c.SetKeepAlivePeriod(ka)
			}
		}
	}

	errc := make(chan error, 2)
	go proxyCopy(errc, src, dst)
	go proxyCopy(errc, dst, src)
	<-errc
	<-errc
}

Inside proxyCopy we get the error from io.Copy and return it via the channel:

	_, err := io.Copy(dst, src)
	errc <- err

So it waits until both goroutines finish, but drops the error silently.

The most likely error is ENOBUFS - it is impossible to avoid this error in macOS when writing fast to a unix datagram socket. The write should be retried until it succeeds, or the packet should be dropped. But the tcpproxy just fail the entire operation silently.

@balajiv113 What do you think?

nirs added a commit to nirs/socket_vmnet that referenced this issue Sep 30, 2024
According to launchd.plist(5), if ProcessType is left unspecified, the
system will apply light resource limits to the job, throttling its CPU
usage and I/O bandwidth. Turns out that these resource limits cause
lower and unpredictable performance.

Testing ProcessType Interactive increase iperf3 throughput from 1.32 to
3.36 Gbit/s (2.45 times faster).

Before:

    % iperf3-darwin -c 192.168.105.58
    Connecting to host 192.168.105.58, port 5201
    [  5] local 192.168.105.1 port 50333 connected to 192.168.105.58 port 5201
    [ ID] Interval           Transfer     Bitrate         Retr  Cwnd          RTT
    [  5]   0.00-1.00   sec   158 MBytes  1.32 Gbits/sec    0   2.91 MBytes   21ms
    [  5]   1.00-2.00   sec   163 MBytes  1.36 Gbits/sec    0   3.05 MBytes   19ms
    [  5]   2.00-3.00   sec   152 MBytes  1.28 Gbits/sec    0   3.15 MBytes   19ms
    [  5]   3.00-4.00   sec   167 MBytes  1.40 Gbits/sec    0   3.23 MBytes   21ms
    [  5]   4.00-5.00   sec   162 MBytes  1.36 Gbits/sec    0   3.29 MBytes   21ms
    [  5]   5.00-6.00   sec   151 MBytes  1.27 Gbits/sec    0   3.34 MBytes   21ms
    [  5]   6.00-7.00   sec   160 MBytes  1.34 Gbits/sec    0   3.36 MBytes   20ms
    [  5]   7.00-8.00   sec   152 MBytes  1.28 Gbits/sec    0   3.38 MBytes   22ms
    [  5]   8.00-9.00   sec   161 MBytes  1.35 Gbits/sec    0   3.38 MBytes   23ms
    [  5]   9.00-10.00  sec   152 MBytes  1.27 Gbits/sec    0   3.39 MBytes   21ms
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval           Transfer     Bitrate         Retr
    [  5]   0.00-10.00  sec  1.54 GBytes  1.32 Gbits/sec    0             sender
    [  5]   0.00-10.00  sec  1.54 GBytes  1.32 Gbits/sec                  receiver

After:

    % iperf3-darwin -c 192.168.105.58
    Connecting to host 192.168.105.58, port 5201
    [  5] local 192.168.105.1 port 50358 connected to 192.168.105.58 port 5201
    [ ID] Interval           Transfer     Bitrate         Retr  Cwnd          RTT
    [  5]   0.00-1.00   sec   431 MBytes  3.61 Gbits/sec    0   8.00 MBytes   9ms
    [  5]   1.00-2.00   sec   333 MBytes  2.79 Gbits/sec    0   8.00 MBytes   15ms
    [  5]   2.00-3.00   sec   371 MBytes  3.11 Gbits/sec    0   8.00 MBytes   10ms
    [  5]   3.00-4.00   sec   373 MBytes  3.12 Gbits/sec    0   8.00 MBytes   9ms
    [  5]   4.00-5.00   sec   415 MBytes  3.48 Gbits/sec    0   8.00 MBytes   9ms
    [  5]   5.00-6.00   sec   424 MBytes  3.55 Gbits/sec    0   8.00 MBytes   10ms
    [  5]   6.00-7.00   sec   423 MBytes  3.55 Gbits/sec    0   8.00 MBytes   10ms
    [  5]   7.00-8.00   sec   413 MBytes  3.46 Gbits/sec    0   8.00 MBytes   10ms
    [  5]   8.00-9.00   sec   418 MBytes  3.51 Gbits/sec    0   8.00 MBytes   9ms
    [  5]   9.00-10.00  sec   407 MBytes  3.41 Gbits/sec    0   8.00 MBytes   9ms
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval           Transfer     Bitrate         Retr
    [  5]   0.00-10.00  sec  3.91 GBytes  3.36 Gbits/sec    0             sender
    [  5]   0.00-10.01  sec  3.91 GBytes  3.36 Gbits/sec                  receiver

Testing with 2 vms is much slower and tend to get stuck because of lima-vm#39.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs added a commit to nirs/socket_vmnet that referenced this issue Sep 30, 2024
According to launchd.plist(5), if ProcessType is left unspecified, the
system will apply light resource limits to the job, throttling its CPU
usage and I/O bandwidth. Turns out that these resource limits cause
lower and unpredictable performance.

Testing ProcessType Interactive increase iperf3 throughput from 1.32 to
3.59 Gbit/s (2.71 times faster).

Before:

    % iperf3-darwin -c 192.168.105.58
    Connecting to host 192.168.105.58, port 5201
    [  5] local 192.168.105.1 port 50333 connected to 192.168.105.58 port 5201
    [ ID] Interval           Transfer     Bitrate         Retr  Cwnd          RTT
    [  5]   0.00-1.00   sec   158 MBytes  1.32 Gbits/sec    0   2.91 MBytes   21ms
    [  5]   1.00-2.00   sec   163 MBytes  1.36 Gbits/sec    0   3.05 MBytes   19ms
    [  5]   2.00-3.00   sec   152 MBytes  1.28 Gbits/sec    0   3.15 MBytes   19ms
    [  5]   3.00-4.00   sec   167 MBytes  1.40 Gbits/sec    0   3.23 MBytes   21ms
    [  5]   4.00-5.00   sec   162 MBytes  1.36 Gbits/sec    0   3.29 MBytes   21ms
    [  5]   5.00-6.00   sec   151 MBytes  1.27 Gbits/sec    0   3.34 MBytes   21ms
    [  5]   6.00-7.00   sec   160 MBytes  1.34 Gbits/sec    0   3.36 MBytes   20ms
    [  5]   7.00-8.00   sec   152 MBytes  1.28 Gbits/sec    0   3.38 MBytes   22ms
    [  5]   8.00-9.00   sec   161 MBytes  1.35 Gbits/sec    0   3.38 MBytes   23ms
    [  5]   9.00-10.00  sec   152 MBytes  1.27 Gbits/sec    0   3.39 MBytes   21ms
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval           Transfer     Bitrate         Retr
    [  5]   0.00-10.00  sec  1.54 GBytes  1.32 Gbits/sec    0             sender
    [  5]   0.00-10.00  sec  1.54 GBytes  1.32 Gbits/sec                  receiver

After:

    % iperf3-darwin -c 192.168.105.58
    Connecting to host 192.168.105.58, port 5201
    [  5] local 192.168.105.1 port 50415 connected to 192.168.105.58 port 5201
    [ ID] Interval           Transfer     Bitrate         Retr  Cwnd          RTT
    [  5]   0.00-1.00   sec   453 MBytes  3.80 Gbits/sec    0   8.00 MBytes   10ms
    [  5]   1.00-2.00   sec   426 MBytes  3.57 Gbits/sec    0   8.00 MBytes   9ms
    [  5]   2.00-3.00   sec   422 MBytes  3.54 Gbits/sec    0   8.00 MBytes   9ms
    [  5]   3.00-4.00   sec   405 MBytes  3.40 Gbits/sec    0   8.00 MBytes   9ms
    [  5]   4.00-5.00   sec   429 MBytes  3.60 Gbits/sec    0   8.00 MBytes   9ms
    [  5]   5.00-6.00   sec   433 MBytes  3.64 Gbits/sec    0   8.00 MBytes   9ms
    [  5]   6.00-7.00   sec   432 MBytes  3.62 Gbits/sec    0   8.00 MBytes   10ms
    [  5]   7.00-8.00   sec   432 MBytes  3.63 Gbits/sec    0   8.00 MBytes   9ms
    [  5]   8.00-9.00   sec   414 MBytes  3.47 Gbits/sec    0   8.00 MBytes   9ms
    [  5]   9.00-10.00  sec   433 MBytes  3.63 Gbits/sec    0   8.00 MBytes   9ms
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval           Transfer     Bitrate         Retr
    [  5]   0.00-10.00  sec  4.18 GBytes  3.59 Gbits/sec    0             sender
    [  5]   0.00-10.01  sec  4.18 GBytes  3.59 Gbits/sec                  receiver

Testing with 2 vms is much slower and tend to get stuck because of lima-vm#39.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs added a commit to nirs/socket_vmnet that referenced this issue Oct 1, 2024
According to launchd.plist(5), if ProcessType is left unspecified, the
system will apply light resource limits to the job, throttling its CPU
usage and I/O bandwidth. Turns out that these resource limits cause
lower and unpredictable performance.

Testing ProcessType Interactive increase iperf3 throughput from 1.32 to
3.59 Gbit/s (2.71 times faster).

Before:

    % iperf3-darwin -c 192.168.105.58
    Connecting to host 192.168.105.58, port 5201
    [  5] local 192.168.105.1 port 50333 connected to 192.168.105.58 port 5201
    [ ID] Interval           Transfer     Bitrate         Retr  Cwnd          RTT
    [  5]   0.00-1.00   sec   158 MBytes  1.32 Gbits/sec    0   2.91 MBytes   21ms
    [  5]   1.00-2.00   sec   163 MBytes  1.36 Gbits/sec    0   3.05 MBytes   19ms
    [  5]   2.00-3.00   sec   152 MBytes  1.28 Gbits/sec    0   3.15 MBytes   19ms
    [  5]   3.00-4.00   sec   167 MBytes  1.40 Gbits/sec    0   3.23 MBytes   21ms
    [  5]   4.00-5.00   sec   162 MBytes  1.36 Gbits/sec    0   3.29 MBytes   21ms
    [  5]   5.00-6.00   sec   151 MBytes  1.27 Gbits/sec    0   3.34 MBytes   21ms
    [  5]   6.00-7.00   sec   160 MBytes  1.34 Gbits/sec    0   3.36 MBytes   20ms
    [  5]   7.00-8.00   sec   152 MBytes  1.28 Gbits/sec    0   3.38 MBytes   22ms
    [  5]   8.00-9.00   sec   161 MBytes  1.35 Gbits/sec    0   3.38 MBytes   23ms
    [  5]   9.00-10.00  sec   152 MBytes  1.27 Gbits/sec    0   3.39 MBytes   21ms
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval           Transfer     Bitrate         Retr
    [  5]   0.00-10.00  sec  1.54 GBytes  1.32 Gbits/sec    0             sender
    [  5]   0.00-10.00  sec  1.54 GBytes  1.32 Gbits/sec                  receiver

After:

    % iperf3-darwin -c 192.168.105.58
    Connecting to host 192.168.105.58, port 5201
    [  5] local 192.168.105.1 port 50415 connected to 192.168.105.58 port 5201
    [ ID] Interval           Transfer     Bitrate         Retr  Cwnd          RTT
    [  5]   0.00-1.00   sec   453 MBytes  3.80 Gbits/sec    0   8.00 MBytes   10ms
    [  5]   1.00-2.00   sec   426 MBytes  3.57 Gbits/sec    0   8.00 MBytes   9ms
    [  5]   2.00-3.00   sec   422 MBytes  3.54 Gbits/sec    0   8.00 MBytes   9ms
    [  5]   3.00-4.00   sec   405 MBytes  3.40 Gbits/sec    0   8.00 MBytes   9ms
    [  5]   4.00-5.00   sec   429 MBytes  3.60 Gbits/sec    0   8.00 MBytes   9ms
    [  5]   5.00-6.00   sec   433 MBytes  3.64 Gbits/sec    0   8.00 MBytes   9ms
    [  5]   6.00-7.00   sec   432 MBytes  3.62 Gbits/sec    0   8.00 MBytes   10ms
    [  5]   7.00-8.00   sec   432 MBytes  3.63 Gbits/sec    0   8.00 MBytes   9ms
    [  5]   8.00-9.00   sec   414 MBytes  3.47 Gbits/sec    0   8.00 MBytes   9ms
    [  5]   9.00-10.00  sec   433 MBytes  3.63 Gbits/sec    0   8.00 MBytes   9ms
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval           Transfer     Bitrate         Retr
    [  5]   0.00-10.00  sec  4.18 GBytes  3.59 Gbits/sec    0             sender
    [  5]   0.00-10.01  sec  4.18 GBytes  3.59 Gbits/sec                  receiver

Testing with 2 vms is much slower and tend to get stuck because of lima-vm#39.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs added a commit to nirs/socket_vmnet that referenced this issue Oct 2, 2024
According to launchd.plist(5), if ProcessType is left unspecified, the
system will apply light resource limits to the job, throttling its CPU
usage and I/O bandwidth. Turns out that these resource limits cause
lower and unpredictable performance.

Testing ProcessType Interactive increase iperf3 throughput from 1.32 to
3.59 Gbit/s (2.71 times faster).

Before:

    % iperf3-darwin -c 192.168.105.58
    Connecting to host 192.168.105.58, port 5201
    [  5] local 192.168.105.1 port 50333 connected to 192.168.105.58 port 5201
    [ ID] Interval           Transfer     Bitrate         Retr  Cwnd          RTT
    [  5]   0.00-1.00   sec   158 MBytes  1.32 Gbits/sec    0   2.91 MBytes   21ms
    [  5]   1.00-2.00   sec   163 MBytes  1.36 Gbits/sec    0   3.05 MBytes   19ms
    [  5]   2.00-3.00   sec   152 MBytes  1.28 Gbits/sec    0   3.15 MBytes   19ms
    [  5]   3.00-4.00   sec   167 MBytes  1.40 Gbits/sec    0   3.23 MBytes   21ms
    [  5]   4.00-5.00   sec   162 MBytes  1.36 Gbits/sec    0   3.29 MBytes   21ms
    [  5]   5.00-6.00   sec   151 MBytes  1.27 Gbits/sec    0   3.34 MBytes   21ms
    [  5]   6.00-7.00   sec   160 MBytes  1.34 Gbits/sec    0   3.36 MBytes   20ms
    [  5]   7.00-8.00   sec   152 MBytes  1.28 Gbits/sec    0   3.38 MBytes   22ms
    [  5]   8.00-9.00   sec   161 MBytes  1.35 Gbits/sec    0   3.38 MBytes   23ms
    [  5]   9.00-10.00  sec   152 MBytes  1.27 Gbits/sec    0   3.39 MBytes   21ms
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval           Transfer     Bitrate         Retr
    [  5]   0.00-10.00  sec  1.54 GBytes  1.32 Gbits/sec    0             sender
    [  5]   0.00-10.00  sec  1.54 GBytes  1.32 Gbits/sec                  receiver

After:

    % iperf3-darwin -c 192.168.105.58
    Connecting to host 192.168.105.58, port 5201
    [  5] local 192.168.105.1 port 50415 connected to 192.168.105.58 port 5201
    [ ID] Interval           Transfer     Bitrate         Retr  Cwnd          RTT
    [  5]   0.00-1.00   sec   453 MBytes  3.80 Gbits/sec    0   8.00 MBytes   10ms
    [  5]   1.00-2.00   sec   426 MBytes  3.57 Gbits/sec    0   8.00 MBytes   9ms
    [  5]   2.00-3.00   sec   422 MBytes  3.54 Gbits/sec    0   8.00 MBytes   9ms
    [  5]   3.00-4.00   sec   405 MBytes  3.40 Gbits/sec    0   8.00 MBytes   9ms
    [  5]   4.00-5.00   sec   429 MBytes  3.60 Gbits/sec    0   8.00 MBytes   9ms
    [  5]   5.00-6.00   sec   433 MBytes  3.64 Gbits/sec    0   8.00 MBytes   9ms
    [  5]   6.00-7.00   sec   432 MBytes  3.62 Gbits/sec    0   8.00 MBytes   10ms
    [  5]   7.00-8.00   sec   432 MBytes  3.63 Gbits/sec    0   8.00 MBytes   9ms
    [  5]   8.00-9.00   sec   414 MBytes  3.47 Gbits/sec    0   8.00 MBytes   9ms
    [  5]   9.00-10.00  sec   433 MBytes  3.63 Gbits/sec    0   8.00 MBytes   9ms
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval           Transfer     Bitrate         Retr
    [  5]   0.00-10.00  sec  4.18 GBytes  3.59 Gbits/sec    0             sender
    [  5]   0.00-10.01  sec  4.18 GBytes  3.59 Gbits/sec                  receiver

Testing with 2 vms is much slower and tend to get stuck because of lima-vm#39.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs added a commit to nirs/lima that referenced this issue Oct 2, 2024
We used external package (tcpproxy) for proxying between unix stream and
datagram sockets. This package cannot handle ENOBUFS error, expected
condition on BSD based systems, and worse, it hides errors and stop
forwarding packets silently when write to vz socket fails with
ENOBUFS[1].

Fix the issues by replacing tcpproxy with our own implementation.

Fixes:

- Fix error handling if write to vz datagram socket fail with ENOBUFS.
  We retry the write and drop the packet if we could not write after a
  short timeout.

- Fix error handling if we could not read packet header or body from
  socket_vmnet stream socket. Previously we logged an error and
  continue, sending corrupted packet to VZ.

- Fix error handling if writing a packet to socket_vmnet stream socket
  returned after writing partial packet. Now we write the complete
  packet after short write.  Previously this ended with 2 corrupted
  packets.

- Log error if forwarding packets from vz to socket_vmnet or from
  socket_vmnet to vz failed.

Simplification:

- Use binary.Read() and binary.Write() to read and write qemu packet
  header.

New logs:

    % grep 'Dropping packet' ~/.lima/server/ha.stderr.log
    {"level":"debug","msg":"Dropping packet: write unixgram -\u003e: write: no buffer space available","time":"2024-10-02T04:01:39+03:00"}

[1] lima-vm/socket_vmnet#39

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs added a commit to nirs/lima that referenced this issue Oct 2, 2024
We used external package (tcpproxy) for proxying between unix stream and
datagram sockets. This package cannot handle ENOBUFS error, expected
condition on BSD based systems, and worse, it hides errors and stop
forwarding packets silently when write to vz socket fails with
ENOBUFS[1].

Fix the issues by replacing tcpproxy with our own implementation.

Fixes:

- Fix error handling if write to vz datagram socket fail with ENOBUFS.
  We retry the write and drop the packet if we could not write after a
  short timeout.

- Fix error handling if we could not read packet header or body from
  socket_vmnet stream socket. Previously we logged an error and
  continue, sending corrupted packet to VZ.

- Fix error handling if writing a packet to socket_vmnet stream socket
  returned after writing partial packet. Now we write the complete
  packet after short write.  Previously this ended with 2 corrupted
  packets.

- Log error if forwarding packets from vz to socket_vmnet or from
  socket_vmnet to vz failed.

Simplification:

- Use binary.Read() and binary.Write() to read and write qemu packet
  header.

New logs:

    % grep 'Dropping packet' ~/.lima/server/ha.stderr.log
    {"level":"debug","msg":"Dropping packet: write unixgram -\u003e: write: no buffer space available","time":"2024-10-02T04:01:39+03:00"}

[1] lima-vm/socket_vmnet#39

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
@balajiv113
Copy link
Member

@nirs In gvisor-tap-vsock as well we recently handled ENOBUFS

https://github.com/containers/gvisor-tap-vsock/pull/370/files

This usually happens in dgram socket, maybe we can simply try to create a custom connection overriding
write alone and retry on ENOBUFS error.

@nirs
Copy link
Member

nirs commented Oct 2, 2024

@balajiv113 thanks, I'll send a lima fix.

nirs added a commit to nirs/lima that referenced this issue Oct 2, 2024
We used external package (tcpproxy) for proxying between unix stream and
datagram sockets. This package cannot handle ENOBUFS error, expected
condition on BSD based systems, and worse, it hides errors and stop
forwarding packets silently when write to vz socket fails with
ENOBUFS[1].

Fix the issues by replacing tcpproxy with a simpler and more direct
implementation that will be easier to maintain.

Fixes:

- Fix error handling if write to vz datagram socket fail with ENOBUFS.
  We retry the write until it succeeds. Same solution is used in
  gvisor-tap-vsock[2].

- Fix error handling if we could not read packet header or body from
  socket_vmnet stream socket. Previously we logged an error and
  continue, sending corrupted packet to VZ.

- Fix error handling if writing a packet to socket_vmnet stream socket
  returned after writing partial packet. Now we handle short writes and
  write the complete packet. Previously this ended with 2 corrupted
  packets.

- Log error if forwarding packets from vz to socket_vmnet or from
  socket_vmnet to vz failed.

Simplification:

- Use binary.Read() and binary.Write() to read and write qemu packet
  header.

[1] lima-vm/socket_vmnet#39
[2] containers/gvisor-tap-vsock#370

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs added a commit to nirs/lima that referenced this issue Oct 2, 2024
We used external package (tcpproxy) for proxying between unix stream and
datagram sockets. This package cannot handle ENOBUFS error, expected
condition on BSD based systems, and worse, it hides errors and stop
forwarding packets silently when write to vz socket fails with
ENOBUFS[1].

Fix the issues by replacing tcpproxy with a simpler and more direct
implementation that will be easier to maintain.

Fixes:

- Fix error handling if write to vz datagram socket fail with ENOBUFS.
  We retry the write until it succeeds. Same solution is used in
  gvisor-tap-vsock[2].

- Fix error handling if we could not read packet header or body from
  socket_vmnet stream socket. Previously we logged an error and
  continue, sending corrupted packet to VZ.

- Fix error handling if writing a packet to socket_vmnet stream socket
  returned after writing partial packet. Now we handle short writes and
  write the complete packet. Previously this ended with 2 corrupted
  packets.

- Log error if forwarding packets from vz to socket_vmnet or from
  socket_vmnet to vz failed.

Simplification:

- Use binary.Read() and binary.Write() to read and write qemu packet
  header.

[1] lima-vm/socket_vmnet#39
[2] containers/gvisor-tap-vsock#370

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs added a commit to nirs/lima that referenced this issue Oct 2, 2024
We used external package (tcpproxy) for proxying between unix stream and
datagram sockets. This package cannot handle ENOBUFS error, expected
condition on BSD based systems, and worse, it hides errors and stop
forwarding packets silently when write to vz socket fails with
ENOBUFS[1].

Fix the issues by replacing tcpproxy with a simpler and more direct
implementation that will be easier to maintain.

Fixes:

- Fix error handling if write to vz datagram socket fail with ENOBUFS.
  We retry the write until it succeeds. Same solution is used in
  gvisor-tap-vsock[2].

- Fix error handling if we could not read packet header or body from
  socket_vmnet stream socket. Previously we logged an error and
  continue, sending corrupted packet to VZ.

- Fix error handling if writing a packet to socket_vmnet stream socket
  returned after writing partial packet. Now we handle short writes and
  write the complete packet. Previously this ended with 2 corrupted
  packets.

- Log error if forwarding packets from vz to socket_vmnet or from
  socket_vmnet to vz failed.

Simplification:

- Use binary.Read() and binary.Write() to read and write qemu packet
  header.

Visibility:

- Make QEMUPacketConn private since it is an implementation detail
  correct only for lima vz-socket_vmnet use case.

[1] lima-vm/socket_vmnet#39
[2] containers/gvisor-tap-vsock#370

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs added a commit to nirs/lima that referenced this issue Oct 3, 2024
We used external package (tcpproxy) for proxying between unix stream and
datagram sockets. This package cannot handle ENOBUFS error, expected
condition on BSD based systems, and worse, it hides errors and stop
forwarding packets silently when write to vz socket fails with
ENOBUFS[1].

Fix the issues by replacing tcpproxy with a simpler and more direct
implementation that will be easier to maintain.

Fixes:

- Fix error handling if write to vz datagram socket fail with ENOBUFS.
  We retry the write until it succeeds with a very short sleep between
  retries.. Similar solution is used in gvisor-tap-vsock[2].

- Fix error handling if we could not read packet header or body from
  socket_vmnet stream socket. Previously we logged an error and
  continue, sending corrupted packet to VZ.

- Fix error handling if writing a packet to socket_vmnet stream socket
  returned after writing partial packet. Now we handle short writes and
  write the complete packet. Previously this ended with 2 corrupted
  packets.

- Log error if forwarding packets from vz to socket_vmnet or from
  socket_vmnet to vz failed.

Simplification:

- Use binary.Read() and binary.Write() to read and write qemu packet
  header.

Visibility:

- Make QEMUPacketConn private since it is an implementation detail
  correct only for lima vz-socket_vmnet use case.

[1] lima-vm/socket_vmnet#39
[2] containers/gvisor-tap-vsock#370

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs added a commit to nirs/lima that referenced this issue Oct 3, 2024
We used external package (tcpproxy) for proxying between unix stream and
datagram sockets. This package cannot handle ENOBUFS error, expected
condition on BSD based systems, and worse, it hides errors and stop
forwarding packets silently when write to vz socket fails with
ENOBUFS[1].

Fix the issues by replacing tcpproxy with a simpler and more direct
implementation that will be easier to maintain.

Fixes:

- Fix error handling if write to vz datagram socket fail with ENOBUFS.
  We retry the write until it succeeds with a very short sleep between
  retries.. Similar solution is used in gvisor-tap-vsock[2].

- Fix error handling if we could not read packet header or body from
  socket_vmnet stream socket. Previously we logged an error and
  continue, sending corrupted packet to VZ.

- Fix error handling if writing a packet to socket_vmnet stream socket
  returned after writing partial packet. Now we handle short writes and
  write the complete packet. Previously this ended with 2 corrupted
  packets.

- Log error if forwarding packets from vz to socket_vmnet or from
  socket_vmnet to vz failed.

Simplification:

- Use binary.Read() and binary.Write() to read and write qemu packet
  header.

Visibility:

- Make QEMUPacketConn private since it is an implementation detail
  correct only for lima vz-socket_vmnet use case.

[1] lima-vm/socket_vmnet#39
[2] containers/gvisor-tap-vsock#370

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs added a commit to nirs/lima that referenced this issue Oct 3, 2024
We used external package (tcpproxy) for proxying between unix stream and
datagram sockets. This package cannot handle ENOBUFS error, expected
condition on BSD based systems, and worse, it hides errors and stop
forwarding packets silently when write to vz socket fails with
ENOBUFS[1].

Fix the issues by replacing tcpproxy with a simpler and more direct
implementation that will be easier to maintain.

Fixes:

- Fix error handling if write to vz datagram socket fail with ENOBUFS.
  We retry the write until it succeeds with a very short sleep between
  retries. Similar solution is used in gvisor-tap-vsock[2].

- Fix error handling if we could not read packet header or body from
  socket_vmnet stream socket. Previously we logged an error and continue
  to send corrupted packet to vz from the point of the failure.

- Fix error handling if writing a packet to socket_vmnet stream socket
  returned after writing partial packet. Now we handle short writes and
  write the complete packet. Previously would break the protocol and
  continue to send corrupted packet from the point of the failure.

- Log error if forwarding packets from vz to socket_vmnet or from
  socket_vmnet to vz failed.

Simplification:

- Use binary.Read() and binary.Write() to read and write qemu packet
  header.

Visibility:

- Make QEMUPacketConn private since it is an implementation detail of vz
  when using socket_vmnet.

[1] lima-vm/socket_vmnet#39
[2] containers/gvisor-tap-vsock#370

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
@nirs
Copy link
Member

nirs commented Oct 3, 2024

The ENOBUFS errors are easily reproducible and recoverable with lima-vm/lima#2680.

@sheelchand can you test with the lima fix?

nirs added a commit to nirs/lima that referenced this issue Oct 3, 2024
We used external package (tcpproxy) for proxying between unix stream and
datagram sockets. This package cannot handle ENOBUFS error, expected
condition on BSD based systems, and worse, it hides errors and stop
forwarding packets silently when write to vz socket fails with
ENOBUFS[1].

Fix the issues by replacing tcpproxy with a simpler and more direct
implementation that will be easier to maintain.

Fixes:

- Fix error handling if write to vz datagram socket fail with ENOBUFS.
  We retry the write until it succeeds with a very short sleep between
  retries. Similar solution is used in gvisor-tap-vsock[2].

- Fix error handling if we could not read packet header or body from
  socket_vmnet stream socket. Previously we logged an error and continue
  to send corrupted packet to vz from the point of the failure.

- Fix error handling if writing a packet to socket_vmnet stream socket
  returned after writing partial packet. Now we handle short writes and
  write the complete packet. Previously would break the protocol and
  continue to send corrupted packet from the point of the failure.

- Log error if forwarding packets from vz to socket_vmnet or from
  socket_vmnet to vz failed.

Simplification:

- Use binary.Read() and binary.Write() to read and write qemu packet
  header.

Visibility:

- Make QEMUPacketConn private since it is an implementation detail of vz
  when using socket_vmnet.

[1] lima-vm/socket_vmnet#39
[2] containers/gvisor-tap-vsock#370

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs added a commit to nirs/lima that referenced this issue Oct 6, 2024
We used external package (tcpproxy) for proxying between unix stream and
datagram sockets. This package cannot handle ENOBUFS error, expected
condition on BSD based systems, and worse, it hides errors and stop
forwarding packets silently when write to vz socket fails with
ENOBUFS[1].

Fix the issues by replacing tcpproxy with a simpler and more direct
implementation that will be easier to maintain.

Fixes:

- Fix error handling if write to vz datagram socket fail with ENOBUFS.
  We retry the write until it succeeds with a very short sleep between
  retries. Similar solution is used in gvisor-tap-vsock[2].

- Fix error handling if we could not read packet header or body from
  socket_vmnet stream socket. Previously we logged an error and continue
  to send corrupted packet to vz from the point of the failure.

- Fix error handling if writing a packet to socket_vmnet stream socket
  returned after writing partial packet. Now we handle short writes and
  write the complete packet. Previously would break the protocol and
  continue to send corrupted packet from the point of the failure.

- Log error if forwarding packets from vz to socket_vmnet or from
  socket_vmnet to vz failed.

Simplification:

- Use binary.Read() and binary.Write() to read and write qemu packet
  header.

Visibility:

- Make QEMUPacketConn private since it is an implementation detail of vz
  when using socket_vmnet.

[1] lima-vm/socket_vmnet#39
[2] containers/gvisor-tap-vsock#370

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs added a commit to nirs/lima that referenced this issue Oct 10, 2024
We used external package (tcpproxy) for proxying between unix stream and
datagram sockets. This package cannot handle ENOBUFS error, expected
condition on BSD based systems, and worse, it hides errors and stop
forwarding packets silently when write to vz socket fails with
ENOBUFS[1].

Fix the issues by replacing tcpproxy with a simpler and more direct
implementation that will be easier to maintain.

Fixes:

- Fix error handling if write to vz datagram socket fail with ENOBUFS.
  We retry the write until it succeeds with a very short sleep between
  retries. Similar solution is used in gvisor-tap-vsock[2].

- Fix error handling if we could not read packet header or body from
  socket_vmnet stream socket. Previously we logged an error and continue
  to send corrupted packet to vz from the point of the failure.

- Fix error handling if writing a packet to socket_vmnet stream socket
  returned after writing partial packet. Now we handle short writes and
  write the complete packet. Previously would break the protocol and
  continue to send corrupted packet from the point of the failure.

- Log error if forwarding packets from vz to socket_vmnet or from
  socket_vmnet to vz failed.

Simplification:

- Use binary.Read() and binary.Write() to read and write qemu packet
  header.

Visibility:

- Make QEMUPacketConn private since it is an implementation detail of vz
  when using socket_vmnet.

Testing:

- Add a packet forwarding test covering the happy path in 10
  milliseconds.

[1] lima-vm/socket_vmnet#39
[2] containers/gvisor-tap-vsock#370

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
@nirs
Copy link
Member

nirs commented Oct 11, 2024

The lima issue is fixed now. We can close this when a lima version including the fix is released.

@nirs
Copy link
Member

nirs commented Nov 9, 2024

@AkihiroSuda, @jandubois: lima >= 1.0.0 fixed the issue, we can close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants