-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missed upstream disconnect leading to 503 UC #6190
Comments
@alyssawilk do you have any ideas about this? Thanks! |
Ok, so reading this the first request was proxied successfully, the upstream sent a FIN, but Envoy reused the connection 4s later for a subsequent request? There's always a potential race where a FIN may not be received, but 4s pause is not a race - that's clearly crummy. I thought that Envoy listened for data between H1 requests should have detected that FIN but maybe I'm misremembering. @lizan do you remember from your work on #2871? If not I'll dig around. |
I would like to clarify that we're talking minutes and not seconds. That being said I was able to reproduce the issue with trace logs. From C425 (good)
From C427 (bad)
It seems that libevent received correctly socket event: 3 (read ready and write ready) in the good case but received only write ready in the bad one. I'm using Photon OS, maybe the issue has something to do with its TCP stack. |
Actually now that I looked closer at C425 (good) and C427 (bad) I noticed another difference. Attaching logs for both upstreams.
It seems that read was disabled, hence no FIN notification. |
I'm not familiar with the envoy codebase but here's an idea - why not readDisable(false) unconditionally when "moving to ready". |
FYI the following patch seems to work fine.
|
Hey @mattklein123's mased on #2715 I think we should detect that disconnect, and the supplied patch working makes me think the H1 stream is being returned to the conn pool disabled (probably due to flow control...). The hopefully quick question I have for you is if onMessageComplete is the right place to unwind those - I'd lean towards unwinding where the connection is returned to the conn pool for reuse (where it's moved to the ready_clients_ list?) just in case there's lag between the message being complete and the prior stream detangling itself from data callbacks. Thoughts? |
The patch that I proposed works on a large scale setup that used to reproduce the issue, @see C5683.txt
Ideally I'd like to see "readDisable: enabled=false disable=false" before "moving to ready" but was in a hurry to patch the environment and didn't have time to explore that option. The good news is that I no longer get "upstream connect error or disconnect/reset before headers" : ) |
Now that we got to the bottom of this issue, I think that it's a bug and not a question. |
@ppadevski are you willing to propose a patch for this? Thanks! |
@ppadevski @alyssawilk how is readDisabled(true) getting called in the first place? From my quick read of the code I think this must be due to flow control? I think the router is telling the upstream connection to stop reading, but then the upstream connection finishes, right? (Agree this is a bug.)
@alyssawilk I tend to agree with you here that we should be handling this in the calling code. We are already kind of doing this here https://github.com/envoyproxy/envoy/blob/master/source/common/http/http1/codec_impl.cc#L677, but I think that code now seems to be in the wrong place. I think we should replace that code with an ASSERT that reads are already enabled, and enable reads in the owning codec client code that wants to reuse the connection? |
Catching up here... Yeah I agree that this is a bug. #2871 tried to cover the race in the case upstream sends FIN right after completed response. IIUC per this comment we're not guaranteed to get notified for FIN if read is disabled. Moving this part of code to here sounds right to me. |
I will try to fix this before we ship 1.10.0 |
@mattklein123 Hello. Just want to double check, is the fix now in 1.10.0? Reading through the release notes does not indicate so: https://www.envoyproxy.io/docs/envoy/v1.10.0/intro/version_history thank you! |
Does anyone know some of the details surrounding what causes this issue to trigger? I have an Istio environment with 200+ services using identical foundational setups and this bug is only hitting a few of them -- one more than the others (which happens to serve slower responses if that's relevant; something like 30+ seconds). These are node services with a 5s keep-alive which I assume could cause it to hit a little more frequently... but again, this is only really hitting a few services out of the many. Interestingly, this also doesn't seem to get caught by 503 retries... using either |
Previously we were doing this when we create a new stream, but on a reused connection this can lead to us missing an upstream disconnection when the connection is placed back in the pool. Fixes #6190 Signed-off-by: Matt Klein <mklein@lyft.com>
Previously we were doing this when we create a new stream, but on a reused connection this can lead to us missing an upstream disconnection when the connection is placed back in the pool. Fixes #6190 Signed-off-by: Matt Klein <mklein@lyft.com>
I was struggling with half open connections where my node app would send a tcp fin but envoy would still proxy the next get request to the server which would then trigger the node app to send back a tcp rst. This was not fixed for me in 1.10 but I just pulled the latest envoy-alpine-dev and the issues have gone away, hopefully because envoy after receiving the fin ack from the nodejs app is removing the connection from the connection pool. From other threads it seems like I need to make my node app (koa plus node v12) do something to send the http header "Connection: Closed" so that it behaves better, will have to look into that more. If your hitting this and can run on the latest dev code you might want to give this a try. |
…#6578) Previously we were doing this when we create a new stream, but on a reused connection this can lead to us missing an upstream disconnection when the connection is placed back in the pool. Fixes envoyproxy#6190 Signed-off-by: Matt Klein <mklein@lyft.com>
Hi,
I've been chasing a certain issue that is very difficult to reproduce. I'm using envoy v1.8.0 and have an unencrypted HTTP/1.1 local upstream, an HTTP/2 downstream, and use HTTP connection manager. The cluster configuration is
and the route match rule is
{"match":{"path":"/websso"}, "route: {"cluster":"/websso","timeout":"28800s","use_websocket":true,"idle_timeout":"28800s"}}}
My issue is that from time to time I get these "503 UC"s, e.g.
The packet capture has this to say for request/response 1.
and the following for request/response 2.
It seems that envoy didn't reply with a FIN and instead pooled the connection and tried to reuse it 4 minutes later which ended badly.
The envoy log contains the following for that connection.
A good connection that happened around that time looks like
The upstream server has a 1 minute idle timeout and it seems that for C714 and many other connections envoy properly received and processed the disconnect but not for C715.
Have you heard of the above issue before and is there any remedy for it? It is possible to retry once for that particular case only and how?
The text was updated successfully, but these errors were encountered: