Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic 503 responses from Quarkus MP application #4572

Closed
enbohm opened this issue Oct 15, 2019 · 30 comments
Closed

Sporadic 503 responses from Quarkus MP application #4572

enbohm opened this issue Oct 15, 2019 · 30 comments
Labels
kind/bug Something isn't working triage/out-of-date This issue/PR is no longer valid or relevant triage/regression

Comments

@enbohm
Copy link

enbohm commented Oct 15, 2019

Describe the bug
After upgrading from 0.22.0 to 0.23.x and/or 0.24.0 we see some http response "503 Service unavailable" when calling any of our Quarkus endpoints. Even /openapi and /health endpoint seem to fail from time to time. After a few retries it works again. This seem to happen only after the application has been idle for a some time (> 5 min or so) but a very sporadic behavior. For instance our load test seems to be running just fine except that the first few requests might fail.

We currently have four different services (using among other things Microprofile) and have noticed that as soon as we downgrade to 0.22.0 everything is back to normal again (=i.e the 503 never occurs). Speculating but could it be related to #3506 ("Use Vert.x as the basis for all HTTP") ?

Expected behavior
Request to any Quarkus endpoint should not respond with a 503 code, even it the service has been idle for a while.

Actual behavior
Sporadic 503 Service unavailable responses when the Quarkus app have been idle for a short time.

To Reproduce
Since I can't really create any reproducer I hope someone might know if this is an issue (or if someone else reported the same behavior). But what we can see is that it only occurs using quarkus >0.22.0

Environment (please complete the following information):

  • Output of uname -a or ver:
  • Output of java -version:
    openjdk version "11" 2018-09-25
    OpenJDK Runtime Environment 18.9 (build 11+28)
    OpenJDK 64-Bit Server VM 18.9 (build 11+28, mixed mode)
  • GraalVM version (if different from Java):
    N/A
  • Quarkus version or git rev: 0.23.2 and 0.24.0
@enbohm enbohm added the kind/bug Something isn't working label Oct 15, 2019
@gsmet
Copy link
Member

gsmet commented Oct 15, 2019

@stuartwdouglas this looks like an important regression. Not sure how we will figure out when it comes from though...

@enbohm do you have a proxy in front of your Quarkus app? Or is it Quarkus itself serving the 503?

@enbohm
Copy link
Author

enbohm commented Oct 15, 2019

@gsmet I have a proxy in front (Envoy 1.9.0) which reports the 503. I can't see anything in any logs from the Quarkus apps so nothing seems to crash or anything.

@cescoffier
Copy link
Member

@enbohm did you try with the quarkus-undertow dependency?

@gsmet
Copy link
Member

gsmet commented Oct 15, 2019

@enbohm and another question: could you paste the list of dependencies you have when starting Quarkus?

@vietj
Copy link

vietj commented Oct 15, 2019

@enbohm it is not clear whether the proxy replies 503 and cannot connect or that's Quarkus that returns 503, can you elaborate ?

@stuartwdouglas
Copy link
Member

My guess would be that connections are timing out, and then the proxy is using dead connections. It looks like the default Netty timeout is 30s (as far as I can tell by reading the code).

@vietj
Copy link

vietj commented Oct 15, 2019

@stuartwdouglas Vert.x HTTP server will not configure idle timeout by default unless it is configured in HttpServerOptions with idleTimeOut, what are you referring to ?

@stuartwdouglas
Copy link
Member

Actually I mis-read the code

@enbohm
Copy link
Author

enbohm commented Oct 15, 2019

@enbohm did you try with the quarkus-undertow dependency?

@cescoffier No, but will try it out and see the result, will get back within shortly

@enbohm and another question: could you paste the list of dependencies you have when starting Quarkus?
@gsmet here are the dependencies I use:

quarkus-resteasy-jsonb
quarkus-arc
quarkus-smallrye-health
quarkus-smallrye-openapi
quarkus-smallrye-metrics
quarkus-smallrye-opentracing
quarkus-smallrye-fault-tolerance
quarkus-hibernate-orm
quarkus-jdbc-mariadb

@vietj
Copy link

vietj commented Oct 15, 2019

@enbohm can you also try investigate what is going on between the proxy and the server (i.e using TCP dump or wireshark) ?

@enbohm
Copy link
Author

enbohm commented Oct 15, 2019

@enbohm it is not clear whether the proxy replies 503 and cannot connect or that's Quarkus that returns 503, can you elaborate ?

@vietj Currently, we can see that the requests going through our envoy proxy reports 503 from time to time when calling Quarkus 0.24.0 application. We have quite a few other services (Helidon, node.js, python) running behind the Envoy proxy working just fine. When downgrading to Quarkus 0.22.0 everything works fine. We also have a Swagger UI which direct tries to access the openapi-endpoint which fails sometimes in Quarkus 0.24.0 (but never in version < 0.22).

I will try to reproduce the failure in standalone mode, i.e. just a simple Quarkus app without any proxy in front. Will get back later on.

@enbohm
Copy link
Author

enbohm commented Oct 15, 2019

@cescoffier I just tried adding quarkus-undertow along with 0.24.0 but unfortunately after a few request that works, waiting ca 10 minutes and calling the /openapi endpoint resulted in a 503...

It is like the http handler is just gone for a short while. No logs or anything from the Quarkus app. After a few more requests it starts serving content again as expected.

@vietj
Copy link

vietj commented Oct 15, 2019

@enbohm it is not clear wether using quarkus-undertow is different from the default vertx based solution. Can you clarify ?

@stuartwdouglas
Copy link
Member

So Undertow used to time out the connections, so I guess one of the changes here is that this no longer happens.

@enbohm
Copy link
Author

enbohm commented Oct 15, 2019

@enbohm it is not clear wether using quarkus-undertow is different from the default vertx based solution. Can you clarify ?

@vietj I added the quarkus-undertow as a dependency, nothing else. According to https://quarkus.io/blog/quarkus-0-24.0-released/
"If Undertow is present, RESTEasy will detect this and will fall back to running as a Servlet."

but that made no difference what so ever - still a random 503 after a while. Should I have made any other configuration changes or removed any other dependency such quarkus-resteasy-jsonb maybe?

@stuartwdouglas
Copy link
Member

No, it won't make any difference here, I was referring to Undertow as it was configured before the change to vert.x.

@enbohm
Copy link
Author

enbohm commented Oct 16, 2019

@stuartwdouglas thx for clarifying. Are there any configuration changes I can try out such as changing timeouts in vert.x, evection time of stale connections, thread pool size, etc. that could be one step forward to understand what is going on?

@vietj
Copy link

vietj commented Oct 16, 2019

it would be good to understand what is going on without the proxy in front

@enbohm
Copy link
Author

enbohm commented Oct 16, 2019

@vietj yes will try without the proxy in front and see the outcome. Still weird that the proxy (Envoy) should fail here since all other services we have, node.js, Helidon and Python work 100% and the Quarkus 0.22.0 works like a charm but of course that could be the case.

I also add the tracing error (from Jeager UI) not very informative but one can at least see the response size and status code. As mention before, repeating the request makes everything work as expected (i.e. 200 response).

service-unavailable

@enbohm
Copy link
Author

enbohm commented Oct 16, 2019

I also noticed that a new version of Quarkus is out (0.25.0) - I will try that one as well and see if the error occurs.

@enbohm
Copy link
Author

enbohm commented Oct 16, 2019

I also noticed that a new version of Quarkus is out (0.25.0) - I will try that one as well and see if the error occurs.

Unfortunately, same behavior in Quarkus 0.25.0 - 503 occurred after a few minutes

@enbohm
Copy link
Author

enbohm commented Oct 17, 2019

@vietj I've now tried to reproduce this in a Quarkus standalone app, i.e. no Envoy proxy in front. I have not been able to get the error (only tried for ca 30 min but the error should have occurred). This could mean that something does not work as expected between Envoy and Quarkus > 0.22. I also tried all kinds of config in Envoy to drop stale connections, timeouts, stream timeouts etc. but with no luck :(

However, as I did mention, we have several Quarkus 0.22 applications, Oracle Helidon MP apps, node.js behind Envoy which does not have this issue at all. So there must be something that makes Envoy fail when calling the newer Quarkus app. Since having an Envoy proxy in front of you services is quite common it would be nice to see if any other users with similar setup suffer from this (Istio users maybe?)

I will try to write a very simple Docker compose reproducer with Envoy and a Quarkus 0.25 app - will try to do that within a day or two.

@enbohm
Copy link
Author

enbohm commented Oct 18, 2019

I've now created a repo which contains the setup of a reproducer https://github.com/enbohm/quarkus-envoy-503/

However, I am not able to provoke the error using Quakus only (i.e. without a front proxy) and I more and more suspect that is has something to do with Envoy. There are similar issues reported by istio users (see istio/istio#13205) and someone pointed out envoyproxy/envoy#6190 could be the root cause. But what I can't understand is why it did work with Quarkus 0.22.0 (and with several other applications such as node.js, Oracle Helidon and Python apps) but probably something that has changed with the new Vert.x http layer.

I fully understand if this issue is closed BUT it could also be good to get a deeper understanding why Envoy and the latest Quarkus is not 100% compatible since this is a very common setup. Unfortunately, we can't easily switch away from Envoy so if I can't figure anything out we might need to pause the development of Quarkus applications which is a bit sad :(

@vietj
Copy link

vietj commented Oct 19, 2019

@enbohm it is great investigations you are doing now, I'm not familiar at all with such proxies and their behaviour to have insights about what is going on . Is there a way to dump the TCP traffic between containers to see what is going on with wireshark ?

@enbohm
Copy link
Author

enbohm commented Oct 21, 2019

@vietj will investigate and try to get TCP logs if possible

@enbohm
Copy link
Author

enbohm commented Nov 5, 2019

Reading more about similar issues in Istio (among others) istio/istio#13848 where it could be related to the socket timeout settings in Quarkus. I did find a useful write-up about sporadic 503 UC (upstream closed) on https://karlstoney.com/2019/05/31/istio-503s-ucs-and-tcp-fun-times/index.html

The conclusion was that the applications were using a low socket timeout, e.g. nodejs 5 seconds, java/tomcat 20s. Setting a higher socket keep alive seems to do the trick, and I'm now trying this with Quarkus 0.28.1. Currently I have these settings

quarkus.vertx.eventbus.tcp-keep-alive=true
quarkus.vertx.eventbus.connect-timeout=PT6M

but I can still see a few 503 Upstream disconnect in Envoy. I'm trying to understand these settings before the Vert.x introduction in 0.23 and then applying them in Quarkus 0.28.1 to see if I can get back to a stable state.

@enbohm
Copy link
Author

enbohm commented Nov 11, 2019

Using Envoy and Quarkus is now working for me. I tried with various settings in Quarkus such as connect timeout and idle timeout but could still see that a few requests did fail (=503). Finally I managed to effectively disable keep alive in Envoy for all upstream connections towards Quarkus applications by setting max_requests_per_connection to 1 (see https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/cds.proto.html?highlight=keepalive).

I am not sure if there are any problems with Quarkus/Vert.x layer when it comes to connect- and idle timeout since I don't have to disable keep alive towards our other applications written in Node.js and Python. Anyhow I hope this be of guidance to other facing similar issues.

@gsmet
Copy link
Member

gsmet commented Nov 18, 2019

@stuartwdouglas do you think there is any action left on our part?

@stuartwdouglas
Copy link
Member

I would like to understand why this was happening. We now have a HTTP idle-timeout config option that can be used to control when connections time out, I think if this was set to a sufficiently large number is should be enough to stop this occurring. The default is now 30m (as running with no default can result in running out of open sockets if there are lots of clients with long lived connections).

@stuartwdouglas
Copy link
Member

I think this can be closed.

@stuartwdouglas stuartwdouglas added the triage/out-of-date This issue/PR is no longer valid or relevant label Jul 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working triage/out-of-date This issue/PR is no longer valid or relevant triage/regression
Projects
None yet
Development

No branches or pull requests

5 participants