Sporadic 503 responses from Quarkus MP application #4572

enbohm · 2019-10-15T12:46:02Z

Describe the bug
After upgrading from 0.22.0 to 0.23.x and/or 0.24.0 we see some http response "503 Service unavailable" when calling any of our Quarkus endpoints. Even /openapi and /health endpoint seem to fail from time to time. After a few retries it works again. This seem to happen only after the application has been idle for a some time (> 5 min or so) but a very sporadic behavior. For instance our load test seems to be running just fine except that the first few requests might fail.

We currently have four different services (using among other things Microprofile) and have noticed that as soon as we downgrade to 0.22.0 everything is back to normal again (=i.e the 503 never occurs). Speculating but could it be related to #3506 ("Use Vert.x as the basis for all HTTP") ?

Expected behavior
Request to any Quarkus endpoint should not respond with a 503 code, even it the service has been idle for a while.

Actual behavior
Sporadic 503 Service unavailable responses when the Quarkus app have been idle for a short time.

To Reproduce
Since I can't really create any reproducer I hope someone might know if this is an issue (or if someone else reported the same behavior). But what we can see is that it only occurs using quarkus >0.22.0

Environment (please complete the following information):

Output of uname -a or ver:
Output of java -version:
openjdk version "11" 2018-09-25
OpenJDK Runtime Environment 18.9 (build 11+28)
OpenJDK 64-Bit Server VM 18.9 (build 11+28, mixed mode)
GraalVM version (if different from Java):
N/A
Quarkus version or git rev: 0.23.2 and 0.24.0

The text was updated successfully, but these errors were encountered:

gsmet · 2019-10-15T12:51:25Z

@stuartwdouglas this looks like an important regression. Not sure how we will figure out when it comes from though...

@enbohm do you have a proxy in front of your Quarkus app? Or is it Quarkus itself serving the 503?

enbohm · 2019-10-15T12:54:37Z

@gsmet I have a proxy in front (Envoy 1.9.0) which reports the 503. I can't see anything in any logs from the Quarkus apps so nothing seems to crash or anything.

cescoffier · 2019-10-15T13:24:46Z

@enbohm did you try with the quarkus-undertow dependency?

gsmet · 2019-10-15T13:25:42Z

@enbohm and another question: could you paste the list of dependencies you have when starting Quarkus?

vietj · 2019-10-15T13:40:39Z

@enbohm it is not clear whether the proxy replies 503 and cannot connect or that's Quarkus that returns 503, can you elaborate ?

stuartwdouglas · 2019-10-15T13:44:34Z

My guess would be that connections are timing out, and then the proxy is using dead connections. It looks like the default Netty timeout is 30s (as far as I can tell by reading the code).

vietj · 2019-10-15T14:30:42Z

@stuartwdouglas Vert.x HTTP server will not configure idle timeout by default unless it is configured in HttpServerOptions with idleTimeOut, what are you referring to ?

stuartwdouglas · 2019-10-15T14:33:57Z

Actually I mis-read the code

enbohm · 2019-10-15T14:50:13Z

@enbohm did you try with the quarkus-undertow dependency?

@cescoffier No, but will try it out and see the result, will get back within shortly

@enbohm and another question: could you paste the list of dependencies you have when starting Quarkus?
@gsmet here are the dependencies I use:

quarkus-resteasy-jsonb
quarkus-arc
quarkus-smallrye-health
quarkus-smallrye-openapi
quarkus-smallrye-metrics
quarkus-smallrye-opentracing
quarkus-smallrye-fault-tolerance
quarkus-hibernate-orm
quarkus-jdbc-mariadb

vietj · 2019-10-15T14:53:12Z

@enbohm can you also try investigate what is going on between the proxy and the server (i.e using TCP dump or wireshark) ?

enbohm · 2019-10-15T14:56:17Z

@enbohm it is not clear whether the proxy replies 503 and cannot connect or that's Quarkus that returns 503, can you elaborate ?

@vietj Currently, we can see that the requests going through our envoy proxy reports 503 from time to time when calling Quarkus 0.24.0 application. We have quite a few other services (Helidon, node.js, python) running behind the Envoy proxy working just fine. When downgrading to Quarkus 0.22.0 everything works fine. We also have a Swagger UI which direct tries to access the openapi-endpoint which fails sometimes in Quarkus 0.24.0 (but never in version < 0.22).

I will try to reproduce the failure in standalone mode, i.e. just a simple Quarkus app without any proxy in front. Will get back later on.

enbohm · 2019-10-15T16:57:37Z

@cescoffier I just tried adding quarkus-undertow along with 0.24.0 but unfortunately after a few request that works, waiting ca 10 minutes and calling the /openapi endpoint resulted in a 503...

It is like the http handler is just gone for a short while. No logs or anything from the Quarkus app. After a few more requests it starts serving content again as expected.

vietj · 2019-10-15T18:01:17Z

@enbohm it is not clear wether using quarkus-undertow is different from the default vertx based solution. Can you clarify ?

stuartwdouglas · 2019-10-15T18:12:20Z

So Undertow used to time out the connections, so I guess one of the changes here is that this no longer happens.

enbohm · 2019-10-15T18:12:28Z

@enbohm it is not clear wether using quarkus-undertow is different from the default vertx based solution. Can you clarify ?

@vietj I added the quarkus-undertow as a dependency, nothing else. According to https://quarkus.io/blog/quarkus-0-24.0-released/
"If Undertow is present, RESTEasy will detect this and will fall back to running as a Servlet."

but that made no difference what so ever - still a random 503 after a while. Should I have made any other configuration changes or removed any other dependency such quarkus-resteasy-jsonb maybe?

stuartwdouglas · 2019-10-15T18:16:05Z

No, it won't make any difference here, I was referring to Undertow as it was configured before the change to vert.x.

enbohm · 2019-10-16T04:42:22Z

@stuartwdouglas thx for clarifying. Are there any configuration changes I can try out such as changing timeouts in vert.x, evection time of stale connections, thread pool size, etc. that could be one step forward to understand what is going on?

vietj · 2019-10-16T08:54:44Z

it would be good to understand what is going on without the proxy in front

enbohm · 2019-10-16T11:22:25Z

@vietj yes will try without the proxy in front and see the outcome. Still weird that the proxy (Envoy) should fail here since all other services we have, node.js, Helidon and Python work 100% and the Quarkus 0.22.0 works like a charm but of course that could be the case.

I also add the tracing error (from Jeager UI) not very informative but one can at least see the response size and status code. As mention before, repeating the request makes everything work as expected (i.e. 200 response).

enbohm · 2019-10-16T11:29:55Z

I also noticed that a new version of Quarkus is out (0.25.0) - I will try that one as well and see if the error occurs.

enbohm · 2019-10-16T12:03:04Z

I also noticed that a new version of Quarkus is out (0.25.0) - I will try that one as well and see if the error occurs.

Unfortunately, same behavior in Quarkus 0.25.0 - 503 occurred after a few minutes

enbohm · 2019-10-17T11:36:24Z

@vietj I've now tried to reproduce this in a Quarkus standalone app, i.e. no Envoy proxy in front. I have not been able to get the error (only tried for ca 30 min but the error should have occurred). This could mean that something does not work as expected between Envoy and Quarkus > 0.22. I also tried all kinds of config in Envoy to drop stale connections, timeouts, stream timeouts etc. but with no luck :(

However, as I did mention, we have several Quarkus 0.22 applications, Oracle Helidon MP apps, node.js behind Envoy which does not have this issue at all. So there must be something that makes Envoy fail when calling the newer Quarkus app. Since having an Envoy proxy in front of you services is quite common it would be nice to see if any other users with similar setup suffer from this (Istio users maybe?)

I will try to write a very simple Docker compose reproducer with Envoy and a Quarkus 0.25 app - will try to do that within a day or two.

enbohm · 2019-10-18T12:07:24Z

I've now created a repo which contains the setup of a reproducer https://github.com/enbohm/quarkus-envoy-503/

However, I am not able to provoke the error using Quakus only (i.e. without a front proxy) and I more and more suspect that is has something to do with Envoy. There are similar issues reported by istio users (see istio/istio#13205) and someone pointed out envoyproxy/envoy#6190 could be the root cause. But what I can't understand is why it did work with Quarkus 0.22.0 (and with several other applications such as node.js, Oracle Helidon and Python apps) but probably something that has changed with the new Vert.x http layer.

I fully understand if this issue is closed BUT it could also be good to get a deeper understanding why Envoy and the latest Quarkus is not 100% compatible since this is a very common setup. Unfortunately, we can't easily switch away from Envoy so if I can't figure anything out we might need to pause the development of Quarkus applications which is a bit sad :(

vietj · 2019-10-19T09:18:07Z

@enbohm it is great investigations you are doing now, I'm not familiar at all with such proxies and their behaviour to have insights about what is going on . Is there a way to dump the TCP traffic between containers to see what is going on with wireshark ?

enbohm · 2019-10-21T08:25:22Z

@vietj will investigate and try to get TCP logs if possible

enbohm · 2019-11-05T10:35:02Z

Reading more about similar issues in Istio (among others) istio/istio#13848 where it could be related to the socket timeout settings in Quarkus. I did find a useful write-up about sporadic 503 UC (upstream closed) on https://karlstoney.com/2019/05/31/istio-503s-ucs-and-tcp-fun-times/index.html

The conclusion was that the applications were using a low socket timeout, e.g. nodejs 5 seconds, java/tomcat 20s. Setting a higher socket keep alive seems to do the trick, and I'm now trying this with Quarkus 0.28.1. Currently I have these settings

quarkus.vertx.eventbus.tcp-keep-alive=true
quarkus.vertx.eventbus.connect-timeout=PT6M

but I can still see a few 503 Upstream disconnect in Envoy. I'm trying to understand these settings before the Vert.x introduction in 0.23 and then applying them in Quarkus 0.28.1 to see if I can get back to a stable state.

enbohm · 2019-11-11T07:50:53Z

Using Envoy and Quarkus is now working for me. I tried with various settings in Quarkus such as connect timeout and idle timeout but could still see that a few requests did fail (=503). Finally I managed to effectively disable keep alive in Envoy for all upstream connections towards Quarkus applications by setting max_requests_per_connection to 1 (see https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/cds.proto.html?highlight=keepalive).

I am not sure if there are any problems with Quarkus/Vert.x layer when it comes to connect- and idle timeout since I don't have to disable keep alive towards our other applications written in Node.js and Python. Anyhow I hope this be of guidance to other facing similar issues.

gsmet · 2019-11-18T23:25:23Z

@stuartwdouglas do you think there is any action left on our part?

stuartwdouglas · 2019-11-18T23:39:31Z

I would like to understand why this was happening. We now have a HTTP idle-timeout config option that can be used to control when connections time out, I think if this was set to a sufficiently large number is should be enough to stop this occurring. The default is now 30m (as running with no default can result in running out of open sockets if there are lots of clients with long lived connections).

stuartwdouglas · 2020-07-08T01:25:50Z

I think this can be closed.

enbohm added the kind/bug Something isn't working label Oct 15, 2019

gsmet added the component:new-http label Oct 15, 2019

gsmet added the triage/regression label Oct 15, 2019

enbohm mentioned this issue Nov 7, 2019

Bug: 503 UC sometimes, upstream_reset_before_response_started envoyproxy/envoy#8639

Closed

stuartwdouglas closed this as completed Jul 8, 2020

stuartwdouglas added the triage/out-of-date This issue/PR is no longer valid or relevant label Jul 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sporadic 503 responses from Quarkus MP application #4572

Sporadic 503 responses from Quarkus MP application #4572

enbohm commented Oct 15, 2019

gsmet commented Oct 15, 2019

enbohm commented Oct 15, 2019 •

edited

Loading

cescoffier commented Oct 15, 2019

gsmet commented Oct 15, 2019

vietj commented Oct 15, 2019

stuartwdouglas commented Oct 15, 2019

vietj commented Oct 15, 2019

stuartwdouglas commented Oct 15, 2019

enbohm commented Oct 15, 2019

vietj commented Oct 15, 2019

enbohm commented Oct 15, 2019

enbohm commented Oct 15, 2019

vietj commented Oct 15, 2019

stuartwdouglas commented Oct 15, 2019

enbohm commented Oct 15, 2019

stuartwdouglas commented Oct 15, 2019

enbohm commented Oct 16, 2019

vietj commented Oct 16, 2019

enbohm commented Oct 16, 2019

enbohm commented Oct 16, 2019

enbohm commented Oct 16, 2019

enbohm commented Oct 17, 2019

enbohm commented Oct 18, 2019 •

edited

Loading

vietj commented Oct 19, 2019

enbohm commented Oct 21, 2019

enbohm commented Nov 5, 2019

enbohm commented Nov 11, 2019

gsmet commented Nov 18, 2019

stuartwdouglas commented Nov 18, 2019

stuartwdouglas commented Jul 8, 2020

Sporadic 503 responses from Quarkus MP application #4572

Sporadic 503 responses from Quarkus MP application #4572

Comments

enbohm commented Oct 15, 2019

gsmet commented Oct 15, 2019

enbohm commented Oct 15, 2019 • edited Loading

cescoffier commented Oct 15, 2019

gsmet commented Oct 15, 2019

vietj commented Oct 15, 2019

stuartwdouglas commented Oct 15, 2019

vietj commented Oct 15, 2019

stuartwdouglas commented Oct 15, 2019

enbohm commented Oct 15, 2019

vietj commented Oct 15, 2019

enbohm commented Oct 15, 2019

enbohm commented Oct 15, 2019

vietj commented Oct 15, 2019

stuartwdouglas commented Oct 15, 2019

enbohm commented Oct 15, 2019

stuartwdouglas commented Oct 15, 2019

enbohm commented Oct 16, 2019

vietj commented Oct 16, 2019

enbohm commented Oct 16, 2019

enbohm commented Oct 16, 2019

enbohm commented Oct 16, 2019

enbohm commented Oct 17, 2019

enbohm commented Oct 18, 2019 • edited Loading

vietj commented Oct 19, 2019

enbohm commented Oct 21, 2019

enbohm commented Nov 5, 2019

enbohm commented Nov 11, 2019

gsmet commented Nov 18, 2019

stuartwdouglas commented Nov 18, 2019

stuartwdouglas commented Jul 8, 2020

enbohm commented Oct 15, 2019 •

edited

Loading

enbohm commented Oct 18, 2019 •

edited

Loading