-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster manager init takes minutes due to slow DNS and STRICT_DNS behaviour #14670
Comments
To me it would make sense to be able to have a per-cluster configuration to disable warming. This could be coupled with a control to permit connection picking / LB to defer until a cluster is warm. That would then allow normal route timeouts to be applied to control the process. Not sure if that's exactly what's needed, but might have a fewer moving parts than an explicit warming filter. @snowp @mattklein123 WDYT? |
+1. I know that @lambdai has also been looking at the general problem of "too many things need warming" and might have thoughts here. |
This would work for me.
I actually use a custom LB (fifo) and think that it would be too late there as it only has chooseHost that is called from within the router and cannot reschedule and must not block. Here's the code for reference (it is always coupled with envoy.retry_host_predicates.previous_hosts)
In any case I only shared the code as a reference. I know that it is by far not the best solution but wanted to start from somewhere. |
Yeah, we would need to block somewhere in router. I think this behavior is generic enough we could have a "block-on-cluster-warming" capability. |
Drive by: per cluster warming might be good. @ppadevski Http filter is an acceptable place but we'd better put it in http router which handles general upstream failure. e.g. add a retry on Ideally I am planning to experimental asynchronous load balancer, which would return Future so that any terminated network filter, including http conn manager, tcp proxy, the existing rest and future filters can wait on cluster. This can resolve this slow dns use case, and also on demand EDS |
Hi all. Is there any update for this feature? |
I believe I encountered the same issue. What’s strange is that after the message I would expect the DNS request to be canceled after 5 seconds (the default timeout as per Envoy’s documentation) and retried. Unless, of course, this DNS configuration needs to be explicitly set for it to apply. Logs
Config:
|
I noticed recently that cluster manager init can take quite some time. For example here it takes 3+ minutes
After debugging the issue it turned out that a STRICT_DNS cluster is considered initialised once the very first DNS resolver completes either with success or failure.
This is a problem for me as during these 3+ minutes envoy wasn't working at all - it was unable to get its listeners and routes because it was stuck in CDS. I only had a few STRICT_DNS clusters and most of my other clusters 100+ are STATIC but unusable due to lack of listeners+routes.
Note that this only happens when DNS is (very) slow. I tried reproducing the issue with iptables+DROP rules but was unable to. c-ares returns failure immediately. This is the only way I was able to slow down DNS in order to reproduce the issue (10.10.10.10 is my DNS server).
After the initial initialisation all is good as the DNS responses are cached and deduplicated. My apps and services tend to restart from time to time and having a 3+ minute gap when DNS is slow is bad. And DNS happens to sometimes be slow when running a cloud service.
I ended up implementing the following solution:
The solution does the following:
At the moment I am not using L4 filters so StrictDnsWarmupFilter is specifically for L7 router.
I would like to ask if it would be possible to have the above use case incorporated into envoy for broader use (the patch is only for reference) - that is - initialise STRICT_DNS clusters immediately so that some routing may happen and have a grace period when DNS is slow or down.
The text was updated successfully, but these errors were encountered: