-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.66.0 - increase in produced zero addresses #7625
Comments
Before each of the error logs, the listener resource for
Once the listener becomes available, |
Thanks @arjan-bal The logs were captured from a service that would have been hitting a few services configured via xDS, however, during our investigation only focused on I'll take that away & try to dig into why the listener resource is missing as I see from the logs that's happening across both endpoints being discussed here; best I can tell is that none of the servers have been restarted for some time so it's not immediately clear why this would be missing. |
@arjan-bal Reading through the logs, it seems to me that the listener resource config for
And then subsequent requests to the rest of the APIs are also successful and initiate watches. Then at some point after the client sends another ADS request for the same listener:
which in fact receives an empty response:
and triggers the |
Here are the events leading up to one of the
|
@arjan-bal One difference I've spotted comparing to the previous client version (
Our xDS server implementation used that list to keep an up to date snapshot for the respective node. This was implemented as the client's behaviour when a connection to a resource hit idle timeout was to send an ADS request The reason I was asking about the ADS flow control feature added in the latest release is because we now see all our ADS requests containing a single |
The ADS flow control mechanism that was implemented recently only applies flow control on the receive path, i.e. once an ADS response is received by the xDS client in gRPC, it notifies all registered watchers for all the resources in that response, and blocks the next read on the ADS stream until all the watchers have processed the previous update.
gRPC only supports the SotW variant. It does not support the incremental variant. I'm not sure what exactly you mean here. The xDS server is expected to return all requested resources (for LDS and CDS), even if it knows that it had sent the same resource previously to this client.
That doesn't sound right and if that is the case, then its a bug. Would you be able to give us a simple way to reproduce this problem? Thanks. |
@easwars If I create a very simplistic example, where I call 2 echo servers like:
and build this with grpc-go v1.65.0 and latest (v1.66.2), I can see how the behaviour changes between releases. From there you can see that when building the above code with grpc-go v1.65, I get the following ADS discovery requests
While when using v1.66.2 it looks like my requests are sent from different clients:
I appreciate my knowledge in grpc xDS might not be very deep so please let me know if I am doing anything wrong here. |
Ah, now I see what is happening. So, this is not really related to the flow control changes that we made, but is related to some fallback changes that we are in the process of making. So, earlier, we used to have a global singleton xDS client for the whole grpc application and if the application creates multiple grpc channels (thereby requesting multiple listener resources), all these listener resources would be requested by the same xDS client, and therefore would result in multiple listener resource names being specified in a single LDS request. But with fallback, we have switched to using a separate xDS client per grpc target URI. So, if you create three grpc channels:
On the server side though, all servers share a single xDS client. Hope this helps. This is the PR that contains the changes I described above: #7347 And this is the design for the overall fallback feature: https://github.com/grpc/proposal/blob/master/A71-xds-fallback.md |
@easwars do we need to do anything from our side or this is intended behavior? |
This is working as per the design. |
@ffilippopoulos feel free to reopen if you have any more questions |
What version of gRPC are you using?
v.1.66.0
What version of Go are you using (
go version
)?1.22.5
What operating system (Linux, Windows, …) and version?
Alpine 3.14
K8s v1.30.1
What did you do?
Ran
v.1.66.0
across a fleet of gRPC clients & servers, using xDS to talk to an in-house control plane.xDS services configured with
round_robin
load-balancer.Logs captured with:
GRPC_GO_LOG_VERBOSITY_LEVEL=99
GRPC_GO_LOG_SEVERITY_LEVEL=info
What did you expect to see?
Running
v1.65.0
we observe no such errors in with our clients talking to our servers.What did you see instead?
A % of requests will see...
We observe the errors happening multiple times per-minute, for a few minutes & then the errors will drop for up to 5 minutes. It's frequent enough to reliably replicate the problem across all our environments should you need us to capture more information.
I've not yet understood the pattern of when it fails & how often, but have attached logs.tar.gz from a small timeframe when we observed multiple errors, for a single container.
We were focused on the falling calls to
iam-policy-decision-api.auth:5000
during our debugging & by querying against our trace backend, observed the following timestamps where these errors occurred. Note logs are in UTC while traces bellow are in BST (UTC+1).The text was updated successfully, but these errors were encountered: