1.66.0 - increase in produced zero addresses #7625

rcrowe · 2024-09-12T08:53:14Z

What version of gRPC are you using?

v.1.66.0

What version of Go are you using (`go version`)?

1.22.5

What operating system (Linux, Windows, …) and version?

Alpine 3.14
K8s v1.30.1

What did you do?

Ran v.1.66.0 across a fleet of gRPC clients & servers, using xDS to talk to an in-house control plane.

xDS services configured with round_robin load-balancer.

Logs captured with:

GRPC_GO_LOG_VERBOSITY_LEVEL=99
GRPC_GO_LOG_SEVERITY_LEVEL=info

What did you expect to see?

Running v1.65.0 we observe no such errors in with our clients talking to our servers.

What did you see instead?

A % of requests will see...

rpc error: code = Unavailable desc = name resolver error: produced zero addresses

We observe the errors happening multiple times per-minute, for a few minutes & then the errors will drop for up to 5 minutes. It's frequent enough to reliably replicate the problem across all our environments should you need us to capture more information.

I've not yet understood the pattern of when it fails & how often, but have attached logs.tar.gz from a small timeframe when we observed multiple errors, for a single container.

We were focused on the falling calls to iam-policy-decision-api.auth:5000 during our debugging & by querying against our trace backend, observed the following timestamps where these errors occurred. Note logs are in UTC while traces bellow are in BST (UTC+1).

The text was updated successfully, but these errors were encountered:

arjan-bal · 2024-09-12T12:56:22Z

Before each of the error logs, the listener resource for iam-machine-api.auth isn't found (maybe the servers restarted?). This is causing an empty resolver state to be created which results in the creation of a pickfirst leaf LB policy which is the default LB policy when none is specified. pickfirst is given the empty endpoint list and it logs the error we're seeing.

2024/09/12 07:38:25 INFO: [xds] [xds-resolver 0xc00918f320] Received resource-not-found-error for Listener resource "iam-machine-api.auth:5000"
2024/09/12 07:38:25 INFO: [xds] [xds-client 0xc0091edd10] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS request sent: {
  "versionInfo": "1789713",
  "resourceNames": [
    "iam-machine-api.auth:5000"
  ],
  "typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener",
  "responseNonce": "1538097"
}
2024/09/12 07:38:25 INFO: [xds] [xds-resolver 0xc00918f320] For Listener resource "iam-machine-api.auth:5000" and RouteConfiguration resource "", generated service config: {
  "loadBalancingConfig": [
    {
      "xds_cluster_manager_experimental": {
        "children": {
          "cluster:iam-machine-api.auth.5000": {
            "childPolicy": [
              {
                "cds_experimental": {
                  "cluster": "iam-machine-api.auth.5000"
                }
              }
            ]
          }
        }
      }
    }
  ]
}
2024/09/12 07:38:25 INFO: [core] [Channel #9]Resolver state updated: {
  "Addresses": null,
  "Endpoints": [],
  "ServiceConfig": {
    "Config": {
      "Config": null,
      "Methods": {}
    },
    "Err": null
  },
  "Attributes": {
    "\u003c%!p(resolver.csKeyType=grpc.internal.resolver.configSelector)\u003e": "\u003c0x0\u003e",
    "\u003c%!p(xdsclient.clientKeyType=grpc.xds.internal.client.Client)\u003e": "\u003c0xc007d369b0\u003e"
  }
} ()
2024/09/12 07:38:25 INFO: [xds] [xds-cluster-manager-lb 0xc02deeda00] Update with config {
  "LoadBalancingConfig": null,
  "Children": {
    "cluster:iam-machine-api.auth.5000": {
      "ChildPolicy": [
        {
          "cds_experimental": {
            "LoadBalancingConfig": null,
            "Cluster": "iam-machine-api.auth.5000"
          }
        }
      ]
    }
  }
}, resolver state {Addresses:[] Endpoints:[] ServiceConfig:0xc02fff8580 Attributes:{"<%!p(resolver.csKeyType=grpc.internal.resolver.configSelector)>": "<0x0>" , "<%!p(xdsclient.clientKeyType=grpc.xds.internal.client.Client)>": "<0xc007d369b0>" }}
2024/09/12 07:38:25 INFO: [xds] [cds-lb 0xc02f3cc9c0] Received balancer config update: {
  "LoadBalancingConfig": null,
  "Cluster": "iam-machine-api.auth.5000"
}
2024/09/12 07:38:25 INFO: [core] [Channel #9]Resolver state updated: {
  "Addresses": null,
  "Endpoints": [],
  "ServiceConfig": {
    "Config": {
      "Config": null,
      "Methods": {}
    },
    "Err": null
  },
  "Attributes": null
} (service config updated)
2024/09/12 07:38:25 INFO: [core] [Channel #9]Channel switches to new LB policy "pick_first"
2024/09/12 07:38:25 INFO: [pick-first-lb] [pick-first-lb 0xc02fffbda0] Received error from the name resolver: produced zero addresses
2024/09/12 07:38:25 INFO: [core] [Channel #9]Channel Connectivity change to TRANSIENT_FAILURE
2024/09/12 07:38:25 INFO: [core] error from balancer.UpdateClientConnState: bad resolver state

Once the listener becomes available, pickfirst is removed. Still looking into whether anything changed about the handling of not found listeners in 1.66 which would cause these logs to show up.

rcrowe · 2024-09-12T15:50:28Z

Thanks @arjan-bal

The logs were captured from a service that would have been hitting a few services configured via xDS, however, during our investigation only focused on iam-policy-decision-api.auth so can only say with certainty that this endpoint was exhibiting the issue. It could well have been the same with the machine endpoint.

I'll take that away & try to dig into why the listener resource is missing as I see from the logs that's happening across both endpoints being discussed here; best I can tell is that none of the servers have been restarted for some time so it's not immediately clear why this would be missing.

ffilippopoulos · 2024-09-12T20:14:37Z

@arjan-bal Reading through the logs, it seems to me that the listener resource config for iam-machine-api.auth:5000 is served down the client and is added to it's cache:

2024/09/12 07:38:22 INFO: [xds] [xds-client 0xc0091edd10] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] Resource type "ListenerResource" with name "iam-machine-api.auth:5000" added to cache

And then subsequent requests to the rest of the APIs are also successful and initiate watches. Then at some point after the client sends another ADS request for the same listener:

2024/09/12 07:38:26 INFO: [xds] [xds-client 0xc0091edd10] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS request sent: {
  "versionInfo": "1789812",
  "resourceNames": [
    "iam-machine-api.auth:5000"
  ],
  "typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener",
  "responseNonce": "1538184"

which in fact receives an empty response:

2024/09/12 07:38:26 INFO: [xds] [xds-client 0xc0091edd10] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS response received: {
  "versionInfo": "1789812",
  "typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener",
  "nonce": "1538184"
}

and triggers the resource-not-found-error error.
This though should be as the xDS server thinks that the client is up to date and no update should be sent.
The only change I can see that could have affected the call flow is the ADS stream flow control mechanism: #7458. Is there anything new we need to implement on the xDS server side to respect the new flow?

arjan-bal · 2024-09-13T06:54:50Z

Here are the events leading up to one of the resource-not-found errors:

Request for listener sent with version 1790189.

2024/09/12 07:38:33 INFO: [xds] [xds-client 0xc0091edd10] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS request sent: {
  "versionInfo": "1790189",
  "resourceNames": [
    "iam-machine-api.auth:5000"
  ],
  "typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener",
  "responseNonce": "1538494"
}

ADS response received with version 1790190. Since the version has changed, I believe this should be considered as a not found response instead of not updated. Based on the envoy docs

In the SotW protocol variants, all resource types except for Listener and Cluster are grouped into responses in the same way as in the incremental protocol variants. However, Listener and Cluster resource types are handled differently: the server must include the complete state of the world, meaning that all resources of the relevant type that are needed by the client must be included, even if they did not change since the last response.

2024/09/12 07:38:33 INFO: [xds] [xds-client 0xc0091edd10] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS response received: {
  "versionInfo": "1790190",
  "typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener",
  "nonce": "1538496"
}

Request sent again with version 1790190.

2024/09/12 07:38:33 INFO: [xds] [xds-client 0xc0091edd10] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS request sent: {
  "versionInfo": "1790190",
  "resourceNames": [
    "iam-machine-api.auth:5000"
  ],
  "typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener",
  "responseNonce": "1538496"
}

This time the response for the same version 1790190 contains the listener details

2024/09/12 07:38:33 INFO: [xds] [xds-client 0xc01c3a2460] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS response received: {
  "versionInfo": "1790190",
  "resources": [
    {
      "@type": "type.googleapis.com/envoy.config.listener.v3.Listener",
      "name": "iam-policy-decision-api.auth:5000",
      "apiListener": {
        "apiListener": {
          "@type": "type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager",
          "routeConfig": {
            "name": "iam-policy-decision-api.auth:5000",
            "virtualHosts": [
              {
                "name": "iam-policy-decision-api.auth:5000",
                "domains": [
                  "iam-policy-decision-api.auth:5000"
                ],
                "routes": [
                  {
                    "match": {
                      "prefix": ""
                    },
                    "route": {
                      "cluster": "iam-policy-decision-api.auth.5000"
                    }
                  }
                ],
                "retryPolicy": {
                  "retryOn": "cancelled,deadline-exceeded,internal,unavailable",
                  "numRetries": 3,
                  "retryBackOff": {
                    "baseInterval": "0.005s",
                    "maxInterval": "0.050s"
                  }
                }
              }
            ]
          },
          "httpFilters": [
            {
              "name": "envoy.filters.http.router",
              "typedConfig": {
                "@type": "type.googleapis.com/envoy.extensions.filters.http.router.v3.Router"
              }
            }
          ]
        }
      }
    }
  ],
  "typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener",
  "nonce": "1539401"
}

ffilippopoulos · 2024-09-13T09:48:51Z

@arjan-bal One difference I've spotted comparing to the previous client version (v1.65.0) is that following the incremental xDS, it was sending ADS requests including a list of resources. for example:

2024/09/13 09:14:32 INFO: [xds] [xds-client 0xc0093c02d0] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS request sent: {
  "versionInfo": "8",
  "resourceNames": [
    "iam-machine-api.auth:5000",
    "iam-policy-decision-api.auth:5000"
  ],
  "typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener",
  "responseNonce": "16"
}

Our xDS server implementation used that list to keep an up to date snapshot for the respective node. This was implemented as the client's behaviour when a connection to a resource hit idle timeout was to send an ADS request
with an updated list (missing the resources that timed out) to notify the xDS server that it offloaded respective resources from cache and the server needs to remove from snaphot. Please correct me if I am wrong here!

The reason I was asking about the ADS flow control feature added in the latest release is because we now see all our ADS requests containing a single resourceName, which would result in constantly updating our snapshots on the xDS server side.

easwars · 2024-09-13T20:13:59Z

The ADS flow control mechanism that was implemented recently only applies flow control on the receive path, i.e. once an ADS response is received by the xDS client in gRPC, it notifies all registered watchers for all the resources in that response, and blocks the next read on the ADS stream until all the watchers have processed the previous update.

One difference I've spotted comparing to the previous client version (v1.65.0) is that following the incremental xDS, it was sending ADS requests including a list of resources

gRPC only supports the SotW variant. It does not support the incremental variant. I'm not sure what exactly you mean here. The xDS server is expected to return all requested resources (for LDS and CDS), even if it knows that it had sent the same resource previously to this client.

The reason I was asking about the ADS flow control feature added in the latest release is because we now see all our ADS requests containing a single resourceName, which would result in constantly updating our snapshots on the xDS server side

That doesn't sound right and if that is the case, then its a bug. Would you be able to give us a simple way to reproduce this problem? Thanks.

ffilippopoulos · 2024-09-13T21:06:44Z

@easwars If I create a very simplistic example, where I call 2 echo servers like:

	serverAddr1 := "xds:///grpc-echo-server.labs:50051"
	serverAddr2 := "xds:///grpc-echo-another-server.labs:50051"
	conn1, err := grpc.Dial(serverAddr1, grpc.WithInsecure())
	if err != nil {
		log.Fatalf("Could not connect %v", err)
	}
	defer conn1.Close()
	c1 := echo.NewEchoServerClient(conn1)
	ctx := context.Background()
	r1, err := c1.SayHello(ctx, &echo.EchoRequest{Name: "unary RPC msg "})
	if err != nil {
		log.Printf("Could not get RPC %v\n", err)
	}
	log.Printf("RPC Response: %v", r1)
	time.Sleep(1 * time.Second)
	conn2, err := grpc.Dial(serverAddr2, grpc.WithInsecure())
	if err != nil {
		log.Fatalf("Could not connect %v", err)
	}
	defer conn1.Close()
	c2 := echo.NewEchoServerClient(conn2)
	r2, err := c2.SayHello(ctx, &echo.EchoRequest{Name: "unary RPC msg "})
	if err != nil {
		log.Printf("Could not get RPC %v\n", err)
	}
	log.Printf("RPC Response: %v", r2)

and build this with grpc-go v1.65.0 and latest (v1.66.2), I can see how the behaviour changes between releases.
Logs when running with v1.65.0: https://gist.github.com/ffilippopoulos/c2112dccb04e05a7255184bb305b4593
Logs when running with v1.66.2: https://gist.github.com/ffilippopoulos/d5576a2ac7efbc8e9f4e89aa9fe920a7

From there you can see that when building the above code with grpc-go v1.65, I get the following ADS discovery requests

2024/09/13 20:41:13 INFO: [xds] [xds-client 0xc0002ceeb0] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS request sent: {
  "node":  {
    "id":  "labs/grpc-echo-client-869c7c9dfb-jmrm6",
    "locality":  {},
    "userAgentName":  "gRPC Go",
    "userAgentVersion":  "1.65.0",
    "clientFeatures":  [
      "envoy.lb.does_not_support_overprovisioning",
      "xds.config.resource-in-sotw"
    ]
  },
  "resourceNames":  [
    "grpc-echo-server.labs:50051"
  ],
  "typeUrl":  "type.googleapis.com/envoy.config.listener.v3.Listener"
}

2024/09/13 20:41:14 INFO: [xds] [xds-client 0xc0002ceeb0] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS request sent: {
  "versionInfo":  "2",
  "resourceNames":  [
    "grpc-echo-server.labs:50051",
    "grpc-echo-another-server.labs:50051"
  ],
  "typeUrl":  "type.googleapis.com/envoy.config.listener.v3.Listener",
  "responseNonce":  "3"
}

While when using v1.66.2 it looks like my requests are sent from different clients:

2024/09/13 20:47:37 INFO: [xds] [xds-client 0xc00023a9b0] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS request sent: {
  "node":  {
    "id":  "labs/grpc-echo-client-5b48dff6b5-slqq4",
    "userAgentName":  "gRPC Go",
    "userAgentVersion":  "1.66.2",
    "clientFeatures":  [
      "envoy.lb.does_not_support_overprovisioning",
      "xds.config.resource-in-sotw"
    ]
  },
  "resourceNames":  [
    "grpc-echo-server.labs:50051"
  ],
  "typeUrl":  "type.googleapis.com/envoy.config.listener.v3.Listener"
}

2024/09/13 20:47:38 INFO: [xds] [xds-client 0xc000374eb0] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS request sent: {
  "node":  {
    "id":  "labs/grpc-echo-client-5b48dff6b5-slqq4",
    "userAgentName":  "gRPC Go",
    "userAgentVersion":  "1.66.2",
    "clientFeatures":  [
      "envoy.lb.does_not_support_overprovisioning",
      "xds.config.resource-in-sotw"
    ]
  },
  "resourceNames":  [
    "grpc-echo-another-server.labs:50051"
  ],
  "typeUrl":  "type.googleapis.com/envoy.config.listener.v3.Listener"
}

I appreciate my knowledge in grpc xDS might not be very deep so please let me know if I am doing anything wrong here.

easwars · 2024-09-13T21:50:17Z

Ah, now I see what is happening. So, this is not really related to the flow control changes that we made, but is related to some fallback changes that we are in the process of making.

So, earlier, we used to have a global singleton xDS client for the whole grpc application and if the application creates multiple grpc channels (thereby requesting multiple listener resources), all these listener resources would be requested by the same xDS client, and therefore would result in multiple listener resource names being specified in a single LDS request.

But with fallback, we have switched to using a separate xDS client per grpc target URI. So, if you create three grpc channels:

channel1 to URI xds:///foo
channel2 to URI xds:///foo
channel3 to URI xds:///bar
channels 1 and 2, would share an xDS client, while channel3 would get its own.

On the server side though, all servers share a single xDS client. Hope this helps.

This is the PR that contains the changes I described above: #7347

And this is the design for the overall fallback feature: https://github.com/grpc/proposal/blob/master/A71-xds-fallback.md

purnesh42H · 2024-09-17T11:27:00Z

@easwars do we need to do anything from our side or this is intended behavior?

easwars · 2024-09-26T09:40:58Z

@easwars do we need to do anything from our side or this is intended behavior?

This is working as per the design.

purnesh42H · 2024-09-29T06:14:51Z

@ffilippopoulos feel free to reopen if you have any more questions

rcrowe added the Type: Bug label Sep 12, 2024

arjan-bal self-assigned this Sep 12, 2024

arjan-bal assigned easwars and unassigned arjan-bal Sep 13, 2024

arjan-bal added the Area: xDS Includes everything xDS related, including LB policies used with xDS. label Sep 13, 2024

easwars assigned ffilippopoulos and unassigned easwars Sep 13, 2024

easwars added the Status: Requires Reporter Clarification label Sep 13, 2024

easwars removed the Status: Requires Reporter Clarification label Sep 13, 2024

purnesh42H assigned easwars and unassigned ffilippopoulos Sep 17, 2024

easwars added the Status: Requires Reporter Clarification label Sep 26, 2024

purnesh42H closed this as completed Sep 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.66.0 - increase in produced zero addresses #7625

1.66.0 - increase in produced zero addresses #7625

rcrowe commented Sep 12, 2024 •

edited

Loading

arjan-bal commented Sep 12, 2024

rcrowe commented Sep 12, 2024 •

edited

Loading

ffilippopoulos commented Sep 12, 2024 •

edited

Loading

arjan-bal commented Sep 13, 2024

ffilippopoulos commented Sep 13, 2024

easwars commented Sep 13, 2024

ffilippopoulos commented Sep 13, 2024

easwars commented Sep 13, 2024

purnesh42H commented Sep 17, 2024

easwars commented Sep 26, 2024

purnesh42H commented Sep 29, 2024

1.66.0 - increase in produced zero addresses #7625

1.66.0 - increase in produced zero addresses #7625

Comments

rcrowe commented Sep 12, 2024 • edited Loading

What version of gRPC are you using?

What version of Go are you using (go version)?

What operating system (Linux, Windows, …) and version?

What did you do?

What did you expect to see?

What did you see instead?

arjan-bal commented Sep 12, 2024

rcrowe commented Sep 12, 2024 • edited Loading

ffilippopoulos commented Sep 12, 2024 • edited Loading

arjan-bal commented Sep 13, 2024

ffilippopoulos commented Sep 13, 2024

easwars commented Sep 13, 2024

ffilippopoulos commented Sep 13, 2024

easwars commented Sep 13, 2024

purnesh42H commented Sep 17, 2024

easwars commented Sep 26, 2024

purnesh42H commented Sep 29, 2024

rcrowe commented Sep 12, 2024 •

edited

Loading

What version of Go are you using (`go version`)?

rcrowe commented Sep 12, 2024 •

edited

Loading

ffilippopoulos commented Sep 12, 2024 •

edited

Loading