-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PreviousPrioritiesConfig Type Not Found #14735
Comments
Not a lot that looks relevant - here for reference is some context around the error when running with trace logging (I'm running a local config.yaml that references a filesystem rds.yaml with the route definitions):
For reference our config.yaml is:
The rds.yaml is:
I can provide the cds.yaml if needed, although I don't think it's relevant here so for now I won't silt up the issue with it. |
FYI - I reviewed the Envoy startup output and addressed all deprecation warnings on the vague hope that one of them might be related (I was especially hopeful about adding a resource_api_version set to V3 on the rds configure_source thinking it might somehow unlock the use of v3 protos in rds.yaml), but alas none of these updates had any effect on the observed behavior. As another test I swapped out the retry_priority stanza for a envoy.retry_host_predicates.omit_host_metadata in a retry_host_predicate (basically using the syntax shown on https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/http/http_connection_management), simply to see if any typed_config would work inside rds.yaml. This configuration did not produce any type URL or other validator errors on startup - so the problem here seems to be very specific to the PreviousPrioritiesConfig. |
I think I see the issue; the actual implementation of this extension is linking and using v2 config, rather than v3 config, see
|
@ankatare in all the cases that you are adding v3 config, you need to make sure that the code consuming the new protos is also upgraded to link in these protos and accept configurations that are both v2 and v3. I'm thinking actually we probably want to not do the denylist of v2 --> v3 upgrade, but instead move this elsewhere (maybe back to denylisting v2 --> v3 upgrade exceptions). This is kind of why I was asking for proper testing. FWIW, your changes are unrelated to #14735 but I think we will see the exact same behavior. |
I've been trying today to configure the envoy.retry_host_predicates.previous_hosts retry host predicate but finding that it never takes effect - I'm suspicious that this is actually due to the same v2/v3 proto issue we're covering in this issue, so I'll describe the problem here and we can sort out if it's worth a new issue or folds into the existing remediation effort. I'm still on Envoy 1.17.0 as described above, and again using the most basic definition from https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/http/http_connection_management#arch-overview-http-retry-plugins, in this case this snippet:
With no typed_config reference I don't hit any invalid-type-URL errors, but the previous_hosts logic simply never takes effect. To confirm this I created a two-member cluster with a simple v3 config (brodmerkle1 is not listening on 9444 while brodmerkle2 is - so half the time I'd expect a retry to occur, at least until outlier detection spots enough consecutive local origin failures):
I then built the simplest possible route configuration to demonstrate the problem:
When hitting Envoy immediately after startup I do occasionally experience 503 responses, and I can see in the trace logs that Envoy is selecting the same host twice, for example:
Nothing in the code suggests that this should be possible - the previous-host logic at https://github.com/envoyproxy/envoy/blob/v1.17.0/source/extensions/retry/host/previous_hosts/previous_hosts.h#L12 is simple enough, and shouldSelectAnotherHost is just checking if any of the retry host predicates complained about the host selection at https://github.com/envoyproxy/envoy/blob/v1.17.0/source/common/router/retry_state_impl.h#L75. However, as with previous_priorities the previous_hosts config is utilizing v2 config instead of v3 - see https://github.com/envoyproxy/envoy/blob/v1.17.0/source/extensions/retry/host/previous_hosts/BUILD#L29. Is this then another manifestation of the same v2/v3 problem that will fall into scope of the PR work being discussed here, or should I split this into another issue with a potentially different fix? |
@brodmerkle looking at Re: the v2/v3 dependency and upgrade issue, I haven't forgotten about this, I didn't have any time last week to tackle it. Happy to have someone else in the community dig into it if they have cycles, otherwise I'm hoping to get some time this week. |
I believe there aren't any integration tests that exercise the previous_hosts one specifically, just one that exercises a test extension. I can look into getting an integration test added, it shouldn't be that hard. Re whether it's working properly: I know that it worked originally (as users reported a substantial increase in success rate for first retry by using it) and I'm not familiar with any changes that would have broken it. My guess is that the reproducer is showing the limitations of the host retry extensions: they don't allow a strict "never retry to the same hosts", but rather attempts to perform load balancing N times until it finds a host that doesn't match the predicate. This means that especially for random load balancers (RANDOM, LEAST_REQUEST), you'd likely see some requests be routed to the same host, especially when |
* Add v3 equivalents of v2 configs that were included in v3 due to no transitive deprecation. This increases consistency and reduces user confusion. We will continue to support these straggler v2 configs beyond the v2 turndown due to the late addition of v3 counterparts, special case code is added to utility.cc to handle this. * There were two extensions, //envoy/config/cluster/redis and //envoy/config/retry/previous_priorities, that for some reason were not upgraded to use v3 config. This is now fixed and I've grepped for other v2 in //source, none remain. Risk level: Medium (changes to extension config types and deprecated config handling). Testing: Additional unit test added for utility.cc handling, upgraded configs to v3 for other tests. Fixes envoyproxy#14735 Fixes envoyproxy#12841 Signed-off-by: Harvey Tuch <htuch@google.com> Co-authored-by: Abhay Narayan Katare <abhay.katare@india.nec.com>
I finally had some cycles to work on this, #14907 is the v3 API fix. |
* Add v3 equivalents of v2 configs that were included in v3 due to no transitive deprecation. This increases consistency and reduces user confusion. We will continue to support these straggler v2 configs beyond the v2 turndown due to the late addition of v3 counterparts, special case code is added to utility.cc to handle this. * There were two extensions, //envoy/config/cluster/redis and //envoy/config/retry/previous_priorities, that for some reason were not upgraded to use v3 config. This is now fixed and I've grepped for other v2 in //source, none remain. Risk level: Medium (changes to extension config types and deprecated config handling). Testing: Additional unit test added for utility.cc handling, upgraded configs to v3 for other tests. Fixes #14735 Fixes #12841 Signed-off-by: Harvey Tuch <htuch@google.com> Co-authored-by: Abhay Narayan Katare <abhay.katare@india.nec.com>
When trying to use the previous-priorities predicate in a retry handler, Envoy 1.17.0 is returning an invalid-type-URL error even when using the same syntax from the example included in the Envoy documentation. For example, consider the following retry handler snippet in a route:
This YAML uses the same previous-priorities configuration shown in the Envoy 1.17.0 documentation (https://www.envoyproxy.io/docs/envoy/v1.17.0/intro/arch_overview/http/http_connection_management#arch-overview-http-retry-plugins), but it fails when Envoy starts up:
This is strange, as the type URL seems to match up with the corresponding proto file - https://github.com/envoyproxy/envoy/blob/main/api/envoy/extensions/retry/priority/previous_priorities/v3/previous_priorities_config.proto. Since update_frequency is effectively a required field (it has a greater-than-zero validation rule defined at https://github.com/envoyproxy/envoy/blob/main/api/envoy/extensions/retry/priority/previous_priorities/v3/previous_priorities_config.proto#L56 but as an int32 will default to zero if not defined), it's not possible to simply skip defining a value here - indeed trying to use
previous_priorities
without any typed_config will lead a full Envoy crash when constraint validation fails (not sure if a crash is expected in this case):I can see the
previous_priorities
extension is in place in this Envoy binary as expected:I'm using the Envoy 1.17.0 binary from the envoyproxy/envoy:v1.17.0 image, and from the envoyproxy/envoy-debug:v1.17.0 image when collecting debug output from the stack trace above, so nothing remarkable on that front:
Is the example above (and in the retry plugin documentation itself) the correct syntax to utilize the previous-priorities plugin? If so, does this indicate a deeper problem in Envoy with the type URL not existing when it should?
For background here - we're trying to use Envoy as a far more capable reverse-proxy than NGINX, but are running into a challenge with retry handling logic. The simplest version of our goal here can be described with a use case in which we have only two upstreams, one at priority 0 (the "primary upstream") that should handle all traffic by default, and one at priority 1 (the "backup upstream") who should be used when the priority 0 upstream is unavailable. We can achieve this functionality using active health checks, but would like to avoid that as our sole approach - especially because there's a latency before ill health of the first upstream is detected. An ideal approach here is to retry a wide variety of failed idempotent requests against the backup upstream, and to retry a much narrower set of failed non-idempotent requests as well (pretty much just outright connect failures). In NGINX we can use a proxy_next_upstream directive (http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_next_upstream) to achieve this outcome, since NGINX will simply move on from the failing upstream member to the next available member even if it's a backup member.
In Envoy the first approach I tried was to use the
envoy.retry_host_predicates.previous_hosts
host predicate in our route's retry policy, to prevent Envoy from simply rerunning the same host selection algorithm and selecting the primary upstream again. However, this leads to failure, with UF and URX response flags confirming that host selection just kept picking the sole priority 0 upstream and the host predicate kept rejecting it.Given that result I next tried to use the
envoy.retry_priorities.previous_priorities
priority predicate, planning to dial the update_frequency down to 1 but running into the issue described above. Is there a more elegant way to achieve this outcome?I can provide more exhaustive YAML config files if it helps, although the behavior should be reasonably clear from the examples above.
The text was updated successfully, but these errors were encountered: