Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PreviousPrioritiesConfig Type Not Found #14735

Closed
brodmerkle opened this issue Jan 17, 2021 · 9 comments · Fixed by #14907
Closed

PreviousPrioritiesConfig Type Not Found #14735

brodmerkle opened this issue Jan 17, 2021 · 9 comments · Fixed by #14907
Assignees

Comments

@brodmerkle
Copy link

When trying to use the previous-priorities predicate in a retry handler, Envoy 1.17.0 is returning an invalid-type-URL error even when using the same syntax from the example included in the Envoy documentation. For example, consider the following retry handler snippet in a route:

retry_policy:
  num_retries: 1
  retry_on: 5xx
  retry_priority:
      name: envoy.retry_priorities.previous_priorities
      typed_config:
          '@type': type.googleapis.com/envoy.extensions.retry.priority.previous_priorities.v3.PreviousPrioritiesConfig
          update_frequency: 1

This YAML uses the same previous-priorities configuration shown in the Envoy 1.17.0 documentation (https://www.envoyproxy.io/docs/envoy/v1.17.0/intro/arch_overview/http/http_connection_management#arch-overview-http-retry-plugins), but it fails when Envoy starts up:

[2021-01-17 13:01:31.235][91][warning][config] [source/common/config/filesystem_subscription_impl.cc:78] Filesystem config update failure: Unable to parse JSON as proto (INVALID_ARGUMENT:(virtual_hosts[0].routes[2].route.retry_policy.retry_priority.typed_config): invalid value Invalid type URL, unknown type: envoy.extensions.retry.priority.previous_priorities.v3.PreviousPrioritiesConfig for type Any): {"resources":{"name":"local-routes","@type":"type.googleapis.com/envoy.config.route.v3.RouteConfiguration","virtual_hosts":[{"name":"http","domains":["*"],"routes":[{"match":{"prefix":"/x-tree/F5Monitor.html"},"direct_response":{"body":{"inline_string":"THE SERVER IS UP"},"status":200}},{"route":{"retry_policy":{"retry_priority":{"typed_config":{"update_frequency":1,"@type":"type.googleapis.com/envoy.extensions.retry.priority.previous_priorities.v3.PreviousPrioritiesConfig"},"name":"envoy.retry_priorities.previous_priorities"},"retry_on":"5xx","num_retries":1},"cluster":"healthtest","timeout":"60s"},"match":{"prefix":"/","headers":[{"safe_regex_match":{"google_re2":{},"regex":"POST|PUT|DELETE"},"invert_match":true,"name":":method"}]}},{"route":{"timeout":"60s","retry_policy":{"retry_on":"connect-failure","retry_priority":{"typed_config":{"update_frequency":1,"@type":"type.googleapis.com/envoy.extensions.retry.priority.previous_priorities.v3.PreviousPrioritiesConfig"},"name":"envoy.retry_priorities.previous_priorities"},"num_retries":1},"cluster":"healthtest"},"match":{"prefix":"/","headers":[{"name":":method","safe_regex_match":{"regex":"POST|PUT|DELETE","google_re2":{}},"invert_match":false}]}}]}]}}

This is strange, as the type URL seems to match up with the corresponding proto file - https://github.com/envoyproxy/envoy/blob/main/api/envoy/extensions/retry/priority/previous_priorities/v3/previous_priorities_config.proto. Since update_frequency is effectively a required field (it has a greater-than-zero validation rule defined at https://github.com/envoyproxy/envoy/blob/main/api/envoy/extensions/retry/priority/previous_priorities/v3/previous_priorities_config.proto#L56 but as an int32 will default to zero if not defined), it's not possible to simply skip defining a value here - indeed trying to use previous_priorities without any typed_config will lead a full Envoy crash when constraint validation fails (not sure if a crash is expected in this case):

[2021-01-17 12:53:27.149][81][debug][router] [source/common/router/router.cc:425] [C2][S11559433518512939447] cluster 'healthtest' match for URL '/test'
[2021-01-17 12:53:27.149][81][debug][misc] [source/common/protobuf/utility.cc:266] Proto validation error; throwing Proto constraint validation failed (PreviousPrioritiesConfigValidationError.UpdateFrequency: ["value must be greater than " '\x00']):
[2021-01-17 12:53:27.150][81][critical][main] [source/exe/terminate_handler.cc:13] std::terminate called! (possible uncaught exception, see trace)
[2021-01-17 12:53:27.150][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2021-01-17 12:53:27.150][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:92] Envoy version: 5c801b25cae04f06bf48248c90e87d623d7a6283/1.17.0/Clean/RELEASE/BoringSSL
[2021-01-17 12:53:27.160][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #0: Envoy::TerminateHandler::logOnTerminate()::$_0::operator()() [0x55781a636a7b]
[2021-01-17 12:53:27.170][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #1: [0x55781a6368e9]
[2021-01-17 12:53:27.180][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #2: std::__terminate() [0x55781acc05c3]
[2021-01-17 12:53:27.191][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #3: Envoy::MessageUtil::validate<>() [0x55781946c85f]
[2021-01-17 12:53:27.201][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #4: Envoy::Extensions::Retry::Priority::PreviousPrioritiesRetryPriorityFactory::createRetryPriority() [0x55781946c4c5]
[2021-01-17 12:53:27.211][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #5: Envoy::Router::RetryPolicyImpl::retryPriority() [0x55781a3d933d]
[2021-01-17 12:53:27.221][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #6: Envoy::Router::RetryStateImpl::RetryStateImpl() [0x55781a4053fc]
[2021-01-17 12:53:27.230][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #7: Envoy::Router::RetryStateImpl::create() [0x55781a405261]
[2021-01-17 12:53:27.240][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #8: Envoy::Router::ProdFilter::createRetryState() [0x55781a3cd36f]

I can see the previous_priorities extension is in place in this Envoy binary as expected:

[2021-01-17 12:53:21.350][70][info][main] [source/server/server.cc:325] statically linked extensions:
...
[2021-01-17 12:53:21.350][70][info][main] [source/server/server.cc:327]   envoy.retry_priorities: envoy.retry_priorities.previous_priorities

I'm using the Envoy 1.17.0 binary from the envoyproxy/envoy:v1.17.0 image, and from the envoyproxy/envoy-debug:v1.17.0 image when collecting debug output from the stack trace above, so nothing remarkable on that front:

envoy@acfa1297d53b:/$ envoy --version

envoy  version: 5c801b25cae04f06bf48248c90e87d623d7a6283/1.17.0/Clean/RELEASE/BoringSSL

Is the example above (and in the retry plugin documentation itself) the correct syntax to utilize the previous-priorities plugin? If so, does this indicate a deeper problem in Envoy with the type URL not existing when it should?

For background here - we're trying to use Envoy as a far more capable reverse-proxy than NGINX, but are running into a challenge with retry handling logic. The simplest version of our goal here can be described with a use case in which we have only two upstreams, one at priority 0 (the "primary upstream") that should handle all traffic by default, and one at priority 1 (the "backup upstream") who should be used when the priority 0 upstream is unavailable. We can achieve this functionality using active health checks, but would like to avoid that as our sole approach - especially because there's a latency before ill health of the first upstream is detected. An ideal approach here is to retry a wide variety of failed idempotent requests against the backup upstream, and to retry a much narrower set of failed non-idempotent requests as well (pretty much just outright connect failures). In NGINX we can use a proxy_next_upstream directive (http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_next_upstream) to achieve this outcome, since NGINX will simply move on from the failing upstream member to the next available member even if it's a backup member.

In Envoy the first approach I tried was to use the envoy.retry_host_predicates.previous_hosts host predicate in our route's retry policy, to prevent Envoy from simply rerunning the same host selection algorithm and selecting the primary upstream again. However, this leads to failure, with UF and URX response flags confirming that host selection just kept picking the sole priority 0 upstream and the host predicate kept rejecting it.

Given that result I next tried to use the envoy.retry_priorities.previous_priorities priority predicate, planning to dial the update_frequency down to 1 but running into the issue described above. Is there a more elegant way to achieve this outcome?

I can provide more exhaustive YAML config files if it helps, although the behavior should be reasonably clear from the examples above.

@htuch
Copy link
Member

htuch commented Jan 17, 2021

At a quick glance, it looks like this is being configured correctly, is there anything else in the logs if you run with -l trace?

@snowp thoughts on configuration?
@phlax for documentation and example verification visibility.

@brodmerkle
Copy link
Author

Not a lot that looks relevant - here for reference is some context around the error when running with trace logging (I'm running a local config.yaml that references a filesystem rds.yaml with the route definitions):

[2021-01-17 15:05:35.127][93][debug][init] [source/common/init/manager_impl.cc:53] init manager Server initializing
[2021-01-17 15:05:35.127][93][debug][init] [source/common/init/target_impl.cc:15] init manager Server initializing target Listener-init-target listener-8443
[2021-01-17 15:05:35.127][93][debug][init] [source/common/init/manager_impl.cc:53] init manager Listener-local-init-manager listener-8443 2936379492351167496 initializing
[2021-01-17 15:05:35.127][93][debug][init] [source/common/init/target_impl.cc:15] init manager Listener-local-init-manager listener-8443 2936379492351167496 initializing shared target RdsRouteConfigSubscription init local-routes
[2021-01-17 15:05:35.127][93][debug][init] [source/common/init/manager_impl.cc:53] init manager RDS local-init-manager local-routes initializing
[2021-01-17 15:05:35.127][93][debug][init] [source/common/init/target_impl.cc:15] init manager RDS local-init-manager local-routes initializing target RdsRouteConfigSubscription local-init-target local-routes
[2021-01-17 15:05:35.127][93][debug][config] [source/common/config/filesystem_subscription_impl.cc:61] Filesystem config refresh for /apps/envoy/rds.yaml
[2021-01-17 15:05:35.128][93][warning][config] [source/common/config/filesystem_subscription_impl.cc:78] Filesystem config update failure: Unable to parse JSON as proto (INVALID_ARGUMENT:(virtual_hosts[0].routes[2].route.retry_policy.retry_priority.typed_config): invalid value Invalid type URL, unknown type: envoy.extensions.retry.priority.previous_priorities.v3.PreviousPrioritiesConfig for type Any): {"resources":{"virtual_hosts":[{"name":"http","domains":["*"],"routes":[{"direct_response":{"status":200,"body":{"inline_string":"THE SERVER IS UP"}},"match":{"prefix":"/x-tree/F5Monitor.html"}},{"match":{"prefix":"/","headers":[{"safe_regex_match":{"google_re2":{},"regex":"POST|PUT|DELETE"},"name":":method","invert_match":true}]},"route":{"retry_policy":{"retry_priority":{"typed_config":{"update_frequency":1,"@type":"type.googleapis.com/envoy.extensions.retry.priority.previous_priorities.v3.PreviousPrioritiesConfig"},"name":"envoy.retry_priorities.previous_priorities"},"num_retries":1,"retry_on":"5xx"},"cluster":"healthtest","timeout":"60s"}},{"route":{"timeout":"60s","retry_policy":{"num_retries":1,"retry_on":"connect-failure","retry_priority":{"name":"envoy.retry_priorities.previous_priorities","typed_config":{"@type":"type.googleapis.com/envoy.extensions.retry.priority.previous_priorities.v3.PreviousPrioritiesConfig","update_frequency":1}}},"cluster":"healthtest"},"match":{"headers":[{"name":":method","invert_match":false,"safe_regex_match":{"google_re2":{},"regex":"POST|PUT|DELETE"}}],"prefix":"/"}}]}],"name":"local-routes","@type":"type.googleapis.com/envoy.config.route.v3.RouteConfiguration"}}
[2021-01-17 15:05:35.128][93][debug][init] [source/common/init/watcher_impl.cc:14] target RdsRouteConfigSubscription local-init-target local-routes initialized, notifying init manager RDS local-init-manager local-routes
[2021-01-17 15:05:35.128][93][debug][init] [source/common/init/watcher_impl.cc:14] init manager RDS local-init-manager local-routes initialized, notifying RDS local-init-watcher local-routes

For reference our config.yaml is:

admin:
  access_log_path: /dev/stdout
  address:
    socket_address:
      address: 127.0.0.1
      port_value: 9901
dynamic_resources:
  cds_config:
    path: /apps/envoy/cds.yaml
layered_runtime:
  layers:
  - name: static-layer
    static_layer:
      overload.global_downstream_max_connections: 10000
node:
  cluster: envoy
  id: envoy-worker
static_resources:
  listeners:
  - address:
      socket_address:
        address: 0.0.0.0
        port_value: 8443
    filter_chains:
    - filters:
      - name: httpfilter
        typed_config:
          '@type': type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          access_log:
            typed_config:
              '@type': type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
              log_format:
                text_format: "[logtype=access] x-client-ip=\"%REQ(X-CLIENT-IP)%\"\
                  \ time=[%START_TIME%] request=\"%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%\
                  \ %PROTOCOL%\" http_status=%RESPONSE_CODE% content_length=%BYTES_SENT%\
                  \ user_agent=\"%REQ(USER-AGENT)%\" duration=%DURATION% fsreqid=\"\
                  %REQ(FSREQID)%\" srcip=%DOWNSTREAM_DIRECT_REMOTE_ADDRESS_WITHOUT_PORT%\
                  \ rflags=\"%RESPONSE_FLAGS%\" upstream=\"%UPSTREAM_HOST%\"\n "
              path: /dev/stdout
          common_http_protocol_options:
            idle_timeout: 65s
            max_connection_duration: 300s
            max_headers_count: 100
          http_filters:
          - name: envoy.filters.http.compressor
            typed_config:
              '@type': type.googleapis.com/envoy.extensions.filters.http.compressor.v3.Compressor
              compressor_library:
                name: text_optimized
                typed_config:
                  '@type': type.googleapis.com/envoy.extensions.compression.gzip.compressor.v3.Gzip
              content_length: 1024
          - name: envoy.filters.http.router
          http_protocol_options:
            accept_http_10: true
            default_host_for_http_10: d35aa7ddfe2d
          rds:
            config_source:
              path: /apps/envoy/rds.yaml
            route_config_name: local-routes
          stat_prefix: http
      transport_socket:
        name: transport-socket-8443
        typed_config:
          '@type': type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
          common_tls_context:
            tls_certificates:
            - certificate_chain:
                filename: /apps/envoy/certs/listener-cert.pem
              private_key:
                filename: /apps/envoy/certs/listener-key.pem
            tls_params:
              cipher_suites: '[ECDHE-RSA-AES128-GCM-SHA256|ECDHE-RSA-AES256-GCM-SHA384|ECDHE-ECDSA-AES128-GCM-SHA256|ECDHE-ECDSA-AES256-GCM-SHA384]'
              tls_minimum_protocol_version: TLSv1_2
    name: listener-8443

The rds.yaml is:

resources:
  '@type': type.googleapis.com/envoy.config.route.v3.RouteConfiguration
  name: local-routes
  virtual_hosts:
  - domains:
    - '*'
    name: http
    routes:
    - direct_response:
        body:
          inline_string: THE SERVER IS UP
        status: 200
      match:
        prefix: /x-tree/F5Monitor.html
    - match:
        headers:
        - invert_match: true
          name: :method
          safe_regex_match:
            google_re2: {}
            regex: POST|PUT|DELETE
        prefix: /
      route:
        cluster: healthtest
        retry_policy:
          num_retries: 1
          retry_on: 5xx
          retry_priority:
            name: envoy.retry_priorities.previous_priorities
            typed_config:
              '@type': type.googleapis.com/envoy.extensions.retry.priority.previous_priorities.v3.PreviousPrioritiesConfig
              update_frequency: 1
        timeout: 60s
    - match:
        headers:
        - invert_match: false
          name: :method
          safe_regex_match:
            google_re2: {}
            regex: POST|PUT|DELETE
        prefix: /
      route:
        cluster: healthtest
        retry_policy:
          num_retries: 1
          retry_on: connect-failure
          retry_priority:
            name: envoy.retry_priorities.previous_priorities
            typed_config:
              '@type': type.googleapis.com/envoy.extensions.retry.priority.previous_priorities.v3.PreviousPrioritiesConfig
              update_frequency: 1
        timeout: 60s

I can provide the cds.yaml if needed, although I don't think it's relevant here so for now I won't silt up the issue with it.

@brodmerkle
Copy link
Author

FYI - I reviewed the Envoy startup output and addressed all deprecation warnings on the vague hope that one of them might be related (I was especially hopeful about adding a resource_api_version set to V3 on the rds configure_source thinking it might somehow unlock the use of v3 protos in rds.yaml), but alas none of these updates had any effect on the observed behavior.

As another test I swapped out the retry_priority stanza for a envoy.retry_host_predicates.omit_host_metadata in a retry_host_predicate (basically using the syntax shown on https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/http/http_connection_management), simply to see if any typed_config would work inside rds.yaml. This configuration did not produce any type URL or other validator errors on startup - so the problem here seems to be very specific to the PreviousPrioritiesConfig.

@htuch
Copy link
Member

htuch commented Jan 18, 2021

I think I see the issue; the actual implementation of this extension is linking and using v2 config, rather than v3 config, see

"@envoy_api//envoy/config/retry/previous_priorities:pkg_cc_proto",
. I'll dig into this some more tomorrow, I'm surprised we haven't caught this in tests around v2 deprecated config.

@htuch htuch self-assigned this Jan 18, 2021
@htuch
Copy link
Member

htuch commented Jan 18, 2021

@ankatare in all the cases that you are adding v3 config, you need to make sure that the code consuming the new protos is also upgraded to link in these protos and accept configurations that are both v2 and v3. I'm thinking actually we probably want to not do the denylist of v2 --> v3 upgrade, but instead move this elsewhere (maybe back to denylisting v2 --> v3 upgrade exceptions). This is kind of why I was asking for proper testing. FWIW, your changes are unrelated to #14735 but I think we will see the exact same behavior.

@brodmerkle
Copy link
Author

I've been trying today to configure the envoy.retry_host_predicates.previous_hosts retry host predicate but finding that it never takes effect - I'm suspicious that this is actually due to the same v2/v3 proto issue we're covering in this issue, so I'll describe the problem here and we can sort out if it's worth a new issue or folds into the existing remediation effort.

I'm still on Envoy 1.17.0 as described above, and again using the most basic definition from https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/http/http_connection_management#arch-overview-http-retry-plugins, in this case this snippet:

retry_policy:
  retry_host_predicate:
   - name: envoy.retry_host_predicates.previous_hosts

With no typed_config reference I don't hit any invalid-type-URL errors, but the previous_hosts logic simply never takes effect. To confirm this I created a two-member cluster with a simple v3 config (brodmerkle1 is not listening on 9444 while brodmerkle2 is - so half the time I'd expect a retry to occur, at least until outlier detection spots enough consecutive local origin failures):

resources:
- '@type': type.googleapis.com/envoy.config.cluster.v3.Cluster
  circuit_breakers:
    thresholds:
      max_connections: 1024
  connect_timeout: 2s
  lb_policy: LEAST_REQUEST
  load_assignment:
    cluster_name: post-retry
    endpoints:
    - lb_endpoints:
      - endpoint:
          address:
            socket_address:
              address: brodmerkle1
              port_value: 9444
      - endpoint:
          address:
            socket_address:
              address: brodmerkle2
              port_value: 9444
      priority: 0
  name: post-retry
  respect_dns_ttl: true
  transport_socket:
    name: post-retry-transport-socket
    typed_config:
      '@type': type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
      common_tls_context:
        tls_params:
          cipher_suites: '[ECDHE-RSA-AES128-GCM-SHA256|ECDHE-RSA-AES256-GCM-SHA384|ECDHE-ECDSA-AES128-GCM-SHA256|ECDHE-ECDSA-AES256-GCM-SHA384]'
          tls_minimum_protocol_version: TLSv1_2
  type: STRICT_DNS
  typed_extension_protocol_options:
    envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
      '@type': type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
      common_http_protocol_options:
        idle_timeout: 60s
        max_connection_duration: 300s
        max_headers_count: 100
      explicit_http_config:
        http_protocol_options: {}

I then built the simplest possible route configuration to demonstrate the problem:

resources:
  '@type': type.googleapis.com/envoy.config.route.v3.RouteConfiguration
  name: local-routes
  virtual_hosts:
  - domains:
    - '*'
    name: http
    routes:
    - match:
        prefix: /post-retry
      route:
        cluster: post-retry
        retry_policy:
          num_retries: 1
          retry_host_predicate:
            name: envoy.retry_host_predicates.previous_hosts
          retry_on: 5xx
        timeout: 60s

When hitting Envoy immediately after startup I do occasionally experience 503 responses, and I can see in the trace logs that Envoy is selecting the same host twice, for example:

[2021-01-23 19:27:30.631][124][debug][router] [source/common/router/router.cc:425] [C23][S13529278531814971024] cluster 'post-retry' match for URL '/post-retry'
[2021-01-23 19:27:30.631][124][debug][router] [source/common/router/router.cc:582] [C23][S13529278531814971024] router decoding headers:
':authority', 'brodmerkle1:8443'
':path', '/post-retry'
':method', 'GET'
':scheme', 'https'
'user-agent', 'curl/7.58.0'
'accept', '*/*'
'x-forwarded-proto', 'https'
'x-request-id', '56b2be6f-07df-4638-bffe-f582d0304611'
'x-envoy-expected-rq-timeout-ms', '60000'

[2021-01-23 19:27:30.631][124][debug][pool] [source/common/http/conn_pool_base.cc:79] queueing stream due to no available connections
[2021-01-23 19:27:30.631][124][debug][pool] [source/common/conn_pool/conn_pool_base.cc:106] creating a new connection
[2021-01-23 19:27:30.631][124][debug][client] [source/common/http/codec_client.cc:41] [C24] connecting
[2021-01-23 19:27:30.631][124][debug][connection] [source/common/network/connection_impl.cc:860] [C24] connecting to 10.147.1.144:9444
[2021-01-23 19:27:30.631][124][debug][connection] [source/common/network/connection_impl.cc:876] [C24] connection in progress
[2021-01-23 19:27:30.631][124][trace][http] [source/common/http/filter_manager.cc:498] [C23][S13529278531814971024] decode headers called: filter=0x17333f48b500 status=1
[2021-01-23 19:27:30.631][124][trace][misc] [source/common/event/scaled_range_timer_manager_impl.cc:60] enableTimer called on 0x17333f48a380 for 300000ms, min is 300000ms
[2021-01-23 19:27:30.631][124][trace][http] [source/common/http/http1/codec_impl.cc:620] [C23] parsed 102 bytes
[2021-01-23 19:27:30.631][124][trace][connection] [source/common/network/connection_impl.cc:548] [C23] socket event: 2
[2021-01-23 19:27:30.631][124][trace][connection] [source/common/network/connection_impl.cc:657] [C23] write ready
[2021-01-23 19:27:30.669][124][trace][connection] [source/common/network/connection_impl.cc:548] [C24] socket event: 3
[2021-01-23 19:27:30.669][124][trace][connection] [source/common/network/connection_impl.cc:657] [C24] write ready
[2021-01-23 19:27:30.669][124][debug][connection] [source/common/network/connection_impl.cc:675] [C24] delayed connection error: 111
[2021-01-23 19:27:30.669][124][debug][connection] [source/common/network/connection_impl.cc:241] [C24] closing socket: 0
[2021-01-23 19:27:30.669][124][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:224] [C24]
[2021-01-23 19:27:30.669][124][trace][connection] [source/common/network/connection_impl.cc:407] [C24] raising connection event 0
[2021-01-23 19:27:30.669][124][debug][client] [source/common/http/codec_client.cc:99] [C24] disconnect. resetting 0 pending requests
[2021-01-23 19:27:30.669][124][debug][pool] [source/common/conn_pool/conn_pool_base.cc:343] [C24] client disconnected, failure reason:
[2021-01-23 19:27:30.669][124][debug][router] [source/common/router/router.cc:1026] [C23][S13529278531814971024] upstream reset: reset reason: connection failure, transport failure reason:
[2021-01-23 19:27:30.669][124][trace][main] [source/common/event/dispatcher_impl.cc:213] item added to deferred deletion list (size=1)
[2021-01-23 19:27:30.669][124][trace][main] [source/common/event/dispatcher_impl.cc:89] clearing deferred deletion list (size=1)
[2021-01-23 19:27:30.685][124][debug][router] [source/common/router/router.cc:1530] [C23][S13529278531814971024] performing retry
[2021-01-23 19:27:30.685][124][debug][pool] [source/common/http/conn_pool_base.cc:79] queueing stream due to no available connections
[2021-01-23 19:27:30.685][124][debug][pool] [source/common/conn_pool/conn_pool_base.cc:106] creating a new connection
[2021-01-23 19:27:30.685][124][debug][client] [source/common/http/codec_client.cc:41] [C25] connecting
[2021-01-23 19:27:30.685][124][debug][connection] [source/common/network/connection_impl.cc:860] [C25] connecting to 10.147.1.144:9444
[2021-01-23 19:27:30.685][124][debug][connection] [source/common/network/connection_impl.cc:876] [C25] connection in progress
[2021-01-23 19:27:30.722][124][trace][connection] [source/common/network/connection_impl.cc:548] [C25] socket event: 3
[2021-01-23 19:27:30.723][124][trace][connection] [source/common/network/connection_impl.cc:657] [C25] write ready
[2021-01-23 19:27:30.723][124][debug][connection] [source/common/network/connection_impl.cc:675] [C25] delayed connection error: 111
[2021-01-23 19:27:30.723][124][debug][connection] [source/common/network/connection_impl.cc:241] [C25] closing socket: 0
[2021-01-23 19:27:30.723][124][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:224] [C25]
[2021-01-23 19:27:30.723][124][trace][connection] [source/common/network/connection_impl.cc:407] [C25] raising connection event 0
[2021-01-23 19:27:30.723][124][debug][client] [source/common/http/codec_client.cc:99] [C25] disconnect. resetting 0 pending requests
[2021-01-23 19:27:30.723][124][debug][pool] [source/common/conn_pool/conn_pool_base.cc:343] [C25] client disconnected, failure reason:
[2021-01-23 19:27:30.723][124][debug][router] [source/common/router/router.cc:1026] [C23][S13529278531814971024] upstream reset: reset reason: connection failure, transport failure reason:
[2021-01-23 19:27:30.723][124][debug][http] [source/common/http/filter_manager.cc:839] [C23][S13529278531814971024] Sending local reply with details upstream_reset_before_response_started{connection failure}

Nothing in the code suggests that this should be possible - the previous-host logic at https://github.com/envoyproxy/envoy/blob/v1.17.0/source/extensions/retry/host/previous_hosts/previous_hosts.h#L12 is simple enough, and shouldSelectAnotherHost is just checking if any of the retry host predicates complained about the host selection at https://github.com/envoyproxy/envoy/blob/v1.17.0/source/common/router/retry_state_impl.h#L75.

However, as with previous_priorities the previous_hosts config is utilizing v2 config instead of v3 - see https://github.com/envoyproxy/envoy/blob/v1.17.0/source/extensions/retry/host/previous_hosts/BUILD#L29.

Is this then another manifestation of the same v2/v3 problem that will fall into scope of the PR work being discussed here, or should I split this into another issue with a potentially different fix?

@htuch
Copy link
Member

htuch commented Jan 25, 2021

@brodmerkle looking at test/extensions/retry/host/previous_hosts, it seems we only have a config test. @snowp are there other tests that demonstrate functional behavior for this extension? We may not have even v2 behavior happening.

Re: the v2/v3 dependency and upgrade issue, I haven't forgotten about this, I didn't have any time last week to tackle it. Happy to have someone else in the community dig into it if they have cycles, otherwise I'm hoping to get some time this week.

@snowp
Copy link
Contributor

snowp commented Jan 25, 2021

I believe there aren't any integration tests that exercise the previous_hosts one specifically, just one that exercises a test extension. I can look into getting an integration test added, it shouldn't be that hard.

Re whether it's working properly: I know that it worked originally (as users reported a substantial increase in success rate for first retry by using it) and I'm not familiar with any changes that would have broken it. My guess is that the reproducer is showing the limitations of the host retry extensions: they don't allow a strict "never retry to the same hosts", but rather attempts to perform load balancing N times until it finds a host that doesn't match the predicate. This means that especially for random load balancers (RANDOM, LEAST_REQUEST), you'd likely see some requests be routed to the same host, especially when host_selection_retry_max_attempts is left unset.

htuch added a commit to htuch/envoy that referenced this issue Feb 2, 2021
* Add v3 equivalents of v2 configs that were included in v3 due to no
  transitive deprecation. This increases consistency and reduces user
  confusion. We will continue to support these straggler v2 configs
  beyond the v2 turndown due to the late addition of v3 counterparts,
  special case code is added to utility.cc to handle this.

* There were two extensions, //envoy/config/cluster/redis and
  //envoy/config/retry/previous_priorities, that for some reason were
  not upgraded to use v3 config. This is now fixed and I've grepped for
  other v2 in //source, none remain.

Risk level: Medium (changes to extension config types and deprecated
  config handling).
Testing: Additional unit test added for utility.cc handling, upgraded
  configs to v3 for other tests.

Fixes envoyproxy#14735
Fixes envoyproxy#12841

Signed-off-by: Harvey Tuch <htuch@google.com>
Co-authored-by: Abhay Narayan Katare <abhay.katare@india.nec.com>
@htuch
Copy link
Member

htuch commented Feb 2, 2021

I finally had some cycles to work on this, #14907 is the v3 API fix.

lizan pushed a commit that referenced this issue Feb 11, 2021
* Add v3 equivalents of v2 configs that were included in v3 due to no
  transitive deprecation. This increases consistency and reduces user
  confusion. We will continue to support these straggler v2 configs
  beyond the v2 turndown due to the late addition of v3 counterparts,
  special case code is added to utility.cc to handle this.

* There were two extensions, //envoy/config/cluster/redis and
  //envoy/config/retry/previous_priorities, that for some reason were
  not upgraded to use v3 config. This is now fixed and I've grepped for
  other v2 in //source, none remain.

Risk level: Medium (changes to extension config types and deprecated
  config handling).
Testing: Additional unit test added for utility.cc handling, upgraded
  configs to v3 for other tests.

Fixes #14735
Fixes #12841

Signed-off-by: Harvey Tuch <htuch@google.com>
Co-authored-by: Abhay Narayan Katare <abhay.katare@india.nec.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants