Health Discovery Service #1310

htuch · 2017-07-21T20:04:33Z

The v2 APIs introduce HDS - https://github.com/lyft/envoy-api/blob/master/api/hds.proto (@amb67). It works as follows (pasting in the .proto comment):

Envoy starts up and if its can_healthcheck option in the static bootstrap config is enabled, sends HealthCheckRequest to the management server. It supplies its capabilities (which protocol it can health check with, what zone it resides in, etc.).
In response to (1), the management server designates this Envoy as a healthchecker to health check a subset of all upstream hosts for a given cluster (for example upstream Host 1 and Host 2). It streams HealthCheckSpecifier messages with cluster related configuration for all clusters this Envoy is designated to health check. Subsequent HealthCheckSpecifier message will be sent on changes to:
a. Endpoints to health checks
b. Per cluster configuration change
Envoy creates a health probe based on the HealthCheck config and sends it to endpoint(ip:port) of Host 1 and 2. Based on the HealthCheck configuration Envoy waits upon the arrival of the probe response and looks at the content of the response to decide whether the endpoint is healthy or not. If a response hasn’t been received within the timeout interval, the endpoint health status is considered TIMEOUT.
Envoy reports results back in an EndpointHealthResponse message. Envoy streams responses as often as the interval configured by the management server in HealthCheckSpecifier.
The management server collects health statuses for all endpoints in the cluster (for all clusters) and uses this information to construct EDS DiscoveryResponse messages.
Once Envoy has a list of upstream endpoints to send traffic to, it load balances traffic to them without additional health checking. It may use inline healthcheck (i.e. consider endpoint UNHEALTHY if connection failed to a particular endpoint to account for health status propagation delay between HDS and EDS).

This issue will track implementation work on HDS.

The text was updated successfully, but these errors were encountered:

Added a health discovery service config option in bootstrap.proto. There is skeleton code for this service in source/common/upstream/health_discovery_service.cc and an integration test that tests connectivity between envoy running hds and a management server. This work is for #1310. Risk Level: Low Signed-off-by: Lilika Markatou <lilika@google.com>

stale · 2018-07-12T20:58:14Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

This pull request contains a basic implementation of HDS, where a management server can request an HTTP healthcheck for any number of endpoints, the HdsDelegate healthchecks them, and reports back. The code also includes TODOs, to help identify the work that needs to be done next, like supporting updates to the set of endpoints that require healthchecking. Risk Level: Low Testing: There are integration tests in test/integration/hds_integration_test.cc and unit tests in test/common/upstream/hds_test.cc. This work is for #1310. Signed-off-by: Lilika Markatou <lilika@google.com>

Renaming message types in api/envoy/service/discovery/v2/hds.proto to improve readability Risk Level: Low This is for #1310. Signed-off-by: Lilika Markatou <lilika@google.com>

The HdsDelegate now informs the management server that it can perform TCP healthchecks. This change also comes with three tests showing that indeed the HdsDelegate can perform TCP healthchecks. Risk Level: Low This is for #1310. Signed-off-by: Lilika Markatou <lilika@google.com>

…ction failures (#4108) I changed the retry strategy of the HdsDelegate on stream/connection failures. Instead of retrying every set number of seconds, we now use a jittered backoff strategy, as in #3791. Risk Level: Low This is for #1310. Signed-off-by: Lilika Markatou <lilika@google.com>

When reporting back to the management server, the HdsDelegate can now tell it whether an unhealthy host responded that it's unhealthy or the request timed out. Risk Level: Low This is for #1310. Signed-off-by: markatou <lilika@google.com>

…obust (#4164) Follow up on #4149. I increased any unused timers to 100 seconds. Additionally, instead of waiting on a timer to go off, all but one tests wait on cluster stats to change. I decreased the timer wait times, reducing the test runtime from 25 seconds to 5. Exceptions: While TCP health checking, if an endpoint responds with a wrong phrase, the cluster failure stats do not change. So, the one test that examines this case still depends on timers. There are 2 tests that test hosts timing out. I set those timeouts to 0.1 seconds (vs 100). However, both tests still wait on the cluster stats to change. Risk Level: Low Testing: I run the test locally, and it passed 3000/3000 times. This work is for #1310. Signed-off-by: markatou <lilika@google.com>

sschepens · 2018-12-12T19:04:03Z

@htuch Been meaning to test and probably use HDS, I bumped into #5279 though, which made me think.
What's the status of HDS?
What things remain to be done? It would be great to have a list of TODOs, i can probably tackle some of them.
For example, I find that the EndpointHealthResponse which get's sent to the control plane is probably lacking information, like which cluster the target belongs to.
I can also imagine HDS being a bit more intelligent, like not responding healthchecks with a fixed interval (using healthchecks interval?), responding only when status has changed or a new stream has been established.

htuch · 2018-12-13T08:32:06Z

@sschepens I left some notes in #5279 (comment).

If you'd like to continue to drive HDS, taking it from its current alpha status to production ready, that would be totally rad. I'd be happy to do reviews and have design discussions. We'd like to iterate here. It's likely that Google will also have bandwidth to help with the implementation in the next 1-2 quarters, but no guarantees at this point.

sschepens · 2018-12-18T13:02:41Z

@htuch I'd be totally willing to have me and my team continuing with HDS and hopefully get it production ready.
What would be the channel for having design discussions? I also want to understand previous deisgn decisions.

htuch · 2018-12-19T03:32:17Z

@sschepens this PR or Slack; we could create an HDS channel. Also, happy to do Hangouts meeting to context dump.

banks · 2020-04-28T20:11:06Z

I see that health moved out of service/discovery/v2 into it's own service/health/v3 for v3. Is that actually usable now? I tried to run a v3 service and bootstrap Envoy with:

hds_config:
  api_type: GRPC
  transport_api_version: V3
  grpc_services:
  - envoy_grpc:
      cluster_name: xds_cluster
  set_node_on_first_message_only: true

but I see in logs it's failing with unknown service envoy.service.discovery.v2.HealthDiscoveryService still. This is Envoy 1.14.1.

I don't think I see differences from v2 yet just wondering what the "right" way to work with HDS is now v2 is supposed to be deprecated?

If I can get the basics working here I might end up able to find resources to contribute to any additional work needed here as this could become a critical feature for us in the next 6 months or so.

htuch · 2020-04-29T00:00:00Z

@banks yeah, I think you're seeing the effect of #10609

banks · 2020-04-29T13:35:56Z

I'm also struggling to get HDS to work at all in 1.14.1. I can file a bug separately if that helps but not sure if it's expected to be working per this open issue state.

Whenever I send HealthChecks for a cluster (I've tried both with static clusters and clusters delivered via xDS) Envoy crashes:

Envoy Debug output

[2020-04-29 14:21:18.511][3530335][debug][upstream] [external/envoy/source/common/upstream/health_discovery_service.cc:187] New health check response message cluster_health_checks {
  cluster_name: "cluster-00000"
  health_checks {
    timeout {
      seconds: 1
    }
    interval {
      seconds: 5
    }
    http_health_check {
      host: "localhost"
      path: "/v1/agent/self"
    }
  }
}
interval {
  seconds: 5
}

[2020-04-29 14:21:18.511][3530335][debug][upstream] [external/envoy/source/common/upstream/health_discovery_service.cc:140] New health check response message cluster_health_checks {
  cluster_name: "cluster-00000"
  health_checks {
    timeout {
      seconds: 1
    }
    interval {
      seconds: 5
    }
    http_health_check {
      host: "localhost"
      path: "/v1/agent/self"
    }
  }
}
interval {
  seconds: 5
}

[2020-04-29 14:21:18.511][3530335][debug][upstream] [external/envoy/source/common/upstream/health_discovery_service.cc:169] New HdsCluster config name: "cluster-00000"
connect_timeout {
  seconds: 1
}
per_connection_buffer_limit_bytes {
  value: 32768
}
health_checks {
  timeout {
    seconds: 1
  }
  interval {
    seconds: 5
  }
  http_health_check {
    host: "localhost"
    path: "/v1/agent/self"
  }
}
load_assignment {
  endpoints {
  }
}

[2020-04-29 14:21:18.511][3530335][debug][upstream] [external/envoy/source/common/upstream/health_discovery_service.cc:228] Creating an HdsCluster
[2020-04-29 14:21:18.516][3530335][critical][main] [external/envoy/source/exe/terminate_handler.cc:13] std::terminate called! (possible uncaught exception, see trace)
[2020-04-29 14:21:18.516][3530335][critical][backtrace] [bazel-out/darwin-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2020-04-29 14:21:18.516][3530335][critical][backtrace] [bazel-out/darwin-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:92] Envoy version: 3504d40f752eb5c20bc2883053547717bcb92fd8/1.14.1/clean-getenvoy-902f20f-envoy/RELEASE/BoringSSL
[2020-04-29 14:21:18.516][3530335][critical][backtrace] [bazel-out/darwin-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:104] Caught Abort trap: 6, suspect faulting address 0x7fff70d6d2c2
[2020-04-29 14:21:18.516][3530335][critical][backtrace] [bazel-out/darwin-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2020-04-29 14:21:18.516][3530335][critical][backtrace] [bazel-out/darwin-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:92] Envoy version: 3504d40f752eb5c20bc2883053547717bcb92fd8/1.14.1/clean-getenvoy-902f20f-envoy/RELEASE/BoringSSL
AsyncClient 0x9a30d00, stream_id_: 15660133565381783220
&stream_info_:
  StreamInfoImpl 0x9a30eb0, protocol_: 1, response_code_: null, response_code_details_: via_upstream, health_check_request_: 0, route_name_:
fish: 'envoy -c ./bootstrap.yaml -l de…' terminated by signal SIGABRT (Abort)

One interesting thing is that the initial message is repeated - I've checked and as far as I can see my control plane (a simple test app running locally) is not sending the message twice on the stream.

I also spent about an hour trying to get stack_decode.py to work on a Mac and failed - potentially because this binary (from getenvoy.io) has debug symbols stripped anyway?

Happy to file this as a separate issue if that's most useful.

Does anyone have a basic example of a HDS check actually working in 1.14 or 1.15 they can share?

I think we'd need the TODO item of specific endpoint checks rather than cluster level ones eventually but just trying to see the mechanism work for now!

banks · 2020-04-29T15:28:29Z

I realised that I was missing actually specifying the actual Endpoints as I wrongly assumed from docs it was going to get that from the cluster spec.

Even with that I still see the same crash. Also interesting is that the protobuf dump of the thing I sent uses different field names than the actual v2 HDS proto definition:

[2020-04-29 16:15:19.112][3773067][debug][upstream] [external/envoy/source/common/upstream/health_discovery_service.cc:169] New HdsCluster config name: "cluster-00000"
connect_timeout {
  seconds: 1
}
per_connection_buffer_limit_bytes {
  value: 32768
}
health_checks {
  timeout {
    seconds: 1
  }
  interval {
    seconds: 5
  }
  http_health_check {
    host: "localhost"
    path: "/v1/agent/self"
  }
}
load_assignment {
  endpoints {
    lb_endpoints {
      endpoint {
        address {
          socket_address {
            address: "127.0.0.1"
            port_value: 8500
          }
        }
      }
    }
  }
}
[2020-04-29 16:15:19.112][3773067][debug][upstream] [external/envoy/source/common/upstream/health_discovery_service.cc:228] Creating an HdsCluster
[2020-04-29 16:15:19.113][3773067][debug][upstream] [external/envoy/source/common/upstream/upstream_impl.cc:275] transport socket match, socket default selected for host with address 127.0.0.1:8500
[2020-04-29 16:15:19.116][3773067][critical][main] [external/envoy/source/exe/terminate_handler.cc:13] std::terminate called! (possible uncaught exception, see trace)

load_assignment where the actual hds.proto calls that locality_endpoints no idea if that's significant or not though...

The lines just before the crash are different now - something about the socket address. But not sure if that's significant.

htuch · 2020-04-30T12:27:08Z

@banks can you share the stack trace? FWIW, we do have an HDS integration test that demonstrates some subset of this working, see https://github.com/envoyproxy/envoy/blob/master/test/integration/hds_integration_test.cc.

banks · 2020-04-30T12:51:20Z

@htuch Thanks for your help!

I've not managed to get any more stack trace info than is in the first comment above. Note it's collapsed by default and needs to be expanded. If that's what you mean then great but otherwise I've not managed to get further decoding it.

I also spent about an hour trying to get stack_decode.py to work on a Mac and failed - potentially because this binary (from getenvoy.io) has debug symbols stripped anyway?

If it would help I can try and re-run same config on a linux VM and see if I can get a more useful trace? I didn't want to spend too many hours shaving that pile of yaks though if this report is useful on it's own.

Thanks for the link the to integration test. I'll see if I can my example to work based on that.

banks · 2020-04-30T12:57:36Z

@htuch that integration test was perfect - I got it to work withotu a crash just by adding unhealthy_threshold: 2, healthy_threshold: 2 to the Health Check definition.

Those are documented as required to be fair which I'd missed. I'm not sure if an uncaught exception is still considered a bug for bad config though? Hopefully it will be easy to recreate now anyway - I can file a separate issue if that is useful.

htuch · 2020-04-30T13:26:32Z

Yeah, this is probably a bug in HDS handling of PGV validation. I think filing a separate issue makes sense (please tag me), the fix is likely only a few lines of code (plus tests..).

banks · 2020-04-30T13:58:14Z

@htuch I went to file a new issue on Gh but the notice about not reporting any crash publicly put me off a bit... It's already public here but should I still report this via the security list instead?

htuch · 2020-04-30T13:59:40Z

@banks it's fine, since this is a control plane induced crash, which is outside the threat model as far as embargoes goes today. Thanks for checking though.

mattklein123 · 2020-09-16T23:52:32Z

I'm going to call this finished. Let's open new issues that are specific to features/bugs/etc.

@jingwei99

Description: this PR consolidates all envoy_data related utility functions in order to a) keep code consistent b) more easily cover the functionality with fuzzing. It also moves from the Buffer namespace to the newly (and more aptly named) Data namespace -- credit to @jingwei99. Risk Level: low - consolidating functionality. Testing: existing tests pass giving confidence that no functionality has changed, only consolidation. Signed-off-by: Jose Nino <jnino@lyft.com> Signed-off-by: JP Simard <jp@jpsim.com>

@jingwei99

Description: this PR consolidates all envoy_data related utility functions in order to a) keep code consistent b) more easily cover the functionality with fuzzing. It also moves from the Buffer namespace to the newly (and more aptly named) Data namespace -- credit to @jingwei99. Risk Level: low - consolidating functionality. Testing: existing tests pass giving confidence that no functionality has changed, only consolidation. Signed-off-by: Jose Nino <jnino@lyft.com> Signed-off-by: JP Simard <jp@jpsim.com>

htuch added the api/v2 label Jul 21, 2017

htuch added the help wanted Needs help! label Aug 10, 2017

htuch removed the help wanted Needs help! label Jun 12, 2018

htuch assigned markatou Jun 12, 2018

This was referenced Jun 19, 2018

[WIP] Skeleton code for HDS #3673

Closed

Skeleton code for HDS #3723

Merged

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 12, 2018

htuch added no stalebot Disables stalebot from closing an issue and removed stale stalebot believes this issue/PR has not been touched recently labels Jul 12, 2018

markatou mentioned this issue Jul 27, 2018

Basic Implementation of HDS #3973

Merged

This was referenced Aug 13, 2018

HdsDelegate providing more granular health checking information #4133

Merged

Making //test/integration:hds_integration_test less flaky #4149

Merged

Making //test/integration:hds_integration_test test faster and more robust #4164

Merged

htuch added api/v3 Major version release @ end of Q3 2019 and removed api/v2 labels Mar 12, 2020

banks mentioned this issue Apr 30, 2020

HDS: Incorrect HealthCheck from xDS config causes std:terminate #11014

Closed

mattklein123 closed this as completed Sep 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health Discovery Service #1310

Health Discovery Service #1310

htuch commented Jul 21, 2017

stale bot commented Jul 12, 2018

sschepens commented Dec 12, 2018

htuch commented Dec 13, 2018

sschepens commented Dec 18, 2018

htuch commented Dec 19, 2018

banks commented Apr 28, 2020 •

edited

Loading

htuch commented Apr 29, 2020

banks commented Apr 29, 2020 •

edited

Loading

banks commented Apr 29, 2020 •

edited

Loading

htuch commented Apr 30, 2020

banks commented Apr 30, 2020

banks commented Apr 30, 2020

htuch commented Apr 30, 2020

banks commented Apr 30, 2020

htuch commented Apr 30, 2020

mattklein123 commented Sep 16, 2020

Health Discovery Service #1310

Health Discovery Service #1310

Comments

htuch commented Jul 21, 2017

stale bot commented Jul 12, 2018

sschepens commented Dec 12, 2018

htuch commented Dec 13, 2018

sschepens commented Dec 18, 2018

htuch commented Dec 19, 2018

banks commented Apr 28, 2020 • edited Loading

htuch commented Apr 29, 2020

banks commented Apr 29, 2020 • edited Loading

banks commented Apr 29, 2020 • edited Loading

htuch commented Apr 30, 2020

banks commented Apr 30, 2020

banks commented Apr 30, 2020

htuch commented Apr 30, 2020

banks commented Apr 30, 2020

htuch commented Apr 30, 2020

mattklein123 commented Sep 16, 2020

banks commented Apr 28, 2020 •

edited

Loading

banks commented Apr 29, 2020 •

edited

Loading

banks commented Apr 29, 2020 •

edited

Loading