Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health Discovery Service #1310

Closed
htuch opened this issue Jul 21, 2017 · 16 comments
Closed

Health Discovery Service #1310

htuch opened this issue Jul 21, 2017 · 16 comments
Assignees
Labels
api/v3 Major version release @ end of Q3 2019 no stalebot Disables stalebot from closing an issue

Comments

@htuch
Copy link
Member

htuch commented Jul 21, 2017

The v2 APIs introduce HDS - https://github.com/lyft/envoy-api/blob/master/api/hds.proto (@amb67). It works as follows (pasting in the .proto comment):

  1. Envoy starts up and if its can_healthcheck option in the static bootstrap config is enabled, sends HealthCheckRequest to the management server. It supplies its capabilities (which protocol it can health check with, what zone it resides in, etc.).
  2. In response to (1), the management server designates this Envoy as a healthchecker to health check a subset of all upstream hosts for a given cluster (for example upstream Host 1 and Host 2). It streams HealthCheckSpecifier messages with cluster related configuration for all clusters this Envoy is designated to health check. Subsequent HealthCheckSpecifier message will be sent on changes to:
    a. Endpoints to health checks
    b. Per cluster configuration change
  3. Envoy creates a health probe based on the HealthCheck config and sends it to endpoint(ip:port) of Host 1 and 2. Based on the HealthCheck configuration Envoy waits upon the arrival of the probe response and looks at the content of the response to decide whether the endpoint is healthy or not. If a response hasn’t been received within the timeout interval, the endpoint health status is considered TIMEOUT.
  4. Envoy reports results back in an EndpointHealthResponse message. Envoy streams responses as often as the interval configured by the management server in HealthCheckSpecifier.
  5. The management server collects health statuses for all endpoints in the cluster (for all clusters) and uses this information to construct EDS DiscoveryResponse messages.
  6. Once Envoy has a list of upstream endpoints to send traffic to, it load balances traffic to them without additional health checking. It may use inline healthcheck (i.e. consider endpoint UNHEALTHY if connection failed to a particular endpoint to account for health status propagation delay between HDS and EDS).

This issue will track implementation work on HDS.

@htuch htuch added the api/v2 label Jul 21, 2017
@htuch htuch added the help wanted Needs help! label Aug 10, 2017
@htuch htuch removed the help wanted Needs help! label Jun 12, 2018
This was referenced Jun 19, 2018
htuch pushed a commit that referenced this issue Jun 26, 2018
Added a health discovery service config option in bootstrap.proto.
There is skeleton code for this service in source/common/upstream/health_discovery_service.cc
and an integration test that tests connectivity between envoy running hds and a management server.

This work is for #1310.

Risk Level: Low

Signed-off-by: Lilika Markatou <lilika@google.com>
@stale
Copy link

stale bot commented Jul 12, 2018

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 12, 2018
@htuch htuch added no stalebot Disables stalebot from closing an issue and removed stale stalebot believes this issue/PR has not been touched recently labels Jul 12, 2018
htuch pushed a commit that referenced this issue Aug 6, 2018
This pull request contains a basic implementation of HDS, where a management server can request an HTTP healthcheck for any number of endpoints, the HdsDelegate healthchecks them, and reports back. The code also includes TODOs, to help identify the work that needs to be done next, like supporting updates to the set of endpoints that require healthchecking.

Risk Level: Low

Testing:
There are integration tests in test/integration/hds_integration_test.cc
and unit tests in test/common/upstream/hds_test.cc.

This work is for #1310.

Signed-off-by: Lilika Markatou <lilika@google.com>
htuch pushed a commit that referenced this issue Aug 10, 2018
Renaming message types in api/envoy/service/discovery/v2/hds.proto to improve readability

Risk Level:
Low

This is for #1310.

Signed-off-by: Lilika Markatou <lilika@google.com>
htuch pushed a commit that referenced this issue Aug 12, 2018
The HdsDelegate now informs the management server that it can perform TCP healthchecks.

This change also comes with three tests showing that indeed the HdsDelegate can perform TCP healthchecks.

Risk Level: Low

This is for #1310.

Signed-off-by: Lilika Markatou <lilika@google.com>
htuch pushed a commit that referenced this issue Aug 12, 2018
…ction failures (#4108)

I changed the retry strategy of the HdsDelegate on stream/connection failures. Instead of retrying every set number of seconds, we now use a jittered backoff strategy, as in #3791.

Risk Level:
Low

This is for #1310.

Signed-off-by: Lilika Markatou <lilika@google.com>
htuch pushed a commit that referenced this issue Aug 20, 2018
When reporting back to the management server, the HdsDelegate can now tell it whether an unhealthy host responded that it's unhealthy or the request timed out.

Risk Level:
Low

This is for #1310.

Signed-off-by: markatou <lilika@google.com>
htuch pushed a commit that referenced this issue Aug 22, 2018
…obust (#4164)

Follow up on #4149.

I increased any unused timers to 100 seconds. Additionally, instead of waiting on a timer to go off, all but one tests wait on cluster stats to change.
I decreased the timer wait times, reducing the test runtime from 25 seconds to 5.

Exceptions:

While TCP health checking, if an endpoint responds with a wrong phrase, the cluster failure stats do not change. So, the one test that examines this case still depends on timers.
There are 2 tests that test hosts timing out. I set those timeouts to 0.1 seconds (vs 100). However, both tests still wait on the cluster stats to change.
Risk Level:
Low

Testing:
I run the test locally, and it passed 3000/3000 times.

This work is for #1310.

Signed-off-by: markatou <lilika@google.com>
@sschepens
Copy link
Contributor

@htuch Been meaning to test and probably use HDS, I bumped into #5279 though, which made me think.
What's the status of HDS?
What things remain to be done? It would be great to have a list of TODOs, i can probably tackle some of them.
For example, I find that the EndpointHealthResponse which get's sent to the control plane is probably lacking information, like which cluster the target belongs to.
I can also imagine HDS being a bit more intelligent, like not responding healthchecks with a fixed interval (using healthchecks interval?), responding only when status has changed or a new stream has been established.

@htuch
Copy link
Member Author

htuch commented Dec 13, 2018

@sschepens I left some notes in #5279 (comment).

If you'd like to continue to drive HDS, taking it from its current alpha status to production ready, that would be totally rad. I'd be happy to do reviews and have design discussions. We'd like to iterate here. It's likely that Google will also have bandwidth to help with the implementation in the next 1-2 quarters, but no guarantees at this point.

@sschepens
Copy link
Contributor

@htuch I'd be totally willing to have me and my team continuing with HDS and hopefully get it production ready.
What would be the channel for having design discussions? I also want to understand previous deisgn decisions.

@htuch
Copy link
Member Author

htuch commented Dec 19, 2018

@sschepens this PR or Slack; we could create an HDS channel. Also, happy to do Hangouts meeting to context dump.

@htuch htuch added api/v3 Major version release @ end of Q3 2019 and removed api/v2 labels Mar 12, 2020
@banks
Copy link
Contributor

banks commented Apr 28, 2020

I see that health moved out of service/discovery/v2 into it's own service/health/v3 for v3. Is that actually usable now? I tried to run a v3 service and bootstrap Envoy with:

hds_config:
  api_type: GRPC
  transport_api_version: V3
  grpc_services:
  - envoy_grpc:
      cluster_name: xds_cluster
  set_node_on_first_message_only: true

but I see in logs it's failing with unknown service envoy.service.discovery.v2.HealthDiscoveryService still. This is Envoy 1.14.1.

I don't think I see differences from v2 yet just wondering what the "right" way to work with HDS is now v2 is supposed to be deprecated?

If I can get the basics working here I might end up able to find resources to contribute to any additional work needed here as this could become a critical feature for us in the next 6 months or so.

@htuch
Copy link
Member Author

htuch commented Apr 29, 2020

@banks yeah, I think you're seeing the effect of #10609

@banks
Copy link
Contributor

banks commented Apr 29, 2020

I'm also struggling to get HDS to work at all in 1.14.1. I can file a bug separately if that helps but not sure if it's expected to be working per this open issue state.

Whenever I send HealthChecks for a cluster (I've tried both with static clusters and clusters delivered via xDS) Envoy crashes:

Envoy Debug output
[2020-04-29 14:21:18.511][3530335][debug][upstream] [external/envoy/source/common/upstream/health_discovery_service.cc:187] New health check response message cluster_health_checks {
  cluster_name: "cluster-00000"
  health_checks {
    timeout {
      seconds: 1
    }
    interval {
      seconds: 5
    }
    http_health_check {
      host: "localhost"
      path: "/v1/agent/self"
    }
  }
}
interval {
  seconds: 5
}

[2020-04-29 14:21:18.511][3530335][debug][upstream] [external/envoy/source/common/upstream/health_discovery_service.cc:140] New health check response message cluster_health_checks {
  cluster_name: "cluster-00000"
  health_checks {
    timeout {
      seconds: 1
    }
    interval {
      seconds: 5
    }
    http_health_check {
      host: "localhost"
      path: "/v1/agent/self"
    }
  }
}
interval {
  seconds: 5
}

[2020-04-29 14:21:18.511][3530335][debug][upstream] [external/envoy/source/common/upstream/health_discovery_service.cc:169] New HdsCluster config name: "cluster-00000"
connect_timeout {
  seconds: 1
}
per_connection_buffer_limit_bytes {
  value: 32768
}
health_checks {
  timeout {
    seconds: 1
  }
  interval {
    seconds: 5
  }
  http_health_check {
    host: "localhost"
    path: "/v1/agent/self"
  }
}
load_assignment {
  endpoints {
  }
}

[2020-04-29 14:21:18.511][3530335][debug][upstream] [external/envoy/source/common/upstream/health_discovery_service.cc:228] Creating an HdsCluster
[2020-04-29 14:21:18.516][3530335][critical][main] [external/envoy/source/exe/terminate_handler.cc:13] std::terminate called! (possible uncaught exception, see trace)
[2020-04-29 14:21:18.516][3530335][critical][backtrace] [bazel-out/darwin-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2020-04-29 14:21:18.516][3530335][critical][backtrace] [bazel-out/darwin-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:92] Envoy version: 3504d40f752eb5c20bc2883053547717bcb92fd8/1.14.1/clean-getenvoy-902f20f-envoy/RELEASE/BoringSSL
[2020-04-29 14:21:18.516][3530335][critical][backtrace] [bazel-out/darwin-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:104] Caught Abort trap: 6, suspect faulting address 0x7fff70d6d2c2
[2020-04-29 14:21:18.516][3530335][critical][backtrace] [bazel-out/darwin-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2020-04-29 14:21:18.516][3530335][critical][backtrace] [bazel-out/darwin-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:92] Envoy version: 3504d40f752eb5c20bc2883053547717bcb92fd8/1.14.1/clean-getenvoy-902f20f-envoy/RELEASE/BoringSSL
AsyncClient 0x9a30d00, stream_id_: 15660133565381783220
&stream_info_:
  StreamInfoImpl 0x9a30eb0, protocol_: 1, response_code_: null, response_code_details_: via_upstream, health_check_request_: 0, route_name_:
fish: 'envoy -c ./bootstrap.yaml -l de…' terminated by signal SIGABRT (Abort)

One interesting thing is that the initial message is repeated - I've checked and as far as I can see my control plane (a simple test app running locally) is not sending the message twice on the stream.

I also spent about an hour trying to get stack_decode.py to work on a Mac and failed - potentially because this binary (from getenvoy.io) has debug symbols stripped anyway?

Happy to file this as a separate issue if that's most useful.

Does anyone have a basic example of a HDS check actually working in 1.14 or 1.15 they can share?

I think we'd need the TODO item of specific endpoint checks rather than cluster level ones eventually but just trying to see the mechanism work for now!

@banks
Copy link
Contributor

banks commented Apr 29, 2020

I realised that I was missing actually specifying the actual Endpoints as I wrongly assumed from docs it was going to get that from the cluster spec.

Even with that I still see the same crash. Also interesting is that the protobuf dump of the thing I sent uses different field names than the actual v2 HDS proto definition:

[2020-04-29 16:15:19.112][3773067][debug][upstream] [external/envoy/source/common/upstream/health_discovery_service.cc:169] New HdsCluster config name: "cluster-00000"
connect_timeout {
  seconds: 1
}
per_connection_buffer_limit_bytes {
  value: 32768
}
health_checks {
  timeout {
    seconds: 1
  }
  interval {
    seconds: 5
  }
  http_health_check {
    host: "localhost"
    path: "/v1/agent/self"
  }
}
load_assignment {
  endpoints {
    lb_endpoints {
      endpoint {
        address {
          socket_address {
            address: "127.0.0.1"
            port_value: 8500
          }
        }
      }
    }
  }
}
[2020-04-29 16:15:19.112][3773067][debug][upstream] [external/envoy/source/common/upstream/health_discovery_service.cc:228] Creating an HdsCluster
[2020-04-29 16:15:19.113][3773067][debug][upstream] [external/envoy/source/common/upstream/upstream_impl.cc:275] transport socket match, socket default selected for host with address 127.0.0.1:8500
[2020-04-29 16:15:19.116][3773067][critical][main] [external/envoy/source/exe/terminate_handler.cc:13] std::terminate called! (possible uncaught exception, see trace)

load_assignment where the actual hds.proto calls that locality_endpoints no idea if that's significant or not though...

The lines just before the crash are different now - something about the socket address. But not sure if that's significant.

@htuch
Copy link
Member Author

htuch commented Apr 30, 2020

@banks can you share the stack trace? FWIW, we do have an HDS integration test that demonstrates some subset of this working, see https://github.com/envoyproxy/envoy/blob/master/test/integration/hds_integration_test.cc.

@banks
Copy link
Contributor

banks commented Apr 30, 2020

@htuch Thanks for your help!

I've not managed to get any more stack trace info than is in the first comment above. Note it's collapsed by default and needs to be expanded. If that's what you mean then great but otherwise I've not managed to get further decoding it.

I also spent about an hour trying to get stack_decode.py to work on a Mac and failed - potentially because this binary (from getenvoy.io) has debug symbols stripped anyway?

If it would help I can try and re-run same config on a linux VM and see if I can get a more useful trace? I didn't want to spend too many hours shaving that pile of yaks though if this report is useful on it's own.

Thanks for the link the to integration test. I'll see if I can my example to work based on that.

@banks
Copy link
Contributor

banks commented Apr 30, 2020

@htuch that integration test was perfect - I got it to work withotu a crash just by adding unhealthy_threshold: 2, healthy_threshold: 2 to the Health Check definition.

Those are documented as required to be fair which I'd missed. I'm not sure if an uncaught exception is still considered a bug for bad config though? Hopefully it will be easy to recreate now anyway - I can file a separate issue if that is useful.

@htuch
Copy link
Member Author

htuch commented Apr 30, 2020

Yeah, this is probably a bug in HDS handling of PGV validation. I think filing a separate issue makes sense (please tag me), the fix is likely only a few lines of code (plus tests..).

@banks
Copy link
Contributor

banks commented Apr 30, 2020

@htuch I went to file a new issue on Gh but the notice about not reporting any crash publicly put me off a bit... It's already public here but should I still report this via the security list instead?

@htuch
Copy link
Member Author

htuch commented Apr 30, 2020

@banks it's fine, since this is a control plane induced crash, which is outside the threat model as far as embargoes goes today. Thanks for checking though.

@mattklein123
Copy link
Member

I'm going to call this finished. Let's open new issues that are specific to features/bugs/etc.

jpsim pushed a commit that referenced this issue Nov 28, 2022
Description: this PR consolidates all envoy_data related utility functions in order to a) keep code consistent b) more easily cover the functionality with fuzzing. It also moves from the Buffer namespace to the newly (and more aptly named) Data namespace -- credit to @jingwei99.
Risk Level: low - consolidating functionality.
Testing: existing tests pass giving confidence that no functionality has changed, only consolidation.

Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: JP Simard <jp@jpsim.com>
jpsim pushed a commit that referenced this issue Nov 29, 2022
Description: this PR consolidates all envoy_data related utility functions in order to a) keep code consistent b) more easily cover the functionality with fuzzing. It also moves from the Buffer namespace to the newly (and more aptly named) Data namespace -- credit to @jingwei99.
Risk Level: low - consolidating functionality.
Testing: existing tests pass giving confidence that no functionality has changed, only consolidation.

Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: JP Simard <jp@jpsim.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api/v3 Major version release @ end of Q3 2019 no stalebot Disables stalebot from closing an issue
Projects
None yet
Development

No branches or pull requests

5 participants