-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Health Discovery Service #1310
Comments
Added a health discovery service config option in bootstrap.proto. There is skeleton code for this service in source/common/upstream/health_discovery_service.cc and an integration test that tests connectivity between envoy running hds and a management server. This work is for #1310. Risk Level: Low Signed-off-by: Lilika Markatou <lilika@google.com>
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
This pull request contains a basic implementation of HDS, where a management server can request an HTTP healthcheck for any number of endpoints, the HdsDelegate healthchecks them, and reports back. The code also includes TODOs, to help identify the work that needs to be done next, like supporting updates to the set of endpoints that require healthchecking. Risk Level: Low Testing: There are integration tests in test/integration/hds_integration_test.cc and unit tests in test/common/upstream/hds_test.cc. This work is for #1310. Signed-off-by: Lilika Markatou <lilika@google.com>
Renaming message types in api/envoy/service/discovery/v2/hds.proto to improve readability Risk Level: Low This is for #1310. Signed-off-by: Lilika Markatou <lilika@google.com>
The HdsDelegate now informs the management server that it can perform TCP healthchecks. This change also comes with three tests showing that indeed the HdsDelegate can perform TCP healthchecks. Risk Level: Low This is for #1310. Signed-off-by: Lilika Markatou <lilika@google.com>
When reporting back to the management server, the HdsDelegate can now tell it whether an unhealthy host responded that it's unhealthy or the request timed out. Risk Level: Low This is for #1310. Signed-off-by: markatou <lilika@google.com>
…obust (#4164) Follow up on #4149. I increased any unused timers to 100 seconds. Additionally, instead of waiting on a timer to go off, all but one tests wait on cluster stats to change. I decreased the timer wait times, reducing the test runtime from 25 seconds to 5. Exceptions: While TCP health checking, if an endpoint responds with a wrong phrase, the cluster failure stats do not change. So, the one test that examines this case still depends on timers. There are 2 tests that test hosts timing out. I set those timeouts to 0.1 seconds (vs 100). However, both tests still wait on the cluster stats to change. Risk Level: Low Testing: I run the test locally, and it passed 3000/3000 times. This work is for #1310. Signed-off-by: markatou <lilika@google.com>
@htuch Been meaning to test and probably use HDS, I bumped into #5279 though, which made me think. |
@sschepens I left some notes in #5279 (comment). If you'd like to continue to drive HDS, taking it from its current alpha status to production ready, that would be totally rad. I'd be happy to do reviews and have design discussions. We'd like to iterate here. It's likely that Google will also have bandwidth to help with the implementation in the next 1-2 quarters, but no guarantees at this point. |
@htuch I'd be totally willing to have me and my team continuing with HDS and hopefully get it production ready. |
@sschepens this PR or Slack; we could create an HDS channel. Also, happy to do Hangouts meeting to context dump. |
I see that health moved out of
but I see in logs it's failing with I don't think I see differences from v2 yet just wondering what the "right" way to work with HDS is now v2 is supposed to be deprecated? If I can get the basics working here I might end up able to find resources to contribute to any additional work needed here as this could become a critical feature for us in the next 6 months or so. |
I'm also struggling to get HDS to work at all in 1.14.1. I can file a bug separately if that helps but not sure if it's expected to be working per this open issue state. Whenever I send HealthChecks for a cluster (I've tried both with static clusters and clusters delivered via xDS) Envoy crashes: Envoy Debug output
One interesting thing is that the initial message is repeated - I've checked and as far as I can see my control plane (a simple test app running locally) is not sending the message twice on the stream. I also spent about an hour trying to get Happy to file this as a separate issue if that's most useful. Does anyone have a basic example of a HDS check actually working in 1.14 or 1.15 they can share? I think we'd need the TODO item of specific endpoint checks rather than cluster level ones eventually but just trying to see the mechanism work for now! |
I realised that I was missing actually specifying the actual Endpoints as I wrongly assumed from docs it was going to get that from the cluster spec. Even with that I still see the same crash. Also interesting is that the protobuf dump of the thing I sent uses different field names than the actual v2 HDS proto definition:
The lines just before the crash are different now - something about the socket address. But not sure if that's significant. |
@banks can you share the stack trace? FWIW, we do have an HDS integration test that demonstrates some subset of this working, see https://github.com/envoyproxy/envoy/blob/master/test/integration/hds_integration_test.cc. |
@htuch Thanks for your help! I've not managed to get any more stack trace info than is in the first comment above. Note it's collapsed by default and needs to be expanded. If that's what you mean then great but otherwise I've not managed to get further decoding it.
If it would help I can try and re-run same config on a linux VM and see if I can get a more useful trace? I didn't want to spend too many hours shaving that pile of yaks though if this report is useful on it's own. Thanks for the link the to integration test. I'll see if I can my example to work based on that. |
@htuch that integration test was perfect - I got it to work withotu a crash just by adding Those are documented as required to be fair which I'd missed. I'm not sure if an uncaught exception is still considered a bug for bad config though? Hopefully it will be easy to recreate now anyway - I can file a separate issue if that is useful. |
Yeah, this is probably a bug in HDS handling of PGV validation. I think filing a separate issue makes sense (please tag me), the fix is likely only a few lines of code (plus tests..). |
@htuch I went to file a new issue on Gh but the notice about not reporting any crash publicly put me off a bit... It's already public here but should I still report this via the security list instead? |
@banks it's fine, since this is a control plane induced crash, which is outside the threat model as far as embargoes goes today. Thanks for checking though. |
I'm going to call this finished. Let's open new issues that are specific to features/bugs/etc. |
Description: this PR consolidates all envoy_data related utility functions in order to a) keep code consistent b) more easily cover the functionality with fuzzing. It also moves from the Buffer namespace to the newly (and more aptly named) Data namespace -- credit to @jingwei99. Risk Level: low - consolidating functionality. Testing: existing tests pass giving confidence that no functionality has changed, only consolidation. Signed-off-by: Jose Nino <jnino@lyft.com> Signed-off-by: JP Simard <jp@jpsim.com>
Description: this PR consolidates all envoy_data related utility functions in order to a) keep code consistent b) more easily cover the functionality with fuzzing. It also moves from the Buffer namespace to the newly (and more aptly named) Data namespace -- credit to @jingwei99. Risk Level: low - consolidating functionality. Testing: existing tests pass giving confidence that no functionality has changed, only consolidation. Signed-off-by: Jose Nino <jnino@lyft.com> Signed-off-by: JP Simard <jp@jpsim.com>
The v2 APIs introduce HDS - https://github.com/lyft/envoy-api/blob/master/api/hds.proto (@amb67). It works as follows (pasting in the
.proto
comment):can_healthcheck
option in the static bootstrap config is enabled, sendsHealthCheckReques
t to the management server. It supplies its capabilities (which protocol it can health check with, what zone it resides in, etc.).HealthCheckSpecifier
messages with cluster related configuration for all clusters this Envoy is designated to health check. SubsequentHealthCheckSpecifier
message will be sent on changes to:a. Endpoints to health checks
b. Per cluster configuration change
HealthCheck
config and sends it to endpoint(ip:port) of Host 1 and 2. Based on theHealthCheck
configuration Envoy waits upon the arrival of the probe response and looks at the content of the response to decide whether the endpoint is healthy or not. If a response hasn’t been received within the timeout interval, the endpoint health status is considered TIMEOUT.EndpointHealthResponse
message. Envoy streams responses as often as the interval configured by the management server inHealthCheckSpecifier
.DiscoveryResponse
messages.This issue will track implementation work on HDS.
The text was updated successfully, but these errors were encountered: