-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stats: optimize json representation of histograms #3184
stats: optimize json representation of histograms #3184
Conversation
Signed-off-by: Rama <rama.rao@salesforce.com>
Here is how the new optimized format will look like
|
@jmarantz @mattklein123 one of the TODOs in histogram PR. |
Very nice. Is there a test that at least checks that this is valid JSON? Better would be a test that sanity checks some of the output? |
nice. Would it also be worthwhile to have computed_quantiles be a simple array of 2-element arrays, to avoid the interleaved repeated keywords? Of course if the json is gzipped it's not going to matter much. It's also worth explicitly discussing what happens when some of the underlying data is NaN, no? |
@mattklein123 Yes. There is a test that validates whether this a valid json. Do you think any other test is needed? @jmarantz I think leaving it with interleaved keywords is fine because it would be readable. As you mentioned, it is not going to matter much. When underlying data is Nan(that means that histogram is not used in that interval), we set it to null - I think that is fine. |
@ramaraochavali It's a pretty big size difference if not gzipped. I'm not sure whether in the flows that consume this, gzip is commonly in the path. If not, then this change adds readability, but makes the output much more verbose, so it's a significant tradeoff. If we are certain gzip is in the flow then I think it's a reasonable idea. Can you confirm that's the case? I thought any individual value can be NaN, e.g. the summary and not the interval or vice-versa. What happens then? |
@jmarantz agree on the size different if not gzipped. I am not sure about whether gzip filer configured commonly here. @mattklein123 might have details. If it is not there, I am ok, to reduce it to simple arrays excluding the keywords. We can document the pattern similar to how we have done for admin end point. regarding Nan, the code that outputs to json ensures that it does not set that value if it is Nan so would automatically be null. |
source/server/http/admin.cc
Outdated
histograms_obj.SetObject(); | ||
Stats::HistogramStatisticsImpl empty_statistics; | ||
rapidjson::Value supported_quantile_array(rapidjson::kArrayType); | ||
for (size_t i = 0; i < empty_statistics.supportedQuantiles().size(); ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (double quantile : empty_statistics.supportedQuantiles()) {
....
source/server/http/admin.cc
Outdated
quantile_type.SetDouble(empty_statistics.supportedQuantiles()[i] * 100); | ||
supported_quantile_array.PushBack(quantile_type, allocator); | ||
} | ||
histograms_obj.AddMember("supported_quanlties", supported_quantile_array, allocator); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/quanlties/quantiles/
Signed-off-by: Rama <rama.rao@salesforce.com>
There is no gzip support right now for the admin endpoint (though the same loopback hack that we have done for security in some cases could be used for gzip also, so maybe docs is all that is needed). With that said, given the code @gsagula has written it would be not hard to add. I think we should do this, especially for Prometheus output. I'm fine either way on this. Will defer to @jmarantz on final approval. |
@ramaraochavali this looks solid! Awesome stuff! Are there any tests that sanity check the output of this method, so we know that it output the specific json we expect? If not do you think it makes sense to add a simple one? |
OK let's go with the format as described. Tests are the next step I think. |
source/server/http/admin.cc
Outdated
histograms_obj.SetObject(); | ||
Stats::HistogramStatisticsImpl empty_statistics; | ||
rapidjson::Value supported_quantile_array(rapidjson::kArrayType); | ||
for (double quantile : empty_statistics.supportedQuantiles()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from an API perspective, the supported quantiles can in theory be overridden per-histogram, though the code doesn't support that now. Do you want to just put a warning check below as you loop through the histograms, that the quantiles match and at least print a warning if they don't? I think that'll never fire, though, so test-coverage would be annoying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a validation in tests that validates supported quantiles and computed quantiles are of the same size. I think that should be fine for now.
Signed-off-by: Rama <rama.rao@salesforce.com>
@mrice32 there is a test that validates whether this JSON is valid or not. I have extended and added additional validations to ensure the new histogram structure is what we want. PTAL. |
Here is the new format. I thought this more structured than previous one. The main change is "histograms" come as the top level container that has supported_quantiles, computed_quantlies. The computed_quantiles is an array of histograms with name and values (with interval and cumulative values).
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. +1 for having a test with some small amount of data where actually EXPECT_EQ on the stringified json.
More importantly, we should doc the json format, right?
} | ||
|
||
// Validate that the stats JSON has histograms. | ||
EXPECT_TRUE(has_histograms); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why you made this a bool as opposed to a count that you increment in the loop. I wouldn't expect this to be flaky between tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it to bool because, we will only have one "histograms" top level object in the JSON which has supported_quantlies (one) and an array of computed_quantiles for each histogram. I may be missing some thing here and not able to follow your suggestion about counting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That means you could successfully pass the test even if the underlying impl matched an extra histogram that it shouldn't have. I'd err on the side of checking more precisely, especially as it's no extra lines of code.
source/server/http/admin.cc
Outdated
Value histograms_obj; | ||
histograms_obj.SetObject(); | ||
|
||
Stats::HistogramStatisticsImpl empty_statistics; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting in a comment here that an implementation design makes it impossible for the supported quantiles to differ across histograms, although the interface in include/envoy/stats/stats.h, ParentHistogram, makes it look like they can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test looks good to me outside of the issue @jmarantz pointed out. LGTM after addressing that.
} | ||
|
||
// Validate that the stats JSON has histograms. | ||
EXPECT_TRUE(has_histograms); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Obviously we will hold on this until we sort out the underlying issues in #3235 |
Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
…3165) server_test proto fuzzing resulted in OOM because of /dev traversal. It seems bad to allow runtime paths to point in these places. Risk Level: Low Testing: New unit tests. Signed-off-by: Harvey Tuch <htuch@google.com> Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Simon Pasquier <spasquie@redhat.com> Signed-off-by: Rama <rama.rao@salesforce.com>
This patch stems from an operational issue at Lyft that took a really long time to debug. The issue was a client sending a content-length along with a 204 response which is against spec and nghttp2 does not like. There are a few different changes here: 1) Treat all messaging errors as stream level errors vs. connection level errors. 2) Increment a specific stat so its easier to debug these cases. 3) Charge resets that stem from these issues as local resets vs. remote resets which is what was happening previously. Signed-off-by: Matt Klein <mklein@lyft.com> Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Tal Nordan <tal.nordan@solo.io> Signed-off-by: Rama <rama.rao@salesforce.com>
Attach pending upstream requests in next event after onResponseComplete. Risk Level: Medium Testing: unit test, integration test Docs Changes: N/A Release Notes: N/A Fixes envoyproxy#2715 Signed-off-by: Lizan Zhou <zlizan@google.com> Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Kyle Myerson <kmyerson@google.com> Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: James Buckland <jbuckland@google.com> Signed-off-by: Rama <rama.rao@salesforce.com>
…#3042) Adds TCP Keepalive support for upstream connections. This can be configured on the cluster manager level, and overridden on the cluster level. Risk Level: Medium Testing: Unit tests have been added. It appears to run and work. Docs Changes: envoyproxy/data-plane-api#614 Fixes envoyproxy#3028 API Changes: envoyproxy/data-plane-api#614 Signed-off-by: Jonathan Oddy <jonathan.oddy@transferwise.com> Signed-off-by: Rama <rama.rao@salesforce.com>
Added protos to support Role Based Access Control in Envoy. Also removed existing auth.proto because the new RBAC proto is a replacement of it. Ealier discussions at envoyproxy/data-plane-api#586. Signed-off-by: Limin Wang <liminwang@google.com> Signed-off-by: Rama <rama.rao@salesforce.com>
docs: clarify the use of "/" as a prefix_rewrite This patch clarifies the use of "/" as a prefix_rewrite in route and redirect prefix rewriting. And also a note on the use of trailing slashes as match value. Risk Level: Low Testing: add more input samples to RedirectPrefixRewrite test. Docs Changes: Update route.proto doc regarding path_rewrite both for redirect and route. Release Notes: N/A Fixes envoyproxy#2956 Signed-off-by: Dhi Aurrahman <dio@rockybars.com> Signed-off-by: Rama <rama.rao@salesforce.com>
…oxy#3174) The current implementation of `healthy_edge_interval` and `unhealthy_edge_interval` will wait for the health threshold to be reached before the interval is used. That leads to situations like this: - `healthy_threshold` is set to 3; - host health state is currently `unhealthy`; - host health check fails, next check happens after `unhealthy_interval`; - host health check succeeds, next check happens after `interval`; - host health check succeeds, next check happens after `interval`; - host health check succeeds, next check happens after `healthy_edge_interval`; - host health check succeeds, next check happens after `interval`. The behavior above defeats the purpose of having an edge interval since its goal is to detect health state changes faster whilst reducing the burden of health checks on healthy hosts. The intended behavior to achieve edge interval's purpose on the same scenario would be: - host health check fails, next check happens after `unhealthy_interval`; - host health check succeeds, next check happens after `healthy_edge_interval`; - host health check succeeds, next check happens after `healthy_edge_interval`; - host health check succeeds, next check happens after `interval`; - host health check succeeds, next check happens after `interval`. This PR fixes health check interval overrides so that the latter bahavior is achieved. Fixes: envoyproxy#3173 Signed-off-by: Marcelo Juchem <juchem@gmail.com> Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Chris Roche <croche@lyft.com> Signed-off-by: Rama <rama.rao@salesforce.com>
…envoyproxy#3253) The memory allocator can reuse a memory address immediately, leading to a situation in which TLS references are used for an already destroyed scope. Instead, an incrementing ID is used as the cache key. Fixes envoyproxy#3214 Signed-off-by: Matt Klein <mklein@lyft.com> Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Yangmin Zhu <ymzhu@google.com> Signed-off-by: Rama <rama.rao@salesforce.com>
…mode (envoyproxy#3209) Config: cluster is static with health checking enabled. Added unit test to run the same procedure as when in config verification mode. Signed-off-by: Christoph Pakulski <paker8848@gmail.com> Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Mark Woo <wcgwuxinwei@gmail.com> Signed-off-by: Rama <rama.rao@salesforce.com>
This will be used for testing utilities such as capture2pcap in CI. Signed-off-by: Harvey Tuch <htuch@google.com> Signed-off-by: Rama <rama.rao@salesforce.com>
* common/network: GCC8 memcpy error ``` source/common/network/address_impl.cc: In member function 'virtual absl::uint128 Envoy::Network::Address::Ipv6Instance::Ipv6Helper::address() const': source/common/network/address_impl.cc:188:69: error: 'void* memcpy(void*, const void*, size_t)' copying an object of type 'class absl::uint128' with 'private' member 'absl::uint128::lo_' from an array of 'const uint8_t [16]' {aka 'const unsigned char [16]'}; use assignment or copy-initialization instead [-Werror=class-memaccess] memcpy(&result, &address_.sin6_addr.s6_addr, sizeof(absl::uint128)); ^ In file included from bazel-out/k8-opt/bin/include/envoy/network/_virtual_includes/address_interface/envoy/network/address.h:13, from bazel-out/k8-opt/bin/source/common/network/_virtual_includes/address_lib/common/network/address_impl.h:12, from source/common/network/address_impl.cc:1: external/com_google_absl/absl/numeric/int128.h:81:19: note: 'class absl::uint128' declared here class alignas(16) uint128 { ^~~~~~~ cc1plus: all warnings being treated as errors Target //source/exe:envoy-static failed to build ``` Signed-off-by: Vincent Batts <vbatts@hashbangbash.com> * server/hot_restart_impl: GCC8 memset error ``` source/server/hot_restart_impl.cc: In member function 'virtual void Envoy::Server::HotRestartImpl::free(Envoy::Stats::RawStatData&)': source/server/hot_restart_impl.cc:171:46: error: 'void* memset(void*, int, size_t)' clearing an object of type 'struct Envoy::Stats::RawStatData' with no trivial copy-assignment [-Werror=class-memaccess] memset(&data, 0, Stats::RawStatData::size()); ``` Signed-off-by: Vincent Batts <vbatts@hashbangbash.com> Signed-off-by: Rama <rama.rao@salesforce.com>
…o upstream (envoyproxy#3243) * websocket: fix crash when data is received after failing to connect to upstream Fixes envoyproxy#3236 Signed-off-by: Greg Greenway <ggreenway@apple.com> Signed-off-by: Rama <rama.rao@salesforce.com>
…th checks fail. (envoyproxy#3171) * cluster: Add option to close tcp_proxy connections when health checks fail. Signed-off-by: Greg Greenway <ggreenway@apple.com> Signed-off-by: Rama <rama.rao@salesforce.com>
…d for per-filter config. (envoyproxy#3268) Signed-off-by: Greg Greenway <ggreenway@apple.com> Signed-off-by: Rama <rama.rao@salesforce.com>
Previously, TLS Inspector set detected transport protocol to either "tls" or "raw_buffer", however, such approach didn't allow multiple sniffers to be installed at the same time, since they would overwrite results from each other. After this change, protocol sniffers will set detected transport protocol only when they find the one they know about it, and listener will set the default one ("raw_buffer") in case no known transport protocol was found. Signed-off-by: Piotr Sikora <piotrsikora@google.com> Signed-off-by: Rama <rama.rao@salesforce.com>
…ed Status. (envoyproxy#3272) Signed-off-by: Trevor Schroeder <trevors@google.com> Signed-off-by: Rama <rama.rao@salesforce.com>
Adds the --flaky_test_attempts flag for OS X CI builds to paper over the flaky tests noted in envoyproxy#2428. Test with "integration" in their names will be retried on failure and the CI will succeed if the retry succeeds. Risk Level: Low Testing: n/a Docs Changes: n/a Release Notes: n/a Signed-off-by: Stephan Zuercher <stephan@turbinelabs.io> Signed-off-by: Rama <rama.rao@salesforce.com>
…yproxy#3267) * cleanup: move filter wrappers into Network namespace. No functional changes. Signed-off-by: Piotr Sikora <piotrsikora@google.com> Signed-off-by: Rama <rama.rao@salesforce.com>
The test assumes that IPv4 is always supported. It is not. This PR parameterizes it to run in IPv6-only environments. Risk Level: Low Signed-off-by: Trevor Schroeder <trevors@google.com> Signed-off-by: Rama <rama.rao@salesforce.com>
…oyproxy#3215) Allow `upstream.zone_routing.enabled` and `upstream.zone_routing.min_cluster_size` to be set through the configuration API. It closes the loop on the work started with envoyproxy/data-plane-api#480 Fixes: envoyproxy#1344 Signed-off-by: Marcelo Juchem <juchem@gmail.com> Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Ryan Burn <ryan.burn@gmail.com> Signed-off-by: Rama <rama.rao@salesforce.com>
This change unreverts: Also fixes envoyproxy#3223 Please see the 2nd commit for the actual changes. The changes are: Bring in a new histogram library version with the fix (and more debug checking). Fix a small issue with scope iteration during merging. Remove de-dup for histograms until we iterate to shared storage for overlapping scope histograms. Personally, I found this very confusing when debugging and I think the way I changed it is better for now given the code we have. Signed-off-by: Matt Klein <mklein@lyft.com>
This PR fixes two issues: Avoid having synchronized behavior across the fleet by having a seed applied to each WRR balancer instance. Avoid having unweighted RR reset to same host on each refresh, biasing towards earlier hosts in the schedule. This TODO still exists for weighted. Risk Level: Medium (there will be observable traffic effects due to changes in host pick schedule). Testing: New unit tests, modified existing. Signed-off-by: Harvey Tuch <htuch@google.com> Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Rama rama.rao@salesforce.com
Description:
optimizes the json representation of histograms by sending all supported_quantile once and computed_quantlie array multiple times.
Risk Level: Low
Testing: Manual
Docs Changes: NA
Release Notes: NA
Fixes #3102