Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats: optimize json representation of histograms #3184

Closed

Conversation

ramaraochavali
Copy link
Contributor

Signed-off-by: Rama rama.rao@salesforce.com

Description:
optimizes the json representation of histograms by sending all supported_quantile once and computed_quantlie array multiple times.

Risk Level: Low
Testing: Manual
Docs Changes: NA
Release Notes: NA

Fixes #3102

Signed-off-by: Rama <rama.rao@salesforce.com>
@ramaraochavali
Copy link
Contributor Author

Here is how the new optimized format will look like

{
            "supported_quanlties": [
                0.0,
                25.0,
                50.0,
                75.0,
                90.0,
                95.0,
                99.0,
                99.9,
                100.0
            ],
            "histograms": [
                {
                    "name": "http.egress_http2.downstream_rq_time",
                    "computed_quantiles": [
                        {
                            "interval": 0.0,
                            "cumulative": 0.0
                        },
                        {
                            "interval": 0.0,
                            "cumulative": 0.0
                        },
                        {
                            "interval": 1.043578738857709,
                            "cumulative": 1.043578738857709
                        },
                        {
                            "interval": 1.0941564872895345,
                            "cumulative": 1.0941564872895345
                        },
                        {
                            "interval": 2.086002317497103,
                            "cumulative": 2.086002317497103
                        },
                        {
                            "interval": 3.066523297491039,
                            "cumulative": 3.066523297491039
                        },
                        {
                            "interval": 6.046608695652175,
                            "cumulative": 6.046608695652175
                        },
                        {
                            "interval": 229.57333333333436,
                            "cumulative": 229.57333333333436
                        },
                        {
                            "interval": 260.0,
                            "cumulative": 260.0
                        }
                    ]
                },
            ]
}

@ramaraochavali
Copy link
Contributor Author

@jmarantz @mattklein123 one of the TODOs in histogram PR.

@mattklein123
Copy link
Member

Very nice. Is there a test that at least checks that this is valid JSON? Better would be a test that sanity checks some of the output?

@jmarantz
Copy link
Contributor

nice. Would it also be worthwhile to have computed_quantiles be a simple array of 2-element arrays, to avoid the interleaved repeated keywords? Of course if the json is gzipped it's not going to matter much.

It's also worth explicitly discussing what happens when some of the underlying data is NaN, no?

@ramaraochavali
Copy link
Contributor Author

@mattklein123 Yes. There is a test that validates whether this a valid json. Do you think any other test is needed?

@jmarantz I think leaving it with interleaved keywords is fine because it would be readable. As you mentioned, it is not going to matter much. When underlying data is Nan(that means that histogram is not used in that interval), we set it to null - I think that is fine.

@jmarantz
Copy link
Contributor

@ramaraochavali It's a pretty big size difference if not gzipped. I'm not sure whether in the flows that consume this, gzip is commonly in the path. If not, then this change adds readability, but makes the output much more verbose, so it's a significant tradeoff. If we are certain gzip is in the flow then I think it's a reasonable idea. Can you confirm that's the case?

I thought any individual value can be NaN, e.g. the summary and not the interval or vice-versa. What happens then?

@ramaraochavali
Copy link
Contributor Author

@jmarantz agree on the size different if not gzipped. I am not sure about whether gzip filer configured commonly here. @mattklein123 might have details. If it is not there, I am ok, to reduce it to simple arrays excluding the keywords. We can document the pattern similar to how we have done for admin end point.

regarding Nan, the code that outputs to json ensures that it does not set that value if it is Nan so would automatically be null.

histograms_obj.SetObject();
Stats::HistogramStatisticsImpl empty_statistics;
rapidjson::Value supported_quantile_array(rapidjson::kArrayType);
for (size_t i = 0; i < empty_statistics.supportedQuantiles().size(); ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for (double quantile : empty_statistics.supportedQuantiles()) {
....

quantile_type.SetDouble(empty_statistics.supportedQuantiles()[i] * 100);
supported_quantile_array.PushBack(quantile_type, allocator);
}
histograms_obj.AddMember("supported_quanlties", supported_quantile_array, allocator);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/quanlties/quantiles/

Signed-off-by: Rama <rama.rao@salesforce.com>
@mattklein123
Copy link
Member

@jmarantz agree on the size different if not gzipped. I am not sure about whether gzip filer configured commonly here. @mattklein123 might have details. If it is not there, I am ok, to reduce it to simple arrays excluding the keywords. We can document the pattern similar to how we have done for admin end point.

There is no gzip support right now for the admin endpoint (though the same loopback hack that we have done for security in some cases could be used for gzip also, so maybe docs is all that is needed). With that said, given the code @gsagula has written it would be not hard to add. I think we should do this, especially for Prometheus output.

I'm fine either way on this. Will defer to @jmarantz on final approval.

@mrice32
Copy link
Member

mrice32 commented Apr 25, 2018

@ramaraochavali this looks solid! Awesome stuff! Are there any tests that sanity check the output of this method, so we know that it output the specific json we expect? If not do you think it makes sense to add a simple one?

@jmarantz
Copy link
Contributor

OK let's go with the format as described. Tests are the next step I think.

histograms_obj.SetObject();
Stats::HistogramStatisticsImpl empty_statistics;
rapidjson::Value supported_quantile_array(rapidjson::kArrayType);
for (double quantile : empty_statistics.supportedQuantiles()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from an API perspective, the supported quantiles can in theory be overridden per-histogram, though the code doesn't support that now. Do you want to just put a warning check below as you loop through the histograms, that the quantiles match and at least print a warning if they don't? I think that'll never fire, though, so test-coverage would be annoying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a validation in tests that validates supported quantiles and computed quantiles are of the same size. I think that should be fine for now.

Signed-off-by: Rama <rama.rao@salesforce.com>
@ramaraochavali
Copy link
Contributor Author

@mrice32 there is a test that validates whether this JSON is valid or not. I have extended and added additional validations to ensure the new histogram structure is what we want. PTAL.

@ramaraochavali
Copy link
Contributor Author

Here is the new format. I thought this more structured than previous one. The main change is "histograms" come as the top level container that has supported_quantiles, computed_quantlies. The computed_quantiles is an array of histograms with name and values (with interval and cumulative values).

{
            "histograms": {
                "supported_quantiles": [
                    0.0,
                    25.0,
                    50.0,
                    75.0,
                    90.0,
                    95.0,
                    99.0,
                    99.9,
                    100.0
                ],
                "computed_quantiles": [
                    {
                        "name": "cluster.external_auth_cluster.upstream_cx_length_ms",
                        "values": [
                            {
                            "interval": 0.0,
                            "cumulative": 0.0
                        },
                        {
                            "interval": 0.0,
                            "cumulative": 0.0
                        },
                        {
                            "interval": 1.043578738857709,
                            "cumulative": 1.043578738857709
                        },
                        {
                            "interval": 1.0941564872895345,
                            "cumulative": 1.0941564872895345
                        },
                        {
                            "interval": 2.086002317497103,
                            "cumulative": 2.086002317497103
                        },
                        {
                            "interval": 3.066523297491039,
                            "cumulative": 3.066523297491039
                        },
                        {
                            "interval": 6.046608695652175,
                            "cumulative": 6.046608695652175
                        },
                        {
                            "interval": 229.57333333333436,
                            "cumulative": 229.57333333333436
                        },
                        {
                            "interval": 260.0,
                            "cumulative": 260.0
                        }
                        ]
                    },
            },
}

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. +1 for having a test with some small amount of data where actually EXPECT_EQ on the stringified json.

More importantly, we should doc the json format, right?

}

// Validate that the stats JSON has histograms.
EXPECT_TRUE(has_histograms);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why you made this a bool as opposed to a count that you increment in the loop. I wouldn't expect this to be flaky between tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it to bool because, we will only have one "histograms" top level object in the JSON which has supported_quantlies (one) and an array of computed_quantiles for each histogram. I may be missing some thing here and not able to follow your suggestion about counting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means you could successfully pass the test even if the underlying impl matched an extra histogram that it shouldn't have. I'd err on the side of checking more precisely, especially as it's no extra lines of code.

Value histograms_obj;
histograms_obj.SetObject();

Stats::HistogramStatisticsImpl empty_statistics;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting in a comment here that an implementation design makes it impossible for the supported quantiles to differ across histograms, although the interface in include/envoy/stats/stats.h, ParentHistogram, makes it look like they can.

Copy link
Member

@mrice32 mrice32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test looks good to me outside of the issue @jmarantz pointed out. LGTM after addressing that.

}

// Validate that the stats JSON has histograms.
EXPECT_TRUE(has_histograms);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@mattklein123
Copy link
Member

Obviously we will hold on this until we sort out the underlying issues in #3235

ramaraochavali and others added 7 commits April 29, 2018 09:33
Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
…3165)

server_test proto fuzzing resulted in OOM because of /dev traversal. It
seems bad to allow runtime paths to point in these places.

Risk Level: Low
Testing: New unit tests.

Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
This patch stems from an operational issue at Lyft that took a really long
time to debug. The issue was a client sending a content-length along with a
204 response which is against spec and nghttp2 does not like.

There are a few different changes here:
1) Treat all messaging errors as stream level errors vs. connection level
   errors.
2) Increment a specific stat so its easier to debug these cases.
3) Charge resets that stem from these issues as local resets vs. remote
   resets which is what was happening previously.

Signed-off-by: Matt Klein <mklein@lyft.com>

Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Tal Nordan <tal.nordan@solo.io>
Signed-off-by: Rama <rama.rao@salesforce.com>
Attach pending upstream requests in next event after onResponseComplete.

Risk Level: Medium
Testing: unit test, integration test
Docs Changes: N/A
Release Notes: N/A

Fixes envoyproxy#2715

Signed-off-by: Lizan Zhou <zlizan@google.com>

Signed-off-by: Rama <rama.rao@salesforce.com>
kmyerson and others added 27 commits May 3, 2018 21:05
)

Signed-off-by: Kyle Myerson <kmyerson@google.com>

Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Kyle Myerson <kmyerson@google.com>

Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: James Buckland <jbuckland@google.com>

Signed-off-by: Rama <rama.rao@salesforce.com>
…#3042)

Adds TCP Keepalive support for upstream connections. This can be configured on the cluster manager level, and overridden on the cluster level.

Risk Level: Medium

Testing:
Unit tests have been added. It appears to run and work.

Docs Changes:
envoyproxy/data-plane-api#614

Fixes envoyproxy#3028

API Changes:
envoyproxy/data-plane-api#614

Signed-off-by: Jonathan Oddy <jonathan.oddy@transferwise.com>

Signed-off-by: Rama <rama.rao@salesforce.com>
Added protos to support Role Based Access Control in Envoy.

Also removed existing auth.proto because the new RBAC proto is a replacement of it.

Ealier discussions at
envoyproxy/data-plane-api#586.

Signed-off-by: Limin Wang <liminwang@google.com>

Signed-off-by: Rama <rama.rao@salesforce.com>
docs: clarify the use of "/" as a prefix_rewrite

This patch clarifies the use of "/" as a prefix_rewrite in route and redirect prefix rewriting. And also a note on the use of trailing slashes as match value.

Risk Level: Low
Testing: add more input samples to RedirectPrefixRewrite test.
Docs Changes: Update route.proto doc regarding path_rewrite both for redirect and route.
Release Notes: N/A

Fixes envoyproxy#2956

Signed-off-by: Dhi Aurrahman <dio@rockybars.com>

Signed-off-by: Rama <rama.rao@salesforce.com>
…oxy#3174)

The current implementation of `healthy_edge_interval` and
`unhealthy_edge_interval` will wait for the health threshold to be
reached before the interval is used.

That leads to situations like this:
- `healthy_threshold` is set to 3;
- host health state is currently `unhealthy`;
- host health check fails, next check happens after `unhealthy_interval`;
- host health check succeeds, next check happens after `interval`;
- host health check succeeds, next check happens after `interval`;
- host health check succeeds, next check happens after `healthy_edge_interval`;
- host health check succeeds, next check happens after `interval`.

The behavior above defeats the purpose of having an edge interval since
its goal is to detect health state changes faster whilst reducing the
burden of health checks on healthy hosts.

The intended behavior to achieve edge interval's purpose on the same scenario
would be:
- host health check fails, next check happens after `unhealthy_interval`;
- host health check succeeds, next check happens after `healthy_edge_interval`;
- host health check succeeds, next check happens after `healthy_edge_interval`;
- host health check succeeds, next check happens after `interval`;
- host health check succeeds, next check happens after `interval`.

This PR fixes health check interval overrides so that the latter
bahavior is achieved.

Fixes: envoyproxy#3173

Signed-off-by: Marcelo Juchem <juchem@gmail.com>

Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Chris Roche <croche@lyft.com>

Signed-off-by: Rama <rama.rao@salesforce.com>
…envoyproxy#3253)

The memory allocator can reuse a memory address immediately, leading to
a situation in which TLS references are used for an already destroyed
scope. Instead, an incrementing ID is used as the cache key.

Fixes envoyproxy#3214

Signed-off-by: Matt Klein <mklein@lyft.com>

Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Yangmin Zhu <ymzhu@google.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
…mode (envoyproxy#3209)

Config: cluster is static with health checking enabled.
Added unit test to run the same procedure as when in config
verification mode.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Mark Woo <wcgwuxinwei@gmail.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
This will be used for testing utilities such as capture2pcap in CI.

Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
* common/network: GCC8 memcpy error

```
source/common/network/address_impl.cc: In member function 'virtual absl::uint128 Envoy::Network::Address::Ipv6Instance::Ipv6Helper::address() const':
source/common/network/address_impl.cc:188:69: error: 'void* memcpy(void*, const void*, size_t)' copying an object of type 'class absl::uint128' with 'private' member 'absl::uint128::lo_' from an array of 'const uint8_t [16]' {aka 'const unsigned char [16]'}; use assignment or copy-initialization instead [-Werror=class-memaccess]
   memcpy(&result, &address_.sin6_addr.s6_addr, sizeof(absl::uint128));
                                                                     ^
In file included from bazel-out/k8-opt/bin/include/envoy/network/_virtual_includes/address_interface/envoy/network/address.h:13,
                 from bazel-out/k8-opt/bin/source/common/network/_virtual_includes/address_lib/common/network/address_impl.h:12,
                 from source/common/network/address_impl.cc:1:
external/com_google_absl/absl/numeric/int128.h:81:19: note: 'class absl::uint128' declared here
 class alignas(16) uint128 {
                   ^~~~~~~
cc1plus: all warnings being treated as errors
Target //source/exe:envoy-static failed to build
```

Signed-off-by: Vincent Batts <vbatts@hashbangbash.com>

* server/hot_restart_impl: GCC8 memset error

```
source/server/hot_restart_impl.cc: In member function 'virtual void Envoy::Server::HotRestartImpl::free(Envoy::Stats::RawStatData&)':
source/server/hot_restart_impl.cc:171:46: error: 'void* memset(void*, int, size_t)' clearing an object of type 'struct Envoy::Stats::RawStatData' with no trivial copy-assignment [-Werror=class-memaccess]
   memset(&data, 0, Stats::RawStatData::size());
```

Signed-off-by: Vincent Batts <vbatts@hashbangbash.com>

Signed-off-by: Rama <rama.rao@salesforce.com>
…o upstream (envoyproxy#3243)

* websocket: fix crash when data is received after failing to connect to upstream

Fixes envoyproxy#3236

Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
…th checks fail. (envoyproxy#3171)

* cluster: Add option to close tcp_proxy connections when health checks fail.

Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
…d for per-filter config. (envoyproxy#3268)

Signed-off-by: Greg Greenway <ggreenway@apple.com>

Signed-off-by: Rama <rama.rao@salesforce.com>
Previously, TLS Inspector set detected transport protocol to either "tls"
or "raw_buffer", however, such approach didn't allow multiple sniffers to
be installed at the same time, since they would overwrite results from
each other.

After this change, protocol sniffers will set detected transport protocol
only when they find the one they know about it, and listener will set the
default one ("raw_buffer") in case no known transport protocol was found.

Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
…ed Status. (envoyproxy#3272)

Signed-off-by: Trevor Schroeder <trevors@google.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
Adds the --flaky_test_attempts flag for OS X CI builds to paper over the flaky tests noted in envoyproxy#2428. Test with "integration" in their names will be retried on failure and the CI will succeed if the retry succeeds.

Risk Level: Low
Testing: n/a
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Stephan Zuercher <stephan@turbinelabs.io>
Signed-off-by: Rama <rama.rao@salesforce.com>
…yproxy#3267)

* cleanup: move filter wrappers into Network namespace.

No functional changes.

Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
The test assumes that IPv4 is always supported. It is not. This PR parameterizes it to run in IPv6-only environments.

Risk Level: Low

Signed-off-by: Trevor Schroeder <trevors@google.com>

Signed-off-by: Rama <rama.rao@salesforce.com>
…oyproxy#3215)

Allow `upstream.zone_routing.enabled` and `upstream.zone_routing.min_cluster_size`
to be set through the configuration API.

It closes the loop on the work started with envoyproxy/data-plane-api#480

Fixes: envoyproxy#1344

Signed-off-by: Marcelo Juchem <juchem@gmail.com>

Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Ryan Burn <ryan.burn@gmail.com>

Signed-off-by: Rama <rama.rao@salesforce.com>
This change unreverts:
Also fixes envoyproxy#3223

Please see the 2nd commit for the actual changes. The changes are:

Bring in a new histogram library version with the fix (and more debug
checking).
Fix a small issue with scope iteration during merging.
Remove de-dup for histograms until we iterate to shared
storage for overlapping scope histograms. Personally, I found
this very confusing when debugging and I think the way I changed
it is better for now given the code we have.

Signed-off-by: Matt Klein <mklein@lyft.com>
This PR fixes two issues:

Avoid having synchronized behavior across the fleet by having a seed
applied to each WRR balancer instance.
Avoid having unweighted RR reset to same host on each refresh, biasing towards
earlier hosts in the schedule. This TODO still exists for weighted.
Risk Level: Medium (there will be observable traffic effects due to
changes in host pick schedule).
Testing: New unit tests, modified existing.


Signed-off-by: Harvey Tuch <htuch@google.com>

Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
@ramaraochavali ramaraochavali deleted the feature/optimize_json branch May 8, 2018 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.