Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Adaptive concurrency documentation and stats #8582

Merged
merged 11 commits into from
Oct 17, 2019

Conversation

tonya11en
Copy link
Member

@tonya11en tonya11en commented Oct 11, 2019

Documentation for the adaptive concurrency control filter. This patch also introduces a new stat to the filter and adds some coverage in an existing test.

Fixes #7789

Tony Allen added 6 commits October 9, 2019 17:11
wip
Signed-off-by: Tony Allen <tallen@lyft.com>
Signed-off-by: Tony Allen <tallen@lyft.com>
Signed-off-by: Tony Allen <tallen@lyft.com>
Signed-off-by: Tony Allen <tallen@lyft.com>
Signed-off-by: Tony Allen <tallen@lyft.com>
@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/.

🐱

Caused by: #8582 was opened by tonya11en.

see: more, trace.

Signed-off-by: Tony Allen <tallen@lyft.com>
Copy link
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look at the docs from someone who hasn't been following along - seems pretty well explained in general, just a few comments

Tony Allen added 2 commits October 11, 2019 12:38
Signed-off-by: Tony Allen <tallen@lyft.com>
Signed-off-by: Tony Allen <tallen@lyft.com>
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; I haven't been following this that closely either. The only question I have is whether there are references to other systems that implement this algorithm?

Also, this only works if the system is linear, right? Is there some heuristic to know whether adaptive control works in a given situation, or does it basically work in all practical scenarios?

return runtime_.snapshot().getDouble(RuntimeKeys::get().SampleAggregatePercentileKey,
sample_aggregate_percentile_) /
100.0;
const double val = runtime_.snapshot().getDouble(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: prefer code and major docs PR to be separate, but it's fine for this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 in the future.

@tonya11en
Copy link
Member Author

@htuch thanks for taking a look. The only other system I'm aware of that has implemented this algorithm is the Netflix concurrency limits library and the associated blog post. I'll admit I didn't look too closely at the Java library, so my implementation here might be dramatically different; however, the concurrency limit calculation should be identical.

There are still a lot that needs to be explored regarding the filter's performance under various workloads. I've only tested the filter for ingress circuit breaking while protecting a single host. Also, since the filter's decisions are measurement-based, I doubt it would perform well in situations where it cannot sample all requests going to the host or in scenarios where the latency distribution is bi-modal (since the sampleRTT is measured as a percentile).

To figure out the situations where this filter does work well, I hacked together a client/server load test with configurable RPS, concurrency, request latencies, etc. I have a lot of data showing how the filter performs under various workload patterns and scenarios where the upstream itself begins to "degrade" by artificially increasing the server's time to respond. I can share this data if you'd like, but I didn't think the configuration docs were the place for my experimental results. I did recently provide an update to the parent issue with a single experiment that might be of interest.

I haven't found any synthetic workload patterns where the filter doesn't work as expected. I'll be rolling this out in our staging environment over the next few weeks to see how it performs in real life, so stay tuned for those results.

@htuch
Copy link
Member

htuch commented Oct 14, 2019

@tonya11en thanks for the context, it sounds like this is an awesome piece of kit. I agree we don't need a complete writeup of experiments in the docs, but some explanation of how this technology works and where its limits might be would be great. I'm thinking a single paragraph, a few sentences.

Signed-off-by: Tony Allen <tony@allen.gg>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, awesome work. A few small comments. Super excited to see this land.

/wait


The adaptive concurrency filter supports the following runtime settings:

adaptive_concurrency.enabled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that if we ever want to support both ingress and egress adaptive concurrency in a single side car, we will have conflicting runtime names. Would it be better to not hard code these names and instead read them from runtime value configuration fields or similar? We might consider doing this in a follow up before we consider this filter production ready. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these runtime parameters are all overrides of the config parameters, we can just use the runtime configuration fields for the config like you mention. That'll allow for unique runtime names.

Let's knock that out in a different patch. I'll open an issue.

return runtime_.snapshot().getDouble(RuntimeKeys::get().SampleAggregatePercentileKey,
sample_aggregate_percentile_) /
100.0;
const double val = runtime_.snapshot().getDouble(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 in the future.

Signed-off-by: Tony Allen <tony@allen.gg>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mattklein123 mattklein123 merged commit 999c27b into envoyproxy:master Oct 17, 2019
@tonya11en tonya11en deleted the acc_docs branch November 13, 2019 04:35
@Onewaysidewalks
Copy link

Is there a point where this moves out of "experimental" in the documentation? (still that way as of 1.16.x)

@tonya11en
Copy link
Member Author

I'm not quite sure what the criteria are for removing that disclaimer. Have you used this filter in your deployments or are you waiting for a "green light" before trying it out?

@Onewaysidewalks
Copy link

definitely the second, as it looks like the netflix lib this was based off of (https://github.com/Netflix/concurrency-limits/) is now abandoned

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adaptive Concurrency Control L7 Filter
5 participants