Adaptive Concurrency Control L7 Filter #7789

tonya11en · 2019-07-31T23:59:11Z

Adaptive Concurrency Control HTTP Filter

TL; DR:
L7 filter to dynamically update request concurrency values based on sampled request latencies.

Motivation

At Lyft, we’re using Envoy’s circuit breaking system to defend our services from load caused by too many concurrent requests. The circuit breakers currently require service owners to manually set limits for each service/upstream cluster. While the feature is beneficial, the service owners are expected to understand the ideal concurrency limits of their services, configure them, and update the limits as their service evolves. This is not a problem if the concurrency values are always tuned properly, but the inevitable stale values have caused several incidents in production.

This work aims to remove the need for manual tuning of these circuit breaking parameters and keep latencies low by using techniques from Netflix’s concurrency limits Java library.

Planned Patches

The filter will be committed in stages to make the review process easier. Here's an ordered list of what is currently planned:

L7 filter with calls to a noop concurrency control framework.
Implement concurrency control framework and unit tests.
Make parameters runtime configurable.
Integration tests.
Documentation.

Nice-to-haves:

Benchmark.

Preliminary Data

To test the various concurrency control algorithms, I used bufferbloater (whose functionality will eventually be added to Nighthawk) and a proof-of-concept implementation in my envoy-filter-example fork.

Identical tests were performed on a system with sane CB settings configured and adaptive concurrency configured. The client RPS to the server remained constant, but the server's response times varied every 30 seconds.

Circuit breaker configured

Adaptive concurrency control configured

So far, the results seem promising. The proof-of-concept filter was able to replicate the circuit breaker behavior, but did so with latencies that were <= the CB test latencies. All while not requiring any manual concurrency configurations.

mattklein123 · 2019-08-02T21:04:06Z

This plans LGTM to me @tonya11en. Very excited to see this land. In the proposed patch set:

L7 filter with calls to a noop concurrency control framework.
Implement concurrency control framework and unit tests.
Make parameters runtime configurable.
Integration tests and benchmark.
Documentation.

In which PR is the real algorithm going to be wired into the filter? The integration test PR? If so that makes sense to me. Also, you might want to split stats out into its own PR and do that after the first PR with the NOP filter, but up to you.

tonya11en · 2019-08-02T21:20:44Z

@mattklein123 I plan to implement the actual algorithm (~gradient2 from netflix's lib) in the 2nd PR (implementation of the framework).

mattklein123 · 2019-08-02T21:22:52Z

OK. Since the algorithm should be behind an interface, I would prefer to see a dedicated algorithm PR with tests, and then we can do a further PR to actually integrate the real algorithm into the filter along with integration tests.

stale · 2019-09-01T21:35:05Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

tonya11en · 2019-09-02T00:08:45Z

Keepalive.

…

On Sun, Sep 1, 2019, 2:35 PM stale[bot] ***@***.***> wrote: This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7789?email_source=notifications&email_token=AAIOZ7WE3TLS36WQUOHQU2LQHQYSHA5CNFSM4IILPIWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5ULCEY#issuecomment-526954771>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAIOZ7TZTJF2LNQXNBXZUZDQHQYSHANCNFSM4IILPIWA> .

mmckeen · 2019-09-09T16:18:07Z

Would there be any chance in this proposal to allow concurrency control at a finer level? Maybe some form of client_id passed from callers that Envoy could maintain separate concurrency limits for? There are sets of problems that an approach similar to client side throttling in https://landing.google.com/sre/sre-book/chapters/handling-overload/ might be able to help solve.

mattklein123 · 2019-09-09T17:31:15Z

@mmckeen I've been discussing this offline with @tonya11en. I think we do want to allow concurrency to be set at a finer level either via upstream cluster, arbitrary header (e.g. client id), etc. We can discsus this once we land the initial support.

tonya11en · 2019-09-09T18:29:40Z

@mmckeen in addition to what Matt said, what you mention is an approach for implementing QoS using this concurrency control filter. This is under consideration as a next step and I had some thoughts on this - I'll CC you on the brainstorming doc when the time comes if you'd like.

tonya11en · 2019-09-23T23:58:21Z

@PiotrSikora there's currently a necessary default maximum concurrency in this filter's config proto. In steady-state, the concurrency limit continues to increase until there is a measurable increase in latency.

This might cause more problems for you with regards to istio/istio#17233.

PiotrSikora · 2019-09-24T03:41:02Z

@tonya11en thanks!

tonya11en · 2019-10-08T00:49:33Z

Just to provide an update on what we have so far, I wanted to share some experimental data. Once we merge #8524, we'll have an MVP for the adaptive concurrency filter. It's behaving as expected with the bugfix.

Below you'll find some output from a client/server load generator that I hacked together. It's a 160s test with RPS slowly ramping up from to values that tip the fake server over. The fake server injects latencies of p99=150ms and p90=50ms with 8 worker threads. The fake client will time out a request at 1s.

Filter passthrough, no concurrency limiting

Below is the output of a test run without circuit breaking or adaptive concurrency. It's just a simple pass-through to compare against.

Adaptive concurrency

The adaptive concurrency filter appears to keep the latencies relatively close to the ideal. The filter deduces the ideal concurrency (we know this to be ~8 based on the number of workers) purely based on latency samples.

Things look promising. I'll be writing up some thorough documentation once the remaining 2 patches are merged, then I'll see how it performs in our staging environment within the next few weeks. If anyone would like to assist in hardening the filter, I'd be happy to work with you.

tonya11en mentioned this issue Aug 1, 2019

Allow configuration of server-side request latency distributions envoyproxy/nighthawk#102

Open

mattklein123 assigned tonya11en Aug 1, 2019

mattklein123 added the enhancement Feature requests. Not bugs or questions. label Aug 1, 2019

mattklein123 added this to the 1.12.0 milestone Aug 1, 2019

tonya11en mentioned this issue Aug 3, 2019

Adaptive concurrency no-op implementation #7819

Merged

tonya11en mentioned this issue Aug 13, 2019

adaptive concurrency: Gradient algorithm implementation #7908

Merged

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 1, 2019

stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Sep 2, 2019

mattklein123 added the no stalebot Disables stalebot from closing an issue label Sep 9, 2019

tonya11en mentioned this issue Sep 15, 2019

adaptive concurrency: Filter configuration and integration tests #8244

Merged

PiotrSikora mentioned this issue Sep 23, 2019

Disable circuit breaking by default istio/istio#17233

Closed

tonya11en mentioned this issue Sep 23, 2019

adaptive concurrency: Allow configurable jitter to minRTT calculation windows #8337

Closed

tonya11en mentioned this issue Sep 25, 2019

adaptive concurrency: Implement runtime knobs #8360

Merged

tonya11en mentioned this issue Oct 7, 2019

adaptive concurrency: Reset sample timer after minRTT calculation #8524

Merged

tonya11en mentioned this issue Oct 11, 2019

docs: Adaptive concurrency documentation and stats #8582

Merged

mattklein123 closed this as completed in #8582 Oct 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adaptive Concurrency Control L7 Filter #7789

Adaptive Concurrency Control L7 Filter #7789

tonya11en commented Jul 31, 2019 •

edited

Loading

mattklein123 commented Aug 2, 2019

tonya11en commented Aug 2, 2019

mattklein123 commented Aug 2, 2019

stale bot commented Sep 1, 2019

tonya11en commented Sep 2, 2019 via email

mmckeen commented Sep 9, 2019

mattklein123 commented Sep 9, 2019

tonya11en commented Sep 9, 2019

tonya11en commented Sep 23, 2019

PiotrSikora commented Sep 24, 2019

tonya11en commented Oct 8, 2019

Adaptive Concurrency Control L7 Filter #7789

Adaptive Concurrency Control L7 Filter #7789

Comments

tonya11en commented Jul 31, 2019 • edited Loading

Adaptive Concurrency Control HTTP Filter

Motivation

Planned Patches

Preliminary Data

Circuit breaker configured

Adaptive concurrency control configured

mattklein123 commented Aug 2, 2019

tonya11en commented Aug 2, 2019

mattklein123 commented Aug 2, 2019

stale bot commented Sep 1, 2019

tonya11en commented Sep 2, 2019 via email

mmckeen commented Sep 9, 2019

mattklein123 commented Sep 9, 2019

tonya11en commented Sep 9, 2019

tonya11en commented Sep 23, 2019

PiotrSikora commented Sep 24, 2019

tonya11en commented Oct 8, 2019

Filter passthrough, no concurrency limiting

Adaptive concurrency

tonya11en commented Jul 31, 2019 •

edited

Loading