Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adaptive Concurrency Control L7 Filter #7789

Closed
4 of 6 tasks
tonya11en opened this issue Jul 31, 2019 · 11 comments · Fixed by #8582
Closed
4 of 6 tasks

Adaptive Concurrency Control L7 Filter #7789

tonya11en opened this issue Jul 31, 2019 · 11 comments · Fixed by #8582
Assignees
Labels
enhancement Feature requests. Not bugs or questions. no stalebot Disables stalebot from closing an issue
Milestone

Comments

@tonya11en
Copy link
Member

tonya11en commented Jul 31, 2019

Adaptive Concurrency Control HTTP Filter

TL; DR:
L7 filter to dynamically update request concurrency values based on sampled request latencies.

Motivation

At Lyft, we’re using Envoy’s circuit breaking system to defend our services from load caused by too many concurrent requests. The circuit breakers currently require service owners to manually set limits for each service/upstream cluster. While the feature is beneficial, the service owners are expected to understand the ideal concurrency limits of their services, configure them, and update the limits as their service evolves. This is not a problem if the concurrency values are always tuned properly, but the inevitable stale values have caused several incidents in production.

This work aims to remove the need for manual tuning of these circuit breaking parameters and keep latencies low by using techniques from Netflix’s concurrency limits Java library.

Planned Patches

The filter will be committed in stages to make the review process easier. Here's an ordered list of what is currently planned:

  • L7 filter with calls to a noop concurrency control framework.
  • Implement concurrency control framework and unit tests.
  • Make parameters runtime configurable.
  • Integration tests.
  • Documentation.

Nice-to-haves:

  • Benchmark.

Preliminary Data

To test the various concurrency control algorithms, I used bufferbloater (whose functionality will eventually be added to Nighthawk) and a proof-of-concept implementation in my envoy-filter-example fork.

Identical tests were performed on a system with sane CB settings configured and adaptive concurrency configured. The client RPS to the server remained constant, but the server's response times varied every 30 seconds.

Circuit breaker configured

realistic_cb_well_configured

Adaptive concurrency control configured

realistic_server_degrade

So far, the results seem promising. The proof-of-concept filter was able to replicate the circuit breaker behavior, but did so with latencies that were <= the CB test latencies. All while not requiring any manual concurrency configurations.

@mattklein123
Copy link
Member

This plans LGTM to me @tonya11en. Very excited to see this land. In the proposed patch set:

  • L7 filter with calls to a noop concurrency control framework.
  • Implement concurrency control framework and unit tests.
  • Make parameters runtime configurable.
  • Integration tests and benchmark.
  • Documentation.

In which PR is the real algorithm going to be wired into the filter? The integration test PR? If so that makes sense to me. Also, you might want to split stats out into its own PR and do that after the first PR with the NOP filter, but up to you.

@tonya11en
Copy link
Member Author

@mattklein123 I plan to implement the actual algorithm (~gradient2 from netflix's lib) in the 2nd PR (implementation of the framework).

@mattklein123
Copy link
Member

OK. Since the algorithm should be behind an interface, I would prefer to see a dedicated algorithm PR with tests, and then we can do a further PR to actually integrate the real algorithm into the filter along with integration tests.

@stale
Copy link

stale bot commented Sep 1, 2019

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 1, 2019
@tonya11en
Copy link
Member Author

tonya11en commented Sep 2, 2019 via email

@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Sep 2, 2019
@mmckeen
Copy link

mmckeen commented Sep 9, 2019

Would there be any chance in this proposal to allow concurrency control at a finer level? Maybe some form of client_id passed from callers that Envoy could maintain separate concurrency limits for? There are sets of problems that an approach similar to client side throttling in https://landing.google.com/sre/sre-book/chapters/handling-overload/ might be able to help solve.

@mattklein123
Copy link
Member

@mmckeen I've been discussing this offline with @tonya11en. I think we do want to allow concurrency to be set at a finer level either via upstream cluster, arbitrary header (e.g. client id), etc. We can discsus this once we land the initial support.

@mattklein123 mattklein123 added the no stalebot Disables stalebot from closing an issue label Sep 9, 2019
@tonya11en
Copy link
Member Author

@mmckeen in addition to what Matt said, what you mention is an approach for implementing QoS using this concurrency control filter. This is under consideration as a next step and I had some thoughts on this - I'll CC you on the brainstorming doc when the time comes if you'd like.

@tonya11en
Copy link
Member Author

@PiotrSikora there's currently a necessary default maximum concurrency in this filter's config proto. In steady-state, the concurrency limit continues to increase until there is a measurable increase in latency.

This might cause more problems for you with regards to istio/istio#17233.

@PiotrSikora
Copy link
Contributor

@tonya11en thanks!

@tonya11en
Copy link
Member Author

Just to provide an update on what we have so far, I wanted to share some experimental data. Once we merge #8524, we'll have an MVP for the adaptive concurrency filter. It's behaving as expected with the bugfix.

Below you'll find some output from a client/server load generator that I hacked together. It's a 160s test with RPS slowly ramping up from to values that tip the fake server over. The fake server injects latencies of p99=150ms and p90=50ms with 8 worker threads. The fake client will time out a request at 1s.

Filter passthrough, no concurrency limiting

Below is the output of a test run without circuit breaking or adaptive concurrency. It's just a simple pass-through to compare against.
prod_bypass_filter

Adaptive concurrency

The adaptive concurrency filter appears to keep the latencies relatively close to the ideal. The filter deduces the ideal concurrency (we know this to be ~8 based on the number of workers) purely based on latency samples.
prod_with_filter

Things look promising. I'll be writing up some thorough documentation once the remaining 2 patches are merged, then I'll see how it performs in our staging environment within the next few weeks. If anyone would like to assist in hardening the filter, I'd be happy to work with you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests. Not bugs or questions. no stalebot Disables stalebot from closing an issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants