-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adaptive Concurrency Control L7 Filter #7789
Comments
This plans LGTM to me @tonya11en. Very excited to see this land. In the proposed patch set:
In which PR is the real algorithm going to be wired into the filter? The integration test PR? If so that makes sense to me. Also, you might want to split stats out into its own PR and do that after the first PR with the NOP filter, but up to you. |
@mattklein123 I plan to implement the actual algorithm (~gradient2 from netflix's lib) in the 2nd PR (implementation of the framework). |
OK. Since the algorithm should be behind an interface, I would prefer to see a dedicated algorithm PR with tests, and then we can do a further PR to actually integrate the real algorithm into the filter along with integration tests. |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
Keepalive.
…On Sun, Sep 1, 2019, 2:35 PM stale[bot] ***@***.***> wrote:
This issue has been automatically marked as stale because it has not had
activity in the last 30 days. It will be closed in the next 7 days unless
it is tagged "help wanted" or other activity occurs. Thank you for your
contributions.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7789?email_source=notifications&email_token=AAIOZ7WE3TLS36WQUOHQU2LQHQYSHA5CNFSM4IILPIWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5ULCEY#issuecomment-526954771>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAIOZ7TZTJF2LNQXNBXZUZDQHQYSHANCNFSM4IILPIWA>
.
|
Would there be any chance in this proposal to allow concurrency control at a finer level? Maybe some form of client_id passed from callers that Envoy could maintain separate concurrency limits for? There are sets of problems that an approach similar to client side throttling in https://landing.google.com/sre/sre-book/chapters/handling-overload/ might be able to help solve. |
@mmckeen I've been discussing this offline with @tonya11en. I think we do want to allow concurrency to be set at a finer level either via upstream cluster, arbitrary header (e.g. client id), etc. We can discsus this once we land the initial support. |
@mmckeen in addition to what Matt said, what you mention is an approach for implementing QoS using this concurrency control filter. This is under consideration as a next step and I had some thoughts on this - I'll CC you on the brainstorming doc when the time comes if you'd like. |
@PiotrSikora there's currently a necessary default maximum concurrency in this filter's config proto. In steady-state, the concurrency limit continues to increase until there is a measurable increase in latency. This might cause more problems for you with regards to istio/istio#17233. |
@tonya11en thanks! |
Just to provide an update on what we have so far, I wanted to share some experimental data. Once we merge #8524, we'll have an MVP for the adaptive concurrency filter. It's behaving as expected with the bugfix. Below you'll find some output from a client/server load generator that I hacked together. It's a 160s test with RPS slowly ramping up from to values that tip the fake server over. The fake server injects latencies of p99=150ms and p90=50ms with 8 worker threads. The fake client will time out a request at 1s. Filter passthrough, no concurrency limitingBelow is the output of a test run without circuit breaking or adaptive concurrency. It's just a simple pass-through to compare against. Adaptive concurrencyThe adaptive concurrency filter appears to keep the latencies relatively close to the ideal. The filter deduces the ideal concurrency (we know this to be ~8 based on the number of workers) purely based on latency samples. Things look promising. I'll be writing up some thorough documentation once the remaining 2 patches are merged, then I'll see how it performs in our staging environment within the next few weeks. If anyone would like to assist in hardening the filter, I'd be happy to work with you. |
Adaptive Concurrency Control HTTP Filter
TL; DR:
L7 filter to dynamically update request concurrency values based on sampled request latencies.
Motivation
At Lyft, we’re using Envoy’s circuit breaking system to defend our services from load caused by too many concurrent requests. The circuit breakers currently require service owners to manually set limits for each service/upstream cluster. While the feature is beneficial, the service owners are expected to understand the ideal concurrency limits of their services, configure them, and update the limits as their service evolves. This is not a problem if the concurrency values are always tuned properly, but the inevitable stale values have caused several incidents in production.
This work aims to remove the need for manual tuning of these circuit breaking parameters and keep latencies low by using techniques from Netflix’s concurrency limits Java library.
Planned Patches
The filter will be committed in stages to make the review process easier. Here's an ordered list of what is currently planned:
Nice-to-haves:
Preliminary Data
To test the various concurrency control algorithms, I used bufferbloater (whose functionality will eventually be added to Nighthawk) and a proof-of-concept implementation in my envoy-filter-example fork.
Identical tests were performed on a system with sane CB settings configured and adaptive concurrency configured. The client RPS to the server remained constant, but the server's response times varied every 30 seconds.
Circuit breaker configured
Adaptive concurrency control configured
So far, the results seem promising. The proof-of-concept filter was able to replicate the circuit breaker behavior, but did so with latencies that were <= the CB test latencies. All while not requiring any manual concurrency configurations.
The text was updated successfully, but these errors were encountered: