-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poller scaling #874
base: master
Are you sure you want to change the base?
Poller scaling #874
Conversation
7b4e70c
to
14a75ad
Compare
14a75ad
to
d70715f
Compare
d70715f
to
7be4134
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but didn't get into the details. I think an English description of what's happening could help. I tried to guess at one in on the comments. But since the default behavior is unchanged for users, no problem.
I do think we should consider holding off (or at least holding off lang side) until a server is released that can even take advantage of this. We will want some end to end automated test somewhere that just proves poller auto scaling works (often we can just do this in lang in a smoke-test kinda integration test just to confirm).
core/src/pollers/poll_buffer.rs
Outdated
Some(tokio::task::spawn(async move { | ||
let mut interval = tokio::time::interval(Duration::from_millis(100)); | ||
loop { | ||
tokio::select! { | ||
_ = interval.tick() => {} | ||
_ = shutdown.cancelled() => { break; } | ||
} | ||
let ingested = rhc.ingested_this_period.swap(0, Ordering::Relaxed); | ||
let ingested_last = rhc.ingested_last_period.swap(ingested, Ordering::Relaxed); | ||
rhc.scale_up_allowed | ||
.store(ingested_last >= ingested, Ordering::Relaxed); | ||
} | ||
})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to confirm, the algorithm is:
Server returns amount of pollers to scale up (or down if negative) on each poll response, and SDK respects that decision so long as it's bounded by min/max, and in the case of scale up, there were at least as many polls accepted last 100ms period as this 100ms?
So if I accepted a poll 50ms ago that told me to scale up but my last poll response was 250ms ago, I would not scale up? (because ingested_last
is 0
and ingested
is 1
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad you looked at this again because it sort of obviously makes no sense. I was wondering why my testing wasn't showing good restriction of overshooting like I know it did at one point, and I had tried like a bazillion different methods but I knew this worked and it was simple, so I went back to it -- but results were inconsistent.
Somehow this comparison got flipped at some point. I've changed it back and the results more consistent again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly though I'm tempted to get rid of this entirely. My tests show it can still be "defeated" semi often by rapid scale ups, and ingestion is indeed going up so there's not really any reason to say no, and then you end up with a bunch of polls sitting there once the backlog clears.
Smoothing out the calculation so that the shorter timeout gets set more reliably might have more value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I am always a bit wary when I see hardcoded time expectations like 100ms unless that's a good number arrived by testing or something (which maybe it is). Maybe a more naive "do not make poller count changes more frequently than X interval" can help, but I don't understand the details and haven't run the tests.
Yeah, I'm happy to wait for a server release so that we have an integ test. I'll add a high-level prose description of the approach somewhere too. |
lower short-timeout threshold
I intend to add a few unit tests too, and some simpler integ tests once server is actually released with this. This PR will have to sit and wait for server to release with the changes temporalio/temporal#7300
What was changed
Read and handle poller scaling decisions from server
Why?
Part of the worker management effort to simplify configuration of workers for users.
Checklist
Closes
How was this tested:
Big manual tests + integ tests (to come)
Any docs updates needed?
Doc updates will come as part of adding to all SDKs