-
Notifications
You must be signed in to change notification settings - Fork 818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HashRing: proper debounce instead of sleep-if-seen-in4s #6342
HashRing: proper debounce instead of sleep-if-seen-in4s #6342
Conversation
The current mechanism of HashRing handling updates from PeerProvider (ringpop/uns) works like ratelimiter: it allows the first update, but then it drop another updates withtin the next 4s. This leads to a situation which often takes place in production: 1. Some host dissapeared from membership HashRing refresh()-es the state and notifies listeners 2. In half a second another host appears HashRing throttles this update 3. After periodic forced refresh (10 seconds) HashRing finally keeps up - it performs refresh() and notifies listeners This leads to 10s delay for matching instances to realise the ownership has changed. I believe the idea behind initial 4s throttling was "we don't want to have updates 100/s because of massive restarts since we think grabbing ownership of tasklists and especially history is bad". Therefore, the change brings the fair debounce mechanism (do a single, postponed call in a second instead of hundreds of calls which could take place). We also preserve the initial behaviour - to perform an immediate call if there were no calls for a long time (twice as long as debounce interval). The debouncer itself is implemented in a library with two ways of using it: via calling handler or via wrapper that implements a signal-channel pattern we use in HashRing.
We need DaemonStatusManager to manage Init->Started->Stopped transisions - it is a common pattern we use in Cadence server. Now I'm only taking this for debouncer + hashring. Will change other usages later.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files
... and 38 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this change, the delay is reduced to 1s?
common/debounce/callback.go
Outdated
func (d *DebouncedCallback) Start() { | ||
if !d.status.TransitionToStart() { | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh I'd treat these as major errors. mis-calling lifecycle events like this is generally a sign of fatal flaws somewhere, like double-(un)locking a mutex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean by "tbh I'd treat these as major errors".
They shouldn't cause errors/panics. In my mind, we should just ignore double-Start, double-Stop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would duplicate starting be a desirable thing to allow though? by allowing it, you're requiring every future modifier to figure out if that's needed / if it's in use, and that kind of thing tends to be quite a lot of work because it's internal state. and users of it tend to be less trusting if something has been started because they can't necessarily tell, but starting is safe so they Start()
multiple times, which makes ^ the "is this used" check even harder.
this kind of unused-lifecycle-flexibility tends to become a major complicator in the long run. if we're not getting some concrete benefit from it, we really shouldn't build it.
(stopping is less dubious, since it simplifies defer stop()
. generally though there's some clear entry-and-exit point that should be doing it, and doing it anywhere else is unnecessary and sometimes wrong)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarification here: part of this is just that I really hate that daemon API / the atomics-pattern through the codebase.
But making a wrapper for it makes plenty of sense, tests look good, no complaint at all about that.
It's more that things using the Daemon API almost always have clear expectations about who should call what, when, but are doing nothing to enforce that. That kind of thing has repeatedly led to unnecessary calls making it extremely hard to figure out what's actually happening, and I really don't want to make that situation worse :\
But that's not really an issue for an isolated thing used in one place. And we already have this pattern, and continuing it isn't immediately damaging or anything.
So no request to make changes, just complaining out loud :) If you've got ideas, I'd love to see them, but it definitely does not need to happen in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... I think I'm going to leave a fairly general block on this.
- descriptions are not accurate.
- e.g. "never ignores calls to handler - only postpone" but if two calls come in, the second is ignored.
- more generally, this seems to be a bit like a combination between "throttle" and "lossy ticker" and "debounce" but those are rather different things, and I'm not sure the blend is beneficial (or what anyone will expect, given the names)
- there's a constant upkeep cost even when idle, due to the goroutine+ticker. how many of these are we expecting, particularly since it's in
common
and not private tomembership
? seems likely worth optimizing, and probably not all that hard. - it exposes
Lock()
andUnlock()
methods, but they don't seem useful to expose / are arguably dangerous leaks of internal state - the "if long idle, trigger immediately" makes sense given what we generally want hashring updates to do, but the way this is implemented will mean that on a 1 second loop stuff like this is possible:
- Handle() -> triggers immediately
- 1.5s passes
- Handle() -> does not trigger immediately
- 0.5s passes -> triggers
- 1.99s passes
- Handle() -> races with the background loop, could mean
- background read fails
- send succeeds
- out-of-date check fails
- 1s passes -> triggers
What's the intended behavior? I'd be willing to bet it can be implemented much more simply.
More tricky test of DaemonStatus Lock/Unlock are not longer exposed from CallbackDebounce Moved Debounce to membership package so we can be more opinionated in decisions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mutex is now hidden, but everything else from last time is still in here.
And e.g. the tests don't match the use (there aren't any concurrent tests on this concurrency-using-and-concurrent-call-expected code), the description doesn't match the intended (afaik) or implemented behavior, and the resulting behavior doesn't seem to match what we want (e.g. why the double interval? I don't see any sign of that in the original code).
It's really not meeting the baseline requirements for something like this: I'm not sure if you know what it's doing because nothing matches, and it's taking a lot of effort for me to understand all the unexplained edge cases and figure out if they're dangerous or not. It clearly looks close and race-free but beyond that it's hard to be confident.
(edit: though reading more closely, yea - DebouncedCallback does debounce, not sure why I thought it didn't earlier. though I don't think that's what we want here since it can starve updates.)
It does seem close to what we want, and I don't think the gaps will cause issues (AFAICT it achieves something close enough and seems safe), but that's not really a good enough reason to add code like this.
updating after a chat:
- Thanks for spotting my exploratory-test flaw :) I must need more sleep or something.
- Seems to behave like I had thought, and no races/major misbehaviors/etc. So it is and was pretty close, behavior-wise.
- I do think this needs a (very simple) concurrent test that matches how it'll be used, i.e. multiple goroutines calling
Handler()
.- Basically just to push the race detector a bit more.
- Even just two goroutines with one call each is probably fine, and it'll plug a big gap that future changes could fall into. Nothing fancy needed.
- "always call the callback async" seems worth doing, blended sync + async callbacks can be a really nasty footgun since they often lead to easy-to-miss-in-tests deadlocks
- Inside membership, small behavioral imperfections are fine as long as they're not dangerous, and nothing I see here is dangerous.
- E.g. the buffered channel allows reading up to 2 values "immediately", not separated by at least
interval
. But we won't be using it in a way that risks this, and it won't break anything even if it happens, so that seems fine. - My earlier block here was because it was presenting itself as a much more general tool in
common
, where stuff like that may not be safe. That's no longer a concern 👍
- E.g. the buffered channel allows reading up to 2 values "immediately", not separated by at least
And adding a commit-message and possibly a go-doc for a more precise description of the behavior is probably good too, e.g. maybe:
If no Handler() call has occurred for at least [interval],
a callback will be immediately performed (asynchronously).
After that, any additional calls to Handler() within that
interval will be deduplicated, and one call will be enqueued
for the end of the interval.
This will lead to at most one call to the callback per
[interval], regardless of Handler() call patterns.
since it's close to a "debounce" but isn't exactly one (for good reasons).
case <-r.debounce.Chan(): | ||
callRefresh() | ||
case <-refreshTicker.Chan(): | ||
callRefresh() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the earlier-code equivalent would probably be to r.debounce.Handler()
inside the refresh ticker, so it's debounced with updates / we don't refresh 2x quickly.
but this seems fine either way tbh. and in some ways doing it separately is more bug-tolerant since it ensures at least some updates occur.
not requesting a change, just pointing out that it was changed, in case that wasn't intended.
defer wg.Done() | ||
<-changeCh | ||
}() | ||
assert.True(t, common.AwaitWaitGroup(&wg, maxTestDuration), "first notification has to be there") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe worth mentioning that this is the "ring-started" refresh notification, as "first" is a bit vague
updateHandler(ChangedEvent{}) // we call updateHandler, but .GetMembers provides the same set of hosts | ||
time.Sleep(100 * time.Millisecond) | ||
assert.True(t, len(changeCh) == 0, "should not be notified again since membership remained the same") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to be mimicking the earlier test so it seems probably fine, but:
if the change event was just being delayed due to the debouncer, would this test still pass? and is there an easy way to assert that that isn't the reason there were no changes?
I just finished rereading tests / hashring code more carefully: looks good and probably not going to be affected by any of the coming changes. So it should be a fairly easy re-review 👍 |
Closing if favor of much simpler temporary solution: #6357 |
What changed?
The current mechanism of HashRing handling updates from PeerProvider
(ringpop/uns) works like ratelimiter: it allows the first update, but then it
drop another updates withtin the next 4s. This leads to a situation
which often takes place in production:
HashRing refresh()-es the state and notifies listeners
HashRing throttles this update
This leads to 10s delay for matching instances to realise the ownership
has changed.
I believe the idea behind initial 4s throttling was "we don't want to
have updates 100/s because of massive restarts since we think
grabbing ownership of tasklists and especially history is bad".
Therefore, the change brings the fair debounce mechanism (do a single,
postponed call in a second instead of hundreds of calls which could take
place). We also preserve the initial behaviour - to perform an immediate
call if there were no calls for a long time (twice as long as debounce
interval).
The debouncer itself is implemented in a library with two ways of using
it: via calling handler or via wrapper that implements a signal-channel pattern we use in HashRing.
Why?
Existing approach leads to 10s delay for matching instances to realise task-list ownership needs to be changed.
How did you test it?
staging environment at uber: all 3 services - frontend, history, matching
Potential risks
Release notes
Documentation Changes