Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HashRing: proper debounce instead of sleep-if-seen-in4s #6342

Conversation

dkrotx
Copy link
Member

@dkrotx dkrotx commented Oct 8, 2024

What changed?
The current mechanism of HashRing handling updates from PeerProvider
(ringpop/uns) works like ratelimiter: it allows the first update, but then it
drop another updates withtin the next 4s. This leads to a situation
which often takes place in production:

  1. Some host dissapeared from membership
    HashRing refresh()-es the state and notifies listeners
  2. In half a second another host appears
    HashRing throttles this update
  3. After periodic forced refresh (10 seconds) HashRing finally keeps up
    • it performs refresh() and notifies listeners

This leads to 10s delay for matching instances to realise the ownership
has changed.

I believe the idea behind initial 4s throttling was "we don't want to
have updates 100/s because of massive restarts since we think
grabbing ownership of tasklists and especially history is bad".
Therefore, the change brings the fair debounce mechanism (do a single,
postponed call in a second instead of hundreds of calls which could take
place). We also preserve the initial behaviour - to perform an immediate
call if there were no calls for a long time (twice as long as debounce
interval).

The debouncer itself is implemented in a library with two ways of using
it: via calling handler or via wrapper that implements a signal-channel pattern we use in HashRing.

Why?
Existing approach leads to 10s delay for matching instances to realise task-list ownership needs to be changed.

How did you test it?
staging environment at uber: all 3 services - frontend, history, matching

Potential risks

Release notes

Documentation Changes

The current mechanism of HashRing handling updates from PeerProvider
(ringpop/uns) works like ratelimiter: it allows the first update, but then it
drop another updates withtin the next 4s. This leads to a situation
which often takes place in production:
1. Some host dissapeared from membership
   HashRing refresh()-es the state and notifies listeners
2. In half a second another host appears
   HashRing throttles this update
3. After periodic forced refresh (10 seconds) HashRing finally keeps up
   - it performs refresh() and notifies listeners

This leads to 10s delay for matching instances to realise the ownership
has changed.

I believe the idea behind initial 4s throttling was "we don't want to
have updates 100/s because of massive restarts since we think
grabbing ownership of tasklists and especially history is bad".
Therefore, the change brings the fair debounce mechanism (do a single,
postponed call in a second instead of hundreds of calls which could take
place). We also preserve the initial behaviour - to perform an immediate
call if there were no calls for a long time (twice as long as debounce
interval).

The debouncer itself is implemented in a library with two ways of using
it: via calling handler or via wrapper that implements a signal-channel pattern we use in HashRing.
We need DaemonStatusManager to manage Init->Started->Stopped transisions
- it is a common pattern we use in Cadence server.
Now I'm only taking this for debouncer + hashring. Will change other
usages later.
Copy link

codecov bot commented Oct 9, 2024

Codecov Report

Attention: Patch coverage is 95.94595% with 3 lines in your changes missing coverage. Please review.

Project coverage is 73.74%. Comparing base (fd46d4c) to head (dd0eede).
Report is 11 commits behind head on master.

Files with missing lines Patch % Lines
common/membership/debounce.go 96.36% 1 Missing and 1 partial ⚠️
common/membership/hashring.go 93.33% 0 Missing and 1 partial ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
common/daemon.go 100.00% <100.00%> (ø)
common/membership/hashring.go 90.39% <93.33%> (-0.57%) ⬇️
common/membership/debounce.go 96.36% <96.36%> (ø)

... and 38 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fd46d4c...dd0eede. Read the comment docs.

Copy link
Member

@Shaddoll Shaddoll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this change, the delay is reduced to 1s?

Comment on lines 65 to 67
func (d *DebouncedCallback) Start() {
if !d.status.TransitionToStart() {
return
Copy link
Member

@Groxx Groxx Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh I'd treat these as major errors. mis-calling lifecycle events like this is generally a sign of fatal flaws somewhere, like double-(un)locking a mutex

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by "tbh I'd treat these as major errors".
They shouldn't cause errors/panics. In my mind, we should just ignore double-Start, double-Stop

Copy link
Member

@Groxx Groxx Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would duplicate starting be a desirable thing to allow though? by allowing it, you're requiring every future modifier to figure out if that's needed / if it's in use, and that kind of thing tends to be quite a lot of work because it's internal state. and users of it tend to be less trusting if something has been started because they can't necessarily tell, but starting is safe so they Start() multiple times, which makes ^ the "is this used" check even harder.

this kind of unused-lifecycle-flexibility tends to become a major complicator in the long run. if we're not getting some concrete benefit from it, we really shouldn't build it.

(stopping is less dubious, since it simplifies defer stop(). generally though there's some clear entry-and-exit point that should be doing it, and doing it anywhere else is unnecessary and sometimes wrong)

Copy link
Member

@Groxx Groxx Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarification here: part of this is just that I really hate that daemon API / the atomics-pattern through the codebase.
But making a wrapper for it makes plenty of sense, tests look good, no complaint at all about that.

It's more that things using the Daemon API almost always have clear expectations about who should call what, when, but are doing nothing to enforce that. That kind of thing has repeatedly led to unnecessary calls making it extremely hard to figure out what's actually happening, and I really don't want to make that situation worse :\

But that's not really an issue for an isolated thing used in one place. And we already have this pattern, and continuing it isn't immediately damaging or anything.

So no request to make changes, just complaining out loud :) If you've got ideas, I'd love to see them, but it definitely does not need to happen in this PR.

Copy link
Member

@Groxx Groxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... I think I'm going to leave a fairly general block on this.

  • descriptions are not accurate.
    • e.g. "never ignores calls to handler - only postpone" but if two calls come in, the second is ignored.
    • more generally, this seems to be a bit like a combination between "throttle" and "lossy ticker" and "debounce" but those are rather different things, and I'm not sure the blend is beneficial (or what anyone will expect, given the names)
  • there's a constant upkeep cost even when idle, due to the goroutine+ticker. how many of these are we expecting, particularly since it's in common and not private to membership? seems likely worth optimizing, and probably not all that hard.
  • it exposes Lock() and Unlock() methods, but they don't seem useful to expose / are arguably dangerous leaks of internal state
  • the "if long idle, trigger immediately" makes sense given what we generally want hashring updates to do, but the way this is implemented will mean that on a 1 second loop stuff like this is possible:
    • Handle() -> triggers immediately
    • 1.5s passes
    • Handle() -> does not trigger immediately
    • 0.5s passes -> triggers
    • 1.99s passes
    • Handle() -> races with the background loop, could mean
      • background read fails
      • send succeeds
      • out-of-date check fails
    • 1s passes -> triggers

What's the intended behavior? I'd be willing to bet it can be implemented much more simply.

More tricky test of DaemonStatus
Lock/Unlock are not longer exposed from CallbackDebounce
Moved Debounce to membership package so we can be more opinionated in
decisions
@dkrotx dkrotx requested a review from Groxx October 10, 2024 14:05
Copy link
Member

@Groxx Groxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mutex is now hidden, but everything else from last time is still in here.

And e.g. the tests don't match the use (there aren't any concurrent tests on this concurrency-using-and-concurrent-call-expected code), the description doesn't match the intended (afaik) or implemented behavior, and the resulting behavior doesn't seem to match what we want (e.g. why the double interval? I don't see any sign of that in the original code).
It's really not meeting the baseline requirements for something like this: I'm not sure if you know what it's doing because nothing matches, and it's taking a lot of effort for me to understand all the unexplained edge cases and figure out if they're dangerous or not. It clearly looks close and race-free but beyond that it's hard to be confident.

(edit: though reading more closely, yea - DebouncedCallback does debounce, not sure why I thought it didn't earlier. though I don't think that's what we want here since it can starve updates.)

It does seem close to what we want, and I don't think the gaps will cause issues (AFAICT it achieves something close enough and seems safe), but that's not really a good enough reason to add code like this.


updating after a chat:

  • Thanks for spotting my exploratory-test flaw :) I must need more sleep or something.
    • Seems to behave like I had thought, and no races/major misbehaviors/etc. So it is and was pretty close, behavior-wise.
  • I do think this needs a (very simple) concurrent test that matches how it'll be used, i.e. multiple goroutines calling Handler().
    • Basically just to push the race detector a bit more.
    • Even just two goroutines with one call each is probably fine, and it'll plug a big gap that future changes could fall into. Nothing fancy needed.
  • "always call the callback async" seems worth doing, blended sync + async callbacks can be a really nasty footgun since they often lead to easy-to-miss-in-tests deadlocks
  • Inside membership, small behavioral imperfections are fine as long as they're not dangerous, and nothing I see here is dangerous.
    • E.g. the buffered channel allows reading up to 2 values "immediately", not separated by at least interval. But we won't be using it in a way that risks this, and it won't break anything even if it happens, so that seems fine.
    • My earlier block here was because it was presenting itself as a much more general tool in common, where stuff like that may not be safe. That's no longer a concern 👍

And adding a commit-message and possibly a go-doc for a more precise description of the behavior is probably good too, e.g. maybe:

If no Handler() call has occurred for at least [interval],
a callback will be immediately performed (asynchronously).

After that, any additional calls to Handler() within that
interval will be deduplicated, and one call will be enqueued
for the end of the interval.

This will lead to at most one call to the callback per
[interval], regardless of Handler() call patterns.

since it's close to a "debounce" but isn't exactly one (for good reasons).

Comment on lines +292 to +295
case <-r.debounce.Chan():
callRefresh()
case <-refreshTicker.Chan():
callRefresh()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the earlier-code equivalent would probably be to r.debounce.Handler() inside the refresh ticker, so it's debounced with updates / we don't refresh 2x quickly.

but this seems fine either way tbh. and in some ways doing it separately is more bug-tolerant since it ensures at least some updates occur.
not requesting a change, just pointing out that it was changed, in case that wasn't intended.

defer wg.Done()
<-changeCh
}()
assert.True(t, common.AwaitWaitGroup(&wg, maxTestDuration), "first notification has to be there")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth mentioning that this is the "ring-started" refresh notification, as "first" is a bit vague

Comment on lines +303 to +305
updateHandler(ChangedEvent{}) // we call updateHandler, but .GetMembers provides the same set of hosts
time.Sleep(100 * time.Millisecond)
assert.True(t, len(changeCh) == 0, "should not be notified again since membership remained the same")
Copy link
Member

@Groxx Groxx Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be mimicking the earlier test so it seems probably fine, but:

if the change event was just being delayed due to the debouncer, would this test still pass? and is there an easy way to assert that that isn't the reason there were no changes?

@Groxx
Copy link
Member

Groxx commented Oct 10, 2024

I just finished rereading tests / hashring code more carefully: looks good and probably not going to be affected by any of the coming changes. So it should be a fairly easy re-review 👍

@dkrotx dkrotx closed this Oct 11, 2024
@dkrotx
Copy link
Member Author

dkrotx commented Oct 11, 2024

Closing if favor of much simpler temporary solution: #6357

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants