HashRing: proper debounce instead of sleep-if-seen-in4s #6342

dkrotx · 2024-10-08T21:25:34Z

What changed?
The current mechanism of HashRing handling updates from PeerProvider
(ringpop/uns) works like ratelimiter: it allows the first update, but then it
drop another updates withtin the next 4s. This leads to a situation
which often takes place in production:

Some host dissapeared from membership
HashRing refresh()-es the state and notifies listeners
In half a second another host appears
HashRing throttles this update
After periodic forced refresh (10 seconds) HashRing finally keeps up
- it performs refresh() and notifies listeners

This leads to 10s delay for matching instances to realise the ownership
has changed.

I believe the idea behind initial 4s throttling was "we don't want to
have updates 100/s because of massive restarts since we think
grabbing ownership of tasklists and especially history is bad".
Therefore, the change brings the fair debounce mechanism (do a single,
postponed call in a second instead of hundreds of calls which could take
place). We also preserve the initial behaviour - to perform an immediate
call if there were no calls for a long time (twice as long as debounce
interval).

The debouncer itself is implemented in a library with two ways of using
it: via calling handler or via wrapper that implements a signal-channel pattern we use in HashRing.

Why?
Existing approach leads to 10s delay for matching instances to realise task-list ownership needs to be changed.

How did you test it?
staging environment at uber: all 3 services - frontend, history, matching

Potential risks

Release notes

Documentation Changes

The current mechanism of HashRing handling updates from PeerProvider (ringpop/uns) works like ratelimiter: it allows the first update, but then it drop another updates withtin the next 4s. This leads to a situation which often takes place in production: 1. Some host dissapeared from membership HashRing refresh()-es the state and notifies listeners 2. In half a second another host appears HashRing throttles this update 3. After periodic forced refresh (10 seconds) HashRing finally keeps up - it performs refresh() and notifies listeners This leads to 10s delay for matching instances to realise the ownership has changed. I believe the idea behind initial 4s throttling was "we don't want to have updates 100/s because of massive restarts since we think grabbing ownership of tasklists and especially history is bad". Therefore, the change brings the fair debounce mechanism (do a single, postponed call in a second instead of hundreds of calls which could take place). We also preserve the initial behaviour - to perform an immediate call if there were no calls for a long time (twice as long as debounce interval). The debouncer itself is implemented in a library with two ways of using it: via calling handler or via wrapper that implements a signal-channel pattern we use in HashRing.

We need DaemonStatusManager to manage Init->Started->Stopped transisions - it is a common pattern we use in Cadence server. Now I'm only taking this for debouncer + hashring. Will change other usages later.

codecov · 2024-10-09T16:53:44Z

Codecov Report

Attention: Patch coverage is 95.94595% with 3 lines in your changes missing coverage. Please review.

Project coverage is 73.74%. Comparing base (fd46d4c) to head (dd0eede).
Report is 11 commits behind head on master.

Files with missing lines	Patch %	Lines
common/membership/debounce.go	96.36%	1 Missing and 1 partial ⚠️
common/membership/hashring.go	93.33%	0 Missing and 1 partial ⚠️

Additional details and impacted files

Files with missing lines	Coverage Δ
common/daemon.go	`100.00% <100.00%> (ø)`
common/membership/hashring.go	`90.39% <93.33%> (-0.57%)`	⬇️
common/membership/debounce.go	`96.36% <96.36%> (ø)`

... and 38 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fd46d4c...dd0eede. Read the comment docs.

Shaddoll

After this change, the delay is reduced to 1s?

common/daemon_test.go

Groxx · 2024-10-09T22:44:48Z

common/debounce/callback.go

+func (d *DebouncedCallback) Start() {
+	if !d.status.TransitionToStart() {
+		return


tbh I'd treat these as major errors. mis-calling lifecycle events like this is generally a sign of fatal flaws somewhere, like double-(un)locking a mutex

what do you mean by "tbh I'd treat these as major errors".
They shouldn't cause errors/panics. In my mind, we should just ignore double-Start, double-Stop

why would duplicate starting be a desirable thing to allow though? by allowing it, you're requiring every future modifier to figure out if that's needed / if it's in use, and that kind of thing tends to be quite a lot of work because it's internal state. and users of it tend to be less trusting if something has been started because they can't necessarily tell, but starting is safe so they Start() multiple times, which makes ^ the "is this used" check even harder.

this kind of unused-lifecycle-flexibility tends to become a major complicator in the long run. if we're not getting some concrete benefit from it, we really shouldn't build it.

(stopping is less dubious, since it simplifies defer stop(). generally though there's some clear entry-and-exit point that should be doing it, and doing it anywhere else is unnecessary and sometimes wrong)

For clarification here: part of this is just that I really hate that daemon API / the atomics-pattern through the codebase.
But making a wrapper for it makes plenty of sense, tests look good, no complaint at all about that.

It's more that things using the Daemon API almost always have clear expectations about who should call what, when, but are doing nothing to enforce that. That kind of thing has repeatedly led to unnecessary calls making it extremely hard to figure out what's actually happening, and I really don't want to make that situation worse :\

But that's not really an issue for an isolated thing used in one place. And we already have this pattern, and continuing it isn't immediately damaging or anything.

So no request to make changes, just complaining out loud :) If you've got ideas, I'd love to see them, but it definitely does not need to happen in this PR.

Groxx

... I think I'm going to leave a fairly general block on this.

descriptions are not accurate.
- e.g. "never ignores calls to handler - only postpone" but if two calls come in, the second is ignored.
- more generally, this seems to be a bit like a combination between "throttle" and "lossy ticker" and "debounce" but those are rather different things, and I'm not sure the blend is beneficial (or what anyone will expect, given the names)
there's a constant upkeep cost even when idle, due to the goroutine+ticker. how many of these are we expecting, particularly since it's in common and not private to membership? seems likely worth optimizing, and probably not all that hard.
it exposes Lock() and Unlock() methods, but they don't seem useful to expose / are arguably dangerous leaks of internal state
the "if long idle, trigger immediately" makes sense given what we generally want hashring updates to do, but the way this is implemented will mean that on a 1 second loop stuff like this is possible:
- Handle() -> triggers immediately
- 1.5s passes
- Handle() -> does not trigger immediately
- 0.5s passes -> triggers
- 1.99s passes
- Handle() -> races with the background loop, could mean
  - background read fails
  - send succeeds
  - out-of-date check fails
- 1s passes -> triggers

What's the intended behavior? I'd be willing to bet it can be implemented much more simply.

More tricky test of DaemonStatus Lock/Unlock are not longer exposed from CallbackDebounce Moved Debounce to membership package so we can be more opinionated in decisions

Groxx

The mutex is now hidden, but everything else from last time is still in here.

And e.g. the tests don't match the use (there aren't any concurrent tests on this concurrency-using-and-concurrent-call-expected code), the description doesn't match the intended (afaik) or implemented behavior, and the resulting behavior doesn't seem to match what we want (e.g. why the double interval? I don't see any sign of that in the original code).
It's really not meeting the baseline requirements for something like this: I'm not sure if you know what it's doing because nothing matches, and it's taking a lot of effort for me to understand all the unexplained edge cases and figure out if they're dangerous or not. It clearly looks close and race-free but beyond that it's hard to be confident.

(edit: though reading more closely, yea - DebouncedCallback does debounce, not sure why I thought it didn't earlier. though I don't think that's what we want here since it can starve updates.)

It does seem close to what we want, and I don't think the gaps will cause issues (AFAICT it achieves something close enough and seems safe), but that's not really a good enough reason to add code like this.

updating after a chat:

Thanks for spotting my exploratory-test flaw :) I must need more sleep or something.
- Seems to behave like I had thought, and no races/major misbehaviors/etc. So it is and was pretty close, behavior-wise.
I do think this needs a (very simple) concurrent test that matches how it'll be used, i.e. multiple goroutines calling Handler().
- Basically just to push the race detector a bit more.
- Even just two goroutines with one call each is probably fine, and it'll plug a big gap that future changes could fall into. Nothing fancy needed.
"always call the callback async" seems worth doing, blended sync + async callbacks can be a really nasty footgun since they often lead to easy-to-miss-in-tests deadlocks
Inside membership, small behavioral imperfections are fine as long as they're not dangerous, and nothing I see here is dangerous.
- E.g. the buffered channel allows reading up to 2 values "immediately", not separated by at least interval. But we won't be using it in a way that risks this, and it won't break anything even if it happens, so that seems fine.
- My earlier block here was because it was presenting itself as a much more general tool in common, where stuff like that may not be safe. That's no longer a concern 👍

And adding a commit-message and possibly a go-doc for a more precise description of the behavior is probably good too, e.g. maybe:

If no Handler() call has occurred for at least [interval],
a callback will be immediately performed (asynchronously).

After that, any additional calls to Handler() within that
interval will be deduplicated, and one call will be enqueued
for the end of the interval.

This will lead to at most one call to the callback per
[interval], regardless of Handler() call patterns.

since it's close to a "debounce" but isn't exactly one (for good reasons).

Groxx · 2024-10-10T23:20:45Z

common/membership/hashring.go

+		case <-r.debounce.Chan():
+			callRefresh()
+		case <-refreshTicker.Chan():
+			callRefresh()


the earlier-code equivalent would probably be to r.debounce.Handler() inside the refresh ticker, so it's debounced with updates / we don't refresh 2x quickly.

but this seems fine either way tbh. and in some ways doing it separately is more bug-tolerant since it ensures at least some updates occur.
not requesting a change, just pointing out that it was changed, in case that wasn't intended.

Groxx · 2024-10-10T23:23:56Z

common/membership/hashring_test.go

+		defer wg.Done()
+		<-changeCh
+	}()
+	assert.True(t, common.AwaitWaitGroup(&wg, maxTestDuration), "first notification has to be there")


maybe worth mentioning that this is the "ring-started" refresh notification, as "first" is a bit vague

Groxx · 2024-10-10T23:25:31Z

common/membership/hashring_test.go

+	updateHandler(ChangedEvent{}) // we call updateHandler, but .GetMembers provides the same set of hosts
+	time.Sleep(100 * time.Millisecond)
+	assert.True(t, len(changeCh) == 0, "should not be notified again since membership remained the same")


this seems to be mimicking the earlier test so it seems probably fine, but:

if the change event was just being delayed due to the debouncer, would this test still pass? and is there an easy way to assert that that isn't the reason there were no changes?

Groxx · 2024-10-10T23:30:43Z

I just finished rereading tests / hashring code more carefully: looks good and probably not going to be affected by any of the coming changes. So it should be a fairly easy re-review 👍

dkrotx · 2024-10-11T17:57:29Z

Closing if favor of much simpler temporary solution: #6357

dkrotx requested review from Shaddoll, neil-xie, davidporter-id-au, Groxx, shijiesheng, jakobht, 3vilhamster, sankari165, taylanisikdemir and demirkayaender as code owners October 8, 2024 21:25

Fix race in tests and introduce DaemonStatusManager

ab36cff

We need DaemonStatusManager to manage Init->Started->Stopped transisions - it is a common pattern we use in Cadence server. Now I'm only taking this for debouncer + hashring. Will change other usages later.

Shaddoll approved these changes Oct 9, 2024

View reviewed changes

Shaddoll reviewed Oct 9, 2024

View reviewed changes

Groxx reviewed Oct 9, 2024

View reviewed changes

common/daemon_test.go Outdated Show resolved Hide resolved

Groxx reviewed Oct 9, 2024

View reviewed changes

Groxx requested changes Oct 9, 2024

View reviewed changes

Address issues

dd0eede

More tricky test of DaemonStatus Lock/Unlock are not longer exposed from CallbackDebounce Moved Debounce to membership package so we can be more opinionated in decisions

dkrotx requested a review from Groxx October 10, 2024 14:05

Groxx requested changes Oct 10, 2024

View reviewed changes

Groxx reviewed Oct 10, 2024

View reviewed changes

dkrotx closed this Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HashRing: proper debounce instead of sleep-if-seen-in4s #6342

HashRing: proper debounce instead of sleep-if-seen-in4s #6342

dkrotx commented Oct 8, 2024

codecov bot commented Oct 9, 2024 •

edited

Loading

Shaddoll left a comment

Groxx Oct 9, 2024 •

edited

Loading

dkrotx Oct 10, 2024

Groxx Oct 10, 2024 •

edited

Loading

Groxx Oct 10, 2024 •

edited

Loading

Groxx left a comment •

edited

Loading

Groxx left a comment •

edited

Loading

Groxx Oct 10, 2024

Groxx Oct 10, 2024

Groxx Oct 10, 2024 •

edited

Loading

Groxx commented Oct 10, 2024 •

edited

Loading

dkrotx commented Oct 11, 2024

HashRing: proper debounce instead of sleep-if-seen-in4s #6342

HashRing: proper debounce instead of sleep-if-seen-in4s #6342

Conversation

dkrotx commented Oct 8, 2024

codecov bot commented Oct 9, 2024 • edited Loading

Codecov Report

Shaddoll left a comment

Choose a reason for hiding this comment

Groxx Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

dkrotx Oct 10, 2024

Choose a reason for hiding this comment

Groxx Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

Groxx Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

Groxx left a comment • edited Loading

Choose a reason for hiding this comment

Groxx left a comment • edited Loading

Choose a reason for hiding this comment

Groxx Oct 10, 2024

Choose a reason for hiding this comment

Groxx Oct 10, 2024

Choose a reason for hiding this comment

Groxx Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

Groxx commented Oct 10, 2024 • edited Loading

dkrotx commented Oct 11, 2024

codecov bot commented Oct 9, 2024 •

edited

Loading

Groxx Oct 9, 2024 •

edited

Loading

Groxx Oct 10, 2024 •

edited

Loading

Groxx Oct 10, 2024 •

edited

Loading

Groxx left a comment •

edited

Loading

Groxx left a comment •

edited

Loading

Groxx Oct 10, 2024 •

edited

Loading

Groxx commented Oct 10, 2024 •

edited

Loading