-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topology change resilience for incremental repair #1235
Conversation
accb60c
to
27186ed
Compare
This is still a work in progress. I have written a test (which is passing), but it isn't clear to me that this should actually be working.
|
27186ed
to
323dbaa
Compare
Manual testing of this change suggests that there are possible improvements still available, but that it does address the specific requirements in the ticket. First, I tried starting the repair, pausing it, then doing a rolling restart on the cluster. I've included a log below showing the results. The problem there was that no coordinators could be found. test-dc1-reaper-94769775c-npd4n.log But that problem is subtly different to the one we are trying to address here. The matter at hand is not dealing with a full rolling restart (where every IP changes) but just the case where one node's IP changes. I recommend creating another ticket to handle this more extreme case by redoing the DNS lookup so that the list of potential coordinators is refreshed for cases where an FQDN is used as the contact point. So to make this test more specific I created a new repair, paused it, took note of the host which the current segment was running against, and restarted just that node (leaving the other two in place). I do still get the below error, but the repair appears to reschedule correctly:
You'll note from the below screenshot that it does switch to a different coordinator for the second two segments, which I think is the behaviour we want. |
…and looks up its current IP when determining the list of potential coordinators.
323dbaa
to
92643af
Compare
We took this back to the drawing board because our JMX calls were not returning any way to identify the primary replica when doing range -> endpoint queries. This new approach instead tracks the hostID associated with a given segment and looks up the current IP associated with it when starting each run. It then uses that IP as the coordinator to ensure we are hitting the same endpoint even if the IP has changed. |
…at the coordinator is the node that gets swapped out.
3e16fac
to
c186b9a
Compare
…D field in RepairSegment.
05eb4bf
to
e8a2d04
Compare
…e correctly propagated into the potentialCoordinators so that repairs correctly pick them up.
fbdb724
to
4a0d51a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome sauce @Miles-Garnsey! Approved ✅
Allows incremental repair to survive nodes changing IP address during the repair. The hostID is now stored in each segment and the ip address is recomputed from it when the segment runs.
Ensure that, in an incremental repair, the replica list is updated on every repair run. This addresses cases where the node IPs change.
Fixes #1213