-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release-3.5] [Solution 2] Auto sync members in v3store if Islearner is the only field that differs between v2store and v3store #19606
base: release-3.5
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahrtr The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
f88c944
to
3c99280
Compare
cc @fuweid @ivanvc @jmhbnz @serathius @siyuanfoundation Hi folks, can you spend at least 1 hour each day to move this forward? |
@@ -277,6 +277,9 @@ func StartEtcd(inCfg *Config) (e *Etcd, err error) { | |||
return e, err | |||
} | |||
} | |||
|
|||
e.Server.SyncLearnerPromotionIfNeeded() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% sure about consequences of calling sync here. I don't know if we really don't run any goroutines in background at this point, and proving so might be very hard.
By proposing to run it during bootstrap I meant within bootstrap.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is release-3.5 isntead of main or release-3.6, there is no bootstrap.go. Note that we need to do this after corruption check,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you think this change will interact with corrupt check? Even if we do the sync before corrupt check on one member, next member will start with corrupt check before sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we really don't run any goroutines in background at this point, and proving so might be very hard.
We call SyncLearnerPromotionIfNeeded
before starting the raft loop, before starting all the [periodically] jobs (i.e. monitorVersions
, linearizableReadLoop
, monitorKVHash
, etc), also before starting to serve client & peer requests. It's safe. What's your concern here?
How do you think this change will interact with corrupt check?
If there is data corrupt, we should leave it as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking if we still need this migration.
I found there is publish API call during starting. Since we already allow to override member information, this publish can help us migration (if I understand it correctly).
etcd/server/etcdserver/server.go
Line 2132 in 3c65dfa
Path: membership.MemberAttributesStorePath(s.id), |
- If there is available snapshot, the v2store contains valid data. UpdateAtribute will override existing invalid one. (REF: [3.5] tests: add TestIssue19557FixMemberDataAfterUpgrade #19627)
- If there is no snapshot, WAL replay can fix this issue as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@serathius please let me know if you still have any concern on my previous comment. I think we should be all good now.
I am thinking if we still need this migration.
I found there is publish API call during starting. Since we already allow to override member information, this publish can help us migration (if I understand it correctly).
@fuweid I don't think we are on the same page. The publish logic is about attribute, what we are resolving/talking about in this PR (and issue 19557) is about IsLearner, which is part of RaftAttribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
override member information
Yes, this is true. thx for pointing out this.
Although the publish logic is only supposed to update attribute, but the implementation just overwrites the whole member info into v3store (bbolt), so it can automatically fix the issues which are affected by #19557 on top of the patch #19563
Just raised another PR with only e2e test cases #19629, which I believe is better than #19627
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For release-3.6, we don't need any further validation (i.e. requiring 3.5.20+ before upgrading to 3.6) any more, reasons:
- If there are 2 or more (already promoted) learners, then etcd will crash anyway.
- If there is only 1 (already promoted) learner, then etcd will automatically fix it due to the same reason as mentioned above.
So for release-3.6, we only need to add two e2e test cases:
- the kubeadm style upgrade case
- verify etcd is able to automatically fix the issue when upgrading to 3.6 for the one learner case, similar to [release-3.5] Add e2e test to verify etcd is able to automatically fix the issue #19629
Note that we still need to document the requirement of requiring 3.5.20+ before upgrading to 3.6, to avoid the potential upgrading failure (as mentioned above, etcd will crash if there are 2 or more learners).
3c99280
to
2f55a7c
Compare
…ers between v2store and v3store Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
2f55a7c
to
b5a3046
Compare
Second solution to provide a fix for the users who have already been affected by #19557. It's based on the comment #19586 (comment),
shouldApplyV3
isfalse
LookOutsideApply
inside the apply workflow, so we have to disable the verification in e2e test case.The first solution is #19586
Both solutions work. If there is no strong objection or preference, let's go for solution 2 (this PR)
cc @fuweid @ivanvc @jmhbnz @serathius @siyuanfoundation