Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-3.5] Fix the learner promotion changes not being persisted into v3store (bbolt) #19563

Merged
merged 3 commits into from
Mar 12, 2025

Conversation

ahrtr
Copy link
Member

@ahrtr ahrtr commented Mar 10, 2025

Link to #19557

Also added a generic verification on ensure the membership data in v3store is always in sync with v2store.

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

@ahrtr ahrtr changed the title add verification to check whether membership data is in sync between … [release-3.5] add verification to check whether membership data is in sync between … Mar 10, 2025
@ahrtr ahrtr force-pushed the 3.5_learner_20250310 branch from af23911 to 6ee8abe Compare March 10, 2025 14:49
@ahrtr ahrtr force-pushed the 3.5_learner_20250310 branch from ba0b6ac to 87f5cb2 Compare March 10, 2025 15:47
@ahrtr ahrtr changed the title [release-3.5] add verification to check whether membership data is in sync between … [release-3.5] Fix the learner promotion changes not being persisted into v3store (bbolt) Mar 10, 2025
@ahrtr ahrtr force-pushed the 3.5_learner_20250310 branch 3 times, most recently from 746bb27 to 2ca9b2f Compare March 10, 2025 19:33
@ahrtr
Copy link
Member Author

ahrtr commented Mar 10, 2025

cc @fuweid @ivanvc @serathius @jmhbnz @siyuanfoundation PTAL

This PR is the first item of the proposed actions mentioned in #19557 (comment)

Once this gets merged, I will work on the second item "auto sync v3store based on v2store on bootstrap" in release-3.5 instead of release-3.6.

@ahrtr ahrtr force-pushed the 3.5_learner_20250310 branch from 2ca9b2f to ab38710 Compare March 10, 2025 22:02
@ahrtr
Copy link
Member Author

ahrtr commented Mar 10, 2025

/retest

@serathius
Copy link
Member

serathius commented Mar 11, 2025

This PR seems to fix the bug in learner promotion, but doesn't patches store for clusters that already encountered the bug. Meaning that they will still need to run the etcdutl check members && etcdutl sync members commands, making it critical command before upgrade to v3.6 or risking cluster shuttdown.

EDIT: Fixed "leader promotion" --> "learner promotion".

@ahrtr
Copy link
Member Author

ahrtr commented Mar 11, 2025

This PR seems to fix the bug in leader promotion

"leader promotion" --> "learner promotion".

Yes, exactly.

but doesn't patches store for clusters that already encountered the bug.

It will be covered in a followup PR, also mentioned this multiple times. Please see the second item as mentioned in #19557 (comment)

Meaning that they will still need to run the etcdutl check members && etcdutl sync members commands, making it critical command before upgrade to v3.6 or risking cluster shuttdown.

Already clarified this several times, please see the proposed action and also #19557 (comment)

@@ -220,8 +220,23 @@ func unsafeSaveMemberToStore(lg *zap.Logger, s v2store.Store, m *Member) error {
lg.Panic("failed to marshal raftAttributes", zap.Error(err))
}
p := path.Join(MemberStoreKey(m.ID), raftAttributesSuffix)
_, err = s.Create(p, false, string(b), false, v2store.TTLOptionSet{ExpireTime: v2store.Permanent})
return err
if _, err = s.Create(p, false, string(b), false, v2store.TTLOptionSet{ExpireTime: v2store.Permanent}); err != nil {
Copy link
Member

@serathius serathius Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a separate issue. Thanks for adding validation! However I'm worried about trying to fix two issues in one PR thus not putting enough attention into understanding each of them and introducing proper testing.

Can we separate PR for this one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a separate issue.

Yes, it's a separate (but related) minor issue. When adding a new member,

  • v3store persists the whole member,
  • but v2store only persists the raftAttributes (peerURL, isLearner); so it causes a temprary inconsistency between v2 and v3store, and it depends on the following publish logic to update the attributes (name, clientURL), so it converge soon.

If we don't fix it in this PR, then the verification will fail, otherwise I will have to update the verification to ignore the attribute and only verify the raftAttributes(including peerURLs, isLearner).

Since from users perspective, this isn't an issue. We can also leave it as it's and only verify raftAttribute.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't fix it in this PR, then the verification will fail, otherwise I will have to update the verification to ignore the attribute and only verify the raftAttributes(including peerURLs, isLearner).

That's why I'm suggesting to send PR to just fix this so we can merge it fast. This could also be backported to v3.4.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this isn't an issue from user perspective, so suggest to keep it as it's. This issue only affects version 3.5 (and possibly 3.4, though unverified). Since it is a minor issue and not a concern from the user's perspective, let's just let it fade into history.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this isn't an issue from user perspective, so suggest to keep it as it's

I mean that I will revert the change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted the change.

ahrtr added 2 commits March 11, 2025 13:56
…v2store and v3store

Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
When promoting a learner, the member is definitely already
exist.

Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
@ahrtr ahrtr force-pushed the 3.5_learner_20250310 branch from 1fadfb1 to e3cd57a Compare March 11, 2025 13:57
@ahrtr ahrtr force-pushed the 3.5_learner_20250310 branch from 1609443 to 31b7495 Compare March 11, 2025 15:19
@ahrtr ahrtr force-pushed the 3.5_learner_20250310 branch from 31b7495 to cd24fe8 Compare March 11, 2025 15:52
@ahrtr
Copy link
Member Author

ahrtr commented Mar 11, 2025

Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
@ahrtr ahrtr force-pushed the 3.5_learner_20250310 branch from cd24fe8 to a1cb13b Compare March 11, 2025 16:53
@ahrtr
Copy link
Member Author

ahrtr commented Mar 11, 2025

@@ -1049,10 +1049,9 @@ func TestAddMemberSyncsBackendAndStoreV2(t *testing.T) {
backendMembers: []*Member{alice},
},
{
name: "Adding member should fail if it exists in both",
name: "Adding member should success if it exists in both",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case was meant to prevent same member being added twice, while we are fixing the issue of v3store rejecting learner being promoted. Should the test be updated to separate those two cases?

Copy link
Member Author

@ahrtr ahrtr Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For v3store (bbolt), when we add a member or promote a learner, etcd just marshals the member and write the data into bbolt, if it already exists, just overwrites it. This is the behaviour in both release-3.4, release-3.6 and main. But In release-3.5, it returned an error if it already exists, it's a bug and now we fix it in this PR.

Also FYI.

  • When we add a member again, the ValidateConfigurationChange will fail.
  • We also have separate unit test case to verify the v2store will return error when adding a member again. (see line 1057 in cluster_test.go).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the test be updated to separate those two cases?

No need. Please see my response above.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, fuweid, serathius, siyuanfoundation

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ahrtr ahrtr merged commit 2b9598b into etcd-io:release-3.5 Mar 12, 2025
24 checks passed
@ahrtr ahrtr deleted the 3.5_learner_20250310 branch March 12, 2025 12:21
@ivanvc ivanvc mentioned this pull request Mar 14, 2025
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants