-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release-3.5] Learner promotion isn't persisted to v3store (bbolt) #19557
Comments
Could we implement a fix without introducing series of commands that each user that ever used learners would need to run? For example during etcd v3.6 bootstrap compare v2store and v3store and if there are any unpromoted learners in v3 that are normal members in v2, then override them in v3store? |
thanks for the investigation @ahrtr |
I wish we could follow this approach, as it wouldn’t add any extra burden on users when upgrading to 3.6. But unfortunately, it won’t work because v3store is the sole source of truth in 3.6—even the v2snapshot is generated from v3store. So before upgrading 3.5 to 3.6, users need to
So the proposed actions:
|
I understand how v3.6 is meant to treat v2 state, I'm saying that during bootstrap you still have some wiggle room. When we upgrade etcd v3.5 to v3.6, there will be v2store snapshots on disk generated by v3.5, that can be read and cross validated with v3 state. |
What's the "wiggle room"? As mentioned above, v3store is usually more up-to-date than v2store, so theoretically during the very first 3.6 bootstrap, the v2store might be out of date, accordingly we might miss the only chance to sync from v2store to v3store. |
Yes, but we also will replay all WAL entries from v2store. They will be skipped for v3store, but we can make an exception for learner promotion. What I'm proposing is hacky, but should be possible. Please let me know if you think it's too risky to modify apply loop in that way, but I want to make sure we considered alternatives. |
It's too error prone, and also greatly complicate the release-3.6, which we will maintain a long long time. |
in terms of k8s release schedule it might be safer to release 1.33 with 3.5.20 (that has the fix) instead of 3.6.0. it's also possible to backport 3.5.20 to older versions of k8s and add 3.6.0 in 1.33, but that seems more rushed and risky. |
|
We only add the command in release-3.5.
As mentioned above, users don't have to necessarily execute the commands, if they bump to 3.5.20+ regardless. |
There is no plan to integrate etcd 3.6.0 with k8s 1.33, I think we are targeting 1.34. |
Agree, to bump etcd version in K8s we would need to do it at beginning of cycle to get enough soak for K8s to feel safe. |
@fuweid @serathius @siyuanfoundation @ivanvc Please see the proposed action in #19557 (comment), and feedback if you have any further comment, thx |
I agree with @ahrtr but for a different reason than the long maintenance time. The root cause of this issue is the inconsistent v2store and v3store in 3.5, and that is what we really should fix. I would prefer not complicate 3.6 code to fix a 3.5 bug. |
@fuweid @siyuanfoundation do you have time to have a call today? |
I agree with Siyuan's point. It should be treated as a 3.5 bug, not 3.6. I also think the less the end-user has to do, the better. So, upgrading to 3.5.20 before to 3.6.0 should be reasonable. If we follow that approach, does that mean we must ship 3.5.20 with K8s 1.33? |
I'm concerned more with user experience than where the bug came from. We should strive for minimizing maintenance cost, however we should be careful when this impacts user experience. My proposal came from concern about writing another 10 page upgrade instructions scaring users from upgrading and further sharding the ecosystem. In the current day and age I don't think writing a instructions like https://etcd.io/docs/v3.3/upgrades/upgrade_3_5/ should have place at all. Most users expectation is that they can just replace the container tag. But we want to require user to run dedicated commands before upgrade, or risk complete unrecoverable cluster shutdown when upgrading to v3.6. Requiring additional steps for upgrade will just cause bad user experience, dissolving the trust.
I think you might be lacking full context, the options are: Option 1 - Implement resynchronization in v3.5.20Proposed by @ahrtr, preferred by @siyuanfoundation, @ivanvc
Downside:
Option 2 - Implement resynchronization in v3.6.0Proposed by @serathius
Downside:
Note; the resynchronization code is risky to implement, whether we implement it in v3.5.20 or v3.6 doesn't matter. Both will be non trivial and require extensive testing. Why requiring upgrade through v3.5.20We should not require or assume user upgrades go through each and every patch release. Many upgrade automations (including one used in GKE), doesn't give you guarantee of total control over exact upgrade path. |
Responses to comments
This is a generally valid point.
We release 3.6.0 almost 4 years after 3.5.0. There are already some unavoidable items in the upgrade checklist. Long-term success depends on building a more collaborative and stronger team, and maintaining a regular release cycle (can be discussed separately).
In my proposal, no any workaround in 3.6 is needed.
No, it matters.
New comments
|
one concern that i have is that this bug was caught on the k/k pr by an upgrade test that runs on demand, manually. the etcd maintainers might have missed this error and merged the upgrade to 3.6.0. later kubeadm periodic upgrade e2e would have failed and the kubeadm maintainers would have contacted the etcd maintainers with a report. that is all good, but are there any plans to run local etcd upgrade e2e tests on the etcd project side? perhaps the etcd-operator project would help with that? |
@neolit123 we have upgrade & downgrade e2e test cases, but the problem is that we don't have kubeadm style upgrade case (adding & promoting learners, upgrade later). We will add fill the gap, please see the last action item in the proposed actions in #19557 (comment) |
If users are already on 3.5.20+, then all good; otherwise, there are two options as mentioned in
Now you are proposing to limit to only the first option. I am not strongly against this, the good side is that it simplifies the users' burden to understand this (it also saves our effort), the bad side it that it removes the flexibility. From 3.6 perspective, we will have to add some validation on bootstrap, and check (1) whether or not there is any learner and (2) the version of any other members <= 3.5.19, and raise an alarm if any. Is this what you proposed? @serathius Actually I tend not to do it. Reasons:
please see my answer above. |
Looking in bootstrap.go, looks like validation for max learners is pretty robust. It checks all paths and should prevent cluster with multiple learners to start. Still I think we should improve the error message by providing context about this issue.
I'm not sure about this path, seems like there are many edge cases (Promotion before last snapshot, promotion after last snapshot, restarted was promoted, restarted member was not promoted, restarted member becomes leader), still you just assume will work the same in all cases and user will be able to easily fix it by promotion. |
That's why we need to fix it in 3.5 before upgrading to 3.6. The key for now is whether we should forcibly prevent users from upgrading from 3.5.19 (or older versions) to 3.6. If yes, then we don't need the |
Summarisation. release-3.5
release-3.6Add validation on bootstrap, and raise an alarm if any other member is on 3.5.19 or older versions |
Raising an alarm has a specific meaning in etcd, and I would be even more worried about adding a new type of alarm and its consequences. |
As @serathius mentioned, there are many edge cases to consider, so it is potentially more risky to rely on syncing from v2store in 3.6 in some cases when v3store should be the ground truth. |
Clarifying:
Kubernetes is not in code freeze until the 21st per https://k8s.dev/release (rendered form of the previous link) However, the 21st is pretty soon. |
Please anyone feel free to add comment. Also added the link in the very first comment. |
cc @dims @chaochn47 @jberkus @jmhbnz for visibility |
cc @stackbaek |
caused by etcd-io#19557 Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
caused by etcd-io#19557 Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
caused by etcd-io#19557 Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
caused by etcd-io#19557 Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
caused by etcd-io#19557 Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
caused by etcd-io#19557 Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
caused by etcd-io#19557 Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
caused by etcd-io#19557 Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
caused by etcd-io#19557 Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
caused by etcd-io#19557 Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
Update & summary We were planning to release 3.5.20 and land it to K8s 1.33 before the code freeze, but unfortunately, we didn't make it although I have been working & pushing hard. We have to land 3.5.20 to 1.33.1. cc @neolit123 We were planning to release etcd 3.6.0 before KubeCon, we have to postpone it until post KubeCon, because,
release-3.5
release-3.6
We need to add two e2e tests below,
I also updated the google doc: https://docs.google.com/document/d/11wkyiK-bDYGOPtFfLGII7iPQxsXoYiq8f6L1fc30S4k/edit?tab=t.0#heading=h.58mxw7gv0jjj |
Unfortunately, this isn't correct :( . 3.6 won't be able to automatically fix the issues which are already affected, because v3store is the source of truth in 3.6. cc @fuweid who confirmed this. But adding validation in 3.6 also has flaws,
Options: we need to provide both options.
|
See summary: https://docs.google.com/document/d/11wkyiK-bDYGOPtFfLGII7iPQxsXoYiq8f6L1fc30S4k/edit?tab=t.0#heading=h.efgyhz20f9eb
What's the symptom
When upgrading etcd from v3.5.x to v3.6.0-rc.x, the upgrade may fail due to "
membership: too many learner members in cluster
", because etcd allows only 1 learner at most by default.Thanks @neolit123 for uploading the initial log.
Which versions are impacted
Note only release-3.5 has this issue. The issue was introduced in
v3.5.1
in #13348. All etcd patch versions inv3.5.1 - v3.5.19
are affected.If you ever added & promoted learner(s) in
v3.5.1 - v3.5.19
and try to upgrade from 3.5.1+ to v3.6.0-rc.x, then you will see this issue,v3.6.0-rc.x
;membership: too many learner members in cluster
.What's the root cause
When promoting a learner, the change is only persisted in v2store, not in v3store. The reason is simple, because etcd returns
errMemberAlreadyExist
. Clearly, the member (learner) ID has already existed. See 3.5 code below,etcd/server/etcdserver/api/membership/store.go
Lines 57 to 59 in 1810af3
So the membership data will be inconsistent between v2store and v3store in such case.
Why we only see this issue when upgrading from 3.5 to 3.6?
In 3.5, the v2store is the source of truth for the membership data. In 3.6, v3store (bbolt) is the source of truth for the membership data. When upgrading from 3.5.x to 3.6, the source of truth changes.
How to reproduce this issue
Manual steps
Note try the steps using 3.5.x (>=1) binary,
Step 1: start an etcd instance
Step 2: add a learner in another terminal
Step 3: start the learner
Step 4: promote the learner
Step 5: stop both etcd instances
Step 6: check the bbolt db file directly
Using tool etcd-dump-db to check the db file. You will see that the already promoted leaner is still a leaner.
Automatic step
Just execute upgrade_test.sh (of course, you need to download both v3.5.19 and v3.6.0-rc.2 binaries beforehand), afterwards, check the log, you will see the error message "
membership: too many learner members in cluster
".Proposed solution & actions
Proposal for release-3.5
main
branch's implementation). Also add an e2e test to verify the membership data is consistent between v2store and v3store. Probably we should add the verification in production code, but only enabled in test.etcd/server/etcdserver/api/membership/store.go
Lines 57 to 59 in 1810af3
etcdutl
commands toetcdutl check members --data-dir path-2-data-dir
etcdutl sync members --data-dir path-2-data-dir
Proposal for main & release-3.6.
Make change to main firstly, backport to release-3.6 later.
Add an e2e test similar to upgrade_test.sh to cover the upgrade case similar to what kubeadm does. The rough steps,
We should definitely get the issue included in the upgrade(3.5->3.6) checklist. Users should check the membership data is consistent between v3store and v2store.
Probably we need to publish an official announcement
cc @fuweid @ivanvc @jmhbnz @serathius @siyuanfoundation @spzala @neolit123 @dims
Next step
Let's discuss this next Monday, and who work on what.
PRs
The text was updated successfully, but these errors were encountered: