Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add proposal for single step multi version downgrade #136

Merged
merged 10 commits into from
Jan 31, 2025

Conversation

MichaelMorrisEst
Copy link
Contributor

# Support Single step multi version downgrade for Zookeeper based clusters

This proposal seeks to introduce support for multi-version downgrade of Strimzi in a single step where possible.
The proposal only relates to Zookeeper based clusters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I commented on the strimzi/strimzi-kafka-operator#10801 issue. This does not seem to make any sense anymore:

  • At best, this would land in Strimzi 0.45 with support for Kafka 3.8 and 3.9.
  • These are the last versions with ZooKeeper support. ZooKeeper will be dropped in Strimzi 0.46 / Kafka 4.0
  • So there will be never a downgrade of a ZooKeeper-based cluster that would utilize this feature

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, doesn't make sense for Zookeeper at this stage. I've updated the proposal to address the same issue for KRaft based clusters instead

Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this proposal to KRaft. I think this would be a good feature to have and in fact I had it somewhere deep on my list. But I think the proposal should be extended to cover:

  • More details of the implementation -> I assume you will need to somehow carry the information we are downgrading from unknown version in the code etc. This should be explained in more detail.
  • System test strategy -> Describe how will it be tested etc. Saying that we do not expect any system tests is probably an option ... but even that should be covered.
  • The risks tat are part of this proposal and how they will be documented
    • While most Kafka versions do not expect any special treatment when upgrading / downgrading, it is possible that there might be something needed for some versions. While for upgrade, this can be simply implemented in the new operator version, for a downgrade like this it will not be supported by the old operator. As such, it is unclear how will such a downgrade be supported, which versions will support it, how deep will it be possible etc.
    • Similarly, there are many issues with the operator versions as well. For example, in one of the future Strimzi versions where the dynamic KRaft qourum configuration will be supported - the new operator will know how to swicth between the static qourum configuration and dynamic qourum configuration and possibly the other way around. But the old operator version will not have any knowledge of this and will not be able to do this and the downgrade would simply break.
    • I do not think ^^^ these are blockers. But these risks need to be covered by the proosal and it needs to describe how we will protect the users from it but also how will the expectations be set for the users. E.g. by explaining that this will be something to be used at your own risk and that you are expected to test it for your combination of versions / configuration etc. This should likely also explain what exactly we test and what is just some code that does not block the downgrade, but offers not guarantees that it would work.

PS: If you write the proposal with a sentence-per-line, it will make it much easier to comment on it during the reviews than with paragraph-per-line. No need to change the existing proposal ... just something what might be helpful for the next one 😉.


During downgrade an attempt is made to read the Kafka information for the 'from' Kafka version by the 'to' version of the operator. When the 'from' Kafka version is not supported by the 'to' Strimzi version an error will be thrown because the version is unknown. Information such as the metadata version, interbroker protocol message format, log message format version are read from kafka-versions.yaml during the creation of a KafkaVersion object to represent the 'from' kafka version and the error message is generated as it cannot find the information for the unknown version.

However, while this information is important to know for the 'to' version in upgrade and downgrade, and for the 'from' version in upgrade, it is not important for the 'from' version in the downgrade as it is not subsequently used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree it is not important. While currently, there are no explicit steps needed to do during downgrade, there might be in the future. So I think the statement that it is never needed is not true. (in the distant past, I think there were for example, some breaking ZooKeeper changes that might have included special steps at downgrade).

Copy link
Contributor

@PaulRMellor PaulRMellor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I left a few comments on wording for clarity. I can return when the proposal covers the implementation detail and handling differences in functionality mentioned by Jakub

Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. I left few more nits.

Comment on lines 30 to 32
This exception can be caught in detectToAndFromVersions() in the rollback scenario and an instance of KafkaVersion created that has null values for the unknown data (such as metadata version).
As this data is not currently read anywhere in the code the null values will not have any negative effect.
However future code changes could result in NPE as it will not necessarily be obvious the values may be null.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more feasible to just have null as the from version rather than have a version with null values if we do not know it? I wonder if that would make it easier to handle compared to having null values inside the object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do know the version number, just not anything else. The version number is useful (for example in KRaftVersionChnageCreator.prepareVersionChange(...) to determine if its an upgrade or downgrade), so it is useful to have the KafkaVersion object with the version number in it rather than having a null and losing the means to pass the 'from' version number

Comment on lines 48 to 54
Proposal B
Add a new field to KafkaStatus, named minOperatorVersionForRollback, which is set by the operator during Kafka upgrade when it is known that specific extra steps are needed for a rollback to succeed.
When a downgrade is identified (in KraftVerionChangeCreator.detectToAndFromVersions()) the version of the operator is compared with the value of minOperatorVersionForRollback, if set.
An exception will be thrown if the requirement is not satisfied.

For example, if specific steps need to be implemented in the operator to upgrade/rollback from Kafka version 'x' to Kafka version 'y' and such steps are implemented in operator version 'n' then as part of the implementation in operator version 'n' to support the 'x' -> 'y' upgrade, the minOperatorVersionForRollback will bet set to 'n'.
Any attempt to rollback a Kafka cluster from 'y' -> 'x' with operator version earlier than 'n' shall be rejected by the operator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is would be super hard to track across various configurations. So I would not go this way.

@scholzj scholzj requested a review from katheris January 16, 2025 16:34
Copy link
Contributor

@katheris katheris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this proposal looks good to me. I agree with Jakub that option A seems best in both scenarios, so you should update to move the rejected option to the Rejected alternatives section. Once that is done I would be happy to approve this proposal.

@ppatierno
Copy link
Member

A couple of comments ...

  1. I agree with Kate and Jakub and the proposal should stick with one option and I am for option A as well.
  2. I think you are using "rollback" and "downgrade" with the same meaning, right? Could we stick with just one term, maybe using "downgrade" consistently?

Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichaelMorrisEst Thanks for the updates. I left some more formatting comments and two comments related to the content -> one is to confirm if we plan to do a check that it is actually a downgrade and other is about the ST where I realized it does not make much sense.

Comment on lines 29 to 30
This exception can be caught in detectToAndFromVersions() in the downgrade scenario and an instance of KafkaVersion created that has null values for the unknown data (such as metadata version).
As this data is not currently read anywhere in the code the null values will not have any negative effect.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should clarify here that we will compare the versions to confirm the assumption that it is a downgrade? (i.e. we will double-check that it is for example 4.5.0 to 4.1.0 and not 3.17.5 to 4.1.0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment on lines 42 to 46
A new system test will be added with the following steps:
- Deploy the latest operator
- Deploy a Kafka cluster with the highest supported Kafka version but metadata version set to the appropriate version for the Kafka version the cluster will be downgrade to in the following steps
- Downgrade the operator to the most recent version which does not support the Kafka version used in the cluster.
- Downgrade the Kafka version used in the cluster to a version supported by the downgraded operator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this a bit more -> this ST makes not much 🤷‍♂️. The problem is that it would not really test the current version but the past / already released version. E.g., if we tested this today, we would start with Strimzi main branch / Kafka 3.9.0 and downgrade to Strimzi 0.44.0. If the downgrade would not work, it would be because of an issue with Strimzi 0.44. So it will not help us to discover any issues in the operator version we are actually testing.

I wonder if we should instead fake it? for example:

  • Deploy Kafka 3.9.0 cluster with current main Strimzi branch
  • Pause the reconciliation
  • Change the annotations / Kafka CR status to some unknown future version (e.g. 98.99.100)
  • Unpause the reconciliation and have the operator handle the downgrade?

Or maybe we should give up on ST completely and have it covered by UT only? 🤷‍♂️

Any thoughts @im-konge @Frawless @see-quick?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thats a problem alright. I will update to state UT only

Copy link
Contributor

@katheris katheris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of nits but otherwise looks good to me


Kafka does not support downgrade where the metadata version in use is higher than the highest metadata supported by the 'to' Kafka version or where there has been changes in the metadata between the 'from' and 'to' versions. There is already logic in place in Strimzi to prevent such a downgrade from happening. This proposal does not propose any changes to this functionality, it shall continue to be not supported to downgrade in such a scenario.

### Impacts:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be an empty heading here

Copy link
Member

@scholzj scholzj Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is level 3 heading that groups the Level 4 headings below it, or?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it groups the level 4 headings below. I can add a sentence just to make it clearer

Comment on lines 29 to 30
This exception can be caught in detectToAndFromVersions() in the downgrade scenario and an instance of KafkaVersion created that has null values for the unknown data (such as metadata version).
As this data is not currently read anywhere in the code the null values will not have any negative effect.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the proposal.

Copy link
Member

@ppatierno ppatierno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just a nit.

@scholzj
Copy link
Member

scholzj commented Jan 28, 2025

This proposal has now 4 binding +1 votes. Given it has been open for some time, if there are no new comments, we should try to get it closed this week. Si if you haven't looked at it and are interested in it, please do so soon.

@scholzj
Copy link
Member

scholzj commented Jan 30, 2025

@MichaelMorrisEst I wanted to update the proposal number and add it to the index in README.md. But I do not seem to be able to push into the PR for some reason. Can you please do it so that I can close the vote and merge it? The next proposal number is 093 right now. But it might change depending on when you get to it. Thanks.

MichaelMorrisEst and others added 3 commits January 31, 2025 09:26
Signed-off-by: MichaelMorris <michael.morris@est.tech>
Signed-off-by: MichaelMorris <michael.morris@est.tech>
Co-authored-by: PaulRMellor <47596553+PaulRMellor@users.noreply.github.com>
Signed-off-by: Michael Morris <105736419+MichaelMorrisEst@users.noreply.github.com>
MichaelMorrisEst and others added 7 commits January 31, 2025 09:26
Signed-off-by: MichaelMorris <michael.morris@est.tech>
Co-authored-by: Jakub Scholz <www@scholzj.com>
Signed-off-by: Michael Morris <105736419+MichaelMorrisEst@users.noreply.github.com>
Signed-off-by: MichaelMorris <michael.morris@est.tech>
Co-authored-by: Kate Stanley <11195226+katheris@users.noreply.github.com>
Co-authored-by: Jakub Scholz <www@scholzj.com>
Signed-off-by: Michael Morris <105736419+MichaelMorrisEst@users.noreply.github.com>
Signed-off-by: MichaelMorris <michael.morris@est.tech>
Co-authored-by: Paolo Patierno <paolo.patierno@gmail.com>
Co-authored-by: Jakub Scholz <www@scholzj.com>
Signed-off-by: Michael Morris <105736419+MichaelMorrisEst@users.noreply.github.com>
Signed-off-by: MichaelMorris <michael.morris@est.tech>
@MichaelMorrisEst
Copy link
Contributor Author

@MichaelMorrisEst I wanted to update the proposal number and add it to the index in README.md. But I do not seem to be able to push into the PR for some reason. Can you please do it so that I can close the vote and merge it? The next proposal number is 093 right now. But it might change depending on when you get to it. Thanks.

@scholzj Done, thanks

@ppatierno
Copy link
Member

This proposal has now 7 binding +1 votes, so it's approved.
@MichaelMorrisEst thanks! looking forward to the implementation!

@ppatierno ppatierno merged commit ad3c47d into strimzi:main Jan 31, 2025
1 check passed
@MichaelMorrisEst
Copy link
Contributor Author

Thanks everyone for taking the time to review and for providing feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants