Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2: refactor managed decommission #288

Merged
merged 1 commit into from
Nov 8, 2024

Conversation

chrisseto
Copy link
Contributor

Prior to this commit the managed decommission controller was utilizing patterns akin to the node watcher and decommission controller for connecting to the admin API. This made it difficult to test via the RedpandaControllerSuite test which was desirable due to the flakey nature (especially when considering a later commit) of it's corresponding kuttl based test.

This commit refactors the manged decommission controller to leverage the ClientFactory struct used in the redpanda controller which makes it possible to test in the faster and more easily debuggable RedpandaControllerSuite.

@chrisseto
Copy link
Contributor Author

The refactor to the controller ended up being pretty heavy but the logic is largely unchanged with the exception of getPodFromRedpandaNodeID which now relies on a heuristic with a really big comment instead of dialing directly into the admin API.

@chrisseto chrisseto force-pushed the chris/p/managed-decom-client-factory branch from b591c32 to a36efbc Compare November 7, 2024 21:19
Copy link
Contributor

@RafalKorepta RafalKorepta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +252 to +253
// NB: Deleting this Pod will take an unexpectedly long time as the
// pre-stop hook will "spin" due to not handling decommissioned brokers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment. I'm under impression that podEvict is called after Redpanda is successfully decommissioned. Maybe the problem is that maintenance mode is not achievable if particular Redpanda is no longer part of the Cluster?

if !decomStatus.Finished {
log.Info("decommission status not finished", "decommission-broker-status", decomStatus)
return &resources.RequeueAfterError{
RequeueAfter: wait.Jitter(defaultDecommissionWaitInterval, decommissionWaitJitterFactor),
Msg: fmt.Sprintf("broker %d decommission status not finished", decommissionNodeID),
}
}
log.Info("Node decommissioned")
r.EventRecorder.AnnotatedEventf(rp,
map[string]string{v1alpha2.GroupVersion.Group + revisionPath: rp.ResourceVersion},
corev1.EventTypeNormal, v1alpha2.EventTypeTrace, fmt.Sprintf("Node decommissioned: %d", decommissionNodeID))
if err := r.podEvict(ctx, rp); err != nil {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the problem is that maintenance mode is not achievable if particular Redpanda is no longer part of the Cluster?

Exactly this! You can't set maintenance mode on a broker that's not part of a cluster. The admin API returns a 404.

Repository: ptr.To("redpandadata/redpanda"), // Override the default to make use of the docker-io image cache.
},
Config: &redpandav1alpha2.Config{},
Image: &redpandav1alpha2.RedpandaImage{},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why an override was removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It got added by accident 😓 I've actually added it back in and updated the comment. This way we won't inflate our own metrics :P

Image: &redpandav1alpha2.RedpandaImage{
Repository: ptr.To("redpandadata/redpanda"), // Override the default to make use of the docker-io image cache.
},
Config: &redpandav1alpha2.Config{},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why empty Config struct is added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry that leaked in from another PR. It's so that I can set config later on without having to worry about nil pointers. Added a comment.

@chrisseto chrisseto force-pushed the chris/p/managed-decom-client-factory branch 5 times, most recently from 7c8d7be to 12205e4 Compare November 8, 2024 17:42
Prior to this commit the managed decommission controller was utilizing patterns
akin to the node watcher and decommission controller for connecting to the
admin API. This made it difficult to test via the `RedpandaControllerSuite`
test which was desirable due to the flakey nature (especially when considering
a later commit) of it's corresponding kuttl based test.

This commit refactors the manged decommission controller to leverage the
`ClientFactory` struct used in the redpanda controller which makes it possible
to test in the faster and more easily debuggable `RedpandaControllerSuite`.
@chrisseto chrisseto force-pushed the chris/p/managed-decom-client-factory branch from 12205e4 to 908182b Compare November 8, 2024 17:52
@chrisseto chrisseto enabled auto-merge (rebase) November 8, 2024 18:19
@chrisseto chrisseto merged commit d329b7d into main Nov 8, 2024
4 of 5 checks passed
@RafalKorepta RafalKorepta deleted the chris/p/managed-decom-client-factory branch December 2, 2024 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants