v2: refactor managed decommission #288

chrisseto · 2024-11-07T21:14:59Z

Prior to this commit the managed decommission controller was utilizing patterns akin to the node watcher and decommission controller for connecting to the admin API. This made it difficult to test via the RedpandaControllerSuite test which was desirable due to the flakey nature (especially when considering a later commit) of it's corresponding kuttl based test.

This commit refactors the manged decommission controller to leverage the ClientFactory struct used in the redpanda controller which makes it possible to test in the faster and more easily debuggable RedpandaControllerSuite.

chrisseto · 2024-11-07T21:17:03Z

The refactor to the controller ended up being pretty heavy but the logic is largely unchanged with the exception of getPodFromRedpandaNodeID which now relies on a heuristic with a really big comment instead of dialing directly into the admin API.

RafalKorepta

LGTM

RafalKorepta · 2024-11-07T22:14:55Z

operator/internal/controller/redpanda/managed_decommission_controller.go

+	// NB: Deleting this Pod will take an unexpectedly long time as the
+	// pre-stop hook will "spin" due to not handling decommissioned brokers.


I don't understand this comment. I'm under impression that podEvict is called after Redpanda is successfully decommissioned. Maybe the problem is that maintenance mode is not achievable if particular Redpanda is no longer part of the Cluster?

redpanda-operator/operator/internal/controller/redpanda/managed_decommission_controller.go

Lines 208 to 221 in a36efbc

if !decomStatus.Finished {

log.Info("decommission status not finished", "decommission-broker-status", decomStatus)

return &resources.RequeueAfterError{

RequeueAfter: wait.Jitter(defaultDecommissionWaitInterval, decommissionWaitJitterFactor),

Msg: fmt.Sprintf("broker %d decommission status not finished", decommissionNodeID),

}

}

log.Info("Node decommissioned")

r.EventRecorder.AnnotatedEventf(rp,

map[string]string{v1alpha2.GroupVersion.Group + revisionPath: rp.ResourceVersion},

corev1.EventTypeNormal, v1alpha2.EventTypeTrace, fmt.Sprintf("Node decommissioned: %d", decommissionNodeID))

if err := r.podEvict(ctx, rp); err != nil {

Maybe the problem is that maintenance mode is not achievable if particular Redpanda is no longer part of the Cluster?

Exactly this! You can't set maintenance mode on a broker that's not part of a cluster. The admin API returns a 404.

RafalKorepta · 2024-11-07T22:32:52Z

operator/internal/controller/redpanda/redpanda_controller_test.go

-					Repository: ptr.To("redpandadata/redpanda"), // Override the default to make use of the docker-io image cache.
-				},
+				Config: &redpandav1alpha2.Config{},
+				Image:  &redpandav1alpha2.RedpandaImage{},


Why an override was removed?

It got added by accident 😓 I've actually added it back in and updated the comment. This way we won't inflate our own metrics :P

RafalKorepta · 2024-11-07T22:33:42Z

operator/internal/controller/redpanda/redpanda_controller_test.go

-				Image: &redpandav1alpha2.RedpandaImage{
-					Repository: ptr.To("redpandadata/redpanda"), // Override the default to make use of the docker-io image cache.
-				},
+				Config: &redpandav1alpha2.Config{},


Why empty Config struct is added?

Oh, sorry that leaked in from another PR. It's so that I can set config later on without having to worry about nil pointers. Added a comment.

Prior to this commit the managed decommission controller was utilizing patterns akin to the node watcher and decommission controller for connecting to the admin API. This made it difficult to test via the `RedpandaControllerSuite` test which was desirable due to the flakey nature (especially when considering a later commit) of it's corresponding kuttl based test. This commit refactors the manged decommission controller to leverage the `ClientFactory` struct used in the redpanda controller which makes it possible to test in the faster and more easily debuggable `RedpandaControllerSuite`.

chrisseto requested review from RafalKorepta and andrewstucki as code owners November 7, 2024 21:15

chrisseto mentioned this pull request Nov 7, 2024

v2: implement post-install/upgrade job in controller #282

Merged

chrisseto force-pushed the chris/p/managed-decom-client-factory branch from b591c32 to a36efbc Compare November 7, 2024 21:19

RafalKorepta approved these changes Nov 7, 2024

View reviewed changes

chrisseto force-pushed the chris/p/managed-decom-client-factory branch 5 times, most recently from 7c8d7be to 12205e4 Compare November 8, 2024 17:42

chrisseto force-pushed the chris/p/managed-decom-client-factory branch from 12205e4 to 908182b Compare November 8, 2024 17:52

chrisseto enabled auto-merge (rebase) November 8, 2024 18:19

chrisseto merged commit d329b7d into main Nov 8, 2024
4 of 5 checks passed

RafalKorepta deleted the chris/p/managed-decom-client-factory branch December 2, 2024 13:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2: refactor managed decommission #288

v2: refactor managed decommission #288

chrisseto commented Nov 7, 2024

chrisseto commented Nov 7, 2024

RafalKorepta left a comment

RafalKorepta Nov 7, 2024

chrisseto Nov 7, 2024

RafalKorepta Nov 7, 2024

chrisseto Nov 7, 2024

RafalKorepta Nov 7, 2024

chrisseto Nov 7, 2024

		// NB: Deleting this Pod will take an unexpectedly long time as the
		// pre-stop hook will "spin" due to not handling decommissioned brokers.

	if !decomStatus.Finished {
	log.Info("decommission status not finished", "decommission-broker-status", decomStatus)
	return &resources.RequeueAfterError{
	RequeueAfter: wait.Jitter(defaultDecommissionWaitInterval, decommissionWaitJitterFactor),
	Msg: fmt.Sprintf("broker %d decommission status not finished", decommissionNodeID),
	}
	}

	log.Info("Node decommissioned")
	r.EventRecorder.AnnotatedEventf(rp,
	map[string]string{v1alpha2.GroupVersion.Group + revisionPath: rp.ResourceVersion},
	corev1.EventTypeNormal, v1alpha2.EventTypeTrace, fmt.Sprintf("Node decommissioned: %d", decommissionNodeID))

	if err := r.podEvict(ctx, rp); err != nil {

v2: refactor managed decommission #288

v2: refactor managed decommission #288

Conversation

chrisseto commented Nov 7, 2024

chrisseto commented Nov 7, 2024

RafalKorepta left a comment

Choose a reason for hiding this comment

RafalKorepta Nov 7, 2024

Choose a reason for hiding this comment

chrisseto Nov 7, 2024

Choose a reason for hiding this comment

RafalKorepta Nov 7, 2024

Choose a reason for hiding this comment

chrisseto Nov 7, 2024

Choose a reason for hiding this comment

RafalKorepta Nov 7, 2024

Choose a reason for hiding this comment

chrisseto Nov 7, 2024

Choose a reason for hiding this comment