Implementing Operator downscaling functionality #901

bsocaciu · 2025-01-14T09:51:04Z

No description provided.

internal/kubernetes/kubernetes.go

controllers/humiocluster_controller.go

…cePartitions spec

controllers/humiocluster_controller.go

bsocaciu · 2025-02-17T13:45:45Z

Fixes issue: #179

api/v1alpha1/humiocluster_types.go

…tstrapping bugs

…mio-operator into feature/add-downscaling-support

SaaldjorMike · 2025-02-19T12:51:23Z

controllers/humiocluster_controller.go

+				err = r.Status().Update(ctx, hc)
+				if err != nil {
+					r.Log.Error(err, "failed to update cluster status")
+					return reconcile.Result{}, err
+				}
+				r.Log.Info(fmt.Sprintf("removing pod %s containing vhost %d", pod.Name, vhost))
+				if err := r.Delete(ctx, &pod); err != nil { // delete pod before unregistering node
+					return reconcile.Result{}, r.logErrorAndReturn(err, fmt.Sprintf("failed to delete pod %s for vhost %d!", pod.Name, vhost))
+				}


What happens if the r.Status().Update(ctx, hc) call succeeds, but then the r.Delete(ctx, &pod) fails? Or perhaps we update status, and then due to the operator being restarted, it doesn't get to do the pod deletion?

If I read the code correctly, it would essentially "just" mean it'll append the ID one more time to the Status, and then try to do the pod deletion again? Perhaps it should only do the append operation if the vhost is not already in the list? Mostly to prevent some sort of edge case that appends the same vhost over and over, where it fails to actually delete the pod.

That's fine. If the delete fails after the update succeeds, then the node will remain marked at Evicted and at the next reconcile run, it will try to delete the pod again.
As per the check at line 2228, the node will NOT be unregistered if it's still alive (if the pod is not deleted).

Edit: Nvm, I see your point. I'll add a check before the insert.

jswoods · 2025-02-19T16:57:00Z

api/v1alpha1/humiocluster_types.go

+	// Default: false
+	// Preview: this feature is in a preview state
+	//+kubebuilder:default=false
+	EnableDownscalingFeature bool `json:"enableDownscalingFeature,omitempty"`


This may be nitpicky but if we plan on having opt-in features, perhaps we should create a feature flag type for this purpose?

FeatureFlags could then hold a Downscaling field that is a bool, or could be a string that we parse.

In general if we can keep the list of top level fields as minimal as possible I think that is cleaner. Especially if they are just holding a boolean.

Great idea. I've implemented it, now I'm testing it, along with the eviction comment.

controllers/humiocluster_controller.go

jswoods · 2025-02-19T17:27:16Z

controllers/humiocluster_controller.go

+			continue
+		}
+		vhost := nodeIdToPodMap[pod.GetName()]
+		err = r.HumioClient.SetIsBeingEvicted(ctx, humioHttpClient, req, vhost, true)


This section makes me nervous. Do we have any protection in preventing runaway evictions?

For example, what if SetIsBeingEvicted returns an error but still executes the eviction, or something else happens (operator restart, etc) during this time after we mark for eviction but before we mark the pod as being evicted?

I think we should do some validation before this step, or change things where we're forced to be a little more careful. For example, in the reconciliation we could:

First check against the humio API to see if any nodes are marked for eviction. If they are, ensure the labels/annotations are set. If it sets the annotation, close the reconciliation loop. If it is already set, then continue.

Second step is to check if there are any pods with the label/annotation and check this against the humio API. If it doesn't match, then trigger the eviction for the pods that have the label and restart the reconciliation loop. If it does match, then continue with matchPodsToHosts() and starting the eviction process for a node assuming a node can be safely evicted.

Add logic to restrict a max number of concurrent evictions by default and possibly offer a way to override via the spec, possibly a max per zone?

This way if someone evicts via the UI then it will eventually lead to the pod being replaced, and we will not inadvertently evict all nodes in the cluster.

controllers/humiocluster_controller.go

… scaling up the cluster as old pods don't have the new eviction annotation set

Implementing Operator dOCownscaling functionality

3235e7c

SaaldjorMike reviewed Jan 14, 2025

View reviewed changes

internal/kubernetes/kubernetes.go Outdated Show resolved Hide resolved

controllers/humiocluster_controller.go Outdated Show resolved Hide resolved

controllers/humiocluster_controller.go Outdated Show resolved Hide resolved

SaaldjorMike reviewed Jan 14, 2025

View reviewed changes

controllers/humiocluster_controller.go Outdated Show resolved Hide resolved

bsocaciu added 6 commits January 16, 2025 16:27

Fixing code review issues

13c53ec

Fixing code review issues, removing deprecated and unused AutoRebalan…

34edf88

…cePartitions spec

Fixing tests and api check

ae26a4a

Fixing tests and adding AutoRebalancePartitions back

97f1b96

Fixing tests and adding AutoRebalancePartitions back

bd6c683

Fixed node unregistration bug

4d4bacb

dominique-brice-cs reviewed Jan 30, 2025

View reviewed changes

controllers/humiocluster_controller.go Outdated Show resolved Hide resolved

controllers/humiocluster_controller.go Outdated Show resolved Hide resolved

dominique-brice-cs reviewed Feb 3, 2025

View reviewed changes

controllers/humiocluster_controller.go Outdated Show resolved Hide resolved

Using reliable eviction check

c8b094e

dominique-brice-cs reviewed Feb 10, 2025

View reviewed changes

controllers/humiocluster_controller.go Outdated Show resolved Hide resolved

controllers/humiocluster_controller.go Outdated Show resolved Hide resolved

bsocaciu added 3 commits February 12, 2025 11:34

Using cheap computational checks before the cache invalidation

3e1fa8d

Added evicted node tracking

afde711

Merge branch 'master' into feature/add-downscaling-support

28f5b68

bsocaciu marked this pull request as ready for review February 17, 2025 07:25

bsocaciu requested a review from a team as a code owner February 17, 2025 07:25

SaaldjorMike reviewed Feb 17, 2025

View reviewed changes

api/v1alpha1/humiocluster_types.go Outdated Show resolved Hide resolved

bsocaciu added 4 commits February 17, 2025 22:28

Split downscaling decision making from the upscaling one to avoid boo…

51fafaa

…tstrapping bugs

Merge branch 'feature/add-downscaling-support' of github.com:humio/hu…

3f8c626

…mio-operator into feature/add-downscaling-support

Fixed failing tests

1046f24

Fixed failing tests

30bcb8e

SaaldjorMike reviewed Feb 19, 2025

View reviewed changes

jswoods reviewed Feb 19, 2025

View reviewed changes

bsocaciu added 3 commits February 23, 2025 20:24

Fixed PR comments

f854902

Fixed PR comments

881d9e9

Fixed PR comments

7543776

SaaldjorMike approved these changes Feb 24, 2025

View reviewed changes

Fixed an issue where upgrading the humio operator will result in bugs…

fb1c3d8

… scaling up the cluster as old pods don't have the new eviction annotation set

jswoods approved these changes Feb 26, 2025

View reviewed changes

bsocaciu added 2 commits February 26, 2025 18:14

Improved operator performance

4b51c59

Improved logging

bf03462

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing Operator downscaling functionality #901

Implementing Operator downscaling functionality #901

bsocaciu commented Jan 14, 2025

bsocaciu commented Feb 17, 2025

SaaldjorMike Feb 19, 2025

SaaldjorMike Feb 19, 2025 •

edited

Loading

bsocaciu Feb 19, 2025 •

edited

Loading

jswoods Feb 19, 2025

bsocaciu Feb 20, 2025

jswoods Feb 19, 2025

Implementing Operator downscaling functionality #901

Are you sure you want to change the base?

Implementing Operator downscaling functionality #901

Conversation

bsocaciu commented Jan 14, 2025

bsocaciu commented Feb 17, 2025

SaaldjorMike Feb 19, 2025

Choose a reason for hiding this comment

SaaldjorMike Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

bsocaciu Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

jswoods Feb 19, 2025

Choose a reason for hiding this comment

bsocaciu Feb 20, 2025

Choose a reason for hiding this comment

jswoods Feb 19, 2025

Choose a reason for hiding this comment

SaaldjorMike Feb 19, 2025 •

edited

Loading

bsocaciu Feb 19, 2025 •

edited

Loading