Add UpgradeRequeueTime logic to prevent exponential backoff during upgrade #168

narphu · 2022-03-09T18:19:51Z

During an online upgrade, we drain all of the connections to a subcluster before shutting it down. The way that this is instrumented in the code is that we requeue the current reconcile. This causes the operator to queue up the next reconcile action using an exponential backoff algorithm. This causes a bad user experience because the exponential backoff can get up to 20 minutes. This adds a new spec in our API - upgradeRequeueTime, that helps us configure delaying the Requeue reconcile.

narphu · 2022-03-09T18:21:13Z

I am probably going to do another PR for the error handling changes just so that we can separate the changes

Conflicts: api/v1beta1/verticadb_types.go config/manifests/bases/verticadb-operator.clusterserviceversion.yaml pkg/controllers/offlineupgrade_reconcile_test.go pkg/controllers/onlineupgrade_reconciler.go

spilchen

Looks good. Thanks for doing these changes.

spilchen · 2022-03-10T12:32:29Z

pkg/controllers/offlineupgrade_reconcile.go

+			// If Reconcile was aborted with a requeue, set the RequeueAfter interval to prevent exponential backoff
+			if err == nil {
+				res.Requeue = false
+				res.RequeueAfter = time.Duration(o.Vdb.GetUpgradeRequeueTime())


I think we need to multiply the time.Duration against time.Second to ensure that the requeue time is in seconds. It also helps clarify what unit of time we are using.

Ok, will add this

spilchen · 2022-03-10T12:34:12Z

api/v1beta1/verticadb_types.go

@@ -36,6 +36,9 @@ import (
 const VerticaDBKind = "VerticaDB"
 const VerticaDBAPIVersion = "vertica.com/v1beta1"

+// Set Constant Upgrade Requeue Time
+const URTime = 120


Can we use 30 seconds instead? We will want to retry as quick as possible during an upgrade as we really need to get the system back up.

spilchen · 2022-03-10T12:34:36Z

api/v1beta1/verticadb_types.go

@@ -959,8 +970,15 @@ func (v *VerticaDB) IsOnlineUpgradeInProgress() bool {
 	return inx < len(v.Status.Conditions) && v.Status.Conditions[inx].Status == corev1.ConditionTrue
 }

-// buildTransientSubcluster creates a temporary read-only subcluster based on an
-// existing subcluster
+// GetUpgradeRequeueTime returns default (2 minutes) if not set in the CRD


Lets leave the '2 minute' comment out in case the default ever changes.

spilchen · 2022-03-10T12:36:18Z

pkg/controllers/offlineupgrade_reconcile.go

@@ -217,7 +223,7 @@ func (o *OfflineUpgradeReconciler) checkForNewPods(ctx context.Context) (ctrl.Re
 	}
 	if !foundPodWithNewImage {
 		o.Log.Info("Requeue to wait until at least one pod exists with the new image")
-		return ctrl.Result{Requeue: true}, nil
+		return ctrl.Result{Requeue: true, RequeueAfter: time.Duration(o.Vdb.Spec.UpgradeRequeueTime)}, nil


We shouldn't need this. The one added in the Reconcile function should handle adding the RequeueAfter time.

Just saw this, added in the new commit.

spilchen · 2022-03-10T12:36:56Z

pkg/controllers/offlineupgrade_reconcile_test.go

@@ -49,7 +49,7 @@ var _ = Describe("offlineupgrade_reconcile", func() {
 		updateVdbToCauseUpgrade(ctx, vdb, NewImage)

 		r, _, _ := createOfflineUpgradeReconciler(vdb)
-		Expect(r.Reconcile(ctx, &ctrl.Request{})).Should(Equal(ctrl.Result{Requeue: true}))
+		Expect(r.Reconcile(ctx, &ctrl.Request{})).Should(Equal(ctrl.Result{Requeue: false, RequeueAfter: 120}))


Lets use the const here rather than hard coding 120. Similar comment in other places in this file.

spilchen · 2022-03-10T12:43:52Z

pkg/controllers/onlineupgrade_reconciler.go

@@ -665,8 +671,7 @@ func (o *OnlineUpgradeReconciler) isSubclusterIdle(ctx context.Context, scName s
 	}

 	// Parse the output.  We requeue if there is an active connection.  This
-	// will rely on the exponential backoff algorithm that is in implemented by
-	// the controller-runtime: start at 5ms, doubles until it gets to 16minutes.
+	// will rely on the UpgradeRequeueTime that is set at 2 minutes default


Please update comment here since we aren't using 2 minutes as the default now.

spilchen · 2022-03-10T12:44:58Z

pkg/controllers/verticadb_controller.go

 				res.Requeue = false
-				res.RequeueAfter = time.Second * time.Duration(vdb.Spec.RequeueTime)
+				res.RequeueAfter = time.Duration(vdb.Spec.RequeueTime)


Lets keep the time.Second. I thought it was needed to make sure we set seconds. But it also clarifies what unit we are using for the requeueTime.

spilchen · 2022-03-10T12:47:35Z

pkg/controllers/verticadb_controller.go

+			// If any function needs a requeue and we have a RequeueTime set,
+			// then overwrite RequeueAfter.
+			// Functions such as Upgrade may already set RequeueAfter and Requeue to false
+			if res.RequeueAfter > 0 && vdb.Spec.RequeueTime > 0 {


I think the check should be:
if (res.Requeue || res.RequeueAfter > 0) && vdb.Spec.RequeueTime > 0

That way any place that previously set Requeue to true will be overridden.

Good point. Thanks for the comments. Will add the changes

spilchen · 2022-03-10T12:48:28Z

pkg/errors/errors.go

+	ctrl "sigs.k8s.io/controller-runtime"
+)
+
+// Checks if the reconcile function returned an error,


All function comments in Go must start with the function name. So something like:
// IsReconcileAborted checks if the ....

…image

spilchen

Looks good.

… handling (#172) As part of #168 we added a new error handling package. The function IsReconcileAborted verifies if the reconcile needs a re-queue, needs to be re-queued after a delay or returned an error. This function can be utilized at other points of reconcilers where similar check happens.

… handling (vertica#172) As part of vertica#168 we added a new error handling package. The function IsReconcileAborted verifies if the reconcile needs a re-queue, needs to be re-queued after a delay or returned an error. This function can be utilized at other points of reconcilers where similar check happens.

narphu added 3 commits March 8, 2022 17:31

Add logic to inject upgradeRequeueTime

4f630ef

Add errors package to supplement changes to reconcile error handling

c714bbb

Clean up some stuff, configure an E2E test to verify upgradeRequeueTime

630bed0

narphu added 2 commits March 9, 2022 16:06

Merge branch 'main' into VER-80769

485720a

Conflicts: api/v1beta1/verticadb_types.go config/manifests/bases/verticadb-operator.clusterserviceversion.yaml pkg/controllers/offlineupgrade_reconcile_test.go pkg/controllers/onlineupgrade_reconciler.go

Commit changes required to tests as a result of go1.17

de75999

spilchen reviewed Mar 10, 2022

View reviewed changes

narphu added 3 commits March 10, 2022 07:52

No need to explicitly set RequeueAfter when waiting for pod with new …

8246fbd

…image

Fixes as per comments

0cf97a9

Add changie

07366dc

spilchen approved these changes Mar 10, 2022

View reviewed changes

spilchen merged commit 320f8c7 into vertica:main Mar 10, 2022

narphu mentioned this pull request Mar 10, 2022

Utilize the new errors package to encapsulate reconcile requeue/error handling #172

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add UpgradeRequeueTime logic to prevent exponential backoff during upgrade #168

Add UpgradeRequeueTime logic to prevent exponential backoff during upgrade #168

narphu commented Mar 9, 2022

narphu commented Mar 9, 2022

spilchen left a comment

spilchen Mar 10, 2022

narphu Mar 10, 2022

spilchen Mar 10, 2022

spilchen Mar 10, 2022

spilchen Mar 10, 2022

narphu Mar 10, 2022

spilchen Mar 10, 2022

spilchen Mar 10, 2022

spilchen Mar 10, 2022

spilchen Mar 10, 2022

narphu Mar 10, 2022

spilchen Mar 10, 2022

spilchen left a comment

Add UpgradeRequeueTime logic to prevent exponential backoff during upgrade #168

Add UpgradeRequeueTime logic to prevent exponential backoff during upgrade #168

Conversation

narphu commented Mar 9, 2022

narphu commented Mar 9, 2022

spilchen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spilchen left a comment

Choose a reason for hiding this comment