Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UpgradeRequeueTime logic to prevent exponential backoff during upgrade #168

Merged
merged 8 commits into from
Mar 10, 2022

Conversation

narphu
Copy link
Collaborator

@narphu narphu commented Mar 9, 2022

During an online upgrade, we drain all of the connections to a subcluster before shutting it down. The way that this is instrumented in the code is that we requeue the current reconcile. This causes the operator to queue up the next reconcile action using an exponential backoff algorithm. This causes a bad user experience because the exponential backoff can get up to 20 minutes. This adds a new spec in our API - upgradeRequeueTime, that helps us configure delaying the Requeue reconcile.

@narphu
Copy link
Collaborator Author

narphu commented Mar 9, 2022

I am probably going to do another PR for the error handling changes just so that we can separate the changes

narphu added 2 commits March 9, 2022 16:06
Conflicts:
	api/v1beta1/verticadb_types.go
	config/manifests/bases/verticadb-operator.clusterserviceversion.yaml
	pkg/controllers/offlineupgrade_reconcile_test.go
	pkg/controllers/onlineupgrade_reconciler.go
Copy link
Collaborator

@spilchen spilchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for doing these changes.

// If Reconcile was aborted with a requeue, set the RequeueAfter interval to prevent exponential backoff
if err == nil {
res.Requeue = false
res.RequeueAfter = time.Duration(o.Vdb.GetUpgradeRequeueTime())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to multiply the time.Duration against time.Second to ensure that the requeue time is in seconds. It also helps clarify what unit of time we are using.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will add this

@@ -36,6 +36,9 @@ import (
const VerticaDBKind = "VerticaDB"
const VerticaDBAPIVersion = "vertica.com/v1beta1"

// Set Constant Upgrade Requeue Time
const URTime = 120
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use 30 seconds instead? We will want to retry as quick as possible during an upgrade as we really need to get the system back up.

@@ -959,8 +970,15 @@ func (v *VerticaDB) IsOnlineUpgradeInProgress() bool {
return inx < len(v.Status.Conditions) && v.Status.Conditions[inx].Status == corev1.ConditionTrue
}

// buildTransientSubcluster creates a temporary read-only subcluster based on an
// existing subcluster
// GetUpgradeRequeueTime returns default (2 minutes) if not set in the CRD
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets leave the '2 minute' comment out in case the default ever changes.

@@ -217,7 +223,7 @@ func (o *OfflineUpgradeReconciler) checkForNewPods(ctx context.Context) (ctrl.Re
}
if !foundPodWithNewImage {
o.Log.Info("Requeue to wait until at least one pod exists with the new image")
return ctrl.Result{Requeue: true}, nil
return ctrl.Result{Requeue: true, RequeueAfter: time.Duration(o.Vdb.Spec.UpgradeRequeueTime)}, nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need this. The one added in the Reconcile function should handle adding the RequeueAfter time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saw this, added in the new commit.

@@ -49,7 +49,7 @@ var _ = Describe("offlineupgrade_reconcile", func() {
updateVdbToCauseUpgrade(ctx, vdb, NewImage)

r, _, _ := createOfflineUpgradeReconciler(vdb)
Expect(r.Reconcile(ctx, &ctrl.Request{})).Should(Equal(ctrl.Result{Requeue: true}))
Expect(r.Reconcile(ctx, &ctrl.Request{})).Should(Equal(ctrl.Result{Requeue: false, RequeueAfter: 120}))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets use the const here rather than hard coding 120. Similar comment in other places in this file.

@@ -665,8 +671,7 @@ func (o *OnlineUpgradeReconciler) isSubclusterIdle(ctx context.Context, scName s
}

// Parse the output. We requeue if there is an active connection. This
// will rely on the exponential backoff algorithm that is in implemented by
// the controller-runtime: start at 5ms, doubles until it gets to 16minutes.
// will rely on the UpgradeRequeueTime that is set at 2 minutes default
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update comment here since we aren't using 2 minutes as the default now.

res.Requeue = false
res.RequeueAfter = time.Second * time.Duration(vdb.Spec.RequeueTime)
res.RequeueAfter = time.Duration(vdb.Spec.RequeueTime)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets keep the time.Second. I thought it was needed to make sure we set seconds. But it also clarifies what unit we are using for the requeueTime.

// If any function needs a requeue and we have a RequeueTime set,
// then overwrite RequeueAfter.
// Functions such as Upgrade may already set RequeueAfter and Requeue to false
if res.RequeueAfter > 0 && vdb.Spec.RequeueTime > 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the check should be:
if (res.Requeue || res.RequeueAfter > 0) && vdb.Spec.RequeueTime > 0

That way any place that previously set Requeue to true will be overridden.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Thanks for the comments. Will add the changes

ctrl "sigs.k8s.io/controller-runtime"
)

// Checks if the reconcile function returned an error,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All function comments in Go must start with the function name. So something like:
// IsReconcileAborted checks if the ....

Copy link
Collaborator

@spilchen spilchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@spilchen spilchen merged commit 320f8c7 into vertica:main Mar 10, 2022
spilchen pushed a commit that referenced this pull request Mar 11, 2022
… handling (#172)

As part of #168 we added a new error handling package. The function
IsReconcileAborted verifies if the reconcile needs a re-queue, needs to be
re-queued after a delay or returned an error. This function can be utilized at
other points of reconcilers where similar check happens.
spilchen pushed a commit to spilchen/vertica-kubernetes that referenced this pull request Mar 21, 2022
… handling (vertica#172)

As part of vertica#168 we added a new error handling package. The function
IsReconcileAborted verifies if the reconcile needs a re-queue, needs to be
re-queued after a delay or returned an error. This function can be utilized at
other points of reconcilers where similar check happens.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants