-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add UpgradeRequeueTime logic to prevent exponential backoff during upgrade #168
Conversation
I am probably going to do another PR for the error handling changes just so that we can separate the changes |
Conflicts: api/v1beta1/verticadb_types.go config/manifests/bases/verticadb-operator.clusterserviceversion.yaml pkg/controllers/offlineupgrade_reconcile_test.go pkg/controllers/onlineupgrade_reconciler.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks for doing these changes.
// If Reconcile was aborted with a requeue, set the RequeueAfter interval to prevent exponential backoff | ||
if err == nil { | ||
res.Requeue = false | ||
res.RequeueAfter = time.Duration(o.Vdb.GetUpgradeRequeueTime()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to multiply the time.Duration against time.Second to ensure that the requeue time is in seconds. It also helps clarify what unit of time we are using.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will add this
api/v1beta1/verticadb_types.go
Outdated
@@ -36,6 +36,9 @@ import ( | |||
const VerticaDBKind = "VerticaDB" | |||
const VerticaDBAPIVersion = "vertica.com/v1beta1" | |||
|
|||
// Set Constant Upgrade Requeue Time | |||
const URTime = 120 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use 30 seconds instead? We will want to retry as quick as possible during an upgrade as we really need to get the system back up.
api/v1beta1/verticadb_types.go
Outdated
@@ -959,8 +970,15 @@ func (v *VerticaDB) IsOnlineUpgradeInProgress() bool { | |||
return inx < len(v.Status.Conditions) && v.Status.Conditions[inx].Status == corev1.ConditionTrue | |||
} | |||
|
|||
// buildTransientSubcluster creates a temporary read-only subcluster based on an | |||
// existing subcluster | |||
// GetUpgradeRequeueTime returns default (2 minutes) if not set in the CRD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets leave the '2 minute' comment out in case the default ever changes.
@@ -217,7 +223,7 @@ func (o *OfflineUpgradeReconciler) checkForNewPods(ctx context.Context) (ctrl.Re | |||
} | |||
if !foundPodWithNewImage { | |||
o.Log.Info("Requeue to wait until at least one pod exists with the new image") | |||
return ctrl.Result{Requeue: true}, nil | |||
return ctrl.Result{Requeue: true, RequeueAfter: time.Duration(o.Vdb.Spec.UpgradeRequeueTime)}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't need this. The one added in the Reconcile function should handle adding the RequeueAfter time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just saw this, added in the new commit.
@@ -49,7 +49,7 @@ var _ = Describe("offlineupgrade_reconcile", func() { | |||
updateVdbToCauseUpgrade(ctx, vdb, NewImage) | |||
|
|||
r, _, _ := createOfflineUpgradeReconciler(vdb) | |||
Expect(r.Reconcile(ctx, &ctrl.Request{})).Should(Equal(ctrl.Result{Requeue: true})) | |||
Expect(r.Reconcile(ctx, &ctrl.Request{})).Should(Equal(ctrl.Result{Requeue: false, RequeueAfter: 120})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets use the const here rather than hard coding 120. Similar comment in other places in this file.
@@ -665,8 +671,7 @@ func (o *OnlineUpgradeReconciler) isSubclusterIdle(ctx context.Context, scName s | |||
} | |||
|
|||
// Parse the output. We requeue if there is an active connection. This | |||
// will rely on the exponential backoff algorithm that is in implemented by | |||
// the controller-runtime: start at 5ms, doubles until it gets to 16minutes. | |||
// will rely on the UpgradeRequeueTime that is set at 2 minutes default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update comment here since we aren't using 2 minutes as the default now.
res.Requeue = false | ||
res.RequeueAfter = time.Second * time.Duration(vdb.Spec.RequeueTime) | ||
res.RequeueAfter = time.Duration(vdb.Spec.RequeueTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets keep the time.Second. I thought it was needed to make sure we set seconds. But it also clarifies what unit we are using for the requeueTime.
// If any function needs a requeue and we have a RequeueTime set, | ||
// then overwrite RequeueAfter. | ||
// Functions such as Upgrade may already set RequeueAfter and Requeue to false | ||
if res.RequeueAfter > 0 && vdb.Spec.RequeueTime > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the check should be:
if (res.Requeue || res.RequeueAfter > 0) && vdb.Spec.RequeueTime > 0
That way any place that previously set Requeue to true will be overridden.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Thanks for the comments. Will add the changes
pkg/errors/errors.go
Outdated
ctrl "sigs.k8s.io/controller-runtime" | ||
) | ||
|
||
// Checks if the reconcile function returned an error, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All function comments in Go must start with the function name. So something like:
// IsReconcileAborted checks if the ....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
… handling (vertica#172) As part of vertica#168 we added a new error handling package. The function IsReconcileAborted verifies if the reconcile needs a re-queue, needs to be re-queued after a delay or returned an error. This function can be utilized at other points of reconcilers where similar check happens.
During an online upgrade, we drain all of the connections to a subcluster before shutting it down. The way that this is instrumented in the code is that we requeue the current reconcile. This causes the operator to queue up the next reconcile action using an exponential backoff algorithm. This causes a bad user experience because the exponential backoff can get up to 20 minutes. This adds a new spec in our API - upgradeRequeueTime, that helps us configure delaying the Requeue reconcile.