✨Support draining unready nodes #2165

hypnoglow · 2020-01-27T20:33:23Z

What this PR does / why we need it:
Adds support for draining unready nodes.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1870

k8s-ci-robot · 2020-01-27T20:33:31Z

Hi @hypnoglow. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ncdc · 2020-01-27T20:35:36Z

/ok-to-test

FYI @michaelgugino

hypnoglow · 2020-01-29T08:43:36Z

@michaelgugino @ncdc Does this PR covers the issue? Is anything missing?

hypnoglow · 2020-01-29T19:45:08Z

/retest

k8s-ci-robot · 2020-01-29T19:55:45Z

@hypnoglow: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-cluster-api-e2e	`1dc0442`	link	`/test pull-cluster-api-e2e`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

enxebre · 2020-01-30T09:09:22Z

controllers/machine_controller.go

@@ -348,6 +349,12 @@ func (r *MachineReconciler) drainNode(cluster *clusterv1.Cluster, nodeName strin
 		DryRun: false,
 	}

+	if !noderefutil.IsNodeReady(node) {


shouldn't this be checking the node is unreachable instead?

Could you please elaborate? The issue stated about "unready" node. How do we want to check if the node is unreachable?

I don't believe there is any way for Cluster API to evaluate a node's reachability other than by looking at the conditions of the node in the workload cluster. Or am I missing something?

@michaelgugino @vincepri would we want to always set drainer.SkipWaitForDeleteTimeoutSeconds, or only if certain node conditions are set?

This feature comes in handy when the node is unreachable, see https://kubernetes.io/docs/concepts/architecture/nodes/.
In some cases when the node is unreachable, the apiserver is unable to communicate with the kubelet on the node. The decision to delete the pods cannot be communicated to the kubelet until communication with the apiserver is re-established. In the meantime, the pods that are scheduled for deletion may continue to run on the partitioned node.
#1870 (comment)

I believe we can assume this when the nodeReady status [1] is unknown [2].

[1] https://github.com/kubernetes-sigs/cluster-api/blob/master/controllers/noderefutil/util.go#L62-L67
[2] https://github.com/kubernetes/api/blob/master/core/v1/types.go#L2318

100% agree with what you wrote, but I'm not clear if this is something we should always set, or only when the ready status is unknown?

@enxebre did you see my last question?

I think I'd be in favour of setting the flag only when the kubelet status is unknown as we deliberately want to let draining move forward in such scenario. It seems safe to let any other unknown scenario where this might happen be visible by blocking draining for now.

I've updated the PR to check if node ready status is "Unknown" instead of "not True". Please check if this matches what we have discussed.

chuckha · 2020-01-30T14:23:01Z

/test pull-cluster-api-capd-e2e

ncdc · 2020-02-04T21:03:25Z

controllers/noderefutil/util.go

+// IsNodeUnreachable returns true if a node is unreachable.
+// Node is considered unreachable when its ready status is "Unknown".
+func IsNodeUnreachable(node *corev1.Node) bool {
+	if node == nil {


I'm not sure we need this check. It should only ever be called with a valid node.

ncdc · 2020-02-04T21:04:29Z

/assign @enxebre @michaelgugino

k8s-ci-robot · 2020-02-04T21:04:31Z

@ncdc: GitHub didn't allow me to assign the following users: enxebre.

Note that only kubernetes-sigs members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @enxebre @michaelgugino

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ncdc · 2020-02-05T15:22:15Z

/lgtm
/approve

k8s-ci-robot · 2020-02-05T15:24:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hypnoglow, ncdc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ncdc]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

enxebre · 2020-02-06T16:58:55Z

controllers/machine_controller.go

@@ -348,6 +349,11 @@ func (r *MachineReconciler) drainNode(cluster *clusterv1.Cluster, nodeName strin
 		DryRun: false,
 	}

+	if noderefutil.IsNodeUnreachable(node) {
+		// When the node is unreachable and some pods are not evicted for as long as this timeout, we ignore them.


We should probably include a logging message here e.g "This node is unreachable, draining will wait for pod deletion during skipWaitForDeleteTimeoutSeconds after the request and will skip otherwise." cc @ncdc @alexander-demichev @hypnoglow

I can open a PR if everyone is ok with the change.

Do we want this to be something end users are aware of? If so, I would suggest an Event in addition to the log message.

riking · 2020-02-15T22:49:42Z

The shouldSkipPod function doesn't seem like it takes into account nodes that come back or become unhealthy during the drain. Additionally, I'm pretty sure that kubelet updates the DeletionTimestamp multiple times while evicting a pod.

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 27, 2020

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 27, 2020

k8s-ci-robot requested review from justinsb and ncdc January 27, 2020 20:33

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 27, 2020

hypnoglow force-pushed the drain-unready-nodes branch from 1dc0442 to 57813cf Compare January 29, 2020 19:13

hypnoglow changed the title ~~WIP: ✨Support draining unready nodes~~ ✨Support draining unready nodes Jan 29, 2020

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 29, 2020

enxebre reviewed Jan 30, 2020

View reviewed changes

enxebre mentioned this pull request Jan 30, 2020

Bug 1733474: Use upstream drain library instead of downstream openshift/machine-api-operator#464

Merged

hypnoglow force-pushed the drain-unready-nodes branch from 57813cf to ec648ec Compare February 4, 2020 20:53

Support draining unreachable nodes

ec648ec

ncdc reviewed Feb 4, 2020

View reviewed changes

k8s-ci-robot assigned michaelgugino Feb 4, 2020

enxebre mentioned this pull request Feb 5, 2020

REQUEST: New membership for enxebre kubernetes/org#1614

Closed

6 tasks

ncdc added this to the v0.3.0 milestone Feb 5, 2020

k8s-ci-robot assigned ncdc Feb 5, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 5, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 5, 2020

k8s-ci-robot merged commit 82a7ac0 into kubernetes-sigs:master Feb 5, 2020

enxebre reviewed Feb 6, 2020

View reviewed changes

ncdc mentioned this pull request May 11, 2020

Allow a user specifiable node draining timeout #2331

Closed

enxebre mentioned this pull request Aug 24, 2022

Add a drain controller for inplace upgrades openshift/hypershift#1691

Merged

enxebre mentioned this pull request May 30, 2024

🐛 Machine deletion skips waiting for volumes detached for unreachable Nodes #10662

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨Support draining unready nodes #2165

✨Support draining unready nodes #2165

hypnoglow commented Jan 27, 2020

k8s-ci-robot commented Jan 27, 2020

ncdc commented Jan 27, 2020

hypnoglow commented Jan 29, 2020

hypnoglow commented Jan 29, 2020

k8s-ci-robot commented Jan 29, 2020 •

edited

Loading

enxebre Jan 30, 2020

hypnoglow Jan 30, 2020

ncdc Jan 31, 2020

enxebre Jan 31, 2020 •

edited

Loading

ncdc Jan 31, 2020

ncdc Feb 3, 2020

enxebre Feb 4, 2020

hypnoglow Feb 4, 2020

chuckha commented Jan 30, 2020

ncdc Feb 4, 2020

ncdc commented Feb 4, 2020

k8s-ci-robot commented Feb 4, 2020

ncdc commented Feb 5, 2020

k8s-ci-robot commented Feb 5, 2020

enxebre Feb 6, 2020

alexander-demicev Feb 6, 2020

detiber Feb 6, 2020

riking commented Feb 15, 2020 •

edited

Loading

✨Support draining unready nodes #2165

✨Support draining unready nodes #2165

Conversation

hypnoglow commented Jan 27, 2020

k8s-ci-robot commented Jan 27, 2020

ncdc commented Jan 27, 2020

hypnoglow commented Jan 29, 2020

hypnoglow commented Jan 29, 2020

k8s-ci-robot commented Jan 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enxebre Jan 31, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chuckha commented Jan 30, 2020

Choose a reason for hiding this comment

ncdc commented Feb 4, 2020

k8s-ci-robot commented Feb 4, 2020

ncdc commented Feb 5, 2020

k8s-ci-robot commented Feb 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

riking commented Feb 15, 2020 • edited Loading

k8s-ci-robot commented Jan 29, 2020 •

edited

Loading

enxebre Jan 31, 2020 •

edited

Loading

riking commented Feb 15, 2020 •

edited

Loading