New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Bug 1733474: Use upstream drain library instead of downstream #464

Merged

openshift-merge-robot merged 3 commits into openshift:master from alexander-demicev:unready-node-timeout

Feb 6, 2020

Contributor

alexander-demicev commented Jan 13, 2020 •

edited

Loading

This PR drops our drain library in favor of upstream. This should make it easier to maintain, use new upstream features and get us kubernetes/kubernetes#85577 kubernetes/kubernetes#85574

openshift-ci-robot added the size/L label

openshift-ci-robot requested review from bison and michaelgugino

January 13, 2020 14:49

michaelgugino suggested changes

View reviewed changes

Contributor

michaelgugino left a comment

We should resist making any more changes to our local drain library and see if we can import kubernetes/kubectl drain now. Upstream cluster-api we just copied in the files due to conflicting controller-runtime dependencies, but I think that is resolved in k8s >= 1.16.

alexander-demicev force-pushed the unready-node-timeout branch from 1546dd8 to 8b7009a Compare

January 16, 2020 14:58

openshift-ci-robot added size/XXL and removed size/L labels

alexander-demicev changed the title ~~Cherry-pick add option skip-wait-for-delete-timeout changes~~ Use upstream drain library instead of downstream

alexander-demicev requested a review from michaelgugino

January 16, 2020 14:59

Contributor Author

alexander-demicev commented Jan 16, 2020

/retest

1 similar comment

Contributor Author

alexander-demicev commented Jan 17, 2020

/retest

Member

enxebre commented Jan 17, 2020 •

edited

Loading

@alexander-demichev can you please add into the PR desc links to the most significant specific changes we are bringing here e.g kubernetes/kubernetes#85577
and kubernetes/kubernetes#85574

Member

enxebre commented Jan 17, 2020

/retest

Member

enxebre commented Jan 17, 2020

in a follow up let's set skip-wait-for-delete-timeout

michaelgugino suggested changes

View reviewed changes

pkg/controller/machine/controller.go Show resolved Hide resolved

alexander-demicev force-pushed the unready-node-timeout branch from 8b7009a to 1546dd8 Compare

January 20, 2020 12:16

openshift-ci-robot added size/L and removed size/XXL labels

alexander-demicev force-pushed the unready-node-timeout branch from 1546dd8 to 6920f7e Compare

January 20, 2020 12:18

openshift-ci-robot added size/XXL and removed size/L labels

Alexander Demichev added 2 commits

January 28, 2020 15:41


          Vendor dependencies

57a3204


          Use upstream drain library

f0c52c1

alexander-demicev force-pushed the unready-node-timeout branch from 6920f7e to a34afcb Compare

January 28, 2020 14:42

alexander-demicev requested a review from michaelgugino

January 28, 2020 16:11

enxebre reviewed

View reviewed changes

pkg/controller/machine/controller.go Outdated

+              		Out:                             writer{klog.Info},
+              		ErrOut:                          writer{klog.Error},
+              		DryRun:                          false,
+              		SkipWaitForDeleteTimeoutSeconds: 60 * 5,

Member

enxebre Jan 29, 2020

may be move this value to a constant?
Do we may be want to set this value only when the node is unreachable and log so?

Member

enxebre Jan 29, 2020

cc @bison @JoelSpeed @michaelgugino wdyt

Contributor

JoelSpeed Jan 29, 2020

I would agree this should be a constant. As for only setting when a node is unreachable, I think it will be quite a rare case where pods are sticking around for that long because of shut down reasons, but I have seen it (pods holding long lived connections draining over 10 mins to reduce reconnection spikes).

So I think if we can detect that the node is unreachable, then only setting this if the node is unreachable would be better

Contributor Author

alexander-demicev Jan 29, 2020

Can looking at status.conditions be a good check for an unreachable node?

Contributor

JoelSpeed Jan 29, 2020

That or the taints placed on the node, whichever is easiest. This might help determine when to say it's unreachable https://github.com/kubernetes/kubernetes/blob/ed3cc6afea6fa3d9f8e2a1544daaa12f87d2b65c/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L69-L104

To me that suggests the NodeReady condition being in status Unknown is the only "unreachable" state, do you agree?

Contributor Author

alexander-demicev Jan 29, 2020

Yes, you are right here https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#taint-based-evictions

Member

enxebre Jan 30, 2020

Upstream related PR kubernetes-sigs/cluster-api#2165

alexander-demicev force-pushed the unready-node-timeout branch 2 times, most recently from dc06c7d to c99f27b Compare

January 29, 2020 16:18

alexander-demicev changed the title ~~Use upstream drain library instead of downstream~~ Bug 1733474: Use upstream drain library instead of downstream

openshift-ci-robot added the bugzilla/valid-bug label

Contributor

openshift-ci-robot commented Feb 5, 2020

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [enxebre]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot added the approved label

michaelgugino approved these changes

View reviewed changes

Contributor

michaelgugino left a comment

/lgtm

openshift-ci-robot assigned michaelgugino

openshift-ci-robot added the lgtm label

Contributor

openshift-bot commented Feb 5, 2020

/retest

Please review the full test history for this PR and help us cut down flakes.

5 similar comments

Contributor

openshift-bot commented Feb 5, 2020

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor

openshift-bot commented Feb 5, 2020

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor

openshift-bot commented Feb 5, 2020

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor

openshift-bot commented Feb 6, 2020

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor

openshift-bot commented Feb 6, 2020

/retest

Please review the full test history for this PR and help us cut down flakes.

Member

enxebre commented Feb 6, 2020

/retest

Contributor

openshift-bot commented Feb 6, 2020

/retest

Please review the full test history for this PR and help us cut down flakes.

2 similar comments

Contributor

openshift-bot commented Feb 6, 2020

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor

openshift-bot commented Feb 6, 2020

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor Author

alexander-demicev commented Feb 6, 2020

/retest

Contributor

openshift-bot commented Feb 6, 2020

/retest

Please review the full test history for this PR and help us cut down flakes.

3 similar comments

Contributor

openshift-bot commented Feb 6, 2020

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor

openshift-bot commented Feb 6, 2020

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor

openshift-bot commented Feb 6, 2020

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-merge-robot merged commit 8b94fd7 into openshift:master

Contributor

openshift-ci-robot commented Feb 6, 2020

@alexander-demichev: All pull requests linked via external trackers have merged. Bugzilla bug 1733474 has been moved to the MODIFIED state.

In response to this:

Bug 1733474: Use upstream drain library instead of downstream

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

alexander-demicev deleted the unready-node-timeout branch

February 6, 2020 16:39

enxebre reviewed

View reviewed changes

pkg/controller/machine/controller.go

               		DryRun: false,
               	}
+              	if nodeIsUnreachable(node) {
+              		drainer.SkipWaitForDeleteTimeoutSeconds = skipWaitForDeleteTimeoutSeconds

Member

enxebre Feb 6, 2020

Can we include a logging message here e.g "This node is unreachable, draining will wait for pod deletion during skipWaitForDeleteTimeoutSeconds after the request and will skip otherwise."

Contributor Author

alexander-demicev Feb 6, 2020

Makes sense, can we do it after master opens?

Contributor

beekhof Mar 11, 2020

@enxebre Why not skip the drain entirely we know the node to be unreachable?

Member

enxebre Mar 11, 2020

The idea is that kubelet unreachability might be temporary. The intention is to tolerate unreachability during a reasonable timeframe i.e 5 min before considering the node dead and deleting the underlying infra, therefore disrupting app intent for graceful shutdowns and PDB policies.

Contributor

openshift-ci-robot commented Feb 6, 2020

@alexander-demichev: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-scaleup-rhel7	`4e9ca8a`	link	`/test e2e-aws-scaleup-rhel7`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

This was referenced Feb 7, 2020

Bug 1733474: Revendor MAO with new node draining feature check existence before node deletion openshift/cluster-api-provider-aws#296

Merged

Bug 1733474: Revendor MAO with new node draining feature check existence before node deletion openshift/cluster-api-provider-azure#105

Merged

Bug 1733474: Revendor MAO with new node draining feature and check existence before node deletion openshift/cluster-api-provider-gcp#74

Merged

alexander-demicev mentioned this pull request

[release-4.3] Bug 1803762: Revendor MAO with new node draining feature check existence before node deletion openshift/cluster-api-provider-azure#110

Closed

openshift-ci-robot mentioned this pull request

Bug 1733474: Revendor MAO with timeout formatting fix openshift/cluster-api-provider-gcp#109

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

beekhof beekhof left review comments

enxebre enxebre left review comments

JoelSpeed JoelSpeed left review comments

michaelgugino michaelgugino approved these changes

bison Awaiting requested review from bison

Labels

approved bugzilla/valid-bug lgtm size/XXL