Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1733474: Use upstream drain library instead of downstream #464

Merged

Conversation

alexander-demicev
Copy link
Contributor

@alexander-demicev alexander-demicev commented Jan 13, 2020

This PR drops our drain library in favor of upstream. This should make it easier to maintain, use new upstream features and get us kubernetes/kubernetes#85577 kubernetes/kubernetes#85574

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 13, 2020
Copy link
Contributor

@michaelgugino michaelgugino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should resist making any more changes to our local drain library and see if we can import kubernetes/kubectl drain now. Upstream cluster-api we just copied in the files due to conflicting controller-runtime dependencies, but I think that is resolved in k8s >= 1.16.

@openshift-ci-robot openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 16, 2020
@alexander-demicev alexander-demicev changed the title Cherry-pick add option skip-wait-for-delete-timeout changes Use upstream drain library instead of downstream Jan 16, 2020
@alexander-demicev
Copy link
Contributor Author

/retest

1 similar comment
@alexander-demicev
Copy link
Contributor Author

/retest

@enxebre
Copy link
Member

enxebre commented Jan 17, 2020

@alexander-demichev can you please add into the PR desc links to the most significant specific changes we are bringing here e.g kubernetes/kubernetes#85577
and kubernetes/kubernetes#85574

@enxebre
Copy link
Member

enxebre commented Jan 17, 2020

/retest

@enxebre
Copy link
Member

enxebre commented Jan 17, 2020

in a follow up let's set skip-wait-for-delete-timeout

@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 20, 2020
@openshift-ci-robot openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 20, 2020
Out: writer{klog.Info},
ErrOut: writer{klog.Error},
DryRun: false,
SkipWaitForDeleteTimeoutSeconds: 60 * 5,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be move this value to a constant?
Do we may be want to set this value only when the node is unreachable and log so?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree this should be a constant. As for only setting when a node is unreachable, I think it will be quite a rare case where pods are sticking around for that long because of shut down reasons, but I have seen it (pods holding long lived connections draining over 10 mins to reduce reconnection spikes).

So I think if we can detect that the node is unreachable, then only setting this if the node is unreachable would be better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can looking at status.conditions be a good check for an unreachable node?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That or the taints placed on the node, whichever is easiest. This might help determine when to say it's unreachable https://github.com/kubernetes/kubernetes/blob/ed3cc6afea6fa3d9f8e2a1544daaa12f87d2b65c/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L69-L104

To me that suggests the NodeReady condition being in status Unknown is the only "unreachable" state, do you agree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upstream related PR kubernetes-sigs/cluster-api#2165

@alexander-demicev alexander-demicev force-pushed the unready-node-timeout branch 2 times, most recently from dc06c7d to c99f27b Compare January 29, 2020 16:18
@alexander-demicev alexander-demicev changed the title Use upstream drain library instead of downstream Bug 1733474: Use upstream drain library instead of downstream Feb 3, 2020
@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Feb 3, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 5, 2020
Copy link
Contributor

@michaelgugino michaelgugino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 5, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

5 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@enxebre
Copy link
Member

enxebre commented Feb 6, 2020

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@alexander-demicev
Copy link
Contributor Author

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 8b94fd7 into openshift:master Feb 6, 2020
@openshift-ci-robot
Copy link
Contributor

@alexander-demichev: All pull requests linked via external trackers have merged. Bugzilla bug 1733474 has been moved to the MODIFIED state.

In response to this:

Bug 1733474: Use upstream drain library instead of downstream

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@alexander-demicev alexander-demicev deleted the unready-node-timeout branch February 6, 2020 16:39
@@ -344,6 +346,10 @@ func (r *ReconcileMachine) drainNode(machine *machinev1.Machine) error {
DryRun: false,
}

if nodeIsUnreachable(node) {
drainer.SkipWaitForDeleteTimeoutSeconds = skipWaitForDeleteTimeoutSeconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include a logging message here e.g "This node is unreachable, draining will wait for pod deletion during skipWaitForDeleteTimeoutSeconds after the request and will skip otherwise."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, can we do it after master opens?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@enxebre Why not skip the drain entirely we know the node to be unreachable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that kubelet unreachability might be temporary. The intention is to tolerate unreachability during a reasonable timeframe i.e 5 min before considering the node dead and deleting the underlying infra, therefore disrupting app intent for graceful shutdowns and PDB policies.

@openshift-ci-robot
Copy link
Contributor

@alexander-demichev: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-scaleup-rhel7 4e9ca8a link /test e2e-aws-scaleup-rhel7

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants