-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1733474: Use upstream drain library instead of downstream #464
Bug 1733474: Use upstream drain library instead of downstream #464
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should resist making any more changes to our local drain library and see if we can import kubernetes/kubectl drain now. Upstream cluster-api we just copied in the files due to conflicting controller-runtime dependencies, but I think that is resolved in k8s >= 1.16.
1546dd8
to
8b7009a
Compare
/retest |
1 similar comment
/retest |
@alexander-demichev can you please add into the PR desc links to the most significant specific changes we are bringing here e.g kubernetes/kubernetes#85577 |
/retest |
in a follow up let's set |
8b7009a
to
1546dd8
Compare
1546dd8
to
6920f7e
Compare
6920f7e
to
a34afcb
Compare
pkg/controller/machine/controller.go
Outdated
Out: writer{klog.Info}, | ||
ErrOut: writer{klog.Error}, | ||
DryRun: false, | ||
SkipWaitForDeleteTimeoutSeconds: 60 * 5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be move this value to a constant?
Do we may be want to set this value only when the node is unreachable and log so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @bison @JoelSpeed @michaelgugino wdyt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would agree this should be a constant. As for only setting when a node is unreachable, I think it will be quite a rare case where pods are sticking around for that long because of shut down reasons, but I have seen it (pods holding long lived connections draining over 10 mins to reduce reconnection spikes).
So I think if we can detect that the node is unreachable, then only setting this if the node is unreachable would be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can looking at status.conditions
be a good check for an unreachable node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That or the taints placed on the node, whichever is easiest. This might help determine when to say it's unreachable https://github.com/kubernetes/kubernetes/blob/ed3cc6afea6fa3d9f8e2a1544daaa12f87d2b65c/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L69-L104
To me that suggests the NodeReady
condition being in status Unknown
is the only "unreachable" state, do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right here https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#taint-based-evictions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upstream related PR kubernetes-sigs/cluster-api#2165
dc06c7d
to
c99f27b
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enxebre The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/retest Please review the full test history for this PR and help us cut down flakes. |
5 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest |
/retest Please review the full test history for this PR and help us cut down flakes. |
2 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest |
/retest Please review the full test history for this PR and help us cut down flakes. |
3 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@alexander-demichev: All pull requests linked via external trackers have merged. Bugzilla bug 1733474 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@@ -344,6 +346,10 @@ func (r *ReconcileMachine) drainNode(machine *machinev1.Machine) error { | |||
DryRun: false, | |||
} | |||
|
|||
if nodeIsUnreachable(node) { | |||
drainer.SkipWaitForDeleteTimeoutSeconds = skipWaitForDeleteTimeoutSeconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we include a logging message here e.g "This node is unreachable, draining will wait for pod deletion during skipWaitForDeleteTimeoutSeconds
after the request and will skip otherwise."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, can we do it after master opens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@enxebre Why not skip the drain entirely we know the node to be unreachable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that kubelet unreachability might be temporary. The intention is to tolerate unreachability during a reasonable timeframe i.e 5 min before considering the node dead and deleting the underlying infra, therefore disrupting app intent for graceful shutdowns and PDB policies.
@alexander-demichev: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
This PR drops our drain library in favor of upstream. This should make it easier to maintain, use new upstream features and get us kubernetes/kubernetes#85577 kubernetes/kubernetes#85574