Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reflect pod-blocking issues into member status #562

Closed
joel-bluedata opened this issue Feb 17, 2022 · 4 comments
Closed

reflect pod-blocking issues into member status #562

joel-bluedata opened this issue Feb 17, 2022 · 4 comments

Comments

@joel-bluedata
Copy link
Member

joel-bluedata commented Feb 17, 2022

There are various reasons why a K8s pod might not come up right away. e.g.:

  • bad image / inaccessible repo
  • can't schedule because of resources
  • can't schedule because of node taints
  • can't schedule because of affinity constraints
  • can't use PV in exclusive mode because already in use (see PR Support for mounting user created volumes #561)

...and probably others, but we're mainly concerned with the common cases. Currently discovering these issues requires digging into pod status, which in turn requires some set of K8s API privs and some set of knowledge about where to look. We don't want to mirror the whole pod status in our status -- that would be a lot of stuff and wouldn't really help any of the know-where-to-look issues.

Instead, we need to surface some summary of common issues (like those above) into the per-member state.

Note that all the things I listed so far are when the member is in "create pending" state and its lastKnownContainerState is "waiting" or "absent" -- this also sets the "membersWaiting" flag in the rollup status. So this work would be for drilling further into those situations.

I do also wonder if it would be useful to have a separate rollup flag for that "absent" state.

==========

Other than hard errors like the above, there are also more fuzzy "taking too long" kinds of problems. We're already tackling one of those in issue #540 for PV initialization. There are a couple of other things we might be able to give insight on, like time spent waiting on image load or time spent running the config script. Not sure about that. Maybe that just boils down to "time spent in current member state" for transitional states?

==========

Along with issue #540 and issue #547, this work would improve the user experience for launching a kdcluster in the situations where anything is wrong.

@joel-bluedata
Copy link
Member Author

joel-bluedata commented Feb 18, 2022

Something to ponder as an additional change: how proactive could we be about warning the kdcluster-submitter that the spec is likely to not be schedulable? This is a hard problem (and ultimately best-effort since even if we give something the greenlight it could still fail) but we get an interesting number of requests for that.

Edit: this just seems like way too much to bite off, unless and until K8s introduces scheduling dry-run capability. Without that capability we'd more or less need to try to replicate the K8s scheduling logic internally to KD, although we could make some simplifying assumptions. It would be good to research the OpenShift cluster capacity tool which at first glance is attacking the same problem.

@joel-bluedata
Copy link
Member Author

Let's look at the steps involved in for example advertising the "can't schedule" cases. For this exercise let's bundle them all together and just distinguish between them in the actual text of the error message, since I think that's what K8s does.

When this happens for a member, the per-member status currently will show:

state: create pending
stateDetail:
  lastKnownContainerState: absent

So we can hook into the KD code that populates this status information, to further investigate the pod and see if it is in the "not enough resources case".

What should that investigation actually look like?

Well if we look at the pod's status.conditions, it will have something like this:

lastProbeTime: null
lastTransitionTime: "2022-03-09T00:27:11Z"
message: '0/4 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master:
  }, that the pod didn''t tolerate, 3 Insufficient cpu.'
reason: Unschedulable
status: "False"
type: PodScheduled

That message is interesting and seems to encapsulate what we want to expose to the user. As a strawman for how the per-member status will be modified, let's say that we want to add an "schedulingErrorMessage" field to that object, which will be populated with this message if we're in this situation. We may really want to modify this arrangement once we start looking at the other things we want to advertise, but let's use that as a working assumption.

So OK, let's hop over to the KD code for a bit.

To add that field, we'll need to modify kubedirector.hpe.com_kubedirectorclusters_crd.yaml to add schedulingErrorMessage as another string field under stateDetail. Let's also say that we add a membersNotScheduled boolean field under memberStateRollup.

In kubedirectorcluster_types.go we'll need to correspondingly update the MemberStateDetail and StateRollup structures. For SchedulingErrorMessage in MemberStateDetail, let's make it a string pointer (rather than just a string) so it's clear when it is or isn't populated.

Now where can we have the code to populate these? The place to do the pod-investigation would be wherever that lastKnownContainerState is being set to "absent". In the KD code this value is actually a constant called containerMissing. That is set in the checkContainerStates function, so let's look at that function.

checkContainerStates actually starts with an assumption of lastKnownContainerState=containerMissing, then loads the Pod, then tries to determine if lastKnownContainerState can be set to something else. If it goes through that code and and still ends up thinking that lastKnownContainerState=containerMissing, it can look into the status.conditions of the loaded Pod object. Probably by calling some subroutine, since this function is getting big.

That subroutine can look through the pod's status.conditions, looking for a condition where Type == corev1.PodScheduled. If it finds that condition AND sees that the Reason == corev1.PodReasonUnschedulable, it should set our new SchedulingErrorMessage field to whatever is in the condition's Message.

(Note that it's possible for an unschedulable pod to BECOME schedulable if things change, so we also need to think about clearing out that field when necessary. Probably we could just always clear it in checkContainerStates when we first start processing the per-member status.)

Last question then would be where to set the rollup boolean of MembersNotScheduled. Rollups are set in the updateStateRollup function and adding another one should be straightforward. If we want to be laser-targetted about it: in the memberCreatePending case, when LastKnownContainerState == containerMissing, we can then additionally check to see if SchedulingErrorMessage is non-nil and if so then set MembersNotScheduled.

@joel-bluedata
Copy link
Member Author

We'll do the above-described "unschedulable" status advertisement as the first cut, to definitely be in the next minor KD release. Advertisement of bad image, inaccessible repo, and already-in-use-PV will be investigated next, and probably not handled for this next KD release.

The scheduling dry run is off the table for now and I'll update the relevant comment above accordingly.

@joel-bluedata
Copy link
Member Author

Change has been added to partially address this. I'll split this issue so we can track future work separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants