handle host/pod shutdown and reboot #272

joel-bluedata · 2020-02-15T23:09:38Z

This PR addresses (at least partially) issue #18 and issue #66.

We want to add more status useful for analyzing when cluster members are in bad or transitional states, and add more smarts to reconciliation to help cluster members survive those states.

Note that this change does NOT include automatic recovery if the host for some member is permanently down. Someone with the right privileges will still need to explicitly delete that pod so that the statefulset controller will reschedule/recreate it. There is room for us to add value there in the future.

If a member is stuck in config-error state, you can also delete its pod to trigger another attempt at getting that member set up. (If there's a bug in your app image/setup it will probably just end up in config-error state again though.)

There's a couple more things I want to add after this in terms of dev/debug "trigger labels" that can be applied to a KD cluster to forcibly move members out of bad states. Not in this PR though.

The general ideas for handling pod death and/or restart better:

Be stricter in the validator about whether we can send additional changes to a cluster that has not finished digesting previous changes yet (because of dead pods). Once we start propagating spec changes to the app containers, we don't track enough state history to continue to allow more spec changes while the previous changes have not been fully propagated.
Be persistent -- if configmeta updates or add/delete notifies fail on one handler pass, try again later.
Notice if the container ID changes for a member, indicating that the pod has restarted. If so, pop the member back into create-pending state so we can send them back through the process of making sure they have their app configuration set up. This will cause any file injections to happen again, and one of four things to happen with app setup:
- If the role has no setup package, do nothing.
- Or if the previous setup had ended in a config error, re-run setup from scratch.
- Or if the member has no persistent storage, re-run setup from scratch.
- Or just run through the normal appConfig code paying attention to any existing status file. If it shows that the previous config completed successfully then no need to do initial setup again. (If stuff has happened while the pod was down, it may immediately then need to get add/delete notifies.)

Specific changes in status:

A memberStateRollup block at the top level of the status stanza. This is to just help advertise when things are "going on" with some individual members; it has no functional impact. The properties in this block are:
- membershipChanging: This is true if members are in the process of being added to or removed from the cluster. Specifically these are members in the creating or create-pending state that have not previously been successfully brought to a configured state (i.e. are not just restarting), or members in the deleting or delete-pending state.
- membersDown: This is true if member containers (our "app" container in each pod) are in a terminated or unresponsive state or just plain missing. Members in deleting or create-pending state are not included here.
- membersInitializing: This is true if members are waiting on persistent storage setup. This may take a while if large quantities of data must be initially copied to a PV.
- membersWaiting: This is true if member containers are waiting for some condition (other than persistent storage setup) before they can start running. Investigate the member status and pod resource status/events for details if this persists. May require admin intervention; e.g. a Docker image pull will resolve itself if the image is indeed available in the repo, but lack of resources might not.
- membersRestarting: This is true if members are in the process of coming back up and possibly getting their app config/setup refreshed. Specifically these are members in the creating or create-pending state that have previously been successfully brought to a configured state.
- configErrors: This is true if any members are in the config-error state.
A specGenerationToProcess property also in the top level. This reflects the generation of the KD cluster (from the resource's metadata) that the operator desires to reflect in the app setup. It's quite possible to make changes to the KD cluster resource, e.g. by just editing labels, that don't affect this number. This number is not useful for most clients but is used by reconciliation logic in KD.
A stateDetail block within each member status. This is used by reconciliation logic, and some of the properties can also be useful for other clients to help determine exactly what is up with an individual member. The properties in this block are:
- configErrorDetail: A string that is populated if and only if the member is in config-error state, and then also persisted after restarting from config-error state until config is attempted again. Describes the error encountered.
- lastConfigDataGeneration: An integer populated when the member has configmeta injected (note that for no-setup-package roles this will not happen). Advertises the KD cluster resource generation number for which configmeta was generated and pushed into the member. Updated each time that configmeta is refreshed for this member.
- lastSetupGeneration: An integer set equal to lastConfigDataGeneration when the member runs a setup script to process that config data. This can be at initial setup time, at reboot time if setup needs to be re-run, or when processing an add/delete notify.
- configuringContainer: While going through initial configuration, this is the Docker container ID corresponding to our app container inside the member's pod. Functionally this is just an opaque UUID that we trust will change if the container gets restarted.
- lastConfiguredContainer: The Docker container ID at the time that the member successfully reached configured state. Again all we care about is that this is a UUID that changes if the container gets restarted.
- lastKnownContainerState: Just advertised FYI. Can be one of:
  - running: Container is in the K8s running state, and neither KD nor K8s are having trouble contacting it.
  - unresponsive: Container is in the K8s running state. But KD can't execute necessary commands in it, and/or K8s has its Ready condition set false.
  - initializing: An init container is running (to set up persistent storage).
  - waiting: Container is in the K8s waiting state but an init container is NOT what it is waiting on. The most likely candidate for time spent in this state is a docker image pull.
  - terminated: Container is in the K8s terminated state.
  - absent: Container does not exist. Not a problem if member state is create pending, but probably bad news otherwise. If this persists while in create pending, it's usually a case of insufficient resources to schedule.
  - unknown: Couldn't determine a result based on the rules above. "Shouldn't happen."
- pendingNotifyCmds: A list of objects describing the add or delete notifies to be sent to a member. Each object just has one property "arguments" whose value is a list of the appropriate startscript arguments. It's fine for objects to come and go in this list, but if they remain here then that will block future spec changes to the KD cluster.

There's a couple of interesting situations that can affect these properties:

appConfig is run on a restarted member that was previously in the config-error state. In this case we will try to re-run initial configuration from scratch. We will intentionally forget any status that stores previous config results or pending add/delete notifications. We'll call this the "error retry restart" situation.
A container gets restarted and it doesn't have persistent storage. In this case we will definitely need to re-run appConfig even if it was previously successful. We need to forget not only previous config results but also any idea of the configmeta version sent to the member. We'll call this the "clean slate restart" situation. checkContainerState will detect this case at the beginning of our cluster handler. appConfig should naturally do setup from scratch since any configure.status file will have been lost.

So that being said, here's how these properties are populated:

memberStateRollup is calculated in syncCluster at the end of every handler. The updateStateRollup is called from the same deferred func that writes back any CR status changes. It's a pretty straightforward examination of each member status.
specGenerationToProcess is also populated in syncCluster; it's set to the current resource generation if syncClusterRoles returned clusterMembersChangedUnready, i.e. when something in the spec has changed that we need to propagate into the members.
The bits of the member stateDetail block are calculated at various points:
- configErrorDetail is set whenever configuration succeeds/fails in handleCreatingMembers. It is cleared on success and set to some message on failure.
- lastConfigDataGeneration is normally set either in appConfig (during handleCreatingMembers) when first configuring a member or in handleReadyMembers when updating the configmeta. It will be cleared in a "clean slate restart".
- lastSetupGeneration set equal to lastConfigDataGeneration whenever that data is processed by the member. It is initially set when initial setup completes, and is then updated when any notify to the member succeeds. It will be cleared in "error retry restart" as well as "clean slate restart".
- configuringContainer is set in handleCreatePendingMembers when moving members to creating state, and then cleared in handleCreatingMembers when setup is complete.
- lastConfiguredContainer is set in handleCreatingMembers when setup is complete.
- lastKnownContainerState is set in checkContainerStates.
- pendingNotifyCmds is normally set/modified in queueNotify (during handleCreatingMembers and handleDeletePendingMembers) and then processed/cleared in syncMemberNotifies, called at the end of each cluster handler in the deferred func. It is cleared at the end of setup, because setup will have captured the same info that a notify would bring. It will also be cleared in "error retry restart" as well as "clean slate restart".

Here's how our validation has changed:

Our KD cluster validation webhook has to be a little more strict about when it can accept changes. Partly this is to support the reconciliation changes described below, and partly to close some existing holes. There are two new things we enforce:

A change to the spec is not allowed if the reconciler has not yet taken at least one pass through the previous change. (Changes to metadata like labels are fine though.) It's possible that a quick race between multiple client requests to change a given KD cluster could hit this restriction, but in practice this shouldn't happen unless a reconciler handler for a specific cluster is permanently stuck without returning because of some bug.
A change to the spec is not allowed if there are pending notifies. (Because we only persist the need for notifies, and not their content, we can't change the source info that will be used to generate that content.) For a healthy cluster the pending notifies list should be generated and cleared inside the same handler pass and the validator won't see them. But if some of the configured-state members are down and not able to process their notifies, you'll be blocked from making further spec changes.

Note: a good and important KD improvement would be to have the KD app spec explicitly state which roles need which notifies. So if for example the workers don't need to receive add/delete notifies, we know that we don't need to update their config or send them notifies on add/delete events... and so we won't get blocked from processing KD cluster changes if some of those workers are down.

Finally here's a walkthrough of how the reconciliation has changed, aside from all the status-updating already described above.

One broad thing to notice is any time we run some function from guest.go we check to see that the container ID we're accessing is still the lastConfiguredContainer. (So you'll see "container ID" arguments passed down through a lot of functions.) If not, that is treated as an error, same as if we had just completely failed to execute the operation. Next handler will detect the container ID change and trigger the whole member recovery process.

OK, so first the high-level view of the handler processing as seen in syncCluster:

Before doing any sync-ing we will invoke checkContainerStates. Along with populating status as described above, this function's job is to handle any restart case i.e. we notice that the container ID has changed. The most basic thing to do here is to kick the member back to create-pending state so it can get reconfigured if necessary. If the member doesn't have persistent storage then we'll also handle the "clean slate restart" described above.
We haven't made any changes to syncClusterRoles. But notice that if checkContainerStates put any members into create-pending, syncClusterRoles will never return clusterMembersStableReady, and therefore we will proceed to do member syncing.
Member processing happens in syncMembers; we'll get to the changes to that below in the next part. We no longer pass a boolean flag to syncMembers indicating whether membership has changed in this handler. Instead, we'll make decisions about where to update configmeta based on the per-member status we're tracking.
When syncCluster exits, the defer func runs. This is where syncMemberNotifies is invoked, and that'll be the final part described below.

The next part to mention is syncMembers, and the processing functions it invokes for the different member states:

In the loop that calls handleReadyMembers we now attempt sending notifies to each role, even if some role along the way encounters issues.
handleReadyMembers will skip updating a configured member's configmeta if that role doesn't have a setup package or if the member's lastConfigDataGeneration indicates that it already has the configmeta.
We now bail out before any further processing if ANY previously-configured member is stuck on an old configmeta (not just ready members). This catches members that were previously configured but are currently "rebooting"... we need to wait for them to come back & give them the update. This maintains a previously-established invariant that to get past this step of the reconciliation, no member can have a stale version of the configmeta.
handleCreatePendingMembers no longer uses an "internal" configured state to help decide where to send notifies; our other state we're tracking can better cover that. (Cf. the code and comments at the end of this function.) And instead of immediately sending out notifies, handleCreatePendingMembers calls generateNotifies which adds to the pending-notifies lists of the target members.
handleDeletePendingMembers also queues up notifies rather than sending them right away.

The last part is syncMemberNotifies. This always happens at the end of each handler, regardless of whether we had to handle any membership changes.

syncMemberNotifies walks through all configured-state members and tries to send them any add/delete notifies stored in their pending-notifies lists. On success a notify is removed from the list. Ideally all these notifies will succeed, but if some of them don't, then they'll stay on the list and we'll try them again next handler. Meanwhile the KD cluster status will be advertising that there are pending notifies (which block further cluster spec changes) and that maybe something needs to get fixed.

This is really unlikely to happen, but if it does happen (perhaps because handler taking really long?) it would be bad. One badness is that validateRoleChanges might get the wrong answer. Also our upcoming changes to better deal with pod error states. Basically anything here in the validator that relies on looking at the cluster status stanza is made reliable by enforcing this invariant.

Let folks know at a glance if something is hinky. Will also be useful for the validator in the future.

configErrorDetail can contain a message if member is in config error state. lastGenerationUpdate has the Generation number of the spec most recently used to inject configmeta into the member. Won't be set if configmeta never injected. Doesn't necessarily need to be the current Generation if configmeta didn't change across spec changes (theoretically).

Also update the validator to reject spec changes if there are pending notifies.

This way we can retry failed notifications on subsequent handlers. Will also be very handy for handling restarted members.

This will be used to determine if the container got restarted (and therefore we might have something we need to do to it). The restart counter is not reliable. We need to get the container ID from executor.ReadFile, so for symmetry and possible future use I'll make the other similar guest-op functions also return container ID.

Informative, also used in status rollup && will be used in down-member handling.

Helps to also track which container we're currently working on during member bringup. Pushed the container ID checking down into the executor guest functions.

Don't report "members down" for a member in create-pending or deleting state. Also just a bit of reorg.

Don't report pending config data because of members that have no setup package to run. Add members restarting and members waiting. "Restarting" isn't possible quite yet, but will be with an upcoming change.

If the container ID changes on a ready or config-error member, boot it back to create pending state. Various bits of app setup tweaked to acknowledge that it could be a restart. Also note that if we are restarting after a config error, or if there's no persistent storage, we want to completely rerun the setup.

If there's an existing config status file with no final status in it yet, we need to know if maybe it was created by a previous incarnation of this container (and therefore the setup script is not actually running anymore and needs to be restarted). To do this we'll stash the container ID in that status file.

With all our new kdcluster status it can be the case that some things in status have changed even as we're removing our finalizer, and because we have essentially cleaned up our internal object state at this point the status update will be rejected by the validator. Making the same change for config CR too although I don't think it is currently vulnerable. (Obviously we need to factor out this status-mgmt code and share it.)

* Add "unresponsive" container state for members. This will be set for a container that is nominally running state if we can't send configmeta or notifies to it, or if K8s sets its ready condition to false. A container in this state will add to the membersDown count. * Add "initializing" container state too. This will be set for a container that is nominally waiting state if it has a running init container. * Remove the rollup booleans for pending stuff. Those conditions will result in "unresponsive" state detail for the member and also be reflected in the "membersDown" rollup. * Fix the sense of the check for persistent storage in checkContainerStates. * Fix the top of syncMembers so that it does handleReadyMembers on all roles... don't stop if one of the roles is incompletely handled. * Also fix it to bail out if ANY member is lagging in its config data generation. Not just ready members. We might have a previously-configured member that is in the process of restarting. * Fix checks comparing against cr.Generation that really should be checking against cr.Status.SpecGenerationToProcess.

Anything specified here is a NOP anyway since status is now a subresource, but it can trigger validation failure in the case that a) we did make a change to the status but b) we are ignoring the status updated because the object is being deleted. It should be OK to incur the cost of DeepCopy here since finalizer update is rare.

Call it lastSetupGeneration and have it reflect when either initial setup or notify was successfully run.

Similar to the change for members. "ready" implies too much. All we can definitively say is that we did what we intended to do. Some members might be currently down, etc... check the status rollup and member statuses for more info.

We have better ways of tracking that now.

Also let's not try to be overly clever w/ use of switch case fallthrough.

joel-bluedata · 2020-02-15T23:10:05Z

Been going over this w/ Swami offline and got approval, so will merge later today.

neogeographica and others added 25 commits February 9, 2020 12:56

more human-readable state strings

4502166

add "member state rollup" to top-level cluster status

5c58c06

Let folks know at a glance if something is hinky. Will also be useful for the validator in the future.

add more status, including tracking failed notifies

a007160

Also update the validator to reject spec changes if there are pending notifies.

keep up with changes to member state strings

7ad1d98

split notify generation and handling

af230ba

This way we can retry failed notifications on subsequent handlers. Will also be very handy for handling restarted members.

track container state for member

985eccc

Informative, also used in status rollup && will be used in down-member handling.

more container ID tracking

ec4b9fb

Helps to also track which container we're currently working on during member bringup. Pushed the container ID checking down into the executor guest functions.

tweaks to rollup computation

fdd22fc

Don't report "members down" for a member in create-pending or deleting state. Also just a bit of reorg.

rollup fixes & improvements

24f152b

Don't report pending config data because of members that have no setup package to run. Add members restarting and members waiting. "Restarting" isn't possible quite yet, but will be with an upcoming change.

Merge branch 'master' into memberstate

d47cdc3

Merge remote-tracking branch 'origin/master' into memberstate

c3965cb

more logging

321662d

rename/repurpose initalConfigGeneration

b862b50

Call it lastSetupGeneration and have it reflect when either initial setup or notify was successfully run.

rename the user-visible ready state string

6fab92e

Similar to the change for members. "ready" implies too much. All we can definitively say is that we did what we intended to do. Some members might be currently down, etc... check the status rollup and member statuses for more info.

queue up notifies for rebooting members if necessary

d30d022

get rid of internal member state

c609713

We have better ways of tracking that now.

rollup tweaks

ebdf7b1

Also let's not try to be overly clever w/ use of switch case fallthrough.

joel-bluedata merged commit d0840f6 into bluek8s:master Feb 15, 2020

joel-bluedata deleted the memberstate branch February 15, 2020 23:20

joel-bluedata mentioned this pull request Feb 15, 2020

don't block handler during add/del notify #66

Open

joel-bluedata mentioned this pull request Apr 6, 2020

handle cases where notify does not need to be sent #289

Closed

joel-bluedata mentioned this pull request Oct 15, 2021

Race in change processing can cause deadlock. #518

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handle host/pod shutdown and reboot #272

handle host/pod shutdown and reboot #272

joel-bluedata commented Feb 15, 2020

joel-bluedata commented Feb 15, 2020

handle host/pod shutdown and reboot #272

handle host/pod shutdown and reboot #272

Conversation

joel-bluedata commented Feb 15, 2020

joel-bluedata commented Feb 15, 2020