-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handle host/pod shutdown and reboot #272
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is really unlikely to happen, but if it does happen (perhaps because handler taking really long?) it would be bad. One badness is that validateRoleChanges might get the wrong answer. Also our upcoming changes to better deal with pod error states. Basically anything here in the validator that relies on looking at the cluster status stanza is made reliable by enforcing this invariant.
Let folks know at a glance if something is hinky. Will also be useful for the validator in the future.
configErrorDetail can contain a message if member is in config error state. lastGenerationUpdate has the Generation number of the spec most recently used to inject configmeta into the member. Won't be set if configmeta never injected. Doesn't necessarily need to be the current Generation if configmeta didn't change across spec changes (theoretically).
Also update the validator to reject spec changes if there are pending notifies.
This way we can retry failed notifications on subsequent handlers. Will also be very handy for handling restarted members.
This will be used to determine if the container got restarted (and therefore we might have something we need to do to it). The restart counter is not reliable. We need to get the container ID from executor.ReadFile, so for symmetry and possible future use I'll make the other similar guest-op functions also return container ID.
Informative, also used in status rollup && will be used in down-member handling.
Helps to also track which container we're currently working on during member bringup. Pushed the container ID checking down into the executor guest functions.
Don't report "members down" for a member in create-pending or deleting state. Also just a bit of reorg.
Don't report pending config data because of members that have no setup package to run. Add members restarting and members waiting. "Restarting" isn't possible quite yet, but will be with an upcoming change.
If the container ID changes on a ready or config-error member, boot it back to create pending state. Various bits of app setup tweaked to acknowledge that it could be a restart. Also note that if we are restarting after a config error, or if there's no persistent storage, we want to completely rerun the setup.
If there's an existing config status file with no final status in it yet, we need to know if maybe it was created by a previous incarnation of this container (and therefore the setup script is not actually running anymore and needs to be restarted). To do this we'll stash the container ID in that status file.
With all our new kdcluster status it can be the case that some things in status have changed even as we're removing our finalizer, and because we have essentially cleaned up our internal object state at this point the status update will be rejected by the validator. Making the same change for config CR too although I don't think it is currently vulnerable. (Obviously we need to factor out this status-mgmt code and share it.)
* Add "unresponsive" container state for members. This will be set for a container that is nominally running state if we can't send configmeta or notifies to it, or if K8s sets its ready condition to false. A container in this state will add to the membersDown count. * Add "initializing" container state too. This will be set for a container that is nominally waiting state if it has a running init container. * Remove the rollup booleans for pending stuff. Those conditions will result in "unresponsive" state detail for the member and also be reflected in the "membersDown" rollup. * Fix the sense of the check for persistent storage in checkContainerStates. * Fix the top of syncMembers so that it does handleReadyMembers on all roles... don't stop if one of the roles is incompletely handled. * Also fix it to bail out if ANY member is lagging in its config data generation. Not just ready members. We might have a previously-configured member that is in the process of restarting. * Fix checks comparing against cr.Generation that really should be checking against cr.Status.SpecGenerationToProcess.
Anything specified here is a NOP anyway since status is now a subresource, but it can trigger validation failure in the case that a) we did make a change to the status but b) we are ignoring the status updated because the object is being deleted. It should be OK to incur the cost of DeepCopy here since finalizer update is rare.
Call it lastSetupGeneration and have it reflect when either initial setup or notify was successfully run.
Similar to the change for members. "ready" implies too much. All we can definitively say is that we did what we intended to do. Some members might be currently down, etc... check the status rollup and member statuses for more info.
We have better ways of tracking that now.
Also let's not try to be overly clever w/ use of switch case fallthrough.
Been going over this w/ Swami offline and got approval, so will merge later today. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR addresses (at least partially) issue #18 and issue #66.
We want to add more status useful for analyzing when cluster members are in bad or transitional states, and add more smarts to reconciliation to help cluster members survive those states.
Note that this change does NOT include automatic recovery if the host for some member is permanently down. Someone with the right privileges will still need to explicitly delete that pod so that the statefulset controller will reschedule/recreate it. There is room for us to add value there in the future.
If a member is stuck in config-error state, you can also delete its pod to trigger another attempt at getting that member set up. (If there's a bug in your app image/setup it will probably just end up in config-error state again though.)
There's a couple more things I want to add after this in terms of dev/debug "trigger labels" that can be applied to a KD cluster to forcibly move members out of bad states. Not in this PR though.
The general ideas for handling pod death and/or restart better:
Be stricter in the validator about whether we can send additional changes to a cluster that has not finished digesting previous changes yet (because of dead pods). Once we start propagating spec changes to the app containers, we don't track enough state history to continue to allow more spec changes while the previous changes have not been fully propagated.
Be persistent -- if configmeta updates or add/delete notifies fail on one handler pass, try again later.
Notice if the container ID changes for a member, indicating that the pod has restarted. If so, pop the member back into create-pending state so we can send them back through the process of making sure they have their app configuration set up. This will cause any file injections to happen again, and one of four things to happen with app setup:
Specific changes in status:
A memberStateRollup block at the top level of the status stanza. This is to just help advertise when things are "going on" with some individual members; it has no functional impact. The properties in this block are:
A specGenerationToProcess property also in the top level. This reflects the generation of the KD cluster (from the resource's metadata) that the operator desires to reflect in the app setup. It's quite possible to make changes to the KD cluster resource, e.g. by just editing labels, that don't affect this number. This number is not useful for most clients but is used by reconciliation logic in KD.
A stateDetail block within each member status. This is used by reconciliation logic, and some of the properties can also be useful for other clients to help determine exactly what is up with an individual member. The properties in this block are:
There's a couple of interesting situations that can affect these properties:
So that being said, here's how these properties are populated:
memberStateRollup is calculated in syncCluster at the end of every handler. The updateStateRollup is called from the same deferred func that writes back any CR status changes. It's a pretty straightforward examination of each member status.
specGenerationToProcess is also populated in syncCluster; it's set to the current resource generation if syncClusterRoles returned clusterMembersChangedUnready, i.e. when something in the spec has changed that we need to propagate into the members.
The bits of the member stateDetail block are calculated at various points:
Here's how our validation has changed:
Our KD cluster validation webhook has to be a little more strict about when it can accept changes. Partly this is to support the reconciliation changes described below, and partly to close some existing holes. There are two new things we enforce:
Note: a good and important KD improvement would be to have the KD app spec explicitly state which roles need which notifies. So if for example the workers don't need to receive add/delete notifies, we know that we don't need to update their config or send them notifies on add/delete events... and so we won't get blocked from processing KD cluster changes if some of those workers are down.
Finally here's a walkthrough of how the reconciliation has changed, aside from all the status-updating already described above.
One broad thing to notice is any time we run some function from guest.go we check to see that the container ID we're accessing is still the lastConfiguredContainer. (So you'll see "container ID" arguments passed down through a lot of functions.) If not, that is treated as an error, same as if we had just completely failed to execute the operation. Next handler will detect the container ID change and trigger the whole member recovery process.
OK, so first the high-level view of the handler processing as seen in syncCluster:
Before doing any sync-ing we will invoke checkContainerStates. Along with populating status as described above, this function's job is to handle any restart case i.e. we notice that the container ID has changed. The most basic thing to do here is to kick the member back to create-pending state so it can get reconfigured if necessary. If the member doesn't have persistent storage then we'll also handle the "clean slate restart" described above.
We haven't made any changes to syncClusterRoles. But notice that if checkContainerStates put any members into create-pending, syncClusterRoles will never return clusterMembersStableReady, and therefore we will proceed to do member syncing.
Member processing happens in syncMembers; we'll get to the changes to that below in the next part. We no longer pass a boolean flag to syncMembers indicating whether membership has changed in this handler. Instead, we'll make decisions about where to update configmeta based on the per-member status we're tracking.
When syncCluster exits, the defer func runs. This is where syncMemberNotifies is invoked, and that'll be the final part described below.
The next part to mention is syncMembers, and the processing functions it invokes for the different member states:
In the loop that calls handleReadyMembers we now attempt sending notifies to each role, even if some role along the way encounters issues.
handleReadyMembers will skip updating a configured member's configmeta if that role doesn't have a setup package or if the member's lastConfigDataGeneration indicates that it already has the configmeta.
We now bail out before any further processing if ANY previously-configured member is stuck on an old configmeta (not just ready members). This catches members that were previously configured but are currently "rebooting"... we need to wait for them to come back & give them the update. This maintains a previously-established invariant that to get past this step of the reconciliation, no member can have a stale version of the configmeta.
handleCreatePendingMembers no longer uses an "internal" configured state to help decide where to send notifies; our other state we're tracking can better cover that. (Cf. the code and comments at the end of this function.) And instead of immediately sending out notifies, handleCreatePendingMembers calls generateNotifies which adds to the pending-notifies lists of the target members.
handleDeletePendingMembers also queues up notifies rather than sending them right away.
The last part is syncMemberNotifies. This always happens at the end of each handler, regardless of whether we had to handle any membership changes.
syncMemberNotifies walks through all configured-state members and tries to send them any add/delete notifies stored in their pending-notifies lists. On success a notify is removed from the list. Ideally all these notifies will succeed, but if some of them don't, then they'll stay on the list and we'll try them again next handler. Meanwhile the KD cluster status will be advertising that there are pending notifies (which block further cluster spec changes) and that maybe something needs to get fixed.