-
Notifications
You must be signed in to change notification settings - Fork 301
etcd (un)availability #373
Comments
/cc @jonboulle @unihorn |
For posterity - I propose that every Job has an "SLA", and when contemplating leaving the cluster, an Agent will consider the job terminable only on the expiration of this SLA; this allows a (configurable) recovery window from transient network hiccups. For example, if a Job has an SLA of 10 minutes, and an Agent has a network partition with the Registry, the Agent would wait 10 minutes to recover connectivity to the Registry before it terminates the job. Similarly, on the other side, an Engine would probably wait up to the SLA before attempting to reschedule the Job (exact semantic here would depend on how exactly agent/job health is heartbeaten.) Agents and Engines would have a default SLA applied to all Jobs; any SLA configured in the Jobs themselves (e.g. |
+1 for an config/SLA based approach. Some jobs have a long warm-up time (think memcache) where it's preferable to wait a certain amount of time for network connectivity to be restored instead of immediately rescheduling replacement jobs. An SLA would allow network maintenance to occur (which disconnects Agents for a short period of time - within the SLA) without adversely impacting those jobs by killing them and starting replacements. This is essentially saying "this job can disappear for X minutes, then start a replacement" I don't think the Agent should terminate jobs when leaving the cluster (when etcd is unavailable), even for an extended period of time. The action an Agent takes upon rejoining the cluster should be configurable on a per-job basis (defined in the Job's config/SLA). Running jobs could either be killed if a replacement has been started elsewhere, or could continue running and the replacement killed if it exists. This is essentially saying "when jobs disappear start replacements and kill any old jobs that reappear" or "when jobs disappear start replacements but kill them if the old jobs reappear". |
This was the 80% solution that lets us kill the etcd leader in a cluster again: #377 |
Another big step: #611 |
superseded by #708 |
fleet does not deal with a lack of etcd availability correctly. This can cause the cluster to act in odd ways, even unscheduling all jobs in the cluster.
The text was updated successfully, but these errors were encountered: