-
Notifications
You must be signed in to change notification settings - Fork 301
fix(agent): Clear UnitState before unscheduling on shutdown #463
Conversation
On graceful shutdown, an Agent must clear each individual Job's UnitState before unscheduling it. If an Agent takes these two actions in the opposite order, it runs the risk of deleting legitimate state from the Registry that was published by a different Agent.
This was evident in the indeterminacy of functional tests exercising unit migrations. |
This brings up the point that it might be worthwhile blocking agents from removing anything other than their own state. |
@jonboulle Absolutely. I started working through a slightly different approach this afternoon, but it did not pan out. The quickest way I see to get there is to track the index of the last published state for each Job in AgentState. We can use that when we call RemoveUnitState in the prevIndex query param. Thoughts? |
I'd like to get this in as a bugfix regardless of the approach we come out of that discussion with, though |
It sounds suspiciously like #452 |
Yeah, |
I think the index tracking is OK - but wouldn't a simpler solution just be to base it on the machineid? (related point: should we be purging unitstate during Initialize() for all jobs that get unscheduled?) |
@jonboulle How would we identify that state is owned by a given machineid? |
UnitState.MachineState.ID, no? |
@jonboulle So we would have to do a GET on the unit state, compare the ID to the local ID, then issue a compareAndDelete with the prevIndex the modifiedIndex of the initial GET. That's the same as tracking the indexes internally, but without the actual tracking part (which I like). Let's do that. |
👍 |
fix(agent): Clear UnitState before unscheduling on shutdown
On graceful shutdown, an Agent must clear each individual Job's UnitState
before unscheduling it. If an Agent takes these two actions in the
opposite order, it runs the risk of deleting legitimate state from the
Registry that was published by a different Agent.