Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

Unable to reschedule job if scheduled to nonexistent target #500

Closed
bcwaldon opened this issue May 28, 2014 · 9 comments
Closed

Unable to reschedule job if scheduled to nonexistent target #500

bcwaldon opened this issue May 28, 2014 · 9 comments
Assignees
Milestone

Comments

@bcwaldon
Copy link
Contributor

When a job is scheduled, a key like /_coreos.com/fleet/job/foo.service/target is written. The value of the key is the ID of the machine to which the job has been scheduled.

There exist situations where a job can be scheduled to a machine that does not exist, the most common being the loss of a machine after it submits a bid for a job but before an engine accepts its bid. If the machine does not come back up, the job will never be started. The job will not be automatically rescheduled, either.

To work around this, a user can call fleetctl unload foo.service; fleetctl start foo.service, but this is clearly not ideal.

@bcwaldon bcwaldon added this to the v0.5.0 milestone May 28, 2014
@bcwaldon
Copy link
Contributor Author

/cc @philips

@bcwaldon bcwaldon self-assigned this May 28, 2014
@bcwaldon
Copy link
Contributor Author

Seems like the approach to take for now is this:

  1. Engine sets a TTL on the /target key when scheduling decision has been made
  2. Agent compareAndSwaps the /target key without a TTL to confirm decision
  3. If /target key expires, Engine reacts by re-offering relevant Job

bcwaldon added a commit to bcwaldon/fleet that referenced this issue May 28, 2014
After an Engine schedules a Job, the target Agent must respond
by clearing the TTL from the /.../target key in etcd. If the
Agent does not clear the TTL in time, the Engine will respond
to the expiration of the key by reoffering the Job.

Fix coreos#500
bcwaldon added a commit to bcwaldon/fleet that referenced this issue May 28, 2014
After an Engine schedules a Job, the target Agent must respond
by clearing the TTL from the /.../target key in etcd. If the
Agent does not clear the TTL in time, the Engine will respond
to the expiration of the key by reoffering the Job.

Fix coreos#500
bcwaldon added a commit to bcwaldon/fleet that referenced this issue May 28, 2014
After an Engine schedules a Job, the target Agent must respond
by clearing the TTL from the /.../target key in etcd. If the
Agent does not clear the TTL in time, the Engine will respond
to the expiration of the key by reoffering the Job.

Fix coreos#500
bcwaldon added a commit to bcwaldon/fleet that referenced this issue May 29, 2014
Actually use a nonzero value when an Engine persists
a scheduling decision to etcd.

Fix coreos#500.
@bcwaldon bcwaldon removed this from the v0.5.0 milestone Jun 17, 2014
@andyshinn
Copy link

I think I am running into this issue when the machine ID is the same. But when the target machine ID exists and conflicts with a service with X-Fleet constraints. This happened the other day when CoreOS updated from the Alpha channel.

@jonboulle
Copy link
Contributor

@andyshinn I don't quite follow your description of your problem. Are you saying that a job is scheduled to a machine which it should not be according to an X-Fleet restriction?

@andyshinn
Copy link

Yes. This pretty much happens every time I destroy and then start bunch of units that use X-Conflicts. Here is an example in my 3 node cluster:

The nodes:

$ fleetctl list-machines
MACHINE             IP              METADATA
91521585...     172.16.32.81    -
d5240e23...     172.16.32.83    -
efe374cc...     172.16.32.82    -

The units currently running:

$ fleetctl list-units | grep dd-agent
dd-agent@1.service         launched        loaded  active  running Datadog agent           91521585.../172.16.32.81
dd-agent@2.service          launched        loaded  active  running Datadog agent           efe374cc.../172.16.32.82
dd-agent@3.service          launched        loaded  active  running Datadog agent           d5240e23.../172.16.32.83

Targets look good:

$ for i in 1 2 3; do etcdctl get /_coreos.com/fleet/job/dd-agent@$i.service/target; done
915215859c6547c4b7957ad443640767
efe374cc4d214642805164b99165e802
d5240e23706c4496bcae266618cc611d

Lets destroy and start:

$ fleetctl destroy dd-agent@1.service dd-agent@2.service dd-agent@3.service
Destroyed Job dd-agent@1.service
Destroyed Job dd-agent@2.service
Destroyed Job dd-agent@3.service

Our unit file looks like:

$ fleetctl cat dd-agent@1.service
[Unit]
Description=Datadog agent

[Service]
EnvironmentFile=/etc/environment
ExecStartPre=/usr/bin/docker pull andyshinn/docker-dd-agent
ExecStartPre=/bin/sh -c "docker inspect dd-agent-%i > /dev/null 2>&1 && docker rm -f dd-agent-%i || true"
ExecStart=/bin/sh -c "docker run --privileged -v /var/run/docker.sock:/var/run/docker.sock --name dd-agent-%i -h $(hostname) -e API_KEY=$DATADOG_API_KEY -v /home/core/datadog:/etc/dd-agent/conf.d andyshinn/docker-dd-agent"
ExecStop=/usr/bin/docker rm -f dd-agent-%i
TimeoutStartSec=5m

[X-Fleet]
X-Conflicts=dd-agent@*.service

Submit and start the local unit files:

$ fleetctl start config/systemd/dd-agent@*.service
Job dd-agent@1.service launched on efe374cc.../172.16.32.82
Job dd-agent@2.service launched on d5240e23.../172.16.32.83
...
<hangs>

Oh noes!

$ for i in 1 2 3; do etcdctl get /_coreos.com/fleet/job/dd-agent@$i.service/target; done
efe374cc4d214642805164b99165e802
d5240e23706c4496bcae266618cc611d
d5240e23706c4496bcae266618cc611d

But if we delete the erroneous target:

$ etcdctl rm /_coreos.com/fleet/job/dd-agent@3.service/target

The fleetctl start completes:

$ fleetctl start config/systemd/dd-agent@*.service
Job dd-agent@1.service launched on efe374cc.../172.16.32.82
Job dd-agent@2.service launched on d5240e23.../172.16.32.83
Job dd-agent@3.service launched on 91521585.../172.16.32.81

💥

$ fleetctl list-units | grep dd-agent
dd-agent@1.service         launched        loaded  active  running Datadog agent           efe374cc.../172.16.32.82
dd-agent@2.service          launched        loaded  active  running Datadog agent           d5240e23.../172.16.32.83
dd-agent@3.service          launched        loaded  active  running Datadog agent           91521585.../172.16.32.81

Could this be a separate bug?

@bcwaldon
Copy link
Contributor Author

bcwaldon commented Jul 1, 2014

@andyshinn I would suggest you not try to interpret fleet's internal data model. It can be misleading. Can you pull the logs of the machine that seems to have two conflicting units scheduled to it?

@bcwaldon
Copy link
Contributor Author

bcwaldon commented Jul 9, 2014

This should be fixed by #627

@bcwaldon bcwaldon added this to the v0.5.3 milestone Jul 9, 2014
@andyshinn
Copy link

👍

@bcwaldon
Copy link
Contributor Author

bcwaldon commented Jul 9, 2014

#627 in v0.5.3. Reopen if problem persists.

@bcwaldon bcwaldon closed this as completed Jul 9, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants