-
Notifications
You must be signed in to change notification settings - Fork 301
Unable to reschedule job if scheduled to nonexistent target #500
Comments
/cc @philips |
Seems like the approach to take for now is this:
|
After an Engine schedules a Job, the target Agent must respond by clearing the TTL from the /.../target key in etcd. If the Agent does not clear the TTL in time, the Engine will respond to the expiration of the key by reoffering the Job. Fix coreos#500
After an Engine schedules a Job, the target Agent must respond by clearing the TTL from the /.../target key in etcd. If the Agent does not clear the TTL in time, the Engine will respond to the expiration of the key by reoffering the Job. Fix coreos#500
After an Engine schedules a Job, the target Agent must respond by clearing the TTL from the /.../target key in etcd. If the Agent does not clear the TTL in time, the Engine will respond to the expiration of the key by reoffering the Job. Fix coreos#500
Actually use a nonzero value when an Engine persists a scheduling decision to etcd. Fix coreos#500.
I think I am running into this issue when the machine ID is the same. But when the target machine ID exists and conflicts with a service with |
@andyshinn I don't quite follow your description of your problem. Are you saying that a job is scheduled to a machine which it should not be according to an |
Yes. This pretty much happens every time I destroy and then start bunch of units that use The nodes: $ fleetctl list-machines
MACHINE IP METADATA
91521585... 172.16.32.81 -
d5240e23... 172.16.32.83 -
efe374cc... 172.16.32.82 - The units currently running: $ fleetctl list-units | grep dd-agent
dd-agent@1.service launched loaded active running Datadog agent 91521585.../172.16.32.81
dd-agent@2.service launched loaded active running Datadog agent efe374cc.../172.16.32.82
dd-agent@3.service launched loaded active running Datadog agent d5240e23.../172.16.32.83 Targets look good: $ for i in 1 2 3; do etcdctl get /_coreos.com/fleet/job/dd-agent@$i.service/target; done
915215859c6547c4b7957ad443640767
efe374cc4d214642805164b99165e802
d5240e23706c4496bcae266618cc611d Lets destroy and start: $ fleetctl destroy dd-agent@1.service dd-agent@2.service dd-agent@3.service
Destroyed Job dd-agent@1.service
Destroyed Job dd-agent@2.service
Destroyed Job dd-agent@3.service Our unit file looks like: $ fleetctl cat dd-agent@1.service
[Unit]
Description=Datadog agent
[Service]
EnvironmentFile=/etc/environment
ExecStartPre=/usr/bin/docker pull andyshinn/docker-dd-agent
ExecStartPre=/bin/sh -c "docker inspect dd-agent-%i > /dev/null 2>&1 && docker rm -f dd-agent-%i || true"
ExecStart=/bin/sh -c "docker run --privileged -v /var/run/docker.sock:/var/run/docker.sock --name dd-agent-%i -h $(hostname) -e API_KEY=$DATADOG_API_KEY -v /home/core/datadog:/etc/dd-agent/conf.d andyshinn/docker-dd-agent"
ExecStop=/usr/bin/docker rm -f dd-agent-%i
TimeoutStartSec=5m
[X-Fleet]
X-Conflicts=dd-agent@*.service Submit and start the local unit files: $ fleetctl start config/systemd/dd-agent@*.service
Job dd-agent@1.service launched on efe374cc.../172.16.32.82
Job dd-agent@2.service launched on d5240e23.../172.16.32.83
...
<hangs> Oh noes! $ for i in 1 2 3; do etcdctl get /_coreos.com/fleet/job/dd-agent@$i.service/target; done
efe374cc4d214642805164b99165e802
d5240e23706c4496bcae266618cc611d
d5240e23706c4496bcae266618cc611d But if we delete the erroneous target: $ etcdctl rm /_coreos.com/fleet/job/dd-agent@3.service/target
The $ fleetctl start config/systemd/dd-agent@*.service
Job dd-agent@1.service launched on efe374cc.../172.16.32.82
Job dd-agent@2.service launched on d5240e23.../172.16.32.83
Job dd-agent@3.service launched on 91521585.../172.16.32.81 💥 $ fleetctl list-units | grep dd-agent
dd-agent@1.service launched loaded active running Datadog agent efe374cc.../172.16.32.82
dd-agent@2.service launched loaded active running Datadog agent d5240e23.../172.16.32.83
dd-agent@3.service launched loaded active running Datadog agent 91521585.../172.16.32.81 Could this be a separate bug? |
@andyshinn I would suggest you not try to interpret fleet's internal data model. It can be misleading. Can you pull the logs of the machine that seems to have two conflicting units scheduled to it? |
This should be fixed by #627 |
👍 |
#627 in v0.5.3. Reopen if problem persists. |
When a job is scheduled, a key like
/_coreos.com/fleet/job/foo.service/target
is written. The value of the key is the ID of the machine to which the job has been scheduled.There exist situations where a job can be scheduled to a machine that does not exist, the most common being the loss of a machine after it submits a bid for a job but before an engine accepts its bid. If the machine does not come back up, the job will never be started. The job will not be automatically rescheduled, either.
To work around this, a user can call
fleetctl unload foo.service; fleetctl start foo.service
, but this is clearly not ideal.The text was updated successfully, but these errors were encountered: