Frequent Fleet Panics (from the go-etcd client) #437

edpaget · 2014-05-12T17:02:05Z

Hi all, I'm trying out CoreOS using your vagrant setup, but I seem to be having frequent issues with fleet crashing. Here's a small snippet from fleet's logs, let me know if you'd like the entire thing.

May 12 04:29:07 core-03 fleet[8624]: I0512 04:29:07.062556 08624 engine.go:85] Published JobOffer(zookeeper@2.service)
May 12 04:29:07 core-03 fleet[8624]: I0512 04:29:07.206584 08624 engine.go:115] Scheduled Job(zookeeper@2.service) to Machine(23d57bd8-eeb3-4e59-9b26-ca52f8bacc9f)
May 12 04:31:17 core-03 fleet[8624]: panic: runtime error: invalid memory address or nil pointer dereference
May 12 04:31:17 core-03 fleet[8624]: [signal 0xb code=0x1 addr=0x0 pc=0x5765e0]
May 12 04:31:17 core-03 fleet[8624]: goroutine 2482 [running]:
May 12 04:31:17 core-03 fleet[8624]: runtime.panic(0x6f6100, 0xb1d228)
May 12 04:31:17 core-03 fleet[8624]: /usr/lib/go/src/pkg/runtime/panic.c:266 +0xb6
May 12 04:31:17 core-03 fleet[8624]: net/http.(*Client).Do(0xc210072fc0, 0x0, 0xc2102aa200, 0x0, 0x0)
May 12 04:31:17 core-03 fleet[8624]: /usr/lib/go/src/pkg/net/http/client.go:128 +0x30
May 12 04:31:17 core-03 fleet[8624]: github.com/coreos/fleet/third_party/github.com/coreos/go-etcd/etcd.(*Client).sendRequest(0xc21005bc00, 0x7511a0, 0x3, 0xc210122230, 0x62, ...)
May 12 04:31:17 core-03 fleet[8624]: /build/amd64-usr/tmp/portage/app-admin/fleet-0.2.0/work/fleet-0.2.0/gopath/src/github.com/coreos/fleet/third_party/github.com/coreos/go-etcd/etcd/requests.go:193 +0xabb
May 12 04:31:17 core-03 fleet[8624]: github.com/coreos/fleet/third_party/github.com/coreos/go-etcd/etcd.(*Client).getCancelable(0xc21005bc00, 0xc2103d9e80, 0x32, 0xc21029ede0, 0x0, ...)
May 12 04:31:17 core-03 fleet[8624]: /build/amd64-usr/tmp/portage/app-admin/fleet-0.2.0/work/fleet-0.2.0/gopath/src/github.com/coreos/fleet/third_party/github.com/coreos/go-etcd/etcd/requests.go:38 +0x2c5

For me it seems to happen intermittently and I can't point to one action on my part that causes it to go down.

I'm using Fleet 0.2.0, since that was the version on the Vagrant box. I know Fleet 0.3.0 has been released, but I was unsure if there wasn't anything I needed to do to update to it.

The text was updated successfully, but these errors were encountered:

bcwaldon · 2014-05-12T17:17:12Z

@edpaget Nothing jumps out at me as suspect. Do you tend to be actively interacting with the cluster when these panics happen? Can you correlate changes in etcd cluster membership with these panics?

edpaget · 2014-05-12T17:50:51Z

@bcwaldon In all cases my last interaction with the cluster was to try to start a job.

I just tried the whole setup again running fleetctl list-machines occasionally. It looks like right after I start a job machines drop out of the cluster as listed by fleetctl. However, if I log into a cluster member and ping another machine I get a response, and running curl -L http://127.0.0.1:4001/v2/machines from one of the cluster members returns a list of all their IP address.

edpaget · 2014-05-12T18:08:50Z

I think I figured it out.

I was trying to run the two following services:

zookeeper.service

[Unit]
Description=Zookeeper
After=docker.service
Requires=docker.service

[Service]
EnvironmentFile=/etc/environment
ExecStart=/usr/bin/docker run --name zk-%i --rm -e HOST_IP=${COREOS_PUBLIC_IPV4} -p 2181:2181 -p 2888:2888 -p 3888:3888 edpaget/zookeeper:3.4.6-coreos -i %i
ExecStop=/usr/bin/docker kill zk-%i

[X-Fleet]
X-Conflicts=zookeeper@*.service

zookeeper-discovery.service

[Unit]
Description=Announces Zookeeper
BindsTo=zookeeper@%i.service

[Service]
ExecStart=/usr/bin/etcdctl set /zk/%i ${COREOS_PUBLIC_IPV4}
ExecStop=/usr/bin/etcdctl rm /zk/%i

[X-Fleet]
X-ConditionMachineOf=zookeeper@%i.service

Apparently the X-ConditionMachineOf doesn't yet support using %i in it. Changing it to explicitly be X-ConditionMachineOf=zookeeper@1.service works and doesn't cause fleet to crash. I didn't see that mentioned in the docs, but it makes sense that that's the case now.

bcwaldon · 2014-05-12T18:11:51Z

@edpaget X-ConditionMachineOf is interpreted before the systemd has a chance to render that value into what you expect, so that makes sense. It's something we need to address as part of #303, though.

I would not expect this to make fleet panic, though. That's a major bug if so. I'll look into it.

bcwaldon · 2014-05-12T18:19:46Z

And it most definitely is reproducible - happened the first try. Filed a bug here: #438

Systemd specifiers aren't yet supported in [X-Fleet] sections until coreos#303 is resolved. This notes the issues encountered in coreos#437 and coreos#438.

edpaget · 2014-05-12T21:16:43Z

I submitted a small note for the documentation. Thanks for troubleshooting!

bcwaldon mentioned this issue May 12, 2014

fleet panics when scheduling unit with % in X-ConditionMachineOf #438

Closed

edpaget mentioned this issue May 12, 2014

docs(unit-files): Add Documentation about Specifiers #442

Merged

edpaget closed this as completed May 12, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequent Fleet Panics (from the go-etcd client) #437

Frequent Fleet Panics (from the go-etcd client) #437

edpaget commented May 12, 2014

bcwaldon commented May 12, 2014

edpaget commented May 12, 2014

edpaget commented May 12, 2014

bcwaldon commented May 12, 2014

bcwaldon commented May 12, 2014

edpaget commented May 12, 2014

Frequent Fleet Panics (from the go-etcd client) #437

Frequent Fleet Panics (from the go-etcd client) #437

Comments

edpaget commented May 12, 2014

bcwaldon commented May 12, 2014

edpaget commented May 12, 2014

edpaget commented May 12, 2014

bcwaldon commented May 12, 2014

bcwaldon commented May 12, 2014

edpaget commented May 12, 2014