Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

Frequent Fleet Panics (from the go-etcd client) #437

Closed
edpaget opened this issue May 12, 2014 · 6 comments
Closed

Frequent Fleet Panics (from the go-etcd client) #437

edpaget opened this issue May 12, 2014 · 6 comments

Comments

@edpaget
Copy link
Contributor

edpaget commented May 12, 2014

Hi all, I'm trying out CoreOS using your vagrant setup, but I seem to be having frequent issues with fleet crashing. Here's a small snippet from fleet's logs, let me know if you'd like the entire thing.

May 12 04:29:07 core-03 fleet[8624]: I0512 04:29:07.062556 08624 engine.go:85] Published JobOffer(zookeeper@2.service)
May 12 04:29:07 core-03 fleet[8624]: I0512 04:29:07.206584 08624 engine.go:115] Scheduled Job(zookeeper@2.service) to Machine(23d57bd8-eeb3-4e59-9b26-ca52f8bacc9f)
May 12 04:31:17 core-03 fleet[8624]: panic: runtime error: invalid memory address or nil pointer dereference
May 12 04:31:17 core-03 fleet[8624]: [signal 0xb code=0x1 addr=0x0 pc=0x5765e0]
May 12 04:31:17 core-03 fleet[8624]: goroutine 2482 [running]:
May 12 04:31:17 core-03 fleet[8624]: runtime.panic(0x6f6100, 0xb1d228)
May 12 04:31:17 core-03 fleet[8624]: /usr/lib/go/src/pkg/runtime/panic.c:266 +0xb6
May 12 04:31:17 core-03 fleet[8624]: net/http.(*Client).Do(0xc210072fc0, 0x0, 0xc2102aa200, 0x0, 0x0)
May 12 04:31:17 core-03 fleet[8624]: /usr/lib/go/src/pkg/net/http/client.go:128 +0x30
May 12 04:31:17 core-03 fleet[8624]: github.com/coreos/fleet/third_party/github.com/coreos/go-etcd/etcd.(*Client).sendRequest(0xc21005bc00, 0x7511a0, 0x3, 0xc210122230, 0x62, ...)
May 12 04:31:17 core-03 fleet[8624]: /build/amd64-usr/tmp/portage/app-admin/fleet-0.2.0/work/fleet-0.2.0/gopath/src/github.com/coreos/fleet/third_party/github.com/coreos/go-etcd/etcd/requests.go:193 +0xabb
May 12 04:31:17 core-03 fleet[8624]: github.com/coreos/fleet/third_party/github.com/coreos/go-etcd/etcd.(*Client).getCancelable(0xc21005bc00, 0xc2103d9e80, 0x32, 0xc21029ede0, 0x0, ...)
May 12 04:31:17 core-03 fleet[8624]: /build/amd64-usr/tmp/portage/app-admin/fleet-0.2.0/work/fleet-0.2.0/gopath/src/github.com/coreos/fleet/third_party/github.com/coreos/go-etcd/etcd/requests.go:38 +0x2c5

For me it seems to happen intermittently and I can't point to one action on my part that causes it to go down.

I'm using Fleet 0.2.0, since that was the version on the Vagrant box. I know Fleet 0.3.0 has been released, but I was unsure if there wasn't anything I needed to do to update to it.

@bcwaldon
Copy link
Contributor

@edpaget Nothing jumps out at me as suspect. Do you tend to be actively interacting with the cluster when these panics happen? Can you correlate changes in etcd cluster membership with these panics?

@edpaget
Copy link
Contributor Author

edpaget commented May 12, 2014

@bcwaldon In all cases my last interaction with the cluster was to try to start a job.

I just tried the whole setup again running fleetctl list-machines occasionally. It looks like right after I start a job machines drop out of the cluster as listed by fleetctl. However, if I log into a cluster member and ping another machine I get a response, and running curl -L http://127.0.0.1:4001/v2/machines from one of the cluster members returns a list of all their IP address.

@edpaget
Copy link
Contributor Author

edpaget commented May 12, 2014

I think I figured it out.

I was trying to run the two following services:

zookeeper.service

[Unit]
Description=Zookeeper
After=docker.service
Requires=docker.service

[Service]
EnvironmentFile=/etc/environment
ExecStart=/usr/bin/docker run --name zk-%i --rm -e HOST_IP=${COREOS_PUBLIC_IPV4} -p 2181:2181 -p 2888:2888 -p 3888:3888 edpaget/zookeeper:3.4.6-coreos -i %i
ExecStop=/usr/bin/docker kill zk-%i

[X-Fleet]
X-Conflicts=zookeeper@*.service

zookeeper-discovery.service

[Unit]
Description=Announces Zookeeper
BindsTo=zookeeper@%i.service

[Service]
ExecStart=/usr/bin/etcdctl set /zk/%i ${COREOS_PUBLIC_IPV4}
ExecStop=/usr/bin/etcdctl rm /zk/%i

[X-Fleet]
X-ConditionMachineOf=zookeeper@%i.service

Apparently the X-ConditionMachineOf doesn't yet support using %i in it. Changing it to explicitly be X-ConditionMachineOf=zookeeper@1.service works and doesn't cause fleet to crash. I didn't see that mentioned in the docs, but it makes sense that that's the case now.

@bcwaldon
Copy link
Contributor

@edpaget X-ConditionMachineOf is interpreted before the systemd has a chance to render that value into what you expect, so that makes sense. It's something we need to address as part of #303, though.

I would not expect this to make fleet panic, though. That's a major bug if so. I'll look into it.

@bcwaldon
Copy link
Contributor

And it most definitely is reproducible - happened the first try. Filed a bug here: #438

edpaget pushed a commit to edpaget/fleet that referenced this issue May 12, 2014
Systemd specifiers aren't yet supported in [X-Fleet] sections
until coreos#303 is resolved. This notes the issues encountered in coreos#437 and coreos#438.
edpaget pushed a commit to edpaget/fleet that referenced this issue May 12, 2014
Systemd specifiers aren't yet supported in [X-Fleet] sections
until coreos#303 is resolved. This notes the issues encountered in coreos#437 and coreos#438.
@edpaget
Copy link
Contributor Author

edpaget commented May 12, 2014

I submitted a small note for the documentation. Thanks for troubleshooting!

@edpaget edpaget closed this as completed May 12, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants