launcher stops reporting to the server after a "Unavailable" grpc error #134

groob · 2017-09-13T18:14:03Z

User reports that once the Unavailable desc = transport is closing error happens, the launcher stops communicating with the server.

There could be several issues (including server side) but it's important to first eliminate the launcher as a probable cause.
We need to verify if the error causes the grpc connection to be closed completely (not just the request failing) and either re-dial or fatal.

log for reference:

I0912 21:06:10.245666 253878272 distributed.cpp:138] Executing distributed query: kolide:populate:storage: SELECT b.label, m.device_alias, m.device, m.blocks, m.blocks_available, m.blocks_size, m.path, m.type as filesystem, d.encrypted, d.type as encryption_type, d.uid as encryption_uid, d.user_uuid as encryption_user_uuid FROM mounts m LEFT OUTER JOIN block_devices b ON m.device_alias = b.name LEFT OUTER JOIN disk_encryption d ON m.device_alias = d.name;
I0912 21:06:10.304114 253878272 distributed.cpp:138] Executing distributed query: kolide:populate:uptime: SELECT total_seconds FROM uptime;
I0912 21:06:10.305757 253878272 distributed.cpp:138] Executing distributed query: kolide:populate:version: SELECT name, major, minor, patch, build, version, platform, platform_like, codename FROM os_version;
I0912 21:06:10.307452 253878272 distributed.cpp:138] Executing distributed query: kolide:populate:wifi_networks: SELECT * FROM wifi_survey;
{"caller":"level.go:63","err":"sending status logs: writing buffered logs: writing logs: transport error sending logs: rpc error: code = Unavailable desc = transport is closing","level":"info","ts":"2017-09-12T22:06:31.537321819Z"}
{"caller":"level.go:63","err":"sending status logs: writing buffered logs: writing logs: transport error sending logs: rpc error: code = Unavailable desc = transport is closing","level":"info","ts":"2017-09-12T22:25:37.708667224Z"}

The text was updated successfully, but these errors were encountered:

groob · 2017-09-26T14:12:55Z

launcher-logs.zip

@zwass attached some logs from @FritzX6 mac this morning which have debug enabled. The relevant file is launcher-stderr.log
After noticing the launcher still running, but not seeing an active connection to the server, I killed it with SIGABRT, which generates a goroutine dump.

It looks like the launcher was failing to connect to the server around 9:35-36am this morning, but then successfully connected and got a list of distributed queries, which it executed. Then the launcher stops logging anything until about 9:55, when I killed the process.

Edit:

this looks suspicious to me given the interval beween ~9:37-9:55

goroutine 15 [IO wait, 19 minutes]:
net.runtime_pollWait(0x1b29188, 0x72, 0x0)
	/usr/local/go/src/runtime/netpoll.go:164 +0x59
net.(*pollDesc).wait(0xc4200588b8, 0x72, 0x0, 0xc420c4fde0)
	/usr/local/go/src/net/fd_poll_runtime.go:75 +0x38
net.(*pollDesc).waitRead(0xc4200588b8, 0xffffffffffffffff, 0x0)
	/usr/local/go/src/net/fd_poll_runtime.go:80 +0x34
net.(*netFD).accept(0xc420058850, 0x0, 0x1828b20, 0xc420c4fde0)
	/usr/local/go/src/net/fd_unix.go:430 +0x1e5
net.(*UnixListener).accept(0xc42029cde0, 0x12d6909, 0xc4204a0340, 0xc420339590)
	/usr/local/go/src/net/unixsock_posix.go:162 +0x32
net.(*UnixListener).Accept(0xc42029cde0, 0x103318b, 0xc42003ae80, 0x12d618b, 0x1052630)
	/usr/local/go/src/net/unixsock.go:237 +0x49
github.com/kolide/launcher/vendor/git.apache.org/thrift.git/lib/go/thrift.(*TServerSocket).Accept(0xc420280140, 0x15c33e8, 0xc420390800, 0x1832880, 0xc420339590)
	/Users/marpaia/go/src/github.com/kolide/launcher/vendor/git.apache.org/thrift.git/lib/go/thrift/server_socket.go:83 +0x81
github.com/kolide/launcher/vendor/git.apache.org/thrift.git/lib/go/thrift.(*TSimpleServer).AcceptLoop(0xc420390800, 0x0, 0x0)
	/Users/marpaia/go/src/github.com/kolide/launcher/vendor/git.apache.org/thrift.git/lib/go/thrift/simple_server.go:130 +0x38
github.com/kolide/launcher/vendor/git.apache.org/thrift.git/lib/go/thrift.(*TSimpleServer).Serve(0xc420390800, 0xc42003af60, 0x0)
	/Users/marpaia/go/src/github.com/kolide/launcher/vendor/git.apache.org/thrift.git/lib/go/thrift/simple_server.go:156 +0x5c
github.com/kolide/launcher/vendor/github.com/kolide/osquery-go.(*ExtensionManagerServer).Start(0xc420058460, 0x0, 0x0)
	/Users/marpaia/go/src/github.com/kolide/launcher/vendor/github.com/kolide/osquery-go/server.go:173 +0x82
github.com/kolide/launcher/osquery.launchOsqueryInstance.func6(0xc420133920, 0xc420070a80)
	/Users/marpaia/go/src/github.com/kolide/launcher/osquery/runtime.go:428 +0x39
created by github.com/kolide/launcher/osquery.launchOsqueryInstance
	/Users/marpaia/go/src/github.com/kolide/launcher/osquery/runtime.go:431 +0x760

- Correctly detect when error channel is closed (potential fix for #134). Previously the logic was inverted for whether the channel was closed, so recovery was not initiated. Unit test TestOsqueryDies repros the suspected issue. - Allow logger to be set properly. - Add logging around recovery scenarios. - Check communication with both osquery and extension server in health check (previously only the extension server was checked). - Add healthcheck on interval that causes recovery on failure (Closes #141).

- Correctly detect when error channel is closed (potential fix for #134). Previously the logic was inverted for whether the channel was closed, so recovery was not initiated. Unit test TestOsqueryDies repros the suspected issue. - Allow logger to be set properly. - Add logging around recovery scenarios. - Check communication with both osquery and extension server in health check (previously only the extension server was checked). - Add healthcheck on interval that causes recovery on failure (Closes #141). - Do not set cmd output to ioutil.Discard. Causes a bug with cmd.Wait (see golang/go#20730)

zwass · 2017-10-24T15:48:15Z

I believe this was fixed with #176. Closing it, but we can reopen if there are further reports.

groob added the Bug label Sep 13, 2017

zwass self-assigned this Sep 13, 2017

marpaia assigned marpaia and unassigned zwass Sep 14, 2017

marpaia removed their assignment Sep 24, 2017

groob mentioned this issue Sep 26, 2017

launcher grpc methods don't time out #159

Closed

zwass mentioned this issue Oct 7, 2017

Bugfixes and refactoring in osquery runtime #169

Closed

zwass closed this as completed Oct 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

launcher stops reporting to the server after a "Unavailable" grpc error #134

launcher stops reporting to the server after a "Unavailable" grpc error #134

groob commented Sep 13, 2017

groob commented Sep 26, 2017 •

edited

Loading

zwass commented Oct 24, 2017

launcher stops reporting to the server after a "Unavailable" grpc error #134

launcher stops reporting to the server after a "Unavailable" grpc error #134

Comments

groob commented Sep 13, 2017

groob commented Sep 26, 2017 • edited Loading

zwass commented Oct 24, 2017

groob commented Sep 26, 2017 •

edited

Loading