Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

launcher stops reporting to the server after a "Unavailable" grpc error #134

Closed
groob opened this issue Sep 13, 2017 · 2 comments
Closed

Comments

@groob
Copy link
Contributor

groob commented Sep 13, 2017

User reports that once the Unavailable desc = transport is closing error happens, the launcher stops communicating with the server.

There could be several issues (including server side) but it's important to first eliminate the launcher as a probable cause.
We need to verify if the error causes the grpc connection to be closed completely (not just the request failing) and either re-dial or fatal.

log for reference:

I0912 21:06:10.245666 253878272 distributed.cpp:138] Executing distributed query: kolide:populate:storage: SELECT b.label, m.device_alias, m.device, m.blocks, m.blocks_available, m.blocks_size, m.path, m.type as filesystem, d.encrypted, d.type as encryption_type, d.uid as encryption_uid, d.user_uuid as encryption_user_uuid FROM mounts m LEFT OUTER JOIN block_devices b ON m.device_alias = b.name LEFT OUTER JOIN disk_encryption d ON m.device_alias = d.name;
I0912 21:06:10.304114 253878272 distributed.cpp:138] Executing distributed query: kolide:populate:uptime: SELECT total_seconds FROM uptime;
I0912 21:06:10.305757 253878272 distributed.cpp:138] Executing distributed query: kolide:populate:version: SELECT name, major, minor, patch, build, version, platform, platform_like, codename FROM os_version;
I0912 21:06:10.307452 253878272 distributed.cpp:138] Executing distributed query: kolide:populate:wifi_networks: SELECT * FROM wifi_survey;
{"caller":"level.go:63","err":"sending status logs: writing buffered logs: writing logs: transport error sending logs: rpc error: code = Unavailable desc = transport is closing","level":"info","ts":"2017-09-12T22:06:31.537321819Z"}
{"caller":"level.go:63","err":"sending status logs: writing buffered logs: writing logs: transport error sending logs: rpc error: code = Unavailable desc = transport is closing","level":"info","ts":"2017-09-12T22:25:37.708667224Z"}
@groob groob added the Bug label Sep 13, 2017
@zwass zwass self-assigned this Sep 13, 2017
@marpaia marpaia assigned marpaia and unassigned zwass Sep 14, 2017
@marpaia marpaia removed their assignment Sep 24, 2017
@groob
Copy link
Contributor Author

groob commented Sep 26, 2017

launcher-logs.zip

@zwass attached some logs from @FritzX6 mac this morning which have debug enabled. The relevant file is launcher-stderr.log
After noticing the launcher still running, but not seeing an active connection to the server, I killed it with SIGABRT, which generates a goroutine dump.

It looks like the launcher was failing to connect to the server around 9:35-36am this morning, but then successfully connected and got a list of distributed queries, which it executed. Then the launcher stops logging anything until about 9:55, when I killed the process.

Edit:

this looks suspicious to me given the interval beween ~9:37-9:55

goroutine 15 [IO wait, 19 minutes]:
net.runtime_pollWait(0x1b29188, 0x72, 0x0)
	/usr/local/go/src/runtime/netpoll.go:164 +0x59
net.(*pollDesc).wait(0xc4200588b8, 0x72, 0x0, 0xc420c4fde0)
	/usr/local/go/src/net/fd_poll_runtime.go:75 +0x38
net.(*pollDesc).waitRead(0xc4200588b8, 0xffffffffffffffff, 0x0)
	/usr/local/go/src/net/fd_poll_runtime.go:80 +0x34
net.(*netFD).accept(0xc420058850, 0x0, 0x1828b20, 0xc420c4fde0)
	/usr/local/go/src/net/fd_unix.go:430 +0x1e5
net.(*UnixListener).accept(0xc42029cde0, 0x12d6909, 0xc4204a0340, 0xc420339590)
	/usr/local/go/src/net/unixsock_posix.go:162 +0x32
net.(*UnixListener).Accept(0xc42029cde0, 0x103318b, 0xc42003ae80, 0x12d618b, 0x1052630)
	/usr/local/go/src/net/unixsock.go:237 +0x49
github.com/kolide/launcher/vendor/git.apache.org/thrift.git/lib/go/thrift.(*TServerSocket).Accept(0xc420280140, 0x15c33e8, 0xc420390800, 0x1832880, 0xc420339590)
	/Users/marpaia/go/src/github.com/kolide/launcher/vendor/git.apache.org/thrift.git/lib/go/thrift/server_socket.go:83 +0x81
github.com/kolide/launcher/vendor/git.apache.org/thrift.git/lib/go/thrift.(*TSimpleServer).AcceptLoop(0xc420390800, 0x0, 0x0)
	/Users/marpaia/go/src/github.com/kolide/launcher/vendor/git.apache.org/thrift.git/lib/go/thrift/simple_server.go:130 +0x38
github.com/kolide/launcher/vendor/git.apache.org/thrift.git/lib/go/thrift.(*TSimpleServer).Serve(0xc420390800, 0xc42003af60, 0x0)
	/Users/marpaia/go/src/github.com/kolide/launcher/vendor/git.apache.org/thrift.git/lib/go/thrift/simple_server.go:156 +0x5c
github.com/kolide/launcher/vendor/github.com/kolide/osquery-go.(*ExtensionManagerServer).Start(0xc420058460, 0x0, 0x0)
	/Users/marpaia/go/src/github.com/kolide/launcher/vendor/github.com/kolide/osquery-go/server.go:173 +0x82
github.com/kolide/launcher/osquery.launchOsqueryInstance.func6(0xc420133920, 0xc420070a80)
	/Users/marpaia/go/src/github.com/kolide/launcher/osquery/runtime.go:428 +0x39
created by github.com/kolide/launcher/osquery.launchOsqueryInstance
	/Users/marpaia/go/src/github.com/kolide/launcher/osquery/runtime.go:431 +0x760

zwass added a commit that referenced this issue Oct 7, 2017
- Correctly detect when error channel is closed (potential fix for #134).
  Previously the logic was inverted for whether the channel was closed, so
  recovery was not initiated. Unit test TestOsqueryDies repros the suspected
  issue.
- Allow logger to be set properly.
- Add logging around recovery scenarios.
- Check communication with both osquery and extension server in health check
  (previously only the extension server was checked).
- Add healthcheck on interval that causes recovery on failure (Closes #141).
zwass added a commit that referenced this issue Oct 7, 2017
- Correctly detect when error channel is closed (potential fix for #134).
  Previously the logic was inverted for whether the channel was closed, so
  recovery was not initiated. Unit test TestOsqueryDies repros the suspected
  issue.
- Allow logger to be set properly.
- Add logging around recovery scenarios.
- Check communication with both osquery and extension server in health check
  (previously only the extension server was checked).
- Add healthcheck on interval that causes recovery on failure (Closes #141).
- Do not set cmd output to ioutil.Discard. Causes a bug with cmd.Wait (see
  golang/go#20730)
zwass added a commit that referenced this issue Oct 12, 2017
- Correctly detect when error channel is closed (potential fix for #134).
  Previously the logic was inverted for whether the channel was closed, so
  recovery was not initiated. Unit test TestOsqueryDies repros the suspected
  issue.
- Allow logger to be set properly.
- Add logging around recovery scenarios.
- Check communication with both osquery and extension server in health check
  (previously only the extension server was checked).
- Add healthcheck on interval that causes recovery on failure (Closes #141).
- Do not set cmd output to ioutil.Discard. Causes a bug with cmd.Wait (see
  golang/go#20730)
@zwass
Copy link
Contributor

zwass commented Oct 24, 2017

I believe this was fixed with #176. Closing it, but we can reopen if there are further reports.

@zwass zwass closed this as completed Oct 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants