Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faucet stops interacting with the eth client after err="context deadline exceeded" #20793

Closed
maxsam4 opened this issue Mar 21, 2020 · 12 comments · Fixed by #20828
Closed

Faucet stops interacting with the eth client after err="context deadline exceeded" #20793

maxsam4 opened this issue Mar 21, 2020 · 12 comments · Fixed by #20828

Comments

@maxsam4
Copy link
Contributor

maxsam4 commented Mar 21, 2020

I'm running a faucet for Goerli over at https://faucet.goerli.mudit.blog/.

There's a severe lack of light client-serving peers and as a result, often times the faucet loses all connections. The problem is that it stops retrying. Once context deadline exceeded error happens, the faucet never updates the state (I've waited for hours after this error). If the light client kept retrying, it will eventually find a peer. If I restart the faucet, it finds a couple of peers within a few seconds.

I'm running a goerli full node with 300 les client limit and 400 les cpu limit (on an 8 core server) on the same server as the faucet to try and have at least one node always serve to the faucet but the connection between my node and the faucet still breaks sometimes.

There are a couple other people I know that are serving light clients on goerli but we're all maxing out on light client connections. I don't know why there's such a huge demand for goerli light clients. Perhaps something to do with the beacon chain.

I am new to go so don't really know what exactly is going on with the context but I'm thinking of increasing the context limit a little and panicking when this error happens. systemd will catch the panic and restart the service. It will "fix" the problem but it's an ugly solution lol.

Is it possible to have the faucet/client keep looking for peers after the error and resume operations when it finds some?

ps If EF is still running goerli nodes, can you please up your les client limit?

@holiman
Copy link
Contributor

holiman commented Mar 26, 2020

Could you please supply the geth version number?

@karalabe
Copy link
Member

If it happens again, please kill it with kill -SIGABRT <pid>. That will make Go do a stack dump. That might help see if something is blocked on a mutex or similar.

@karalabe
Copy link
Member

We've just redeployed the Rinkeby faucet to latest master, lets see if it acts up for us.

@maxsam4
Copy link
Contributor Author

maxsam4 commented Mar 26, 2020

Geth
Version: 1.9.13-unstable
Git Commit: 8d7aa9078f8a94c2c10b1d11e04242df0ea91e5b
Architecture: amd64
Protocol Versions: [65 64 63]
Go Version: go1.14
Operating System: linux

I modified the live faucet to panic maxsam4@87fb4de when this happens so that it can be restarted by systemd. Is there an easy way to do the stack dump while panicking?

If not, I can spin up another instance and get the dump from there. I don't want to disturb the live faucet.

(I can also return a systemd watchdog signal and have systemd kill it with SIGABRT)

@holiman
Copy link
Contributor

holiman commented Mar 26, 2020

A panic should give you a stack dump. You can also use debug.stacks() to obtain it from a live instance

@karalabe
Copy link
Member

@holiman A faucet doesn't have RPC, you can't attach a console to it.

@maxsam4
Copy link
Contributor Author

maxsam4 commented Mar 26, 2020

Mar 26 09:06:34 dedi faucet[19833]: panic: context deadline exceeded
Mar 26 09:06:34 dedi faucet[19833]: goroutine 143 [running]:
Mar 26 09:06:34 dedi faucet[19833]: main.(*faucet).loop.func1(0xc0001bf9e0, 0xc00112cd20)
Mar 26 09:06:34 dedi faucet[19833]: #011/home/max/maxsam4/go-ethereum/cmd/faucet/faucet.go:613 +0xe06
Mar 26 09:06:34 dedi faucet[19833]: created by main.(*faucet).loop
Mar 26 09:06:34 dedi faucet[19833]: #011/home/max/maxsam4/go-ethereum/cmd/faucet/faucet.go:603 +0x1b8
Mar 26 09:06:34 dedi systemd[1]: faucet.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Mar 26 09:06:34 dedi systemd[1]: faucet.service: Unit entered failed state.
Mar 26 09:06:34 dedi systemd[1]: faucet.service: Failed with result 'exit-code'.
Mar 26 09:06:34 dedi systemd[1]: faucet.service: Service hold-off time over, scheduling restart.
Mar 26 09:06:34 dedi systemd[1]: Stopped Goerli faucet.
Mar 26 09:06:34 dedi systemd[1]: Started Goerli faucet.
Mar 26 09:06:41 dedi faucet[23090]: #033[32mINFO #033[0m[03-26|09:06:41.504] Starting peer-to-peer node               #033[32minstance#033[0m=Geth/v1.9.13-unstable-87fb4de5-20200322/linux-amd64/go1.13.4

This is what was in syslog from the last time it crashed.

@maxsam4
Copy link
Contributor Author

maxsam4 commented Mar 26, 2020

I am not sure if this is what you guys are after. Please let me know if you'd like logs from somewhere else.

@holiman
Copy link
Contributor

holiman commented Mar 26, 2020

My bad -- panic shows only the current stack, what @karalabe suggested would give you all stacks -- which is what we'd need.

@maxsam4
Copy link
Contributor Author

maxsam4 commented Mar 27, 2020

The faucet hasn't hit the context deadline exceeded since the last logs I shared. Perhaps more people started serving light clients on Goerli?

In any case, if/when it happens again, I'll share the stack dump.

@karalabe
Copy link
Member

Managed to reproduce it on Rinkeby (updated out faucet from 1.9.7 to latest). Apparently something broke in the new light client code, looking into it.

https://gist.github.com/karalabe/3e27109ab98051375aaa7feb136d640b

@karalabe
Copy link
Member

This seems interesting:

goroutine 74 [semacquire, 981 minutes]:
sync.runtime_SemacquireMutex(0xc0001d21f4, 0xc000b6f900, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc0001d21f0)
	/usr/local/go/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
github.com/ethereum/go-ethereum/les.(*lightFetcher).peerHasBlock(0xc0001d21e0, 0xc00009ed00, 0x58002b399adc28f2, 0x3d70f057522f8f44, 0x6baaa4df12dc0b37, 0xc9aa260894914d60, 0x5eb613, 0x1, 0xc000b6fb00)
	/go-ethereum/les/fetcher.go:362 +0x435
github.com/ethereum/go-ethereum/les.(*lightFetcher).registerPeer.func1(0x58002b399adc28f2, 0x3d70f057522f8f44, 0x6baaa4df12dc0b37, 0xc9aa260894914d60, 0x5eb613, 0xe1e401, 0xe97560)
	/go-ethereum/les/fetcher.go:228 +0x5d
github.com/ethereum/go-ethereum/les.(*serverPeer).HasBlock(0xc00009ed00, 0x58002b399adc28f2, 0x3d70f057522f8f44, 0x6baaa4df12dc0b37, 0xc9aa260894914d60, 0x5eb613, 0xc0021ee901, 0x1149200)
	/go-ethereum/les/peer.go:547 +0x114
github.com/ethereum/go-ethereum/les.(*TrieRequest).CanSend(0xc00978fdc0, 0xc00009ed00, 0xffffffffffffff00)
	/go-ethereum/les/odr_requests.go:200 +0x6b
github.com/ethereum/go-ethereum/les.(*LesOdr).Retrieve.func2(0x115d7a0, 0xc00009ed00, 0xc000b6fc80)
	/go-ethereum/les/odr.go:114 +0x66
github.com/ethereum/go-ethereum/les.(*retrieveManager).sendReq.func1(0x115d7a0, 0xc00009ed00, 0xc000b6fd68)
	/go-ethereum/les/retrieve.go:146 +0xcc
github.com/ethereum/go-ethereum/les.(*requestDistributor).nextRequest(0xc0000f7810, 0x0, 0x0, 0x0, 0x0)
	/go-ethereum/les/distributor.go:216 +0x4a1
github.com/ethereum/go-ethereum/les.(*requestDistributor).loop(0xc0000f7810)
	/go-ethereum/les/distributor.go:144 +0xdc
created by github.com/ethereum/go-ethereum/les.newRequestDistributor
	/go-ethereum/les/distributor.go:88 +0x149

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants