-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: panic when a vote verification task added while shutting down the agreement service #5341
base: master
Are you sure you want to change the base?
bug: panic when a vote verification task added while shutting down the agreement service #5341
Conversation
So the panic is here go-algorand/util/execpool/pool.go Lines 161 to 163 in 9fc6158
And the channel close is here: go-algorand/agreement/asyncVoteVerifier.go Lines 179 to 188 in 9fc6158
Shouldn't the backlogExecPool.Shutdown() wait until the underlying pool worker shuts down before returning to AsyncVoteVerifier.Quit() and letting it close(avv.execpoolOut)? Oh I see is it because this pool is passed into MakeAsyncVoteVerifier and created in the node implementation, so it is not the owner...
Lines 418 to 443 in 9fc6158
|
You are making all the correct observations. I need to clarify one more thing to convince you that the problem is in the node. The services rely on the exec-pool to process the tasks. The services are well protected against the exec-pool shutting down prematurely. They cannot function, but report an error. This is what happens when So, we have a problem when we shut the exec-pool before the service, but also have a problem when we shut down the service before the exec-pool. The exec-pool also depends on the services to return the execution results to. The code is designed to never have an execution task in the exec-pool after the service shuts down. Thus, breaking the dependency of the exec-pool to the service after the service shuts down. But I demonstrated with this test that this design has a flaw in it, which leaves the possibility of a task escaping the service shut down to the exec-pool, and make the return to the service after the service shuts down, and before the exec-pool shuts down. Do you think this explanation is valid? |
The agreement service shuts down when the node is shutting down, or when the node is asked to catchup from a catchpoint.
This test demonstrates that it is possible to have a vote verification task sneak through the ctx just before it is canceled, and the output channel close just before the wg is incremented, resulting in pushing to a closed channel and getting a panic.
This is not a critical bug, since the node will crash either when shutting down, or when is to catchup from a catchpoint.
However, it will result in a panic, and bad things can happen...
Analysis of the verification process
Understanding the problem for possible solutions: Mutex protection for atomic ctx check and wg increment
Consider adding a mutex to make sure every submitted vote to
verifyVote
either increments thewg
or notifies the caller about the verifier cancellation.Here I present the verifyVote use cases, and analyze the impact of skipping verification tasks from returning the results to the out chan.
Overall, there are 3 main use cases:
Case of bundle.go
The code in general does not support the case of AsyncVoterVerifier.verifyVote not returning the result from the response chan:
In
unauthenticatedBundle verifyAsync
the response fromavv.verifyVote
is not checked (err is returned only when the exec-pool is shut, which never happens in production). Moreover,verifyAsync
does not return an error when the verifier ctx is canceled, and silently ignores the verification task.When
verifyVote
is called, there is an expectation to get the response from the out chan.Two call sites are described below:
Call site A
When used from agreement/bundle.go
return b.verifyAsync(ctx, l, avv)()
Certificate Authenticate
callsverify
withcontext.Background()
, which means, the verification will never get canceled. However, the verifier may get canceled in case (1), but not (2) below.This is important because
Authenticate
callsb.verifyAsync(ctx, l, avv)()
, which will never return if the verifier does not return a response for all the verification votes. When the verifier quits, not all responses will be returned.fetchedCert.Authenticate
herethe verifier used is the same as the agreement service verifier, which shuts down when the agreement shuts down (at node shutdown or catchpoint catchup).
2. In catchup/service.go
fetchAndWrite
The verifier (
node.catchupBlockAuth
) will only quit when the node is shutting down.Call site B
In this case, the verifier will quit only after the
verifyAsync
is done.Uses the agreement service verifier, which quits when the agreement service stops.
TODO: understand how the
bundlereq.ctx
is canceled. Not crucial, since the verifier is blocked from shutting down beforefn
call returns.It is called from the
poolCryptoVerifier) bundleWaitWorker
loop. When the verifier quits, the call will never return. In which case, thepoolCryptoVerifier) Quit
will forever wait forc.wg.Wait()
.Fortunately, when the agreement service is canceled, the verifier will quit after
c.wg.Wait()
.Case of cryptoVerifier.go
Here, the verifier used is the agreement service verifier, which is shut down when the agreement service shuts down.
The consumer of the results from the output channel will also terminate when the agreement service terminates. This use case is safe to terminate.
Case of pseudonode.go
The verifier used is the agreement service verifier.
pseudonodeVotesTask use
When called from
pseudonodeVotesTask execute
, it callsverifyVote
to queue the votes, and then expect to get the results for all submitted votes. It does not expect the verifier to quit, it will hang waiting for results from the output chan when it happens.The code assumes that when
verifyVote
cannot deliver the result, it will return an error, and accounts for that case only.Fortunately, the assumption is backed by the
Service demuxLoop
termination order, which first makes sure thatasyncPseudonode
terminated before quitting the vote verifier.