Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: panic when a vote verification task added while shutting down the agreement service #5341

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

algonautshant
Copy link
Contributor

@algonautshant algonautshant commented Apr 28, 2023

The agreement service shuts down when the node is shutting down, or when the node is asked to catchup from a catchpoint.
This test demonstrates that it is possible to have a vote verification task sneak through the ctx just before it is canceled, and the output channel close just before the wg is incremented, resulting in pushing to a closed channel and getting a panic.

This is not a critical bug, since the node will crash either when shutting down, or when is to catchup from a catchpoint.
However, it will result in a panic, and bad things can happen...

Analysis of the verification process

Understanding the problem for possible solutions: Mutex protection for atomic ctx check and wg increment
Consider adding a mutex to make sure every submitted vote to verifyVote either increments the wg or notifies the caller about the verifier cancellation.
Here I present the verifyVote use cases, and analyze the impact of skipping verification tasks from returning the results to the out chan.

Overall, there are 3 main use cases:

agreement/bundle.go
206: 		avv.verifyVote(ctx, l, uv, uint64(i), message{}, results) //nolint:errcheck // verifyVote will call EnqueueBacklog, which blocks until the verify task is queued, or returns an error when ctx.Done(), which we are already checking
agreement/cryptoVerifier.go
217: 			err := c.voteVerifier.verifyVote(votereq.ctx, c.ledger, uv, votereq.TaskIndex, votereq.message, c.votes.out)
agreement/pseudonode.go
390: 		err := verifier.verifyVote(context.TODO(), t.node.ledger, uv, uint64(i), msg, results)
518: 		err := verifier.verifyVote(context.TODO(), t.node.ledger, uv, uint64(i), msg, results)

Case of bundle.go

The code in general does not support the case of AsyncVoterVerifier.verifyVote not returning the result from the response chan:
In unauthenticatedBundle verifyAsync the response from avv.verifyVote is not checked (err is returned only when the exec-pool is shut, which never happens in production). Moreover, verifyAsync does not return an error when the verifier ctx is canceled, and silently ignores the verification task.

When verifyVote is called, there is an expectation to get the response from the out chan.
Two call sites are described below:

Call site A

agreement/bundle.go
142: 	return b.verifyAsync(ctx, l, avv)()

When used from agreement/bundle.go
return b.verifyAsync(ctx, l, avv)()

Certificate Authenticate calls verify with context.Background(), which means, the verification will never get canceled. However, the verifier may get canceled in case (1), but not (2) below.
This is important because Authenticate calls b.verifyAsync(ctx, l, avv)(), which will never return if the verifier does not return a response for all the verification votes. When the verifier quits, not all responses will be returned.

  1. When calling fetchedCert.Authenticate here
catchup/service.go
// As a failsafe, if the cert we fetched is valid but for the wrong block, panic as loudly as possible
if cert.Round == fetchedCert.Round &&
	cert.Proposal.BlockDigest != fetchedCert.Proposal.BlockDigest &&
	fetchedCert.Authenticate(*block, s.ledger, verifier) == nil {

the verifier used is the same as the agreement service verifier, which shuts down when the agreement shuts down (at node shutdown or catchpoint catchup).
2. In catchup/service.go fetchAndWrite

			err = s.auth.Authenticate(block, cert)

The verifier (node.catchupBlockAuth) will only quit when the node is shutting down.

Call site B

In this case, the verifier will quit only after the verifyAsync is done.

agreement/cryptoVerifier.go
236: 			fn := bundlereq.message.UnauthenticatedBundle.verifyAsync(bundlereq.ctx, c.ledger, c.voteVerifier)

Uses the agreement service verifier, which quits when the agreement service stops.
TODO: understand how the bundlereq.ctx is canceled. Not crucial, since the verifier is blocked from shutting down before fn call returns.
It is called from the poolCryptoVerifier) bundleWaitWorker loop. When the verifier quits, the call will never return. In which case, the poolCryptoVerifier) Quit will forever wait for c.wg.Wait().

Fortunately, when the agreement service is canceled, the verifier will quit after c.wg.Wait().

Case of cryptoVerifier.go

Here, the verifier used is the agreement service verifier, which is shut down when the agreement service shuts down.
The consumer of the results from the output channel will also terminate when the agreement service terminates. This use case is safe to terminate.

Case of pseudonode.go

The verifier used is the agreement service verifier.

pseudonodeVotesTask use

When called from pseudonodeVotesTask execute, it calls verifyVote to queue the votes, and then expect to get the results for all submitted votes. It does not expect the verifier to quit, it will hang waiting for results from the output chan when it happens.
The code assumes that when verifyVote cannot deliver the result, it will return an error, and accounts for that case only.
Fortunately, the assumption is backed by the Service demuxLoop termination order, which first makes sure that asyncPseudonode terminated before quitting the vote verifier.

	s.loopback.Quit()
	s.voteVerifier.Quit()

@algonautshant algonautshant self-assigned this Apr 28, 2023
@algonautshant algonautshant added bug Something isn't working Team Carbon-11 labels Apr 28, 2023
@cce cce added Bug-Fix and removed bug Something isn't working labels Apr 28, 2023
@cce
Copy link
Contributor

cce commented Apr 28, 2023

So the panic is here

if t.out != nil {
t.out <- res
}

And the channel close is here:

// wait until all the tasks we've given the pool are done.
avv.wg.Wait()
if avv.backlogExecPool.GetOwner() == avv {
avv.backlogExecPool.Shutdown()
}
// since no more tasks are coming, we can safely close the output pool channel.
close(avv.execpoolOut)
// wait until the worker function exists.
<-avv.workerWaitCh

Shouldn't the backlogExecPool.Shutdown() wait until the underlying pool worker shuts down before returning to AsyncVoteVerifier.Quit() and letting it close(avv.execpoolOut)? Oh I see is it because this pool is passed into MakeAsyncVoteVerifier and created in the node implementation, so it is not the owner...

But it seems like the node implementation is correctly shutting down the backlog pools AFTER all the services are shut down. I see, so since the node shuts down the backlog pools AFTER the services are shut down, a task might still try to write to avv.execpoolOut but agreement has shut down and close the channel already

go-algorand/node/node.go

Lines 418 to 443 in 9fc6158

func (node *AlgorandFullNode) Stop() {
node.mu.Lock()
defer func() {
node.mu.Unlock()
node.waitMonitoringRoutines()
}()
node.net.ClearHandlers()
if !node.config.DisableNetworking {
node.net.Stop()
}
if node.catchpointCatchupService != nil {
node.catchpointCatchupService.Stop()
} else {
node.stateProofWorker.Stop()
node.txHandler.Stop()
node.agreementService.Shutdown()
node.catchupService.Stop()
node.txPoolSyncerService.Stop()
node.blockService.Stop()
node.ledgerService.Stop()
}
node.catchupBlockAuth.Quit()
node.highPriorityCryptoVerificationPool.Shutdown()
node.lowPriorityCryptoVerificationPool.Shutdown()
node.cryptoPool.Shutdown()

So this seems like not a problem for the node, but just a problem for a test that passes in a pool implementation and doesn't shut down the services before the worker pools?

@algonautshant
Copy link
Contributor Author

But it seems like the node implementation is correctly shutting down the backlog pools AFTER all the services are shut down.

So this seems like not a problem for the node, but just a problem for a test that passes in a pool implementation and doesn't shut down the services before the worker pools?

You are making all the correct observations. I need to clarify one more thing to convince you that the problem is in the node.

The services rely on the exec-pool to process the tasks. The services are well protected against the exec-pool shutting down prematurely. They cannot function, but report an error. This is what happens when EnqueueBacklog fails.

So, we have a problem when we shut the exec-pool before the service, but also have a problem when we shut down the service before the exec-pool.

The exec-pool also depends on the services to return the execution results to. The code is designed to never have an execution task in the exec-pool after the service shuts down. Thus, breaking the dependency of the exec-pool to the service after the service shuts down. But I demonstrated with this test that this design has a flaw in it, which leaves the possibility of a task escaping the service shut down to the exec-pool, and make the return to the service after the service shuts down, and before the exec-pool shuts down.

Do you think this explanation is valid?

@algonautshant algonautshant added bug Something isn't working Bug-Fix and removed Bug-Fix labels Apr 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants