bug: panic when a vote verification task added while shutting down the agreement service #5341

algonautshant · 2023-04-28T00:30:03Z

The agreement service shuts down when the node is shutting down, or when the node is asked to catchup from a catchpoint.
This test demonstrates that it is possible to have a vote verification task sneak through the ctx just before it is canceled, and the output channel close just before the wg is incremented, resulting in pushing to a closed channel and getting a panic.

This is not a critical bug, since the node will crash either when shutting down, or when is to catchup from a catchpoint.
However, it will result in a panic, and bad things can happen...

Analysis of the verification process

Understanding the problem for possible solutions: Mutex protection for atomic ctx check and wg increment
Consider adding a mutex to make sure every submitted vote to verifyVote either increments the wg or notifies the caller about the verifier cancellation.
Here I present the verifyVote use cases, and analyze the impact of skipping verification tasks from returning the results to the out chan.

Overall, there are 3 main use cases:

agreement/bundle.go
206: 		avv.verifyVote(ctx, l, uv, uint64(i), message{}, results) //nolint:errcheck // verifyVote will call EnqueueBacklog, which blocks until the verify task is queued, or returns an error when ctx.Done(), which we are already checking
agreement/cryptoVerifier.go
217: 			err := c.voteVerifier.verifyVote(votereq.ctx, c.ledger, uv, votereq.TaskIndex, votereq.message, c.votes.out)
agreement/pseudonode.go
390: 		err := verifier.verifyVote(context.TODO(), t.node.ledger, uv, uint64(i), msg, results)
518: 		err := verifier.verifyVote(context.TODO(), t.node.ledger, uv, uint64(i), msg, results)

Case of bundle.go

The code in general does not support the case of AsyncVoterVerifier.verifyVote not returning the result from the response chan:
In unauthenticatedBundle verifyAsync the response from avv.verifyVote is not checked (err is returned only when the exec-pool is shut, which never happens in production). Moreover, verifyAsync does not return an error when the verifier ctx is canceled, and silently ignores the verification task.

When verifyVote is called, there is an expectation to get the response from the out chan.
Two call sites are described below:

Call site A

agreement/bundle.go
142: 	return b.verifyAsync(ctx, l, avv)()

When used from agreement/bundle.go
return b.verifyAsync(ctx, l, avv)()

Certificate Authenticate calls verify with context.Background(), which means, the verification will never get canceled. However, the verifier may get canceled in case (1), but not (2) below.
This is important because Authenticate calls b.verifyAsync(ctx, l, avv)(), which will never return if the verifier does not return a response for all the verification votes. When the verifier quits, not all responses will be returned.

When calling fetchedCert.Authenticate here

catchup/service.go
// As a failsafe, if the cert we fetched is valid but for the wrong block, panic as loudly as possible
if cert.Round == fetchedCert.Round &&
	cert.Proposal.BlockDigest != fetchedCert.Proposal.BlockDigest &&
	fetchedCert.Authenticate(*block, s.ledger, verifier) == nil {

the verifier used is the same as the agreement service verifier, which shuts down when the agreement shuts down (at node shutdown or catchpoint catchup).
2. In catchup/service.go fetchAndWrite

			err = s.auth.Authenticate(block, cert)

The verifier (node.catchupBlockAuth) will only quit when the node is shutting down.

Call site B

In this case, the verifier will quit only after the verifyAsync is done.

agreement/cryptoVerifier.go
236: 			fn := bundlereq.message.UnauthenticatedBundle.verifyAsync(bundlereq.ctx, c.ledger, c.voteVerifier)

Uses the agreement service verifier, which quits when the agreement service stops.
TODO: understand how the bundlereq.ctx is canceled. Not crucial, since the verifier is blocked from shutting down before fn call returns.
It is called from the poolCryptoVerifier) bundleWaitWorker loop. When the verifier quits, the call will never return. In which case, the poolCryptoVerifier) Quit will forever wait for c.wg.Wait().

Fortunately, when the agreement service is canceled, the verifier will quit after c.wg.Wait().

Case of cryptoVerifier.go

Here, the verifier used is the agreement service verifier, which is shut down when the agreement service shuts down.
The consumer of the results from the output channel will also terminate when the agreement service terminates. This use case is safe to terminate.

Case of pseudonode.go

The verifier used is the agreement service verifier.

pseudonodeVotesTask use

When called from pseudonodeVotesTask execute, it calls verifyVote to queue the votes, and then expect to get the results for all submitted votes. It does not expect the verifier to quit, it will hang waiting for results from the output chan when it happens.
The code assumes that when verifyVote cannot deliver the result, it will return an error, and accounts for that case only.
Fortunately, the assumption is backed by the Service demuxLoop termination order, which first makes sure that asyncPseudonode terminated before quitting the vote verifier.

	s.loopback.Quit()
	s.voteVerifier.Quit()

agreement/asyncVoteVerifier_test.go

cce · 2023-04-28T14:13:26Z

So the panic is here

go-algorand/util/execpool/pool.go

Lines 161 to 163 in 9fc6158

    
           if t.out != nil { 
        
           	t.out <- res 
        
           }

And the channel close is here:

go-algorand/agreement/asyncVoteVerifier.go

Lines 179 to 188 in 9fc6158

    
           // wait until all the tasks we've given the pool are done. 
        
           avv.wg.Wait() 
        
           if avv.backlogExecPool.GetOwner() == avv { 
        
           	avv.backlogExecPool.Shutdown() 
        
           } 
        
           // since no more tasks are coming, we can safely close the output pool channel. 
        
           close(avv.execpoolOut) 
        
           // wait until the worker function exists. 
        
           <-avv.workerWaitCh

Shouldn't the backlogExecPool.Shutdown() wait until the underlying pool worker shuts down before returning to AsyncVoteVerifier.Quit() and letting it close(avv.execpoolOut)? Oh I see is it because this pool is passed into MakeAsyncVoteVerifier and created in the node implementation, so it is not the owner...

~~But it seems like the node implementation is correctly shutting down the backlog pools AFTER all the services are shut down.~~ I see, so since the node shuts down the backlog pools AFTER the services are shut down, a task might still try to write to avv.execpoolOut but agreement has shut down and close the channel already

go-algorand/node/node.go

Lines 418 to 443 in 9fc6158

    
           func (node *AlgorandFullNode) Stop() { 
        
           	node.mu.Lock() 
        
           	defer func() { 
        
           		node.mu.Unlock() 
        
           		node.waitMonitoringRoutines() 
        
           	}() 
        
           	node.net.ClearHandlers() 
        
           	if !node.config.DisableNetworking { 
        
           		node.net.Stop() 
        
           	} 
        
           	if node.catchpointCatchupService != nil { 
        
           		node.catchpointCatchupService.Stop() 
        
           	} else { 
        
           		node.stateProofWorker.Stop() 
        
           		node.txHandler.Stop() 
        
           		node.agreementService.Shutdown() 
        
           		node.catchupService.Stop() 
        
           		node.txPoolSyncerService.Stop() 
        
           		node.blockService.Stop() 
        
           		node.ledgerService.Stop() 
        
           	} 
        
           	node.catchupBlockAuth.Quit() 
        
           	node.highPriorityCryptoVerificationPool.Shutdown() 
        
           	node.lowPriorityCryptoVerificationPool.Shutdown() 
        
           	node.cryptoPool.Shutdown()

~~So this seems like not a problem for the node, but just a problem for a test that passes in a pool implementation and doesn't shut down the services before the worker pools?~~

algonautshant · 2023-04-28T15:51:08Z

But it seems like the node implementation is correctly shutting down the backlog pools AFTER all the services are shut down.

So this seems like not a problem for the node, but just a problem for a test that passes in a pool implementation and doesn't shut down the services before the worker pools?

You are making all the correct observations. I need to clarify one more thing to convince you that the problem is in the node.

The services rely on the exec-pool to process the tasks. The services are well protected against the exec-pool shutting down prematurely. They cannot function, but report an error. This is what happens when EnqueueBacklog fails.

So, we have a problem when we shut the exec-pool before the service, but also have a problem when we shut down the service before the exec-pool.

The exec-pool also depends on the services to return the execution results to. The code is designed to never have an execution task in the exec-pool after the service shuts down. Thus, breaking the dependency of the exec-pool to the service after the service shuts down. But I demonstrated with this test that this design has a flaw in it, which leaves the possibility of a task escaping the service shut down to the exec-pool, and make the return to the service after the service shuts down, and before the exec-pool shuts down.

Do you think this explanation is valid?

…ervice_shutdown_test

agreement/asyncVoteVerifier.go

test to demonstrate the agreement service shutdown may panic

7aea966

algonautshant requested review from cce and algorandskiy April 28, 2023 00:30

algonautshant self-assigned this Apr 28, 2023

algonautshant added bug Something isn't working Team Carbon-11 labels Apr 28, 2023

cce reviewed Apr 28, 2023

View reviewed changes

agreement/asyncVoteVerifier_test.go Outdated Show resolved Hide resolved

Update agreement/asyncVoteVerifier_test.go

49a539d

cce added Bug-Fix and removed bug Something isn't working labels Apr 28, 2023

algonautshant added bug Something isn't working Bug-Fix and removed Bug-Fix labels Apr 28, 2023

algonautshant added 3 commits May 2, 2023 17:44

Merge remote-tracking branch 'upstream/master' into shant/agreement_s…

8123eb4

…ervice_shutdown_test

fix with a mutex and enhanced test

0717d0f

mutex for eqVotes

e526401

cce reviewed May 3, 2023

View reviewed changes

agreement/asyncVoteVerifier.go Outdated Show resolved Hide resolved

cce reviewed May 3, 2023

View reviewed changes

agreement/asyncVoteVerifier.go Outdated Show resolved Hide resolved

algonautshant added 2 commits May 3, 2023 16:34

rename and comment

54a53fc

fix lint

09c3163

jsgranados added the blocked label Aug 15, 2023

algonautshant mentioned this pull request Aug 15, 2023

perf: batch verification for agreement votes (phase 3) #5222

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: panic when a vote verification task added while shutting down the agreement service #5341

bug: panic when a vote verification task added while shutting down the agreement service #5341

algonautshant commented Apr 28, 2023 •

edited

Loading

cce commented Apr 28, 2023 •

edited

Loading

algonautshant commented Apr 28, 2023

bug: panic when a vote verification task added while shutting down the agreement service #5341

Are you sure you want to change the base?

bug: panic when a vote verification task added while shutting down the agreement service #5341

Conversation

algonautshant commented Apr 28, 2023 • edited Loading

Analysis of the verification process

Case of bundle.go

Call site A

Call site B

Case of cryptoVerifier.go

Case of pseudonode.go

pseudonodeVotesTask use

cce commented Apr 28, 2023 • edited Loading

algonautshant commented Apr 28, 2023

algonautshant commented Apr 28, 2023 •

edited

Loading

cce commented Apr 28, 2023 •

edited

Loading