fix: deadlock scenario from BulkReadVeneerApi and fixed flaky tests #2484

rahulKQL · 2020-04-06T14:53:13Z

Apologies for not adding enough test cases in the original BulkReadVeneerApi PR. Due to that, we are seeing flaky test cases from TestBulkReadVeneerApi in CI jobs.

Bug Description

The BulkReadVeneerApi#cleanUp goes into a deadlock caused by the fact that the element future resolves before the Batch's status is updated. More information can be found here:

The order of events is:

Batch#onBatchSuccess()

BatchingDescriptor#splitResponse()

elementFuture#set()

BatcherStats#recordBatchElementsCompletion

Batcher#onBatchCompletion

BatcherImpl#flushLock.notifyAll()

Which means that the callback in BulkReadVeneerApi is notified before onBatchCompletion is executed, preventing numOfOutstandingBatches from decrementing. In other words we are notifying external callers before finishing updating our own internal state. So I would argue that the bug is actually in gax's BatcherImpl not here.

Proposed Fix:

This PR adds a workaround for this problem by introducing a separate single thread executor, Now we defer the callback which decrement the entry response counter on this executor, which gives a chance to the original grpc thread to call onBatchCompletion.

I confirm this solution by running these unit tests for 100 times each on my local system.

This test case was failing once in ~100 run on local IDE as well maven build. This could be because of `Batcher` are async in nature So refactored the `BulkReadVeneerApi#testAdd` to fix the same.

codecov · 2020-04-06T15:20:21Z

Codecov Report

Merging #2484 into master will increase coverage by 0.49%.
The diff coverage is 64.28%.

@@             Coverage Diff              @@
##             master    #2484      +/-   ##
============================================
+ Coverage     61.74%   62.24%   +0.49%     
- Complexity     1608     1628      +20     
============================================
  Files           197      199       +2     
  Lines          9992    10106     +114     
  Branches       1022     1030       +8     
============================================
+ Hits           6170     6290     +120     
+ Misses         3360     3352       -8     
- Partials        462      464       +2

Impacted Files	Coverage Δ	Complexity Δ
...gle/cloud/bigtable/hbase/wrappers/BigtableApi.java	`71.42% <ø> (ø)`	`3.00 <0.00> (ø)`
...ble/hbase/wrappers/classic/BigtableClassicApi.java	`95.45% <ø> (ø)`	`5.00 <0.00> (ø)`
...e/hbase/wrappers/classic/DataClientClassicApi.java	`97.02% <ø> (ø)`	`18.00 <0.00> (ø)`
...le/hbase/wrappers/veneer/AdminClientVeneerApi.java	`100.00% <ø> (ø)`	`9.00 <0.00> (ø)`
...e/hadoop/hbase/client/BigtableAsyncConnection.java	`0.00% <0.00%> (ø)`	`0.00 <0.00> (ø)`
...adoop/hbase/client/AbstractBigtableConnection.java	`55.55% <66.66%> (-0.36%)`	`16.00 <0.00> (ø)`
...table/hbase/wrappers/veneer/BulkReadVeneerApi.java	`97.82% <100.00%> (+17.82%)`	`12.00 <1.00> (+2.00)`
...gtable/hbase/wrappers/veneer/RowResultAdapter.java	`95.58% <0.00%> (ø)`	`5.00% <0.00%> (?%)`
...ble/hbase/wrappers/veneer/DataClientVeneerApi.java	`95.65% <0.00%> (ø)`	`13.00% <0.00%> (?%)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7b2d0df...b1d8f9b. Read the comment docs.

...ase/src/test/java/com/google/cloud/bigtable/hbase/wrappers/veneer/TestBulkReadVeneerApi.java

rahulKQL · 2020-04-08T18:59:02Z

After spending some time on this today, It seems BulkReadVeneerApi has a bug: If we try to close all the batchers on the map then this bug makes one of the batcher instance in the map to hung until infinite time.

java-bigtable-hbase/bigtable-client-core-parent/bigtable-hbase/src/main/java/com/google/cloud/bigtable/hbase/wrappers/veneer/BulkReadVeneerApi.java

Lines 100 to 109 in 7b2d0df

    
           private void cleanUp() { 
        
             for (Batcher<ByteString, Row> batcher : batchers.values()) { 
        
               try { 
        
                 batcher.close(); 
        
               } catch (Throwable ignored) { 
        
                 // Ignored 
        
               } 
        
             } 
        
             batchers.clear(); 
        
           }

While debugging it seems even after all the resultEntry futures are resolved, BatcherImpl's close() -> flush() still await for the same batch to invoke its Batch#onSuccess callback, Which should already be finished until now.
https://github.com/googleapis/gax-java/blob/a8827adc670e99c289993791c52afeed0fc7a6ef/gax/src/main/java/com/google/api/gax/batching/BatcherImpl.java#L166-L187

Beside this bug, another test case from the same class is flaky i.e. TestBulkReadVeneerApi#testSendOutstanding.

I am still investigating this, I will add more information on this tomorrow.

igorbernstein2 · 2020-04-08T20:51:34Z

Thanks for the update

Problem: There is a race condition between cleanUp() and batcher's callback. The cleanUp() calls -> `batcher.close()` -> `flush()` -> `awaitAllOutstandingBatches()` and the `BatcherImpl#onBatchComplemetion` callback decrements `numOfOutstandingBatches`, It notifies all once it reaches **0**. This led to a situation where `BatcherImpl#awaitAllOutstandingBatches()` grabs the `flushLock` and wait for notify & The `BatcherImpl#onBatchCompletion` never gets a chance to decrement the counter and notify. This only happened until now in case we called `BulkReadVeneerApi#sendOutstanding()`. This method decrements the `cleanupBarrier` on each `sendOutstanding` call, which led to counter being **0** before all the entries could actually be resolved. Solution: With this commit, I moved the counter from `notifyArrival()` to `ApiFuture<Result>`'s callback, and added a `signalCleanUp` flag to trigger batcher cleanups. The `newSingleThreadExecutor` is also needed to handle a case where the user calls `BulkReadVeneerApi#sendOutstanding` after adding only one `rowKey` in batch, This creates the above scenario of waiting on `flushLock`.

rahulKQL · 2020-04-13T08:30:56Z

@igorbernstein2 I have updated this PR with bug description.

Please have a look and let me know your thoughts on the proposed fix.

igorbernstein2 · 2020-04-13T16:24:34Z

So I don't think this is a race condition, but a deadlock caused by the fact that the element future resolves before the Batch's status is updated.

The order of events is:

Batch#onBatchSuccess()
- BatchingDescriptor#splitResponse()
  - elementFuture#set()
  - BatcherStats#recordBatchElementsCompletion
- Batcher#onBatchCompletion
  - BatcherImpl#flushLock.notifyAll()

Which means that the callback in BulkReadVeneerApi is notified before onBatchCompletion is executed, preventing numOfOutstandingBatches from decrementing. In other words we are notifying external callers before finishing updating our own internal state. So I would argue that the bug is actually in gax's BatcherImpl not here.

This PR has a bunch of changes, but I believe only one of the them is the actual workaround to the bug, I think other changes are unnecessary (adding additional atomics).

I believe the actual fix (temporary workaround) is to defer the callback onto a separate thread so that the original grpc thread has a chance to call onBatchCompletion. I think the cleanest way to fix this is to replace the directExecutor with a SingleThreadExecutor when adding the rowFutureListener:

private static final Executor CLEANUP_EXECUTOR = Executors.newSingleThreadExecutor(new ThreadFactory() {
    @Override
    public Thread newThread(Runnable r) {
      Thread thread = new Thread(r);
      thread.setDaemon(true);
      thread.setName("bigtable-bulkread-cleanup");
      return thread;
    }
  });
//...
rowFuture.addListener(
        new Runnable() {
          @Override
          public void run() {
            notifyArrival();
          }
        },
            CLEANUP_EXECUTOR);

If you agree with this, then I propose the following:

In the short term use the above approach
In the medium term, add Batcher#closeAsync() to gax and remove the workaround
In the longer term:
1. add BatchEntry#setResult(T value) and BatchEntry#setException(Throwable) which will temporarily delegate to the underlying SettableFuture
2. update gax's BatchEntry interface to hide the underlying future from the BatchingDescriptor
3. introduce temporary state to BatchEntry to store the THrowable or the result.
4. add a BatchEntry#finalizeResult() that will resolve the internal SettableFuture
5. call finalizeResult() after all of the internal bookkeeping in BatcherImpl is done

rahulKQL · 2020-04-13T18:38:42Z

Thanks a lot for taking the time and adding a detailed analysis of this bug.

I agree with your listed approach and will update this PR with the proposed workaround.

igorbernstein2 · 2020-04-13T21:11:40Z

Also please update the PR title since this is a bugfix now

igorbernstein2

LGTM, but please update the PR description

chore(fix): fix BulkRead flaky test case

14ba3f9

This test case was failing once in ~100 run on local IDE as well maven build. This could be because of `Batcher` are async in nature So refactored the `BulkReadVeneerApi#testAdd` to fix the same.

rahulKQL requested review from igorbernstein2 and kolea2 April 6, 2020 14:53

googlebot added the cla: yes This human has signed the Contributor License Agreement. label Apr 6, 2020

kolea2 reviewed Apr 7, 2020

View reviewed changes

...ase/src/test/java/com/google/cloud/bigtable/hbase/wrappers/veneer/TestBulkReadVeneerApi.java Show resolved Hide resolved

rahulKQL added do not merge Indicates a pull request not ready for merge, due to either quality or timing. status: investigating The issue is under investigation, which is determined to be non-trivial. labels Apr 8, 2020

rahulKQL added 2 commits April 11, 2020 14:29

chore: updated TestBulkReadVeneerApi#testAdd

11a768d

rahulKQL removed do not merge Indicates a pull request not ready for merge, due to either quality or timing. status: investigating The issue is under investigation, which is determined to be non-trivial. labels Apr 13, 2020

rahulKQL added 2 commits April 14, 2020 11:20

chore: updated bulkRead with a workaround to stop deadlock

70524a3

chore: run formatting plugin

b1d8f9b

rahulKQL changed the title ~~chore(test): fix BulkReadVeneerApi flaky test case~~ fix: deadlock scenario from BulkReadVeneerApi and fixed flaky tests Apr 14, 2020

igorbernstein2 approved these changes Apr 14, 2020

View reviewed changes

rahulKQL merged commit e4cd4ef into googleapis:master Apr 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: deadlock scenario from BulkReadVeneerApi and fixed flaky tests #2484

fix: deadlock scenario from BulkReadVeneerApi and fixed flaky tests #2484

rahulKQL commented Apr 6, 2020 •

edited

Loading

codecov bot commented Apr 6, 2020 •

edited

Loading

rahulKQL commented Apr 8, 2020

igorbernstein2 commented Apr 8, 2020

rahulKQL commented Apr 13, 2020

igorbernstein2 commented Apr 13, 2020 •

edited

Loading

rahulKQL commented Apr 13, 2020

igorbernstein2 commented Apr 13, 2020

igorbernstein2 left a comment

fix: deadlock scenario from BulkReadVeneerApi and fixed flaky tests #2484

fix: deadlock scenario from BulkReadVeneerApi and fixed flaky tests #2484

Conversation

rahulKQL commented Apr 6, 2020 • edited Loading

Bug Description

Proposed Fix:

codecov bot commented Apr 6, 2020 • edited Loading

Codecov Report

rahulKQL commented Apr 8, 2020

igorbernstein2 commented Apr 8, 2020

rahulKQL commented Apr 13, 2020

igorbernstein2 commented Apr 13, 2020 • edited Loading

rahulKQL commented Apr 13, 2020

igorbernstein2 commented Apr 13, 2020

igorbernstein2 left a comment

Choose a reason for hiding this comment

rahulKQL commented Apr 6, 2020 •

edited

Loading

codecov bot commented Apr 6, 2020 •

edited

Loading

igorbernstein2 commented Apr 13, 2020 •

edited

Loading