Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: deadlock scenario from BulkReadVeneerApi and fixed flaky tests #2484

Merged
merged 5 commits into from
Apr 14, 2020

Conversation

rahulKQL
Copy link
Contributor

@rahulKQL rahulKQL commented Apr 6, 2020

Apologies for not adding enough test cases in the original BulkReadVeneerApi PR. Due to that, we are seeing flaky test cases from TestBulkReadVeneerApi in CI jobs.

Bug Description

The BulkReadVeneerApi#cleanUp goes into a deadlock caused by the fact that the element future resolves before the Batch's status is updated. More information can be found here:

The order of events is:

  • Batch#onBatchSuccess()

    • BatchingDescriptor#splitResponse()

      • elementFuture#set()
      • BatcherStats#recordBatchElementsCompletion
    • Batcher#onBatchCompletion

      • BatcherImpl#flushLock.notifyAll()

Which means that the callback in BulkReadVeneerApi is notified before onBatchCompletion is executed, preventing numOfOutstandingBatches from decrementing. In other words we are notifying external callers before finishing updating our own internal state. So I would argue that the bug is actually in gax's BatcherImpl not here.

Proposed Fix:

This PR adds a workaround for this problem by introducing a separate single thread executor, Now we defer the callback which decrement the entry response counter on this executor, which gives a chance to the original grpc thread to call onBatchCompletion.

I confirm this solution by running these unit tests for 100 times each on my local system.

This test case was failing once in ~100 run on local IDE as well maven build. This could be because of `Batcher` are async in nature So refactored the `BulkReadVeneerApi#testAdd` to fix the same.
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Apr 6, 2020
@codecov
Copy link

codecov bot commented Apr 6, 2020

Codecov Report

Merging #2484 into master will increase coverage by 0.49%.
The diff coverage is 64.28%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #2484      +/-   ##
============================================
+ Coverage     61.74%   62.24%   +0.49%     
- Complexity     1608     1628      +20     
============================================
  Files           197      199       +2     
  Lines          9992    10106     +114     
  Branches       1022     1030       +8     
============================================
+ Hits           6170     6290     +120     
+ Misses         3360     3352       -8     
- Partials        462      464       +2     
Impacted Files Coverage Δ Complexity Δ
...gle/cloud/bigtable/hbase/wrappers/BigtableApi.java 71.42% <ø> (ø) 3.00 <0.00> (ø)
...ble/hbase/wrappers/classic/BigtableClassicApi.java 95.45% <ø> (ø) 5.00 <0.00> (ø)
...e/hbase/wrappers/classic/DataClientClassicApi.java 97.02% <ø> (ø) 18.00 <0.00> (ø)
...le/hbase/wrappers/veneer/AdminClientVeneerApi.java 100.00% <ø> (ø) 9.00 <0.00> (ø)
...e/hadoop/hbase/client/BigtableAsyncConnection.java 0.00% <0.00%> (ø) 0.00 <0.00> (ø)
...adoop/hbase/client/AbstractBigtableConnection.java 55.55% <66.66%> (-0.36%) 16.00 <0.00> (ø)
...table/hbase/wrappers/veneer/BulkReadVeneerApi.java 97.82% <100.00%> (+17.82%) 12.00 <1.00> (+2.00)
...gtable/hbase/wrappers/veneer/RowResultAdapter.java 95.58% <0.00%> (ø) 5.00% <0.00%> (?%)
...ble/hbase/wrappers/veneer/DataClientVeneerApi.java 95.65% <0.00%> (ø) 13.00% <0.00%> (?%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7b2d0df...b1d8f9b. Read the comment docs.

@rahulKQL
Copy link
Contributor Author

rahulKQL commented Apr 8, 2020

After spending some time on this today, It seems BulkReadVeneerApi has a bug: If we try to close all the batchers on the map then this bug makes one of the batcher instance in the map to hung until infinite time.

private void cleanUp() {
for (Batcher<ByteString, Row> batcher : batchers.values()) {
try {
batcher.close();
} catch (Throwable ignored) {
// Ignored
}
}
batchers.clear();
}

While debugging it seems even after all the resultEntry futures are resolved, BatcherImpl's close() -> flush() still await for the same batch to invoke its Batch#onSuccess callback, Which should already be finished until now.
https://github.com/googleapis/gax-java/blob/a8827adc670e99c289993791c52afeed0fc7a6ef/gax/src/main/java/com/google/api/gax/batching/BatcherImpl.java#L166-L187

Beside this bug, another test case from the same class is flaky i.e. TestBulkReadVeneerApi#testSendOutstanding.

I am still investigating this, I will add more information on this tomorrow.

@rahulKQL rahulKQL added do not merge Indicates a pull request not ready for merge, due to either quality or timing. status: investigating The issue is under investigation, which is determined to be non-trivial. labels Apr 8, 2020
@igorbernstein2
Copy link
Collaborator

Thanks for the update

Problem:
There is a race condition between cleanUp() and batcher's callback. The cleanUp() calls -> `batcher.close()` -> `flush()` -> `awaitAllOutstandingBatches()` and the `BatcherImpl#onBatchComplemetion` callback decrements `numOfOutstandingBatches`, It notifies all once it reaches **0**.
This led to a situation where `BatcherImpl#awaitAllOutstandingBatches()` grabs the `flushLock` and wait for notify & The `BatcherImpl#onBatchCompletion` never gets a chance to decrement the counter and notify.

This only happened until now in case we called `BulkReadVeneerApi#sendOutstanding()`. This method decrements the `cleanupBarrier` on each `sendOutstanding` call, which led to counter being **0** before all the entries could actually be resolved.

Solution:
With this commit, I moved the counter from `notifyArrival()` to `ApiFuture<Result>`'s callback, and added a `signalCleanUp` flag to trigger batcher cleanups.

The `newSingleThreadExecutor` is also needed to handle a case where the user calls `BulkReadVeneerApi#sendOutstanding` after adding only one `rowKey` in batch, This creates the above scenario of waiting on `flushLock`.
@rahulKQL rahulKQL removed do not merge Indicates a pull request not ready for merge, due to either quality or timing. status: investigating The issue is under investigation, which is determined to be non-trivial. labels Apr 13, 2020
@rahulKQL
Copy link
Contributor Author

@igorbernstein2 I have updated this PR with bug description.

Please have a look and let me know your thoughts on the proposed fix.

@igorbernstein2
Copy link
Collaborator

igorbernstein2 commented Apr 13, 2020

So I don't think this is a race condition, but a deadlock caused by the fact that the element future resolves before the Batch's status is updated.

The order of events is:

  • Batch#onBatchSuccess()
    • BatchingDescriptor#splitResponse()
      • elementFuture#set()
      • BatcherStats#recordBatchElementsCompletion
    • Batcher#onBatchCompletion
      • BatcherImpl#flushLock.notifyAll()

Which means that the callback in BulkReadVeneerApi is notified before onBatchCompletion is executed, preventing numOfOutstandingBatches from decrementing. In other words we are notifying external callers before finishing updating our own internal state. So I would argue that the bug is actually in gax's BatcherImpl not here.

This PR has a bunch of changes, but I believe only one of the them is the actual workaround to the bug, I think other changes are unnecessary (adding additional atomics).

I believe the actual fix (temporary workaround) is to defer the callback onto a separate thread so that the original grpc thread has a chance to call onBatchCompletion. I think the cleanest way to fix this is to replace the directExecutor with a SingleThreadExecutor when adding the rowFutureListener:

private static final Executor CLEANUP_EXECUTOR = Executors.newSingleThreadExecutor(new ThreadFactory() {
    @Override
    public Thread newThread(Runnable r) {
      Thread thread = new Thread(r);
      thread.setDaemon(true);
      thread.setName("bigtable-bulkread-cleanup");
      return thread;
    }
  });
//...
rowFuture.addListener(
        new Runnable() {
          @Override
          public void run() {
            notifyArrival();
          }
        },
            CLEANUP_EXECUTOR);

If you agree with this, then I propose the following:

  1. In the short term use the above approach
  2. In the medium term, add Batcher#closeAsync() to gax and remove the workaround
  3. In the longer term:
    1. add BatchEntry#setResult(T value) and BatchEntry#setException(Throwable) which will temporarily delegate to the underlying SettableFuture
    2. update gax's BatchEntry interface to hide the underlying future from the BatchingDescriptor
    3. introduce temporary state to BatchEntry to store the THrowable or the result.
    4. add a BatchEntry#finalizeResult() that will resolve the internal SettableFuture
    5. call finalizeResult() after all of the internal bookkeeping in BatcherImpl is done

@rahulKQL
Copy link
Contributor Author

Thanks a lot for taking the time and adding a detailed analysis of this bug.

I agree with your listed approach and will update this PR with the proposed workaround.

@igorbernstein2
Copy link
Collaborator

Also please update the PR title since this is a bugfix now

@rahulKQL rahulKQL changed the title chore(test): fix BulkReadVeneerApi flaky test case fix: deadlock scenario from BulkReadVeneerApi and fixed flaky tests Apr 14, 2020
Copy link
Collaborator

@igorbernstein2 igorbernstein2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please update the PR description

@rahulKQL rahulKQL merged commit e4cd4ef into googleapis:master Apr 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants