-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: deadlock scenario from BulkReadVeneerApi and fixed flaky tests #2484
Conversation
This test case was failing once in ~100 run on local IDE as well maven build. This could be because of `Batcher` are async in nature So refactored the `BulkReadVeneerApi#testAdd` to fix the same.
Codecov Report
@@ Coverage Diff @@
## master #2484 +/- ##
============================================
+ Coverage 61.74% 62.24% +0.49%
- Complexity 1608 1628 +20
============================================
Files 197 199 +2
Lines 9992 10106 +114
Branches 1022 1030 +8
============================================
+ Hits 6170 6290 +120
+ Misses 3360 3352 -8
- Partials 462 464 +2
Continue to review full report at Codecov.
|
...ase/src/test/java/com/google/cloud/bigtable/hbase/wrappers/veneer/TestBulkReadVeneerApi.java
Show resolved
Hide resolved
After spending some time on this today, It seems BulkReadVeneerApi has a bug: If we try to close all the batchers on the map then this bug makes one of the Lines 100 to 109 in 7b2d0df
While debugging it seems even after all the Beside this bug, another test case from the same class is flaky i.e. TestBulkReadVeneerApi#testSendOutstanding. I am still investigating this, I will add more information on this tomorrow. |
Thanks for the update |
Problem: There is a race condition between cleanUp() and batcher's callback. The cleanUp() calls -> `batcher.close()` -> `flush()` -> `awaitAllOutstandingBatches()` and the `BatcherImpl#onBatchComplemetion` callback decrements `numOfOutstandingBatches`, It notifies all once it reaches **0**. This led to a situation where `BatcherImpl#awaitAllOutstandingBatches()` grabs the `flushLock` and wait for notify & The `BatcherImpl#onBatchCompletion` never gets a chance to decrement the counter and notify. This only happened until now in case we called `BulkReadVeneerApi#sendOutstanding()`. This method decrements the `cleanupBarrier` on each `sendOutstanding` call, which led to counter being **0** before all the entries could actually be resolved. Solution: With this commit, I moved the counter from `notifyArrival()` to `ApiFuture<Result>`'s callback, and added a `signalCleanUp` flag to trigger batcher cleanups. The `newSingleThreadExecutor` is also needed to handle a case where the user calls `BulkReadVeneerApi#sendOutstanding` after adding only one `rowKey` in batch, This creates the above scenario of waiting on `flushLock`.
@igorbernstein2 I have updated this PR with bug description. Please have a look and let me know your thoughts on the proposed fix. |
So I don't think this is a race condition, but a deadlock caused by the fact that the element future resolves before the Batch's status is updated. The order of events is:
Which means that the callback in BulkReadVeneerApi is notified before onBatchCompletion is executed, preventing numOfOutstandingBatches from decrementing. In other words we are notifying external callers before finishing updating our own internal state. So I would argue that the bug is actually in gax's BatcherImpl not here. This PR has a bunch of changes, but I believe only one of the them is the actual workaround to the bug, I think other changes are unnecessary (adding additional atomics). I believe the actual fix (temporary workaround) is to defer the callback onto a separate thread so that the original grpc thread has a chance to call onBatchCompletion. I think the cleanest way to fix this is to replace the directExecutor with a SingleThreadExecutor when adding the rowFutureListener: private static final Executor CLEANUP_EXECUTOR = Executors.newSingleThreadExecutor(new ThreadFactory() {
@Override
public Thread newThread(Runnable r) {
Thread thread = new Thread(r);
thread.setDaemon(true);
thread.setName("bigtable-bulkread-cleanup");
return thread;
}
});
//...
rowFuture.addListener(
new Runnable() {
@Override
public void run() {
notifyArrival();
}
},
CLEANUP_EXECUTOR); If you agree with this, then I propose the following:
|
Thanks a lot for taking the time and adding a detailed analysis of this bug. I agree with your listed approach and will update this PR with the proposed workaround. |
Also please update the PR title since this is a bugfix now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but please update the PR description
Apologies for not adding enough test cases in the original BulkReadVeneerApi PR. Due to that, we are seeing flaky test cases from TestBulkReadVeneerApi in CI jobs.
Bug Description
The BulkReadVeneerApi#cleanUp goes into a deadlock caused by the fact that the element future resolves before the Batch's status is updated. More information can be found here:
Proposed Fix:
This PR adds a workaround for this problem by introducing a separate single thread executor, Now we defer the callback which decrement the entry response counter on this executor, which gives a chance to the original grpc thread to call onBatchCompletion.
I confirm this solution by running these unit tests for 100 times each on my local system.