-
Notifications
You must be signed in to change notification settings - Fork 613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(batch): set VISIBILITY_MODE to all by default #7188
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
2966432
to
68fa60c
Compare
Codecov Report
@@ Coverage Diff @@
## main #7188 +/- ##
==========================================
- Coverage 71.64% 71.63% -0.01%
==========================================
Files 1083 1083
Lines 173644 173693 +49
==========================================
+ Hits 124406 124424 +18
- Misses 49238 49269 +31
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
68fa60c
to
9a6542e
Compare
Discussed offline: waiting current epoch to bump on recovery can still cause correctness issue. Consider the following case:
@zwang28 also pointed out that the read results from non-restarted CNs on recovery is also non-determinitistic. In the above example, CN1 can serve read using data from epoch2 or epoch3 depending on the progress of the recovery process. |
This reverts commit 9a6542e.
We address the correctness issue #7188 (comment) as follows:
Here some examples to show how it works : epoch_1 is checkpoint Case 1 compute node that is restarted:
Any batch query with epoch_2 after step 2 will be rejected by
Case 2 compute node that is not restarted:
Any batch query with epoch_2 after step 5 will be rejected by @hzxa21 PTAL |
A bit surprise that the recovery test doesn't fail, because batch query a restarted compute node with a current epoch < recovery checkpoint's epoch should fail. |
+1. Shall we change the StateStore read interfaces to take HummockReadEpoch instead of u64 then? |
Good idea. Overall looks good and it enlights me to think of a potential issue and an improvement: IssueIIUC, under this proposal, it is possible that we reject a read even when there is no recovery due to async notification of epoch updates. Consider the following example:
ImprovementTo solve the above issue, we can simply avoid updating
Let's see whether there is any pitfall and continue the discussion next week. |
Yes, this's what I mean by
|
Ah I see. I misunderstood. Make sense. |
Verified it's just because recovery test doesn't trigger any case that read current epoch < min_current_epoch or read current epoch > sealed_epoch, i.e. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM. Thanks for the PR!
@@ -257,6 +257,7 @@ impl<S: StateStore> StorageTable<S> { | |||
read_version_from_backup: read_backup, | |||
}; | |||
if let Some(value) = self.store.get(&serialized_pk, epoch, read_options).await? { | |||
self.store.validate_read_epoch(wait_epoch)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention to call valid_read_epoch
after get
is to handle the case when min_current_epoch
is bumped between L245 and L259. It is a bit subtle. Can we add some comments to explain it?
let iter = store.iter(range, raw_epoch, read_options).await?; | ||
store.validate_read_epoch(epoch)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
After merging main, the recovery test triggers both "current epoch < min_current_epoch" failure and "current epoch > sealed_epoch" failure as expected. Update: |
…round known local_mode propagation issue
ab9d46e
to
7ef4459
Compare
As a workaround, we temporarily enforce VISIBILITY_MODE=checkpoint during recovery test. |
I suggest to address this refactor in another PR. Likely lots test cases need to be touched. |
I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.
What's changed and what's your intention?
validate_read_epoch
inStateStore
trait.StorageTable
uses it to verify whether any recover occurs during its read. It so, the read result is invalidated and the batch query is aborted.Implement try_wait_epoch for non-checkpoint epoch.It's necessary because a restarted CN cannot ensure seal_epoch >= uncommitted_read_epoch from batch query.Checklist
- [ ] I have added necessary unit tests and integration tests./risedev check
(or alias,./risedev c
)Documentation
If your pull request contains user-facing changes, please specify the types of the changes, and create a release note. Otherwise, please feel free to remove this section.
Types of user-facing changes
Please keep the types that apply to your changes, and remove those that do not apply.
Release note
Support VISIBILITY_MODE session variable. It's set to ALL by default.
Refer to a related PR or issue link (optional)
#6558