Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(batch): set VISIBILITY_MODE to all by default #7188

Merged
merged 14 commits into from
Jan 18, 2023

Conversation

zwang28
Copy link
Contributor

@zwang28 zwang28 commented Jan 4, 2023

I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.

What's changed and what's your intention?

  • Add validate_read_epoch in StateStore trait. StorageTable uses it to verify whether any recover occurs during its read. It so, the read result is invalidated and the batch query is aborted.
  • Set VISIBILITY_MODE to all by default.
  • Implement try_wait_epoch for non-checkpoint epoch.
    • It's necessary because a restarted CN cannot ensure seal_epoch >= uncommitted_read_epoch from batch query.
  • Temporarily revert FLUSH to always enforce a checkpoint.
    • We will adopt CHECKPOINT in another PR because there will be considerable changes to test scripts.

Checklist

  • I have written necessary rustdoc comments
    - [ ] I have added necessary unit tests and integration tests
  • All checks passed in ./risedev check (or alias, ./risedev c)

Documentation

If your pull request contains user-facing changes, please specify the types of the changes, and create a release note. Otherwise, please feel free to remove this section.

Types of user-facing changes

Please keep the types that apply to your changes, and remove those that do not apply.

  • SQL commands, functions, and operators

Release note

Support VISIBILITY_MODE session variable. It's set to ALL by default.

Refer to a related PR or issue link (optional)

#6558

@zwang28

This comment was marked as resolved.

@zwang28 zwang28 force-pushed the wangzheng/chore_visibility_mode branch 2 times, most recently from 2966432 to 68fa60c Compare January 5, 2023 12:56
@codecov
Copy link

codecov bot commented Jan 5, 2023

Codecov Report

Merging #7188 (dde0456) into main (0524142) will decrease coverage by 0.01%.
The diff coverage is 41.97%.

@@            Coverage Diff             @@
##             main    #7188      +/-   ##
==========================================
- Coverage   71.64%   71.63%   -0.01%     
==========================================
  Files        1083     1083              
  Lines      173644   173693      +49     
==========================================
+ Hits       124406   124424      +18     
- Misses      49238    49269      +31     
Flag Coverage Δ
rust 71.63% <41.97%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/common/src/session_config/visibility_mode.rs 79.16% <ø> (ø)
src/frontend/src/handler/flush.rs 0.00% <0.00%> (ø)
src/storage/hummock_test/src/test_utils.rs 94.09% <0.00%> (-1.31%) ⬇️
src/storage/src/hummock/error.rs 24.35% <0.00%> (-0.98%) ⬇️
...src/hummock/event_handler/hummock_event_handler.rs 80.29% <ø> (-0.24%) ⬇️
src/storage/src/monitor/monitored_store.rs 4.85% <0.00%> (-0.08%) ⬇️
src/storage/src/panic_store.rs 0.00% <0.00%> (ø)
src/storage/src/store.rs 69.64% <ø> (ø)
src/storage/src/store_impl.rs 9.39% <0.00%> (-0.20%) ⬇️
src/storage/src/hummock/state_store_v1.rs 69.52% <20.00%> (-0.58%) ⬇️
... and 10 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@zwang28 zwang28 force-pushed the wangzheng/chore_visibility_mode branch from 68fa60c to 9a6542e Compare January 5, 2023 14:05
@zwang28 zwang28 changed the title chore(batch): set VISIBILITY_MODE to all by default refactor(batch): set VISIBILITY_MODE to all by default Jan 5, 2023
@zwang28 zwang28 added component/storage Storage component/batch Batch related related issue. labels Jan 5, 2023
@zwang28 zwang28 requested a review from wenym1 January 5, 2023 15:13
@zwang28 zwang28 assigned hzxa21 and unassigned hzxa21 Jan 5, 2023
@zwang28 zwang28 requested review from xxhZs and hzxa21 January 5, 2023 15:14
@zwang28 zwang28 marked this pull request as draft January 6, 2023 13:06
@hzxa21
Copy link
Collaborator

hzxa21 commented Jan 10, 2023

Need implement try_wait_epoch for current_epoch, because compute node's sealed_epoch is transient (thus recovery test fails).

Discussed offline: waiting current epoch to bump on recovery can still cause correctness issue. Consider the following case:

  • We have CN [1, 2] and epoch [1, 2, 3] with
    • epoch 1 as the checkpoint epoch
    • epoch 2 as the non-checkpoint epoch
    • epoch 3 as the recovery epoch
  • CN2 restarts after epoch2 and CN1 remains healthy
  • While CN2's recovery is still in-progress, frontend holds epoch2 to read from CN1 and CN2
  • With try_wait_epoch for current_epoch, CN1 will serve read using data from epoch2 while CN2 will serve read using data from epoch3 (which is equivalent to serving data from epoch1)

@zwang28 also pointed out that the read results from non-restarted CNs on recovery is also non-determinitistic. In the above example, CN1 can serve read using data from epoch2 or epoch3 depending on the progress of the recovery process.

@zwang28
Copy link
Contributor Author

zwang28 commented Jan 12, 2023

We address the correctness issue #7188 (comment) as follows:

  • Maintain a min_current_epoch in HummockStorage.
    • min_current_epoch is initialized as latest MCE.
    • min_current_epoch is set to epoch::MAX whenever HummockStorage::clear_shared_buffer, before HummockEvent::Clear takes effect. It means recovery is in process and no current_epoch read is allowed.
    • min_current_epoch is set to the new epoch whenever a checkpoint is sealed AND min_current_epoch == epoch::MAX. It means recovery is finished and valid current_epoch starts from the new epoch.
  • Add a validate_read_epoch, which requires min_current_epoch <= read_current_epoch <= seal_epoch, otherwise raises error.
  • We call validate_read_epoch after StateStore::get/StateStore::iter in StorageTable, paired with try_wait_epoch.
    • Actually we can make validate_read_epoch private to HummockStorage, and call it inside StateStore::get/StateStore::iter after HummockStorage::build_read_version_tuple. But currently StateStore::get/StateStore::iter doesn't know about HummockReadEpoch while validate_read_epoch needs it. I prefer to make validate_read_epoch private to HummockStorage, rather than expose it via StateStore trait. @wenym1

Here some examples to show how it works :

epoch_1 is checkpoint
epoch_2 is non-checkpoint
epoch_3 is non-checkpoint
epoch_4 is checkpoint emitted by recovery process

Case 1 compute node that is restarted:

  1. The compute node seals epoch_2. min_current_epoch=[some epoch <= epoch_1], seal_epoch=epoch_2
  2. The compute node is restarted.
  3. The compute node is initialized with min_current_epoch=epoch_1, seal_epoch=epoch_1.
  4. epoch_3 is scheduled by meta.
  5. The compute node cannot seal epoch_3 because its actors are not recovered yet.
  6. Recovery starts.
  7. clear_storage_buffer is called. set min_current_epoch to epoch::MAX.
  8. The compute node seal epoch_4, set min_current_epoch to 4, seal_epoch to epoch_4.

Any batch query with epoch_2 after step 2 will be rejected by validate_read_epoch.

  • For step 2 - step 6, the error message will be "cannot read current epoch because read epoch {} > sealed epoch {}".
  • For step 7 and above, the error message will be "cannot read current epoch because read epoch {} < min current epoch {}".

Case 2 compute node that is not restarted:

  1. The compute node seals epoch_2. min_current_epoch=[some epoch <= epoch_1], seal_epoch=epoch_2
  2. epoch_3 is scheduled by meta.
  3. The compute node seals epoch_3. min_current_epoch=[some epoch <= epoch_1], seal_epoch=epoch_3
  4. Recovery starts.
  5. clear_storage_buffer is called. set min_current_epoch to epoch::MAX.
  6. The compute node seal epoch_4, set min_current_epoch to 4, seal_epoch to epoch_4.

Any batch query with epoch_2 after step 5 will be rejected by validate_read_epoch.

@hzxa21 PTAL

@zwang28 zwang28 marked this pull request as ready for review January 12, 2023 09:01
@zwang28
Copy link
Contributor Author

zwang28 commented Jan 12, 2023

A bit surprise that the recovery test doesn't fail, because batch query a restarted compute node with a current epoch < recovery checkpoint's epoch should fail.
Notice recovery test will auto retry failed queries for times. Maybe retry hides the failure. Verifying.

@hzxa21
Copy link
Collaborator

hzxa21 commented Jan 13, 2023

But currently StateStore::get/StateStore::iter doesn't know about HummockReadEpoch while validate_read_epoch needs it. I prefer to make validate_read_epoch private to HummockStorage, rather than expose it via StateStore trait.

+1. Shall we change the StateStore read interfaces to take HummockReadEpoch instead of u64 then?

@hzxa21
Copy link
Collaborator

hzxa21 commented Jan 13, 2023

We address the correctness issue #7188 (comment) as follows:

  • Maintain a min_current_epoch in HummockStorage.

    • min_current_epoch is initialized as latest MCE.
    • min_current_epoch is set to epoch::MAX whenever HummockStorage::clear_shared_buffer, before HummockEvent::Clear takes effect. It means recovery is in process and no current_epoch read is allowed.
    • min_current_epoch is set to the new epoch whenever a checkpoint is sealed AND min_current_epoch == epoch::MAX. It means recovery is finished and valid current_epoch starts from the new epoch.
  • Add a validate_read_epoch, which requires min_current_epoch <= read_current_epoch <= seal_epoch, otherwise raises error.

  • We call validate_read_epoch after StateStore::get/StateStore::iter in StorageTable, paired with try_wait_epoch.

    • Actually we can make validate_read_epoch private to HummockStorage, and call it inside StateStore::get/StateStore::iter after HummockStorage::build_read_version_tuple. But currently StateStore::get/StateStore::iter doesn't know about HummockReadEpoch while validate_read_epoch needs it. I prefer to make validate_read_epoch private to HummockStorage, rather than expose it via StateStore trait. @wenym1

Here some examples to show how it works :

epoch_1 is checkpoint epoch_2 is non-checkpoint epoch_3 is non-checkpoint epoch_4 is checkpoint emitted by recovery process

Case 1 compute node that is restarted:

  1. The compute node seals epoch_2. min_current_epoch=[some epoch <= epoch_1], seal_epoch=epoch_2
  2. The compute node is restarted.
  3. The compute node is initialized with min_current_epoch=epoch_1, seal_epoch=epoch_1.
  4. epoch_3 is scheduled by meta.
  5. The compute node cannot seal epoch_3 because its actors are not recovered yet.
  6. Recovery starts.
  7. clear_storage_buffer is called. set min_current_epoch to epoch::MAX.
  8. The compute node seal epoch_4, set min_current_epoch to 4, seal_epoch to epoch_4.

Any batch query with epoch_2 after step 2 will be rejected by validate_read_epoch.

  • For step 2 - step 6, the error message will be "cannot read current epoch because read epoch {} > sealed epoch {}".
  • For step 7 and above, the error message will be "cannot read current epoch because read epoch {} < min current epoch {}".

Case 2 compute node that is not restarted:

  1. The compute node seals epoch_2. min_current_epoch=[some epoch <= epoch_1], seal_epoch=epoch_2
  2. epoch_3 is scheduled by meta.
  3. The compute node seals epoch_3. min_current_epoch=[some epoch <= epoch_1], seal_epoch=epoch_3
  4. Recovery starts.
  5. clear_storage_buffer is called. set min_current_epoch to epoch::MAX.
  6. The compute node seal epoch_4, set min_current_epoch to 4, seal_epoch to epoch_4.

Any batch query with epoch_2 after step 5 will be rejected by validate_read_epoch.

@hzxa21 PTAL

Good idea. Overall looks good and it enlights me to think of a potential issue and an improvement:

Issue

IIUC, under this proposal, it is possible that we reject a read even when there is no recovery due to async notification of epoch updates. Consider the following example:

epoch_1 is non-checkpoint
epoch_2 is checkpoint
  1. FE uses epoch_1 for a batch read
  2. CN seals epoch_2 and bumps min_current_epoch to epoch_2
  3. Batch read issued by 1 reaches the CN. validate_read_epoch fails because read_epoch = epoch_1 < min_current_epoch = epoch_2.

Improvement

To solve the above issue, we can simply avoid updating min_current_epoch on every checkpoint. The update rules become:

  • Set min_current_epoch = MCE and sealed_epoch = MCE on CN bootup.
  • Set min_current_epoch = MAX when recovery is in-progress (before clear_storage_buffer).
  • Set min_current_epoch = sealed_epoch only on sealing the recovery epoch (guaranteed to happen after clear_storage_buffer).

Let's see whether there is any pitfall and continue the discussion next week.

@zwang28
Copy link
Contributor Author

zwang28 commented Jan 14, 2023

Set min_current_epoch = sealed_epoch only on sealing the recovery epoch (guaranteed to happen after clear_storage_buffer).

Yes, this's what I mean by

min_current_epoch is set to the new epoch whenever a checkpoint is sealed AND min_current_epoch == epoch::MAX.

@hzxa21
Copy link
Collaborator

hzxa21 commented Jan 14, 2023

Set min_current_epoch = sealed_epoch only on sealing the recovery epoch (guaranteed to happen after clear_storage_buffer).

Yes, this's what I mean by

min_current_epoch is set to the new epoch whenever a checkpoint is sealed AND min_current_epoch == epoch::MAX.

Ah I see. I misunderstood. Make sense.

@zwang28
Copy link
Contributor Author

zwang28 commented Jan 15, 2023

A bit surprise that the recovery test doesn't fail, because batch query a restarted compute node with a current epoch < recovery checkpoint's epoch should fail.

Verified it's just because recovery test doesn't trigger any case that read current epoch < min_current_epoch or read current epoch > sealed_epoch, i.e. validate_read_epoch never fails.

Copy link
Collaborator

@hzxa21 hzxa21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM. Thanks for the PR!

@@ -257,6 +257,7 @@ impl<S: StateStore> StorageTable<S> {
read_version_from_backup: read_backup,
};
if let Some(value) = self.store.get(&serialized_pk, epoch, read_options).await? {
self.store.validate_read_epoch(wait_epoch)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention to call valid_read_epoch after get is to handle the case when min_current_epoch is bumped between L245 and L259. It is a bit subtle. Can we add some comments to explain it?

let iter = store.iter(range, raw_epoch, read_options).await?;
store.validate_read_epoch(epoch)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@zwang28
Copy link
Contributor Author

zwang28 commented Jan 17, 2023

A bit surprise that the recovery test doesn't fail, because batch query a restarted compute node with a current epoch < recovery checkpoint's epoch should fail.

Verified it's just because recovery test doesn't trigger any case that read current epoch < min_current_epoch or read current epoch > sealed_epoch, i.e. validate_read_epoch never fails.

After merging main, the recovery test triggers both "current epoch < min_current_epoch" failure and "current epoch > sealed_epoch" failure as expected.
Fixing the recovery test.

Update:
I believe it's the same issue with #7433, i.e. frontend doesn't receive error raised by compute node and result timeout. Trying some workaround.

@zwang28 zwang28 force-pushed the wangzheng/chore_visibility_mode branch from ab9d46e to 7ef4459 Compare January 17, 2023 13:37
@zwang28 zwang28 added the user-facing-changes Contains changes that are visible to users label Jan 18, 2023
@zwang28
Copy link
Contributor Author

zwang28 commented Jan 18, 2023

I believe it's the same issue with #7433, i.e. frontend doesn't receive error raised by compute node and result timeout. Trying some workaround.

As a workaround, we temporarily enforce VISIBILITY_MODE=checkpoint during recovery test.
We will remove this workaround after the error propagation issue is fixed for QUERY_MODE=local.

@zwang28
Copy link
Contributor Author

zwang28 commented Jan 18, 2023

Shall we change the StateStore read interfaces to take HummockReadEpoch instead of u64 then

I suggest to address this refactor in another PR. Likely lots test cases need to be touched.

@zwang28 zwang28 added this to the release-0.1.16 milestone Jan 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/batch Batch related related issue. component/storage Storage type/refactor user-facing-changes Contains changes that are visible to users
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants