refactor(batch): set VISIBILITY_MODE to all by default #7188

zwang28 · 2023-01-04T10:18:47Z

I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.

What's changed and what's your intention?

Add validate_read_epoch in StateStore trait. StorageTable uses it to verify whether any recover occurs during its read. It so, the read result is invalidated and the batch query is aborted.
Set VISIBILITY_MODE to all by default.
~~Implement try_wait_epoch for non-checkpoint epoch.~~
- ~~It's necessary because a restarted CN cannot ensure seal_epoch >= uncommitted_read_epoch from batch query.~~
Temporarily revert FLUSH to always enforce a checkpoint.
- We will adopt CHECKPOINT in another PR because there will be considerable changes to test scripts.

Checklist

I have written necessary rustdoc comments
~~- [ ] I have added necessary unit tests and integration tests~~
All checks passed in ./risedev check (or alias, ./risedev c)

Documentation

If your pull request contains user-facing changes, please specify the types of the changes, and create a release note. Otherwise, please feel free to remove this section.

Types of user-facing changes

Please keep the types that apply to your changes, and remove those that do not apply.

SQL commands, functions, and operators

Release note

Support VISIBILITY_MODE session variable. It's set to ALL by default.

Refer to a related PR or issue link (optional)

#6558

codecov · 2023-01-05T13:21:02Z

Codecov Report

Merging #7188 (dde0456) into main (0524142) will decrease coverage by 0.01%.
The diff coverage is 41.97%.

@@            Coverage Diff             @@
##             main    #7188      +/-   ##
==========================================
- Coverage   71.64%   71.63%   -0.01%     
==========================================
  Files        1083     1083              
  Lines      173644   173693      +49     
==========================================
+ Hits       124406   124424      +18     
- Misses      49238    49269      +31

Flag	Coverage Δ
rust	`71.63% <41.97%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/common/src/session_config/visibility_mode.rs	`79.16% <ø> (ø)`
src/frontend/src/handler/flush.rs	`0.00% <0.00%> (ø)`
src/storage/hummock_test/src/test_utils.rs	`94.09% <0.00%> (-1.31%)`	⬇️
src/storage/src/hummock/error.rs	`24.35% <0.00%> (-0.98%)`	⬇️
...src/hummock/event_handler/hummock_event_handler.rs	`80.29% <ø> (-0.24%)`	⬇️
src/storage/src/monitor/monitored_store.rs	`4.85% <0.00%> (-0.08%)`	⬇️
src/storage/src/panic_store.rs	`0.00% <0.00%> (ø)`
src/storage/src/store.rs	`69.64% <ø> (ø)`
src/storage/src/store_impl.rs	`9.39% <0.00%> (-0.20%)`	⬇️
src/storage/src/hummock/state_store_v1.rs	`69.52% <20.00%> (-0.58%)`	⬇️
... and 10 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

hzxa21 · 2023-01-10T04:27:01Z

Need implement try_wait_epoch for current_epoch, because compute node's sealed_epoch is transient (thus recovery test fails).

Discussed offline: waiting current epoch to bump on recovery can still cause correctness issue. Consider the following case:

We have CN [1, 2] and epoch [1, 2, 3] with
- epoch 1 as the checkpoint epoch
- epoch 2 as the non-checkpoint epoch
- epoch 3 as the recovery epoch
CN2 restarts after epoch2 and CN1 remains healthy
While CN2's recovery is still in-progress, frontend holds epoch2 to read from CN1 and CN2
With try_wait_epoch for current_epoch, CN1 will serve read using data from epoch2 while CN2 will serve read using data from epoch3 (which is equivalent to serving data from epoch1)

@zwang28 also pointed out that the read results from non-restarted CNs on recovery is also non-determinitistic. In the above example, CN1 can serve read using data from epoch2 or epoch3 depending on the progress of the recovery process.

This reverts commit 9a6542e.

zwang28 · 2023-01-12T08:15:06Z

We address the correctness issue #7188 (comment) as follows:

Maintain a min_current_epoch in HummockStorage.
- min_current_epoch is initialized as latest MCE.
- min_current_epoch is set to epoch::MAX whenever HummockStorage::clear_shared_buffer, before HummockEvent::Clear takes effect. It means recovery is in process and no current_epoch read is allowed.
- min_current_epoch is set to the new epoch whenever a checkpoint is sealed AND min_current_epoch == epoch::MAX. It means recovery is finished and valid current_epoch starts from the new epoch.
Add a validate_read_epoch, which requires min_current_epoch <= read_current_epoch <= seal_epoch, otherwise raises error.
We call validate_read_epoch after StateStore::get/StateStore::iter in StorageTable, paired with try_wait_epoch.
- Actually we can make validate_read_epoch private to HummockStorage, and call it inside StateStore::get/StateStore::iter after HummockStorage::build_read_version_tuple. But currently StateStore::get/StateStore::iter doesn't know about HummockReadEpoch while validate_read_epoch needs it. I prefer to make validate_read_epoch private to HummockStorage, rather than expose it via StateStore trait. @wenym1

Here some examples to show how it works :

epoch_1 is checkpoint
epoch_2 is non-checkpoint
epoch_3 is non-checkpoint
epoch_4 is checkpoint emitted by recovery process

Case 1 compute node that is restarted:

The compute node seals epoch_2. min_current_epoch=[some epoch <= epoch_1], seal_epoch=epoch_2
The compute node is restarted.
The compute node is initialized with min_current_epoch=epoch_1, seal_epoch=epoch_1.
epoch_3 is scheduled by meta.
The compute node cannot seal epoch_3 because its actors are not recovered yet.
Recovery starts.
clear_storage_buffer is called. set min_current_epoch to epoch::MAX.
The compute node seal epoch_4, set min_current_epoch to 4, seal_epoch to epoch_4.

Any batch query with epoch_2 after step 2 will be rejected by validate_read_epoch.

For step 2 - step 6, the error message will be "cannot read current epoch because read epoch {} > sealed epoch {}".
For step 7 and above, the error message will be "cannot read current epoch because read epoch {} < min current epoch {}".

Case 2 compute node that is not restarted:

The compute node seals epoch_2. min_current_epoch=[some epoch <= epoch_1], seal_epoch=epoch_2
epoch_3 is scheduled by meta.
The compute node seals epoch_3. min_current_epoch=[some epoch <= epoch_1], seal_epoch=epoch_3
Recovery starts.
clear_storage_buffer is called. set min_current_epoch to epoch::MAX.
The compute node seal epoch_4, set min_current_epoch to 4, seal_epoch to epoch_4.

Any batch query with epoch_2 after step 5 will be rejected by validate_read_epoch.

@hzxa21 PTAL

zwang28 · 2023-01-12T09:22:30Z

A bit surprise that the recovery test doesn't fail, because batch query a restarted compute node with a current epoch < recovery checkpoint's epoch should fail.
Notice recovery test will auto retry failed queries for times. Maybe retry hides the failure. Verifying.

hzxa21 · 2023-01-13T13:35:01Z

But currently StateStore::get/StateStore::iter doesn't know about HummockReadEpoch while validate_read_epoch needs it. I prefer to make validate_read_epoch private to HummockStorage, rather than expose it via StateStore trait.

+1. Shall we change the StateStore read interfaces to take HummockReadEpoch instead of u64 then?

hzxa21 · 2023-01-13T15:10:22Z

We address the correctness issue #7188 (comment) as follows:

Maintain a min_current_epoch in HummockStorage.

min_current_epoch is initialized as latest MCE.

min_current_epoch is set to epoch::MAX whenever HummockStorage::clear_shared_buffer, before HummockEvent::Clear takes effect. It means recovery is in process and no current_epoch read is allowed.

min_current_epoch is set to the new epoch whenever a checkpoint is sealed AND min_current_epoch == epoch::MAX. It means recovery is finished and valid current_epoch starts from the new epoch.

Add a validate_read_epoch, which requires min_current_epoch <= read_current_epoch <= seal_epoch, otherwise raises error.

We call validate_read_epoch after StateStore::get/StateStore::iter in StorageTable, paired with try_wait_epoch.

Actually we can make validate_read_epoch private to HummockStorage, and call it inside StateStore::get/StateStore::iter after HummockStorage::build_read_version_tuple. But currently StateStore::get/StateStore::iter doesn't know about HummockReadEpoch while validate_read_epoch needs it. I prefer to make validate_read_epoch private to HummockStorage, rather than expose it via StateStore trait. @wenym1

Here some examples to show how it works :

epoch_1 is checkpoint epoch_2 is non-checkpoint epoch_3 is non-checkpoint epoch_4 is checkpoint emitted by recovery process

Case 1 compute node that is restarted:

The compute node seals epoch_2. min_current_epoch=[some epoch <= epoch_1], seal_epoch=epoch_2

The compute node is restarted.

The compute node is initialized with min_current_epoch=epoch_1, seal_epoch=epoch_1.

epoch_3 is scheduled by meta.

The compute node cannot seal epoch_3 because its actors are not recovered yet.

Recovery starts.

clear_storage_buffer is called. set min_current_epoch to epoch::MAX.

The compute node seal epoch_4, set min_current_epoch to 4, seal_epoch to epoch_4.

Any batch query with epoch_2 after step 2 will be rejected by validate_read_epoch.

For step 2 - step 6, the error message will be "cannot read current epoch because read epoch {} > sealed epoch {}".

For step 7 and above, the error message will be "cannot read current epoch because read epoch {} < min current epoch {}".

Case 2 compute node that is not restarted:

The compute node seals epoch_2. min_current_epoch=[some epoch <= epoch_1], seal_epoch=epoch_2

epoch_3 is scheduled by meta.

The compute node seals epoch_3. min_current_epoch=[some epoch <= epoch_1], seal_epoch=epoch_3

Recovery starts.

clear_storage_buffer is called. set min_current_epoch to epoch::MAX.

The compute node seal epoch_4, set min_current_epoch to 4, seal_epoch to epoch_4.

Any batch query with epoch_2 after step 5 will be rejected by validate_read_epoch.

@hzxa21 PTAL

Good idea. Overall looks good and it enlights me to think of a potential issue and an improvement:

Issue

IIUC, under this proposal, it is possible that we reject a read even when there is no recovery due to async notification of epoch updates. Consider the following example:

epoch_1 is non-checkpoint
epoch_2 is checkpoint

FE uses epoch_1 for a batch read
CN seals epoch_2 and bumps min_current_epoch to epoch_2
Batch read issued by 1 reaches the CN. validate_read_epoch fails because read_epoch = epoch_1 < min_current_epoch = epoch_2.

Improvement

To solve the above issue, we can simply avoid updating min_current_epoch on every checkpoint. The update rules become:

Set min_current_epoch = MCE and sealed_epoch = MCE on CN bootup.
Set min_current_epoch = MAX when recovery is in-progress (before clear_storage_buffer).
Set min_current_epoch = sealed_epoch only on sealing the recovery epoch (guaranteed to happen after clear_storage_buffer).

Let's see whether there is any pitfall and continue the discussion next week.

zwang28 · 2023-01-14T03:25:17Z

Set min_current_epoch = sealed_epoch only on sealing the recovery epoch (guaranteed to happen after clear_storage_buffer).

Yes, this's what I mean by

min_current_epoch is set to the new epoch whenever a checkpoint is sealed AND min_current_epoch == epoch::MAX.

hzxa21 · 2023-01-14T05:55:38Z

Set min_current_epoch = sealed_epoch only on sealing the recovery epoch (guaranteed to happen after clear_storage_buffer).

Yes, this's what I mean by

min_current_epoch is set to the new epoch whenever a checkpoint is sealed AND min_current_epoch == epoch::MAX.

Ah I see. I misunderstood. Make sense.

zwang28 · 2023-01-15T08:14:20Z

A bit surprise that the recovery test doesn't fail, because batch query a restarted compute node with a current epoch < recovery checkpoint's epoch should fail.

Verified it's just because recovery test doesn't trigger any case that read current epoch < min_current_epoch or read current epoch > sealed_epoch, i.e. validate_read_epoch never fails.

hzxa21

Rest LGTM. Thanks for the PR!

hzxa21 · 2023-01-13T13:27:45Z

src/storage/src/table/batch_table/storage_table.rs

@@ -257,6 +257,7 @@ impl<S: StateStore> StorageTable<S> {
            read_version_from_backup: read_backup,
        };
        if let Some(value) = self.store.get(&serialized_pk, epoch, read_options).await? {
+            self.store.validate_read_epoch(wait_epoch)?;


The intention to call valid_read_epoch after get is to handle the case when min_current_epoch is bumped between L245 and L259. It is a bit subtle. Can we add some comments to explain it?

hzxa21 · 2023-01-13T13:27:52Z

src/storage/src/table/batch_table/storage_table.rs

        let iter = store.iter(range, raw_epoch, read_options).await?;
+        store.validate_read_epoch(epoch)?;


zwang28 · 2023-01-17T03:28:35Z

A bit surprise that the recovery test doesn't fail, because batch query a restarted compute node with a current epoch < recovery checkpoint's epoch should fail.

Verified it's just because recovery test doesn't trigger any case that read current epoch < min_current_epoch or read current epoch > sealed_epoch, i.e. validate_read_epoch never fails.

After merging main, the recovery test triggers both "current epoch < min_current_epoch" failure and "current epoch > sealed_epoch" failure as expected.
~~Fixing the recovery test.~~

Update:
I believe it's the same issue with #7433, i.e. frontend doesn't receive error raised by compute node and result timeout. Trying some workaround.

…round known local_mode propagation issue

zwang28 · 2023-01-18T03:10:10Z

I believe it's the same issue with #7433, i.e. frontend doesn't receive error raised by compute node and result timeout. Trying some workaround.

As a workaround, we temporarily enforce VISIBILITY_MODE=checkpoint during recovery test.
We will remove this workaround after the error propagation issue is fixed for QUERY_MODE=local.

zwang28 · 2023-01-18T03:24:34Z

Shall we change the StateStore read interfaces to take HummockReadEpoch instead of u64 then

I suggest to address this refactor in another PR. Likely lots test cases need to be touched.

chore(batch): set VISIBILITY_MODE to all by default

6f21a92

github-actions bot added the type/chore label Jan 4, 2023

zwang28 mentioned this pull request Jan 4, 2023

feat(batch): support query data without checkpoint #5850

Merged

3 tasks

This comment was marked as resolved.

Sign in to view

zwang28 force-pushed the wangzheng/chore_visibility_mode branch 2 times, most recently from 2966432 to 68fa60c Compare January 5, 2023 12:56

try_wait_epoch for seal_epoch

9a6542e

zwang28 force-pushed the wangzheng/chore_visibility_mode branch from 68fa60c to 9a6542e Compare January 5, 2023 14:05

zwang28 changed the title ~~chore(batch): set VISIBILITY_MODE to all by default~~ refactor(batch): set VISIBILITY_MODE to all by default Jan 5, 2023

github-actions bot added type/refactor and removed type/chore labels Jan 5, 2023

zwang28 added component/storage Storage component/batch Batch related related issue. labels Jan 5, 2023

disable some test cases to speedup

65e787d

zwang28 requested a review from wenym1 January 5, 2023 15:13

zwang28 assigned hzxa21 and unassigned hzxa21 Jan 5, 2023

zwang28 requested review from xxhZs and hzxa21 January 5, 2023 15:14

zwang28 marked this pull request as draft January 6, 2023 13:06

zwang28 added 4 commits January 12, 2023 13:17

Revert "try_wait_epoch for seal_epoch"

a8a7c99

This reverts commit 9a6542e.

add validate_read_epoch

7cea0ca

fix

e8e3c30

refactor

a93080e

zwang28 marked this pull request as ready for review January 12, 2023 09:01

hzxa21 approved these changes Jan 16, 2023

View reviewed changes

zwang28 added 2 commits January 16, 2023 23:58

add comment

5ddeddf

Merge branch 'main' into wangzheng/chore_visibility_mode

4433e7f

zwang28 added 2 commits January 17, 2023 17:28

Merge branch 'main' into wangzheng/chore_visibility_mode

8818aa1

set visibility_mode=checkpoint in recovery test to temporarily work a…

7ef4459

…round known local_mode propagation issue

zwang28 force-pushed the wangzheng/chore_visibility_mode branch from ab9d46e to 7ef4459 Compare January 17, 2023 13:37

refactor test

12eecc3

zwang28 added the user-facing-changes Contains changes that are visible to users label Jan 18, 2023

zwang28 added this to the release-0.1.16 milestone Jan 18, 2023

fix test

83507f3

zwang28 added the mergify/can-merge label Jan 18, 2023

Merge branch 'main' into wangzheng/chore_visibility_mode

dde0456

mergify bot merged commit 41f3c8d into main Jan 18, 2023

mergify bot deleted the wangzheng/chore_visibility_mode branch January 18, 2023 04:51

hzxa21 mentioned this pull request Feb 12, 2023

Longevity job is getting error "Storage error: Hummock error: ReadCurrentEpoch error Cannot read when cluster is under recovery" #7841

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(batch): set VISIBILITY_MODE to all by default #7188

refactor(batch): set VISIBILITY_MODE to all by default #7188

zwang28 commented Jan 4, 2023 •

edited

Loading

This comment was marked as resolved.

codecov bot commented Jan 5, 2023 •

edited

Loading

hzxa21 commented Jan 10, 2023

zwang28 commented Jan 12, 2023 •

edited

Loading

zwang28 commented Jan 12, 2023 •

edited

Loading

hzxa21 commented Jan 13, 2023

hzxa21 commented Jan 13, 2023

zwang28 commented Jan 14, 2023

hzxa21 commented Jan 14, 2023

zwang28 commented Jan 15, 2023

hzxa21 left a comment

hzxa21 Jan 13, 2023

hzxa21 Jan 13, 2023

zwang28 commented Jan 17, 2023 •

edited

Loading

zwang28 commented Jan 18, 2023 •

edited

Loading

zwang28 commented Jan 18, 2023

		let iter = store.iter(range, raw_epoch, read_options).await?;
		store.validate_read_epoch(epoch)?;

refactor(batch): set VISIBILITY_MODE to all by default #7188

refactor(batch): set VISIBILITY_MODE to all by default #7188

Conversation

zwang28 commented Jan 4, 2023 • edited Loading

What's changed and what's your intention?

Checklist

Documentation

Types of user-facing changes

Release note

Refer to a related PR or issue link (optional)

This comment was marked as resolved.

codecov bot commented Jan 5, 2023 • edited Loading

Codecov Report

hzxa21 commented Jan 10, 2023

zwang28 commented Jan 12, 2023 • edited Loading

zwang28 commented Jan 12, 2023 • edited Loading

hzxa21 commented Jan 13, 2023

hzxa21 commented Jan 13, 2023

Issue

Improvement

zwang28 commented Jan 14, 2023

hzxa21 commented Jan 14, 2023

zwang28 commented Jan 15, 2023

hzxa21 left a comment

Choose a reason for hiding this comment

hzxa21 Jan 13, 2023

Choose a reason for hiding this comment

hzxa21 Jan 13, 2023

Choose a reason for hiding this comment

zwang28 commented Jan 17, 2023 • edited Loading

zwang28 commented Jan 18, 2023 • edited Loading

zwang28 commented Jan 18, 2023

zwang28 commented Jan 4, 2023 •

edited

Loading

codecov bot commented Jan 5, 2023 •

edited

Loading

zwang28 commented Jan 12, 2023 •

edited

Loading

zwang28 commented Jan 12, 2023 •

edited

Loading

zwang28 commented Jan 17, 2023 •

edited

Loading

zwang28 commented Jan 18, 2023 •

edited

Loading