Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(file source): clear counter splits_on_fetch when updating rate limit #20622

Merged
merged 3 commits into from
Feb 27, 2025

Conversation

wcy-fdu
Copy link
Contributor

@wcy-fdu wcy-fdu commented Feb 26, 2025

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Close #20593
How to reproduce

CREATE source s1(
        id int,
    ) WITH (
    connector = 's3',
    match_pattern = '*.parquet',
    s3.region_name = 'custom',
    s3.bucket_name = 'hummock001',
    s3.credentials.access = 'hummockadmin',
    s3.credentials.secret = 'hummockadmin',
    s3.endpoint_url = 'http://hummock001.127.0.0.1:9301',
    refresh.interval.sec = 1,
    ) FORMAT PLAIN ENCODE PARQUET;
create table d(id int);
create sink sink1 into d as select id from s1 with (
      type = 'append-only',
      force_append_only = 'true',
  );
alter sink sink1 set PARALLELISM to 1;

upload file1
ALTER SOURCE s1 SET source_rate_limit TO 0;
upload file2
ALTER SOURCE s1 SET source_rate_limit TO default;
upload file 3 4 5

File 3 4 5 cannot be read, but is in the fetch state table.

root cause

This is a bug that only occurs when applying rate limits on the file source. After updating the rate limit, we set need_rebuild_reader to true, but we do not sync the modification of splits_on_fetch. Each time we call replace_with_new_batch_reader, it reads the state table of the fetch executor and updates the counter for splits_on_fetch. As a result, this leads to pending files being counted twice during this period. Subsequently, because splits_on_fetch is not zero, it causes further files to remain pending.

Checklist

  • I have written necessary rustdoc comments.
  • I have added necessary unit tests and integration tests.
  • I have added test labels as necessary.
  • I have added fuzzing tests or opened an issue to track them.
  • My PR contains breaking changes.
  • My PR changes performance-critical code, so I will run (micro) benchmarks and present the results.
  • My PR contains critical fixes that are necessary to be merged into the latest release.

Documentation

  • My PR needs documentation updates.
Release note

@tabVersion
Copy link
Contributor

Shall we add a test for it?

Copy link
Contributor

@kwannoel kwannoel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Thanks for finding the root cause

@kwannoel
Copy link
Contributor

We can add a test as per your repro

@wcy-fdu
Copy link
Contributor Author

wcy-fdu commented Feb 27, 2025

Will complete the test in PR #20236 uniformly

@wcy-fdu wcy-fdu enabled auto-merge February 27, 2025 08:49
@wcy-fdu wcy-fdu added this pull request to the merge queue Feb 27, 2025
Merged via the queue into main with commit 5ea7863 Feb 27, 2025
29 of 30 checks passed
@wcy-fdu wcy-fdu deleted the wcy/fix_rate_limit branch February 27, 2025 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet Source: Alter source_rate_limit=0 sometimes does not resume after altering back to default
3 participants