Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: async writer + multi-part #3255

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ion-elgreco
Copy link
Collaborator

Description

Changed to use arrows async writer and use multi-part uploads.

Related Issue(s)

Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Feb 23, 2025
@ion-elgreco ion-elgreco force-pushed the refactor/writer-async branch from fc1ae86 to a1787c9 Compare February 23, 2025 16:14
Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
@ion-elgreco ion-elgreco force-pushed the refactor/writer-async branch from a1787c9 to 2353a21 Compare February 23, 2025 16:20
Copy link

codecov bot commented Feb 23, 2025

Codecov Report

Attention: Patch coverage is 64.89362% with 33 lines in your changes missing coverage. Please review.

Project coverage is 72.17%. Comparing base (666179e) to head (d9e79c5).

Files with missing lines Patch % Lines
crates/core/src/operations/write/writer.rs 62.00% 8 Missing and 11 partials ⚠️
crates/core/src/operations/write/async_utils.rs 68.18% 13 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3255      +/-   ##
==========================================
- Coverage   72.18%   72.17%   -0.01%     
==========================================
  Files         143      144       +1     
  Lines       45629    45709      +80     
  Branches    45629    45709      +80     
==========================================
+ Hits        32936    32992      +56     
- Misses      10621    10640      +19     
- Partials     2072     2077       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines -379 to +408
self.object_store.put(&path, buffer.into()).await?;
let mut multi_part_upload = self.object_store.put_multipart(&path).await?;
let part_size = upload_part_size();
let mut tasks = JoinSet::new();
let max_concurrent_tasks = 10; // TODO: make configurable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really such a major behavior change. I am not terribly familiar with the maturity level of multipart uploads in object_store I don't think this is necessarily a bad change, but I am doubtful of this addressing the originally linked issue.

As best as I can tell the buffers are still going to fill up memory until the flush, and then the flush is going to fan out to have parallel uploads

flowchart LR
    write --> buffer_batch;
    write --> buffer_batch;
    write --> buffer_batch;
    buffer_batch --> flush;
    flush --> p1;
    flush --> p2;
    flush --> p3;
    flush --> p4;
    p1 --> close;
    p2 --> close;
    p3 --> close;
    p4 --> close;
Loading

Copy link
Collaborator Author

@ion-elgreco ion-elgreco Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, in its current form we indeed still buffer longer until the flush, but the buffering and writing has more parallelism now, so it should be faster.

We could do two things here btw:

  • release this as is for people to experiment with (maybe after 1.0)
  • iterate on this to also flush after the min - part size is available, also create some benchmark tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use async writer + multipart + explore Datafusion sink
2 participants