Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Delta Lake partitioned writing #2884

Merged
merged 3 commits into from
Sep 24, 2024

Conversation

kevinzwang
Copy link
Member

@kevinzwang kevinzwang commented Sep 23, 2024

Some things that I will cover in follow-up PRs:

  • split table_io.py up into multiple files
  • fix partitioned writes to conform to hive style (binary encoding, string escaping, etc)

This should not actually be blocking since partition values in the delta log do not actually need to be encoded (Spark does not do so), just stringified. Just don't read it as a hive table lol

@kevinzwang kevinzwang requested a review from jaychia September 23, 2024 00:08
@github-actions github-actions bot added the enhancement New feature or request label Sep 23, 2024
Copy link

codspeed-hq bot commented Sep 23, 2024

CodSpeed Performance Report

Merging #2884 will not alter performance

Comparing kevin/deltalake-partitioned-writes (2b0c7c7) with main (02b30be)

Summary

✅ 17 untouched benchmarks

Copy link

codecov bot commented Sep 23, 2024

Codecov Report

Attention: Patch coverage is 78.04878% with 27 lines in your changes missing coverage. Please review.

Project coverage is 66.47%. Comparing base (2ae875f) to head (2b0c7c7).
Report is 15 commits behind head on main.

Files with missing lines Patch % Lines
daft/table/table_io.py 75.36% 17 Missing ⚠️
src/daft-plan/src/sink_info.rs 0.00% 9 Missing ⚠️
daft/table/partitioning.py 94.44% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2884      +/-   ##
==========================================
+ Coverage   66.04%   66.47%   +0.42%     
==========================================
  Files        1003     1005       +2     
  Lines      117111   114238    -2873     
==========================================
- Hits        77351    75940    -1411     
+ Misses      39760    38298    -1462     
Flag Coverage Δ
66.47% <78.04%> (+0.42%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
daft/dataframe/dataframe.py 86.62% <100.00%> (+0.07%) ⬆️
daft/execution/execution_step.py 91.99% <100.00%> (+0.01%) ⬆️
daft/execution/physical_plan.py 89.30% <ø> (ø)
daft/execution/rust_physical_plan_shim.py 91.30% <ø> (ø)
daft/iceberg/iceberg_write.py 75.43% <100.00%> (ø)
daft/logical/builder.py 89.61% <ø> (ø)
src/daft-plan/src/builder.rs 92.94% <100.00%> (+0.04%) ⬆️
src/daft-plan/src/logical_ops/sink.rs 61.64% <100.00%> (ø)
src/daft-scheduler/src/scheduler.rs 90.84% <100.00%> (+0.01%) ⬆️
daft/table/partitioning.py 95.12% <94.44%> (-2.11%) ⬇️
... and 2 more

... and 116 files with indirect coverage changes

Copy link
Contributor

@jaychia jaychia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM. Main question is why we need to propagate the partitioning as Exprs down to the write -- isn't Deltalake only every going to support writing with the identity transform? I.e. we should be able to just propagate string column names to the write?

Also, let's add more tests -- this is going to be hit pretty often from a bunch of users we should make sure the behavior is rock-solid for any corner-cases

@kevinzwang kevinzwang requested a review from jaychia September 24, 2024 21:33
Copy link
Contributor

@jaychia jaychia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GR8.

@kevinzwang kevinzwang enabled auto-merge (squash) September 24, 2024 21:40
@kevinzwang kevinzwang merged commit b519944 into main Sep 24, 2024
40 checks passed
@kevinzwang kevinzwang deleted the kevin/deltalake-partitioned-writes branch September 24, 2024 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants