[FEAT] Delta Lake partitioned writing #2884

kevinzwang · 2024-09-23T00:08:47Z

Some things that I will cover in follow-up PRs:

split table_io.py up into multiple files
fix partitioned writes to conform to hive style (binary encoding, string escaping, etc)

This should not actually be blocking since partition values in the delta log do not actually need to be encoded (Spark does not do so), just stringified. Just don't read it as a hive table lol

codspeed-hq · 2024-09-23T00:23:37Z

CodSpeed Performance Report

Merging #2884 will not alter performance

_{Comparing kevin/deltalake-partitioned-writes (2b0c7c7) with main (02b30be)}

Summary

✅ 17 untouched benchmarks

codecov · 2024-09-23T18:33:21Z

Codecov Report

Attention: Patch coverage is 78.04878% with 27 lines in your changes missing coverage. Please review.

Project coverage is 66.47%. Comparing base (2ae875f) to head (2b0c7c7).
Report is 15 commits behind head on main.

Files with missing lines	Patch %	Lines
daft/table/table_io.py	75.36%	17 Missing ⚠️
src/daft-plan/src/sink_info.rs	0.00%	9 Missing ⚠️
daft/table/partitioning.py	94.44%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2884      +/-   ##
==========================================
+ Coverage   66.04%   66.47%   +0.42%     
==========================================
  Files        1003     1005       +2     
  Lines      117111   114238    -2873     
==========================================
- Hits        77351    75940    -1411     
+ Misses      39760    38298    -1462

Flag	Coverage Δ
	`66.47% <78.04%> (+0.42%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
daft/dataframe/dataframe.py	`86.62% <100.00%> (+0.07%)`	⬆️
daft/execution/execution_step.py	`91.99% <100.00%> (+0.01%)`	⬆️
daft/execution/physical_plan.py	`89.30% <ø> (ø)`
daft/execution/rust_physical_plan_shim.py	`91.30% <ø> (ø)`
daft/iceberg/iceberg_write.py	`75.43% <100.00%> (ø)`
daft/logical/builder.py	`89.61% <ø> (ø)`
src/daft-plan/src/builder.rs	`92.94% <100.00%> (+0.04%)`	⬆️
src/daft-plan/src/logical_ops/sink.rs	`61.64% <100.00%> (ø)`
src/daft-scheduler/src/scheduler.rs	`90.84% <100.00%> (+0.01%)`	⬆️
daft/table/partitioning.py	`95.12% <94.44%> (-2.11%)`	⬇️
... and 2 more

... and 116 files with indirect coverage changes

jaychia

Mostly LGTM. Main question is why we need to propagate the partitioning as Exprs down to the write -- isn't Deltalake only every going to support writing with the identity transform? I.e. we should be able to just propagate string column names to the write?

Also, let's add more tests -- this is going to be hit pretty often from a bunch of users we should make sure the behavior is rock-solid for any corner-cases

daft/dataframe/dataframe.py

daft/table/partitioning.py

src/daft-plan/src/sink_info.rs

tests/io/delta_lake/test_table_write.py

daft/iceberg/iceberg_write.py

daft/table/table_io.py

jaychia

GR8.

[FEAT] Delta Lake partitioned writing

753daea

kevinzwang requested a review from jaychia September 23, 2024 00:08

github-actions bot added the enhancement New feature or request label Sep 23, 2024

fix partition values to str

923f2b6

jaychia reviewed Sep 23, 2024

View reviewed changes

add more tests and use strings for partition cols

2b0c7c7

kevinzwang requested a review from jaychia September 24, 2024 21:33

jaychia approved these changes Sep 24, 2024

View reviewed changes

kevinzwang enabled auto-merge (squash) September 24, 2024 21:40

kevinzwang merged commit b519944 into main Sep 24, 2024
40 checks passed

kevinzwang deleted the kevin/deltalake-partitioned-writes branch September 24, 2024 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Delta Lake partitioned writing #2884

[FEAT] Delta Lake partitioned writing #2884

kevinzwang commented Sep 23, 2024 •

edited

Loading

codspeed-hq bot commented Sep 23, 2024 •

edited

Loading

codecov bot commented Sep 23, 2024 •

edited

Loading

jaychia left a comment

jaychia left a comment

[FEAT] Delta Lake partitioned writing #2884

[FEAT] Delta Lake partitioned writing #2884

Conversation

kevinzwang commented Sep 23, 2024 • edited Loading

codspeed-hq bot commented Sep 23, 2024 • edited Loading

CodSpeed Performance Report

Merging #2884 will not alter performance

Summary

codecov bot commented Sep 23, 2024 • edited Loading

Codecov Report

jaychia left a comment

Choose a reason for hiding this comment

jaychia left a comment

Choose a reason for hiding this comment

kevinzwang commented Sep 23, 2024 •

edited

Loading

codspeed-hq bot commented Sep 23, 2024 •

edited

Loading

codecov bot commented Sep 23, 2024 •

edited

Loading