-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Delta Lake partitioned writing #2884
Conversation
CodSpeed Performance ReportMerging #2884 will not alter performanceComparing Summary
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2884 +/- ##
==========================================
+ Coverage 66.04% 66.47% +0.42%
==========================================
Files 1003 1005 +2
Lines 117111 114238 -2873
==========================================
- Hits 77351 75940 -1411
+ Misses 39760 38298 -1462
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM. Main question is why we need to propagate the partitioning as Exprs
down to the write -- isn't Deltalake only every going to support writing with the identity transform? I.e. we should be able to just propagate string column names to the write?
Also, let's add more tests -- this is going to be hit pretty often from a bunch of users we should make sure the behavior is rock-solid for any corner-cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GR8.
Some things that I will cover in follow-up PRs:
table_io.py
up into multiple filesThis should not actually be blocking since partition values in the delta log do not actually need to be encoded (Spark does not do so), just stringified. Just don't read it as a hive table lol