-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Utf8View, Boolean, Date32/64, int32/64 for writing hive style partitions #12283
Conversation
…t hive style partitioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Omega359 and @devinjdangelo and I apoligize for the delay in reviewing this PR 😢
My only concern is about the seeming change to allocate a new String
for each row. Otherwise this PR looks great ✨
@@ -320,9 +324,11 @@ async fn hive_style_partitions_demuxer( | |||
fn compute_partition_keys_by_row<'a>( | |||
rb: &'a RecordBatch, | |||
partition_by: &'a [(String, DataType)], | |||
) -> Result<Vec<Vec<&'a str>>> { | |||
) -> Result<Vec<Vec<String>>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I wonder if computing new strings for each row will be unnecessarily slow 🤔 The current code only allocates a string for each distinct partition value (in the final take map) but this code now creates a new string for each row in the output record batch, just to match them up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vaguely recall I had to make this change because of an issue with borrowing temp values with the dates but I'd have to switch back to see exactly where the cause was. You are obviously correct though in that there is a fairly obvious overhead to using String. I'll take another look and see if there is something I can come up with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could use Cow
to avoid causing a regression for StringArrays
🤔
I think it would be fine if existing functionality was unaffected but new features (aka partitioning on newly supported types) was not as fast as it could be
We could treat the subsequent optimization of such new features as a follow on project
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use Cow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it -- thank you @Omega359
cc @devinjdangelo in case you have additional thoughts
@@ -320,9 +325,11 @@ async fn hive_style_partitions_demuxer( | |||
fn compute_partition_keys_by_row<'a>( | |||
rb: &'a RecordBatch, | |||
partition_by: &'a [(String, DataType)], | |||
) -> Result<Vec<Vec<&'a str>>> { | |||
) -> Result<Vec<Vec<Cow<'a, str>>>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ 🐮
Thanks again @Omega359 |
Which issue does this PR close?
Closes #12221
Rationale for this change
Support hive style partitions for additional data types.
What changes are included in this PR?
Code, slt test coverage.
Are these changes tested?
Yes
Are there any user-facing changes?
No