-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add some DataFrame method(s) to combine two inputs where the schema can be different #12650
Comments
Are you thinking that this is limited to case of the same schema with different order or should it also include the case of partially overlapping schema? I ask because our internal toolkit does the latter. It would be nice to have, but not necessary. As far as the API is concerned, keeping it basically as is would be my preferred approach and just updating the internals to reorder if necessary. |
cc @doupache |
I would argue that partially overlapping should be possible albeit with a different concat mode. This way we can schema evolve easily |
Both Spark and DuckDB push this even further. They have |
take |
I think union_by_name is the core functionality desired here - here is spark's method for it |
take |
Have you made progress on this ticket @doupache ? If not I would like to take a stab at it as it would help me in my work project... |
@Omega359, I'm still working on figuring out how to resolve the type inconsistencies between the data frames. If this is a critical issue for you, I’m open to stepping back and unassigning myself. |
My thought was to behave exactly like union does today in that respect. The docs on union have links to helpers if type coercion is required though: https://docs.rs/datafusion/latest/datafusion/logical_expr/fn.union.html It's not urgent for me to get this functionality - I was more looking at it as something to do that is a bit closer to core than the work I normally do is all. |
This should be much easier to implement now that #14508 has landed |
Is your feature request related to a problem or challenge?
@ion-elgreco asked in Discord
The documentation for polars.concat
DataFrame::union
requires the inputs to have the same schemaDescribe the solution you'd like
No response
Describe alternatives you've considered
Dataframe::concat
Add a
Dataframe::concat
method that works like thisImplementing this might be somewhat complicated (as there is no existing LogicalPlan that could do this easily)
One way to implement this could be something like add a fake column (
__child_number
perhaps) to df1 and df2 and have the plan beDataframe::union_with_reorder_schema
Could implement this with just a Projection that reordered the input schemas and then used existing Union
Change semantics of
DataFrame::union
to do reorderingAnother thing we could do is to change the semantics of Union to do the reordering, but that may have unintended consequences
I don't think there is a dataframe level implementation of that functionality -- though I think it would be straightforward to add (the DataFrame could add a projection to the inputs to rearrange the column order ot match)
Additional context
We should double check with our dataframe exprts like @timsaucer and @Omega359 if this is a reasonable API
The text was updated successfully, but these errors were encountered: