Add some DataFrame method(s) to combine two inputs where the schema can be different #12650

alamb · 2024-09-27T11:36:53Z

Is your feature request related to a problem or challenge?

@ion-elgreco asked in Discord

Does datafusion support a more relaxed Union where the schema can be in a different order? Akin to Polars.concat

The documentation for polars.concat

DataFrame::union requires the inputs to have the same schema

Describe the solution you'd like

No response

Describe alternatives you've considered

`Dataframe::concat`

Add a Dataframe::concat method that works like this

  let df1 = ... ; // DataFrame with schema {a: int, b: string}
  let df2 = ...; // DataFrame with schema {b: string, a: int}
  let df3 = df1.concat(df2); // Dataframe with schema {a: int, b:string}, all rows from df1 before df2

Implementing this might be somewhat complicated (as there is no existing LogicalPlan that could do this easily)

One way to implement this could be something like add a fake column (__child_number perhaps) to df1 and df2 and have the plan be

let df1 = df1.add_column('__child_number', 1); // add new __child_number column
let df2 = df2.add_column('__child_number', 2); // add new __child_number column
df3 = df1
  .union_with_reorder(df2) // see below for union with reorder
  .order_by('__child_number')
  .project(...) // remove __child_number column

`Dataframe::union_with_reorder_schema`

  let df1 = ... ; // DataFrame with schema {a: int, b: string}
  let df2 = ...; // DataFrame with schema {b: string, a: int}
  let df3 = df1.union_with_reorder_schema(df2); // Dataframe with schema {a: int, b:string}, rows from df1 and df2 interleaved (like union)

Could implement this with just a Projection that reordered the input schemas and then used existing Union

Change semantics of `DataFrame::union` to do reordering

Another thing we could do is to change the semantics of Union to do the reordering, but that may have unintended consequences

I don't think there is a dataframe level implementation of that functionality -- though I think it would be straightforward to add (the DataFrame could add a projection to the inputs to rearrange the column order ot match)

Additional context

We should double check with our dataframe exprts like @timsaucer and @Omega359 if this is a reasonable API

The text was updated successfully, but these errors were encountered:

timsaucer · 2024-09-27T17:15:22Z

Are you thinking that this is limited to case of the same schema with different order or should it also include the case of partially overlapping schema? I ask because our internal toolkit does the latter. It would be nice to have, but not necessary.

As far as the API is concerned, keeping it basically as is would be my preferred approach and just updating the internals to reorder if necessary.

austin362667 · 2024-09-27T17:15:57Z

cc @doupache

ion-elgreco · 2024-09-27T17:17:45Z

Are you thinking that this is limited to case of the same schema with different order or should it also include the case of partially overlapping schema? I ask because our internal toolkit does the latter. It would be nice to have, but not necessary.

As far as the API is concerned, keeping it basically as is would be my preferred approach and just updating the internals to reorder if necessary.

I would argue that partially overlapping should be possible albeit with a different concat mode. This way we can schema evolve easily

doupache · 2024-09-27T17:40:32Z

Both Spark and DuckDB push this even further. They have UNION BY NAME, which has a lot of use cases. I think supporting this could be a first step.

doupache · 2024-09-27T17:40:37Z

take

Omega359 · 2024-09-27T18:05:03Z

I think union_by_name is the core functionality desired here - here is spark's method for it unionByName(other: DataSet[T], allowMissingColumns: Boolean). I suggest this is the functionality that should be worked on for this.

doupache · 2024-09-28T04:26:26Z

take

Omega359 · 2024-10-23T15:15:24Z

Have you made progress on this ticket @doupache ? If not I would like to take a stab at it as it would help me in my work project...

doupache · 2024-10-24T07:58:03Z

@Omega359, I'm still working on figuring out how to resolve the type inconsistencies between the data frames. If this is a critical issue for you, I’m open to stepping back and unassigning myself.

Omega359 · 2024-10-24T12:24:24Z

My thought was to behave exactly like union does today in that respect. The docs on union have links to helpers if type coercion is required though:

https://docs.rs/datafusion/latest/datafusion/logical_expr/fn.union.html
https://docs.rs/datafusion-optimizer/latest/datafusion_optimizer/analyzer/type_coercion/fn.coerce_union_schema.html

It's not urgent for me to get this functionality - I was more looking at it as something to do that is a bit closer to core than the work I normally do is all.

Omega359 · 2025-02-21T19:44:43Z

This should be much easier to implement now that #14508 has landed

alamb added the enhancement New feature or request label Sep 27, 2024

alamb changed the title ~~Add some DataFrame::union method for two inputs where the schema can be different~~ Add some DataFrame method(s) to combine two inputs where the schema can be different Sep 27, 2024

github-actions bot assigned doupache Sep 28, 2024

Omega359 mentioned this issue Sep 29, 2024

How to merge multiple data sources and deduplicate based on certain fields? #12532

Open

doupache removed their assignment Oct 24, 2024

ttencate mentioned this issue Nov 7, 2024

DataFrame::union() does not detect schema mismatches #13287

Closed

This was referenced Feb 5, 2025

Implement UNION ALL BY NAME #14508

Closed

[DISCUSSION] 2025 Q1-Q2 Roadmap #14580

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add some DataFrame method(s) to combine two inputs where the schema can be different #12650

Add some DataFrame method(s) to combine two inputs where the schema can be different #12650

alamb commented Sep 27, 2024

timsaucer commented Sep 27, 2024

austin362667 commented Sep 27, 2024

ion-elgreco commented Sep 27, 2024

doupache commented Sep 27, 2024

doupache commented Sep 27, 2024

Omega359 commented Sep 27, 2024

doupache commented Sep 28, 2024

Omega359 commented Oct 23, 2024 •

edited

Loading

doupache commented Oct 24, 2024

Omega359 commented Oct 24, 2024

Omega359 commented Feb 21, 2025

Add some DataFrame method(s) to combine two inputs where the schema can be different #12650

Add some DataFrame method(s) to combine two inputs where the schema can be different #12650

Comments

alamb commented Sep 27, 2024

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Dataframe::concat

Dataframe::union_with_reorder_schema

Change semantics of DataFrame::union to do reordering

Additional context

timsaucer commented Sep 27, 2024

austin362667 commented Sep 27, 2024

ion-elgreco commented Sep 27, 2024

doupache commented Sep 27, 2024

doupache commented Sep 27, 2024

Omega359 commented Sep 27, 2024

doupache commented Sep 28, 2024

Omega359 commented Oct 23, 2024 • edited Loading

doupache commented Oct 24, 2024

Omega359 commented Oct 24, 2024

Omega359 commented Feb 21, 2025

`Dataframe::concat`

`Dataframe::union_with_reorder_schema`

Change semantics of `DataFrame::union` to do reordering

Omega359 commented Oct 23, 2024 •

edited

Loading