fix: incorrect NATURAL/USING JOIN schema #14102

jonahgao · 2025-01-13T06:54:51Z

Which issue does this PR close?

Rationale for this change

When expanding unqualified wildcard over a natural/using join, it should deduplicate the columns specified in the join conditions. For example, select * from t t1 join t t2 using(a) should output the column a only once.

We have already done this in ExpandWildcardRule, and this PR re-implements it when computing plan schemas.

What changes are included in this PR?

Are these changes tested?

Yes

Are there any user-facing changes?

No

jonahgao · 2025-01-13T06:56:33Z

datafusion/expr/src/utils.rs

+/// For each column specified in the USING JOIN condition, the JOIN plan outputs it twice
+/// (once for each join side), but an unqualified wildcard should include it only once.
+/// This function returns the columns that should be excluded.
+fn exclude_using_columns(plan: &LogicalPlan) -> Result<HashSet<Column>> {


This function is extracted from expand_wildcard, so that we can reuse it in exprlist_to_fields.

jonahgao · 2025-01-13T07:08:13Z

datafusion/expr/src/utils.rs

@@ -705,27 +711,20 @@ pub fn exprlist_to_fields<'a>(
        .map(|e| match e {
            Expr::Wildcard { qualifier, options } => match qualifier {


Although we have moved wildcard expansions to the analyzer #11681, it still does wildcard expansions when computing plan schemas(in exprlist_to_fields and exprlist_len). I wonder if performing wildcard expansions before computing schemas would be simplier, at least it would avoid duplicated work.

Another issue is that in exprlist_to_fields, we don't handle replace items, but ExpandWildcardRule does.
select * replace ('foo' as a) from t will give the wrong datatype in schema before executing analyzer.

jonahgao · 2025-01-13T07:17:01Z

datafusion/expr/src/utils.rs

+/// For each column specified in the USING JOIN condition, the JOIN plan outputs it twice
+/// (once for each join side), but an unqualified wildcard should include it only once.
+/// This function returns the columns that should be excluded.
+fn exclude_using_columns(plan: &LogicalPlan) -> Result<HashSet<Column>> {
    let using_columns = plan.using_columns()?;


using_columns() finds join condition columns by traversing the plan tree. This manner might be unsafe as it could incorrectly find columns that are not relevant to the current SQL context. For example, the result of the query below is different from other databases.

create table t(a int); insert into t values(1),(2),(3); select * from (select t.a+2 as a from t join t t2 using(a)) as t2;

is this something we should file a ticket to track?

is this something we should file a ticket to track?

@alamb Filed #14118

DDtKey

Thank you for quickly addressing the issue and preparing the PR! 🙏 ❤️

I haven’t had a chance to dive deeply into the code yet, but I do have one request: improving test coverage, as I believe it’s really important.

DDtKey · 2025-01-13T15:59:22Z

datafusion/sql/tests/sql_integration.rs

+#[test]
+fn test_using_join_wildcard_schema() {
+    let sql = "SELECT * FROM orders o1 JOIN orders o2 USING (order_id)";
+    let plan = logical_plan(sql).unwrap();
+    let count = plan
+        .schema()
+        .iter()
+        .filter(|(_, f)| f.name() == "order_id")
+        .count();
+    // Only one order_id column
+    assert_eq!(count, 1);
+
+    let sql = "SELECT * FROM orders o1 NATURAL JOIN orders o2";
+    let plan = logical_plan(sql).unwrap();
+    // Only columns from one join side should be present
+    let expected_fields = vec![
+        "o1.order_id".to_string(),
+        "o1.customer_id".to_string(),
+        "o1.o_item_id".to_string(),
+        "o1.qty".to_string(),
+        "o1.price".to_string(),
+        "o1.delivered".to_string(),
+    ];
+    assert_eq!(plan.schema().field_names(), expected_fields);
+}


I'd insist on better test coverage:

test-case where output (expected fields) contains at least 1 column from second table. Just like in MRE Regression: DataFrame::schema returns incorrect schema for NATURAL JOIN #14058

more complex select, e.g with WITH or subselect

Join of >2 tables

(?)

Because otherwise, another regression may happen easily

Added in 789e9f9

alamb

This makes sense to me -- thank you @jonahgao and @DDtKey

I verified that the test fails without this code change, so from that perspective this PR is strictly better than main so in my opinion this PR could be merged in as is

However, I very much think @DDtKey 's comment about more testing is important https://github.com/apache/datafusion/pull/14102/files#r1913426831

Though I do think we can do it as a follow on PR as well

Here is how the test fails

assertion `left == right` failed
  left: 2
 right: 1

Left:  2
Right: 1
<Click to see difference>

thread 'test_using_join_wildcard_schema' panicked at datafusion/sql/tests/sql_integration.rs:4568:5:
assertion `left == right` failed
  left: 2
 right: 1
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/library/std/src/panicking.rs:665:5
   1: core::panicking::panic_fmt
             at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/library/core/src/panicking.rs:76:14
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
             at /Users/andrewlamb/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/panicking.rs:373:5
   4: sql_integration::test_using_join_wildcard_schema
             at ./tests/sql_integration.rs:4568:5
   5: sql_integration::test_using_join_wildcard_schema::{{closure}}
             at ./tests/sql_integration.rs:4559:37
   6: core::ops::function::FnOnce::call_once
             at /Users/andrewlamb/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
   7: core::ops::function::FnOnce::call_once
             at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

alamb · 2025-01-14T01:40:13Z

datafusion/expr/src/utils.rs

+/// For each column specified in the USING JOIN condition, the JOIN plan outputs it twice
+/// (once for each join side), but an unqualified wildcard should include it only once.
+/// This function returns the columns that should be excluded.
+fn exclude_using_columns(plan: &LogicalPlan) -> Result<HashSet<Column>> {
    let using_columns = plan.using_columns()?;


is this something we should file a ticket to track?

alamb · 2025-01-14T21:06:43Z

Thanks @jonahgao and @DDtKey

jonahgao · 2025-01-15T01:37:18Z

Thanks @DDtKey @alamb for the review.

jonahgao added 4 commits January 12, 2025 23:49

fix: incorrect NATURAL/USING JOIN schema

48a8e6b

Add test

7b2cdd2

Merge branch 'main' into using_schema

6de6b5f

Simplify exclude_using_columns

efca753

github-actions bot added sql SQL Planner logical-expr Logical plan and expressions labels Jan 13, 2025

jonahgao commented Jan 13, 2025

View reviewed changes

DDtKey reviewed Jan 13, 2025

View reviewed changes

alamb approved these changes Jan 14, 2025

View reviewed changes

Add more tests

789e9f9

DDtKey approved these changes Jan 14, 2025

View reviewed changes

alamb merged commit 02c8247 into apache:main Jan 14, 2025
25 checks passed

jonahgao deleted the using_schema branch January 15, 2025 01:37

alamb mentioned this pull request Jan 18, 2025

Jan 18, 2025: This week(s) in DataFusion #14179

Closed

jonahgao mentioned this pull request Feb 11, 2025

Create UNION plan node with correct schema #14380

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: incorrect NATURAL/USING JOIN schema #14102

fix: incorrect NATURAL/USING JOIN schema #14102

jonahgao commented Jan 13, 2025 •

edited

Loading

jonahgao Jan 13, 2025

jonahgao Jan 13, 2025 •

edited

Loading

jonahgao Jan 13, 2025

jonahgao Jan 13, 2025

alamb Jan 14, 2025

jonahgao Jan 14, 2025

DDtKey left a comment •

edited

Loading

DDtKey Jan 13, 2025 •

edited

Loading

jonahgao Jan 14, 2025

alamb left a comment

alamb Jan 14, 2025

alamb commented Jan 14, 2025

jonahgao commented Jan 15, 2025

		@@ -705,27 +711,20 @@ pub fn exprlist_to_fields<'a>(
		.map(\|e\| match e {
		Expr::Wildcard { qualifier, options } => match qualifier {

fix: incorrect NATURAL/USING JOIN schema #14102

fix: incorrect NATURAL/USING JOIN schema #14102

Conversation

jonahgao commented Jan 13, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jonahgao Jan 13, 2025

Choose a reason for hiding this comment

jonahgao Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

jonahgao Jan 13, 2025

Choose a reason for hiding this comment

jonahgao Jan 13, 2025

Choose a reason for hiding this comment

alamb Jan 14, 2025

Choose a reason for hiding this comment

jonahgao Jan 14, 2025

Choose a reason for hiding this comment

DDtKey left a comment • edited Loading

Choose a reason for hiding this comment

DDtKey Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

jonahgao Jan 14, 2025

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Jan 14, 2025

Choose a reason for hiding this comment

alamb commented Jan 14, 2025

jonahgao commented Jan 15, 2025

jonahgao commented Jan 13, 2025 •

edited

Loading

jonahgao Jan 13, 2025 •

edited

Loading

DDtKey left a comment •

edited

Loading

DDtKey Jan 13, 2025 •

edited

Loading