Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: incorrect NATURAL/USING JOIN schema #14102

Merged
merged 5 commits into from
Jan 14, 2025
Merged

Conversation

jonahgao
Copy link
Member

@jonahgao jonahgao commented Jan 13, 2025

Which issue does this PR close?

Closes #14058.

Rationale for this change

When expanding unqualified wildcard over a natural/using join, it should deduplicate the columns specified in the join conditions. For example, select * from t t1 join t t2 using(a) should output the column a only once.

We have already done this in ExpandWildcardRule, and this PR re-implements it when computing plan schemas.

What changes are included in this PR?

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions labels Jan 13, 2025
/// For each column specified in the USING JOIN condition, the JOIN plan outputs it twice
/// (once for each join side), but an unqualified wildcard should include it only once.
/// This function returns the columns that should be excluded.
fn exclude_using_columns(plan: &LogicalPlan) -> Result<HashSet<Column>> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is extracted from expand_wildcard, so that we can reuse it in exprlist_to_fields.

@@ -705,27 +711,20 @@ pub fn exprlist_to_fields<'a>(
.map(|e| match e {
Expr::Wildcard { qualifier, options } => match qualifier {
Copy link
Member Author

@jonahgao jonahgao Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although we have moved wildcard expansions to the analyzer #11681, it still does wildcard expansions when computing plan schemas(in exprlist_to_fields and exprlist_len). I wonder if performing wildcard expansions before computing schemas would be simplier, at least it would avoid duplicated work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue is that in exprlist_to_fields, we don't handle replace items, but ExpandWildcardRule does.
select * replace ('foo' as a) from t will give the wrong datatype in schema before executing analyzer.

/// For each column specified in the USING JOIN condition, the JOIN plan outputs it twice
/// (once for each join side), but an unqualified wildcard should include it only once.
/// This function returns the columns that should be excluded.
fn exclude_using_columns(plan: &LogicalPlan) -> Result<HashSet<Column>> {
let using_columns = plan.using_columns()?;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using_columns() finds join condition columns by traversing the plan tree. This manner might be unsafe as it could incorrectly find columns that are not relevant to the current SQL context. For example, the result of the query below is different from other databases.

create table t(a int);
insert into t values(1),(2),(3);
select * from (select t.a+2 as a from t join t t2 using(a)) as t2;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this something we should file a ticket to track?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this something we should file a ticket to track?

@alamb Filed #14118

Copy link
Contributor

@DDtKey DDtKey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for quickly addressing the issue and preparing the PR! 🙏 ❤️

I haven’t had a chance to dive deeply into the code yet, but I do have one request: improving test coverage, as I believe it’s really important.

Comment on lines 4558 to 4582
#[test]
fn test_using_join_wildcard_schema() {
let sql = "SELECT * FROM orders o1 JOIN orders o2 USING (order_id)";
let plan = logical_plan(sql).unwrap();
let count = plan
.schema()
.iter()
.filter(|(_, f)| f.name() == "order_id")
.count();
// Only one order_id column
assert_eq!(count, 1);

let sql = "SELECT * FROM orders o1 NATURAL JOIN orders o2";
let plan = logical_plan(sql).unwrap();
// Only columns from one join side should be present
let expected_fields = vec![
"o1.order_id".to_string(),
"o1.customer_id".to_string(),
"o1.o_item_id".to_string(),
"o1.qty".to_string(),
"o1.price".to_string(),
"o1.delivered".to_string(),
];
assert_eq!(plan.schema().field_names(), expected_fields);
}
Copy link
Contributor

@DDtKey DDtKey Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd insist on better test coverage:

Because otherwise, another regression may happen easily

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 789e9f9

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me -- thank you @jonahgao and @DDtKey

I verified that the test fails without this code change, so from that perspective this PR is strictly better than main so in my opinion this PR could be merged in as is

However, I very much think @DDtKey 's comment about more testing is important https://github.com/apache/datafusion/pull/14102/files#r1913426831

Though I do think we can do it as a follow on PR as well

Here is how the test fails

assertion `left == right` failed
  left: 2
 right: 1

Left:  2
Right: 1
<Click to see difference>

thread 'test_using_join_wildcard_schema' panicked at datafusion/sql/tests/sql_integration.rs:4568:5:
assertion `left == right` failed
  left: 2
 right: 1
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/library/std/src/panicking.rs:665:5
   1: core::panicking::panic_fmt
             at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/library/core/src/panicking.rs:76:14
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
             at /Users/andrewlamb/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/panicking.rs:373:5
   4: sql_integration::test_using_join_wildcard_schema
             at ./tests/sql_integration.rs:4568:5
   5: sql_integration::test_using_join_wildcard_schema::{{closure}}
             at ./tests/sql_integration.rs:4559:37
   6: core::ops::function::FnOnce::call_once
             at /Users/andrewlamb/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
   7: core::ops::function::FnOnce::call_once
             at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


/// For each column specified in the USING JOIN condition, the JOIN plan outputs it twice
/// (once for each join side), but an unqualified wildcard should include it only once.
/// This function returns the columns that should be excluded.
fn exclude_using_columns(plan: &LogicalPlan) -> Result<HashSet<Column>> {
let using_columns = plan.using_columns()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this something we should file a ticket to track?

@alamb alamb merged commit 02c8247 into apache:main Jan 14, 2025
25 checks passed
@alamb
Copy link
Contributor

alamb commented Jan 14, 2025

Thanks @jonahgao and @DDtKey

@jonahgao
Copy link
Member Author

Thanks @DDtKey @alamb for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression: DataFrame::schema returns incorrect schema for NATURAL JOIN
3 participants