-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: incorrect NATURAL/USING JOIN schema #14102
Conversation
/// For each column specified in the USING JOIN condition, the JOIN plan outputs it twice | ||
/// (once for each join side), but an unqualified wildcard should include it only once. | ||
/// This function returns the columns that should be excluded. | ||
fn exclude_using_columns(plan: &LogicalPlan) -> Result<HashSet<Column>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is extracted from expand_wildcard
, so that we can reuse it in exprlist_to_fields
.
@@ -705,27 +711,20 @@ pub fn exprlist_to_fields<'a>( | |||
.map(|e| match e { | |||
Expr::Wildcard { qualifier, options } => match qualifier { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although we have moved wildcard expansions to the analyzer #11681, it still does wildcard expansions when computing plan schemas(in exprlist_to_fields and exprlist_len). I wonder if performing wildcard expansions before computing schemas would be simplier, at least it would avoid duplicated work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another issue is that in exprlist_to_fields, we don't handle replace items, but ExpandWildcardRule
does.
select * replace ('foo' as a) from t
will give the wrong datatype in schema before executing analyzer.
/// For each column specified in the USING JOIN condition, the JOIN plan outputs it twice | ||
/// (once for each join side), but an unqualified wildcard should include it only once. | ||
/// This function returns the columns that should be excluded. | ||
fn exclude_using_columns(plan: &LogicalPlan) -> Result<HashSet<Column>> { | ||
let using_columns = plan.using_columns()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using_columns() finds join condition columns by traversing the plan tree. This manner might be unsafe as it could incorrectly find columns that are not relevant to the current SQL context. For example, the result of the query below is different from other databases.
create table t(a int);
insert into t values(1),(2),(3);
select * from (select t.a+2 as a from t join t t2 using(a)) as t2;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this something we should file a ticket to track?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for quickly addressing the issue and preparing the PR! 🙏 ❤️
I haven’t had a chance to dive deeply into the code yet, but I do have one request: improving test coverage, as I believe it’s really important.
#[test] | ||
fn test_using_join_wildcard_schema() { | ||
let sql = "SELECT * FROM orders o1 JOIN orders o2 USING (order_id)"; | ||
let plan = logical_plan(sql).unwrap(); | ||
let count = plan | ||
.schema() | ||
.iter() | ||
.filter(|(_, f)| f.name() == "order_id") | ||
.count(); | ||
// Only one order_id column | ||
assert_eq!(count, 1); | ||
|
||
let sql = "SELECT * FROM orders o1 NATURAL JOIN orders o2"; | ||
let plan = logical_plan(sql).unwrap(); | ||
// Only columns from one join side should be present | ||
let expected_fields = vec![ | ||
"o1.order_id".to_string(), | ||
"o1.customer_id".to_string(), | ||
"o1.o_item_id".to_string(), | ||
"o1.qty".to_string(), | ||
"o1.price".to_string(), | ||
"o1.delivered".to_string(), | ||
]; | ||
assert_eq!(plan.schema().field_names(), expected_fields); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd insist on better test coverage:
- test-case where output (expected fields) contains at least 1 column from second table. Just like in MRE Regression:
DataFrame::schema
returns incorrect schema for NATURAL JOIN #14058 - more complex select, e.g with
WITH
or subselect - Join of >2 tables
- (?)
Because otherwise, another regression may happen easily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 789e9f9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me -- thank you @jonahgao and @DDtKey
I verified that the test fails without this code change, so from that perspective this PR is strictly better than main so in my opinion this PR could be merged in as is
However, I very much think @DDtKey 's comment about more testing is important https://github.com/apache/datafusion/pull/14102/files#r1913426831
Though I do think we can do it as a follow on PR as well
Here is how the test fails
assertion `left == right` failed
left: 2
right: 1
Left: 2
Right: 1
<Click to see difference>
thread 'test_using_join_wildcard_schema' panicked at datafusion/sql/tests/sql_integration.rs:4568:5:
assertion `left == right` failed
left: 2
right: 1
stack backtrace:
0: rust_begin_unwind
at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/library/std/src/panicking.rs:665:5
1: core::panicking::panic_fmt
at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/library/core/src/panicking.rs:76:14
2: core::panicking::assert_failed_inner
3: core::panicking::assert_failed
at /Users/andrewlamb/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/panicking.rs:373:5
4: sql_integration::test_using_join_wildcard_schema
at ./tests/sql_integration.rs:4568:5
5: sql_integration::test_using_join_wildcard_schema::{{closure}}
at ./tests/sql_integration.rs:4559:37
6: core::ops::function::FnOnce::call_once
at /Users/andrewlamb/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
7: core::ops::function::FnOnce::call_once
at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
/// For each column specified in the USING JOIN condition, the JOIN plan outputs it twice | ||
/// (once for each join side), but an unqualified wildcard should include it only once. | ||
/// This function returns the columns that should be excluded. | ||
fn exclude_using_columns(plan: &LogicalPlan) -> Result<HashSet<Column>> { | ||
let using_columns = plan.using_columns()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this something we should file a ticket to track?
Which issue does this PR close?
Closes #14058.
Rationale for this change
When expanding unqualified wildcard over a natural/using join, it should deduplicate the columns specified in the join conditions. For example,
select * from t t1 join t t2 using(a)
should output the columna
only once.We have already done this in
ExpandWildcardRule
, and this PR re-implements it when computing plan schemas.What changes are included in this PR?
Are these changes tested?
Yes
Are there any user-facing changes?
No