-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(join): joining on different types #3716
Conversation
CodSpeed Performance ReportMerging #3716 will degrade performances by 44.76%Comparing Summary
Benchmarks breakdown
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3716 +/- ##
==========================================
- Coverage 77.58% 77.56% -0.02%
==========================================
Files 729 728 -1
Lines 92010 92054 +44
==========================================
+ Hits 71389 71406 +17
- Misses 20621 20648 +27
|
null_equals_nulls: list[bool] | None, | ||
how: JoinType, | ||
null_equals_nulls: list[bool] | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small drive-by fix in typing to match rust definition
|
||
use crate::{deduplicate_expr_names, ExprRef}; | ||
|
||
pub fn get_common_join_cols<'a>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Common columns now determined via schema instead of join keys so that join keys can be modified (ex: casting to supertype) without side effects. This does introduce a small API change in the order of the join schema: common columns are now sorted by left side schema instead of by join keys. I think this is fine but we could introduce a project under the left side to reorder if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some initial comments. Will get to a more thorough review tomorrow!
daft/daft/__init__.pyi
Outdated
join_prefix: str | None = None, | ||
join_suffix: str | None = None, | ||
join_strategy: JoinStrategy | None = None, | ||
column_renaming_params: JoinColumnRenamingParams | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd prefer deprecating join_prefix
and join_suffix
instead of just flat out removing them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm kind of also on the fence of leaving these as is for the dataframe api. IMO, it's a bit cleaner to do
df.join(df2, prefix="df2.", suffix="_joined")
instead of
from daft.daft import JoinColumnRenamingParams
df.join(df2, column_renaming_params=JoinColumnRenamingParams(prefix="df2.", suffix="_joined"))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API for the actual dataframe operation has not changed, this is just for the builder. Do you think we should be concerned about breaking the builder API?
let left_on = deduplicate_expr_names(&left_on); | ||
let right_on = deduplicate_expr_names(&right_on); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we make deduplicate_expr_names
take in an iter so we don't have to materialize twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to materialize here anyway to get the DaftResult out of the iterator as well as split the iterator into two vecs.
This PR fixes a few things and also has some QOL changes.
Fixes:
QOL:
JoinOptions
type. Reduces the parameters for a bunch of functions and also uses the builder patternkeep_join_keys
field tomerge_matching_join_keys
to make behavior more clear