-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add resolved target
to DmlStatement
(to eliminate need for table lookup after deserialization)
#14631
Conversation
9850957
to
62cd035
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @milenkovicm - I think this PR makes a lot of sense to me
The most compelling rationale in my mind is that it makes DMLNode follow the same pattern as the TableScan (which also has a resolved reference):
datafusion/datafusion/expr/src/logical_plan/plan.rs
Lines 2500 to 2501 in 62cd035
/// The source of the table | |
pub source: Arc<dyn TableSource>, |
I had some questions and comments about naming and theneed to keep the table schema around. Let me know what you think
scan_plan, | ||
"t", | ||
table_sink, | ||
&schema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the table_source already has the target schema, we could avoid storing it again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking to remove it, but I did not as I got confused about its usage.
looking at the
LogicalPlanBuilder::insert_into(scan_plan, "t", &schema, InsertOp::Append)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it is the scan schema, then it should be possible to get from the scan_plan
🤔
let insert_into_table = LogicalPlanBuilder::insert_into( | ||
scan_plan, | ||
"t", | ||
table_sink, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the name of table_sink
to be confusing here (as the type is called TableSource
I would expect it to be called table_source
for consistency)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But now I see that it is a sink because it is the target of the DML operations
Maybe target
would be clearer 🤔
pub struct DmlStatement { | ||
/// The table name | ||
pub table_name: TableReference, | ||
/// this is sink table which corresponds to table_name reference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// this is sink table which corresponds to table_name reference | |
/// this is target table to insert into |
thanks @alamb will address comments and do cleanup later. |
62cd035
to
715781e
Compare
comments should be addressed @alamb |
target
to DmlStatement
(to eliminate need for table lookup after deserialization)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @milenkovicm -- I think the code in ths PR makes sense to me and is very nice 👌
I didn't see a test of serializing / deserializing a DML plan (which I think is the primary usecase for this change). I think the PR needs such a test prior to merge otherwise it is ready to go 🚀
BTW I also added some docs + an example in a follow on PR:
datafusion/sql/src/statement.rs
Outdated
LogicalPlanBuilder::scan(table_ref.clone(), table_source, None)?.build()?; | ||
let schema = table_source.schema().to_dfschema_ref()?; | ||
let scan = LogicalPlanBuilder::scan( | ||
//object_name_to_string(&table_name), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left over?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it showed up after rebase, so i left it in case i need to go back to i. will be removed
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#14079 added
async fn roundtrip_logical_plan_dml() -> Result<()>
roundtrip test, do we need more @alamb ?
Good point about the existing tests -- I was thinking about somehow testing the usecase of not having to re-resolve the table provider as that seems like important context not covered by existing tests.
However, I don't have any great ideas in this area so let's call it good!
I will push those small changes later, and will have a look if tests can be improved. |
@alamb I'm testing this branch in ballista just to verify it. Apparently ballista tests fail, with this branch first, but with latest main as well (without this patch) so not related to this patch.
Error log:
you may have educated guess where the issue is |
That looks like it may be related to projections somehow (expected column isn't present) 🤔 |
I tried to test apache/datafusion-ballista#1176 with latest to verify this PR.
made no changes to code apart from pointing to this branch / main |
Maybe it is related to using a different schema (from the target 🤔 ) |
its not problem (at least not for now) with this PR, I've tested it with main |
test with so its either on ballista side or datafusion-proto |
Is this PR good to merge from your perspective @milenkovicm ? |
I think this PR is fine, test fails with main. we may want to postpone merge it until we can identify problem |
ha ...
test fails
no problem at all |
Thanks @milenkovicm |
@alamb this PR looks fine I verified it with |
Which issue does this PR close?
TableSource
toDML
proto
to eliminate need for table lookup after deserialisation #14654.Rationale for this change
As discussed in apache/datafusion-ballista#1164 DML statement has table reference instead of table source, which as implication impacts usability in case of ser/de logical plan. After deserialisation another table lookup need to be performed which impacts usability in case of ballista (where we keep two separate session context and only one of them has table definition)
This PR add table source as well to LogicalPlan:DML to improve usability.
What changes are included in this PR?
this PR changes:
Are these changes tested?
using existent tests
Are there any user-facing changes?
proto definition changed