Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add resolved target to DmlStatement (to eliminate need for table lookup after deserialization) #14631

Merged
merged 3 commits into from
Feb 14, 2025

Conversation

milenkovicm
Copy link
Contributor

@milenkovicm milenkovicm commented Feb 12, 2025

Which issue does this PR close?

Rationale for this change

As discussed in apache/datafusion-ballista#1164 DML statement has table reference instead of table source, which as implication impacts usability in case of ser/de logical plan. After deserialisation another table lookup need to be performed which impacts usability in case of ballista (where we keep two separate session context and only one of them has table definition)

This PR add table source as well to LogicalPlan:DML to improve usability.

What changes are included in this PR?

this PR changes:

  • LogicalPlan:DML
  • proto LogicalPlanNode::DML

Are these changes tested?

using existent tests

Are there any user-facing changes?

proto definition changed

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions core Core DataFusion crate proto Related to proto crate labels Feb 12, 2025
@milenkovicm milenkovicm force-pushed the feat_dml_table_source branch 3 times, most recently from 9850957 to 62cd035 Compare February 12, 2025 16:52
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @milenkovicm - I think this PR makes a lot of sense to me

The most compelling rationale in my mind is that it makes DMLNode follow the same pattern as the TableScan (which also has a resolved reference):

/// The source of the table
pub source: Arc<dyn TableSource>,

I had some questions and comments about naming and theneed to keep the table schema around. Let me know what you think

scan_plan,
"t",
table_sink,
&schema,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the table_source already has the target schema, we could avoid storing it again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking to remove it, but I did not as I got confused about its usage.
looking at the

LogicalPlanBuilder::insert_into(scan_plan, "t", &schema, InsertOp::Append)?
I got confused is this scan schema or table schema

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is the scan schema, then it should be possible to get from the scan_plan 🤔

let insert_into_table = LogicalPlanBuilder::insert_into(
scan_plan,
"t",
table_sink,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the name of table_sink to be confusing here (as the type is called TableSource I would expect it to be called table_source for consistency)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But now I see that it is a sink because it is the target of the DML operations

Maybe target would be clearer 🤔

pub struct DmlStatement {
/// The table name
pub table_name: TableReference,
/// this is sink table which corresponds to table_name reference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// this is sink table which corresponds to table_name reference
/// this is target table to insert into

@milenkovicm
Copy link
Contributor Author

thanks @alamb will address comments and do cleanup later.

@alamb alamb added the api change Changes the API exposed to users of the crate label Feb 13, 2025
@milenkovicm milenkovicm force-pushed the feat_dml_table_source branch from 62cd035 to 715781e Compare February 13, 2025 17:44
@milenkovicm
Copy link
Contributor Author

comments should be addressed @alamb

@alamb alamb changed the title feat: add table source to DML proto to eliminate need for table lookup after deserialisation feat: add resolved target to DmlStatement (to eliminate need for table lookup after deserialization) Feb 14, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @milenkovicm -- I think the code in ths PR makes sense to me and is very nice 👌

I didn't see a test of serializing / deserializing a DML plan (which I think is the primary usecase for this change). I think the PR needs such a test prior to merge otherwise it is ready to go 🚀

BTW I also added some docs + an example in a follow on PR:

LogicalPlanBuilder::scan(table_ref.clone(), table_source, None)?.build()?;
let schema = table_source.schema().to_dfschema_ref()?;
let scan = LogicalPlanBuilder::scan(
//object_name_to_string(&table_name),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left over?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it showed up after rebase, so i left it in case i need to go back to i. will be removed

@milenkovicm
Copy link
Contributor Author

#14079 added async fn roundtrip_logical_plan_dml() -> Result<()> roundtrip test, do we need more @alamb ?

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#14079 added async fn roundtrip_logical_plan_dml() -> Result<()> roundtrip test, do we need more @alamb ?

Good point about the existing tests -- I was thinking about somehow testing the usecase of not having to re-resolve the table provider as that seems like important context not covered by existing tests.

However, I don't have any great ideas in this area so let's call it good!

@milenkovicm
Copy link
Contributor Author

I will push those small changes later, and will have a look if tests can be improved.
Issue I have is that we can have one table ref but different table source if I'm not mistaken, this would be very hard to debug.

@milenkovicm
Copy link
Contributor Author

milenkovicm commented Feb 14, 2025

@alamb I'm testing this branch in ballista just to verify it.

Apparently ballista tests fail, with this branch first, but with latest main as well (without this patch) so not related to this patch.

DataFusionError(Internal("Could not create `ExprBoundaries`: in `try_from_column` `col_index` \n                has gone out of bounds with a value of 3, the schema has 3 columns."))

Error log:

[2025-02-14T17:57:07Z DEBUG ballista_scheduler::scheduler_server::query_stage_scheduler] processing task status updates from b77a5230-b9f6-4bf9-a914-ef337642e31e: []
[2025-02-14T17:57:07Z DEBUG ballista_scheduler::state::task_manager] Preparing task definition for TaskDescription[session_id: cd7a5bec-bda2-4908-bec4-2d62387ea568,job: EBoAKBv, stage: 1.0, partition: 0 task_id 0, task attempt 0]
    ShuffleWriterExec: None
      CoalesceBatchesExec: target_batch_size=8192
        FilterExec: id@0 > 4
          DataSourceExec: file_groups={1 group: [[var/folders/h_/cdg2zx090212402xhrwhv62w0000gn/T/.tmp9gafz6/YnnsNOURHACqq5wx_0.parquet]]}, projection=[id, string_col, timestamp_col], file_type=parquet, predicate=id@0 > 4, pruning_predicate=id_null_count@1 != row_count@2 AND id_max@0 > 4, required_guarantees=[]
    
[2025-02-14T17:57:07Z DEBUG ballista_scheduler::scheduler_server::query_stage_scheduler] processing task status updates from af0d3817-31e6-4336-b753-e84f9e466db1: []
[2025-02-14T17:57:07Z INFO  ballista_executor::execution_loop] Received task: [TID 0 EBoAKBv/1.0/0.0]
[2025-02-14T17:57:07Z WARN  datafusion_common::config] `enable_options_value_normalization` is deprecated and ignored
[2025-02-14T17:57:07Z WARN  ballista_executor::execution_loop] Failed to run task: DataFusionError(Internal("Could not create `ExprBoundaries`: in `try_from_column` `col_index` \n                has gone out of bounds with a value of 3, the schema has 3 columns."))

you may have educated guess where the issue is

@alamb
Copy link
Contributor

alamb commented Feb 14, 2025

Could not create ExprBoundaries: in try_from_column col_index \n

That looks like it may be related to projections somehow (expected column isn't present) 🤔

@milenkovicm
Copy link
Contributor Author

I tried to test apache/datafusion-ballista#1176 with latest to verify this PR.

cargo test should_execute_sql_write -- --nocapture

made no changes to code apart from pointing to this branch / main

@alamb
Copy link
Contributor

alamb commented Feb 14, 2025

I tried to test apache/datafusion-ballista#1176 with latest to verify this PR.

cargo test should_execute_sql_write -- --nocapture

made no changes to code apart from pointing to this branch / main

Maybe it is related to using a different schema (from the target 🤔 )

@milenkovicm
Copy link
Contributor Author

its not problem (at least not for now) with this PR, I've tested it with main 25b2a92aa683edb27605ea78473be170453583bf branch, it fails as well

@milenkovicm
Copy link
Contributor Author

test with SessionContext::standalone().await.unwrap() ... fails
test with SessionContext::new() ... all good

so its either on ballista side or datafusion-proto

@alamb
Copy link
Contributor

alamb commented Feb 14, 2025

Is this PR good to merge from your perspective @milenkovicm ?

@milenkovicm
Copy link
Contributor Author

I think this PR is fine, test fails with main. we may want to postpone merge it until we can identify problem

@milenkovicm
Copy link
Contributor Author

ha ...

+-----------------------------------------------+-------+
| name                                          | value |
+-----------------------------------------------+-------+
| datafusion.execution.parquet.pushdown_filters | false |
+-----------------------------------------------+-------+

test fails

+-----------------------------------------------+-------+
| name                                          | value |
+-----------------------------------------------+-------+
| datafusion.execution.parquet.pushdown_filters | true  |
+-----------------------------------------------+-------+

no problem at all

probably related to #14253 & #14465

@alamb alamb merged commit a104661 into apache:main Feb 14, 2025
24 checks passed
@alamb
Copy link
Contributor

alamb commented Feb 14, 2025

Thanks @milenkovicm

@milenkovicm
Copy link
Contributor Author

@alamb this PR looks fine I verified it with datafusion.execution.parquet.pushdown_filters=true but still same problem like #14631 (comment) when datafusion.execution.parquet.pushdown_filters=false

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate logical-expr Logical plan and expressions proto Related to proto crate sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: add TableSource to DML proto to eliminate need for table lookup after deserialisation
2 participants