Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace TypeSignature::String with TypeSignature::Coercible for starts_with #14812

Merged
merged 1 commit into from
Feb 27, 2025

Conversation

zjregee
Copy link
Member

@zjregee zjregee commented Feb 21, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions functions Changes to functions implementation labels Feb 21, 2025
@zjregee
Copy link
Member Author

zjregee commented Feb 21, 2025

Hi, @jayzhan211. I encountered some problems when trying to discard TypeSignature::String.

After a simple direct replacement, the original test failed. The specific test failure items are shown in this https://github.com/apache/datafusion/actions/runs/13457031878/job/37603204917?pr=14812.

I think the problem may be caused by the need to modify the logic of this part of the code in this function https://github.com/zjregee/datafusion/blob/replace-type-signature/datafusion/expr/src/type_coercion/functions.rs#L603-L636, but I am not sure, and hope to get some help and suggestions.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 21, 2025

fn simplify(
&self,
args: Vec<Expr>,
_info: &dyn SimplifyInfo,
) -> Result<ExprSimplifyResult> {
if let Expr::Literal(scalar_value) = &args[1] {
// Convert starts_with(col, 'prefix') to col LIKE 'prefix%' with proper escaping
// Example: starts_with(col, 'ja%') -> col LIKE 'ja\%%'
// 1. 'ja%' (input pattern)
// 2. 'ja\%' (escape special char '%')
// 3. 'ja\%%' (add suffix for starts_with)
let like_expr = match scalar_value {
ScalarValue::Utf8(Some(pattern)) => {
let escaped_pattern = pattern.replace("%", "\\%");
let like_pattern = format!("{}%", escaped_pattern);
Expr::Literal(ScalarValue::Utf8(Some(like_pattern)))
}
ScalarValue::LargeUtf8(Some(pattern)) => {
let escaped_pattern = pattern.replace("%", "\\%");
let like_pattern = format!("{}%", escaped_pattern);
Expr::Literal(ScalarValue::LargeUtf8(Some(like_pattern)))
}
ScalarValue::Utf8View(Some(pattern)) => {
let escaped_pattern = pattern.replace("%", "\\%");
let like_pattern = format!("{}%", escaped_pattern);
Expr::Literal(ScalarValue::Utf8View(Some(like_pattern)))
}
_ => return Ok(ExprSimplifyResult::Original(args)),
};
return Ok(ExprSimplifyResult::Simplified(Expr::Like(Like {
negated: false,
expr: Box::new(args[0].clone()),
pattern: Box::new(like_expr),
escape_char: None,
case_insensitive: false,
})));
}
Ok(ExprSimplifyResult::Original(args))
}

pub fn like(
negated: bool,
case_insensitive: bool,
expr: Arc<dyn PhysicalExpr>,
pattern: Arc<dyn PhysicalExpr>,
input_schema: &Schema,
) -> Result<Arc<dyn PhysicalExpr>> {
let expr_type = &expr.data_type(input_schema)?;
let pattern_type = &pattern.data_type(input_schema)?;
if !expr_type.eq(pattern_type) && !can_like_type(expr_type) {
return internal_err!(
"The type of {expr_type} AND {pattern_type} of like physical should be same"
);
}
Ok(Arc::new(LikeExpr::new(
negated,
case_insensitive,
expr,
pattern,
)))
}

The problem seems like you need to cast args[0] to the same as args[1] for like kernel function.

@zjregee zjregee force-pushed the replace-type-signature branch from 9c7ce7d to 0a84fdc Compare February 24, 2025 08:55
@github-actions github-actions bot added physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt) labels Feb 24, 2025
@zjregee
Copy link
Member Author

zjregee commented Feb 24, 2025

Hi, @jayzhan211. I'm sorry to bother you again. Before I start replacing other functions, I'd like to get some confirmation from you that my current implementation and handling are as expected. Thanks a lot!

fn simplify(
&self,
args: Vec<Expr>,
_info: &dyn SimplifyInfo,
) -> Result<ExprSimplifyResult> {
if let Expr::Literal(scalar_value) = &args[1] {
// Convert starts_with(col, 'prefix') to col LIKE 'prefix%' with proper escaping
// Example: starts_with(col, 'ja%') -> col LIKE 'ja\%%'
// 1. 'ja%' (input pattern)
// 2. 'ja\%' (escape special char '%')
// 3. 'ja\%%' (add suffix for starts_with)
let like_expr = match scalar_value {
ScalarValue::Utf8(Some(pattern)) => {
let escaped_pattern = pattern.replace("%", "\\%");
let like_pattern = format!("{}%", escaped_pattern);
Expr::Literal(ScalarValue::Utf8(Some(like_pattern)))
}
ScalarValue::LargeUtf8(Some(pattern)) => {
let escaped_pattern = pattern.replace("%", "\\%");
let like_pattern = format!("{}%", escaped_pattern);
Expr::Literal(ScalarValue::LargeUtf8(Some(like_pattern)))
}
ScalarValue::Utf8View(Some(pattern)) => {
let escaped_pattern = pattern.replace("%", "\\%");
let like_pattern = format!("{}%", escaped_pattern);
Expr::Literal(ScalarValue::Utf8View(Some(like_pattern)))
}
_ => return Ok(ExprSimplifyResult::Original(args)),
};
return Ok(ExprSimplifyResult::Simplified(Expr::Like(Like {
negated: false,
expr: Box::new(args[0].clone()),
pattern: Box::new(like_expr),
escape_char: None,
case_insensitive: false,
})));
}
Ok(ExprSimplifyResult::Original(args))
}

After some experimentation, I ended up not handling the type conversion in simplify of starts_with for the following reasons: this doesn't seem to cover all failure cases; If args[0] is of type Expr::Column, it seems that cannot do a direct type conversion here, because the type of args[0] is not deduced here.

The above is my superficial understanding. If there is anything that does not meet expectations, I hope to be corrected.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 24, 2025

@zjregee How about wrapping with Expr::Cast?

The logic should be something similar to this

let data_type_a = complex_expr.get_data_type();
let data_type_b = complex_expr.get_data_type();

if data_type_a != data_type_b {
   target type = determine_common_type(data_type_a, data_type_b)
   Expr::Cast(complex_expr, target type)
}

@github-actions github-actions bot removed the physical-expr Changes to the physical-expr crates label Feb 24, 2025
@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 24, 2025

let left_type = expr.get_type(self.schema)?;
let right_type = pattern.get_type(self.schema)?;
let coerced_type = like_coercion(&left_type, &right_type).ok_or_else(|| {
let op_name = if case_insensitive {
"ILIKE"
} else {
"LIKE"
};
plan_datafusion_err!(
"There isn't a common type to coerce {left_type} and {right_type} in {op_name} expression"
)
})?;
let expr = match left_type {
DataType::Dictionary(_, inner) if *inner == DataType::Utf8 => expr,
_ => Box::new(expr.cast_to(&coerced_type, self.schema)?),
};
let pattern = Box::new(pattern.cast_to(&coerced_type, self.schema)?);
Ok(Transformed::yes(Expr::Like(Like::new(
negated,
expr,
pattern,
escape_char,
case_insensitive,
))))
}

This is how types coerced in like expression, I guess we might need a similar coerce logic

string_coercion is probably the only coercion we need

@zjregee
Copy link
Member Author

zjregee commented Feb 24, 2025

@jayzhan211. Got this, thank you very much for your help.

In addition, some functions will also have similar problems that need to be solved during the replacement process. Maybe we can split it into several PRs. I will try to complete all of this in the next few days.

@jayzhan211
Copy link
Contributor

can split it into several PRs.

sure

@zjregee zjregee changed the title [wip] replace TypeSignature::String with TypeSignature::Coercible replace TypeSignature::String with TypeSignature::Coercible for starts_with Feb 24, 2025
@zjregee
Copy link
Member Author

zjregee commented Feb 25, 2025

I think this PR is ready for review now. cc: @jayzhan211

use datafusion_macros::user_doc;

/// Returns true if string starts with prefix.
/// starts_with('alphabet', 'alph') = 't'
pub fn starts_with(args: &[ArrayRef]) -> Result<ArrayRef> {
let result = arrow::compute::kernels::comparison::starts_with(&args[0], &args[1])?;
let arg0 = arrow::compute::kernels::cast::cast(&args[0], args[1].data_type())?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think better to use string_coercion and binary_to_string_coercion

@@ -127,7 +137,7 @@ impl ScalarUDFImpl for StartsWithFunc {

return Ok(ExprSimplifyResult::Simplified(Expr::Like(Like {
negated: false,
expr: Box::new(args[0].clone()),
expr: Box::new(cast(args[0].clone(), scalar_value.data_type())),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think better to use string_coercion and binary_to_string_coercion

@@ -619,13 +619,13 @@ query TT
explain select * from foo where starts_with(column1, 'f');
----
logical_plan
01)Filter: foo.column1 LIKE Utf8View("f%")
02)--TableScan: foo projection=[column1], partial_filters=[foo.column1 LIKE Utf8View("f%")]
01)Filter: CAST(foo.column1 AS Utf8) LIKE Utf8("f%")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, casting array is less efficient, since you only has 2 characters "f%" on right hand side.

Even more, instead of calling make_scalar_function, we can wrap scalar case to arrow::Scalar and call arrow::compute::kernels::comparison::starts_with directly.
It accepts Datum

pub fn starts_with(left: &dyn Datum, right: &dyn Datum) -> Result<BooleanArray, ArrowError> {
    like_op(Op::StartsWith, left, right)
}

@zjregee zjregee force-pushed the replace-type-signature branch from 4cc2188 to 9c37c65 Compare February 25, 2025 11:40
Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

@zjregee zjregee force-pushed the replace-type-signature branch from 9c37c65 to 4df8fbe Compare February 25, 2025 12:02
@zjregee
Copy link
Member Author

zjregee commented Feb 25, 2025

Hi, @jayzhan211. I resubmitted the PR, and now we not only don't have redundant CAST, but also reduce the use of redundant CAST in some cases. I think the current treatment is appropriate, thank you very much for your help!

@zjregee zjregee force-pushed the replace-type-signature branch from 12e5d4d to 81e82b0 Compare February 27, 2025 03:13
@github-actions github-actions bot removed the logical-expr Logical plan and expressions label Feb 27, 2025
Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

@jayzhan211 jayzhan211 merged commit a28f283 into apache:main Feb 27, 2025
24 checks passed
@jayzhan211
Copy link
Contributor

Thanks @zjregee

@zjregee zjregee deleted the replace-type-signature branch February 27, 2025 04:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants