Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21110][SQL] Structs, arrays, and other orderable datatypes should be usable in inequalities #18818

Closed
wants to merge 8 commits into from

Conversation

aray
Copy link
Contributor

@aray aray commented Aug 2, 2017

What changes were proposed in this pull request?

Allows BinaryComparison operators to work on any data type that actually supports ordering as verified by TypeUtils.checkForOrderingExpr instead of relying on the incomplete list TypeCollection.Ordered (which is removed by this PR).

How was this patch tested?

Updated unit tests to cover structs and arrays.

@SparkQA
Copy link

SparkQA commented Aug 2, 2017

Test build #80161 has finished for PR 18818 at commit 0d1fd56.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class MyStruct(a: Long, b: String)

@SparkQA
Copy link

SparkQA commented Aug 2, 2017

Test build #80162 has finished for PR 18818 at commit ec8dc95.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 2, 2017

Test build #80170 has finished for PR 18818 at commit caf74bf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@nchammas nchammas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not my area of expertise, but happy to see the changeset is very small. Maybe @HyukjinKwon, @viirya, or @gatorsmile can provide a more thorough review here.

@aray - Does the example from the original feature request work with this PR?

* Types that can be ordered/compared. In the long run we should probably make this a trait
* that can be mixed into each data type, and perhaps create an `AbstractDataType`.
*/
// TODO: Should we consolidate this with RowOrdering.isOrderable?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: Do we need to do anything with RowOrdering.isOrderable given the change in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, RowOrdering.isOrderable (which is used by TypeUtils.checkForOrderingExpr) returns true on a strict superset of this type collection as it works for complex types that need recursive checks.

@@ -453,6 +453,14 @@ case class Or(left: Expression, right: Expression) extends BinaryOperator with P

abstract class BinaryComparison extends BinaryOperator with Predicate {

override def inputType: AbstractDataType = AnyDataType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For LessThan-like operators, we should not accept AnyDataType , right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but the real type check is below in checkInputDataTypes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will make the others confused. Could you revert them back?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to define inputType because it extends BinaryOperator. Previously the LessThan-like operators defined inputType was a subset of what they could actually support. This PR fixes that, but since the supported types can not be finitely specified as a type collection (there are a countably infinite number of legal StructType's), we need to give a superset of what is actually supported for inputType and then do the real recursive check in checkInputDataTypes. This is much like how the EqualTo and EqualNullSafe operators were previously implemented. In this PR we just move that logic up to BinaryComparison as it's really the same for equality and inequality operators.

Did that answer your concerns?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to write the code comments for explaining it. Given this assumption, inputType might be misused in the future.

In addition, this PR has a pretty critical change. We really need to check all the new data types we support in this PR. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ordering.scala#L90-L97

Could you write a comprehensive test case coverage for this?

@SparkQA
Copy link

SparkQA commented Aug 8, 2017

Test build #80416 has finished for PR 18818 at commit c4f43e9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Also add NULL in the test case?

@@ -465,7 +475,7 @@ abstract class BinaryComparison extends BinaryOperator with Predicate {
}
}

protected lazy val ordering = TypeUtils.getInterpretedOrdering(left.dataType)
protected lazy val ordering: Ordering[Any] = TypeUtils.getInterpretedOrdering(left.dataType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The acceptable types in TypeUtils.getInterpretedOrdering are less than RowOrdering.isOrderable. It only accepts AtomicType, ArrayType and StructType.

NullType, UserDefinedType can cause problems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As ordering is lazily accessed, and any nulls don't lead us to access it in those predicates,NullType should be safe. We should add related test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in cc2f3ec

@SparkQA
Copy link

SparkQA commented Aug 14, 2017

Test build #80644 has finished for PR 18818 at commit cc2f3ec.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@aray
Copy link
Contributor Author

aray commented Aug 14, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Aug 15, 2017

Test build #80647 has finished for PR 18818 at commit cc2f3ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@aray
Copy link
Contributor Author

aray commented Aug 15, 2017

@viirya @gatorsmile I have addressed your comments, could you take another look.

@aray
Copy link
Contributor Author

aray commented Aug 31, 2017

ping @viirya @gatorsmile

@@ -582,6 +582,7 @@ class CodegenContext {
case array: ArrayType => genComp(array, c1, c2) + " == 0"
case struct: StructType => genComp(struct, c1, c2) + " == 0"
case udt: UserDefinedType[_] => genEqual(udt.sqlType, c1, c2)
case NullType => "true"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required? Will it be covered by any test?

BTW, the value should be false.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the test case, but the test case is not affected by the value we generate here since it is under nullSafeCodeGen.

However, we should still return false when doing null = null

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, codegen fails without this. I had originally made the value false but when i noticed the codegen for comparison (https://github.com/aray/spark/blob/cc2f3eca28ee6b9faa87853568205307567827cc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L606) returned 0, I changed it to be consistent. Happy to change it back though.

@gatorsmile
Copy link
Member

LGTM except one comment. Thanks for working on it!

@SparkQA
Copy link

SparkQA commented Aug 31, 2017

Test build #81294 has finished for PR 18818 at commit 6e01186.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merging to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants