Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Can be replaced by parts of #225 - Do NOT merge] Adding orderBy methods for TypedDetasets #231

Closed
wants to merge 2 commits into from

Conversation

imarios
Copy link
Contributor

@imarios imarios commented Jan 20, 2018

scala> val t = TypedDataset.create((1,2, List(1))::(6,4, List(2))::(7,4,List(8))::(6,10,List(2015, 2017))::Nil)
t: frameless.TypedDataset[(Int, Int, List[Int])] = [_1: int, _2: int ... 1 more field]

scala> t.orderByAsc('_2)
res0: frameless.TypedDataset[(Int, Int, List[Int])] = [_1: int, _2: int ... 1 more field]

scala> t.orderByAsc('_2).show().run
+---+---+------------+
| _1| _2|          _3|
+---+---+------------+
|  1|  2|         [1]|
|  6|  4|         [2]|
|  7|  4|         [8]|
|  6| 10|[2015, 2017]|
+---+---+------------+


scala> t.orderByDesc('_2).show().run
+---+---+------------+
| _1| _2|          _3|
+---+---+------------+
|  6| 10|[2015, 2017]|
|  6|  4|         [2]|
|  7|  4|         [8]|
|  1|  2|         [1]|
+---+---+------------+

scala> t.orderByDesc('_3)
<console>:26: error: Cannot compare columns of type A.
       t.orderByDesc('_3)


scala> t.orderByMany(t('_1).asc).show().run
+---+---+------------+
| _1| _2|          _3|
+---+---+------------+
|  1|  2|         [1]|
|  6|  4|         [2]|
|  6| 10|[2015, 2017]|
|  7|  4|         [8]|
+---+---+------------+


scala> t.orderByMany(t('_1).asc, t('_2).desc).show().run
+---+---+------------+
| _1| _2|          _3|
+---+---+------------+
|  1|  2|         [1]|
|  6| 10|[2015, 2017]|
|  6|  4|         [2]|
|  7|  4|         [8]|
+---+---+------------+

You may wonder here why we are forcing the users to choose an ordering (ascending/descending) and why do we really need a new SortedTypedColumn type. Better to show this with an example:

scala> t.dataset.orderBy($"_1".desc).show()
+---+---+------------+
| _1| _2|          _3|
+---+---+------------+
|  7|  4|         [8]|
|  6| 10|[2015, 2017]|
|  6|  4|         [2]|
|  1|  2|         [1]|
+---+---+------------+


scala> t.dataset.orderBy($"_1".desc+1).show()
java.lang.UnsupportedOperationException: Cannot evaluate expression: input[0, int, false] DESC NULLS LAST
  at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:226)
  at org.apache.spark.sql.catalyst.expressions.SortOrder.doGenCode(SortOrder.scala:60)
  at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106)
  at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:103)
  at org.apache.spark.sql.catalyst.expressions.BinaryExpression.nullSafeCodeGen(Expression.scala:460)
  at org.apache.spark.sql.catalyst.expressions.BinaryExpression.defineCodeGen(Expression.scala:443)
  at org.apache.spark.sql.catalyst.expressions.Add.doGenCode(arithmetic.scala:174)
  at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106)
  at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103)
  at scala.Option.getOrElse(Option.scala:121)
....

Ups, applying any further operations on a column after you marked it for ordering, leads to a run-time error. So we need to return a different column-like type that doesn't allow for any other ops.

@codecov-io
Copy link

codecov-io commented Jan 20, 2018

Codecov Report

Merging #231 into master will increase coverage by 0.49%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #231      +/-   ##
==========================================
+ Coverage   96.18%   96.68%   +0.49%     
==========================================
  Files          52       53       +1     
  Lines         944      964      +20     
  Branches       18       12       -6     
==========================================
+ Hits          908      932      +24     
+ Misses         36       32       -4
Impacted Files Coverage Δ
...set/src/main/scala/frameless/ops/ColumnTypes.scala 100% <ø> (ø) ⬆️
...ore/src/main/scala/frameless/CatalystOrdered.scala 100% <ø> (ø) ⬆️
...ataset/src/main/scala/frameless/TypedDataset.scala 95.55% <100%> (+1.66%) ⬆️
...main/scala/frameless/ops/SortableColumnTypes.scala 100% <100%> (ø)
dataset/src/main/scala/frameless/TypedColumn.scala 100% <100%> (ø) ⬆️
...c/main/scala/frameless/TypedDatasetForwarded.scala 75% <0%> (+20.83%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b99be3f...e73127c. Read the comment docs.

@iravid
Copy link
Contributor

iravid commented Jan 20, 2018

That is very strange behaviour. Is there a Spark bug tracking this?

Copy link
Contributor

@OlivierBlanvillain OlivierBlanvillain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find @imarios, I definitely think this deserves creating a Spark issue given that their documentation has no mentions of this.


/** Orders the TypedDataset using the column selected in descending order.
*/
def orderByDesc[A](column: Witness.Lt[Symbol])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird to me to have an asymmetry between the two interfaces:

  • ds.orderByDesc(a')
  • dsorderByMany(('a).desc)

I would rather have a def orderBy(on: SortedTypedColumn) and use the second syntax in both cases.

"Either one of the selected columns is not sortable (${U}), " +
"or no ordering (ascending/descending) has been selected. " +
"Select an ordering on any column (t) using the t.asc or t.desc methods.")
trait SortableColumnTypes[T, U <: HList]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might be something in shapeless doing just that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, it's called Comapped

Copy link
Contributor Author

@imarios imarios Jan 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all the comments @OlivierBlanvillain. This one does two things, enforces that they are CatalystOrdered and that they are SortedTypedColumn. It also allows us to give the ability to write a custom error compilation error (using implicitNotFound). FInally, our ColumnTypes already is a small rework of Comapped and there @kanterov mentioned a performance reason why we prefer our own version rather than relying on Comapped.

@OlivierBlanvillain
Copy link
Contributor

OlivierBlanvillain commented Jan 20, 2018

@iravid I quickly went through the Spark jira issues with OrderBy in the title and couldn't find anything similar.

@imarios
Copy link
Contributor Author

imarios commented Jan 20, 2018

@iravid @OlivierBlanvillain that has been the behavior for Spark for as long as I remember. Now, I am not really sure its a bug. It seems that they purposefully added the exception there. Read my next comment below. I think probably they want to have a way to prevent sorting on projections (select). This guard, seems to be preventing exactly this.

@imarios
Copy link
Contributor Author

imarios commented Jan 20, 2018

@iravid @OlivierBlanvillain also forgot to mention that you cannot really project a column when an ordering is applied to it. This was the main reason why we needed a new type for sort-enabled columns.

scala> t.dataset.select($"_1".desc)
res3: org.apache.spark.sql.DataFrame = [_1 DESC NULLS LAST: int]

scala> t.dataset.select($"_1".desc).show()
java.lang.UnsupportedOperationException: Cannot evaluate expression: input[0, int, false] DESC NULLS LAST
  at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:226)
  at org.apache.spark.sql.catalyst.expressions.SortOrder.doGenCode(SortOrder.scala:60)
  at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106)
  at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103)
  at scala.Option.getOrElse(Option.scala:121)

@iravid
Copy link
Contributor

iravid commented Jan 20, 2018

Makes sense. I like your solution :-)

@imarios imarios changed the title Adding orderBy methods for TypedDetasets [Can be replaced by parts of #225 - DO not merge] Adding orderBy methods for TypedDetasets Jan 21, 2018
@imarios imarios changed the title [Can be replaced by parts of #225 - DO not merge] Adding orderBy methods for TypedDetasets [Can be replaced by parts of #225 - Do NOT merge] Adding orderBy methods for TypedDetasets Jan 21, 2018
@imarios
Copy link
Contributor Author

imarios commented Jan 23, 2018

Hey @frosforever, do you think you have time to move this together with the remaining column ordering work in a new PR? Let me know if you need any help.

@frosforever
Copy link
Contributor

Hey @imarios I'm not entirely sure what you mean. I can split out the sorting from the window work in #225 and submit a new PR. Comparing to this PR, it seems there are few differences:

  1. rename sort to orderBy
  2. add helper orderBy for N columns and then a separate orderByMany for arbitrary columns
    a. Is this for better type inference?
  3. Strictly use CatalystOrdered and not add CatalystRowOrdered
    a. Don't see a problem with that as a first pass though [WIP] Window functions and column sorting #225 supports sorting by struct as well as collections and options. I don't know off hand if all the methods that currently use CatalystOrdered can also use structs, collections and options
  4. This is missing ascNonesFirst etc.
    a. Is that intentional? Again this would require CatalystOrdered for Option

I can submit a PR with the above sometime tomorrow, barring the unforeseen.

I somehow missed this PR and am not sure how it originated. Was this done before I opened #225 or was this an effort to split out my ordering work from #225?

@imarios
Copy link
Contributor Author

imarios commented Jan 24, 2018

@frosforever I started coding this without referencing your PR. Probably subconsciously my brain remembered parts of it (looked at your PR 1.5 months ago), but didn't even reminded me (my brain that is) that this PR had somehow similar parts so that I don't reinvent the wheel. Surprisingly perhabs, the two solutions are not terribly different.

Regarding your questions, 1 we can use both (one as an alias to the other), 2 yes, 3 let's stick to CatalystOrdered, 4 no particular reason let's add this API as well.

@frosforever
Copy link
Contributor

@imarios Thanks for the background. It's reassuring we both came up with pretty similar solutions to the same problems!
I'll work on splitting out the sorting to a different PR

@frosforever frosforever mentioned this pull request Jan 29, 2018
@frosforever
Copy link
Contributor

@imarios I've opened #236 addressing what you've suggested here. I bumped into many issues conflating sorting with comparable. Besides the collection issue that we can expect 2.3 to address, there are also issues around supporting Options.

@OlivierBlanvillain
Copy link
Contributor

@imarios What's the status of this PR?

@imarios imarios closed this Feb 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants