[Can be replaced by parts of #225 - Do NOT merge] Adding orderBy methods for TypedDetasets #231

imarios · 2018-01-20T08:10:04Z

scala> val t = TypedDataset.create((1,2, List(1))::(6,4, List(2))::(7,4,List(8))::(6,10,List(2015, 2017))::Nil)
t: frameless.TypedDataset[(Int, Int, List[Int])] = [_1: int, _2: int ... 1 more field]

scala> t.orderByAsc('_2)
res0: frameless.TypedDataset[(Int, Int, List[Int])] = [_1: int, _2: int ... 1 more field]

scala> t.orderByAsc('_2).show().run
+---+---+------------+
| _1| _2|          _3|
+---+---+------------+
|  1|  2|         [1]|
|  6|  4|         [2]|
|  7|  4|         [8]|
|  6| 10|[2015, 2017]|
+---+---+------------+


scala> t.orderByDesc('_2).show().run
+---+---+------------+
| _1| _2|          _3|
+---+---+------------+
|  6| 10|[2015, 2017]|
|  6|  4|         [2]|
|  7|  4|         [8]|
|  1|  2|         [1]|
+---+---+------------+

scala> t.orderByDesc('_3)
<console>:26: error: Cannot compare columns of type A.
       t.orderByDesc('_3)


scala> t.orderByMany(t('_1).asc).show().run
+---+---+------------+
| _1| _2|          _3|
+---+---+------------+
|  1|  2|         [1]|
|  6|  4|         [2]|
|  6| 10|[2015, 2017]|
|  7|  4|         [8]|
+---+---+------------+


scala> t.orderByMany(t('_1).asc, t('_2).desc).show().run
+---+---+------------+
| _1| _2|          _3|
+---+---+------------+
|  1|  2|         [1]|
|  6| 10|[2015, 2017]|
|  6|  4|         [2]|
|  7|  4|         [8]|
+---+---+------------+

You may wonder here why we are forcing the users to choose an ordering (ascending/descending) and why do we really need a new SortedTypedColumn type. Better to show this with an example:

scala> t.dataset.orderBy($"_1".desc).show()
+---+---+------------+
| _1| _2|          _3|
+---+---+------------+
|  7|  4|         [8]|
|  6| 10|[2015, 2017]|
|  6|  4|         [2]|
|  1|  2|         [1]|
+---+---+------------+


scala> t.dataset.orderBy($"_1".desc+1).show()
java.lang.UnsupportedOperationException: Cannot evaluate expression: input[0, int, false] DESC NULLS LAST
  at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:226)
  at org.apache.spark.sql.catalyst.expressions.SortOrder.doGenCode(SortOrder.scala:60)
  at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106)
  at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:103)
  at org.apache.spark.sql.catalyst.expressions.BinaryExpression.nullSafeCodeGen(Expression.scala:460)
  at org.apache.spark.sql.catalyst.expressions.BinaryExpression.defineCodeGen(Expression.scala:443)
  at org.apache.spark.sql.catalyst.expressions.Add.doGenCode(arithmetic.scala:174)
  at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106)
  at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103)
  at scala.Option.getOrElse(Option.scala:121)
....

Ups, applying any further operations on a column after you marked it for ordering, leads to a run-time error. So we need to return a different column-like type that doesn't allow for any other ops.

codecov-io · 2018-01-20T08:44:46Z

Codecov Report

Merging #231 into master will increase coverage by 0.49%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #231      +/-   ##
==========================================
+ Coverage   96.18%   96.68%   +0.49%     
==========================================
  Files          52       53       +1     
  Lines         944      964      +20     
  Branches       18       12       -6     
==========================================
+ Hits          908      932      +24     
+ Misses         36       32       -4

Impacted Files	Coverage Δ
...set/src/main/scala/frameless/ops/ColumnTypes.scala	`100% <ø> (ø)`	⬆️
...ore/src/main/scala/frameless/CatalystOrdered.scala	`100% <ø> (ø)`	⬆️
...ataset/src/main/scala/frameless/TypedDataset.scala	`95.55% <100%> (+1.66%)`	⬆️
...main/scala/frameless/ops/SortableColumnTypes.scala	`100% <100%> (ø)`
dataset/src/main/scala/frameless/TypedColumn.scala	`100% <100%> (ø)`	⬆️
...c/main/scala/frameless/TypedDatasetForwarded.scala	`75% <0%> (+20.83%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b99be3f...e73127c. Read the comment docs.

iravid · 2018-01-20T09:14:04Z

That is very strange behaviour. Is there a Spark bug tracking this?

OlivierBlanvillain

Good find @imarios, I definitely think this deserves creating a Spark issue given that their documentation has no mentions of this.

OlivierBlanvillain · 2018-01-20T13:58:50Z

dataset/src/main/scala/frameless/TypedDataset.scala

+
+  /** Orders the TypedDataset using the column selected in descending order.
+    */
+  def orderByDesc[A](column: Witness.Lt[Symbol])


It's weird to me to have an asymmetry between the two interfaces:

ds.orderByDesc(a')

dsorderByMany(('a).desc)

I would rather have a def orderBy(on: SortedTypedColumn) and use the second syntax in both cases.

OlivierBlanvillain · 2018-01-20T14:01:49Z

dataset/src/main/scala/frameless/ops/SortableColumnTypes.scala

+  "Either one of the selected columns is not sortable (${U}), " +
+    "or no ordering (ascending/descending) has been selected. " +
+    "Select an ordering on any column (t) using the t.asc or t.desc methods.")
+trait SortableColumnTypes[T, U <: HList]


I think there might be something in shapeless doing just that.

Got it, it's called Comapped

Thank you for all the comments @OlivierBlanvillain. This one does two things, enforces that they are CatalystOrdered and that they are SortedTypedColumn. It also allows us to give the ability to write a custom error compilation error (using implicitNotFound). FInally, our ColumnTypes already is a small rework of Comapped and there @kanterov mentioned a performance reason why we prefer our own version rather than relying on Comapped.

OlivierBlanvillain · 2018-01-20T14:14:00Z

@iravid I quickly went through the Spark jira issues with OrderBy in the title and couldn't find anything similar.

imarios · 2018-01-20T17:38:09Z

@iravid @OlivierBlanvillain that has been the behavior for Spark for as long as I remember. Now, I am not really sure its a bug. It seems that they purposefully added the exception there. Read my next comment below. I think probably they want to have a way to prevent sorting on projections (select). This guard, seems to be preventing exactly this.

imarios · 2018-01-20T19:23:39Z

@iravid @OlivierBlanvillain also forgot to mention that you cannot really project a column when an ordering is applied to it. This was the main reason why we needed a new type for sort-enabled columns.

scala> t.dataset.select($"_1".desc)
res3: org.apache.spark.sql.DataFrame = [_1 DESC NULLS LAST: int]

scala> t.dataset.select($"_1".desc).show()
java.lang.UnsupportedOperationException: Cannot evaluate expression: input[0, int, false] DESC NULLS LAST
  at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:226)
  at org.apache.spark.sql.catalyst.expressions.SortOrder.doGenCode(SortOrder.scala:60)
  at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106)
  at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103)
  at scala.Option.getOrElse(Option.scala:121)

iravid · 2018-01-20T19:26:56Z

Makes sense. I like your solution :-)

imarios · 2018-01-23T05:32:35Z

Hey @frosforever, do you think you have time to move this together with the remaining column ordering work in a new PR? Let me know if you need any help.

frosforever · 2018-01-23T23:04:47Z

Hey @imarios I'm not entirely sure what you mean. I can split out the sorting from the window work in #225 and submit a new PR. Comparing to this PR, it seems there are few differences:

rename sort to orderBy
add helper orderBy for N columns and then a separate orderByMany for arbitrary columns
a. Is this for better type inference?
Strictly use CatalystOrdered and not add CatalystRowOrdered
a. Don't see a problem with that as a first pass though [WIP] Window functions and column sorting #225 supports sorting by struct as well as collections and options. I don't know off hand if all the methods that currently use CatalystOrdered can also use structs, collections and options
This is missing ascNonesFirst etc.
a. Is that intentional? Again this would require CatalystOrdered for Option

I can submit a PR with the above sometime tomorrow, barring the unforeseen.

I somehow missed this PR and am not sure how it originated. Was this done before I opened #225 or was this an effort to split out my ordering work from #225?

imarios · 2018-01-24T00:13:03Z

@frosforever I started coding this without referencing your PR. Probably subconsciously my brain remembered parts of it (looked at your PR 1.5 months ago), but didn't even reminded me (my brain that is) that this PR had somehow similar parts so that I don't reinvent the wheel. Surprisingly perhabs, the two solutions are not terribly different.

Regarding your questions, 1 we can use both (one as an alias to the other), 2 yes, 3 let's stick to CatalystOrdered, 4 no particular reason let's add this API as well.

frosforever · 2018-01-24T17:01:02Z

@imarios Thanks for the background. It's reassuring we both came up with pretty similar solutions to the same problems!
I'll work on splitting out the sorting to a different PR

frosforever · 2018-01-29T02:53:30Z

@imarios I've opened #236 addressing what you've suggested here. I bumped into many issues conflating sorting with comparable. Besides the collection issue that we can expect 2.3 to address, there are also issues around supporting Options.

OlivierBlanvillain · 2018-02-25T06:36:27Z

@imarios What's the status of this PR?

Adding orderBy methods for TypedDetasets

66da757

imarios requested a review from OlivierBlanvillain January 20, 2018 08:10

OlivierBlanvillain reviewed Jan 20, 2018

View reviewed changes

Addressing core review comments.

e73127c

OlivierBlanvillain approved these changes Jan 20, 2018

View reviewed changes

imarios mentioned this pull request Jan 21, 2018

[WIP] Window functions and column sorting #225

Closed

imarios changed the title ~~Adding orderBy methods for TypedDetasets~~ [Can be replaced by parts of #225 - DO not merge] Adding orderBy methods for TypedDetasets Jan 21, 2018

imarios changed the title ~~[Can be replaced by parts of #225 - DO not merge] Adding orderBy methods for TypedDetasets~~ [Can be replaced by parts of #225 - Do NOT merge] Adding orderBy methods for TypedDetasets Jan 21, 2018

frosforever mentioned this pull request Jan 29, 2018

Sort columns #236

Merged

imarios closed this Feb 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Can be replaced by parts of #225 - Do NOT merge] Adding orderBy methods for TypedDetasets #231

[Can be replaced by parts of #225 - Do NOT merge] Adding orderBy methods for TypedDetasets #231

imarios commented Jan 20, 2018 •

edited

Loading

codecov-io commented Jan 20, 2018 •

edited

Loading

iravid commented Jan 20, 2018

OlivierBlanvillain left a comment

OlivierBlanvillain Jan 20, 2018

OlivierBlanvillain Jan 20, 2018

OlivierBlanvillain Jan 20, 2018

imarios Jan 20, 2018 •

edited

Loading

OlivierBlanvillain commented Jan 20, 2018 •

edited

Loading

imarios commented Jan 20, 2018 •

edited

Loading

imarios commented Jan 20, 2018

iravid commented Jan 20, 2018

imarios commented Jan 23, 2018

frosforever commented Jan 23, 2018

imarios commented Jan 24, 2018 •

edited

Loading

frosforever commented Jan 24, 2018

frosforever commented Jan 29, 2018

OlivierBlanvillain commented Feb 25, 2018

[Can be replaced by parts of #225 - Do NOT merge] Adding orderBy methods for TypedDetasets #231

[Can be replaced by parts of #225 - Do NOT merge] Adding orderBy methods for TypedDetasets #231

Conversation

imarios commented Jan 20, 2018 • edited Loading

codecov-io commented Jan 20, 2018 • edited Loading

Codecov Report

iravid commented Jan 20, 2018

OlivierBlanvillain left a comment

Choose a reason for hiding this comment

OlivierBlanvillain Jan 20, 2018

Choose a reason for hiding this comment

OlivierBlanvillain Jan 20, 2018

Choose a reason for hiding this comment

OlivierBlanvillain Jan 20, 2018

Choose a reason for hiding this comment

imarios Jan 20, 2018 • edited Loading

Choose a reason for hiding this comment

OlivierBlanvillain commented Jan 20, 2018 • edited Loading

imarios commented Jan 20, 2018 • edited Loading

imarios commented Jan 20, 2018

iravid commented Jan 20, 2018

imarios commented Jan 23, 2018

frosforever commented Jan 23, 2018

imarios commented Jan 24, 2018 • edited Loading

frosforever commented Jan 24, 2018

frosforever commented Jan 29, 2018

OlivierBlanvillain commented Feb 25, 2018

imarios commented Jan 20, 2018 •

edited

Loading

codecov-io commented Jan 20, 2018 •

edited

Loading

imarios Jan 20, 2018 •

edited

Loading

OlivierBlanvillain commented Jan 20, 2018 •

edited

Loading

imarios commented Jan 20, 2018 •

edited

Loading

imarios commented Jan 24, 2018 •

edited

Loading