-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Can be replaced by parts of #225 - Do NOT merge] Adding orderBy methods for TypedDetasets #231
Conversation
Codecov Report
@@ Coverage Diff @@
## master #231 +/- ##
==========================================
+ Coverage 96.18% 96.68% +0.49%
==========================================
Files 52 53 +1
Lines 944 964 +20
Branches 18 12 -6
==========================================
+ Hits 908 932 +24
+ Misses 36 32 -4
Continue to review full report at Codecov.
|
That is very strange behaviour. Is there a Spark bug tracking this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find @imarios, I definitely think this deserves creating a Spark issue given that their documentation has no mentions of this.
|
||
/** Orders the TypedDataset using the column selected in descending order. | ||
*/ | ||
def orderByDesc[A](column: Witness.Lt[Symbol]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's weird to me to have an asymmetry between the two interfaces:
ds.orderByDesc(a')
dsorderByMany(('a).desc)
I would rather have a def orderBy(on: SortedTypedColumn)
and use the second syntax in both cases.
"Either one of the selected columns is not sortable (${U}), " + | ||
"or no ordering (ascending/descending) has been selected. " + | ||
"Select an ordering on any column (t) using the t.asc or t.desc methods.") | ||
trait SortableColumnTypes[T, U <: HList] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there might be something in shapeless doing just that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, it's called Comapped
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for all the comments @OlivierBlanvillain. This one does two things, enforces that they are CatalystOrdered and that they are SortedTypedColumn. It also allows us to give the ability to write a custom error compilation error (using implicitNotFound). FInally, our ColumnTypes already is a small rework of Comapped and there @kanterov mentioned a performance reason why we prefer our own version rather than relying on Comapped.
@iravid I quickly went through the Spark jira issues with OrderBy in the title and couldn't find anything similar. |
@iravid @OlivierBlanvillain that has been the behavior for Spark for as long as I remember. Now, I am not really sure its a bug. It seems that they purposefully added the exception there. Read my next comment below. I think probably they want to have a way to prevent sorting on projections (select). This guard, seems to be preventing exactly this. |
@iravid @OlivierBlanvillain also forgot to mention that you cannot really project a column when an ordering is applied to it. This was the main reason why we needed a new type for sort-enabled columns. scala> t.dataset.select($"_1".desc)
res3: org.apache.spark.sql.DataFrame = [_1 DESC NULLS LAST: int]
scala> t.dataset.select($"_1".desc).show()
java.lang.UnsupportedOperationException: Cannot evaluate expression: input[0, int, false] DESC NULLS LAST
at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:226)
at org.apache.spark.sql.catalyst.expressions.SortOrder.doGenCode(SortOrder.scala:60)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103)
at scala.Option.getOrElse(Option.scala:121) |
Makes sense. I like your solution :-) |
Hey @frosforever, do you think you have time to move this together with the remaining column ordering work in a new PR? Let me know if you need any help. |
Hey @imarios I'm not entirely sure what you mean. I can split out the sorting from the window work in #225 and submit a new PR. Comparing to this PR, it seems there are few differences:
I can submit a PR with the above sometime tomorrow, barring the unforeseen. I somehow missed this PR and am not sure how it originated. Was this done before I opened #225 or was this an effort to split out my ordering work from #225? |
@frosforever I started coding this without referencing your PR. Probably subconsciously my brain remembered parts of it (looked at your PR 1.5 months ago), but didn't even reminded me (my brain that is) that this PR had somehow similar parts so that I don't reinvent the wheel. Surprisingly perhabs, the two solutions are not terribly different. Regarding your questions, 1 we can use both (one as an alias to the other), 2 yes, 3 let's stick to CatalystOrdered, 4 no particular reason let's add this API as well. |
@imarios Thanks for the background. It's reassuring we both came up with pretty similar solutions to the same problems! |
@imarios What's the status of this PR? |
You may wonder here why we are forcing the users to choose an ordering (ascending/descending) and why do we really need a new
SortedTypedColumn
type. Better to show this with an example:Ups, applying any further operations on a column after you marked it for ordering, leads to a run-time error. So we need to return a different column-like type that doesn't allow for any other ops.