Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add drop to a case class #212

Merged
merged 1 commit into from
Dec 8, 2017
Merged

Conversation

frosforever
Copy link
Contributor

Connects to #163 and #209

Similar to #208 adds support for dropping a column from a dataset with greater than 22 columns by returning a specified case class instead of a tuple.

@codecov-io
Copy link

codecov-io commented Nov 29, 2017

Codecov Report

Merging #212 into master will increase coverage by 0.12%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #212      +/-   ##
==========================================
+ Coverage   95.86%   95.99%   +0.12%     
==========================================
  Files          42       42              
  Lines         847      848       +1     
  Branches       11       11              
==========================================
+ Hits          812      814       +2     
+ Misses         35       34       -1
Impacted Files Coverage Δ
...ataset/src/main/scala/frameless/TypedDataset.scala 93.89% <100%> (+0.04%) ⬆️
...c/main/scala/frameless/TypedDatasetForwarded.scala 54.16% <0%> (+4.16%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 88b20c7...0fab834. Read the comment docs.

Copy link
Contributor

@imarios imarios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look really good! Let's discuss a bit the comments and we can merge.

*
* @see [[frameless.TypedDataset.dropApply#apply]]
*/
def drop[U] = new dropApply[U]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks kind of strange that the types are documented here, but they are not defined here ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that it looks strange but it's the same thing that I did

/** Adds a column to a Dataset so long as the specified output type, `U`, has
* an extra column from `T` that has type `A`.
*
* @example
* {{{
* case class X(i: Int, j: Int)
* case class Y(i: Int, j: Int, k: Boolean)
* val f: TypedDataset[X] = TypedDataset.create(X(1,1) :: X(1,1) :: X(1,10) :: Nil)
* val fNew: TypedDataset[Y] = f.withColumn[Y](f('j) === 10)
* }}}
* @param ca The typed column to add
* @param i0 TypeEncoder for output type U
* @param i1 TypeEncoder for added column type A
* @param i2 the LabelledGeneric derived for T
* @param i3 the LabelledGeneric derived for U
* @param i4 proof no fields have been removed
* @param i5 diff from T to U
* @param i6 keys from newFields
* @param i7 the one and only new key
* @param i8 the one and only new field enforcing the type of A exists
* @param i9 the keys of U
* @param iA allows for traversing the keys of U
* @tparam U the output type
* @tparam A The added column type
* @tparam TRep shapeless' record representation of T
* @tparam URep shapeless' record representation of U
* @tparam UKeys the keys of U as an HList
* @tparam NewFields the added fields to T to get U
* @tparam NewKeys the keys of NewFields as an HList
* @tparam NewKey the first, and only, key in NewKey
*
* @see [[frameless.TypedDataset.withColumnApply#apply]]
*/
def withColumn[U] = new withColumnApply[U]

The reason for doing so is to get the nice scaladoc in IDEs etc on the method call site that the user will see. I don't have strong feelings either way but originally moved it due to the suggestion #208 (comment).

Whatever the decision, we should be consistent and update both this and withColumn.

check(prop[SQLDate] _)
check(prop[Option[X1[Boolean]]] _)
test("remove column in the middle") {
val f: TypedDataset[X] = TypedDataset.create(X(1,1, false) :: X(1,1, false) :: X(1,10, false) :: Nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: spacing seems incostend. (same for other X literals)

*/
def drop[U] = new dropApply[U]

class dropApply[U] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropApply can have a regular class name DropApply.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I'll update

as well then

@frosforever
Copy link
Contributor Author

@imarios Thanks once again for the review!

Let me know how to proceed regarding #212 (comment)

@imarios
Copy link
Contributor

imarios commented Dec 6, 2017

@frosforever I had this thought after finishing the report: Isn't x.drop[U]('w) a more verbose version of x.project[U]? Actually, with project you can drop multiple columns at once instead of having to chain multiple .drops. What do you think? Are there cases where you would prefer to use drop over project?

@frosforever
Copy link
Contributor Author

@imarios Ha that's an excellent point!
I would say there are two reasons to have drop but neither are imperative.

  1. It can better convey intention. Adding a column, filtering on it, and then dropping it (often done with ranking to get the top by some category) is a common enough pattern that is better expressed using withColumn and then drop. Depending on how Named intermediate Datasets #188 goes this can be even more useful is HLists or Records are the intermediate type.

  2. Improve coverage of available spark functions. A user coming from spark expects to see drop.

What do you think?

@imarios
Copy link
Contributor

imarios commented Dec 7, 2017

I think keeping drop for consistency with spark makes sense. How about we keep drop[U] as an alias to project[U]? That is, def drop[U] = project[U]. Someone may complain that with project you may not actual drop anything, but I think that's unlikely to happen accidentally. Project has the added benefit that you can drop more than one column; and as a bonus it let's you also reorder fields on the resulting class. What do you think?

@frosforever
Copy link
Contributor Author

@imarios agreed. Updated PR.

@imarios imarios self-requested a review December 8, 2017 03:02
@imarios imarios merged commit f249172 into typelevel:master Dec 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants