-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TypedTransformer and TypedEstimator, towards a type-safe Spark ML API #206
Conversation
Codecov Report
@@ Coverage Diff @@
## master #206 +/- ##
=========================================
+ Coverage 95.71% 96.01% +0.3%
=========================================
Files 39 52 +13
Lines 793 904 +111
Branches 11 14 +3
=========================================
+ Hits 759 868 +109
- Misses 34 36 +2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atamborrino Thanks for the PR, this looks amazing! I'm really unfamiliar with these APIs so I couldn't do more than a superficial review. Maybe I will be able to understand more once you add documentation.
By the way I think it could be useful to make a "meta issue" for the ML stuff, just to have a global view of how much work there is here, what you can plan to do and so on.
import org.apache.spark.ml.{Estimator, Model} | ||
|
||
/** | ||
* A TypedEstimator `fit` method takes as input a TypedDataset containing `Inputs`and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space missing between these two: Inputs
& and
* A TypedEstimator `fit` method takes as input a TypedDataset containing `Inputs`and | ||
* return an AppendTransformer with `Inputs` as inputs and `Outputs` as outputs | ||
*/ | ||
abstract class TypedEstimator[Inputs, Outputs, M <: Model[M]] private[ml] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain the need for F-bounded polymorphism here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's how Spark's Estimator
is modeled, unfortunately
* A TypedEstimator `fit` method takes as input a TypedDataset containing `Inputs`and | ||
* return an AppendTransformer with `Inputs` as inputs and `Outputs` as outputs | ||
*/ | ||
abstract class TypedEstimator[Inputs, Outputs, M <: Model[M]] private[ml] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to use abstract class
and not trait
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nop no special reason, I'll move it to a trait
prepend: Prepend.Aux[TVals, OutputsVals, OutVals], | ||
tupler: Tupler.Aux[OutVals, Out], | ||
outEncoder: TypedEncoder[Out] | ||
): TypedDataset[Out] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the names really reduce readability here... Lately I've been using the following pattern:
(implicit
i0: SmartProject[T, Inputs],
i1: Generic.Aux[T, TVals],
i2: Generic.Aux[Outputs, OutputsVals],
i3: Prepend.Aux[TVals, OutputsVals, OutVals],
i4: Tupler.Aux[OutVals, Out],
i5: TypedEncoder[Out]
)
And then query things by type in the method body if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, it will make readability better, I'll change it.
val sparkValue: String | ||
} | ||
object FeatureSubsetStrategy { | ||
case object Auto extends FeatureSubsetStrategy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you could reduce the boilerplate a bit by having FeatureSubsetStrategy
as a class and using extends FeatureSubsetStrategy("auto")
instead
/** | ||
* Typeclass supporting record selection by value type (returning the first key whose value is of type `Value`) | ||
*/ | ||
trait SelectorByValue[L <: HList, Value] extends DepFn1[L] with Serializable { type Out <: Symbol } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be up-streamed to shapeless
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you use it at the type level, you might to remove the apply altogether for this usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, upstreaming this to shapeless was my intention :) (that's why I also added the apply pattern). But I prefer to include it in this PR to avoid having to wait for a new version of shapeless to be published. Once I've did the shapeless PR and it is merged, I'll remove it from there via a new PR.
import org.scalatest.MustMatchers | ||
|
||
class TypedRandomForestClassifierTests extends FramelessMlSuite with MustMatchers { | ||
implicit val arbDouble: Arbitrary[Double] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this would be worth a new type.
val rf = TypedRandomForestRegressor.create[X2[Double, Vector]]() | ||
.setNumTrees(10) | ||
.setMaxBins(100) | ||
.setFeatureSubsetStrategy(FeatureSubsetStrategy.All) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you only tested All
, could you add something for the other cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@OlivierBlanvillain BTW, for params like numTrees
, maxDepth
etc that have a bounded valid range, I was thinking about using https://github.com/fthomas/refined to check at compile the correctness of the parameter. Does including refined in frameless-ml's build seems ok for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would stick to shapeless for now, you could use NatProductArgs
and ask for a nat.obs.Range.Aux[_0, _100, N]
. But don't that on this PR, it's already getting pretty big :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok :)
tt: TypedVectorAssemblerInputsValueChecker[T] | ||
): TypedVectorAssemblerInputsValueChecker[H :: T] = new TypedVectorAssemblerInputsValueChecker[H :: T] {} | ||
|
||
implicit def hlistCheckInputsValueBoolean[T <: HList]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
codecov say you this case is never in tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I forgot to check the property...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atamborrino isn't there a typeclass already in shapeless for checking that all fields of type are subtypes of another type? This is related to TypedVectorAssemblerInputsValueChecker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very exciting and I can't wait to use it :-)
|
||
def fit[T](ds: TypedDataset[T])(implicit smartProject: SmartProject[T, Inputs]): AppendTransformer[Inputs, Outputs, M] = { | ||
val inputDs = smartProject.apply(ds) | ||
val model = estimator.fit(inputDs.dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this is an effect. Spark usually runs a bunch of actions in the fit
methods. The rest of the frameless APIs wrap these effects in a SparkDelay.delay
. Perhaps this should too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I forgot about that! I'll make the change.
tupler: Tupler.Aux[OutVals, Out], | ||
outEncoder: TypedEncoder[Out] | ||
): TypedDataset[Out] = { | ||
val transformed = transformer.transform(ds.dataset).as[Out](TypedExpressionEncoder[Out]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same re effects
* A TypedEstimator `fit` method takes as input a TypedDataset containing `Inputs`and | ||
* return an AppendTransformer with `Inputs` as inputs and `Outputs` as outputs | ||
*/ | ||
abstract class TypedEstimator[Inputs, Outputs, M <: Model[M]] private[ml] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's how Spark's Estimator
is modeled, unfortunately
|
||
test("fit() returns a correct TypedTransformer") { | ||
val prop = forAll { x2: X2[Double, Vector] => | ||
val rf = TypedRandomForestClassifier.create[X2[Double, Vector]]() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aesthetics suggestion: perhaps TypedRandomForestClassifier[X2[Double, Vector]]
would be more ergonomic? E.g. apply
method with no explicit parameter list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
val featuresCol: String | ||
} | ||
|
||
private[ml] object TypedRandomForestRegressorInputsChecker { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this typeclass is going to be used in basically all ML algos, can we package it up as something generic? E.g. something that converts the input type to a Record
and uses Align
to make sure that the input record conforms to a specified requirements Record
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes agreed, I'll try to come up with something more generic for this.
@OlivierBlanvillain @iravid I've pushed fixes for the past reviews. Looking at #207, I was wondering if it could be interesting that |
@atamborrino thanks, I'll try to do a second pass this week-end.
I agree with what you say here. Unless it's a blocker for you I propose to iterate: merge this PR first and try to fix #207 separately for all the affected methods. |
I am also looking at this. @atamborrino amazing work thank you! |
* An AppendTransformer `transform` method takes as input a TypedDataset containing `Inputs` and | ||
* return a TypedDataset with `Outputs` columns appended to the input TypedDataset. | ||
*/ | ||
trait AppendTransformer[Inputs, Outputs, InnerTransformer <: Transformer] extends TypedTransformer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atamborrino do we need to have our own Append operation here? Shapeless provides an operation that may do exactly this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AppendTransformer
does a little more than just appending by first checking that T
in .transform[T]
can be projected to Inputs
, and then indeed it uses shapeless' Prepend operation (see i3: Prepend.Aux[TVals, OutputsVals, OutVals]
). The type-level append logic is basically the same as in TypedDataset.withColumn
.
@atamborrino I think we should add the scala docs from Spark's Transformers and Estimators to our versions. I find my self going back and forth to spark to remind myself what these classes do. |
@atamborrino same for all the other classes, like VectorAssembler etc. |
trait AppendTransformer[Inputs, Outputs, InnerTransformer <: Transformer] extends TypedTransformer { | ||
val transformer: InnerTransformer | ||
|
||
def transform[T, TVals <: HList, OutputsVals <: HList, OutVals <: HList, Out, F[_]](ds: TypedDataset[T])( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atamborrino to the best of my knowledge, when you execute transform, this is a lazy operation from Dataset to Dataset. I don't think we need to enclose it into the effectful SparkDelay
. I think transform should just take a TypedDataset[T]
and rerun a TypedDataset[Out]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sampled a few mllib Model
s and that looks right, although they may throw exceptions (require
is used liberally).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all comments have been addressed, @atamborrino is there anything else you want to do before the merge? |
I just have to update documentation, I'll do it today or tomorrow. |
Documentation pushed. Will it be possible to publish this new documentation to https://typelevel.org/frameless/TypedML.html without having to wait for a new frameless release? |
Docs LGTM! Yes, we can publish the docs separately, it's not fully automated at the moment but we have this script |
Hi @atamborrino, I think it doesn't make sense for docs to have things that are out of current release. So I would rather not push the docs out right now. We can still review what you have here and add them in the code base (when we merge this PR). That said, cutting a new release looks like a pretty good option right now, especially with all the cool stuff you did in the ML package. I want to get few more hours this weekend to give a closer look and probably help in merging this PR. When the re release is out, we will update the docs and everything should be in sync. thank you again for all the hard work! |
I guess you could always publish the docs to https://atamborrino.github.io/frameless/ using the above script |
docs/src/main/tut/TypedML.md
Outdated
Both `TypedEstimator` and `TypedTransformer` check at compile-time the correctness of their inputs field names and types, | ||
contrary to Spark ML API which only deals with DataFrames (the data structure with the lowest level of type-safety in Spark). | ||
|
||
`frameless-ml` adds type-safety to Spark ML API but stays very close to it in terms of abstractions and code flow, so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace "code flow" with "API calls"?
docs/src/main/tut/TypedML.md
Outdated
val assembler = TypedVectorAssembler[Features] | ||
|
||
case class DataWithFeatures(field1: Double, field2: Int, field3: Double, features: Vector) | ||
val trainingDataWithFeatures = assembler.transform(trainingData).as[DataWithFeatures] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you had in the ticket, I think it's good to mention here that the .as[DataWithFeatures]
is a type-safe casting.
docs/src/main/tut/TypedML.md
Outdated
Then, we train the model: | ||
|
||
```tut:book | ||
case class RFInputs(field3: Double, features: Vector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we explain here why we also need field3? I would think that only the features vector is needed.
docs/src/main/tut/TypedML.md
Outdated
val assembler = TypedVectorAssembler[Features] | ||
|
||
case class DataWithFeatures(field1: Double, field2: Int, field3: Double, features: Vector) | ||
val trainingDataWithFeatures = assembler.transform(trainingData).as[DataWithFeatures] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the type annotation for trainingDataWithFeatures? I think it would help here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the compiled/interpreted documentation, we will have the result type:
val trainingDataWithFeatures = assembler.transform(trainingData).as[DataWithFeatures]
// trainingDataWithFeatures: frameless.TypedDataset[DataWithFeatures] = [field1: double, field2: int ... 2 more fields]
Do you think we still need to add the type annotation in the Scala line code?
docs/src/main/tut/TypedML.md
Outdated
``` | ||
|
||
```tut:book | ||
case class Data(field1: Double, field2: Int, field3: Double) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should rename field3 to "label" and field1 and field2 to feature1, feature2? By giving them a semantic naming it will help the reader to follow the example easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By using generic field names like field1, field2 and field3, I wanted to emphasize the fact that TypedTransformers and TypedEstimators did not rely on hard-coded field names for features and label, and can operate on already-well-named user data fields like "temperature", "housePrice", ...
Maybe we should take a less abstract example. For example, we could try to predict the "housePrice" (classic ML problem) based on the house square footage and the fact that the house has a garden or not. This will give the following data:
case class HouseData(squareFeet: Double, hasGarden: Boolean, price: Double)
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it!
docs/src/main/tut/TypedML.md
Outdated
rf.fit(wrongTrainingDataWithFeatures) | ||
``` | ||
|
||
For new-comers to frameless, please note that `typedDataset.as[...]` is a type-safe cast, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove the "For new-comers to frameless," and place this section where we first call .as[]
docs/src/main/tut/TypedML.md
Outdated
|
||
### Prediction | ||
|
||
We now want to predict `field3` for `testData` using the previously trained model. Please note that, like Spark ML API, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like the Spark ML API
docs/src/main/tut/TypedML.md
Outdated
|
||
As with Spark ML API, we use a `TypedVectorAssembler` to compute feature vectors and a `TypedStringIndexer` | ||
to index `field3` values in order to be able to pass them to a `TypedRandomForestClassifier` | ||
(which only accepts indexed Double values as label): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's indexed String instead of indexed Double values?
docs/src/main/tut/TypedML.md
Outdated
|
||
### Prediction | ||
|
||
We now want to predict `field3` for `testData` using the previously trained model. Please note that, like Spark ML API, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like the Spark ML API
|
||
`frameless-ml` provides `TypedEncoder` instances for `org.apache.spark.ml.linalg.Vector` | ||
and `org.apache.spark.ml.linalg.Matrix`: | ||
|
||
```tut:book | ||
```tut:silent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's good to include here the imports a user needs to add. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tut:silent
will indeed show the imports in the documentation (unlike tut:invisible
), it's just that it will not add the interpreted lines:
// import frameless.ml._
// import org.apache.spark.ml.linalg._
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, yes you are right
Hi @atamborrino! Again amazing work. I added few comments in the tut docs. Let me know what you think. |
@atamborrino looks really good! Should I merge? Do you want to add anything? |
@imarios Ready for the merge! :) |
This PR introduces
TypedEstimator
andTypedTransformer
to frameless-ml, the type-safe equivalents of Spark ML'sEstimator
andTransformer
. The goal is to provide a type-safe API for Spark ML (which currently only deals with DataFrames, the data structure with the lowest level of type-safety in Spark).Several ML models and ML transformers have been implemented extending
TypedEstimator
andTypedTransformer
, such asTypedRandomForestRegressor
,TypedRandomForestClassifier
,TypedIndexToString
,TypedVectorAssemlber
andTypedStringIndexer
. The final goal is to provide a typed version for every estimator and transformer of Spark ML.As an example, the following code snippet uses a Random Forest algorithm to predict
field3
fromfield1
andfield2
. Every transformation and fitting are type-safe (input fields are checked for correctness at compile-time), so no runtime exception can be thrown from the below code. For non-frameless user, please note thattypedDataset.as[...]
is a type-safe cast.In the above example,
TypedRandomForestRegressor[RFInputs]
will compile only if the givenRFInputs
case class contains only one field of type Double (the label) and one field of type Vector (the features).The resulting
rf.fit(trainingDataWithFeatures)
fitting will only compile iftrainingData
contains the same fields (names and types) asRFInputs
.All checks are done at compile-time so runtime performance is the same as with Spark ML DataFrame API (or even better if you consider the fact that a ML model input mismatch is now catched at compile-time rather than after one hour of feature engineering on your Spark cluster ;))
I have not updated the documentation for now, I prefer to wait for reviews and then write it once we all agree on the API.