Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Dataset methods #163

Open
62 of 76 tasks
OlivierBlanvillain opened this issue Aug 8, 2017 · 3 comments
Open
62 of 76 tasks

Missing Dataset methods #163

OlivierBlanvillain opened this issue Aug 8, 2017 · 3 comments

Comments

@OlivierBlanvillain
Copy link
Contributor

OlivierBlanvillain commented Aug 8, 2017

Here is an exhaustive status of the API implemented by frameless.TypeDataset compared to Spark's Dataset. We are getting pretty close to 100% API coverage 😄

Won't fix:

  • Dataset alias(String alias) inherently unsafe
  • Dataset withColumnRenamed(String existingName, String newName) inherently unsafe
  • void createGlobalTempView(String viewName) inherently unsafe
  • void createOrReplaceTempView(String viewName) inherently unsafe
  • void createTempView(String viewName) inherently unsafe
  • void registerTempTable(String tableName) inherently unsafe
  • Dataset where(String conditionExpr) use select instead

TODO:

Done:

  • Dataset sort(String sortCol, String... sortCols) (Window dense rank #248)
  • Dataset sortWithinPartitions(String sortCol, String... sortCols) (Window dense rank #248)
  • Dataset repartition(int numPartitions, Column... partitionExprs)
  • Dataset drop(String... colNames) (I#163 dataset drop #209)
  • Dataset join(Dataset<?> right, Column joinExprs, String joinType)
  • Dataset<scala.Tuple2<T,U>> joinWith(Dataset other, Column condition, String joinType)
  • Dataset crossJoin(Dataset<?> right)
  • Dataset agg(Column expr, Column... exprs)
  • Column apply(String colName)
  • Dataset as(Encoder evidence2)
  • Dataset cache()
  • Dataset coalesce(int numPartitions)
  • Column col(String colName)
  • Object collect()
  • long count()
  • Dataset distinct()
  • Dataset except(Dataset other)
  • void explain(boolean extended)
  • <A,B> Dataset explode(String inputColumn, String outputColumn, scala.Function1<A,TraversableOnce<B f)
  • Dataset filter(Column condition)
  • Dataset filter(scala.Function1<T,Object> func)
  • T first() (as firstOption)
  • Dataset flatMap(scala.Function1<T,TraversableOnce> func, Encoder evidence8)
  • void foreach(ForeachFunction func)
  • void foreachPartition(scala.Function1<Iterator,scala.runtime.BoxedUnit> f)
  • RelationalGroupedDataset groupBy(String col1, String... cols)
  • Dataset intersect(Dataset other)
  • Dataset limit(int n)
  • Dataset map(scala.Function1<T,U> func, Encoder evidence6)
  • Dataset mapPartitions(MapPartitionsFunction<T,U> f, Encoder encoder)
  • Dataset persist(StorageLevel newLevel)
  • void printSchema()
  • RDD rdd()
  • T reduce(scala.Function2<T,T,T> func) (as reduceOption)
  • Dataset repartition(int numPartitions)
  • Dataset sample(boolean withReplacement, double fraction, long seed)
  • Dataset select(String col, String... cols)
  • void show(int numRows, boolean truncate)
  • Object take(int n)
  • Dataset toDF()
  • String toString()
  • Dataset transform(scala.Function1<Dataset,Dataset> t)
  • Dataset union(Dataset other)
  • Dataset unpersist(boolean blocking)
  • Dataset withColumn(String colName, Column col)
  • Dataset orderBy(String sortCol, String... sortCols)
  • String[] columns()
  • org.apache.spark.sql.execution.QueryExecution queryExecution()
  • StructType schema()
  • SparkSession sparkSession()
  • SQLContext sqlContext()
  • Dataset checkpoint(boolean eager)
  • String[] inputFiles()
  • boolean isLocal()
  • boolean isStreaming()
  • Dataset[] randomSplit(double[] weights, long seed)
  • StorageLevel storageLevel()
  • Dataset toJSON()
  • java.util.Iterator toLocalIterator()
  • DataFrameWriter write()
@snadorp
Copy link
Contributor

snadorp commented Jul 25, 2018

<A,B> Dataset explode(String inputColumn, String outputColumn, scala.Function1<A,TraversableOnce<B f) is not working for Map type columns. While vanilla Spark supports it.

@imarios
Copy link
Contributor

imarios commented Jul 25, 2018

Yes, I was not able to fit Map because its type signature has two holes compared to one for all other. We can have an overloaded method just for Map I think.

@etspaceman
Copy link
Contributor

writeStream can be marked as done here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants