Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop a Roadmap #210

Closed
scopatz opened this issue Jun 11, 2018 · 15 comments
Closed

Develop a Roadmap #210

scopatz opened this issue Jun 11, 2018 · 15 comments

Comments

@scopatz
Copy link

scopatz commented Jun 11, 2018

Hello All,

I'd like to help Dask-ML develop a roadmap, and @TomAugspurger requested that I open an issue here to kickstart the discussion.

For those of you unfamiliar, a roadmap is a listing of near- and long-term goals that the project has, but which have yet to be implemented. These goals are generally larger than single pull request. "Add GPU Support" or "Port to Python 3" are examples of some roadmap items a project might have. However, goals can also include activities that seem small and mundane, but are critical to the project, such as "Improve Documentation" or "Achieve 100% Test Coverage."

The purpose of this issue is so that roadmap items can be listed and commented on in the discussion. Once we have a reasonable sense of what folks would like to see on the roadmap, the contents should make its way to the website/docs in some fashion. It is also a good idea if it ends up being listed in priority order, so that higher priority items are closer to the top.

Some good example of project roadmaps are:

A decent example of a project roadmap is Spyder's Roadmap which looks more like a timeline with milestones.

Roadmaps serve a joint purpose:

  • They help inform and attract potentail funders, and some funders require roadmap documents
  • They help attract contributors, as developers are more likely to jump in an help an establish project than start a new project to scratch their itch.

TL;DR: Roadmaps communicate a project's intentions. It would be a good idea for Dask-ML to have one, and I am happy to help facilitate!

So where would you like to see Dask-ML go?

This is a mirror issue to dask/dask#3589

@TomAugspurger
Copy link
Member

cc @stsievert, I wonder if you could post your summer plans in here, and we'd can use this issue to develop a wishlist, and from there a prioritized roadmap.

@TomAugspurger
Copy link
Member

Here's my scattered wishlist. Will add / cleanup things.

Optimization

  • Hogwild type
  • Distributed SGD
  • SAGA / incremental SAGA
  • Adaptive Hyperparameter Optimization

Algorithms

Scikit-Learn compatibility

Miscellaneous

  • Coherent sparse / mixed sparse & dense array support: Sparse arrays support #123
  • Light-GBM wrapper
    • Similar to dask-xgboost
  • Dask-Tensorflow

@stsievert
Copy link
Member

My wishlist is similar. Should we edit this wishlist in some place that's more amendable to changes than a GitHub issue? Maybe a fork of dask-ml or a google doc.

  • Optimization:
    • I see these algs (e.g., Hogwild) as exploiting dask's distributed architecture
    • These will require a parameter server. Can we make this general and integrate with (for example) CuPy/Chainer and PyTorch?
  • Clean optimization framework.

I have put some thoughts down at https://docs.google.com/document/d/1jsCmPcXlXsSLgdFYgXgngj_P1EkumwZ3MrjkoVaMTjY but it's much less clear than this.

@scopatz
Copy link
Author

scopatz commented Jun 14, 2018

Thanks @TomAugspurger @stsievert! I think that this is great! It would be awesome if you could expand some of those topics into 1 - 3 sentence descriptions each.

@Kulbear
Copy link

Kulbear commented Jun 20, 2018

Do you have a plan to support other algorithms like tree-based method and support vector machine?

@lesteve
Copy link
Member

lesteve commented Jul 2, 2018

I have put some thoughts down at https://docs.google.com/document/d/1jsCmPcXlXsSLgdFYgXgngj_P1EkumwZ3MrjkoVaMTjY but it's much less clear than this.

@stsievert I'd be interested to have a look at your notes (people in my team may be interested to combine dask with PyTorch and/or Tensorflow at one point in the future). It looks like your google document is not public though. Would you be willing to make it public?

@stsievert
Copy link
Member

@lesteve I think I've made it public. I'd still label it as a work in progress though.

@TomAugspurger
Copy link
Member

people in my team may be interested to combine dask with PyTorch and/or Tensorflow at one point in the future

Better integration with the various DL frameworks is certainly in scope.

If you have / develop thoughts on how this should be done, then please share them :)

@lesteve
Copy link
Member

lesteve commented Jul 2, 2018

@lesteve I think I've made it public. I'd still label it as a work in progress though.

Great, thanks!

Better integration with the various DL frameworks is certainly in scope.
If you have / develop thoughts on how this should be done, then please share them :)

I would say we are just getting started so we probably have more questions than answers for now.

@stsievert
Copy link
Member

If you have / develop thoughts on how this should be done, then please share them

Especially with PyTorch. It certainly feels like there should be an integration with PyTorch because it has torch.distributed and torch.multiprocessing. I almost added it before, and I've added it now.

@mrocklin
Copy link
Member

mrocklin commented Jul 2, 2018 via email

@lesteve
Copy link
Member

lesteve commented Jul 3, 2018

I opened #268.

@js3711
Copy link

js3711 commented Jul 10, 2018

Is there any appetite to port many of the classifiers built for larger than memory datasets from spark-ml? I would love to see dask_ml have some of this functionality natively.

https://spark.apache.org/docs/latest/ml-classification-regression.html

Here's a specific example of a decision tree that has been adapted for larger than memory training datasets:
https://spark.apache.org/docs/latest/mllib-decision-tree.html

Spark-ml also has many useful feature extraction, selection, and transformers.

@TomAugspurger
Copy link
Member

TomAugspurger commented Jul 10, 2018 via email

@js3711
Copy link

js3711 commented Jul 10, 2018

I am interested specifically in the distributed decision tree implementation with customized stopping criteria at the moment. I currently use trees as a feature extraction method on large datasets.

It would be great not to hop over to scala more generally though :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants