Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient sparse arrays storage #44

Closed
rth opened this issue Dec 27, 2016 · 3 comments
Closed

Efficient sparse arrays storage #44

rth opened this issue Dec 27, 2016 · 3 comments
Milestone

Comments

@rth
Copy link
Contributor

rth commented Dec 27, 2016

The performance FreeDiscovery REST API is in large part constrained by how fast we can retrieve the extracted features from disk (several GB, sparse arrays, Bag of Words / n-gram models). This is also relevant to issue #43.

Currently joblib is used to store the computed sparse CSR arrays of features, but better solutions could exist to this problem.

The most critical parameters are,

  • fast read performance
  • ability to read a subset of rows without loading the full dataset in memory (~ memory map)
  • (preferably) should not introduce new dependencies
  • (optionally) be compatible with the chosen caching mechanism (issue Caching mechanism for the extracted features #43 )

A baseline benchmark could be,

In [12]: from sklearn.externals import joblib
    ...: import scipy.sparse
    ...: 
    ...: filename = '/tmp/test.pkl'
    ...: X = scipy.sparse.random(1000000, 100000, density=1e-3, format='csr')
    ...: 
    ...: %time joblib.dump(X, filename)
    ...: 
    ...: %time joblib.load(filename)
    ...:  # load a few rows 
    ...:  %time joblib.load(filename, mmap_mode='r')[[290290, 2, 3, 0]].data
CPU times: user 184 ms, sys: 880 ms, total: 1.06 s
Wall time: 8.15 s
CPU times: user 192 ms, sys: 432 ms, total: 624 ms
Wall time: 858 ms
CPU times: user 4 ms, sys: 4 ms, total: 8 ms
Wall time: 161 ms
@rth rth added this to the v2.0 milestone Dec 27, 2016
@rth rth removed this from the v2.0 milestone Jan 19, 2017
@rth rth added this to the v2.0 milestone Feb 12, 2017
@rth
Copy link
Contributor Author

rth commented Oct 11, 2018

At present, joblib might still be the best way to achieve sparse array storage. zarr is worth investigating, but sparse support not great (uses compression instead of true sparse representation zarr-developers/zarr-python#152). https://github.com/pydata/sparse is another relevant project, but currently it only supports .npz file format.

It looks like joblib is still the way to go, unless one desires to write a custom wrapper to achieve comparable functionality for mmap support etc. Closing.

@rth rth closed this as completed Oct 11, 2018
@jakirkham
Copy link

FWIW it should be possible to store COO directly in Zarr with compression disabled. This may pair nicely with Sparse, which also operates on a COO representation of data as well. Was already writing up a brief comment about it when I saw your xref. Not totally sure if this is what you were looking for, but figured I'd share it with you anyways in case it was. If there is still something else you need, we'd be happy to get more feedback from you. :)

@rth
Copy link
Contributor Author

rth commented Dec 19, 2018

Thanks for the information @jakirkham ! Zarr is indeed an interesting possibility. So far I have been avoiding sparse COO format (and as a result e.g. the sparse package) in favor of using CSR (e.g. as used in scikit-learn everywhere), but maybe I should re-evaluate that. Still an increase in memory usage by ~2x is unfortunate unless there are some performance benefits I'm not aware of to compensate for it..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants