Efficient sparse arrays storage #44

rth · 2016-12-27T12:07:15Z

The performance FreeDiscovery REST API is in large part constrained by how fast we can retrieve the extracted features from disk (several GB, sparse arrays, Bag of Words / n-gram models). This is also relevant to issue #43.

Currently joblib is used to store the computed sparse CSR arrays of features, but better solutions could exist to this problem.

The most critical parameters are,

fast read performance
ability to read a subset of rows without loading the full dataset in memory (~ memory map)
(preferably) should not introduce new dependencies
(optionally) be compatible with the chosen caching mechanism (issue Caching mechanism for the extracted features #43 )

A baseline benchmark could be,

In [12]: from sklearn.externals import joblib
    ...: import scipy.sparse
    ...: 
    ...: filename = '/tmp/test.pkl'
    ...: X = scipy.sparse.random(1000000, 100000, density=1e-3, format='csr')
    ...: 
    ...: %time joblib.dump(X, filename)
    ...: 
    ...: %time joblib.load(filename)
    ...:  # load a few rows 
    ...:  %time joblib.load(filename, mmap_mode='r')[[290290, 2, 3, 0]].data
CPU times: user 184 ms, sys: 880 ms, total: 1.06 s
Wall time: 8.15 s
CPU times: user 192 ms, sys: 432 ms, total: 624 ms
Wall time: 858 ms
CPU times: user 4 ms, sys: 4 ms, total: 8 ms
Wall time: 161 ms

The text was updated successfully, but these errors were encountered:

rth · 2018-10-11T18:31:12Z

At present, joblib might still be the best way to achieve sparse array storage. zarr is worth investigating, but sparse support not great (uses compression instead of true sparse representation zarr-developers/zarr-python#152). https://github.com/pydata/sparse is another relevant project, but currently it only supports .npz file format.

It looks like joblib is still the way to go, unless one desires to write a custom wrapper to achieve comparable functionality for mmap support etc. Closing.

jakirkham · 2018-12-15T20:43:46Z

FWIW it should be possible to store COO directly in Zarr with compression disabled. This may pair nicely with Sparse, which also operates on a COO representation of data as well. Was already writing up a brief comment about it when I saw your xref. Not totally sure if this is what you were looking for, but figured I'd share it with you anyways in case it was. If there is still something else you need, we'd be happy to get more feedback from you. :)

rth · 2018-12-19T13:02:57Z

Thanks for the information @jakirkham ! Zarr is indeed an interesting possibility. So far I have been avoiding sparse COO format (and as a result e.g. the sparse package) in favor of using CSR (e.g. as used in scikit-learn everywhere), but maybe I should re-evaluate that. Still an increase in memory usage by ~2x is unfortunate unless there are some performance benefits I'm not aware of to compensate for it..

rth added the optimization label Dec 27, 2016

rth added this to the v2.0 milestone Dec 27, 2016

rth mentioned this issue Jan 6, 2017

[RFC] General code organization #48

Open

8 tasks

rth removed this from the v2.0 milestone Jan 19, 2017

rth mentioned this issue Feb 11, 2017

Scaling benchmarks #89

Open

rth added this to the v2.0 milestone Feb 12, 2017

rth closed this as completed Oct 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient sparse arrays storage #44

Efficient sparse arrays storage #44

rth commented Dec 27, 2016 •

edited

Loading

rth commented Oct 11, 2018

jakirkham commented Dec 15, 2018

rth commented Dec 19, 2018 •

edited

Loading

Efficient sparse arrays storage #44

Efficient sparse arrays storage #44

Comments

rth commented Dec 27, 2016 • edited Loading

rth commented Oct 11, 2018

jakirkham commented Dec 15, 2018

rth commented Dec 19, 2018 • edited Loading

rth commented Dec 27, 2016 •

edited

Loading

rth commented Dec 19, 2018 •

edited

Loading