-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficient sparse arrays storage #44
Comments
At present, joblib might still be the best way to achieve sparse array storage. zarr is worth investigating, but sparse support not great (uses compression instead of true sparse representation zarr-developers/zarr-python#152). https://github.com/pydata/sparse is another relevant project, but currently it only supports It looks like joblib is still the way to go, unless one desires to write a custom wrapper to achieve comparable functionality for mmap support etc. Closing. |
FWIW it should be possible to store COO directly in Zarr with compression disabled. This may pair nicely with Sparse, which also operates on a COO representation of data as well. Was already writing up a brief comment about it when I saw your xref. Not totally sure if this is what you were looking for, but figured I'd share it with you anyways in case it was. If there is still something else you need, we'd be happy to get more feedback from you. :) |
Thanks for the information @jakirkham ! Zarr is indeed an interesting possibility. So far I have been avoiding sparse COO format (and as a result e.g. the sparse package) in favor of using CSR (e.g. as used in scikit-learn everywhere), but maybe I should re-evaluate that. Still an increase in memory usage by ~2x is unfortunate unless there are some performance benefits I'm not aware of to compensate for it.. |
The performance FreeDiscovery REST API is in large part constrained by how fast we can retrieve the extracted features from disk (several GB, sparse arrays, Bag of Words / n-gram models). This is also relevant to issue #43.
Currently
joblib
is used to store the computed sparse CSR arrays of features, but better solutions could exist to this problem.The most critical parameters are,
A baseline benchmark could be,
The text was updated successfully, but these errors were encountered: