-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiment: Potential speed up strategies #399
base: main
Are you sure you want to change the base?
Conversation
Arrow project - added deps for InlineStrings Arrow.Table - added kwarg useinlinestrings to load strings as InlineStrings whenever possible Arrow.write - added chunking as a default (ala PyArrow), kwarg chunksize Compression - added thread-safe implementation with locks for decompression, added locks for compression TranscodingStreams - added inplace mutating transcode to avoid unnecessary resizing
…fer for type consistency))
Two additional thoughts on InlineStrings:
|
One side note on compression: I was quite surprised that despite LZ4 being known to be the "fastest" option, ZSTD (which compressed the data to smaller file sizes) has mostly been on par and in some cases even faster than LZ4! It depends on your data and its sizes (repetition etc.), but I was positively surprised by ZSTD -- I'll be using it much more from now on. |
@@ -27,6 +27,7 @@ CodecLz4 = "5ba52731-8f18-5e0d-9241-30f10d1ec561" | |||
CodecZstd = "6b39b394-51ab-5f42-8807-6242bab2b4c2" | |||
DataAPI = "9a962f9c-6df0-11e9-0e5d-c546b8b5ee8a" | |||
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a" | |||
InlineStrings = "842dd82b-1e85-43dc-bf29-5d0ee9dffc48" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing compat entry for InlineStrings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. I set it to 1.4, as I don't understand its evolution up to this point, but I have a suspicion that "1" would work as well.
To be clear, I'm not sure if this PR could ever be merged.
I've hijacked (copy&pasted) transcode method from TranscodingStreams to create the mutating version, which is probably not the best practice :-/
If we let transcode to its own allocation it will allocate a small vector, start filling it, resize the vector, fill it some more, resize the vector, etc. Instead in this commit we pre-allocate a vector of the corect size and pass it to transcode(). Inspired by apache#399
If we let transcode to its own allocation it will allocate a small vector, start filling it, resize the vector, fill it some more, resize the vector, etc. Instead in this commit we pre-allocate a vector of the corect size and pass it to transcode(). Inspired by apache#399
If we let transcode to its own allocation it will allocate a small vector, start filling it, resize the vector, fill it some more, resize the vector, etc. Instead in this commit we pre-allocate a vector of the corect size and pass it to transcode(). Inspired by #399
If we let transcode to its own allocation it will allocate a small vector, start filling it, resize the vector, fill it some more, resize the vector, etc. Instead in this commit we pre-allocate a vector of the corect size and pass it to transcode(). Inspired by apache#399
This PR aims to showcase various strategies how to improve the performance of Arrow.jl and allow for easier testing. The intention is to be broken into separate modular PRs for actual contributions (upon interest).
TL;DR Arrow.jl beats everyone except for one case (loading large strings in Task1, uncompressed Polars+PyArrow is so fast - 5ms - that it clearly uses some lazy strategy and skips materializing the strings)
Changes:
Future:
Timings (copied from the original thread for comparison):
Task 1: 10x count nonmissing elements in the first column of a table
Data: 2 columns of 5K-long strings each, 10% of data missing, 10K rows
Timings: (ordered by Uncompressed, LZ4, ZSTD)
Data: 32 partitions (!), 2 columns of 5K-long strings each, 10% of data missing, 10K rows
Timings: (ordered by Uncompressed, LZ4, ZSTD)
(Arrow.jl timing also benefits from a quick fix to TranscodingStreams)
NEW: partitioned using the new defaults+keywords
write_out_compressions(df, fn; chunksize = cld(nrow(df), Threads.nthreads()));
Task 2: 10x mean value of Int column in the first column of a table
Data: 10 columns, Int64, 10M rows
Timings: (ordered by Uncompressed, LZ4, ZSTD)
Data: 32 partitions (!), 10 columns, Int64, 10M rows
Timings: (ordered by Uncompressed, LZ4, ZSTD)
(Arrow.jl timing also benefits from a quick fix to TranscodingStreams)
NEW: partitioned using the new defaults+keywords
write_out_compressions(df, fn; chunksize = min(64000, cld(nrow(df), Threads.nthreads())));
Added a new task to test out the automatic string inlining
Task 3: 10x count nonmissing elements in the first column of a table
Data: 2 columns of 10 codeunits long strings each, 10% of data missing, 1M rows
partitioned using the new defaults+keywords
write_out_compressions(df, fn; chunksize = min(64000, cld(nrow(df), Threads.nthreads())));
Timings: (ordered by Uncompressed, LZ4, ZSTD)
(for comparison, I ran also string length check on Polars+PyArrow: 0.32s, 0.36s, 0.41s)
And the best part? On my MacBook, I can get timings for Task 3 around 40ms (uncompressed) and 90ms (compressed), which is better than what I could have hoped for :)
Setup:
Benchmarking code is provided in #393