-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an indexable variant of Arrow.Stream #353
Comments
I don't think this is possible. The Arrow file format is a series of FlatBuffer messages that are not indexed and therefore have to be iterated over. More concretely, the |
My idea was that the constructor of such indexable object could do the indexing you mention. I assume that the whole file would have to be scanned, but maybe it could be done in a cheap way, i.e. without having to read/interpret all the data stored in the file. |
Yeah, we could probably add support for this. Maybe with a Curious though, because a non-hard workflow you can already do is: for record_batch in Arrow.Stream(...)
Distributed.@spawn begin
# do stuff with record_batch
end
end what are the alternative workflows where that doesn't work for you? |
What you propose works, but I thought in this approach the parallelism would not be achieved (i.e. that |
Ah, you're correct; we do all the message processing in the |
This would be a great improvement as it would also allow predicate-pushdown at the RecordBatch level based on Message-level metatdata, thus opening up the ability to operate on a single RecordBatch without uncompressing all RecordBatches in a file. This is an important feature for me so I'll try to spend some time building this without breaking too much.
If the data uses the "IPC File Format" then the footer (link) should contain all the information we need to construct this index. This should be more performant than scanning the whole file, but is certainly an optimization as scanning should also be supported. |
I implemented this minus the indexing. Thoughts? |
In distributed computing context it would be nice to have a vector-variant of
Arrow.Stream
iterator. The idea is to be able to split processing of a single large arrow file with multiple record batches into multiple worker processes. Looking at the source code this should be possible to be done in a relatively efficient way.@quinnj - what do you think?
The text was updated successfully, but these errors were encountered: