-
Notifications
You must be signed in to change notification settings - Fork 884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support dictionary encoding in structures for FlightDataEncoder
, add documentation for arrow_flight::encode::Dictionary
#5488
Changes from 1 commit
8a13eef
b7a66d5
2799e7e
c4f37ec
69c7912
db37231
c8726b7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -388,29 +388,39 @@ impl Stream for FlightDataEncoder { | |
/// Defines how a [`FlightDataEncoder`] encodes [`DictionaryArray`]s | ||
/// | ||
/// [`DictionaryArray`]: arrow_array::DictionaryArray | ||
/// | ||
/// In the arrow flight protocol dictionary values and keys are sent as two separate messages. | ||
/// When a sender is encoding a [`RecordBatch`] containing [`DictionaryArray`] columns, it will | ||
/// first send a dictionary batch (a batch with header [`MessageHeader::DictionaryBatch`]) containing | ||
/// the dictionary values. The receiver is responsible for reading this batch and maintaining state that associates | ||
/// those dictionary values with the corresponding array using the `dict_id` as a key. | ||
/// | ||
/// After sending the dictionary batch the sender will send the array data in a batch with header ['MessageHeader::RecordBatch ']. | ||
/// For any dictionary array batches in this message, the encoded flight message will only contain the dictionary keys. The receiver | ||
/// is then responsible for rebuilding the `DictionaryArray` on the client side using the dictionary values from the DictionaryBatch message | ||
/// and the keys from the RecordBatch message. | ||
/// | ||
/// For example, if we have a batch with a `TypedDictionaryArray<'_, UInt32Type, Utf8Type>` (a dictionary array where they keys are `u32` and the | ||
/// values are `String`), then the DictionaryBatch will contain a `StringArray` and the RecordBatch will contain a `UInt32Array`. | ||
/// | ||
/// Not that since `dict_id` defined in the `Schema` is used as a key to assicated dictionary values to their arrays it is required that each | ||
/// `DictionaryArray` in a `RecordBatch` have a unique `dict_id`. | ||
#[derive(Debug, PartialEq)] | ||
pub enum DictionaryHandling { | ||
/// Expands to the underlying type (default). This likely sends more data | ||
/// over the network but requires less memory (dictionaries are not tracked) | ||
/// and is more compatible with other arrow flight client implementations | ||
/// that may not support `DictionaryEncoding` | ||
/// | ||
/// An IPC response, streaming or otherwise, defines its schema up front | ||
/// which defines the mapping from dictionary IDs. It then sends these | ||
/// dictionaries over the wire. | ||
/// This method should be used if all batches over the entire lifetime fo the flight stream | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think this description is accurate. The behavior is what is "supposed" to happen with Flight, but to the best of my knowledge, this behavior isn't implemented -- the gap is tracked in #3389 but I don't think it is yet implemented I think what I believe There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, I think you're right but the current implementation is actually broken then because it doesn't take into account nested dictionary fields:
Then if you have
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok yeah, I think this is why I was getting so confused. The original description of
But I don't quite understand why There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
If you have a sparse dictionary, as you might if you've applied a selective predicate, sending a dictionary for each batch where most of the values are not used could conceivably be worse. That being said this is also historical, as resend was added later There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree it was mostly historical The last time I checked (maybe 2 years ago) at least one client (maybe the golang one) didn't handle dictionaries (so if the rust server sent DictionaryArray the client couldn't deal with it) which was important for us as well All in all, sorting out how to handle Dictionaries better / more rationally in arrow-flight would be a very nice improvement I think (aka working on #3389) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
/// share the same dictionary (as determined by pointer-equality of the `DictionaryArray`'s `values` array). | ||
/// | ||
/// This requires identifying the different dictionaries in use, assigning | ||
/// them IDs, and sending new dictionaries, delta or otherwise, when needed | ||
/// The shared dictionary will be sent only once and the encoder will error in the case where it detects | ||
/// a dictionary which is not pointer-equal to the initial dictionary. | ||
/// | ||
/// See also: | ||
/// * <https://github.com/apache/arrow-rs/issues/1206> | ||
/// The arrow flight protocol supports delta dictionaries where the sender can send new dictionary values | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This matches my understanding too |
||
/// incrementally but this is currently not supported in arrow-rs. | ||
Hydrate, | ||
/// Send dictionary FlightData with every RecordBatch that contains a | ||
/// [`DictionaryArray`]. See [`Self::Hydrate`] for more tradeoffs. No | ||
/// attempt is made to skip sending the same (logical) dictionary values | ||
/// twice. | ||
/// This method should be used if each batch may contain different dictionary values. This will require more data to be sent | ||
/// across the wire as each batch sent will require a new DictionaryBatch message with all dictionary values for that batch. | ||
/// | ||
/// [`DictionaryArray`]: arrow_array::DictionaryArray | ||
/// The dictionary values will be sent for each batch separately and the receiver will use dictionary values | ||
/// from the most recent dictionary batch it received for the given `dict_id`. | ||
Resend, | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how things work in
arrow-rs
and I'm just assuming this is how the flight protocol is supposed to work :)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we update this comment maybe to say that the FlightDataEncoder doesn't really handle sending the same dictionary multiple times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't really handle as in will send the same dictionary over and over again for evert batch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
to clarify that we do not send delta dictionaries but the current implementation will actually skip sending the dictionary if it is pointer-equal to existing tracked dictionary.