-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add with_skip_validation
flag to IPC StreamReader
, FileReader
and FileDecoder
#7120
Conversation
d4102d6
to
59b3033
Compare
7b85e5e
to
77d3de5
Compare
@@ -1781,33 +1780,61 @@ impl PartialEq for ArrayData { | |||
} | |||
} | |||
|
|||
mod private { | |||
/// A boolean flag that cannot be mutated outside of unsafe code. | |||
/// A boolean flag that cannot be mutated outside of unsafe code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to make this UnsafeFlag
public (and added examples and more docs) so I could use it across the two crates. However, I can also make a private copy of it in arrow-ipc if reviewers feel it would be better to avoid a new API
arrow-ipc/benches/ipc_reader.rs
Outdated
writer.write(&batch).unwrap(); | ||
} | ||
writer.finish().unwrap(); | ||
let buffer = ipc_stream(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added new versions of each benchmark that work with disabled validation
StructArray::try_new(struct_fields.clone(), struct_arrays, None)? | ||
}; | ||
Ok(Arc::new(struct_array)) | ||
self.create_struct_array(struct_node, null_buffer, struct_fields, struct_arrays) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored this code into its own function so it was eaiser to call StructArray::new_unchecked
when validation was disabled
arrow-ipc/src/reader.rs
Outdated
/// | ||
/// Relies on the caller only passing a flag with `true` value if they are | ||
/// certain that the data is valid | ||
pub fn with_skip_validation(mut self, skip_validation: UnsafeFlag) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that RecordBatchDecoder
is not a public API:
https://docs.rs/arrow-ipc/54.1.0/arrow_ipc/reader/struct.StreamReader.html?search=RecordBatchDecoder
@@ -809,6 +858,21 @@ impl FileDecoder { | |||
self | |||
} | |||
|
|||
/// Specifies if validation should be skipped when reading data (defaults to `false`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a new public API and follows the same pattern as ArrayData::skip_validation
@@ -1177,6 +1243,16 @@ impl<R: Read + Seek> FileReader<R> { | |||
pub fn get_mut(&mut self) -> &mut R { | |||
&mut self.reader | |||
} | |||
|
|||
/// Specifies if validation should be skipped when reading data (defaults to `false`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new API
@@ -1462,6 +1546,16 @@ impl<R: Read> StreamReader<R> { | |||
pub fn get_mut(&mut self) -> &mut R { | |||
&mut self.reader | |||
} | |||
|
|||
/// Specifies if validation should be skipped when reading data (defaults to `false`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new API
@@ -2456,6 +2584,57 @@ mod tests { | |||
); | |||
} | |||
|
|||
#[test] | |||
fn test_invalid_nested_array_ipc_read_errors() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some additional coverage to make sure the flag got passed when decoding multiple nesting levels
@@ -2592,6 +2771,32 @@ mod tests { | |||
); | |||
} | |||
|
|||
#[test] | |||
fn test_validation_of_invalid_union_array() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out UnionArray had its own path / handling so new coverage added
@@ -2602,18 +2807,18 @@ mod tests { | |||
|
|||
// IPC Stream format | |||
let buf = write_stream(&rb); // write is ok | |||
read_stream_skip_validation(&buf).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now all the validation tests also verify they can read the batch back without error if validation is disabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly just some minor nits
arrow-ipc/benches/ipc_reader.rs
Outdated
} | ||
|
||
/// Return an IPC stream with ZSTD compression with 10 record batches | ||
fn ipc_stream_zstd() -> Vec<u8> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit, but I wonder if we could have an ipc_stream_options that takes a IpcWriteOptions
and then have ipc_stream just call through to this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was a good suggestion -- I did it in 7d3f020 🧹
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
…-rs into alamb/ipc_disable_validation
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
…-rs into alamb/ipc_disable_validation
Thank you very much for the review @tustvold |
Which issue does this PR close?
with_skip_validation
flag to IPC readers/writers #7093Rationale for this change
Forcing
Array
validation while reading arrow IPC trusted data is inefficient. Users should be able to avoid doing so if they wantWhat changes are included in this PR?
This PR builds on this PR from @totoroyyb
Are there any user-facing changes?
with_disable_validation
APIs onStreamReader
,FileReader
andFileDecoder
Benchmark results Mac M3
Observations
mmap
is super fast and thus the validation is a much larger portion of the time, resulting in a corresponding 8-9x faster without validationwith_skip_validation
StreamReader
StreamReader(zstd)
FileReader
FileDecoder
+mmap
Details for Mac M3
Benchmarks: GCP
On a GCP
c2-standard-16 (16 vCPUs, 64 GB Memory)
with_skip_validation
StreamReader
StreamReader(zstd)
FileReader
FileDecoder
+mmap
Raw Results (gpc)