-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add support for reading .tar archives #44787
Conversation
co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com>
python's `tarfile` supports gzip, xz and bz2 encoding, so we don't need to make any special cases for that. co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com>
pandas/io/common.py
Outdated
@@ -747,6 +751,21 @@ def get_handle( | |||
f"Only one file per ZIP: {zip_names}" | |||
) | |||
|
|||
# TAR Encoding | |||
elif compression == "tar": | |||
tar = tarfile.open(handle, "r:*") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If pandas supports reading from .tar*, users will probably also expect being able to write to .tar*.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented in 5f22df7 ✓
…xtensions on same compression co-authored-by: Margarete Dippel <margarete01@users.norepl y.github.com>
co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com>
co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com>
Hey @twoertwein, thank you for the review! Addressed the feedback. I think we're still missing some documentation things here and there.
|
pandas/io/common.py
Outdated
@@ -823,6 +852,96 @@ def get_handle( | |||
) | |||
|
|||
|
|||
class _BytesTarFile(tarfile.TarFile, BytesIO): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to share code with the _BytesZipFile class? Maybe _BytesCompressMixin from which both classes inherit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment on why we need this wrapper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried extracting some of their code into a Mixin, but found that there's little room for abstraction. Except for the three lines of write
, none of the methods are identical. Since adding a Mixin also makes the code harder to follow, I'd prefer to keep the duplication. Having said that, if you see a good way of abstracting here, I'm more than open to it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment in 887fd10.
pandas/core/generic.py
Outdated
@@ -2341,6 +2341,7 @@ def to_json( | |||
default_handler: Callable[[Any], JSONSerializable] | None = None, | |||
lines: bool_t = False, | |||
compression: CompressionOptions = "infer", | |||
mode: str = "w", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is mode needed? I think we expect/require that file handles are opened in binary mode when the user request compression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, not needed. I realised that the passthrough described in
doesn't yet work. Fixed in 57eba0a, and removed the added mode
parameter.
will need #43925 and review |
Merged in #43925 and added to the shared docs. |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
@@ -398,6 +400,14 @@ def write_to_compressed(compression, path, data, dest="test"): | |||
mode = "w" | |||
args = (dest, data) | |||
method = "writestr" | |||
elif compression == "tar": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this entire function could be replaced with a call to get_handle
(not needed in this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Skn0tt looks good to me!
I hope that the compression wrappers can be simplified in the future. Another future todo is to close file handles when the tar archive contains no/too many files (zip has the same issue).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sticking with this. Could you merge main one more time? It looks fairly close to merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like some tests are still failing on Windows
green! :) |
members = archive.getmembers() | ||
assert len(members) == 1 | ||
content = archive.extractfile(members[0]).read().decode("utf8") | ||
content = content.replace("\r\n", "\n") # windows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you test this based on the platform instead? There is pandas.compat.is_platform_windows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks reasonable. over to you @twoertwein & @mroeschke
Thanks @Skn0tt. If you could follow up with this #44787 (comment) in another PR that would be great |
* Add reproduction test for .tar.gz archives co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com> * add support for .tar archives python's `tarfile` supports gzip, xz and bz2 encoding, so we don't need to make any special cases for that. co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com> * update doc comments * fix: pep8 errors * refactor: flip _compression_to_extension around to support multiple extensions on same compression co-authored-by: Margarete Dippel <margarete01@users.norepl y.github.com> * refactor: detect tar files using existing extension mapping co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com> * feat: add support for writing tar files co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com> * feat: assure it respects .gz endings * feat: add "tar" entry to compressionoptions * chore: add whatsnew entry * fix: test_compression_size_fh * add tarfile to shared compression docs * fix formatting * pass through "mode" via compression args * fix pickle test * add class comment * sort imports * add _compression_to_extension back for backwards compatibility * fix some type warnings * fix: formatting * fix: mypy complaints * fix: more tests * fix: some error with xml * fix: interpreted text role * move to v1.5 whatsnw * add versionadded note * don't leave blank lines * add tests for zero files / multiple files * move _compression_to_extension to tests * revert added "mode" argument * add test to ensure that `compression.mode` works * compare strings, not bytes * replace carriage returns Co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com>
At the moment, reading from
.tar.gz
will decode using.gzip
, but interprets thetar
contents as if it were acsv
. This leads to funky behaviour (due to the waytar
files are structured, it will interpret filenames as column names), and is generally incorrect.At the moment, Pandas users can work-around this using
tarfile
:This PR adds this logic into Pandas, similar to how Pandas already supports reading from
.zip
files.Co-Authored-By: @Margarete01