Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add support for reading .tar archives #44787

Merged
merged 38 commits into from
May 7, 2022

Conversation

Skn0tt
Copy link
Contributor

@Skn0tt Skn0tt commented Dec 6, 2021

  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

At the moment, reading from .tar.gz will decode using .gzip, but interprets the tar contents as if it were a csv. This leads to funky behaviour (due to the way tar files are structured, it will interpret filenames as column names), and is generally incorrect.

At the moment, Pandas users can work-around this using tarfile:

import pandas as pd
import tarfile

with tarfile.open("file.tar.gz", "r:*") as tar:
    file = tar.getnames()[0]
    return pd.read_csv(tar.extractfile(file))

This PR adds this logic into Pandas, similar to how Pandas already supports reading from .zip files.

Co-Authored-By: @Margarete01

Skn0tt and others added 3 commits December 6, 2021 12:54
co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com>
python's `tarfile` supports gzip, xz and bz2 encoding,
so we don't need to make any special cases for that.

co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com>
@pep8speaks
Copy link

pep8speaks commented Dec 6, 2021

Hello @Skn0tt! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-05-06 09:26:10 UTC

@twoertwein twoertwein added the IO Data IO issues that don't fit into a more specific label label Dec 6, 2021
@@ -747,6 +751,21 @@ def get_handle(
f"Only one file per ZIP: {zip_names}"
)

# TAR Encoding
elif compression == "tar":
tar = tarfile.open(handle, "r:*")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If pandas supports reading from .tar*, users will probably also expect being able to write to .tar*.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in 5f22df7

Skn0tt and others added 5 commits December 7, 2021 10:14
…xtensions on same compression

co-authored-by: Margarete Dippel <margarete01@users.norepl
y.github.com>
co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com>
co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com>
@Skn0tt
Copy link
Contributor Author

Skn0tt commented Dec 15, 2021

Hey @twoertwein, thank you for the review! Addressed the feedback. I think we're still missing some documentation things here and there.

  • TODO: add tar as compression option to docs

@@ -823,6 +852,96 @@ def get_handle(
)


class _BytesTarFile(tarfile.TarFile, BytesIO):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to share code with the _BytesZipFile class? Maybe _BytesCompressMixin from which both classes inherit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment on why we need this wrapper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried extracting some of their code into a Mixin, but found that there's little room for abstraction. Except for the three lines of write, none of the methods are identical. Since adding a Mixin also makes the code harder to follow, I'd prefer to keep the duplication. Having said that, if you see a good way of abstracting here, I'm more than open to it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment in 887fd10.

@@ -2341,6 +2341,7 @@ def to_json(
default_handler: Callable[[Any], JSONSerializable] | None = None,
lines: bool_t = False,
compression: CompressionOptions = "infer",
mode: str = "w",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is mode needed? I think we expect/require that file handles are opened in binary mode when the user request compression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, not needed. I realised that the passthrough described in

https://github.com/pandas-dev/pandas/pull/44787/files#diff-132ee3be1f83a9f885442f45ed9ccbc96796ae28f97991b7c99ce25d44fd6af7R206

doesn't yet work. Fixed in 57eba0a, and removed the added mode parameter.

@jreback
Copy link
Contributor

jreback commented Dec 20, 2021

will need #43925 and review

@Skn0tt
Copy link
Contributor Author

Skn0tt commented Jan 4, 2022

Merged in #43925 and added to the shared docs.

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@@ -398,6 +400,14 @@ def write_to_compressed(compression, path, data, dest="test"):
mode = "w"
args = (dest, data)
method = "writestr"
elif compression == "tar":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this entire function could be replaced with a call to get_handle (not needed in this PR)

Copy link
Member

@twoertwein twoertwein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Skn0tt looks good to me!

I hope that the compression wrappers can be simplified in the future. Another future todo is to close file handles when the tar archive contains no/too many files (zip has the same issue).

@mroeschke mroeschke added this to the 1.5 milestone Apr 24, 2022
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sticking with this. Could you merge main one more time? It looks fairly close to merging

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like some tests are still failing on Windows

@Skn0tt
Copy link
Contributor Author

Skn0tt commented May 6, 2022

green! :)

members = archive.getmembers()
assert len(members) == 1
content = archive.extractfile(members[0]).read().decode("utf8")
content = content.replace("\r\n", "\n") # windows
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you test this based on the platform instead? There is pandas.compat.is_platform_windows

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks reasonable. over to you @twoertwein & @mroeschke

@mroeschke mroeschke merged commit 8647298 into pandas-dev:main May 7, 2022
@mroeschke
Copy link
Member

Thanks @Skn0tt. If you could follow up with this #44787 (comment) in another PR that would be great

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
* Add reproduction test for .tar.gz archives

co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com>

* add support for .tar archives

python's `tarfile` supports gzip, xz and bz2 encoding,
so we don't need to make any special cases for that.

co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com>

* update doc comments

* fix: pep8 errors

* refactor: flip _compression_to_extension around to support multiple extensions on same compression

co-authored-by: Margarete Dippel <margarete01@users.norepl
y.github.com>

* refactor: detect tar files using existing extension mapping

co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com>

* feat: add support for writing tar files

co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com>

* feat: assure it respects .gz endings

* feat: add "tar" entry to compressionoptions

* chore: add whatsnew entry

* fix: test_compression_size_fh

* add tarfile to shared compression docs

* fix formatting

* pass through "mode" via compression args

* fix pickle test

* add class comment

* sort imports

* add _compression_to_extension back for backwards compatibility

* fix some type warnings

* fix: formatting

* fix: mypy complaints

* fix: more tests

* fix: some error with xml

* fix: interpreted text role

* move to v1.5 whatsnw

* add versionadded note

* don't leave blank lines

* add tests for zero files / multiple files

* move _compression_to_extension to tests

* revert added "mode" argument

* add test to ensure that `compression.mode` works

* compare strings, not bytes

* replace carriage returns

Co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants