ENH: add support for reading .tar archives #44787

Skn0tt · 2021-12-06T13:01:38Z

tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

At the moment, reading from .tar.gz will decode using .gzip, but interprets the tar contents as if it were a csv. This leads to funky behaviour (due to the way tar files are structured, it will interpret filenames as column names), and is generally incorrect.

At the moment, Pandas users can work-around this using tarfile:

import pandas as pd
import tarfile

with tarfile.open("file.tar.gz", "r:*") as tar:
    file = tar.getnames()[0]
    return pd.read_csv(tar.extractfile(file))

This PR adds this logic into Pandas, similar to how Pandas already supports reading from .zip files.

Co-Authored-By: @Margarete01

co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com>

python's `tarfile` supports gzip, xz and bz2 encoding, so we don't need to make any special cases for that. co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com>

pep8speaks · 2021-12-06T13:01:45Z

Hello @Skn0tt! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-05-06 09:26:10 UTC

pandas/io/common.py

pandas/tests/io/parser/test_compression.py

twoertwein · 2021-12-06T22:16:05Z

pandas/io/common.py

@@ -747,6 +751,21 @@ def get_handle(
                        f"Only one file per ZIP: {zip_names}"
                    )

+        # TAR Encoding
+        elif compression == "tar":
+            tar = tarfile.open(handle, "r:*")


If pandas supports reading from .tar*, users will probably also expect being able to write to .tar*.

Implemented in 5f22df7 ✓

…xtensions on same compression co-authored-by: Margarete Dippel <margarete01@users.norepl y.github.com>

co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com>

Skn0tt · 2021-12-15T11:01:35Z

Hey @twoertwein, thank you for the review! Addressed the feedback. I think we're still missing some documentation things here and there.

TODO: add tar as compression option to docs

twoertwein · 2021-12-16T01:44:06Z

pandas/io/common.py

@@ -823,6 +852,96 @@ def get_handle(
    )


+class _BytesTarFile(tarfile.TarFile, BytesIO):


Is it possible to share code with the _BytesZipFile class? Maybe _BytesCompressMixin from which both classes inherit?

Can you comment on why we need this wrapper?

I tried extracting some of their code into a Mixin, but found that there's little room for abstraction. Except for the three lines of write, none of the methods are identical. Since adding a Mixin also makes the code harder to follow, I'd prefer to keep the duplication. Having said that, if you see a good way of abstracting here, I'm more than open to it!

Added a comment in 887fd10.

twoertwein · 2021-12-16T01:45:49Z

pandas/core/generic.py

@@ -2341,6 +2341,7 @@ def to_json(
        default_handler: Callable[[Any], JSONSerializable] | None = None,
        lines: bool_t = False,
        compression: CompressionOptions = "infer",
+        mode: str = "w",


Is mode needed? I think we expect/require that file handles are opened in binary mode when the user request compression.

Nope, not needed. I realised that the passthrough described in

https://github.com/pandas-dev/pandas/pull/44787/files#diff-132ee3be1f83a9f885442f45ed9ccbc96796ae28f97991b7c99ce25d44fd6af7R206

doesn't yet work. Fixed in 57eba0a, and removed the added mode parameter.

pandas/core/frame.py

jreback · 2021-12-20T01:30:30Z

will need #43925 and review

Skn0tt · 2022-01-04T08:07:15Z

Merged in #43925 and added to the shared docs.

github-actions · 2022-02-13T00:04:29Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

pandas/_typing.py

twoertwein · 2022-04-13T11:01:49Z

pandas/_testing/_io.py

@@ -398,6 +400,14 @@ def write_to_compressed(compression, path, data, dest="test"):
        mode = "w"
        args = (dest, data)
        method = "writestr"
+    elif compression == "tar":


I think this entire function could be replaced with a call to get_handle (not needed in this PR)

pandas/io/common.py

pandas/io/json/_json.py

pandas/io/pickle.py

twoertwein

Thank you @Skn0tt looks good to me!

I hope that the compression wrappers can be simplified in the future. Another future todo is to close file handles when the tar archive contains no/too many files (zip has the same issue).

mroeschke

Thanks for sticking with this. Could you merge main one more time? It looks fairly close to merging

mroeschke

Looks like some tests are still failing on Windows

Skn0tt · 2022-05-06T11:04:11Z

green! :)

mroeschke · 2022-05-06T17:01:49Z

pandas/tests/io/test_compression.py

+                members = archive.getmembers()
+                assert len(members) == 1
+                content = archive.extractfile(members[0]).read().decode("utf8")
+                content = content.replace("\r\n", "\n")  # windows


Could you test this based on the platform instead? There is pandas.compat.is_platform_windows

jreback

looks reasonable. over to you @twoertwein & @mroeschke

mroeschke · 2022-05-07T21:15:52Z

Thanks @Skn0tt. If you could follow up with this #44787 (comment) in another PR that would be great

* Add reproduction test for .tar.gz archives co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com> * add support for .tar archives python's `tarfile` supports gzip, xz and bz2 encoding, so we don't need to make any special cases for that. co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com> * update doc comments * fix: pep8 errors * refactor: flip _compression_to_extension around to support multiple extensions on same compression co-authored-by: Margarete Dippel <margarete01@users.norepl y.github.com> * refactor: detect tar files using existing extension mapping co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com> * feat: add support for writing tar files co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com> * feat: assure it respects .gz endings * feat: add "tar" entry to compressionoptions * chore: add whatsnew entry * fix: test_compression_size_fh * add tarfile to shared compression docs * fix formatting * pass through "mode" via compression args * fix pickle test * add class comment * sort imports * add _compression_to_extension back for backwards compatibility * fix some type warnings * fix: formatting * fix: mypy complaints * fix: more tests * fix: some error with xml * fix: interpreted text role * move to v1.5 whatsnw * add versionadded note * don't leave blank lines * add tests for zero files / multiple files * move _compression_to_extension to tests * revert added "mode" argument * add test to ensure that `compression.mode` works * compare strings, not bytes * replace carriage returns Co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com>

Skn0tt and others added 3 commits December 6, 2021 12:54

Add reproduction test for .tar.gz archives

c1823ef

co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com>

add support for .tar archives

9a85cba

python's `tarfile` supports gzip, xz and bz2 encoding, so we don't need to make any special cases for that. co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com>

update doc comments

e673061

fix: pep8 errors

a0d6386

twoertwein reviewed Dec 6, 2021

View reviewed changes

pandas/io/common.py Outdated Show resolved Hide resolved

twoertwein reviewed Dec 6, 2021

View reviewed changes

pandas/tests/io/parser/test_compression.py Show resolved Hide resolved

twoertwein added the IO Data IO issues that don't fit into a more specific label label Dec 6, 2021

twoertwein reviewed Dec 6, 2021

View reviewed changes

Skn0tt and others added 5 commits December 7, 2021 10:14

refactor: flip _compression_to_extension around to support multiple e…

6a8edef

…xtensions on same compression co-authored-by: Margarete Dippel <margarete01@users.norepl y.github.com>

refactor: detect tar files using existing extension mapping

d4e40c9

co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com>

feat: add support for writing tar files

5f22df7

co-authored-by: Margarete Dippel <margarete01@users.noreply.github.com>

feat: assure it respects .gz endings

c6573ef

Merge branch 'master' into read-tar-archives

f3b6ed5

Skn0tt added 3 commits December 15, 2021 11:12

feat: add "tar" entry to compressionoptions

a4ac382

chore: add whatsnew entry

e66826b

fix: test_compression_size_fh

941be37

twoertwein reviewed Dec 16, 2021

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

Skn0tt added 3 commits January 4, 2022 08:34

Merge branch 'master' into read-tar-archives

e3369aa

add tarfile to shared compression docs

0468e5f

fix formatting

2531ee0

Skn0tt added 3 commits January 4, 2022 08:27

pass through "mode" via compression args

57eba0a

fix pickle test

38f7d54

add class comment

887fd10

don't leave blank lines

0c31aa8

twoertwein reviewed Apr 11, 2022

View reviewed changes

pandas/_typing.py Show resolved Hide resolved

add tests for zero files / multiple files

086c598

twoertwein reviewed Apr 13, 2022

View reviewed changes

pandas/io/common.py Outdated Show resolved Hide resolved

twoertwein reviewed Apr 13, 2022

View reviewed changes

pandas/io/json/_json.py Outdated Show resolved Hide resolved

twoertwein reviewed Apr 13, 2022

View reviewed changes

pandas/io/pickle.py Outdated Show resolved Hide resolved

Skn0tt added 3 commits April 13, 2022 14:00

move _compression_to_extension to tests

861faf0

revert added "mode" argument

9458ecb

add test to ensure that compression.mode works

d20f315

twoertwein approved these changes Apr 14, 2022

View reviewed changes

mroeschke added this to the 1.5 milestone Apr 24, 2022

mroeschke added Enhancement and removed Stale labels Apr 24, 2022

mroeschke reviewed Apr 24, 2022

View reviewed changes

Merge branch 'main' into read-tar-archives

1066f1b

mroeschke requested changes Apr 26, 2022

View reviewed changes

Skn0tt added 3 commits May 5, 2022 15:12

Merge branch 'main' into read-tar-archives

6b0e1e6

compare strings, not bytes

0d9ed18

replace carriage returns

37370c2

mroeschke reviewed May 6, 2022

View reviewed changes

jreback approved these changes May 6, 2022

View reviewed changes

twoertwein approved these changes May 7, 2022

View reviewed changes

mroeschke approved these changes May 7, 2022

View reviewed changes

mroeschke merged commit 8647298 into pandas-dev:main May 7, 2022

Skn0tt mentioned this pull request May 9, 2022

follow-up to 44787, use pandas compat for platform specifics in added test #46973

Merged

MarcoGorelli mentioned this pull request Sep 22, 2022

BUG: Inconsistent behaviour reading .tar.gz files for 1.5.0 #48708

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add support for reading .tar archives #44787

ENH: add support for reading .tar archives #44787

Skn0tt commented Dec 6, 2021 •

edited

Loading

pep8speaks commented Dec 6, 2021 •

edited

Loading

twoertwein Dec 6, 2021

Skn0tt Dec 15, 2021

Skn0tt commented Dec 15, 2021 •

edited

Loading

twoertwein Dec 16, 2021

twoertwein Dec 16, 2021

Skn0tt Jan 4, 2022

Skn0tt Jan 4, 2022

twoertwein Dec 16, 2021

Skn0tt Jan 4, 2022

jreback commented Dec 20, 2021

Skn0tt commented Jan 4, 2022

github-actions bot commented Feb 13, 2022

twoertwein Apr 13, 2022

twoertwein left a comment

mroeschke left a comment

mroeschke left a comment

Skn0tt commented May 6, 2022

mroeschke May 6, 2022

jreback left a comment

mroeschke commented May 7, 2022

		@@ -823,6 +852,96 @@ def get_handle(
		)


		class _BytesTarFile(tarfile.TarFile, BytesIO):

ENH: add support for reading .tar archives #44787

ENH: add support for reading .tar archives #44787

Conversation

Skn0tt commented Dec 6, 2021 • edited Loading

pep8speaks commented Dec 6, 2021 • edited Loading

Comment last updated at 2022-05-06 09:26:10 UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Skn0tt commented Dec 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 20, 2021

Skn0tt commented Jan 4, 2022

github-actions bot commented Feb 13, 2022

Choose a reason for hiding this comment

twoertwein left a comment

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

Skn0tt commented May 6, 2022

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

mroeschke commented May 7, 2022

Skn0tt commented Dec 6, 2021 •

edited

Loading

pep8speaks commented Dec 6, 2021 •

edited

Loading

Skn0tt commented Dec 15, 2021 •

edited

Loading