Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data descriptors aren't supported in zip files #2216

Closed
charliermarsh opened this issue Mar 5, 2024 · 7 comments · Fixed by #2320
Closed

Data descriptors aren't supported in zip files #2216

charliermarsh opened this issue Mar 5, 2024 · 7 comments · Fixed by #2320
Assignees
Labels
bug Something isn't working registry Related to package indexes and registries

Comments

@charliermarsh
Copy link
Member

If you run:

cargo run pip install --extra-index-url https://buf.build/gen/python hashb_foxglove_protocolbuffers_python==25.3.0.1.20240226043130+465630478360 --force-reinstall -n

You'll hit an error like:

error: Failed to download: hashb-foxglove-protocolbuffers-python==25.3.0.1.20240226043130+465630478360
  Caused by: The wheel hashb_foxglove_protocolbuffers_python-25.3.0.1.20240226043130+465630478360-py3-none-any.whl is not a valid zip file
  Caused by: feature not supported: 'stream reading entries with data descriptors (planned to be reintroduced)'
@charliermarsh charliermarsh added bug Something isn't working registry Related to package indexes and registries labels Mar 5, 2024
@akshayjshah
Copy link

akshayjshah commented Mar 5, 2024

👋🏽 Hi from Buf, the company running the buf.build registry! Happy to chat about why and how we're using this feature if it's helpful.

Regardless, we (and our customers) really appreciate the attention you're giving to #2202 😻

@charliermarsh
Copy link
Member Author

Hey thanks for chiming in! I'd be happy to hear about why / how you're using data descriptors. I'd like to support them but it'll require either making some upgrades to the async_zip crate or adding fallback on our end to do a synchronous download-then-unzip. (Right now, we unzip directly to disk as we stream the download in-memory.)

@stefanvanburen
Copy link

Hi Charlie - I don't think our use of data descriptors is intentional, just what our downstream library (klauspost/compress) does when creating zip files. FWIW, it seems like similar behavior exists in the Go standard library. The calling code is roughly here, but it's nothing special - I believe any call to zip.Writer.Create with a regular file will write a data descriptor.

Hope that helps - not sure there's much we can do on our end, but I'll be watching over at Majored/rs-async-zip#122 to see if things can be resolved over there. Thanks again for all the work!

@charliermarsh
Copy link
Member Author

Thanks @stefanvanburen -- I can also try to take a look today. We use an async-zip fork anyway (https://github.com/charliermarsh/rs-async-zip) so we can easily experiment with different behaviors.

@charliermarsh charliermarsh self-assigned this Mar 6, 2024
@charliermarsh
Copy link
Member Author

Data descriptors add a lot of complexity for streaming. In ZIP, each entry looks like [header][file contents][data descriptor]. The [data descriptor] is optional, and the [header] tells you if it's present. The [header] tells you the length of [file contents], so that you can read the body. But if [data descriptor] is present, the the length is in [data descriptor], not [header]... So you can't stream it (trivially), because you don't know the length of the file until you're past it.

In theory it should be possible to make it work for Deflate, since the compressed body will be self-terminating. Alternatively, we could fall back to downloading the entire file, then unzipping it from disk.

@charliermarsh
Copy link
Member Author

I think the best path is to catch this error and fall back to downloading + unzipping from disk. It's not that much of a slowdown.

charliermarsh added a commit that referenced this issue Mar 10, 2024
## Summary

Some zip files can't be streamed; in particular, `rs-async-zip` doesn't
support data descriptors right now (though it may in the future). This
PR adds a fallback path for such zips that downloads the entire zip file
to disk, then unzips it from disk (which gives us `Seek`).

Closes #2216.

## Test Plan

`cargo run pip install --extra-index-url https://buf.build/gen/python
hashb_foxglove_protocolbuffers_python==25.3.0.1.20240226043130+465630478360
--force-reinstall -n`
@charliermarsh
Copy link
Member Author

We now support data descriptors natively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working registry Related to package indexes and registries
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants