Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Built in gzip decompression for url.download #3821

Open
zheng-essential opened this issue Feb 18, 2025 · 1 comment
Open

Built in gzip decompression for url.download #3821

zheng-essential opened this issue Feb 18, 2025 · 1 comment
Assignees
Labels
enhancement New feature or request feat p1 Important to tackle soon, but preemptable by p0

Comments

@zheng-essential
Copy link

Is your feature request related to a problem?

When we do url.download I sometimes end up loading gzip files and have to write a udf to decompress it.

Describe the solution you'd like

Would like to have a builtin to do gzip decompression
When we do url.download I sometimes end up loading gzip files and have to write a udf to decompress it.

I think it might be best if we put this in a new expression but we could also roll it into url.download. Need to catch failure cases.
I havent checked if daft does this automatically if you do daft.read but could be

Describe alternatives you've considered

No response

Additional Context

No response

Would you like to implement a fix?

No

@zheng-essential zheng-essential added enhancement New feature or request needs triage labels Feb 18, 2025
@universalmind303 universalmind303 changed the title Built in gzip decompression Built in gzip decompression for url.download Feb 18, 2025
@rchowell rchowell self-assigned this Feb 18, 2025
@rchowell rchowell added feat p1 Important to tackle soon, but preemptable by p0 and removed needs triage labels Feb 18, 2025
@rchowell
Copy link
Contributor

Thanks for the feature request! I agree let's make this its own expression for binary strings.

What do you think about these signatures?

# decode as "gzip" and raise ERROR on failure
col("my_bytes").decode("gzip")

# decode as "gzip" or return NULL on failure
col("my_bytes").try_decode("gzip")

Could you please elaborate on needing to catch failure cases? We may also want to allow for an encode/decode context for more complicated use cases. No need to design/solve that now, but would like to be aware of things like encoding levels and encode/decode dictionaries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feat p1 Important to tackle soon, but preemptable by p0
Projects
None yet
Development

No branches or pull requests

2 participants