Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage: Compressed files and Content-Encoding #7154

Closed
nathan-c opened this issue Sep 10, 2021 · 3 comments · Fixed by #7155
Closed

Storage: Compressed files and Content-Encoding #7154

nathan-c opened this issue Sep 10, 2021 · 3 comments · Fixed by #7155
Assignees
Labels
api: storage Issues related to the Cloud Storage API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: question Request for information or clarification. Not an issue.

Comments

@nathan-c
Copy link

The .NET SDK will throw and exception like IOException: Incorrect hash: expected 'jdwXEg==' (base64), was '0RBSEw==' (base64) when trying to download a ZIP file which has Content-Type: application/zip and Content-Encoding: gzip unless file has Cache-Control: no-transform. This is because of the feature documented at https://cloud.google.com/storage/docs/transcoding#gzip-gzip

I am not really sure of the logic behind the GCS feature but I guess the horse has bolted on that one and so now my question is around how I can tell if GCS is going to ignore the Accept-Encoding header sent by the .NET SDK. The documentation doesn't have a definitive list of content types where this holds true. Is there any way this list can be built into the .NET SDK so that it knows that it should ignore the hash for these files? Maybe the SDK could check if the X-GUploader-Response-Body-Transformations: gunzipped response header exists and no check hash if this is true? Is there some other solution where we can still validate the download?

To avoid the suggestion that there is no point compressing already compressed files a ZIP can benefit from GZIP compression if for example it contains duplicate files.

@product-auto-label product-auto-label bot added the api: storage Issues related to the Cloud Storage API. label Sep 10, 2021
@amanda-tarafa amanda-tarafa self-assigned this Sep 10, 2021
@amanda-tarafa amanda-tarafa added type: question Request for information or clarification. Not an issue. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Sep 10, 2021
@jskeet jskeet assigned jskeet and amanda-tarafa and unassigned amanda-tarafa and jskeet Sep 10, 2021
@amanda-tarafa
Copy link
Contributor

For background see #1641 and #1784.

If you want to validate the download, then the only solution is to add Cache-Control: no-transform as that is the only way to obtain the data that matches the hash.

If you want to skip hash validation you can use DownloadObjectOptions.DownloadValidationMode to specify DownloadValidationMode.Never.

As you can see in #1784, which is a similar corner case, automatically detecting all of these is unreliable, as there's no reliable header that describes whether the server has stripped one compression layer or not, and the file metadata information is not enough for us to know whether to ignore the hash.
The request to include a set of already compressed content types for which Accept-Encoding: gzip will be ignored is better made to the API team, you can do so by clicking at the Send Feedback button on the bottom of the Transcoding documentation. But even if they were to do so, we couldn't know if we need to skip validation or not, becuse the Content-Encoding header is dropped on the process, which means we don't know if the file was doubly compressed or not to start with, and if it wasn't (a video, for instance) we'd still want to validate the hash.

@nathan-c
Copy link
Author

Thanks for the quick response. It looks like there is no way to validate the download if GCS decompresses the object, short of re-compressing and re-hashing the downloaded data and checking if that matches the value on the object in GCS.

Ideally we want to leave server-side decompression enabled for those clients that need it. In our case we are in control of file uploads so we could perhaps add the uncompressed hash to the object metadata and then add our own hash validation on download but it is a shame this can't be solved inside the SDK. We also maintain a list of types not to apply gzip compression to and we can continue to add content-types to this list as we see this error but obviously this isn't a great solution either.

I know this is a question for the API team but do you know what the use case is for forcibly decompressing the outer gzip compression for certain "compressed" types even if the user specifically sends Accept-Encoding: gzip?

@amanda-tarafa
Copy link
Contributor

I'll raise these issues again with the API team, as they are the best positioned to offer a solution that works for all, instead of us trying to patch the .NET library based on assumptions.

As for why they are removing the outer compression layer, I really don't know.

I'll move this issue to the backlog now, where #1784 is but I'll update it if/when I know more. Do feel free to add a comment if you think there's something else we can address.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants