Storage: Compressed files and Content-Encoding #7154

nathan-c · 2021-09-10T10:27:31Z

The .NET SDK will throw and exception like IOException: Incorrect hash: expected 'jdwXEg==' (base64), was '0RBSEw==' (base64) when trying to download a ZIP file which has Content-Type: application/zip and Content-Encoding: gzip unless file has Cache-Control: no-transform. This is because of the feature documented at https://cloud.google.com/storage/docs/transcoding#gzip-gzip

I am not really sure of the logic behind the GCS feature but I guess the horse has bolted on that one and so now my question is around how I can tell if GCS is going to ignore the Accept-Encoding header sent by the .NET SDK. The documentation doesn't have a definitive list of content types where this holds true. Is there any way this list can be built into the .NET SDK so that it knows that it should ignore the hash for these files? Maybe the SDK could check if the X-GUploader-Response-Body-Transformations: gunzipped response header exists and no check hash if this is true? Is there some other solution where we can still validate the download?

To avoid the suggestion that there is no point compressing already compressed files a ZIP can benefit from GZIP compression if for example it contains duplicate files.

The text was updated successfully, but these errors were encountered:

amanda-tarafa · 2021-09-10T11:36:21Z

For background see #1641 and #1784.

If you want to validate the download, then the only solution is to add Cache-Control: no-transform as that is the only way to obtain the data that matches the hash.

If you want to skip hash validation you can use DownloadObjectOptions.DownloadValidationMode to specify DownloadValidationMode.Never.

As you can see in #1784, which is a similar corner case, automatically detecting all of these is unreliable, as there's no reliable header that describes whether the server has stripped one compression layer or not, and the file metadata information is not enough for us to know whether to ignore the hash.
The request to include a set of already compressed content types for which Accept-Encoding: gzip will be ignored is better made to the API team, you can do so by clicking at the Send Feedback button on the bottom of the Transcoding documentation. But even if they were to do so, we couldn't know if we need to skip validation or not, becuse the Content-Encoding header is dropped on the process, which means we don't know if the file was doubly compressed or not to start with, and if it wasn't (a video, for instance) we'd still want to validate the hash.

nathan-c · 2021-09-10T14:28:49Z

Thanks for the quick response. It looks like there is no way to validate the download if GCS decompresses the object, short of re-compressing and re-hashing the downloaded data and checking if that matches the value on the object in GCS.

Ideally we want to leave server-side decompression enabled for those clients that need it. In our case we are in control of file uploads so we could perhaps add the uncompressed hash to the object metadata and then add our own hash validation on download but it is a shame this can't be solved inside the SDK. We also maintain a list of types not to apply gzip compression to and we can continue to add content-types to this list as we see this error but obviously this isn't a great solution either.

I know this is a question for the API team but do you know what the use case is for forcibly decompressing the outer gzip compression for certain "compressed" types even if the user specifically sends Accept-Encoding: gzip?

amanda-tarafa · 2021-09-10T15:14:07Z

I'll raise these issues again with the API team, as they are the best positioned to offer a solution that works for all, instead of us trying to patch the .NET library based on assumptions.

As for why they are removing the outer compression layer, I really don't know.

I'll move this issue to the backlog now, where #1784 is but I'll update it if/when I know more. Do feel free to add a comment if you think there's something else we can address.

Closes googleapis#7154

Closes #7154

product-auto-label bot added the api: storage Issues related to the Cloud Storage API. label Sep 10, 2021

amanda-tarafa self-assigned this Sep 10, 2021

amanda-tarafa added type: question Request for information or clarification. Not an issue. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Sep 10, 2021

jskeet assigned jskeet and amanda-tarafa and unassigned amanda-tarafa and jskeet Sep 10, 2021

amanda-tarafa added a commit to amanda-tarafa/google-cloud-dotnet that referenced this issue Sep 10, 2021

chore: Move googleapis#7154 to the backlog

c99dcaa

Closes googleapis#7154

amanda-tarafa mentioned this issue Sep 10, 2021

chore: Move #7154 to the backlog #7155

Merged

amanda-tarafa closed this as completed in #7155 Sep 10, 2021

amanda-tarafa added a commit that referenced this issue Sep 10, 2021

chore: Move #7154 to the backlog

df1c1b4

Closes #7154

cojenco mentioned this issue Oct 1, 2021

Decompressive transcoding is broken when start byte !=0 googleapis/python-storage#607

Closed

amanda-tarafa mentioned this issue Jan 6, 2023

Skip hash validation when it's unsafe when downloading Storage objects #9507

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storage: Compressed files and Content-Encoding #7154

Storage: Compressed files and Content-Encoding #7154

nathan-c commented Sep 10, 2021

amanda-tarafa commented Sep 10, 2021

nathan-c commented Sep 10, 2021

amanda-tarafa commented Sep 10, 2021

Storage: Compressed files and Content-Encoding #7154

Storage: Compressed files and Content-Encoding #7154

Comments

nathan-c commented Sep 10, 2021

amanda-tarafa commented Sep 10, 2021

nathan-c commented Sep 10, 2021

amanda-tarafa commented Sep 10, 2021