Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gzip: Fix bug causing the EOF value (b'') to be returned too early #63

Merged
merged 2 commits into from
Mar 16, 2020

Conversation

tsibley
Copy link
Member

@tsibley tsibley commented Mar 16, 2020

Depending on the data and conditions of compression, the .compress()
method will sometimes internally buffer all the data fed to it and
return only an empty byte string. When that happened, the previous
implementation of GzipCompressingReader() mistook it as an EOF value,
stopped compression, and signaled EOF itself. This led to some files,
but not all, being truncated during deploy/remote upload.

GzipCompressingReader() now uses its own internal buffer to make sure it
always has something to return to callers, even if multiple calls to
.compress() are necessary, thus avoiding a false EOF condition. Support
for .read1() is dropped since it is impossible to implement its
behaviour given the API of gzip. It was not used anyway.

Resolves #62.

Depending on the data and conditions of compression, the .compress()
method will sometimes internally buffer all the data fed to it and
return only an empty byte string.  When that happened, the previous
implementation of GzipCompressingReader() mistook it as an EOF value,
stopped compression, and signaled EOF itself.  This led to some files,
but not all, being truncated during deploy/remote upload.

GzipCompressingReader() now uses its own internal buffer to make sure it
always has something to return to callers, even if multiple calls to
.compress() are necessary, thus avoiding a false EOF condition.  Support
for .read1() is dropped since it is impossible to implement its
behaviour given the API of gzip.  It was not used anyway.

Resolves #62.
@tsibley tsibley requested a review from trvrb March 16, 2020 04:46
@tsibley
Copy link
Member Author

tsibley commented Mar 16, 2020

@trvrb This bug fix should resolve #62. I was able to replicate the issue locally (using the script below) and it is resolved when I use this branch.

I'm happy to merge and release tomorrow (Monday) for you test, or you can test directly against this branch if you'd prefer.

#!/bin/bash
set -euo pipefail
set -x

rm -vf roundtrip-test-*.json

./bin/nextstrain deploy s3://trs-blab-test roundtrip-test.json
aws s3 cp s3://trs-blab-test/roundtrip-test.json roundtrip-test-uploaded.json.gz
gunzip -v roundtrip-test-uploaded.json.gz
test "$(sha256sum < roundtrip-test.json)" == "$(sha256sum < roundtrip-test-uploaded.json)"

./bin/nextstrain remote download s3://trs-blab-test/roundtrip-test.json roundtrip-test-downloaded.json
test "$(sha256sum < roundtrip-test.json)" == "$(sha256sum < roundtrip-test-downloaded.json)"

where roundtrip-test.json was downloaded with curl -fsSL --compressed http://data.nextstrain.org/ncov_test1.15.0.json > roundtrip-test.json and has a sha256sum of 91af331993f3065d03e2019f31a2b6ad869944999ae52849f85fc01dd574b1d0.

@abitrolly
Copy link

Is it possible to write a test for this?

@tsibley
Copy link
Member Author

tsibley commented Mar 16, 2020

@abitrolly Possible, yes, and definitely would be ideal. There's not a proper testing setup for this package yet, however, so adding a test for this isn't quick. I'd love to have pytest setup like we do in https://github.com/seattleflu/id3c. In the meantime, something like the above script could be added as another line in the repo's Travis CI config. The script would need to be cleaned up a bit, test data uploaded somewhere, and appropriate S3 credentials for a testing bucket configured.

@tsibley tsibley merged commit f9bb71e into master Mar 16, 2020
@tsibley tsibley deleted the remote-upload-corruption-fix branch March 16, 2020 17:11
@abitrolly
Copy link

Deploying pytest infrastructure is easy. I can make a PR for it.

For integration test with S3 it may be easier to mock its services with https://github.com/spulec/moto or https://botocore.amazonaws.com/v1/documentation/api/latest/reference/stubber.html

I actually thought about unit testing the bug with missing data in compression algorithm. With such tests it can be possibly submitted upstream.

@tsibley
Copy link
Member Author

tsibley commented Mar 16, 2020

Ah, I see.

I'd rather have an end-to-end integration test over one with mocks, as it tests the actual behaviour of the actual systems.

I actually thought about unit testing the bug with missing data in compression algorithm. With such tests it can be possibly submitted upstream.

The bug was in this repo's code, not in anything upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug in deploy after version 1.16.0
2 participants