-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gzip: Fix bug causing the EOF value (b'') to be returned too early #63
Conversation
Depending on the data and conditions of compression, the .compress() method will sometimes internally buffer all the data fed to it and return only an empty byte string. When that happened, the previous implementation of GzipCompressingReader() mistook it as an EOF value, stopped compression, and signaled EOF itself. This led to some files, but not all, being truncated during deploy/remote upload. GzipCompressingReader() now uses its own internal buffer to make sure it always has something to return to callers, even if multiple calls to .compress() are necessary, thus avoiding a false EOF condition. Support for .read1() is dropped since it is impossible to implement its behaviour given the API of gzip. It was not used anyway. Resolves #62.
@trvrb This bug fix should resolve #62. I was able to replicate the issue locally (using the script below) and it is resolved when I use this branch. I'm happy to merge and release tomorrow (Monday) for you test, or you can test directly against this branch if you'd prefer. #!/bin/bash
set -euo pipefail
set -x
rm -vf roundtrip-test-*.json
./bin/nextstrain deploy s3://trs-blab-test roundtrip-test.json
aws s3 cp s3://trs-blab-test/roundtrip-test.json roundtrip-test-uploaded.json.gz
gunzip -v roundtrip-test-uploaded.json.gz
test "$(sha256sum < roundtrip-test.json)" == "$(sha256sum < roundtrip-test-uploaded.json)"
./bin/nextstrain remote download s3://trs-blab-test/roundtrip-test.json roundtrip-test-downloaded.json
test "$(sha256sum < roundtrip-test.json)" == "$(sha256sum < roundtrip-test-downloaded.json)" where |
Is it possible to write a test for this? |
@abitrolly Possible, yes, and definitely would be ideal. There's not a proper testing setup for this package yet, however, so adding a test for this isn't quick. I'd love to have pytest setup like we do in https://github.com/seattleflu/id3c. In the meantime, something like the above script could be added as another line in the repo's Travis CI config. The script would need to be cleaned up a bit, test data uploaded somewhere, and appropriate S3 credentials for a testing bucket configured. |
Deploying For integration test with S3 it may be easier to mock its services with https://github.com/spulec/moto or https://botocore.amazonaws.com/v1/documentation/api/latest/reference/stubber.html I actually thought about unit testing the bug with missing data in compression algorithm. With such tests it can be possibly submitted upstream. |
Ah, I see. I'd rather have an end-to-end integration test over one with mocks, as it tests the actual behaviour of the actual systems.
The bug was in this repo's code, not in anything upstream. |
Depending on the data and conditions of compression, the .compress()
method will sometimes internally buffer all the data fed to it and
return only an empty byte string. When that happened, the previous
implementation of GzipCompressingReader() mistook it as an EOF value,
stopped compression, and signaled EOF itself. This led to some files,
but not all, being truncated during deploy/remote upload.
GzipCompressingReader() now uses its own internal buffer to make sure it
always has something to return to callers, even if multiple calls to
.compress() are necessary, thus avoiding a false EOF condition. Support
for .read1() is dropped since it is impossible to implement its
behaviour given the API of gzip. It was not used anyway.
Resolves #62.