-
-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update ABSStore to current Azure Storage API version #620
Conversation
zarr/storage.py
Outdated
# It is possible azure.store.blob doesn't provide the content_length attribute on | ||
# the blob propoeries object anymore. Something to look into. | ||
# | ||
# def getsize(self, path=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flagging this as one major thing I haven't figured out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zarr/storage.py
Outdated
for blob in self.client.list_blobs(name_starts_with=dir_path): | ||
# items.append(self._strip_prefix_from_path(blob.name, dir_path)) | ||
if '/' not in blob.name: # what is this doing? | ||
items.append(self._strip_prefix_from_path(blob.name, dir_path)) | ||
else: | ||
items.append(self._strip_prefix_from_path( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tjcrone - I'm pretty confused here. Any points on the listdir
method would be much appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 8 in e3cdd1a
optional methods `listdir` (list members of a "directory") and `rmdir` (remove all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The listdir
method which was implemented using the old version of the azure storage library, used the list_blobs
method in that old library. That method in the old library had a delimiter parameter, which helps in listing directories much easier. For example if we have two directories in the container, say foo/
and bar/
each of which could have hundreds of thousands of blobs, we would not want to list all of them in the list_blobs
operations. Instead if you specify delimiter='/'
, you would only be returned ['foo/', 'bar/']
in the list_blobs
operation. The new version of the azure storage library seems not to have the delimiter
option in the list_blobs
function. Instead there is a function walk_blobs which has this parameter. I think that function would be more appropriate here instead of list_blobs
. I think just replacing self.client.list_blobs(name_starts_with=dir_path)
with self.client.walk_blobs(name_starts_with=dir_path, delimiter='/')
, should work.
zarr/storage.py
Outdated
# if self.client.get_blob_client(fs_path).exists(): | ||
# return self.client.get_blob_properties(self.container, | ||
# fs_path).properties.content_length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR @jhamman
I think the way to go here would be like so
blob_client = self.client.get_blob_client(fs_path) # blob may or may not exist
if blob_client.exists():
return blob_client.get_blob_properties().size
zarr/storage.py
Outdated
for blob in self.client.list_blobs(name_starts_with=dir_path): | ||
# items.append(self._strip_prefix_from_path(blob.name, dir_path)) | ||
if '/' not in blob.name: # what is this doing? | ||
items.append(self._strip_prefix_from_path(blob.name, dir_path)) | ||
else: | ||
items.append(self._strip_prefix_from_path( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The listdir
method which was implemented using the old version of the azure storage library, used the list_blobs
method in that old library. That method in the old library had a delimiter parameter, which helps in listing directories much easier. For example if we have two directories in the container, say foo/
and bar/
each of which could have hundreds of thousands of blobs, we would not want to list all of them in the list_blobs
operations. Instead if you specify delimiter='/'
, you would only be returned ['foo/', 'bar/']
in the list_blobs
operation. The new version of the azure storage library seems not to have the delimiter
option in the list_blobs
function. Instead there is a function walk_blobs which has this parameter. I think that function would be more appropriate here instead of list_blobs
. I think just replacing self.client.list_blobs(name_starts_with=dir_path)
with self.client.walk_blobs(name_starts_with=dir_path, delimiter='/')
, should work.
zarr/storage.py
Outdated
if type(blob) == Blob: | ||
for blob in self.client.list_blobs(name_starts_with=dir_path): | ||
# items.append(self._strip_prefix_from_path(blob.name, dir_path)) | ||
if '/' not in blob.name: # what is this doing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the list_blobs will be replaced with walk_blobs with a delimiter='/'
parameter, the return items will be the contents of the directory. So if the directory structure is like so:
.zgroup
foo/.zgroup
foo/bar/.zarray
then walk_blobs(self.container, name_starts_with='foo/', delimiter='/')
will return ['foo/.zgroup', 'foo/bar/']
The .zgroup
will be a case of the first condition if '/' not in blob.name
, which if true, means that we have a blob, otherwise it's a directory. And the code in the else block removes the '/' from the directory.
So return value of listdir('foo')
should be ['.zgroup', 'bar']
.
I admit, it is not immediately obvious. Any refactoring would be greatly appreciated.
@shikharsg - do you happen to understand how we can do the blob emulation with the new version of the api? The current test uses: zarr-python/zarr/tests/test_storage.py Lines 1775 to 1783 in 610db34
As far as I can tell, this option is no longer available in the |
Yes, it seems like that option is not available in the new version. So the azure storage emulators (azurite or storage emulator) run on a default host and port and have default credentials. The connection string is:
Should be something like this I think: blob_client = asb.BlobServiceClient.from_connection_string(conn_str="PUT CONNECTION STRING HERE")
blob_client.delete_container('test')
blob_client.create_container('test')
store = ABSStore(container='test', prefix=prefix, account_name='devstoreaccount1', account_key='Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==') |
I have some time to help move this forward. @jhamman and @shikharsg, what is the primary blocker on this PR? Thanks! |
@alimanfoo, what is your preferred method for me to work on this PR? Should we merge into a fix/absstore branch in this repo |
I think main blocker is getting the tests to work with the emulator. Once the tests work it should most likely tell you if something else was breaking as well. Since the new storage library does not have a convenient way to connect to the emulator, we will have to use the default credentials that the emulator runs on, to connect to it. See this comment. |
Merged in master and pushed to kick off the tests again. |
|
Thanks @joshmoore! I've actually been thinking this we should discuss closing this PR and deprecating the |
Yes! Closing. |
This PR addresses #618 and updates the minimum supported version of
azure-storage-blob>=12
.You'll note the tests will fail for now. I haven't figured out where the emulation feature of the azure api is in the new version. I'll keep digging but tests on actual data are indicating this is mostly functional now.
cc @tjcrone
TODO: