Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ABSStore to current Azure Storage API version #620

Closed
wants to merge 7 commits into from
Closed

Update ABSStore to current Azure Storage API version #620

wants to merge 7 commits into from

Conversation

jhamman
Copy link
Member

@jhamman jhamman commented Sep 27, 2020

This PR addresses #618 and updates the minimum supported version of azure-storage-blob>=12.

You'll note the tests will fail for now. I haven't figured out where the emulation feature of the azure api is in the new version. I'll keep digging but tests on actual data are indicating this is mostly functional now.

cc @tjcrone

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • AppVeyor and Travis CI passes
  • Test coverage is 100% (Coveralls passes)

zarr/storage.py Outdated
Comment on lines 2259 to 2262
# It is possible azure.store.blob doesn't provide the content_length attribute on
# the blob propoeries object anymore. Something to look into.
#
# def getsize(self, path=None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging this as one major thing I haven't figured out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zarr/storage.py Outdated
Comment on lines 2243 to 2248
for blob in self.client.list_blobs(name_starts_with=dir_path):
# items.append(self._strip_prefix_from_path(blob.name, dir_path))
if '/' not in blob.name: # what is this doing?
items.append(self._strip_prefix_from_path(blob.name, dir_path))
else:
items.append(self._strip_prefix_from_path(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjcrone - I'm pretty confused here. Any points on the listdir method would be much appreciated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional methods `listdir` (list members of a "directory") and `rmdir` (remove all

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The listdir method which was implemented using the old version of the azure storage library, used the list_blobs method in that old library. That method in the old library had a delimiter parameter, which helps in listing directories much easier. For example if we have two directories in the container, say foo/ and bar/ each of which could have hundreds of thousands of blobs, we would not want to list all of them in the list_blobs operations. Instead if you specify delimiter='/', you would only be returned ['foo/', 'bar/'] in the list_blobs operation. The new version of the azure storage library seems not to have the delimiter option in the list_blobs function. Instead there is a function walk_blobs which has this parameter. I think that function would be more appropriate here instead of list_blobs. I think just replacing self.client.list_blobs(name_starts_with=dir_path) with self.client.walk_blobs(name_starts_with=dir_path, delimiter='/'), should work.

zarr/storage.py Outdated
Comment on lines 2268 to 2270
# if self.client.get_blob_client(fs_path).exists():
# return self.client.get_blob_properties(self.container,
# fs_path).properties.content_length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @jhamman

I think the way to go here would be like so

        blob_client = self.client.get_blob_client(fs_path)  # blob may or may not exist
        if blob_client.exists():
            return blob_client.get_blob_properties().size

zarr/storage.py Outdated
Comment on lines 2243 to 2248
for blob in self.client.list_blobs(name_starts_with=dir_path):
# items.append(self._strip_prefix_from_path(blob.name, dir_path))
if '/' not in blob.name: # what is this doing?
items.append(self._strip_prefix_from_path(blob.name, dir_path))
else:
items.append(self._strip_prefix_from_path(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The listdir method which was implemented using the old version of the azure storage library, used the list_blobs method in that old library. That method in the old library had a delimiter parameter, which helps in listing directories much easier. For example if we have two directories in the container, say foo/ and bar/ each of which could have hundreds of thousands of blobs, we would not want to list all of them in the list_blobs operations. Instead if you specify delimiter='/', you would only be returned ['foo/', 'bar/'] in the list_blobs operation. The new version of the azure storage library seems not to have the delimiter option in the list_blobs function. Instead there is a function walk_blobs which has this parameter. I think that function would be more appropriate here instead of list_blobs. I think just replacing self.client.list_blobs(name_starts_with=dir_path) with self.client.walk_blobs(name_starts_with=dir_path, delimiter='/'), should work.

zarr/storage.py Outdated
if type(blob) == Blob:
for blob in self.client.list_blobs(name_starts_with=dir_path):
# items.append(self._strip_prefix_from_path(blob.name, dir_path))
if '/' not in blob.name: # what is this doing?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the list_blobs will be replaced with walk_blobs with a delimiter='/' parameter, the return items will be the contents of the directory. So if the directory structure is like so:

.zgroup
foo/.zgroup
foo/bar/.zarray

then walk_blobs(self.container, name_starts_with='foo/', delimiter='/') will return ['foo/.zgroup', 'foo/bar/']
The .zgroup will be a case of the first condition if '/' not in blob.name, which if true, means that we have a blob, otherwise it's a directory. And the code in the else block removes the '/' from the directory.
So return value of listdir('foo') should be ['.zgroup', 'bar'].

I admit, it is not immediately obvious. Any refactoring would be greatly appreciated.

@jhamman
Copy link
Member Author

jhamman commented Sep 28, 2020

@shikharsg - do you happen to understand how we can do the blob emulation with the new version of the api? The current test uses:

def create_store(self, prefix=None):
asb = pytest.importorskip("azure.storage.blob")
blob_client = asb.BlockBlobService(is_emulated=True)
blob_client.delete_container('test')
blob_client.create_container('test')
store = ABSStore(container='test', prefix=prefix, account_name='foo',
account_key='bar', blob_service_kwargs={'is_emulated': True})
store.rmdir()
return store

As far as I can tell, this option is no longer available in the azure.storage.blob.

@shikharsg
Copy link
Contributor

shikharsg commented Sep 28, 2020

@shikharsg - do you happen to understand how we can do the blob emulation with the new version of the api? The current test uses:

def create_store(self, prefix=None):
asb = pytest.importorskip("azure.storage.blob")
blob_client = asb.BlockBlobService(is_emulated=True)
blob_client.delete_container('test')
blob_client.create_container('test')
store = ABSStore(container='test', prefix=prefix, account_name='foo',
account_key='bar', blob_service_kwargs={'is_emulated': True})
store.rmdir()
return store

As far as I can tell, this option is no longer available in the azure.storage.blob.

Yes, it seems like that option is not available in the new version.

So the azure storage emulators (azurite or storage emulator) run on a default host and port and have default credentials. The connection string is:

DefaultEndpointsProtocol=https;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=https://127.0.0.1:10000/devstoreaccount1;

Should be something like this I think:

blob_client = asb.BlobServiceClient.from_connection_string(conn_str="PUT CONNECTION STRING HERE")
blob_client.delete_container('test') 
blob_client.create_container('test')
store = ABSStore(container='test', prefix=prefix, account_name='devstoreaccount1', account_key='Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==') 

@tjcrone
Copy link
Member

tjcrone commented Oct 30, 2020

I have some time to help move this forward. @jhamman and @shikharsg, what is the primary blocker on this PR? Thanks!

@tjcrone
Copy link
Member

tjcrone commented Oct 30, 2020

@alimanfoo, what is your preferred method for me to work on this PR? Should we merge into a fix/absstore branch in this repo
that we can continue working on or would you like me to grab this branch from @jhamman and start a new PR? Thanks!

@shikharsg
Copy link
Contributor

I have some time to help move this forward. @jhamman and @shikharsg, what is the primary blocker on this PR? Thanks!

I think main blocker is getting the tests to work with the emulator. Once the tests work it should most likely tell you if something else was breaking as well. Since the new storage library does not have a convenient way to connect to the emulator, we will have to use the default credentials that the emulator runs on, to connect to it. See this comment.

@joshmoore
Copy link
Member

Merged in master and pushed to kick off the tests again.

@joshmoore
Copy link
Member

E AttributeError: module 'azure.storage.blob' has no attribute 'BlockBlobService'

@jhamman
Copy link
Member Author

jhamman commented Feb 18, 2021

Thanks @joshmoore!

I've actually been thinking this we should discuss closing this PR and deprecating the ABSStore all together. Lately, I've been using the adlfs mapper implementation with increasing success. Given the complexities with maintaining the test suite and interface API compatibility, maybe it would be wise to consolidate effort elsewhere? Certainly open to other opinions (@tjcrone, @shikharsg) but I'm thinking I'm unlikely to bring this one across the finish line.

@joshmoore
Copy link
Member

@jhamman: this can now be closed with #759 merged, right? That way #781 (which implements #764) doesn't conflict. But if you are still thinking about refactoring ABSStore further, I can hold off.

@jhamman
Copy link
Member Author

jhamman commented Jun 17, 2021

Yes! Closing.

@jhamman jhamman closed this Jun 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants