Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up glob patterns with a prefix #302

Merged
merged 5 commits into from
Nov 11, 2020

Conversation

JacobHayes
Copy link
Contributor

@JacobHayes JacobHayes commented Nov 9, 2020

Resolves the issue described in #223 (comment) by applying @martindurant's first suggestion.

I was experimenting with a path like gs://bucket/partial_folder_name**.csv.gz and these changes returned sub-second versus me canceling after 18 minutes.

The changes are a bit more indirect/wide than I'd like, but I think that's mostly from the async fs and the overridden find not delegating to ls like in the simple spec.

Any suggestions on how to add a test or benchmark around this to ensure it isn't lost? Perhaps I can use VCR or something to assert that the prefix is passed in the API request? I think the VCR recording should be updated accordingly now. The VCR process/docs have gotten much better, so thanks! 😃

@JacobHayes JacobHayes changed the title Speed up glob with ** patterns in large buckets Speed up glob with patterns in large buckets Nov 9, 2020
@JacobHayes JacobHayes changed the title Speed up glob with patterns in large buckets Speed up glob patterns with a prefix Nov 9, 2020
@martindurant
Copy link
Member

I haven't had time to look in detail yet. Is this essentially the same change as fsspec/s3fs#360 for s3fs? Note that there are some follow-on PRs there, to deal with caching the listings and making sure maxdepth is respected in find.

@martindurant
Copy link
Member

On second reading, is it just about providing "partial_folder_name" to find in your example?

@JacobHayes
Copy link
Contributor Author

On second reading, is it just about providing "partial_folder_name" to find in your example?

Yes, exactly. I have folders like:

  • gs://bucket/folder/
  • gs://bucket/folder_backfill/
  • gs://bucket/some_really_large_folder

and a pattern like gs://bucket/folder**.csv.gz. some_really_large_folder wouldn't match the glob, but the old code wouldn't pass along the prefix to the API call, so all of the objects in some_really_large_folder would also be retrieved and filtered in python.

@martindurant
Copy link
Member

OK, got it.
So in this case, you would expect "bucket/folder" and "bucket/folder_backfill" to be in the directory cache, but not "bucket" or "bucket/some_really_large_folder".

@martindurant
Copy link
Member

Thank you, this is going in - and congratulations for making vcrpy work for you.

@martindurant martindurant merged commit 6fc9870 into fsspec:master Nov 11, 2020
@JacobHayes JacobHayes deleted the hasten-glob branch November 11, 2020 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants