-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support using S3Config.credentials_provider for writes #3648
Conversation
CodSpeed Performance ReportMerging #3648 will degrade performances by 21.98%Comparing Summary
Benchmarks breakdown
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3648 +/- ##
==========================================
- Coverage 78.06% 77.85% -0.22%
==========================================
Files 728 728
Lines 89967 89939 -28
==========================================
- Hits 70236 70023 -213
- Misses 19731 19916 +185
|
daft/filesystem.py
Outdated
|
||
def _get_s3_creds_from_provider_cached(provider: Callable[[], S3Credentials]) -> S3Credentials: | ||
"""Get S3 credentials from the cache if the current provider was already called and the creds have not expired.""" | ||
global _CACHED_S3_CREDS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about "always" running the provider instead of caching it? Global state is scary and bug-prone.
Doesn't _CACHED_FSES
already handle caching for us, since this maps the IOConfig (which contains a provider) to the initialized fs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does in part, but fs objects are still created every time _resolve_paths_and_filesystem
is called as part of the call to _validate_filesystem
, even though the cached one is ultimately returned. Because of that, the S3 credentials will be retrieved again, potentially multiple times per filesystem resolution actually (_validate_filesystem
is called once per path)
I am honestly not sure why we cache the fs but didn't want to change that behavior in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok... Another idea that might be a little cleaner -- could we wrap the provider
callback into a cached class? This could help us avoid a global cache I think.
Something like this:
class CachedProvider:
def __init__(self, provider):
self.expiry = ...
self.provider = ...
self.creds = None
def __call__(self):
... # used cached if expiry is still good, otherwise re-trigger the provider
def __reduce__(self):
... # for pickling, we can discard the cached credentials and just pickle the provider itself I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, where would I store this cached class? I'm hesitant to overload S3Config.credentials_provider
but unsure if there are other options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was thinking we store it in there as a little wrapper!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up going with a different solution actually, adding a caching wrapper on the Rust side. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh very cool!!
@@ -154,10 +161,10 @@ def _resolve_paths_and_filesystem( | |||
if resolved_filesystem is None: | |||
# Resolve path and filesystem for the first path. | |||
# We use this first resolved filesystem for validation on all other paths. | |||
resolved_path, resolved_filesystem = _infer_filesystem(paths[0], io_config) | |||
resolved_path, resolved_filesystem, expiry = _infer_filesystem(paths[0], io_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be cleaner instead if we wrapped the filesystem and expiry into a dataclass (e.g. PyarrowFilesystemWithExpiry
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that sounds good, I'll do that
BREAKING CHANGE:
S3Credentials.expiry
must be a timezoned datetime nowResolves #3367