Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support using S3Config.credentials_provider for writes #3648

Merged
merged 12 commits into from
Jan 15, 2025

Conversation

kevinzwang
Copy link
Member

@kevinzwang kevinzwang commented Jan 7, 2025

BREAKING CHANGE: S3Credentials.expiry must be a timezoned datetime now

Resolves #3367

@kevinzwang kevinzwang requested a review from jaychia January 7, 2025 22:17
@github-actions github-actions bot added the feat label Jan 7, 2025
Copy link

codspeed-hq bot commented Jan 7, 2025

CodSpeed Performance Report

Merging #3648 will degrade performances by 21.98%

Comparing kevin/s3-cred-provider-write (105f2dd) with main (feab49a)

Summary

❌ 1 regressions
✅ 26 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main kevin/s3-cred-provider-write Change
test_iter_rows_first_row[100 Small Files] 148.5 ms 190.3 ms -21.98%

@kevinzwang kevinzwang marked this pull request as draft January 7, 2025 22:48
Copy link

codecov bot commented Jan 8, 2025

Codecov Report

Attention: Patch coverage is 90.90909% with 9 lines in your changes missing coverage. Please review.

Project coverage is 77.85%. Comparing base (c932ec9) to head (105f2dd).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
src/common/io-config/src/python.rs 83.78% 6 Missing ⚠️
daft/filesystem.py 89.28% 3 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3648      +/-   ##
==========================================
- Coverage   78.06%   77.85%   -0.22%     
==========================================
  Files         728      728              
  Lines       89967    89939      -28     
==========================================
- Hits        70236    70023     -213     
- Misses      19731    19916     +185     
Files with missing lines Coverage Δ
src/common/io-config/src/lib.rs 58.06% <ø> (ø)
src/common/io-config/src/s3.rs 61.73% <100.00%> (+8.40%) ⬆️
daft/filesystem.py 69.84% <89.28%> (+1.50%) ⬆️
src/common/io-config/src/python.rs 61.90% <83.78%> (+2.06%) ⬆️

... and 24 files with indirect coverage changes

@kevinzwang kevinzwang marked this pull request as ready for review January 8, 2025 01:18

def _get_s3_creds_from_provider_cached(provider: Callable[[], S3Credentials]) -> S3Credentials:
"""Get S3 credentials from the cache if the current provider was already called and the creds have not expired."""
global _CACHED_S3_CREDS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about "always" running the provider instead of caching it? Global state is scary and bug-prone.

Doesn't _CACHED_FSES already handle caching for us, since this maps the IOConfig (which contains a provider) to the initialized fs?

Copy link
Member Author

@kevinzwang kevinzwang Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does in part, but fs objects are still created every time _resolve_paths_and_filesystem is called as part of the call to _validate_filesystem, even though the cached one is ultimately returned. Because of that, the S3 credentials will be retrieved again, potentially multiple times per filesystem resolution actually (_validate_filesystem is called once per path)

I am honestly not sure why we cache the fs but didn't want to change that behavior in this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok... Another idea that might be a little cleaner -- could we wrap the provider callback into a cached class? This could help us avoid a global cache I think.

Something like this:

class CachedProvider:
    def __init__(self, provider):
        self.expiry = ...
        self.provider = ...
        self.creds = None

    def __call__(self):
        ... # used cached if expiry is still good, otherwise re-trigger the provider

    def __reduce__(self):
        ... # for pickling, we can discard the cached credentials and just pickle the provider itself I think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, where would I store this cached class? I'm hesitant to overload S3Config.credentials_provider but unsure if there are other options

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking we store it in there as a little wrapper!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up going with a different solution actually, adding a caching wrapper on the Rust side. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh very cool!!

@kevinzwang kevinzwang requested a review from jaychia January 13, 2025 22:33
@@ -154,10 +161,10 @@ def _resolve_paths_and_filesystem(
if resolved_filesystem is None:
# Resolve path and filesystem for the first path.
# We use this first resolved filesystem for validation on all other paths.
resolved_path, resolved_filesystem = _infer_filesystem(paths[0], io_config)
resolved_path, resolved_filesystem, expiry = _infer_filesystem(paths[0], io_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be cleaner instead if we wrapped the filesystem and expiry into a dataclass (e.g. PyarrowFilesystemWithExpiry)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that sounds good, I'll do that

@kevinzwang kevinzwang enabled auto-merge (squash) January 15, 2025 00:07
@kevinzwang kevinzwang merged commit 0afc55f into main Jan 15, 2025
40 of 41 checks passed
@kevinzwang kevinzwang deleted the kevin/s3-cred-provider-write branch January 15, 2025 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

S3Config.credentials_provider not used in write path
2 participants