Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow passing boto3 S3 resource as a transport parameter #527

Closed
3 tasks done
AlexPadron opened this issue Aug 5, 2020 · 4 comments
Closed
3 tasks done

Allow passing boto3 S3 resource as a transport parameter #527

AlexPadron opened this issue Aug 5, 2020 · 4 comments

Comments

@AlexPadron
Copy link

Problem description

Not a problem, but a feature request that I wanted to get feedback on prior to making a PR.

My goal is to make smart_open fail faster when it tries to open an S3 URI that does not exist. To do this, I'd like to be able to pass a boto3 resource as a transport param, to reduce the runtime of open. Re-initializing the boto3 resource is expensive, and for opening nonexistent files constitutes the bulk of the runtime.

Steps/code to reproduce the problem

Here is some minimal benchmarking code with no dependencies besides smart_open. All of the benchmarks are taken around the same time on my laptop.

import time
import boto3

import smart_open


def timeit(func, *args, **kwargs):
    count = 0
    start = time.time()
    while True:
        try:
            func(*args, **kwargs)
        except OSError:
            pass

        count += 1
        print("Average:", (time.time() - start) / count)

# credentials in environment
#timeit(smart_open.open, "s3://<my bucket>/not/a/real/key")
# 0.15 seconds

# passing session to smart_open
#session = boto3.Session()
#timeit(smart_open.open, "s3://<my bucket>/not/a/real/key", transport_params={"session": session})
# 0.1 seconds

# passing s3 resource directly
resource = boto3.Session().resource("s3")
# Hack: To test this I cloned smart_open and replaced
# s3 = session.resource('s3', **resource_kwargs)
# with
# s3 = session 
# in smart_open/s3.py
timeit(smart_open.open, "s3://<my bucket>/not/a/real/key", transport_params={"session": resource})
# 0.028 seconds

Proposed Solution

Add a resource transport parameter to the s3 module. This would be exclusive with passing a boto session: if they are both passed together warn/raise an error.

Another option would be to take in a boto client instead of resource.

Versions

print(platform.platform())
Darwin-19.6.0-x86_64-i386-64bit
>>> print("Python", sys.version)
Python 3.7.7 (default, Mar 10 2020, 15:43:33) 
[Clang 11.0.0 (clang-1100.0.33.17)]
>>> print("smart_open", smart_open.__version__)
smart_open 2.1.0

Checklist

Before you create the issue, please make sure you have:

  • Described the problem clearly
  • Provided a minimal reproducible example, including any required data
  • Provided the version numbers of the relevant software
@piskvorky
Copy link
Owner

Sounds reasonable to me, but let's wait for @mpenkov 's review – he's the most up-to-date with that module.

@AlexPadron
Copy link
Author

Wanted to follow up on this issue and see if it is a reasonable solution

@mpenkov
Copy link
Collaborator

mpenkov commented Aug 30, 2020

Hi @AlexPadron,

Sure, adding a resource parameter is fine, as long as we clearly document how it works, in particular its interactions with the existing session parameter.

Sorry for the delay in looking into this.

@mpenkov
Copy link
Collaborator

mpenkov commented Dec 30, 2020

We did this in #569 and it's now available in smart_open v4.1.0

@mpenkov mpenkov closed this as completed Dec 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants