You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Not a problem, but a feature request that I wanted to get feedback on prior to making a PR.
My goal is to make smart_open fail faster when it tries to open an S3 URI that does not exist. To do this, I'd like to be able to pass a boto3 resource as a transport param, to reduce the runtime of open. Re-initializing the boto3 resource is expensive, and for opening nonexistent files constitutes the bulk of the runtime.
Steps/code to reproduce the problem
Here is some minimal benchmarking code with no dependencies besides smart_open. All of the benchmarks are taken around the same time on my laptop.
importtimeimportboto3importsmart_opendeftimeit(func, *args, **kwargs):
count=0start=time.time()
whileTrue:
try:
func(*args, **kwargs)
exceptOSError:
passcount+=1print("Average:", (time.time() -start) /count)
# credentials in environment#timeit(smart_open.open, "s3://<my bucket>/not/a/real/key")# 0.15 seconds# passing session to smart_open#session = boto3.Session()#timeit(smart_open.open, "s3://<my bucket>/not/a/real/key", transport_params={"session": session})# 0.1 seconds# passing s3 resource directlyresource=boto3.Session().resource("s3")
# Hack: To test this I cloned smart_open and replaced# s3 = session.resource('s3', **resource_kwargs)# with# s3 = session # in smart_open/s3.pytimeit(smart_open.open, "s3://<my bucket>/not/a/real/key", transport_params={"session": resource})
# 0.028 seconds
Proposed Solution
Add a resource transport parameter to the s3 module. This would be exclusive with passing a boto session: if they are both passed together warn/raise an error.
Another option would be to take in a boto client instead of resource.
Sure, adding a resource parameter is fine, as long as we clearly document how it works, in particular its interactions with the existing session parameter.
Problem description
Not a problem, but a feature request that I wanted to get feedback on prior to making a PR.
My goal is to make smart_open fail faster when it tries to open an S3 URI that does not exist. To do this, I'd like to be able to pass a boto3 resource as a transport param, to reduce the runtime of open. Re-initializing the boto3 resource is expensive, and for opening nonexistent files constitutes the bulk of the runtime.
Steps/code to reproduce the problem
Here is some minimal benchmarking code with no dependencies besides smart_open. All of the benchmarks are taken around the same time on my laptop.
Proposed Solution
Add a
resource
transport parameter to the s3 module. This would be exclusive with passing a boto session: if they are both passed together warn/raise an error.Another option would be to take in a boto client instead of resource.
Versions
Checklist
Before you create the issue, please make sure you have:
The text was updated successfully, but these errors were encountered: