-
-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
iter_bucket fails on AWS Lambda when worker threads is not 1 #340
Comments
That might work. Which version of Python is needed for |
It does work locally, going to test it on lambda in a second. Here's the code i "wrote" (basically a copy of the commit i linked with some adaptations) : with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as executor:
futures = [executor.submit(download_key, key) for key in key_iterator]
for key_no, future in enumerate(concurrent.futures.as_completed(futures)):
(key, content) = future.result()
if True or key_no % 1000 == 0:
logger.info(
"yielding key #%i: %s, size %i (total %.1fMB)", key_no, key, len(content), total_size / 1024.0 ** 2
)
yield key, content
total_size += len(content)
if key_limit is not None and key_no + 1 >= key_limit:
# we were asked to output only a limited number of keys => we're done
break concurrent.futures was added on python 3.2 according to https://docs.python.org/3/library/concurrent.futures.html |
@mpenkov WDYT? According to https://pypistats.org/packages/smart-open , we still get more py2 downloads than py3 downloads. But god knows how many of these are bots / noise… "Support |
Yeah that doesn't seem like a worthy reason to completely drop python 2 support, but there's this: https://pypi.org/project/futures/ |
It sounds like the current smart_open version with default arguments does not work under Lambda at all, regardless of the Python version you're using. The fix suggested by @rodjunger improves that situation for Py3.2+ users only. As long as nothing changes for Py2.7 users (they continue to see @rodjunger Can you make a PR that fixes the problem if @piskvorky I think we'll have to drop Py2.7 support eventually, for the same reasons as in gensim. Our Py2.7 users can continue to download |
@mpenkov Yeah I can do that, I'll submit the PR this week if possible. |
This commit addresses issue piskvorky#340. AWS Lambda environments do not support multiprocessing.Queue or multiprocessing.Pool, which are used by iter_bucket to optimize the pulling of files from s3. Solution: Switch to using concurrent.futures.ThreadPoolExecutor instead. This still optimizes the pulling of files from s3 without using new processes.
This commit addresses issue piskvorky#340. AWS Lambda environments do not support multiprocessing.Queue or multiprocessing.Pool, which are used by iter_bucket to optimize the pulling of files from s3. Solution: Switch to using concurrent.futures.ThreadPoolExecutor instead. This still optimizes the pulling of files from s3 without using new processes.
@mpenkov Assumed that since you marked this as Hackoberfest and @rodjunger hasn't respond that this needed work. I submitted a PR for the work @rodjunger said they were going to do. |
This commit addresses issue piskvorky#340. AWS Lambda environments do not support multiprocessing.Queue or multiprocessing.Pool, which are used by iter_bucket to optimize the pulling of files from s3. Solution: Switch to using concurrent.futures.ThreadPoolExecutor instead. This still optimizes the pulling of files from s3 without using new processes.
This commit addresses issue piskvorky#340. AWS Lambda environments do not support multiprocessing.Queue or multiprocessing.Pool, which are used by iter_bucket to optimize the pulling of files from s3. Solution: Switch to using concurrent.futures.ThreadPoolExecutor instead. This still optimizes the pulling of files from s3 without using new processes.
This commit addresses issue piskvorky#340. AWS Lambda environments do not support multiprocessing.Queue or multiprocessing.Pool, which are used by iter_bucket to optimize the pulling of files from s3. Solution: Switch to using concurrent.futures.ThreadPoolExecutor instead. This still optimizes the pulling of files from s3 without using new processes.
* Updated iter_bucket to use concurrent futures. This commit addresses issue #340. AWS Lambda environments do not support multiprocessing.Queue or multiprocessing.Pool, which are used by iter_bucket to optimize the pulling of files from s3. Solution: Switch to using concurrent.futures.ThreadPoolExecutor instead. This still optimizes the pulling of files from s3 without using new processes. * disable test_old when mocks are disabled * favor multiprocessing over concurrent.futures * make imap_unordered return an iterator instead of a list * skip tests when their respective features are unavailable * Revert "disable test_old when mocks are disabled" This reverts commit 6506562. * tweak imap_unordered * remove tests_require pins Co-authored-by: Michael Penkov <m@penkov.dev>
Closed via #368 |
I did some research before opening this issue so it's more of a "information" kind of issue.
AWS Lambda environments do not support multiprocessing.Queue or multiprocessing.Pool (which is used by smart open) so calling iter_bucket with default arguments results in this exception:
I found this fix from another project which seems simple enough for me to try to apply to this project too
The text was updated successfully, but these errors were encountered: