-
-
Notifications
You must be signed in to change notification settings - Fork 381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use prefetch for sftp? #848
Comments
that sounds to me like the entire file will be downloaded when calling prefetch, which would be an anti-pattern since this lib is designed for files that do not fit into memory |
while i can't find it in the docs, from testing with large (~1gb) files paramiko doesn't appear to load the entire file into memory, only a fixed buffer i was trying to confirm this in the paramiko code but couldn't quite follow it |
fwiw, Line 687 in 76415ca
I'm not a maintainer (just an active contributor), but if the API is analogous to that implementation (global cc @mpenkov |
If this does not change smart_open's API significantly, I'm open to a PR. |
ok i managed to understand paramiko's code a bit better and i'm now pretty sure only at most a single chunk (32k) is stored in memory regarding api, i expect we can handle it with as a first suggestion, how about the presence of the slight downside is that it makes this (pretty significant) performance increase opt-in, but on the other hand it's safer to add. making prefetch the default would presumably require some other transport param value to disable it again if this suggestion sounds ok i'm happy to make a pr |
if we're talking about 32k prefetch, probably there's not much of a downside to having this enabled permanently? can you permalink the relevant paramiko code for the prefetch size? |
at most one chunk stays in memory https://github.com/paramiko/paramiko/blob/ed8b09751ff20340332d4b1bb2b10e32aedc57ff/paramiko/sftp_file.py#L149 chunks are made here https://github.com/paramiko/paramiko/blob/ed8b09751ff20340332d4b1bb2b10e32aedc57ff/paramiko/sftp_file.py#L478 |
though while cheap on the client, it does use a bunch of resources on the server which can be wasteful if the intention is to only read a small part of the file. so maybe we do want it opt-in |
ok i'm not sure why i didn't realise this sooner but if i need custom code for a given protocol (to specify custom transport params) i guess i could also just call |
you'll be talking to this object: smart_open/smart_open/smart_open_lib.py Line 263 in 76415ca
which proxies to your paramiko object only if there are no other proxies in front of it like a decompression handler |
ok. raised #849 |
to me it still looks like paramiko fetches the entire file into memory when you call it's not prefetching lazy/just-in-time like with |
my understanding is that it _requests all the data from the server (hence my comment about server resources above) but it only reads the replies as required and only keeps a single such chunk of data in memory at a time |
aha, that's not obvious for me from a quick look at the code. would be cool to add a test that verifies this behaviour: when a big file is opened with |
all the data does get requested but none of it read until we start reading are there any existing tests that make real connections like this that i can start from? |
I don't know if I'm overthinking this, and I don't know if such test is even necessary. probably for @mpenkov to decide if he's comfortable with this :) |
Hey guys, I'm using very smart_open library to read large files from SFTP server and I observed that the reads are very slow. It is taking 2 sec to read 8 MB chunks form the file. I think adding support for prefetch would help. Please see https://docs.paramiko.org/en/latest/api/sftp.html I'm waiting on this MR to get merged in. If it is going to take more time to get this merged, I will try switching over to use paramiko. Please let me know if any one has updates on when this can be merged. |
@nithinkrishna2107 it would be great if you can test the PR and report back! in pip / setup.py / pyproject.toml:
or (equivalent but immutable):
|
Hi, FYI I've tested in a staging environment using the PR code, as you suggested, for an sftp compressed file download that would previously take over 35 minutes with smart_open to stream from the sftp server onto the local filesystem. The only change (aside from installing the PR version of smart_open) is: transport_params = {"prefetch_kwargs": {}}
with smart_open.open(url, "rb", transport_params=transport_params) as file_handle:
# rest of the code And the file now downloads in less than 30 seconds which is much better. |
@ddelange Sure, I will test this and let you know the results. Thanks to @tvs-lyst for testing the change. I observed that smart_open has 2 files where open is defined one here https://github.com/piskvorky/smart_open/blob/develop/smart_open/smart_open_lib.py#L100 and one here https://github.com/piskvorky/smart_open/blob/develop/smart_open/ssh.py Do we need to make similar change in smart_open_lib.py? |
no need to touch smart_open_lib (from smart_open import open), that one just propagates transport_params to (in this case) smart_open.ssh.open. so #849 is sufficient |
Hi,
I'm finding certain files slow to read over sftp. debugging lead me to wonder if we should be calling
prefetch()
before returning the file handlehttps://docs.paramiko.org/en/latest/api/sftp.html#paramiko.sftp_file.SFTPFile.prefetch
Thoughts?
The text was updated successfully, but these errors were encountered: