Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use prefetch for sftp? #848

Open
davidszotten opened this issue Feb 20, 2025 · 21 comments · May be fixed by #849
Open

use prefetch for sftp? #848

davidszotten opened this issue Feb 20, 2025 · 21 comments · May be fixed by #849

Comments

@davidszotten
Copy link

Hi,

I'm finding certain files slow to read over sftp. debugging lead me to wonder if we should be calling prefetch() before returning the file handle

https://docs.paramiko.org/en/latest/api/sftp.html#paramiko.sftp_file.SFTPFile.prefetch

Thoughts?

@ddelange
Copy link
Contributor

Pre-fetch the remaining contents of this file in anticipation of future read calls.

that sounds to me like the entire file will be downloaded when calling prefetch, which would be an anti-pattern since this lib is designed for files that do not fit into memory

@davidszotten
Copy link
Author

while i can't find it in the docs, from testing with large (~1gb) files paramiko doesn't appear to load the entire file into memory, only a fixed buffer

i was trying to confirm this in the paramiko code but couldn't quite follow it

@ddelange
Copy link
Contributor

fwiw, smart_open.s3 also uses a prefetch mechanic using self._buffer.fill:

self._buffer = smart_open.bytebuffer.ByteBuffer(buffer_size)

I'm not a maintainer (just an active contributor), but if the API is analogous to that implementation (global DEFAULT_BUFFER_SIZE variable etc) so the memory usage can be controlled transparently, I think it would have a good chance of landing :)

cc @mpenkov

@mpenkov
Copy link
Collaborator

mpenkov commented Feb 22, 2025

If this does not change smart_open's API significantly, I'm open to a PR.

@davidszotten
Copy link
Author

ok i managed to understand paramiko's code a bit better and i'm now pretty sure only at most a single chunk (32k) is stored in memory

regarding api, i expect we can handle it with transport_params. indeed, i think the docs https://github.com/piskvorky/smart_open/blob/develop/help.txt#L187C12-L187C79 (Any additional settings to be passed to paramiko.SSHClient.connect) are fortunately incorrect, only transport_params['connect_kwargs'] are passed to connect which leaves space to add say prefetch_kwargs or similar (name suggestions welcome).

as a first suggestion, how about the presence of prefetch_kwargs (dict) will trigger prefetching (to eg keep backwargs compat) and the dict itself is passed as kwargs to the prefetch call (to eg control max concurrency)

the slight downside is that it makes this (pretty significant) performance increase opt-in, but on the other hand it's safer to add. making prefetch the default would presumably require some other transport param value to disable it again

if this suggestion sounds ok i'm happy to make a pr

@ddelange
Copy link
Contributor

if we're talking about 32k prefetch, probably there's not much of a downside to having this enabled permanently? can you permalink the relevant paramiko code for the prefetch size?

@davidszotten
Copy link
Author

@davidszotten
Copy link
Author

though while cheap on the client, it does use a bunch of resources on the server which can be wasteful if the intention is to only read a small part of the file. so maybe we do want it opt-in

@davidszotten
Copy link
Author

ok i'm not sure why i didn't realise this sooner but if i need custom code for a given protocol (to specify custom transport params) i guess i could also just call prefetch() on the returned file object. or am i missing something?

@ddelange
Copy link
Contributor

you'll be talking to this object:

return so_utils.FileLikeProxy(decoded, binary)

which proxies to your paramiko object only if there are no other proxies in front of it like a decompression handler

@davidszotten
Copy link
Author

ok. raised #849

@ddelange
Copy link
Contributor

ddelange commented Feb 23, 2025

to me it still looks like paramiko fetches the entire file into memory when you call prefetch:
https://github.com/paramiko/paramiko/blob/ed8b09751ff20340332d4b1bb2b10e32aedc57ff/paramiko/sftp_file.py#L482

it's not prefetching lazy/just-in-time like with smart_open.s3. could that be achieved with a more low-level approach?

@davidszotten
Copy link
Author

my understanding is that it _requests all the data from the server (hence my comment about server resources above) but it only reads the replies as required and only keeps a single such chunk of data in memory at a time

@ddelange
Copy link
Contributor

aha, that's not obvious for me from a quick look at the code. would be cool to add a test that verifies this behaviour: when a big file is opened with prefetch_kwargs passed, some but not all data gets immediately requested from the server even before performing any read operation

@davidszotten
Copy link
Author

some but not all data gets immediately requested from the serve

all the data does get requested but none of it read until we start reading

are there any existing tests that make real connections like this that i can start from?

@ddelange
Copy link
Contributor

I don't know if I'm overthinking this, and I don't know if such test is even necessary. probably for @mpenkov to decide if he's comfortable with this :)

@nithinkrishna2107
Copy link

Hey guys, I'm using very smart_open library to read large files from SFTP server and I observed that the reads are very slow. It is taking 2 sec to read 8 MB chunks form the file. I think adding support for prefetch would help. Please see https://docs.paramiko.org/en/latest/api/sftp.html

I'm waiting on this MR to get merged in. If it is going to take more time to get this merged, I will try switching over to use paramiko. Please let me know if any one has updates on when this can be merged.

@ddelange
Copy link
Contributor

ddelange commented Feb 26, 2025

@nithinkrishna2107 it would be great if you can test the PR and report back!

in pip / setup.py / pyproject.toml:

smart_open[ssh]@https://github.com/davidszotten/smart_open/archive/refs/heads/sftp-prefetch.zip

or (equivalent but immutable):

smart_open[ssh]@https://github.com/piskvorky/smart_open/archive/afe7da9fb0632bc4f6686572c22f1fc83bc9b550.zip

@davidszotten davidszotten linked a pull request Feb 26, 2025 that will close this issue
5 tasks
@tvs-lyst
Copy link

in pip / setup.py / pyproject.toml:

smart_open[ssh]@https://github.com/davidszotten/smart_open/archive/refs/heads/sftp-prefetch.zip

Hi, FYI I've tested in a staging environment using the PR code, as you suggested, for an sftp compressed file download that would previously take over 35 minutes with smart_open to stream from the sftp server onto the local filesystem.

The only change (aside from installing the PR version of smart_open) is:

transport_params = {"prefetch_kwargs": {}}
with smart_open.open(url, "rb", transport_params=transport_params) as file_handle:
    # rest of the code

And the file now downloads in less than 30 seconds which is much better.

@nithinkrishna2107
Copy link

nithinkrishna2107 commented Feb 26, 2025

@ddelange Sure, I will test this and let you know the results.

Thanks to @tvs-lyst for testing the change.

I observed that smart_open has 2 files where open is defined one here https://github.com/piskvorky/smart_open/blob/develop/smart_open/smart_open_lib.py#L100 and one here https://github.com/piskvorky/smart_open/blob/develop/smart_open/ssh.py

Do we need to make similar change in smart_open_lib.py?

@ddelange
Copy link
Contributor

no need to touch smart_open_lib (from smart_open import open), that one just propagates transport_params to (in this case) smart_open.ssh.open. so #849 is sufficient

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants