use prefetch for sftp? #848

davidszotten · 2025-02-20T21:28:59Z

Hi,

I'm finding certain files slow to read over sftp. debugging lead me to wonder if we should be calling prefetch() before returning the file handle

https://docs.paramiko.org/en/latest/api/sftp.html#paramiko.sftp_file.SFTPFile.prefetch

Thoughts?

The text was updated successfully, but these errors were encountered:

ddelange · 2025-02-21T08:12:33Z

Pre-fetch the remaining contents of this file in anticipation of future read calls.

that sounds to me like the entire file will be downloaded when calling prefetch, which would be an anti-pattern since this lib is designed for files that do not fit into memory

davidszotten · 2025-02-21T10:13:32Z

while i can't find it in the docs, from testing with large (~1gb) files paramiko doesn't appear to load the entire file into memory, only a fixed buffer

i was trying to confirm this in the paramiko code but couldn't quite follow it

ddelange · 2025-02-21T12:45:36Z

fwiw, smart_open.s3 also uses a prefetch mechanic using self._buffer.fill:

smart_open/smart_open/s3.py

Line 687 in 76415ca

self._buffer = smart_open.bytebuffer.ByteBuffer(buffer_size)

I'm not a maintainer (just an active contributor), but if the API is analogous to that implementation (global DEFAULT_BUFFER_SIZE variable etc) so the memory usage can be controlled transparently, I think it would have a good chance of landing :)

cc @mpenkov

mpenkov · 2025-02-22T12:05:22Z

If this does not change smart_open's API significantly, I'm open to a PR.

davidszotten · 2025-02-22T19:47:23Z

ok i managed to understand paramiko's code a bit better and i'm now pretty sure only at most a single chunk (32k) is stored in memory

regarding api, i expect we can handle it with transport_params. indeed, i think the docs https://github.com/piskvorky/smart_open/blob/develop/help.txt#L187C12-L187C79 (Any additional settings to be passed to paramiko.SSHClient.connect) are fortunately incorrect, only transport_params['connect_kwargs'] are passed to connect which leaves space to add say prefetch_kwargs or similar (name suggestions welcome).

as a first suggestion, how about the presence of prefetch_kwargs (dict) will trigger prefetching (to eg keep backwargs compat) and the dict itself is passed as kwargs to the prefetch call (to eg control max concurrency)

the slight downside is that it makes this (pretty significant) performance increase opt-in, but on the other hand it's safer to add. making prefetch the default would presumably require some other transport param value to disable it again

if this suggestion sounds ok i'm happy to make a pr

ddelange · 2025-02-22T20:17:01Z

if we're talking about 32k prefetch, probably there's not much of a downside to having this enabled permanently? can you permalink the relevant paramiko code for the prefetch size?

davidszotten · 2025-02-22T21:35:49Z

at most one chunk stays in memory https://github.com/paramiko/paramiko/blob/ed8b09751ff20340332d4b1bb2b10e32aedc57ff/paramiko/sftp_file.py#L149
(added if missing, deleted when read (put back if partially read)

chunks are made here https://github.com/paramiko/paramiko/blob/ed8b09751ff20340332d4b1bb2b10e32aedc57ff/paramiko/sftp_file.py#L478
max size definition https://github.com/paramiko/paramiko/blob/ed8b09751ff20340332d4b1bb2b10e32aedc57ff/paramiko/sftp_file.py#L59

davidszotten · 2025-02-23T07:52:48Z

though while cheap on the client, it does use a bunch of resources on the server which can be wasteful if the intention is to only read a small part of the file. so maybe we do want it opt-in

davidszotten · 2025-02-23T13:25:05Z

ok i'm not sure why i didn't realise this sooner but if i need custom code for a given protocol (to specify custom transport params) i guess i could also just call prefetch() on the returned file object. or am i missing something?

ddelange · 2025-02-23T13:45:54Z

you'll be talking to this object:

smart_open/smart_open/smart_open_lib.py

Line 263 in 76415ca

return so_utils.FileLikeProxy(decoded, binary)

which proxies to your paramiko object only if there are no other proxies in front of it like a decompression handler

davidszotten · 2025-02-23T17:07:09Z

ok. raised #849

ddelange · 2025-02-23T20:26:18Z

to me it still looks like paramiko fetches the entire file into memory when you call prefetch:
https://github.com/paramiko/paramiko/blob/ed8b09751ff20340332d4b1bb2b10e32aedc57ff/paramiko/sftp_file.py#L482

it's not prefetching lazy/just-in-time like with smart_open.s3. could that be achieved with a more low-level approach?

davidszotten · 2025-02-23T20:59:38Z

my understanding is that it _requests all the data from the server (hence my comment about server resources above) but it only reads the replies as required and only keeps a single such chunk of data in memory at a time

ddelange · 2025-02-23T21:40:08Z

aha, that's not obvious for me from a quick look at the code. would be cool to add a test that verifies this behaviour: when a big file is opened with prefetch_kwargs passed, some but not all data gets immediately requested from the server even before performing any read operation

davidszotten · 2025-02-23T21:55:15Z

some but not all data gets immediately requested from the serve

all the data does get requested but none of it read until we start reading

are there any existing tests that make real connections like this that i can start from?

ddelange · 2025-02-23T22:07:02Z

I don't know if I'm overthinking this, and I don't know if such test is even necessary. probably for @mpenkov to decide if he's comfortable with this :)

nithinkrishna2107 · 2025-02-25T23:13:37Z

Hey guys, I'm using very smart_open library to read large files from SFTP server and I observed that the reads are very slow. It is taking 2 sec to read 8 MB chunks form the file. I think adding support for prefetch would help. Please see https://docs.paramiko.org/en/latest/api/sftp.html

I'm waiting on this MR to get merged in. If it is going to take more time to get this merged, I will try switching over to use paramiko. Please let me know if any one has updates on when this can be merged.

ddelange · 2025-02-26T07:16:13Z

@nithinkrishna2107 it would be great if you can test the PR and report back!

in pip / setup.py / pyproject.toml:

smart_open[ssh]@https://github.com/davidszotten/smart_open/archive/refs/heads/sftp-prefetch.zip

or (equivalent but immutable):

smart_open[ssh]@https://github.com/piskvorky/smart_open/archive/afe7da9fb0632bc4f6686572c22f1fc83bc9b550.zip

tvs-lyst · 2025-02-26T16:41:45Z

in pip / setup.py / pyproject.toml:

smart_open[ssh]@https://github.com/davidszotten/smart_open/archive/refs/heads/sftp-prefetch.zip

Hi, FYI I've tested in a staging environment using the PR code, as you suggested, for an sftp compressed file download that would previously take over 35 minutes with smart_open to stream from the sftp server onto the local filesystem.

The only change (aside from installing the PR version of smart_open) is:

transport_params = {"prefetch_kwargs": {}}
with smart_open.open(url, "rb", transport_params=transport_params) as file_handle:
    # rest of the code

And the file now downloads in less than 30 seconds which is much better.

nithinkrishna2107 · 2025-02-26T16:57:07Z

@ddelange Sure, I will test this and let you know the results.

Thanks to @tvs-lyst for testing the change.

I observed that smart_open has 2 files where open is defined one here https://github.com/piskvorky/smart_open/blob/develop/smart_open/smart_open_lib.py#L100 and one here https://github.com/piskvorky/smart_open/blob/develop/smart_open/ssh.py

Do we need to make similar change in smart_open_lib.py?

ddelange · 2025-02-26T17:29:25Z

no need to touch smart_open_lib (from smart_open import open), that one just propagates transport_params to (in this case) smart_open.ssh.open. so #849 is sufficient

davidszotten linked a pull request Feb 26, 2025 that will close this issue

Add support for prefetching to sftp transport #849

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use prefetch for sftp? #848

use prefetch for sftp? #848

davidszotten commented Feb 20, 2025

ddelange commented Feb 21, 2025

davidszotten commented Feb 21, 2025

ddelange commented Feb 21, 2025

mpenkov commented Feb 22, 2025

davidszotten commented Feb 22, 2025

ddelange commented Feb 22, 2025

davidszotten commented Feb 22, 2025

davidszotten commented Feb 23, 2025

davidszotten commented Feb 23, 2025

ddelange commented Feb 23, 2025

davidszotten commented Feb 23, 2025

ddelange commented Feb 23, 2025 •

edited

Loading

davidszotten commented Feb 23, 2025

ddelange commented Feb 23, 2025

davidszotten commented Feb 23, 2025

ddelange commented Feb 23, 2025

nithinkrishna2107 commented Feb 25, 2025

ddelange commented Feb 26, 2025 •

edited

Loading

tvs-lyst commented Feb 26, 2025

nithinkrishna2107 commented Feb 26, 2025 •

edited

Loading

ddelange commented Feb 26, 2025

use prefetch for sftp? #848

use prefetch for sftp? #848

Comments

davidszotten commented Feb 20, 2025

ddelange commented Feb 21, 2025

davidszotten commented Feb 21, 2025

ddelange commented Feb 21, 2025

mpenkov commented Feb 22, 2025

davidszotten commented Feb 22, 2025

ddelange commented Feb 22, 2025

davidszotten commented Feb 22, 2025

davidszotten commented Feb 23, 2025

davidszotten commented Feb 23, 2025

ddelange commented Feb 23, 2025

davidszotten commented Feb 23, 2025

ddelange commented Feb 23, 2025 • edited Loading

davidszotten commented Feb 23, 2025

ddelange commented Feb 23, 2025

davidszotten commented Feb 23, 2025

ddelange commented Feb 23, 2025

nithinkrishna2107 commented Feb 25, 2025

ddelange commented Feb 26, 2025 • edited Loading

tvs-lyst commented Feb 26, 2025

nithinkrishna2107 commented Feb 26, 2025 • edited Loading

ddelange commented Feb 26, 2025

ddelange commented Feb 23, 2025 •

edited

Loading

ddelange commented Feb 26, 2025 •

edited

Loading

nithinkrishna2107 commented Feb 26, 2025 •

edited

Loading