Small reads in uv_distribution::distribution_database::download producing v. slow download performance when installing torch #2220

thundergolfer · 2024-03-05T20:44:12Z

Hey, this is a successor to #1978 as I'm still trying to integrate uv with Modal's internal mirror. But this time I don't yet have a solution to this problem 🙂.

I'm observed extremely degraded install performance when installing pytorch against our mirror. The install command is uv pip install torch.

On my test machine, installing without our mirror is fast, taking around 7 seconds.

But when I get our mirror involved, performance tanks and an install takes over 5 minutes. The command I'm running is this:

RUST_LOG="uv=trace,uv_extrace=trace,async_zip=debug" \
UV_NO_CACHE=true UV_INDEX_URL="http://172.21.0.1:5555/simple" \
uv --verbose pip install torch

Investigating...

Using debug logs on our mirror I see that there is a vast (~64x) difference in read sizes between uv pip install and pip install.

Our mirror serves .whl files off disk like this:

use hyper::{Body, Response};
use tokio_util::codec::{BytesCodec, FramedRead};

// ... 

let file tokio::fs::File::open(&filepath).await?;
debug!("cache hit: serving {} from disk cache", filepath.display());
let size = file.metadata().await.map(|m| Some(m.len())).unwrap_or(None);
let stream = FramedRead::new(file, BytesCodec::new());
let body = Body::wrap_stream(stream);
let mut response = Response::new(body);

By enabling debug logs (RUST_LOG=debug) I can see that when using pip install read chunks are large:

We can see in the screenshot of the debug logs that reads up to 128KiB are happening. Performance of the pip install command is good, the whl download takes around 3 seconds.

The pip command here was

PIP_TRUSTED_HOST="172.21.0.1:5555" \
PIP_NO_CACHE_DIR=off PIP_INDEX_URL="http://172.21.0.1:5555/simple" \
pip install torch

and pip --version is pip 23.3.1 from /home/ubuntu/modal/venv/lib/python3.11/site-packages/pip (python 3.11).

The read sizes of pip contrast strongly with uv pip install, where I'm seeing read sizes between 11 and 1120 bytes.

I believe these relatively tiny read sizes when downloading are contributing to the slow download performance on the 720.55MiB torch .whl file.

Looking at the trace logs this is the region of code I'm in:

uv/crates/uv-distribution/src/distribution_database.rs

Line 161 in 043d726

let download = |response: reqwest::Response| {

uv version

uv --version
uv 0.1.14

OS

uname -a
Linux ip-10-1-1-198 5.15.0-1052-aws #57~20.04.1-Ubuntu SMP Mon Jan 15 17:04:56 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

thundergolfer · 2024-03-05T20:45:02Z

I'm happy to look more into what's going on later in the week 🙂. I've read through some of the code in uv and async_zip but haven't figured out what's going on.

charliermarsh · 2024-03-05T20:48:42Z

Cool, would definitely welcome more exploration / eyes on it and happy to support. We unzip the wheel to disk directly as we stream the download in-memory, it looks like we're making way too many small reads right now?

thundergolfer · 2024-03-05T20:55:02Z

Yeah, I added BufReader in a couple places which was ineffectual, and see there's already BufReader used in a couple places (in uv and in async_zip) despite seeing reads << than the 8KiB default buffer size.

It'd take me a decent bit more time to unpack why these small reads are being produced and slap a buffered reader in the right place.

charliermarsh · 2024-03-05T20:57:59Z

Is there any way to reproduce this with publicly available indexes?

thundergolfer · 2024-03-05T20:59:24Z

Yeh I think that's job 1 so that we can collab better on this! We won't open-source our mirror implementation just yet.

hmc-cs-mdrissi · 2024-03-05T22:45:13Z

This behavior seems very similar to one I’m seeing as well. For our internal index, proxied public packages have good performance even when wheel is large (PyTorch), but internal packages with large wheels are much slower with uv then pip.

The only thing I can add is I know for our index we don’t support range requests/head so maybe the fallback path there is needed to trigger slow behavior here.

charliermarsh · 2024-03-06T06:03:11Z

We don't use range requests for the wheel download IIRC, just for fetching wheel metadata, so hopefully not a factor here.

thundergolfer · 2024-03-09T04:56:26Z

Adding some details of attempting a repro with an open-source mirror, proxpi: docker run -p 5000:5000 epicwink/proxpi.

time UV_NO_CACHE=true UV_INDEX_URL="http://127.0.0.1:5000/index" target/release/uv --verbose pip install torch

Has proxpi log a whl download time of around 7.4s.

2024-03-09 04:37:34 [    INFO] gunicorn.access: 172.18.0.1 "GET /index/torch/torch-2.2.1-cp311-cp311-manylinux1_x86_64.whl HTTP/1.1" 200 0 7431ms

With just pip:

PIP_TRUSTED_HOST="127.0.0.1:5000" PIP_NO_CACHE_DIR=off PIP_INDEX_URL="http://127.0.0.1:5000/index" pip install torch

2024-03-09 04:39:12 [    INFO] gunicorn.access: 172.18.0.1 "GET /index/torch/ HTTP/1.1" 200 24056 7ms
2024-03-09 04:39:14 [    INFO] gunicorn.access: 172.18.0.1 "GET /index/torch/torch-2.2.1-cp311-cp311-manylinux1_x86_64.whl HTTP/1.1" 200 0 2080ms

I get around 2s.

So there's a difference, but nothing like the ~5 minute slowdown seen in our internal mirror. So a negative result on providing an open-source reproduction. I'll keep digging.

charliermarsh · 2024-03-09T13:16:10Z

I'd really like to fix this. I'll try to play around with it today. The simplest thing we should try:

diff --git a/crates/uv-extract/src/stream.rs b/crates/uv-extract/src/stream.rs
index 8d8dfb11..b9272df5 100644
--- a/crates/uv-extract/src/stream.rs
+++ b/crates/uv-extract/src/stream.rs
@@ -18,7 +18,7 @@ pub async fn unzip<R: tokio::io::AsyncRead + Unpin>(
     target: impl AsRef<Path>,
 ) -> Result<(), Error> {
     let target = target.as_ref();
-    let mut reader = reader.compat();
+    let mut reader = futures::io::BufReader::new( reader.compat());
     let mut zip = async_zip::base::read::stream::ZipFileReader::new(&mut reader);
 
     let mut directories = FxHashSet::default();

(It looks like the async ZIP reader uses a BufReader internally, but only for the entry bodies, and not the headers.)

charliermarsh · 2024-03-09T15:47:12Z

Not seeing any improvement from this change.

FWIW, on my machine, the difference is much smaller with the open-source proxy.

pip:

gunicorn.access: 172.17.0.1 "GET /index/torch/torch-2.2.1-cp312-none-macosx_11_0_arm64.whl HTTP/1.1" 200 0 888ms

uv:

gunicorn.access: 172.17.0.1 "GET /index/torch/torch-2.2.1-cp312-none-macosx_11_0_arm64.whl HTTP/1.1" 200 0 1374ms

Which is maybe plausible given that we're unzipping and writing to disk -- not sure.

charliermarsh · 2024-03-09T17:09:43Z

We could also try increasing the buffer size.

thundergolfer · 2024-03-09T18:21:40Z

Ok from looking at trace logs for our internal mirror I think the encoding strategy difference is key:

pip

2024-03-09T17:41:23.715190Z DEBUG pypi_mirror: cache hit: serving /tmp/pypi-mirror-worker-cache/packages/2c/df/5810707da6f2fd4be57f0cc417987c0fa16a2eecf0b1b71f82ea555dc619/torch-2.2.1-cp311-cp311-manylinux1_x86_64.whl from disk cache
2024-03-09T17:41:23.715200Z TRACE hyper::proto::h1::conn: flushed({role=server}): State { reading: KeepAlive, writing: Init, keep_alive: Busy }
2024-03-09T17:41:23.715260Z TRACE encode_headers: hyper::proto::h1::role: Server::encode status=200, body=Some(Unknown), req_method=Some(GET)
2024-03-09T17:41:23.715329Z DEBUG hyper::proto::h1::io: flushed 83 bytes
2024-03-09T17:41:23.715336Z TRACE hyper::proto::h1::conn: flushed({role=server}): State { reading: KeepAlive, writing: Body(Encoder { kind: Length(755552143), is_last: false }), keep_alive: Busy }
2024-03-09T17:41:23.715355Z TRACE hyper::proto::h1::encode: sized write, len = 8192
2024-03-09T17:41:23.715362Z TRACE hyper::proto::h1::io: buffer.queue self.len=0 buf.len=8192

uv

2024-03-09T17:40:37.658227Z DEBUG pypi_mirror: cache hit: serving /tmp/pypi-mirror-worker-cache/packages/2c/df/5810707da6f2fd4be57f0cc417987c0fa16a2eecf0b1b71f82ea555dc619/torch-2.2.1-cp311-cp311-manylinux1_x86_64.whl from disk cache
2024-03-09T17:40:37.658414Z TRACE hyper::proto::h1::conn: flushed({role=server}): State { reading: KeepAlive, writing: Init, keep_alive: Busy }
2024-03-09T17:40:37.658530Z TRACE encode_headers: hyper::proto::h1::role: Server::encode status=200, body=Some(Unknown), req_method=Some(GET)
2024-03-09T17:40:37.658576Z DEBUG hyper::proto::h1::io: flushed 108 bytes
2024-03-09T17:40:37.658591Z TRACE hyper::proto::h1::conn: flushed({role=server}): State { reading: KeepAlive, writing: Body(Encoder { kind: Chunked, is_last: false }), keep_alive: Busy }
2024-03-09T17:40:37.658755Z TRACE hyper::proto::h1::encode: encoding chunked 10B
2024-03-09T17:40:37.658759Z TRACE hyper::proto::h1::io: buffer.queue self.len=0 buf.len=15

Will work out why this difference occurs.

charliermarsh · 2024-03-09T18:33:03Z

Ohh nice!

thundergolfer · 2024-03-09T20:26:21Z

I think this is another content-encoding related issue.

I added some extra trace! logs into hyper when it was running encode_headers and deciding on the encoding strategy.

uv

Server::encode head=MessageHead { version: HTTP/1.1, subject: 200, headers: {"content-encoding": "gzip"}, extensions: Extensions }, status=200, body=Some(Unknown), req_method=Some(GET)

pip

Server::encode head=MessageHead { version: HTTP/1.1, subject: 200, headers: {"content-length": "755552143"}, extensions: Extensions }, status=200, body=Some(Unknown), req_method=Some(GET)

Seeing gzip in the headers cued me to think this was just the same issue as #1978 showing up in a different place.

By doing the same header change that's done in #1978 torch installs much, much faster against the mirror!

A 15.64s second download against the mirror is still slower than what uv gets against PyPI (around 10s) so there may be some additional change I can make to get performance of our mirror to parity with PyPi. It really should be faster as it's serving over the local network.

Checking the pip repository for its history of dealing with content-encoding I see this pypa/pip#1688

thundergolfer · 2024-03-09T20:30:14Z

Looking at the flushes in our mirror after adding the identity content-encoding header, I can now see they're averaging around 8KiB, much higher than the ~1KiB before:

Still lower than what pip was producing in our mirror, which went up to 128KiB.

charliermarsh · 2024-03-09T20:30:55Z

Can you try tweaking the buffer size on the BufReader? I think 8KiB might be the exact default.

thundergolfer · 2024-03-09T20:31:58Z

You mean doing this but with a non-default buffer size? #2220 (comment)

charliermarsh · 2024-03-09T20:47:50Z

Yeah. Alternatively I can change the buffer used inside async-zip-rs, we use a fork anyway.

charliermarsh · 2024-03-09T20:54:24Z

(I don’t have much intuition for how nested BufReaders interact.)

thundergolfer · 2024-03-09T22:34:42Z

Yep with this diff I get comparable performance to pip 🙂

diff --git a/crates/uv-extract/src/stream.rs b/crates/uv-extract/src/stream.rs
index 8d8dfb11..db3c42fb 100644
--- a/crates/uv-extract/src/stream.rs
+++ b/crates/uv-extract/src/stream.rs
@@ -18,7 +18,7 @@ pub async fn unzip<R: tokio::io::AsyncRead + Unpin>(
     target: impl AsRef<Path>,
 ) -> Result<(), Error> {
     let target = target.as_ref();
-    let mut reader = reader.compat();
+    let mut reader = futures::io::BufReader::with_capacity( 128 * 1024, reader.compat());
     let mut zip = async_zip::base::read::stream::ZipFileReader::new(&mut reader);

     let mut directories = FxHashSet::default();

Debug logging on hyper shows flushes up-to 128KiB.

thundergolfer · 2024-03-09T22:37:52Z

Think this is a enough to make a PR. It'd be two things:

setting identity, in the spirit of Accept encoding identity pypa/pip#1688
increasing buffer from 8KiB to 128KiB.

charliermarsh · 2024-03-09T22:42:05Z

Sounds great, thank you! The only thing I’ll want to test before merging is whether the buffer size changes benchmarks much / at all against PyPI.

thundergolfer · 2024-03-09T22:44:46Z

Makes sense. Looks like I could even check this myself by reading BENCHMARKS.md and going from there?

charliermarsh · 2024-03-09T22:47:30Z

Totally, up to you! I'd just do a release build on main, move it to uv-main, then do a release build on your branch, then run something ike:

python -m scripts.bench \
        --uv-path ./target/release/uv \
        --uv-path ./target/release/uv-main \
        ./scripts/requirements/compiled/trio.txt --benchmark install-cold

charliermarsh · 2024-03-09T22:48:03Z

By the way, I can get this released pretty quickly.

@charliermarsh

## Summary Addressing the extremely slow performance detailed in #2220. There are two changes to increase download performance: 1. setting `accept-encoding: identity`, in the spirit of pypa/pip#1688 2. increasing buffer from 8KiB to 128KiB. ### 1. accept-encoding: identity I think this related `pip` PR has a good explanation of what's going on: pypa/pip#1688 ``` # We use Accept-Encoding: identity here because requests # defaults to accepting compressed responses. This breaks in # a variety of ways depending on how the server is configured. # - Some servers will notice that the file isn't a compressible # file and will leave the file alone and with an empty # Content-Encoding # - Some servers will notice that the file is already # compressed and will leave the file alone and will add a # Content-Encoding: gzip header # - Some servers won't notice anything at all and will take # a file that's already been compressed and compress it again # and set the Content-Encoding: gzip header ``` The `files.pythonhosted.org` server is the 1st kind. Example debug log I added in `uv` when installing against PyPI: <img width="1459" alt="image" src="https://github.com/astral-sh/uv/assets/12058921/ef10d758-46aa-4c8e-9dba-47f33437401b"> (there is no `content-encoding` header in this response, the `whl` hasn't been compressed, and there is a content-length header) Our internal mirror is the third case. It does seem sensible that our mirror should be modified to act like the 1st kind. But `uv` should handle all three cases like `pip` does. ### 2. buffer increase In #2220 I observed that `pip`'s downloading was causing up-to 128KiB flushes in our mirror. After fix 1, `uv` was still only causing up-to 8KiB flushes, and was slower to download than `pip`. Increasing this buffer from the default 8KiB led to a download performance improvement against our mirror and the expected observed 128KiB flushes. ## Test Plan Ran benchmarking as instructed by @charliermarsh <img width="1447" alt="image" src="https://github.com/astral-sh/uv/assets/12058921/840d9c8d-4b98-4bfa-89f3-073a2dec1f23"> No performance improvement or regression.

hmc-cs-mdrissi · 2024-03-10T03:28:36Z

I tested the fix on my slow large wheel and it works great. Installation time is now fast for that 1 wheel and I guess my index server had same issue here. Thanks to both of you for continuing to improve uv's performance.

charliermarsh · 2024-03-10T03:30:07Z

All @thundergolfer!

charliermarsh · 2024-03-10T03:30:18Z

I'll probably cut a release tomorrow with this in it.

thundergolfer changed the title ~~Small reads in uv_distribution::distribution_database::download producing v. slow download performance when install torch~~ Small reads in uv_distribution::distribution_database::download producing v. slow download performance when installing torch Mar 5, 2024

zanieb added performance Potential performance improvement great writeup A wonderful example of a quality contribution 💜 labels Mar 6, 2024

thundergolfer added a commit to thundergolfer/uv that referenced this issue Mar 9, 2024

fix: astral-sh#2220

a1bb380

thundergolfer added a commit to thundergolfer/uv that referenced this issue Mar 10, 2024

fix: astral-sh#2220

150aa23

thundergolfer mentioned this issue Mar 10, 2024

Address #2220 (slow download perf against PyPi mirror) #2319

Merged

hmc-cs-mdrissi mentioned this issue Mar 10, 2024

401 Installing from Internal Index due to .netrc #2323

Closed

zanieb closed this as completed Mar 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small reads in uv_distribution::distribution_database::download producing v. slow download performance when installing torch #2220

Small reads in uv_distribution::distribution_database::download producing v. slow download performance when installing torch #2220

thundergolfer commented Mar 5, 2024 •

edited

Loading

thundergolfer commented Mar 5, 2024

charliermarsh commented Mar 5, 2024

thundergolfer commented Mar 5, 2024

charliermarsh commented Mar 5, 2024

thundergolfer commented Mar 5, 2024

hmc-cs-mdrissi commented Mar 5, 2024

charliermarsh commented Mar 6, 2024

thundergolfer commented Mar 9, 2024 •

edited

Loading

charliermarsh commented Mar 9, 2024

charliermarsh commented Mar 9, 2024

charliermarsh commented Mar 9, 2024

thundergolfer commented Mar 9, 2024 •

edited

Loading

charliermarsh commented Mar 9, 2024

thundergolfer commented Mar 9, 2024 •

edited

Loading

thundergolfer commented Mar 9, 2024

charliermarsh commented Mar 9, 2024

thundergolfer commented Mar 9, 2024 •

edited

Loading

charliermarsh commented Mar 9, 2024

charliermarsh commented Mar 9, 2024

thundergolfer commented Mar 9, 2024 •

edited

Loading

thundergolfer commented Mar 9, 2024

charliermarsh commented Mar 9, 2024

thundergolfer commented Mar 9, 2024

charliermarsh commented Mar 9, 2024

charliermarsh commented Mar 9, 2024

hmc-cs-mdrissi commented Mar 10, 2024

charliermarsh commented Mar 10, 2024

charliermarsh commented Mar 10, 2024

Small reads in uv_distribution::distribution_database::download producing v. slow download performance when installing torch #2220

Small reads in uv_distribution::distribution_database::download producing v. slow download performance when installing torch #2220

Comments

thundergolfer commented Mar 5, 2024 • edited Loading

Investigating...

thundergolfer commented Mar 5, 2024

charliermarsh commented Mar 5, 2024

thundergolfer commented Mar 5, 2024

charliermarsh commented Mar 5, 2024

thundergolfer commented Mar 5, 2024

hmc-cs-mdrissi commented Mar 5, 2024

charliermarsh commented Mar 6, 2024

thundergolfer commented Mar 9, 2024 • edited Loading

charliermarsh commented Mar 9, 2024

charliermarsh commented Mar 9, 2024

charliermarsh commented Mar 9, 2024

thundergolfer commented Mar 9, 2024 • edited Loading

charliermarsh commented Mar 9, 2024

thundergolfer commented Mar 9, 2024 • edited Loading

thundergolfer commented Mar 9, 2024

charliermarsh commented Mar 9, 2024

thundergolfer commented Mar 9, 2024 • edited Loading

charliermarsh commented Mar 9, 2024

charliermarsh commented Mar 9, 2024

thundergolfer commented Mar 9, 2024 • edited Loading

thundergolfer commented Mar 9, 2024

charliermarsh commented Mar 9, 2024

thundergolfer commented Mar 9, 2024

charliermarsh commented Mar 9, 2024

charliermarsh commented Mar 9, 2024

hmc-cs-mdrissi commented Mar 10, 2024

charliermarsh commented Mar 10, 2024

charliermarsh commented Mar 10, 2024

thundergolfer commented Mar 5, 2024 •

edited

Loading

thundergolfer commented Mar 9, 2024 •

edited

Loading

thundergolfer commented Mar 9, 2024 •

edited

Loading

thundergolfer commented Mar 9, 2024 •

edited

Loading

thundergolfer commented Mar 9, 2024 •

edited

Loading

thundergolfer commented Mar 9, 2024 •

edited

Loading