Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small reads in uv_distribution::distribution_database::download producing v. slow download performance when installing torch #2220

Closed
thundergolfer opened this issue Mar 5, 2024 · 28 comments
Labels
great writeup A wonderful example of a quality contribution 💜 performance Potential performance improvement

Comments

@thundergolfer
Copy link
Contributor

thundergolfer commented Mar 5, 2024

Hey, this is a successor to #1978 as I'm still trying to integrate uv with Modal's internal mirror. But this time I don't yet have a solution to this problem 🙂.

I'm observed extremely degraded install performance when installing pytorch against our mirror. The install command is uv pip install torch.

On my test machine, installing without our mirror is fast, taking around 7 seconds.

image

But when I get our mirror involved, performance tanks and an install takes over 5 minutes. The command I'm running is this:

RUST_LOG="uv=trace,uv_extrace=trace,async_zip=debug" \
UV_NO_CACHE=true UV_INDEX_URL="http://172.21.0.1:5555/simple" \
uv --verbose pip install torch

Investigating...

Using debug logs on our mirror I see that there is a vast (~64x) difference in read sizes between uv pip install and pip install.

Our mirror serves .whl files off disk like this:

use hyper::{Body, Response};
use tokio_util::codec::{BytesCodec, FramedRead};

// ... 

let file tokio::fs::File::open(&filepath).await?;
debug!("cache hit: serving {} from disk cache", filepath.display());
let size = file.metadata().await.map(|m| Some(m.len())).unwrap_or(None);
let stream = FramedRead::new(file, BytesCodec::new());
let body = Body::wrap_stream(stream);
let mut response = Response::new(body);

By enabling debug logs (RUST_LOG=debug) I can see that when using pip install read chunks are large:

image

We can see in the screenshot of the debug logs that reads up to 128KiB are happening. Performance of the pip install command is good, the whl download takes around 3 seconds.

The pip command here was

PIP_TRUSTED_HOST="172.21.0.1:5555" \
PIP_NO_CACHE_DIR=off PIP_INDEX_URL="http://172.21.0.1:5555/simple" \
pip install torch

and pip --version is pip 23.3.1 from /home/ubuntu/modal/venv/lib/python3.11/site-packages/pip (python 3.11).

The read sizes of pip contrast strongly with uv pip install, where I'm seeing read sizes between 11 and 1120 bytes.

image

I believe these relatively tiny read sizes when downloading are contributing to the slow download performance on the 720.55MiB torch .whl file.

Looking at the trace logs this is the region of code I'm in:

let download = |response: reqwest::Response| {


uv version

uv --version
uv 0.1.14

OS

uname -a
Linux ip-10-1-1-198 5.15.0-1052-aws #57~20.04.1-Ubuntu SMP Mon Jan 15 17:04:56 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
@thundergolfer
Copy link
Contributor Author

I'm happy to look more into what's going on later in the week 🙂. I've read through some of the code in uv and async_zip but haven't figured out what's going on.

@charliermarsh
Copy link
Member

Cool, would definitely welcome more exploration / eyes on it and happy to support. We unzip the wheel to disk directly as we stream the download in-memory, it looks like we're making way too many small reads right now?

@thundergolfer
Copy link
Contributor Author

Yeah, I added BufReader in a couple places which was ineffectual, and see there's already BufReader used in a couple places (in uv and in async_zip) despite seeing reads << than the 8KiB default buffer size.

It'd take me a decent bit more time to unpack why these small reads are being produced and slap a buffered reader in the right place.

@charliermarsh
Copy link
Member

Is there any way to reproduce this with publicly available indexes?

@thundergolfer
Copy link
Contributor Author

Yeh I think that's job 1 so that we can collab better on this! We won't open-source our mirror implementation just yet.

@thundergolfer thundergolfer changed the title Small reads in uv_distribution::distribution_database::download producing v. slow download performance when install torch Small reads in uv_distribution::distribution_database::download producing v. slow download performance when installing torch Mar 5, 2024
@hmc-cs-mdrissi
Copy link

This behavior seems very similar to one I’m seeing as well. For our internal index, proxied public packages have good performance even when wheel is large (PyTorch), but internal packages with large wheels are much slower with uv then pip.

The only thing I can add is I know for our index we don’t support range requests/head so maybe the fallback path there is needed to trigger slow behavior here.

@zanieb zanieb added performance Potential performance improvement great writeup A wonderful example of a quality contribution 💜 labels Mar 6, 2024
@charliermarsh
Copy link
Member

We don't use range requests for the wheel download IIRC, just for fetching wheel metadata, so hopefully not a factor here.

@thundergolfer
Copy link
Contributor Author

thundergolfer commented Mar 9, 2024

Adding some details of attempting a repro with an open-source mirror, proxpi: docker run -p 5000:5000 epicwink/proxpi.

time UV_NO_CACHE=true UV_INDEX_URL="http://127.0.0.1:5000/index" target/release/uv --verbose pip install torch

Has proxpi log a whl download time of around 7.4s.

2024-03-09 04:37:34 [    INFO] gunicorn.access: 172.18.0.1 "GET /index/torch/torch-2.2.1-cp311-cp311-manylinux1_x86_64.whl HTTP/1.1" 200 0 7431ms

With just pip:

PIP_TRUSTED_HOST="127.0.0.1:5000" PIP_NO_CACHE_DIR=off PIP_INDEX_URL="http://127.0.0.1:5000/index" pip install torch
2024-03-09 04:39:12 [    INFO] gunicorn.access: 172.18.0.1 "GET /index/torch/ HTTP/1.1" 200 24056 7ms
2024-03-09 04:39:14 [    INFO] gunicorn.access: 172.18.0.1 "GET /index/torch/torch-2.2.1-cp311-cp311-manylinux1_x86_64.whl HTTP/1.1" 200 0 2080ms

I get around 2s.

So there's a difference, but nothing like the ~5 minute slowdown seen in our internal mirror. So a negative result on providing an open-source reproduction. I'll keep digging.

@charliermarsh
Copy link
Member

I'd really like to fix this. I'll try to play around with it today. The simplest thing we should try:

diff --git a/crates/uv-extract/src/stream.rs b/crates/uv-extract/src/stream.rs
index 8d8dfb11..b9272df5 100644
--- a/crates/uv-extract/src/stream.rs
+++ b/crates/uv-extract/src/stream.rs
@@ -18,7 +18,7 @@ pub async fn unzip<R: tokio::io::AsyncRead + Unpin>(
     target: impl AsRef<Path>,
 ) -> Result<(), Error> {
     let target = target.as_ref();
-    let mut reader = reader.compat();
+    let mut reader = futures::io::BufReader::new( reader.compat());
     let mut zip = async_zip::base::read::stream::ZipFileReader::new(&mut reader);
 
     let mut directories = FxHashSet::default();

(It looks like the async ZIP reader uses a BufReader internally, but only for the entry bodies, and not the headers.)

@charliermarsh
Copy link
Member

Not seeing any improvement from this change.

FWIW, on my machine, the difference is much smaller with the open-source proxy.

pip:

gunicorn.access: 172.17.0.1 "GET /index/torch/torch-2.2.1-cp312-none-macosx_11_0_arm64.whl HTTP/1.1" 200 0 888ms

uv:

gunicorn.access: 172.17.0.1 "GET /index/torch/torch-2.2.1-cp312-none-macosx_11_0_arm64.whl HTTP/1.1" 200 0 1374ms

Which is maybe plausible given that we're unzipping and writing to disk -- not sure.

@charliermarsh
Copy link
Member

We could also try increasing the buffer size.

@thundergolfer
Copy link
Contributor Author

thundergolfer commented Mar 9, 2024

Ok from looking at trace logs for our internal mirror I think the encoding strategy difference is key:

pip

2024-03-09T17:41:23.715190Z DEBUG pypi_mirror: cache hit: serving /tmp/pypi-mirror-worker-cache/packages/2c/df/5810707da6f2fd4be57f0cc417987c0fa16a2eecf0b1b71f82ea555dc619/torch-2.2.1-cp311-cp311-manylinux1_x86_64.whl from disk cache
2024-03-09T17:41:23.715200Z TRACE hyper::proto::h1::conn: flushed({role=server}): State { reading: KeepAlive, writing: Init, keep_alive: Busy }
2024-03-09T17:41:23.715260Z TRACE encode_headers: hyper::proto::h1::role: Server::encode status=200, body=Some(Unknown), req_method=Some(GET)
2024-03-09T17:41:23.715329Z DEBUG hyper::proto::h1::io: flushed 83 bytes
2024-03-09T17:41:23.715336Z TRACE hyper::proto::h1::conn: flushed({role=server}): State { reading: KeepAlive, writing: Body(Encoder { kind: Length(755552143), is_last: false }), keep_alive: Busy }
2024-03-09T17:41:23.715355Z TRACE hyper::proto::h1::encode: sized write, len = 8192
2024-03-09T17:41:23.715362Z TRACE hyper::proto::h1::io: buffer.queue self.len=0 buf.len=8192

uv

2024-03-09T17:40:37.658227Z DEBUG pypi_mirror: cache hit: serving /tmp/pypi-mirror-worker-cache/packages/2c/df/5810707da6f2fd4be57f0cc417987c0fa16a2eecf0b1b71f82ea555dc619/torch-2.2.1-cp311-cp311-manylinux1_x86_64.whl from disk cache
2024-03-09T17:40:37.658414Z TRACE hyper::proto::h1::conn: flushed({role=server}): State { reading: KeepAlive, writing: Init, keep_alive: Busy }
2024-03-09T17:40:37.658530Z TRACE encode_headers: hyper::proto::h1::role: Server::encode status=200, body=Some(Unknown), req_method=Some(GET)
2024-03-09T17:40:37.658576Z DEBUG hyper::proto::h1::io: flushed 108 bytes
2024-03-09T17:40:37.658591Z TRACE hyper::proto::h1::conn: flushed({role=server}): State { reading: KeepAlive, writing: Body(Encoder { kind: Chunked, is_last: false }), keep_alive: Busy }
2024-03-09T17:40:37.658755Z TRACE hyper::proto::h1::encode: encoding chunked 10B
2024-03-09T17:40:37.658759Z TRACE hyper::proto::h1::io: buffer.queue self.len=0 buf.len=15

Will work out why this difference occurs.

@charliermarsh
Copy link
Member

Ohh nice!

@thundergolfer
Copy link
Contributor Author

thundergolfer commented Mar 9, 2024

I think this is another content-encoding related issue.

I added some extra trace! logs into hyper when it was running encode_headers and deciding on the encoding strategy.

uv

Server::encode head=MessageHead { version: HTTP/1.1, subject: 200, headers: {"content-encoding": "gzip"}, extensions: Extensions }, status=200, body=Some(Unknown), req_method=Some(GET)

pip

Server::encode head=MessageHead { version: HTTP/1.1, subject: 200, headers: {"content-length": "755552143"}, extensions: Extensions }, status=200, body=Some(Unknown), req_method=Some(GET)

Seeing gzip in the headers cued me to think this was just the same issue as #1978 showing up in a different place.

By doing the same header change that's done in #1978 torch installs much, much faster against the mirror!

image

A 15.64s second download against the mirror is still slower than what uv gets against PyPI (around 10s) so there may be some additional change I can make to get performance of our mirror to parity with PyPi. It really should be faster as it's serving over the local network.

Checking the pip repository for its history of dealing with content-encoding I see this pypa/pip#1688

@thundergolfer
Copy link
Contributor Author

Looking at the flushes in our mirror after adding the identity content-encoding header, I can now see they're averaging around 8KiB, much higher than the ~1KiB before:

image

Still lower than what pip was producing in our mirror, which went up to 128KiB.

@charliermarsh
Copy link
Member

Can you try tweaking the buffer size on the BufReader? I think 8KiB might be the exact default.

@thundergolfer
Copy link
Contributor Author

thundergolfer commented Mar 9, 2024

You mean doing this but with a non-default buffer size? #2220 (comment)

@charliermarsh
Copy link
Member

Yeah. Alternatively I can change the buffer used inside async-zip-rs, we use a fork anyway.

@charliermarsh
Copy link
Member

(I don’t have much intuition for how nested BufReaders interact.)

@thundergolfer
Copy link
Contributor Author

thundergolfer commented Mar 9, 2024

Yep with this diff I get comparable performance to pip 🙂

diff --git a/crates/uv-extract/src/stream.rs b/crates/uv-extract/src/stream.rs
index 8d8dfb11..db3c42fb 100644
--- a/crates/uv-extract/src/stream.rs
+++ b/crates/uv-extract/src/stream.rs
@@ -18,7 +18,7 @@ pub async fn unzip<R: tokio::io::AsyncRead + Unpin>(
     target: impl AsRef<Path>,
 ) -> Result<(), Error> {
     let target = target.as_ref();
-    let mut reader = reader.compat();
+    let mut reader = futures::io::BufReader::with_capacity( 128 * 1024, reader.compat());
     let mut zip = async_zip::base::read::stream::ZipFileReader::new(&mut reader);

     let mut directories = FxHashSet::default();

Debug logging on hyper shows flushes up-to 128KiB.


image

@thundergolfer
Copy link
Contributor Author

Think this is a enough to make a PR. It'd be two things:

  1. setting identity, in the spirit of Accept encoding identity pypa/pip#1688
  2. increasing buffer from 8KiB to 128KiB.

@charliermarsh
Copy link
Member

Sounds great, thank you! The only thing I’ll want to test before merging is whether the buffer size changes benchmarks much / at all against PyPI.

@thundergolfer
Copy link
Contributor Author

Makes sense. Looks like I could even check this myself by reading BENCHMARKS.md and going from there?

@charliermarsh
Copy link
Member

Totally, up to you! I'd just do a release build on main, move it to uv-main, then do a release build on your branch, then run something ike:

python -m scripts.bench \
        --uv-path ./target/release/uv \
        --uv-path ./target/release/uv-main \
        ./scripts/requirements/compiled/trio.txt --benchmark install-cold

@charliermarsh
Copy link
Member

By the way, I can get this released pretty quickly.

thundergolfer added a commit to thundergolfer/uv that referenced this issue Mar 9, 2024
thundergolfer added a commit to thundergolfer/uv that referenced this issue Mar 10, 2024
charliermarsh pushed a commit that referenced this issue Mar 10, 2024
## Summary

Addressing the extremely slow performance detailed in
#2220. There are two changes to
increase download performance:

1. setting `accept-encoding: identity`, in the spirit of
pypa/pip#1688
2. increasing buffer from 8KiB to 128KiB. 

### 1. accept-encoding: identity

I think this related `pip` PR has a good explanation of what's going on:
pypa/pip#1688

```
  # We use Accept-Encoding: identity here because requests
  # defaults to accepting compressed responses. This breaks in
  # a variety of ways depending on how the server is configured.
  # - Some servers will notice that the file isn't a compressible
  #   file and will leave the file alone and with an empty
  #   Content-Encoding
  # - Some servers will notice that the file is already
  #   compressed and will leave the file alone and will add a
  #   Content-Encoding: gzip header
  # - Some servers won't notice anything at all and will take
  #   a file that's already been compressed and compress it again
  #   and set the Content-Encoding: gzip header
```

The `files.pythonhosted.org` server is the 1st kind. Example debug log I
added in `uv` when installing against PyPI:

<img width="1459" alt="image"
src="https://github.com/astral-sh/uv/assets/12058921/ef10d758-46aa-4c8e-9dba-47f33437401b">

(there is no `content-encoding` header in this response, the `whl`
hasn't been compressed, and there is a content-length header)

Our internal mirror is the third case. It does seem sensible that our
mirror should be modified to act like the 1st kind. But `uv` should
handle all three cases like `pip` does.

### 2. buffer increase

In #2220 I observed that `pip`'s
downloading was causing up-to 128KiB flushes in our mirror.

After fix 1, `uv` was still only causing up-to 8KiB flushes, and was
slower to download than `pip`. Increasing this buffer from the default
8KiB led to a download performance improvement against our mirror and
the expected observed 128KiB flushes.

## Test Plan

Ran benchmarking as instructed by @charliermarsh 

<img width="1447" alt="image"
src="https://github.com/astral-sh/uv/assets/12058921/840d9c8d-4b98-4bfa-89f3-073a2dec1f23">

No performance improvement or regression.
@hmc-cs-mdrissi
Copy link

I tested the fix on my slow large wheel and it works great. Installation time is now fast for that 1 wheel and I guess my index server had same issue here. Thanks to both of you for continuing to improve uv's performance.

@charliermarsh
Copy link
Member

All @thundergolfer!

@charliermarsh
Copy link
Member

I'll probably cut a release tomorrow with this in it.

@zanieb zanieb closed this as completed Mar 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
great writeup A wonderful example of a quality contribution 💜 performance Potential performance improvement
Projects
None yet
Development

No branches or pull requests

4 participants