Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run-example.sh fails with urllib3.exceptions.ProtocolError: Response ended prematurely #896

Closed
awan-10 opened this issue Apr 29, 2024 · 11 comments
Assignees

Comments

@awan-10
Copy link
Contributor

awan-10 commented Apr 29, 2024

When I modified “run_example.sh” and changed backend to vllm,

I got the error message down below, I will do some some check whether the error comes from server side or client side.

I notice this benchmark has three modes: “mii”, “vllm”, “aml”, in which mii and vllm is serving frame work and aml mode corresponding to benchmark an API server on Azure. Is it possible to run this script to benchmark a local API server? I kind of thinking run vllm serving in separate command, and use this benchmark to test the api server vllm started. So I would have better control on how the vllm server started and see all the error message from vllm server if it fails.

(vllm) [gma@spr02 mii]$ bash ./run_vllm.sh

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Token indices sequence length is longer than the specified maximum sequence length for this model (5883 > 4096). Running this sequence through the model will result in indexing errors

warmup queue size: 37 (1070543)

Process Process-1:

Traceback (most recent call last):

File "/home/gma/anaconda3/envs/vllm/lib/python3.11/site-packages/requests/models.py", line 816, in generate

yield from self.raw.stream(chunk_size, decode_content=True)

File "/home/gma/anaconda3/envs/vllm/lib/python3.11/site-packages/urllib3/response.py", line 1040, in stream

yield from self.read_chunked(amt, decode_content=decode_content)

File "/home/gma/anaconda3/envs/vllm/lib/python3.11/site-packages/urllib3/response.py", line 1184, in read_chunked

self._update_chunk_length()

File "/home/gma/anaconda3/envs/vllm/lib/python3.11/site-packages/urllib3/response.py", line 1119, in _update_chunk_length

raise ProtocolError("Response ended prematurely") from None

urllib3.exceptions.ProtocolError: Response ended prematurely

@awan-10
Copy link
Contributor Author

awan-10 commented Apr 29, 2024

@delock - FYI. Created this issue so we can track and fix it. Please work with folks assigned on this issue.

@lekurile
Copy link
Contributor

Hello @delock,

Thank you for raising this issue. I ran a local vllm benchmark with the microsoft/Phi-3-mini-4k-instruct model using the following code:

# Run benchmark
python ./run_benchmark.py \
        --model microsoft/Phi-3-mini-4k-instruct \
        --tp_size 1 \
        --num_replicas 1 \
        --max_ragged_batch_size 768 \
        --mean_prompt_length 2600 \
        --mean_max_new_tokens 60 \
        --stream \
        --backend vllm \
        --overwrite_results \

### Gernerate the plots
python ./src/plot_th_lat.py --data_dirs results_vllm/

echo "Find figures in ./plots/ and log outputs in ./results/"

I also had to add the "--trust-remote-code", argument to the vllm_cmd here:
https://github.com/microsoft/DeepSpeedExamples/blob/1be0fc77a62ef965e2dea920789f7df95a843820/benchmarks/inference/mii/src/server.py#L39

Here's the resulting plot:

To reproduce the issue you show above, can you please provide a reproduction script so I can test on my end?

To answer your question:

Is it possible to run this script to benchmark a local API server? I kind of thinking run vllm serving in separate command, and use this benchmark to test the api server vllm started. So I would have better control on how the vllm server started and see all the error message from vllm server if it fails.

We can update the benchmarking script and add an additional argument, where existing local server information is provided and the script will not stand up a new server, but will instead target the existing server using the information provided.

@delock
Copy link
Contributor

delock commented Apr 30, 2024

@awan-10 @lekurile Thanks for start this thread. I met this error when I tried to run this example on Xeon server with CPU. I suspect this is a configuration issue. Currently, I plan to modify the script to run client code only, and start the server on seperate command line, so will be able to see more error message and get better understanding.

@delock
Copy link
Contributor

delock commented Apr 30, 2024

Hi @lekurile
Now I can start the server from seperate command line and run benchmark on this server with reduced test size (max batch 128, avg prompt128) to start with.

Yet I met the following error during post processing, I suspect this is due to transformers version. What is the transformers version you are using? My version is transformers==4.40.1

Traceback (most recent call last):
  File "/home/gma/DeepSpeedExamples/benchmarks/inference/mii/./run_benchmark.py", line 44, in <module>
    run_benchmark()
  File "/home/gma/DeepSpeedExamples/benchmarks/inference/mii/./run_benchmark.py", line 36, in run_benchmark
    print_summary(client_args, response_details)
  File "/home/gma/DeepSpeedExamples/benchmarks/inference/mii/src/utils.py", line 235, in print_summary
    ps = get_summary(vars(args), response_details)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gma/DeepSpeedExamples/benchmarks/inference/mii/src/postprocess_results.py", line 80, in get_summary
    [
  File "/home/gma/DeepSpeedExamples/benchmarks/inference/mii/src/postprocess_results.py", line 81, in <listcomp>
    (len(get_tokenizer().tokenize(r.prompt)) + len(get_tokenizer().tokenize(r.generated_tokens)))
                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gma/anaconda3/envs/vllm/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 396, in tokenize
    return self.encode_plus(text=text, text_pair=pair, add_special_tokens=add_special_tokens, **kwargs).tokens()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gma/anaconda3/envs/vllm/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3037, in encode_plus
    return self._encode_plus(
           ^^^^^^^^^^^^^^^^^^
  File "/home/gma/anaconda3/envs/vllm/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 576, in _encode_plus
    batched_output = self._batch_encode_plus(
                     ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gma/anaconda3/envs/vllm/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 504, in _batch_encode_plus
    encodings = self._tokenizer.encode_batch(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

@lekurile
Copy link
Contributor

Hi @delock,

I'm using transformers==4.40.1 as well.

After #895 was committed to the repo, I'm seeing the same error on my end as well.

  File "/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 504, in _batch_encode_plus
    encodings = self._tokenizer.encode_batch(
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

Can you please try detaching your repo HEAD to fab5d06, one commit prior, and running again? I'll look into this PR and see if we need to revert or not.

Thanks,
Lev

@lekurile
Copy link
Contributor

@delock, here's the PR fixing the tokens_per_sec metric to work for both the streaming and non-streaming cases:
#897

You should be able to get past your error above with this PR, but I'm curious if you're seeing any failures still.

@delock
Copy link
Contributor

delock commented May 1, 2024

Yes, the latest version can going forward. Will see whether it can continue.

@delock, here's the PR fixing the tokens_per_sec metric to work for both the streaming and non-streaming cases: #897

You should be able to get past your error above with this PR, but I'm curious if you're seeing any failures still.

@delock
Copy link
Contributor

delock commented May 1, 2024

Hi @lekurile the benchmark will proceed but will hit some other error when running on CPU. I'll check with vllm cpu engineers to investigate these errors. I also submitted a PR adding a flag allowing start the server in seperate command line.
#900

@loadams
Copy link
Contributor

loadams commented Jul 18, 2024

Thanks @delock - can we close this issue for now?

@delock
Copy link
Contributor

delock commented Jul 19, 2024

Thanks @delock - can we close this issue for now?

Yes, this is no longer an issue now, thanks!

@loadams
Copy link
Contributor

loadams commented Jul 19, 2024

Thanks!

@loadams loadams closed this as completed Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants