a compare with vllm 0.2.7 #965

white-wolf-tech · 2024-01-25T10:18:19Z

System Info

ubuntu22.04
one Nvidia A800
driver info: 470.141.10
cuda: 12.3
tensorrt: 9.2.0.5

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

set Concurrency = 16.
when I use VLLM=0.2.7. use vllm asyncengine.
min response time is 367 ms
max response time is 676 ms

I build Tensorrt-LLM and tensorrtllm_backend from the main branch.
use tritonserver deploy the model. test result is:
min response is 379 ms
max response is 4418 ms

Tensorrt-LLM is much lowest than VLLM0.2.7?
I think Tensorrt-LLM should be faster than VLLM.
why? any suggestion?

Expected behavior

NULL

actual behavior

NULL

additional notes

Null

The text was updated successfully, but these errors were encountered:

kaiyux · 2024-01-26T06:53:27Z

@Coder-nlper Please share your commands to build the engines and benchmarks so that we can check if the comparison is apple-to-apple. Thanks.

white-wolf-tech · 2024-01-26T07:17:27Z

commands to build:

hf_model_path=/root/chatglm3-6b/
engine_dir=/root/trtllm/trtllmmodels/fp16/
CUDA_ID="1"

CUDA_VISIBLE_DEVICES=$CUDA_ID python3 build.py
--model_dir $hf_model_path
--log_level "info"
--output_dir $engine_dir/1-gpu
--world_size 1
--tp_size 1
--max_batch_size 50
--max_input_len 2048
--max_output_len 512
--max_beam_width 1
--enable_context_fmha
--use_inflight_batching
--paged_kv_cache
--remove_input_padding

test use ab command:

ip=localhost
port=60025
ab -n 64 -c 16 -p "post.txt" -T "application/json" -H "Content-Type: application/json" -H "Cache-Control: no-cache" -H "Postman-Token: d6xxs-sdf-sdf09d" "http://$ip:$port/model/generate"

white-wolf-tech · 2024-01-26T09:54:05Z

Driver is 470.141.10. is there any relationship with it?

kristiankielhofner · 2024-01-26T10:31:14Z

Driver is 470.141.10. is there any relationship with it?

I have to imagine that likely isn't ideal.

While it's supported per the Nvidia Frameworks Support Matrix you may want to try comparing to 535 for the current 23.10 based containers. Also be aware main has moved to 23.12 which would be 545 (CUDA 12.3).

white-wolf-tech · 2024-01-29T02:45:56Z

Driver is 470.141.10. is there any relationship with it?

I have to imagine that likely isn't ideal.

While it's supported per the Nvidia Frameworks Support Matrix you may want to try comparing to 535 for the current 23.10 based containers. Also be aware main has moved to 23.12 which would be 545 (CUDA 12.3).

but the latest is 535.x.x

Tlntin · 2024-02-02T09:57:35Z

Driver is 470.141.10. is there any relationship with it?

I have to imagine that likely isn't ideal.
While it's supported per the Nvidia Frameworks Support Matrix you may want to try comparing to 535 for the current 23.10 based containers. Also be aware main has moved to 23.12 which would be 545 (CUDA 12.3).

but the latest is 535.x.x

stable version is 535, dev version is 545, beta version is 550.
refer link

nv-guomingz · 2024-11-15T11:38:25Z

Hi @white-wolf-tech do u still have further issue or question now? If not, we'll close it soon.

white-wolf-tech added the bug Something isn't working label Jan 25, 2024

byshiue assigned kaiyux Jan 26, 2024

white-wolf-tech mentioned this issue Feb 2, 2024

大佬有没有对比和VLLM的推理效果？ Tlntin/Qwen-TensorRT-LLM#72

Open

tp-nan mentioned this issue Mar 13, 2024

[Feature Request] More realistic benchmark and throughput optimization #1292

Closed

nv-guomingz added the stale label Nov 15, 2024

white-wolf-tech closed this as completed Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a compare with vllm 0.2.7 #965

a compare with vllm 0.2.7 #965

white-wolf-tech commented Jan 25, 2024 •

edited

Loading

kaiyux commented Jan 26, 2024

white-wolf-tech commented Jan 26, 2024 •

edited

Loading

white-wolf-tech commented Jan 26, 2024

kristiankielhofner commented Jan 26, 2024

white-wolf-tech commented Jan 29, 2024

Tlntin commented Feb 2, 2024

nv-guomingz commented Nov 15, 2024

a compare with vllm 0.2.7 #965

a compare with vllm 0.2.7 #965

Comments

white-wolf-tech commented Jan 25, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

kaiyux commented Jan 26, 2024

white-wolf-tech commented Jan 26, 2024 • edited Loading

white-wolf-tech commented Jan 26, 2024

kristiankielhofner commented Jan 26, 2024

white-wolf-tech commented Jan 29, 2024

Tlntin commented Feb 2, 2024

nv-guomingz commented Nov 15, 2024

white-wolf-tech commented Jan 25, 2024 •

edited

Loading

white-wolf-tech commented Jan 26, 2024 •

edited

Loading