Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a compare with vllm 0.2.7 #965

Closed
2 of 4 tasks
white-wolf-tech opened this issue Jan 25, 2024 · 7 comments
Closed
2 of 4 tasks

a compare with vllm 0.2.7 #965

white-wolf-tech opened this issue Jan 25, 2024 · 7 comments
Assignees
Labels
bug Something isn't working stale

Comments

@white-wolf-tech
Copy link

white-wolf-tech commented Jan 25, 2024

System Info

ubuntu22.04
one Nvidia A800
driver info: 470.141.10
cuda: 12.3
tensorrt: 9.2.0.5

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

set Concurrency = 16.
when I use VLLM=0.2.7. use vllm asyncengine.
min response time is 367 ms
max response time is 676 ms

I build Tensorrt-LLM and tensorrtllm_backend from the main branch.
use tritonserver deploy the model. test result is:
min response is 379 ms
max response is 4418 ms

Tensorrt-LLM is much lowest than VLLM0.2.7?
I think Tensorrt-LLM should be faster than VLLM.
why? any suggestion?

Expected behavior

NULL

actual behavior

NULL

additional notes

Null

@white-wolf-tech white-wolf-tech added the bug Something isn't working label Jan 25, 2024
@kaiyux
Copy link
Member

kaiyux commented Jan 26, 2024

@Coder-nlper Please share your commands to build the engines and benchmarks so that we can check if the comparison is apple-to-apple. Thanks.

@white-wolf-tech
Copy link
Author

white-wolf-tech commented Jan 26, 2024

commands to build:

hf_model_path=/root/chatglm3-6b/
engine_dir=/root/trtllm/trtllmmodels/fp16/
CUDA_ID="1"

CUDA_VISIBLE_DEVICES=$CUDA_ID python3 build.py
--model_dir $hf_model_path
--log_level "info"
--output_dir $engine_dir/1-gpu
--world_size 1
--tp_size 1
--max_batch_size 50
--max_input_len 2048
--max_output_len 512
--max_beam_width 1
--enable_context_fmha
--use_inflight_batching
--paged_kv_cache
--remove_input_padding

test use ab command:

ip=localhost
port=60025
ab -n 64 -c 16 -p "post.txt" -T "application/json" -H "Content-Type: application/json" -H "Cache-Control: no-cache" -H "Postman-Token: d6xxs-sdf-sdf09d" "http://$ip:$port/model/generate"

@white-wolf-tech
Copy link
Author

Driver is 470.141.10. is there any relationship with it?

@kristiankielhofner
Copy link

Driver is 470.141.10. is there any relationship with it?

I have to imagine that likely isn't ideal.

While it's supported per the Nvidia Frameworks Support Matrix you may want to try comparing to 535 for the current 23.10 based containers. Also be aware main has moved to 23.12 which would be 545 (CUDA 12.3).

@white-wolf-tech
Copy link
Author

Driver is 470.141.10. is there any relationship with it?

I have to imagine that likely isn't ideal.

While it's supported per the Nvidia Frameworks Support Matrix you may want to try comparing to 535 for the current 23.10 based containers. Also be aware main has moved to 23.12 which would be 545 (CUDA 12.3).

but the latest is 535.x.x

image

@Tlntin
Copy link
Contributor

Tlntin commented Feb 2, 2024

Driver is 470.141.10. is there any relationship with it?

I have to imagine that likely isn't ideal.
While it's supported per the Nvidia Frameworks Support Matrix you may want to try comparing to 535 for the current 23.10 based containers. Also be aware main has moved to 23.12 which would be 545 (CUDA 12.3).

but the latest is 535.x.x

image

stable version is 535, dev version is 545, beta version is 550.
refer link
image

@nv-guomingz
Copy link
Collaborator

Hi @white-wolf-tech do u still have further issue or question now? If not, we'll close it soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

5 participants