-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can not reach the Throughput value which described in your performance doc under fp16 llama7B #1255
Comments
If you don't setup the proper value to prevent allocating to many memory on kv cache, the real inference batch size might not achieve the possible maximum batch size. |
As mentioned in doc https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/perf_best_practices.md Firstly, We set batchsize to 256 for engine building stage. How can We reproduct your performamce ? Could you share your command and triton-trtllm parameters?
|
I'm confused too |
Please share your steps to get the performance. |
hi @felixslu do u still have further issue or question now? If not, we'll close it soon. |
Background:
in the performance doc https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance.md mentioned:
LLama7B , FP16 , batchsize:256 , input_len:128 output_len:128 ,A100 , reach a Throughput value of 5,353 tok/s/GPU。
Problem:
on the same condition , We only reach to 435 tok/s/GPU.
Library:
NGC 24.01
TensorRT-LLM - v0.7.0 branch
our build command
python3 build.py --model_dir $llm_model_path \ --dtype float16 \ --remove_input_padding \ --use_gpt_attention_plugin float16 \ --use_inflight_batching \ --enable_context_fmha \ --use_gemm_plugin float16 \ --paged_kv_cache \ --use_fused_mlp \ --max_batch_size 256 \ --output_dir $llm_model_path/trt_engines/bs256/fp16/1-gpu/
our running param
`$proj_model_repo/preprocessing/config.pbtxt tokenizer_dir:$ {TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:$MAX_BATCH_SIZE,preprocessing_instance_count:1
MAX_BATCH_SIZE=512
python3 tools/fill_template.py -i
python3 tools/fill_template.py -i$proj_model_repo/postprocessing/config.pbtxt tokenizer_dir:$ {TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:$MAX_BATCH_SIZE,postprocessing_instance_count:1
python3 tools/fill_template.py -i $proj_model_repo/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:$MAX_BATCH_SIZE,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i $proj_model_repo/ensemble/config.pbtxt triton_max_batch_size:$MAX_BATCH_SIZE
python3 tools/fill_template.py -i $proj_model_repo/tensorrt_llm/config.pbtxt triton_max_batch_size:$MAX_BATCH_SIZE,decoupled_mode:False,max_beam_width:1,engine_dir:$MODEL_DIR,max_attention_window_size:8192,kv_cache_free_gpu_mem_fraction:0.9,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,enable_trt_overlap:False,max_queue_delay_microseconds:600
`
We have not set set "--max_tokens_in_paged_kvcache", just use "kv_cache_free_gpu_mem_fraction:0.9" to consume rest memory for kv-cache.
our GPU MEN
TOTAL:78.4GB kv-Cache MEM: 27.4GB
"[TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size"
**So,maybe the max_num_sequences param have been set to 256
our machine
A800
**Use only one GPU Card **
our test script
cd triton_backend/tools/inflight_batcher_llm/ && python3 benchmark_core_model.py -i grpc --request-rate 4 --max-input-len 1024 --num-requests 50 --exclude-input-in-output token-norm-dist --input-mean 128 --input-stdev 5 --output-mean 128 --output-stdev 2
our result
Question:
Why the gap of Throughput is so huge? Could you share your test method or show me our problem?
The text was updated successfully, but these errors were encountered: