Can not reach the Throughput value which described in your performance doc under fp16 llama7B #1255

felixslu · 2024-03-08T03:01:45Z

Background:

in the performance doc https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance.md mentioned:
LLama7B , FP16 , batchsize:256 , input_len:128 output_len:128 ,A100 , reach a Throughput value of 5,353 tok/s/GPU。

Problem:

on the same condition , We only reach to 435 tok/s/GPU.

Library:

NGC 24.01
TensorRT-LLM - v0.7.0 branch

our build command

python3 build.py --model_dir $llm_model_path \ --dtype float16 \ --remove_input_padding \ --use_gpt_attention_plugin float16 \ --use_inflight_batching \ --enable_context_fmha \ --use_gemm_plugin float16 \ --paged_kv_cache \ --use_fused_mlp \ --max_batch_size 256 \ --output_dir $llm_model_path/trt_engines/bs256/fp16/1-gpu/

our running param

`
MAX_BATCH_SIZE=512
python3 tools/fill_template.py -i $proj_model_repo/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:$MAX_BATCH_SIZE,preprocessing_instance_count:1

python3 tools/fill_template.py -i $proj_model_repo/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:$MAX_BATCH_SIZE,postprocessing_instance_count:1

python3 tools/fill_template.py -i $proj_model_repo/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:$MAX_BATCH_SIZE,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False

python3 tools/fill_template.py -i $proj_model_repo/ensemble/config.pbtxt triton_max_batch_size:$MAX_BATCH_SIZE

python3 tools/fill_template.py -i $proj_model_repo/tensorrt_llm/config.pbtxt triton_max_batch_size:$MAX_BATCH_SIZE,decoupled_mode:False,max_beam_width:1,engine_dir:$MODEL_DIR,max_attention_window_size:8192,kv_cache_free_gpu_mem_fraction:0.9,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,enable_trt_overlap:False,max_queue_delay_microseconds:600

`

We have not set set "--max_tokens_in_paged_kvcache", just use "kv_cache_free_gpu_mem_fraction:0.9" to consume rest memory for kv-cache.

our GPU MEN

TOTAL:78.4GB kv-Cache MEM: 27.4GB

"[TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size"
**So,maybe the max_num_sequences param have been set to 256

our machine

A800
**Use only one GPU Card **

our test script

cd triton_backend/tools/inflight_batcher_llm/ && python3 benchmark_core_model.py -i grpc --request-rate 4 --max-input-len 1024 --num-requests 50 --exclude-input-in-output token-norm-dist --input-mean 128 --input-stdev 5 --output-mean 128 --output-stdev 2

our result

Question:

Why the gap of Throughput is so huge? Could you share your test method or show me our problem?

The text was updated successfully, but these errors were encountered:

byshiue · 2024-03-12T03:40:00Z

max_tokens_in_paged_kv_cache is define here.

max_tokens_in_paged_kv_cache and kv_cache_free_gpu_mem_fraction controll the kv cache memory usage together. More details are described here.

If you don't setup the proper value to prevent allocating to many memory on kv cache, the real inference batch size might not achieve the possible maximum batch size.

felixslu · 2024-03-13T06:08:55Z

max_tokens_in_paged_kv_cache is define here.

max_tokens_in_paged_kv_cache and kv_cache_free_gpu_mem_fraction controll the kv cache memory usage together. More details are described here.

If you don't setup the proper value to prevent allocating to many memory on kv cache, the real inference batch size might not achieve the possible maximum batch size.

As mentioned in doc https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/perf_best_practices.md

Firstly, We set batchsize to 256 for engine building stage.
Then, We set kv_cache_free_gpu_mem_fraction to 0.95.
Finally, the total MEM is 78.4G and KV-Cache MEM is 27.4G on A100. and we believe we allocate all MEM to engine and kv-cache as much as popssible. But the actually throughput value is 435 tok/s/GPU, far bellow your evaluation value of 5,353 tok/s/GPU.

How can We reproduct your performamce ? Could you share your command and triton-trtllm parameters?

Unless users clearly know the maximum number of tokens in the KV cache needed by the model, it is recommended to leave max_tokens_in_paged_kv_cache unset. For kv_cache_free_gpu_mem_fraction, if no other programs are executed on the same GPU, it is recommended to test with a as high value as 0.95 to target a high throughput. Note that the kv_cache_free_gpu_mem_fraction parameter cannot be set to 1.0 because some amount of memory has to be reserved for inputs and outputs.

jaywongs · 2024-04-23T08:51:12Z

I'm confused too

byshiue · 2024-04-24T23:48:55Z

Please share your steps to get the performance.

nv-guomingz · 2024-11-14T07:49:46Z

hi @felixslu do u still have further issue or question now? If not, we'll close it soon.

byshiue self-assigned this Mar 12, 2024

byshiue added triaged Issue has been triaged by maintainers Triton Backend labels Mar 12, 2024

tp-nan mentioned this issue Mar 13, 2024

[Feature Request] More realistic benchmark and throughput optimization #1292

Closed

nv-guomingz added the stale label Nov 14, 2024

nv-guomingz closed this as completed Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not reach the Throughput value which described in your performance doc under fp16 llama7B #1255

Can not reach the Throughput value which described in your performance doc under fp16 llama7B #1255

felixslu commented Mar 8, 2024 •

edited

Loading

byshiue commented Mar 12, 2024

felixslu commented Mar 13, 2024 •

edited

Loading

jaywongs commented Apr 23, 2024

byshiue commented Apr 24, 2024

nv-guomingz commented Nov 14, 2024

Can not reach the Throughput value which described in your performance doc under fp16 llama7B #1255

Can not reach the Throughput value which described in your performance doc under fp16 llama7B #1255

Comments

felixslu commented Mar 8, 2024 • edited Loading

Background:

Problem:

Library:

our build command

our running param

our GPU MEN

our machine

our test script

our result

Question:

byshiue commented Mar 12, 2024

felixslu commented Mar 13, 2024 • edited Loading

jaywongs commented Apr 23, 2024

byshiue commented Apr 24, 2024

nv-guomingz commented Nov 14, 2024

felixslu commented Mar 8, 2024 •

edited

Loading

felixslu commented Mar 13, 2024 •

edited

Loading