Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not reach the Throughput value which described in your performance doc under fp16 llama7B #1255

Closed
felixslu opened this issue Mar 8, 2024 · 5 comments
Assignees
Labels
stale triaged Issue has been triaged by maintainers Triton Backend

Comments

@felixslu
Copy link

felixslu commented Mar 8, 2024

Background:

in the performance doc https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance.md mentioned:
LLama7B , FP16 , batchsize:256 , input_len:128 output_len:128 ,A100 , reach a Throughput value of 5,353 tok/s/GPU。

Uploading image.png…

Problem:

on the same condition , We only reach to 435 tok/s/GPU.

Library:

NGC 24.01
TensorRT-LLM - v0.7.0 branch

our build command

python3 build.py --model_dir $llm_model_path \ --dtype float16 \ --remove_input_padding \ --use_gpt_attention_plugin float16 \ --use_inflight_batching \ --enable_context_fmha \ --use_gemm_plugin float16 \ --paged_kv_cache \ --use_fused_mlp \ --max_batch_size 256 \ --output_dir $llm_model_path/trt_engines/bs256/fp16/1-gpu/

our running param

`
MAX_BATCH_SIZE=512
python3 tools/fill_template.py -i $proj_model_repo/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:$MAX_BATCH_SIZE,preprocessing_instance_count:1

python3 tools/fill_template.py -i $proj_model_repo/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:$MAX_BATCH_SIZE,postprocessing_instance_count:1

python3 tools/fill_template.py -i $proj_model_repo/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:$MAX_BATCH_SIZE,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False

python3 tools/fill_template.py -i $proj_model_repo/ensemble/config.pbtxt triton_max_batch_size:$MAX_BATCH_SIZE

python3 tools/fill_template.py -i $proj_model_repo/tensorrt_llm/config.pbtxt triton_max_batch_size:$MAX_BATCH_SIZE,decoupled_mode:False,max_beam_width:1,engine_dir:$MODEL_DIR,max_attention_window_size:8192,kv_cache_free_gpu_mem_fraction:0.9,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,enable_trt_overlap:False,max_queue_delay_microseconds:600

`

We have not set set "--max_tokens_in_paged_kvcache", just use "kv_cache_free_gpu_mem_fraction:0.9" to consume rest memory for kv-cache.

our GPU MEN

TOTAL:78.4GB kv-Cache MEM: 27.4GB

image

"[TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size"
**So,maybe the max_num_sequences param have been set to 256

image

our machine

A800
**Use only one GPU Card **

our test script

cd triton_backend/tools/inflight_batcher_llm/ && python3 benchmark_core_model.py -i grpc --request-rate 4 --max-input-len 1024 --num-requests 50 --exclude-input-in-output token-norm-dist --input-mean 128 --input-stdev 5 --output-mean 128 --output-stdev 2

our result

image

Question:

Why the gap of Throughput is so huge? Could you share your test method or show me our problem?

@byshiue
Copy link
Collaborator

byshiue commented Mar 12, 2024

max_tokens_in_paged_kv_cache is define here.

max_tokens_in_paged_kv_cache and kv_cache_free_gpu_mem_fraction controll the kv cache memory usage together. More details are described here.

If you don't setup the proper value to prevent allocating to many memory on kv cache, the real inference batch size might not achieve the possible maximum batch size.

@felixslu
Copy link
Author

felixslu commented Mar 13, 2024

max_tokens_in_paged_kv_cache is define here.

max_tokens_in_paged_kv_cache and kv_cache_free_gpu_mem_fraction controll the kv cache memory usage together. More details are described here.

If you don't setup the proper value to prevent allocating to many memory on kv cache, the real inference batch size might not achieve the possible maximum batch size.

As mentioned in doc https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/perf_best_practices.md

Firstly, We set batchsize to 256 for engine building stage.
Then, We set kv_cache_free_gpu_mem_fraction to 0.95.
Finally, the total MEM is 78.4G and KV-Cache MEM is 27.4G on A100. and we believe we allocate all MEM to engine and kv-cache as much as popssible. But the actually throughput value is 435 tok/s/GPU, far bellow your evaluation value of 5,353 tok/s/GPU.

How can We reproduct your performamce ? Could you share your command and triton-trtllm parameters?

Unless users clearly know the maximum number of tokens in the KV cache needed by the model, it is recommended to leave max_tokens_in_paged_kv_cache unset. For kv_cache_free_gpu_mem_fraction, if no other programs are executed on the same GPU, it is recommended to test with a as high value as 0.95 to target a high throughput. Note that the kv_cache_free_gpu_mem_fraction parameter cannot be set to 1.0 because some amount of memory has to be reserved for inputs and outputs.

@jaywongs
Copy link

I'm confused too

@byshiue
Copy link
Collaborator

byshiue commented Apr 24, 2024

Please share your steps to get the performance.

@nv-guomingz
Copy link
Collaborator

hi @felixslu do u still have further issue or question now? If not, we'll close it soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale triaged Issue has been triaged by maintainers Triton Backend
Projects
None yet
Development

No branches or pull requests

4 participants