Inference time for Mixtral-8x7B model is slowing down with every new request #1097

punkerpunker · 2024-02-18T00:46:34Z

System Info

GPUs: 2xA100 PCI-e

Who can help?

@kaiyux

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Using the sources from the branch corresponding to the v0.7.1 tag

Building the model:

python ../llama/build.py --model_dir ./Mixtral-8x7B-v0.1 \
                --use_inflight_batching \
                --enable_context_fmha \
                --use_gemm_plugin \
                --world_size 2 \
                --pp_size 2 \
                --output_dir ./trt_engines/mixtral/PP

Following steps from here to pack it into triton inference server.

Sending such requests:

headers = {
    "Content-Type": "application/json",
}

data = {
    "text_input": "Generate a random text up the max number of new tokens", 
    "max_tokens": 300, 
    "bad_words": "", 
    "stop_words": ""
}

response = requests.post('.../v2/models/tensorrt_llm_bls/generate', headers=headers, json=data)

Expected behavior

TensorRT gives better performance than TGI (~2.5RPS for quantized model with 300 output tokens)

actual behavior

glebvazhenin@RYG7YPT4W7 ~ % k6 run script.js   

          /\      |‾‾| /‾‾/   /‾‾/   
     /\  /  \     |  |/  /   /  /    
    /  \/    \    |     (   /   ‾‾\  
   /          \   |  |\  \ |  (‾)  | 
  / __________ \  |__| \__\ \_____/ .io

     execution: local
        script: script.js
        output: -

     scenarios: (100.00%) 1 scenario, 250 max VUs, 8m30s max duration (incl. graceful stop):
              * contacts: Up to 6.00 iterations/s for 8m0s over 3 stages (maxVUs: 250, gracefulStop: 30s)

INFO[0014] String input: Generate a random text up the max number of new tokens
INFO[0014] Status: 200                                   source=console
INFO[0014] Response time: 5970.938 ms                    source=console
INFO[0014] Generated tokens: ...
INFO[0020] Status: 200                                   source=console
INFO[0020] Response time: 8742.619 ms                    source=console
INFO[0020] Generated tokens: ...
INFO[0026] Status: 200                                   source=console
INFO[0026] Response time: 12220.316 ms                   source=console
INFO[0026] Generated tokens: ...
INFO[0032] Status: 200                                   source=console
INFO[0032] Response time: 16089.603 ms                   source=console
INFO[0032] Generated tokens: ...
INFO[0037] Status: 200                                   source=console
INFO[0037] Response time: 20116.343 ms                   source=console
INFO[0037] Generated tokens: ...
INFO[0043] Status: 200                                   source=console
INFO[0043] Response time: 24414.768 ms                   source=console
INFO[0043] Generated tokens: ...
INFO[0049] Status: 200                                   source=console
INFO[0049] Response time: 28801.644 ms                   source=console
INFO[0049] Generated tokens: ...
INFO[0055] Status: 200                                   source=console
INFO[0055] Response time: 33385.066 ms                   source=console
INFO[0055] Generated tokens: ...

Load Test is ramping up users up to 5 RPS over first three minutes, so the above is ~0.15 RPS.

As you may see, the response time is quickly increasing. Moreover, it won't fall back after the requests are processed, it feels like they're stuck within the model until the container with triton is restarted. Also, the GPU voltage stays high after that load (even after some time the load is released).

additional notes

Using pre-built triton server image: vcr.io/nvidia/tritonserver:24.01-trtllm-python-py3
Not sure if that's a triton problem or a TensorRT-LLM though. Any pointer to take a look at would be much appreciated. Thanks!

The text was updated successfully, but these errors were encountered:

1350lumen · 2024-03-11T11:03:27Z

It looks like the default examples are not configured for high throughput. The increasing response time is the effect of requests queuing up either in front of tensorrt_llm_bls or tensorrt_llm models.

tensorrt_llm_bls is a python backend and it generally needs multiple instances to handle concurrent requests. When there is 1 instance only, it handles 1 long running request at a time. To fix it, the options are: to increase the number of python instances for BLS model, use ensemble model or create custom C++ backend. Ensemble should be best for a quick start (when no token streaming needed)

tensorrt_llm - for best throughput the TRT engine must be built with in-flight batching enabled (it's done by default) with a good max_batch_size. If max_batch_size is small (it can be checked in tritonserver logs with --log-verbose=1), the batching will not have any effect at all. Also, config.pbtxt should have "gpt_model_type" mandatory option set to a type supporting inflight batching (e.g. inflight_fused_batching)

punkerpunker · 2024-03-27T00:27:31Z

Managed to build the engine with the proper throughput using the following parameters (1 A100 GPU)

trtllm-build --checkpoint_dir /tensorrt_engines/tllm_checkpoint_mixtral_1gpu_int8 --output_dir /tensorrt_engines_converted/tllm_checkpoint_mixtral_1gpu_int8_bs64_optim --gemm_plugin float16 --max_batch_size 64 --remove_input_padding enable --context_fmha enable --gpt_attention_plugin float16 --max_input_len 2048 --max_output_len 512 --max_num_tokens 16384 --use_fused_mlp

punkerpunker added the bug Something isn't working label Feb 18, 2024

byshiue assigned pcastonguay Feb 23, 2024

pcastonguay assigned schetlur-nv Mar 11, 2024

tp-nan mentioned this issue Mar 13, 2024

[Feature Request] More realistic benchmark and throughput optimization #1292

Closed

punkerpunker closed this as completed Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference time for Mixtral-8x7B model is slowing down with every new request #1097

Inference time for Mixtral-8x7B model is slowing down with every new request #1097

punkerpunker commented Feb 18, 2024 •

edited

Loading

1350lumen commented Mar 11, 2024 •

edited

Loading

punkerpunker commented Mar 27, 2024 •

edited

Loading

Inference time for Mixtral-8x7B model is slowing down with every new request #1097

Inference time for Mixtral-8x7B model is slowing down with every new request #1097

Comments

punkerpunker commented Feb 18, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

1350lumen commented Mar 11, 2024 • edited Loading

punkerpunker commented Mar 27, 2024 • edited Loading

punkerpunker commented Feb 18, 2024 •

edited

Loading

1350lumen commented Mar 11, 2024 •

edited

Loading

punkerpunker commented Mar 27, 2024 •

edited

Loading