Inference time for Mixtral-8x7B model is slowing down with every new request #1097
Closed
3 of 4 tasks
Labels
bug
Something isn't working
System Info
GPUs: 2xA100 PCI-e
Who can help?
@kaiyux
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Using the sources from the branch corresponding to the
v0.7.1
tagBuilding the model:
Following steps from here to pack it into triton inference server.
Sending such requests:
Expected behavior
TensorRT gives better performance than TGI (~2.5RPS for quantized model with 300 output tokens)
actual behavior
Load Test is ramping up users up to 5 RPS over first three minutes, so the above is ~0.15 RPS.
As you may see, the response time is quickly increasing. Moreover, it won't fall back after the requests are processed, it feels like they're stuck within the model until the container with triton is restarted. Also, the GPU voltage stays high after that load (even after some time the load is released).
additional notes
Using pre-built triton server image:
vcr.io/nvidia/tritonserver:24.01-trtllm-python-py3
Not sure if that's a triton problem or a TensorRT-LLM though. Any pointer to take a look at would be much appreciated. Thanks!
The text was updated successfully, but these errors were encountered: