Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG in W4A8_awq-kv-FP8, W-fp8-A-fp8-kv-fp8, in the 0.17.0.post1 #2810

Open
4 tasks
white-wolf-tech opened this issue Feb 21, 2025 · 5 comments
Open
4 tasks
Assignees
Labels
bug Something isn't working triaged Issue has been triaged by maintainers

Comments

@white-wolf-tech
Copy link

white-wolf-tech commented Feb 21, 2025

System Info

GPU: L20
tensorrt-LLM: v0.17.0.post1
modelopt: 0.23.0

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I used the following code to perform model quantization, and everything seemed to be fine.

    python quantize.py --model_dir  $hf_input_dir \
                       --calib_dataset $calib_dataset_path \
                       --dtype 'auto' \
                       --qformat "w4a8_awq" \
                       --awq_block_size 128 \
                       --batch_size 32 \
                       --output_dir $input_temp_dir \
                       --kv_cache_dtype "fp8"

After I completed the calibration on the test set, I got the following warning when using tensorrt-llm for conversion. The output result of the finally compiled model is incorrect.
The warning shows that these variables were not adopted, but they were present during the quantization process.
It should be that there are some differences between the conversion code of tensorrt-llm and the implementation of modelopt that led to this situation.
Is there any solution to this problem?

warning info:

[TRT-LLM] [W] Provided but not required tensors: {'transformer.layers.22.mlp.fc.activation_scaling_factor', 'transformer.layers.19.mlp.fc.activation_scaling_factor', 'transformer.layers.14.mlp.fc.activation_scaling_factor', 'transformer.layers.25.mlp.fc.activation_scaling_factor', 'transformer.layers.25.mlp.gate.activation_scaling_factor', 'transformer.layers.5.mlp.fc.activation_scaling_factor', 'transformer.layers.5.attention.qkv.activation_scaling_factor', 'transformer.layers.8.mlp.gate.activation_scaling_factor', 'transformer.layers.32.mlp.fc.activation_scaling_factor', 'transformer.layers.23.mlp.fc.activation_scaling_factor', 'transformer.layers.31.attention.qkv.activation_scaling_factor', 'transformer.layers.33.mlp.gate.activation_scaling_factor', 'transformer.layers.2.mlp.gate.activation_scaling_factor', 'transformer.layers.14.attention.qkv.activation_scaling_factor', 'transformer.layers.18.mlp.gate.activation_scaling_factor', 'transformer.layers.19.attention.qkv.activation_scaling_factor', 'transformer.layers.20.attention.qkv.activation_scaling_factor', 'transformer.layers.24.attention.qkv.activation_scaling_factor', 'transformer.layers.7.mlp.fc.activation_scaling_factor', 'transformer.layers.17.attention.qkv.activation_scaling_factor', 'transformer.layers.28.mlp.gate.activation_scaling_factor', 'transformer.layers.6.mlp.fc.activation_scaling_factor', 'transformer.layers.15.attention.qkv.activation_scaling_factor', 'transformer.layers.10.attention.qkv.activation_scaling_factor', 'transformer.layers.21.mlp.gate.activation_scaling_factor', 'transformer.layers.17.mlp.fc.activation_scaling_factor', 'transformer.layers.23.attention.qkv.activation_scaling_factor', 'transformer.layers.27.mlp.fc.activation_scaling_factor', 'transformer.layers.28.attention.qkv.activation_scaling_factor', 'transformer.layers.26.attention.qkv.activation_scaling_factor', 'transformer.layers.13.attention.qkv.activation_scaling_factor', 'transformer.layers.5.mlp.gate.activation_scaling_factor', 'transformer.layers.11.mlp.fc.activation_scaling_factor', 'transformer.layers.13.mlp.fc.activation_scaling_factor', 'transformer.layers.8.attention.qkv.activation_scaling_factor', 'transformer.layers.9.attention.qkv.activation_scaling_factor', 'transformer.layers.26.mlp.fc.activation_scaling_factor', 'transformer.layers.0.mlp.fc.activation_scaling_factor', 'transformer.layers.18.mlp.fc.activation_scaling_factor', 'transformer.layers.24.mlp.gate.activation_scaling_factor', 'transformer.layers.34.mlp.fc.activation_scaling_factor', 'transformer.layers.12.attention.qkv.activation_scaling_factor', 'transformer.layers.4.mlp.gate.activation_scaling_factor', 'transformer.layers.2.mlp.fc.activation_scaling_factor', 'transformer.layers.4.attention.qkv.activation_scaling_factor', 'transformer.layers.25.attention.qkv.activation_scaling_factor', 'transformer.layers.0.mlp.gate.activation_scaling_factor', 'transformer.layers.2.attention.qkv.activation_scaling_factor', 'transformer.layers.4.mlp.fc.activation_scaling_factor', 'transformer.layers.16.attention.qkv.activation_scaling_factor', 'transformer.layers.16.mlp.gate.activation_scaling_factor', 'transformer.layers.32.attention.qkv.activation_scaling_factor', 'transformer.layers.33.mlp.fc.activation_scaling_factor', 'transformer.layers.24.mlp.fc.activation_scaling_factor', 'transformer.layers.1.mlp.fc.activation_scaling_factor', 'transformer.layers.17.mlp.gate.activation_scaling_factor', 'transformer.layers.15.mlp.gate.activation_scaling_factor', 'transformer.layers.34.mlp.gate.activation_scaling_factor', 'transformer.layers.35.attention.qkv.activation_scaling_factor', 'transformer.layers.33.attention.qkv.activation_scaling_factor', 'transformer.layers.12.mlp.gate.activation_scaling_factor', 'transformer.layers.29.mlp.fc.activation_scaling_factor', 'transformer.layers.32.mlp.gate.activation_scaling_factor', 'transformer.layers.21.mlp.fc.activation_scaling_factor', 'transformer.layers.20.mlp.fc.activation_scaling_factor', 'transformer.layers.15.mlp.fc.activation_scaling_factor', 'transformer.layers.9.mlp.gate.activation_scaling_factor', 'transformer.layers.0.attention.qkv.activation_scaling_factor', 'transformer.layers.19.mlp.gate.activation_scaling_factor', 'transformer.layers.3.attention.qkv.activation_scaling_factor', 'transformer.layers.3.mlp.fc.activation_scaling_factor', 'transformer.layers.1.attention.qkv.activation_scaling_factor', 'transformer.layers.26.mlp.gate.activation_scaling_factor', 'transformer.layers.11.attention.qkv.activation_scaling_factor', 'transformer.layers.34.attention.qkv.activation_scaling_factor', 'transformer.layers.20.mlp.gate.activation_scaling_factor', 'transformer.layers.1.mlp.gate.activation_scaling_factor', 'transformer.layers.9.mlp.fc.activation_scaling_factor', 'transformer.layers.11.mlp.gate.activation_scaling_factor', 'transformer.layers.30.mlp.fc.activation_scaling_factor', 'transformer.layers.27.mlp.gate.activation_scaling_factor', 'transformer.layers.18.attention.qkv.activation_scaling_factor', 'transformer.layers.35.mlp.gate.activation_scaling_factor', 'transformer.layers.10.mlp.gate.activation_scaling_factor', 'transformer.layers.31.mlp.fc.activation_scaling_factor', 'transformer.layers.13.mlp.gate.activation_scaling_factor', 'transformer.layers.30.attention.qkv.activation_scaling_factor', 'transformer.layers.31.mlp.gate.activation_scaling_factor', 'transformer.layers.22.mlp.gate.activation_scaling_factor', 'transformer.layers.16.mlp.fc.activation_scaling_factor', 'transformer.layers.29.attention.qkv.activation_scaling_factor', 'transformer.layers.30.mlp.gate.activation_scaling_factor', 'transformer.layers.7.mlp.gate.activation_scaling_factor', 'transformer.layers.23.mlp.gate.activation_scaling_factor', 'transformer.layers.6.mlp.gate.activation_scaling_factor', 'transformer.layers.6.attention.qkv.activation_scaling_factor', 'transformer.layers.22.attention.qkv.activation_scaling_factor', 'transformer.layers.7.attention.qkv.activation_scaling_factor', 'transformer.layers.28.mlp.fc.activation_scaling_factor', 'transformer.layers.3.mlp.gate.activation_scaling_factor', 'transformer.layers.8.mlp.fc.activation_scaling_factor', 'transformer.layers.10.mlp.fc.activation_scaling_factor', 'transformer.layers.29.mlp.gate.activation_scaling_factor', 'transformer.layers.27.attention.qkv.activation_scaling_factor', 'transformer.layers.14.mlp.gate.activation_scaling_factor', 'transformer.layers.12.mlp.fc.activation_scaling_factor', 'transformer.layers.35.mlp.fc.activation_scaling_factor', 'transformer.layers.21.attention.qkv.activation_scaling_factor'}

Expected behavior

model`s output is normal

actual behavior

model`s output is wrong!!!

additional notes

NULL

@white-wolf-tech white-wolf-tech added the bug Something isn't working label Feb 21, 2025
@Barry-Delaney
Copy link
Collaborator

Hi @white-wolf-tech, the warning info you provided is caused by a latest optimization over W4A8_AWQ, this optimization will change the way we consume activation_scaling_factor, which will have the values used by a pointer directly, and then leaving this warning message in the engine building process, but this is okay and won't effect the correctness.
Could you please:

  1. Confirm if or not the warning message came from calling trtllm-build, if so, the warning is fine.
  2. Provide more info about your commands of conversion and build, details like model family and converted checkpoint config will be appreciated.

@Barry-Delaney Barry-Delaney self-assigned this Feb 21, 2025
@Barry-Delaney Barry-Delaney added the triaged Issue has been triaged by maintainers label Feb 21, 2025
@white-wolf-tech
Copy link
Author

Thank you for your reply. I conducted the same experiment using FP8 quantization.
There was no such warning with FP8, but the output still had issues.
The quantization method I'm using now is W(FP8)A(FP8)KV(FP8).
I used the API of tensorrt-llm, which is similar to the API of VLLM, and loaded the weights compiled with trtllm-build. The output of the model was normal.
When I used tritonserver with tensorrt-llm, the output of the model was chaotic and incorrect. It should have nothing to do with the quantization process.
The main issue might lie in the configuration of tritonserver with tensorrt-llm.

@white-wolf-tech
Copy link
Author

The model I'm using is Qwen2.5-3B. When I used the code in the "examples" folder and directly compiled the weights in FP16 or smoothquant (W8A8 (INT8)) format, the model ran normally.

@white-wolf-tech white-wolf-tech changed the title BUG in W4A8_awq BUG in W4A8_awq-kv-FP8, W-fp8-A-fp8-kv-fp8, in the 0.17.0.post1 Feb 26, 2025
@white-wolf-tech
Copy link
Author

the following is the build script:

 python quantize.py --model_dir  $hf_input_dir \
                       --calib_dataset $calib_dataset_path \
                       --dtype 'auto' \
                       --qformat "fp8" \
                       --awq_block_size 128 \
                       --batch_size 32 \
                       --output_dir $input_temp_dir \
                       --kv_cache_dtype "fp8"
    
    trtllm-build --checkpoint_dir $input_temp_dir \
            --output_dir $output_dir \
            --max_batch_size 512 \
            --max_input_len 1024 \
            --max_seq_len 2048 \
            --max_beam_width 1 \
            --max_num_tokens 16384 \
            --gemm_plugin auto \
            --kv_cache_type paged \
            --remove_input_padding enable \
            --context_fmha enable \
            --use_paged_context_fmha enable \
            --use_fp8_context_fmha enable \
            --tokens_per_block 32 \
            --use_fused_mlp enable \
            --multiple_profiles enable \
            --reduce_fusion enable \
            --user_buffer enable \
            --workers 4 \
            --log_level  info | tee build.log

log is :

[TensorRT-LLM] TensorRT-LLM version: 0.17.0.post1
[02/26/2025-11:25:19] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[02/26/2025-11:25:19] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[02/26/2025-11:25:19] [TRT-LLM] [I] Set gemm_plugin to auto.
[02/26/2025-11:25:19] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[02/26/2025-11:25:19] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[02/26/2025-11:25:19] [TRT-LLM] [I] Set nccl_plugin to auto.
[02/26/2025-11:25:19] [TRT-LLM] [I] Set lora_plugin to None.
[02/26/2025-11:25:19] [TRT-LLM] [I] Set moe_plugin to auto.
[02/26/2025-11:25:19] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[02/26/2025-11:25:19] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[02/26/2025-11:25:19] [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None.
[02/26/2025-11:25:19] [TRT-LLM] [I] Set context_fmha to True.
[02/26/2025-11:25:19] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[02/26/2025-11:25:19] [TRT-LLM] [I] Set remove_input_padding to True.
[02/26/2025-11:25:19] [TRT-LLM] [I] Set reduce_fusion to True.
[02/26/2025-11:25:19] [TRT-LLM] [I] Set user_buffer to True.
[02/26/2025-11:25:19] [TRT-LLM] [I] Set tokens_per_block to 32.
[02/26/2025-11:25:19] [TRT-LLM] [I] Set use_paged_context_fmha to True.
[02/26/2025-11:25:19] [TRT-LLM] [I] Set use_fp8_context_fmha to True.
[02/26/2025-11:25:19] [TRT-LLM] [I] Set multiple_profiles to True.
[02/26/2025-11:25:19] [TRT-LLM] [I] Set paged_state to True.
[02/26/2025-11:25:19] [TRT-LLM] [I] Set streamingllm to False.
[02/26/2025-11:25:19] [TRT-LLM] [I] Set use_fused_mlp to True.
[02/26/2025-11:25:19] [TRT-LLM] [I] Set pp_reduce_scatter to False.
[02/26/2025-11:25:19] [TRT-LLM] [W] Implicitly setting QWenConfig.producer = {'name': 'modelopt', 'version': '0.23.2'}
[02/26/2025-11:25:19] [TRT-LLM] [W] Implicitly setting QWenConfig.share_embedding_table = False
[02/26/2025-11:25:19] [TRT-LLM] [W] Implicitly setting QWenConfig.residual_mlp = False
[02/26/2025-11:25:19] [TRT-LLM] [W] Implicitly setting QWenConfig.bias = False
[02/26/2025-11:25:19] [TRT-LLM] [W] Implicitly setting QWenConfig.rotary_pct = 1.0
[02/26/2025-11:25:19] [TRT-LLM] [W] Implicitly setting QWenConfig.rank = 0
[02/26/2025-11:25:19] [TRT-LLM] [W] Implicitly setting QWenConfig.decoder = qwen
[02/26/2025-11:25:19] [TRT-LLM] [W] Implicitly setting QWenConfig.rmsnorm = True
[02/26/2025-11:25:19] [TRT-LLM] [W] Implicitly setting QWenConfig.lm_head_bias = False
[02/26/2025-11:25:19] [TRT-LLM] [W] Implicitly setting QWenConfig.seq_length = 8192
[02/26/2025-11:25:19] [TRT-LLM] [W] Implicitly setting QWenConfig.qwen_type = qwen2
[02/26/2025-11:25:19] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_intermediate_size = 0
[02/26/2025-11:25:19] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_shared_expert_intermediate_size = 0
[02/26/2025-11:25:19] [TRT-LLM] [W] Implicitly setting QWenConfig.tie_word_embeddings = True
[02/26/2025-11:25:19] [TRT-LLM] [W] Implicitly setting QWenConfig.model_type = qwen
[02/26/2025-11:25:19] [TRT-LLM] [I] Compute capability: (8, 9)
[02/26/2025-11:25:19] [TRT-LLM] [I] SM count: 92
[02/26/2025-11:25:19] [TRT-LLM] [I] SM clock: 2520 MHz
[02/26/2025-11:25:19] [TRT-LLM] [I] int4 TFLOPS: 474
[02/26/2025-11:25:19] [TRT-LLM] [I] int8 TFLOPS: 237
[02/26/2025-11:25:19] [TRT-LLM] [I] fp8 TFLOPS: 237
[02/26/2025-11:25:19] [TRT-LLM] [I] float16 TFLOPS: 118
[02/26/2025-11:25:19] [TRT-LLM] [I] bfloat16 TFLOPS: 118
[02/26/2025-11:25:19] [TRT-LLM] [I] float32 TFLOPS: 59
[02/26/2025-11:25:19] [TRT-LLM] [I] Total Memory: 44 GiB
[02/26/2025-11:25:19] [TRT-LLM] [I] Memory clock: 9001 MHz
[02/26/2025-11:25:19] [TRT-LLM] [I] Memory bus width: 384
[02/26/2025-11:25:19] [TRT-LLM] [I] Memory bandwidth: 864 GB/s
[02/26/2025-11:25:19] [TRT-LLM] [I] PCIe speed: 16000 Mbps
[02/26/2025-11:25:19] [TRT-LLM] [I] PCIe link width: 16
[02/26/2025-11:25:19] [TRT-LLM] [I] PCIe bandwidth: 32 GB/s
[02/26/2025-11:25:19] [TRT-LLM] [I] Set dtype to bfloat16.
[02/26/2025-11:25:19] [TRT-LLM] [I] Set paged_kv_cache to True.
[02/26/2025-11:25:19] [TRT-LLM] [W] Overriding paged_state to False
[02/26/2025-11:25:19] [TRT-LLM] [I] Set paged_state to False.
[02/26/2025-11:25:19] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[02/26/2025-11:25:19] [TRT-LLM] [W] Overriding reduce_fusion to False
[02/26/2025-11:25:19] [TRT-LLM] [I] Set reduce_fusion to False.
[02/26/2025-11:25:19] [TRT-LLM] [W] Overriding user_buffer to False
[02/26/2025-11:25:19] [TRT-LLM] [I] Set user_buffer to False.
[02/26/2025-11:25:22] [TRT] [I] [MemUsageChange] Init CUDA: CPU -14, GPU +0, now: CPU 1814, GPU 19389 (MiB)
[02/26/2025-11:25:25] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +2773, GPU +446, now: CPU 4768, GPU 19835 (MiB)
[02/26/2025-11:25:25] [TRT-LLM] [I] Set nccl_plugin to None.
[02/26/2025-11:25:25] [TRT-LLM] [I] Total time of constructing network from module object 5.6981000900268555 seconds
[02/26/2025-11:25:25] [TRT-LLM] [I] Total optimization profiles added: 6
[02/26/2025-11:25:25] [TRT-LLM] [I] Total time to initialize the weights in network Unnamed Network 0: 00:00:00
[02/26/2025-11:25:25] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[02/26/2025-11:25:25] [TRT] [W] Unused Input: position_ids
[02/26/2025-11:25:25] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[02/26/2025-11:25:25] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[02/26/2025-11:25:25] [TRT] [I] Compiler backend is used during engine build.
[02/26/2025-11:25:28] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[02/26/2025-11:25:28] [TRT] [I] Detected 17 inputs and 1 output network tensors.
[02/26/2025-11:25:30] [TRT] [I] Total Host Persistent Memory: 75904 bytes
[02/26/2025-11:25:30] [TRT] [I] Total Device Persistent Memory: 0 bytes
[02/26/2025-11:25:30] [TRT] [I] Max Scratch Memory: 51412992 bytes
[02/26/2025-11:25:30] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 198 steps to complete.
[02/26/2025-11:25:30] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 8.65027ms to assign 21 blocks to 198 nodes requiring 191832064 bytes.
[02/26/2025-11:25:30] [TRT] [I] Total Activation Memory: 191830016 bytes
[02/26/2025-11:25:34] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[02/26/2025-11:25:34] [TRT] [I] Detected 17 inputs and 1 output network tensors.
[02/26/2025-11:25:34] [TRT] [I] Total Host Persistent Memory: 75904 bytes
[02/26/2025-11:25:34] [TRT] [I] Total Device Persistent Memory: 0 bytes
[02/26/2025-11:25:34] [TRT] [I] Max Scratch Memory: 51937280 bytes
[02/26/2025-11:25:34] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 198 steps to complete.
[02/26/2025-11:25:34] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 8.69935ms to assign 21 blocks to 198 nodes requiring 192421888 bytes.
[02/26/2025-11:25:34] [TRT] [I] Total Activation Memory: 192421888 bytes
[02/26/2025-11:25:38] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[02/26/2025-11:25:38] [TRT] [I] Detected 17 inputs and 1 output network tensors.
[02/26/2025-11:25:39] [TRT] [I] Total Host Persistent Memory: 75904 bytes
[02/26/2025-11:25:39] [TRT] [I] Total Device Persistent Memory: 0 bytes
[02/26/2025-11:25:39] [TRT] [I] Max Scratch Memory: 52985856 bytes
[02/26/2025-11:25:39] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 198 steps to complete.
[02/26/2025-11:25:39] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 8.74769ms to assign 21 blocks to 198 nodes requiring 193601536 bytes.
[02/26/2025-11:25:39] [TRT] [I] Total Activation Memory: 193601536 bytes
[02/26/2025-11:25:42] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[02/26/2025-11:25:42] [TRT] [I] Detected 17 inputs and 1 output network tensors.
[02/26/2025-11:25:43] [TRT] [I] Total Host Persistent Memory: 75904 bytes
[02/26/2025-11:25:43] [TRT] [I] Total Device Persistent Memory: 0 bytes
[02/26/2025-11:25:43] [TRT] [I] Max Scratch Memory: 55083008 bytes
[02/26/2025-11:25:43] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 198 steps to complete.
[02/26/2025-11:25:43] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 8.21975ms to assign 21 blocks to 198 nodes requiring 194388480 bytes.
[02/26/2025-11:25:43] [TRT] [I] Total Activation Memory: 194388480 bytes
[02/26/2025-11:25:47] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[02/26/2025-11:25:47] [TRT] [I] Detected 17 inputs and 1 output network tensors.
[02/26/2025-11:25:48] [TRT] [I] Total Host Persistent Memory: 75904 bytes
[02/26/2025-11:25:48] [TRT] [I] Total Device Persistent Memory: 0 bytes
[02/26/2025-11:25:48] [TRT] [I] Max Scratch Memory: 66846720 bytes
[02/26/2025-11:25:48] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 198 steps to complete.
[02/26/2025-11:25:48] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 8.28366ms to assign 21 blocks to 198 nodes requiring 199631360 bytes.
[02/26/2025-11:25:48] [TRT] [I] Total Activation Memory: 199631360 bytes
[02/26/2025-11:25:55] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[02/26/2025-11:25:55] [TRT] [I] Detected 17 inputs and 1 output network tensors.
[02/26/2025-11:25:55] [TRT] [I] Total Host Persistent Memory: 75904 bytes
[02/26/2025-11:25:55] [TRT] [I] Total Device Persistent Memory: 0 bytes
[02/26/2025-11:25:55] [TRT] [I] Max Scratch Memory: 1069547520 bytes
[02/26/2025-11:25:55] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 198 steps to complete.
[02/26/2025-11:25:55] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 7.36554ms to assign 21 blocks to 198 nodes requiring 1321214464 bytes.
[02/26/2025-11:25:55] [TRT] [I] Total Activation Memory: 1321214464 bytes
[02/26/2025-11:25:56] [TRT] [I] Total Weights Memory: 4036497412 bytes
[02/26/2025-11:25:56] [TRT] [I] Compiler backend is used during engine execution.
[02/26/2025-11:25:56] [TRT] [I] Engine generation completed in 30.5805 seconds.
[02/26/2025-11:25:56] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 3849 MiB
[02/26/2025-11:25:57] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:32
[02/26/2025-11:25:57] [TRT] [I] Serialized 5197 bytes of code generator cache.
[02/26/2025-11:25:57] [TRT] [I] Serialized 1088732 bytes of compilation cache.
[02/26/2025-11:25:57] [TRT] [I] Serialized 8 timing cache entries
[02/26/2025-11:25:57] [TRT-LLM] [I] Timing cache serialized to model.cache
[02/26/2025-11:25:57] [TRT-LLM] [I] Build phase peak memory: 15360.59 MB, children: 30.48 MB

When using FP8 quantization, there is no warning information as before.

I used the llm-api for a preliminary test.

from transformers import AutoTokenizer
from tqdm import tqdm
import tensorrt_llm
from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.llmapi import (KvCacheConfig,
                                 LookaheadDecodingConfig,
                                 MedusaDecodingConfig,
                                 QuantAlgo,
                                 QuantConfig,
                                 SchedulerConfig
                                 )

def main():
    hf_model_dir = "/data/models/Qwen2.5-3B"
    trt_engine_path = "/data/v0.17.0/code/models/qwen3b-trt/1gpu-w8a8kv8/"
    
    prompt = "Please help me calculate the factorial of 29"

    tokenizer = AutoTokenizer.from_pretrained(hf_model_dir, trust_remote_code=True)
    sampling_params = SamplingParams(temperature=0.0, max_tokens=128, end_id=151643)
    
    kv_cache_config = KvCacheConfig(free_gpu_memory_fraction=0.9, enable_block_reuse=True,)
    
    llm = LLM(model=trt_engine_path,
              tokenizer=tokenizer,
              kv_cache_config=kv_cache_config,
              enable_chunked_prefill=True,
            )

    outputs = llm.generate([prompt], sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

if __name__ == '__main__':
    main()

Build a container using Image nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3 and conduct the test with the above code. The running result is normal.

Then, when I used Triton Server with the TensorRT-LLM backend to deploy the compiled engine, the output result was incorrect. The template filling code is as follows:

ENGINE_DIR=/code/models/qwen3b-trt/1gpu-w8a8kv8
TOKENIZER_DIR=/code/models/qwen_2_5_tokenizer
MODEL_FOLDER=code/triton_deploy/inflight_batcher_llm
TRITON_MAX_BATCH_SIZE=512
INSTANCE_COUNT=64
BLS_INSTANCE_COUNT=256
MAX_QUEUE_DELAY_MS=0
MAX_QUEUE_SIZE=0
FILL_SCRIPT=./fill_template.py
DECOUPLED_MODE=false

python3 ${FILL_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt \
triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},\
logits_datatype:TYPE_FP32

python3 ${FILL_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt \
tokenizer_dir:${TOKENIZER_DIR},\
triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},\
preprocessing_instance_count:${INSTANCE_COUNT}

python3 ${FILL_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt \
triton_backend:tensorrtllm,\
triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},\
decoupled_mode:${DECOUPLED_MODE},\
max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},\
max_queue_size:${MAX_QUEUE_SIZE},\
encoder_input_features_data_type:TYPE_FP16,\
logits_datatype:TYPE_FP32,\
engine_dir:${ENGINE_DIR},\
batching_strategy:inflight_fused_batching,\
batch_scheduler_policy:guaranteed_no_evict,\
kv_cache_free_gpu_mem_fraction:0.9,\
enable_kv_cache_reuse:true,\
enable_chunked_context:true,\
enable_context_fmha_fp32_acc:true,\
multi_block_mode:true,\
cuda_graph_mode:true

python3 ${FILL_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt \
tokenizer_dir:${TOKENIZER_DIR},\
triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},\
postprocessing_instance_count:${INSTANCE_COUNT},\
max_queue_size:${MAX_QUEUE_SIZE}

python3 ${FILL_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt \
triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},\
decoupled_mode:${DECOUPLED_MODE},\
logits_datatype:TYPE_FP32,\
bls_instance_count:${BLS_INSTANCE_COUNT}

The output result is a bunch of meaningless tokens, and the result is as follows:

"xx.Componentlocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklock"

My current conclusion is that there is no problem with the inference process of TensorRT-LLM. So, is there a problem when it is integrated with Triton Server? What could this problem be?
This problem has been bothering me for several days. It's really torturous.
I am really looking forward to your answers.
Thank you.

@Barry-Delaney @kaiyux

@white-wolf-tech
Copy link
Author

I added a log print at the very beginning of the execute function in the model.py file within the tensorrt_llm directory.
For the model quantized using ModelOpt, when the model is running, the tensorrt_llm module is not called and it skips directly. In the response, except for the prompt part, the remaining supplemented tokens are all 1023.
However, models that are not quantized or models quantized with SmoothQuant, that is, models without using ModelOpt, can all be used normally....
This phenomenon is really strange.
Is it possible that the ModelOpt package is not compatible with the current Triton Server TensorRT-LLM backend?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

2 participants