Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch backend run error with fp8 hf model #2825

Open
2 of 4 tasks
nickole2018 opened this issue Feb 26, 2025 · 0 comments
Open
2 of 4 tasks

pytorch backend run error with fp8 hf model #2825

nickole2018 opened this issue Feb 26, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@nickole2018
Copy link

System Info

  • CPU architecture:x86_64
  • GPU: nvidia L40s, 46G
  • librarys
  • TensorRT-LLM version: 0.17.0.post1
  • torch version:2.6.0a0+ecf3bae40a.nv25.01
  • docker image:nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git
cd TensorRT-Model-Optimizer/examples/llm_ptq
scripts/huggingface_example.sh --model --quant fp8 --export_fmt hf,得到量化后的模型
使用以下脚本调用pytorch backend的generate:

import argparse, time

from tensorrt_llm import SamplingParams
from tensorrt_llm._torch import LLM
from tensorrt_llm._torch.pyexecutor.config import PyTorchConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, PretrainedConfig, QuantoConfig

def parse_arguments():
parser = argparse.ArgumentParser()
parser.add_argument('--model_dir',
type=str,
default='the quantized fp8 model dir')
parser.add_argument('--tp_size', type=int, default=1)
parser.add_argument('--enable_overlap_scheduler',
default=False,
action='store_true')
parser.add_argument('--enable_chunked_prefill',
default=False,
action='store_true')
parser.add_argument('--kv_cache_dtype', type=str, default='auto')
args = parser.parse_args()
return args

def main():
args = parse_arguments()

pytorch_config = PyTorchConfig(
    enable_overlap_scheduler=args.enable_overlap_scheduler,
    kv_cache_dtype=args.kv_cache_dtype)
llm = LLM(model=args.model_dir,
          tensor_parallel_size=args.tp_size,
          enable_chunked_prefill=args.enable_chunked_prefill,
          pytorch_backend_config=pytorch_config)

prompts = [
    "讲故事",
]

tokenizer = AutoTokenizer.from_pretrained(args.model_dir)
max_tokens = 200
num_beams = 4

sampling_params = SamplingParams(end_id=tokenizer.eos_token_id, pad_id=tokenizer.pad_token_id, \
    max_tokens=max_tokens, best_of=num_beams, repetition_penalty=1.0, temperature=0, use_beam_search=True)

start = time.time()
outputs = llm.generate(prompts, sampling_params)
end = time.time()
print(f"time: {end - start}")
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

if name == 'main':
main()

Expected behavior

正确输出生成文本

actual behavior

运行报错:
raceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor.py", line 1568, in workers_main
executor = worker_cls(engine, executor_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor.py", line 817, in init
self.engine = _create_engine()
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor.py", line 811, in _create_engine
return unique_create_executor(engine,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/backend_registries/backend_registry.py", line 88, in unique_create_executor
engine = create_py_executor_by_config(executor_config.backend,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/backend_registries/backend_registry.py", line 59, in create_py_executor_by_config
py_executor = backend_registry[name].func(executor_config, checkpoint_dir,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/backend_registries/pytorch_model_registry.py", line 108, in create_pytorch_model_based_executor
kv_cache_max_tokens = estimate_max_kv_cache_tokens(model_engine,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/backend_registries/_util.py", line 93, in estimate_max_kv_cache_tokens
model_engine.forward(req, resource_manager)
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/pytorch_model_engine.py", line 943, in forward
return self._forward_step(inputs, gather_ids)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 81, in inner
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/pytorch_model_engine.py", line 980, in _forward_step
logits = self.model.forward(**inputs,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py", line 187, in forward
hidden_states = self.model(
^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_qwen.py", line 178, in forward
hidden_states, residual = decoder_layer(position_ids=position_ids,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_qwen.py", line 113, in forward
hidden_states = self.self_attn(
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/attention.py", line 104, in forward
qkv = self.qkv_proj(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 313, in forward
output = self.apply_linear(input, self.weight, self.bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 256, in apply_linear
output = torch.ops.trtllm.cublas_scaled_mm(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1123, in call
return self._op(*args, **(kwargs or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: bias is not support yet

additional notes

None

@nickole2018 nickole2018 added the bug Something isn't working label Feb 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant