-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support weight only quantization from bfloat16 to int8? #110
Comments
Same question |
int8/int4 weight only + bf16 activation is not supported in TensorRT-LLM now. |
@byshiue Does TLLM have any plans to support this feature? I will be very grateful |
We do not have such a plan right now but we can consider it if we have sufficient demand for it. |
https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/examples/llama/weight.py#L269 According to your reply above, it seems that this feature is not currently supported. Do you have any plans to support it (weight_only + bf16)? |
@Kelang-Tian This feature should be supported in latest main branch, please take a try. |
Thanks! build engine
cd /tensorrtllm_backend Error
My question is, do I need to checkout tensorrtllm_backend on the
|
@Kelang-Tian It looks a issue of your format of config.pbtxt in tritonserver and not related to this issue. Please create another issue in tensorrtllm_backend repo and share your config.pbtxt. |
May I ask if there are any difficulties in adapting Multi-query with int8 quantization? I would like to attempt using smooth quantization in Llama2 70B, and currently, I am facing some issues while modifying Llama's code based on the GPT code because GPT's code also does not support MQA with int8 quantization. |
An error occurred while quantifying the bf16 model:
tensorrt_llm_0.5.0/examples/gpt/weight.py:265
error:
The text was updated successfully, but these errors were encountered: