Support weight only quantization from bfloat16 to int8? #110

Missmiaom · 2023-10-25T07:48:05Z

An error occurred while quantifying the bf16 model:

tensorrt_llm_0.5.0/examples/gpt/weight.py:265

 elif use_weight_only:
                processed_torch_weights, torch_weight_scales = torch.ops.fastertransformer.symmetric_quantize_last_axis_of_batched_matrix(
                    torch.tensor(t), plugin_weight_only_quant_type)

error:

can't convert np.ndarray of type numpy.void. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

The text was updated successfully, but these errors were encountered:

wang90063 · 2023-10-25T08:00:24Z

Same question

byshiue · 2023-10-25T08:04:36Z

int8/int4 weight only + bf16 activation is not supported in TensorRT-LLM now.

Missmiaom · 2023-10-25T09:13:20Z

@byshiue Does TLLM have any plans to support this feature? I will be very grateful

jdemouth-nvidia · 2023-10-25T09:58:40Z

We do not have such a plan right now but we can consider it if we have sufficient demand for it.

Kelang-Tian · 2023-11-15T09:41:21Z

We do not have such a plan right now but we can consider it if we have sufficient demand for it.

https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/examples/llama/weight.py#L269
I also encountered the same problem. I trained a model that uses bfloat16 for computation, and I also want to use weight_only (int8 or AWQ4bit) in trt-llm.

According to your reply above, it seems that this feature is not currently supported. Do you have any plans to support it (weight_only + bf16)?

byshiue · 2023-11-16T09:09:46Z

@Kelang-Tian This feature should be supported in latest main branch, please take a try.

Kelang-Tian · 2023-11-22T08:14:40Z

@Kelang-Tian This feature should be supported in latest main branch, please take a try.

Thanks!
I constructed trt llm using the main branch. I constructed a model and successfully ran it using Python runtime in trt-llm (bfloat16+weight_only_qaaunt_int8).
I have encountered a new problem now: I want to pull up my model in the form of a service in Triton. I used the following command but received an error.

build engine

cd /code/tensorrt_llm/examples/gpt
export CUDA_VISIBLE_DEVICES=4,5,6,7
MODEL_PATH="/nvme4/trt-llm/models/test_b"
TOKENIZER_PATH="/nvme4/trt-llm/models/test_b"
MAX_INPUT_LEN=8192
MAX_OUTPUT_LEN=4096
MAX_BATCH_SIZE=32
WORLD_SIZE=2
DTYPE="bfloat16"
OUTPUT_DIR_BASE="/nvme4/trt-llm/models/test_b/trt_engines/${DTYPE}/${DTYPE}-TP${WORLD_SIZE}-bs${MAX_BATCH_SIZE}-IFB"
OUTPUT_DIR=${OUTPUT_DIR_BASE}"-w8"
python build.py --world_size ${WORLD_SIZE}
--model_dir ${MODEL_PATH}
--dtype ${DTYPE}
--max_batch_size ${MAX_BATCH_SIZE}
--max_input_len ${MAX_INPUT_LEN}
--max_output_len ${MAX_OUTPUT_LEN}
--use_gpt_attention_plugin ${DTYPE}
--use_gemm_plugin ${DTYPE}
--enable_context_fmha
--use_layernorm_plugin ${DTYPE}
--parallel_build
--output_dir ${OUTPUT_DIR}
--paged_kv_cache
--use_inflight_batching
--remove_input_padding
--use_weight_only

cd /tensorrtllm_backend
python3 scripts/launch_triton_server.py --world_size=2 --model_repo=/tensorrtllm_backend/triton_model_repo`

Error

python3 scripts/launch_triton_server.py --world_size=${WORLD_SIZE} --model_repo=${MODEL_REPO}
root@51ad807dd001:/nvme4/kelang/tensorrtllm_backend# I1122 07:55:20.099326 131 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7f66d4000000' with size 268435456
I1122 07:55:20.099790 132 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7fe8d6000000' with size 268435456
I1122 07:55:20.158816 131 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I1122 07:55:20.158833 131 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I1122 07:55:20.158836 131 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I1122 07:55:20.158839 131 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I1122 07:55:20.163004 132 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I1122 07:55:20.163019 132 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I1122 07:55:20.163022 132 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I1122 07:55:20.163025 132 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I1122 07:55:21.515833 131 model_lifecycle.cc:461] loading: tensorrt_llm:1
I1122 07:55:21.515871 131 model_lifecycle.cc:461] loading: preprocessing:1
I1122 07:55:21.515889 131 model_lifecycle.cc:461] loading: postprocessing:1
I1122 07:55:21.517820 132 model_lifecycle.cc:461] loading: tensorrt_llm:1
I1122 07:55:21.517854 132 model_lifecycle.cc:461] loading: preprocessing:1
I1122 07:55:21.517873 132 model_lifecycle.cc:461] loading: postprocessing:1
I1122 07:55:21.611330 132 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I1122 07:55:21.611882 132 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
I1122 07:55:21.615021 131 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I1122 07:55:21.615510 131 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
E1122 07:55:21.632676 132 backend_model.cc:634] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal
E1122 07:55:21.632730 132 model_lifecycle.cc:621] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal
I1122 07:55:21.632747 132 model_lifecycle.cc:756] failed to load 'tensorrt_llm'
E1122 07:55:21.637088 131 backend_model.cc:634] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal
E1122 07:55:21.637147 131 model_lifecycle.cc:621] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal
I1122 07:55:21.637162 131 model_lifecycle.cc:756] failed to load 'tensorrt_llm'
I1122 07:55:22.168205 132 model_lifecycle.cc:818] successfully loaded 'postprocessing'
I1122 07:55:22.235315 131 model_lifecycle.cc:818] successfully loaded 'postprocessing'
I1122 07:55:23.375833 131 model_lifecycle.cc:818] successfully loaded 'preprocessing'
E1122 07:55:23.375902 131 model_repository_manager.cc:563] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal;
I1122 07:55:23.375975 131 server.cc:592]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1122 07:55:23.376027 131 server.cc:619]
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
| python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","shm-region-prefix-name":"prefix0_" |
| | | ,"default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1122 07:55:23.376075 131 server.cc:662]
+----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Model | Version | Status |
+----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| postprocessing | 1 | READY |
| preprocessing | 1 | READY |
| tensorrt_llm | 1 | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{ |
| | | ', or a literal |
+----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1122 07:55:23.396928 132 model_lifecycle.cc:818] successfully loaded 'preprocessing'
E1122 07:55:23.396987 132 model_repository_manager.cc:563] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal;
I1122 07:55:23.397046 132 server.cc:592]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1122 07:55:23.397092 132 server.cc:619]
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
| python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","shm-region-prefix-name":"prefix1_" |
| | | ,"default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1122 07:55:23.397142 132 server.cc:662]
+----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Model | Version | Status |
+----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| postprocessing | 1 | READY |
| preprocessing | 1 | READY |
| tensorrt_llm | 1 | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{ |
| | | ', or a literal |
+----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1122 07:55:23.768864 132 metrics.cc:817] Collecting metrics for GPU 0: NVIDIA H800
I1122 07:55:23.768873 131 metrics.cc:817] Collecting metrics for GPU 0: NVIDIA H800
I1122 07:55:23.768891 132 metrics.cc:817] Collecting metrics for GPU 1: NVIDIA H800
I1122 07:55:23.768897 132 metrics.cc:817] Collecting metrics for GPU 2: NVIDIA H800
I1122 07:55:23.768898 131 metrics.cc:817] Collecting metrics for GPU 1: NVIDIA H800
I1122 07:55:23.768902 132 metrics.cc:817] Collecting metrics for GPU 3: NVIDIA H800
I1122 07:55:23.768903 131 metrics.cc:817] Collecting metrics for GPU 2: NVIDIA H800
I1122 07:55:23.768908 131 metrics.cc:817] Collecting metrics for GPU 3: NVIDIA H800
I1122 07:55:23.770983 131 metrics.cc:710] Collecting CPU metrics
I1122 07:55:23.770985 132 metrics.cc:710] Collecting CPU metrics
I1122 07:55:23.771402 131 tritonserver.cc:2458]
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.39.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | all_models/gpttest_b_TP2_bs32_IFB_bf16_int8w_main/ |
| model_control_mode | MODE_NONE |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| cuda_memory_pool_byte_size{1} | 67108864 |
| cuda_memory_pool_byte_size{2} | 67108864 |
| cuda_memory_pool_byte_size{3} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 I1122 07:55:23.771404 132 tritonserver.cc:2458]
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.39.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | all_models/gpttest_b_TP2_bs32_IFB_bf16_int8w_main/ |
| model_control_mode | MODE_NONE |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| cuda_memory_pool_byte_size{1} | 67108864 |
| cuda_memory_pool_byte_size{2} | 67108864 |
| cuda_memory_pool_byte_size{3} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1122 07:55:23.771410 131 server.cc:293] Waiting for in-flight requests to complete.
I1122 07:55:23.771416 131 server.cc:309] Timeout 30: Found 0 model versions that have in-flight inferences
|
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1122 07:55:23.771414 132 server.cc:293] Waiting for in-flight requests to complete.
I1122 07:55:23.771426 132 server.cc:309] Timeout 30: Found 0 model versions that have in-flight inferences
I1122 07:55:23.771510 131 server.cc:324] All models are stopped, unloading models
I1122 07:55:23.771515 131 server.cc:331] Timeout 30: Found 2 live models and 0 in-flight non-inference requests
I1122 07:55:23.771523 132 server.cc:324] All models are stopped, unloading models
I1122 07:55:23.771528 132 server.cc:331] Timeout 30: Found 2 live models and 0 in-flight non-inference requests
I1122 07:55:24.771598 132 server.cc:331] Timeout 29: Found 2 live models and 0 in-flight non-inference requests
I1122 07:55:24.771597 131 server.cc:331] Timeout 29: Found 2 live models and 0 in-flight non-inference requests
Cleaning up...
Cleaning up...
Cleaning up...
Cleaning up...
I1122 07:55:25.034130 132 model_lifecycle.cc:603] successfully unloaded 'postprocessing' version 1
I1122 07:55:25.056283 131 model_lifecycle.cc:603] successfully unloaded 'postprocessing' version 1
I1122 07:55:25.263350 131 model_lifecycle.cc:603] successfully unloaded 'preprocessing' version 1
I1122 07:55:25.302367 132 model_lifecycle.cc:603] successfully unloaded 'preprocessing' version 1
I1122 07:55:25.771689 131 server.cc:331] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
I1122 07:55:25.771693 132 server.cc:331] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models
error: creating server: Internal - failed to load all models

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[1482,1],0]
Exit code: 1

My question is, do I need to checkout tensorrtllm_backend on the `mian` branch and recompile it?

byshiue · 2023-11-23T02:01:21Z

@Kelang-Tian It looks a issue of your format of config.pbtxt in tritonserver and not related to this issue. Please create another issue in tensorrtllm_backend repo and share your config.pbtxt.

whalefa1I · 2023-11-25T04:24:44Z

@Kelang-Tian This feature should be supported in latest main branch, please take a try.

May I ask if there are any difficulties in adapting Multi-query with int8 quantization? I would like to attempt using smooth quantization in Llama2 70B, and currently, I am facing some issues while modifying Llama's code based on the GPT code because GPT's code also does not support MQA with int8 quantization.

byshiue added the feature request New feature or request label Oct 25, 2023

byshiue assigned juney-nvidia Oct 25, 2023

ncomly-nvidia added the triaged Issue has been triaged by maintainers label Nov 6, 2023

ncomly-nvidia self-assigned this Nov 6, 2023

Missmiaom closed this as completed Nov 20, 2023

Kelang-Tian mentioned this issue Nov 23, 2023

The benchmark test failed in the main branch(#422). #455

Closed

ncomly-nvidia mentioned this issue Dec 11, 2023

TensorRT-LLM Requests #632

Open

41 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support weight only quantization from bfloat16 to int8? #110

Support weight only quantization from bfloat16 to int8? #110

Missmiaom commented Oct 25, 2023 •

edited

Loading

wang90063 commented Oct 25, 2023

byshiue commented Oct 25, 2023

Missmiaom commented Oct 25, 2023

jdemouth-nvidia commented Oct 25, 2023

Kelang-Tian commented Nov 15, 2023

byshiue commented Nov 16, 2023

Kelang-Tian commented Nov 22, 2023

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

byshiue commented Nov 23, 2023

whalefa1I commented Nov 25, 2023

Support weight only quantization from bfloat16 to int8? #110

Support weight only quantization from bfloat16 to int8? #110

Comments

Missmiaom commented Oct 25, 2023 • edited Loading

wang90063 commented Oct 25, 2023

byshiue commented Oct 25, 2023

Missmiaom commented Oct 25, 2023

jdemouth-nvidia commented Oct 25, 2023

Kelang-Tian commented Nov 15, 2023

byshiue commented Nov 16, 2023

Kelang-Tian commented Nov 22, 2023

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

My question is, do I need to checkout tensorrtllm_backend on the mian branch and recompile it?

byshiue commented Nov 23, 2023

whalefa1I commented Nov 25, 2023

Missmiaom commented Oct 25, 2023 •

edited

Loading

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

My question is, do I need to checkout tensorrtllm_backend on the `mian` branch and recompile it?