Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support weight only quantization from bfloat16 to int8? #110

Closed
Missmiaom opened this issue Oct 25, 2023 · 9 comments
Closed

Support weight only quantization from bfloat16 to int8? #110

Missmiaom opened this issue Oct 25, 2023 · 9 comments
Assignees
Labels
feature request New feature or request triaged Issue has been triaged by maintainers

Comments

@Missmiaom
Copy link

Missmiaom commented Oct 25, 2023

An error occurred while quantifying the bf16 model:

tensorrt_llm_0.5.0/examples/gpt/weight.py:265

 elif use_weight_only:
                processed_torch_weights, torch_weight_scales = torch.ops.fastertransformer.symmetric_quantize_last_axis_of_batched_matrix(
                    torch.tensor(t), plugin_weight_only_quant_type)

error:

can't convert np.ndarray of type numpy.void. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.
@wang90063
Copy link

Same question

@byshiue
Copy link
Collaborator

byshiue commented Oct 25, 2023

int8/int4 weight only + bf16 activation is not supported in TensorRT-LLM now.

@Missmiaom
Copy link
Author

@byshiue Does TLLM have any plans to support this feature? I will be very grateful

@byshiue byshiue added the feature request New feature or request label Oct 25, 2023
@jdemouth-nvidia
Copy link
Collaborator

We do not have such a plan right now but we can consider it if we have sufficient demand for it.

@ncomly-nvidia ncomly-nvidia added the triaged Issue has been triaged by maintainers label Nov 6, 2023
@ncomly-nvidia ncomly-nvidia self-assigned this Nov 6, 2023
@Kelang-Tian
Copy link

We do not have such a plan right now but we can consider it if we have sufficient demand for it.

https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/examples/llama/weight.py#L269
I also encountered the same problem. I trained a model that uses bfloat16 for computation, and I also want to use weight_only (int8 or AWQ4bit) in trt-llm.

According to your reply above, it seems that this feature is not currently supported. Do you have any plans to support it (weight_only + bf16)?

@byshiue
Copy link
Collaborator

byshiue commented Nov 16, 2023

@Kelang-Tian This feature should be supported in latest main branch, please take a try.

@Kelang-Tian
Copy link

@Kelang-Tian This feature should be supported in latest main branch, please take a try.

Thanks!
I constructed trt llm using the main branch. I constructed a model and successfully ran it using Python runtime in trt-llm (bfloat16+weight_only_qaaunt_int8).
I have encountered a new problem now: I want to pull up my model in the form of a service in Triton. I used the following command but received an error.

build engine

cd /code/tensorrt_llm/examples/gpt
export CUDA_VISIBLE_DEVICES=4,5,6,7
MODEL_PATH="/nvme4/trt-llm/models/test_b"
TOKENIZER_PATH="/nvme4/trt-llm/models/test_b"
MAX_INPUT_LEN=8192
MAX_OUTPUT_LEN=4096
MAX_BATCH_SIZE=32
WORLD_SIZE=2
DTYPE="bfloat16"
OUTPUT_DIR_BASE="/nvme4/trt-llm/models/test_b/trt_engines/${DTYPE}/${DTYPE}-TP${WORLD_SIZE}-bs${MAX_BATCH_SIZE}-IFB"
OUTPUT_DIR=${OUTPUT_DIR_BASE}"-w8"
python build.py --world_size ${WORLD_SIZE}
--model_dir ${MODEL_PATH}
--dtype ${DTYPE}
--max_batch_size ${MAX_BATCH_SIZE}
--max_input_len ${MAX_INPUT_LEN}
--max_output_len ${MAX_OUTPUT_LEN}
--use_gpt_attention_plugin ${DTYPE}
--use_gemm_plugin ${DTYPE}
--enable_context_fmha
--use_layernorm_plugin ${DTYPE}
--parallel_build
--output_dir ${OUTPUT_DIR}
--paged_kv_cache
--use_inflight_batching
--remove_input_padding
--use_weight_only

cd /tensorrtllm_backend
python3 scripts/launch_triton_server.py --world_size=2 --model_repo=/tensorrtllm_backend/triton_model_repo`

Error

python3 scripts/launch_triton_server.py --world_size=${WORLD_SIZE} --model_repo=${MODEL_REPO}
root@51ad807dd001:/nvme4/kelang/tensorrtllm_backend# I1122 07:55:20.099326 131 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7f66d4000000' with size 268435456
I1122 07:55:20.099790 132 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7fe8d6000000' with size 268435456
I1122 07:55:20.158816 131 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I1122 07:55:20.158833 131 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I1122 07:55:20.158836 131 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I1122 07:55:20.158839 131 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I1122 07:55:20.163004 132 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I1122 07:55:20.163019 132 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I1122 07:55:20.163022 132 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I1122 07:55:20.163025 132 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I1122 07:55:21.515833 131 model_lifecycle.cc:461] loading: tensorrt_llm:1
I1122 07:55:21.515871 131 model_lifecycle.cc:461] loading: preprocessing:1
I1122 07:55:21.515889 131 model_lifecycle.cc:461] loading: postprocessing:1
I1122 07:55:21.517820 132 model_lifecycle.cc:461] loading: tensorrt_llm:1
I1122 07:55:21.517854 132 model_lifecycle.cc:461] loading: preprocessing:1
I1122 07:55:21.517873 132 model_lifecycle.cc:461] loading: postprocessing:1
I1122 07:55:21.611330 132 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I1122 07:55:21.611882 132 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
I1122 07:55:21.615021 131 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I1122 07:55:21.615510 131 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
E1122 07:55:21.632676 132 backend_model.cc:634] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal
E1122 07:55:21.632730 132 model_lifecycle.cc:621] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal
I1122 07:55:21.632747 132 model_lifecycle.cc:756] failed to load 'tensorrt_llm'
E1122 07:55:21.637088 131 backend_model.cc:634] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal
E1122 07:55:21.637147 131 model_lifecycle.cc:621] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal
I1122 07:55:21.637162 131 model_lifecycle.cc:756] failed to load 'tensorrt_llm'
I1122 07:55:22.168205 132 model_lifecycle.cc:818] successfully loaded 'postprocessing'
I1122 07:55:22.235315 131 model_lifecycle.cc:818] successfully loaded 'postprocessing'
I1122 07:55:23.375833 131 model_lifecycle.cc:818] successfully loaded 'preprocessing'
E1122 07:55:23.375902 131 model_repository_manager.cc:563] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal;
I1122 07:55:23.375975 131 server.cc:592]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1122 07:55:23.376027 131 server.cc:619]
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
| python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","shm-region-prefix-name":"prefix0_" |
| | | ,"default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1122 07:55:23.376075 131 server.cc:662]
+----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Model | Version | Status |
+----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| postprocessing | 1 | READY |
| preprocessing | 1 | READY |
| tensorrt_llm | 1 | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{ |
| | | ', or a literal |
+----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1122 07:55:23.396928 132 model_lifecycle.cc:818] successfully loaded 'preprocessing'
E1122 07:55:23.396987 132 model_repository_manager.cc:563] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal;
I1122 07:55:23.397046 132 server.cc:592]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1122 07:55:23.397092 132 server.cc:619]
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
| python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","shm-region-prefix-name":"prefix1_" |
| | | ,"default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1122 07:55:23.397142 132 server.cc:662]
+----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Model | Version | Status |
+----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| postprocessing | 1 | READY |
| preprocessing | 1 | READY |
| tensorrt_llm | 1 | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{ |
| | | ', or a literal |
+----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1122 07:55:23.768864 132 metrics.cc:817] Collecting metrics for GPU 0: NVIDIA H800
I1122 07:55:23.768873 131 metrics.cc:817] Collecting metrics for GPU 0: NVIDIA H800
I1122 07:55:23.768891 132 metrics.cc:817] Collecting metrics for GPU 1: NVIDIA H800
I1122 07:55:23.768897 132 metrics.cc:817] Collecting metrics for GPU 2: NVIDIA H800
I1122 07:55:23.768898 131 metrics.cc:817] Collecting metrics for GPU 1: NVIDIA H800
I1122 07:55:23.768902 132 metrics.cc:817] Collecting metrics for GPU 3: NVIDIA H800
I1122 07:55:23.768903 131 metrics.cc:817] Collecting metrics for GPU 2: NVIDIA H800
I1122 07:55:23.768908 131 metrics.cc:817] Collecting metrics for GPU 3: NVIDIA H800
I1122 07:55:23.770983 131 metrics.cc:710] Collecting CPU metrics
I1122 07:55:23.770985 132 metrics.cc:710] Collecting CPU metrics
I1122 07:55:23.771402 131 tritonserver.cc:2458]
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.39.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | all_models/gpttest_b_TP2_bs32_IFB_bf16_int8w_main/ |
| model_control_mode | MODE_NONE |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| cuda_memory_pool_byte_size{1} | 67108864 |
| cuda_memory_pool_byte_size{2} | 67108864 |
| cuda_memory_pool_byte_size{3} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 I1122 07:55:23.771404 132 tritonserver.cc:2458]
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.39.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | all_models/gpttest_b_TP2_bs32_IFB_bf16_int8w_main/ |
| model_control_mode | MODE_NONE |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| cuda_memory_pool_byte_size{1} | 67108864 |
| cuda_memory_pool_byte_size{2} | 67108864 |
| cuda_memory_pool_byte_size{3} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1122 07:55:23.771410 131 server.cc:293] Waiting for in-flight requests to complete.
I1122 07:55:23.771416 131 server.cc:309] Timeout 30: Found 0 model versions that have in-flight inferences
|
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1122 07:55:23.771414 132 server.cc:293] Waiting for in-flight requests to complete.
I1122 07:55:23.771426 132 server.cc:309] Timeout 30: Found 0 model versions that have in-flight inferences
I1122 07:55:23.771510 131 server.cc:324] All models are stopped, unloading models
I1122 07:55:23.771515 131 server.cc:331] Timeout 30: Found 2 live models and 0 in-flight non-inference requests
I1122 07:55:23.771523 132 server.cc:324] All models are stopped, unloading models
I1122 07:55:23.771528 132 server.cc:331] Timeout 30: Found 2 live models and 0 in-flight non-inference requests
I1122 07:55:24.771598 132 server.cc:331] Timeout 29: Found 2 live models and 0 in-flight non-inference requests
I1122 07:55:24.771597 131 server.cc:331] Timeout 29: Found 2 live models and 0 in-flight non-inference requests
Cleaning up...
Cleaning up...
Cleaning up...
Cleaning up...
I1122 07:55:25.034130 132 model_lifecycle.cc:603] successfully unloaded 'postprocessing' version 1
I1122 07:55:25.056283 131 model_lifecycle.cc:603] successfully unloaded 'postprocessing' version 1
I1122 07:55:25.263350 131 model_lifecycle.cc:603] successfully unloaded 'preprocessing' version 1
I1122 07:55:25.302367 132 model_lifecycle.cc:603] successfully unloaded 'preprocessing' version 1
I1122 07:55:25.771689 131 server.cc:331] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
I1122 07:55:25.771693 132 server.cc:331] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models
error: creating server: Internal - failed to load all models

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[1482,1],0]
Exit code: 1

My question is, do I need to checkout tensorrtllm_backend on the mian branch and recompile it?

@byshiue
Copy link
Collaborator

byshiue commented Nov 23, 2023

@Kelang-Tian It looks a issue of your format of config.pbtxt in tritonserver and not related to this issue. Please create another issue in tensorrtllm_backend repo and share your config.pbtxt.

@whalefa1I
Copy link

@Kelang-Tian This feature should be supported in latest main branch, please take a try.

May I ask if there are any difficulties in adapting Multi-query with int8 quantization? I would like to attempt using smooth quantization in Llama2 70B, and currently, I am facing some issues while modifying Llama's code based on the GPT code because GPT's code also does not support MQA with int8 quantization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

8 participants