cache-type-k and cache-type-p parameters support #1305

riverzhou · 2024-03-26T11:58:52Z

Origin llama.cpp is support cache-type-k and cache-type-p setting, but llama-cpp-python server is not.

Limour-dev · 2024-03-27T14:22:19Z

Setting type_v quantization below f16 will result in an error: GGML_ASSERT: llama.cpp\ggml.c:7553: false

* add KV cache quantization options #1220 #1305 * Add ggml_type * Use ggml_type instead of string for quantization * Add server support --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>

abetlen · 2024-04-01T14:32:36Z

@riverzhou thanks to @Limour-dev this should be in the latest v0.2.58 release by setting llama_cpp.Llama(..., type_k=llama_cpp.GGML_TYPE_I8)

* add KV cache quantization options abetlen#1220 abetlen#1305 * Add ggml_type * Use ggml_type instead of string for quantization * Add server support --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>

* feat: add support for KV cache quantization options (abetlen#1307) * add KV cache quantization options abetlen#1220 abetlen#1305 * Add ggml_type * Use ggml_type instead of string for quantization * Add server support --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com> * fix: Changed local API doc references to hosted (abetlen#1317) * chore: Bump version * fix: last tokens passing to sample_repetition_penalties function (abetlen#1295) Co-authored-by: ymikhaylov <ymikhaylov@x5.ru> Co-authored-by: Andrei <abetlen@gmail.com> * feat: Update llama.cpp * fix: segfault when logits_all=False. Closes abetlen#1319 * feat: Binary wheels for CPU, CUDA (12.1 - 12.3), Metal (abetlen#1247) * Generate binary wheel index on release * Add total release downloads badge * Update download label * Use official cibuildwheel action * Add workflows to build CUDA and Metal wheels * Update generate index workflow * Update workflow name * feat: Update llama.cpp * chore: Bump version * fix(ci): use correct script name * docs: LLAMA_CUBLAS -> LLAMA_CUDA * docs: Add docs explaining how to install pre-built wheels. * docs: Rename cuBLAS section to CUDA * fix(docs): incorrect tool_choice example (abetlen#1330) * feat: Update llama.cpp * fix: missing logprobs in response, incorrect response type for functionary, minor type issues. Closes abetlen#1328 abetlen#1314 * fix: missing logprobs in response, incorrect response type for functionary, minor type issues. Closes abetlen#1328 Closes abetlen#1314 * feat: Update llama.cpp * fix: Always embed metal library. Closes abetlen#1332 * feat: Update llama.cpp * chore: Bump version --------- Co-authored-by: Limour <93720049+Limour-dev@users.noreply.github.com> Co-authored-by: Andrei Betlen <abetlen@gmail.com> Co-authored-by: lawfordp2017 <lawfordp@gmail.com> Co-authored-by: Yuri Mikhailov <bitsharp@gmail.com> Co-authored-by: ymikhaylov <ymikhaylov@x5.ru> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* add KV cache quantization options abetlen/llama-cpp-python#1220 abetlen/llama-cpp-python#1305 * Add ggml_type * Use ggml_type instead of string for quantization * Add server support --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>

Limour-dev added a commit to Limour-dev/llama-cpp-python that referenced this issue Mar 28, 2024

add KV cache quantization options

94a9519

abetlen#1220 abetlen#1305

Limour-dev mentioned this issue Mar 28, 2024

add KV cache quantization options #1307

Merged

abetlen closed this as completed Apr 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache-type-k and cache-type-p parameters support #1305

cache-type-k and cache-type-p parameters support #1305

riverzhou commented Mar 26, 2024

Limour-dev commented Mar 27, 2024 •

edited

Loading

abetlen commented Apr 1, 2024

cache-type-k and cache-type-p parameters support #1305

cache-type-k and cache-type-p parameters support #1305

Comments

riverzhou commented Mar 26, 2024

Limour-dev commented Mar 27, 2024 • edited Loading

abetlen commented Apr 1, 2024

Limour-dev commented Mar 27, 2024 •

edited

Loading