Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache-type-k and cache-type-p parameters support #1305

Closed
riverzhou opened this issue Mar 26, 2024 · 2 comments
Closed

cache-type-k and cache-type-p parameters support #1305

riverzhou opened this issue Mar 26, 2024 · 2 comments

Comments

@riverzhou
Copy link

Origin llama.cpp is support cache-type-k and cache-type-p setting, but llama-cpp-python server is not.

@Limour-dev
Copy link
Contributor

Limour-dev commented Mar 27, 2024

Setting type_v quantization below f16 will result in an error: GGML_ASSERT: llama.cpp\ggml.c:7553: false

Limour-dev added a commit to Limour-dev/llama-cpp-python that referenced this issue Mar 28, 2024
abetlen added a commit that referenced this issue Apr 1, 2024
* add KV cache quantization options

#1220
#1305

* Add ggml_type

* Use ggml_type instead of string for quantization

* Add server support

---------

Co-authored-by: Andrei Betlen <abetlen@gmail.com>
@abetlen
Copy link
Owner

abetlen commented Apr 1, 2024

@riverzhou thanks to @Limour-dev this should be in the latest v0.2.58 release by setting llama_cpp.Llama(..., type_k=llama_cpp.GGML_TYPE_I8)

@abetlen abetlen closed this as completed Apr 1, 2024
xhedit pushed a commit to xhedit/llama-cpp-conv that referenced this issue Apr 6, 2024
* add KV cache quantization options

abetlen#1220
abetlen#1305

* Add ggml_type

* Use ggml_type instead of string for quantization

* Add server support

---------

Co-authored-by: Andrei Betlen <abetlen@gmail.com>
xhedit added a commit to xhedit/llama-cpp-conv that referenced this issue Apr 6, 2024
* feat: add support for KV cache quantization options (abetlen#1307)

* add KV cache quantization options

abetlen#1220
abetlen#1305

* Add ggml_type

* Use ggml_type instead of string for quantization

* Add server support

---------

Co-authored-by: Andrei Betlen <abetlen@gmail.com>

* fix: Changed local API doc references to hosted (abetlen#1317)

* chore: Bump version

* fix: last tokens passing to sample_repetition_penalties function (abetlen#1295)

Co-authored-by: ymikhaylov <ymikhaylov@x5.ru>
Co-authored-by: Andrei <abetlen@gmail.com>

* feat: Update llama.cpp

* fix: segfault when logits_all=False. Closes abetlen#1319

* feat: Binary wheels for CPU, CUDA (12.1 - 12.3), Metal (abetlen#1247)

* Generate binary wheel index on release

* Add total release downloads badge

* Update download label

* Use official cibuildwheel action

* Add workflows to build CUDA and Metal wheels

* Update generate index workflow

* Update workflow name

* feat: Update llama.cpp

* chore: Bump version

* fix(ci): use correct script name

* docs: LLAMA_CUBLAS -> LLAMA_CUDA

* docs: Add docs explaining how to install pre-built wheels.

* docs: Rename cuBLAS section to CUDA

* fix(docs): incorrect tool_choice example (abetlen#1330)

* feat: Update llama.cpp

* fix: missing logprobs in response, incorrect response type for functionary, minor type issues. Closes abetlen#1328 abetlen#1314

* fix: missing logprobs in response, incorrect response type for functionary, minor type issues. Closes abetlen#1328 Closes abetlen#1314

* feat: Update llama.cpp

* fix: Always embed metal library. Closes abetlen#1332

* feat: Update llama.cpp

* chore: Bump version

---------

Co-authored-by: Limour <93720049+Limour-dev@users.noreply.github.com>
Co-authored-by: Andrei Betlen <abetlen@gmail.com>
Co-authored-by: lawfordp2017 <lawfordp@gmail.com>
Co-authored-by: Yuri Mikhailov <bitsharp@gmail.com>
Co-authored-by: ymikhaylov <ymikhaylov@x5.ru>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
francomattar added a commit to francomattar/Python-llama-cpp that referenced this issue Dec 13, 2024
* add KV cache quantization options

abetlen/llama-cpp-python#1220
abetlen/llama-cpp-python#1305

* Add ggml_type

* Use ggml_type instead of string for quantization

* Add server support

---------

Co-authored-by: Andrei Betlen <abetlen@gmail.com>
nayanzin33sergey added a commit to nayanzin33sergey/Python-llama-cpp that referenced this issue Dec 18, 2024
* add KV cache quantization options

abetlen/llama-cpp-python#1220
abetlen/llama-cpp-python#1305

* Add ggml_type

* Use ggml_type instead of string for quantization

* Add server support

---------

Co-authored-by: Andrei Betlen <abetlen@gmail.com>
jamesdev9 pushed a commit to jamesdev9/python-llama-cpp that referenced this issue Dec 22, 2024
* add KV cache quantization options

abetlen/llama-cpp-python#1220
abetlen/llama-cpp-python#1305

* Add ggml_type

* Use ggml_type instead of string for quantization

* Add server support

---------

Co-authored-by: Andrei Betlen <abetlen@gmail.com>
NoBrainer242 pushed a commit to NoBrainer242/Python-App-llama that referenced this issue Dec 27, 2024
* add KV cache quantization options

abetlen/llama-cpp-python#1220
abetlen/llama-cpp-python#1305

* Add ggml_type

* Use ggml_type instead of string for quantization

* Add server support

---------

Co-authored-by: Andrei Betlen <abetlen@gmail.com>
NoBrainer24 added a commit to NoBrainer24/Python-App-llama- that referenced this issue Dec 28, 2024
* add KV cache quantization options

abetlen/llama-cpp-python#1220
abetlen/llama-cpp-python#1305

* Add ggml_type

* Use ggml_type instead of string for quantization

* Add server support

---------

Co-authored-by: Andrei Betlen <abetlen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants