-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cache-type-k and cache-type-p parameters support #1305
Comments
Setting type_v quantization below f16 will result in an error: GGML_ASSERT: llama.cpp\ggml.c:7553: false |
Limour-dev
added a commit
to Limour-dev/llama-cpp-python
that referenced
this issue
Mar 28, 2024
abetlen
added a commit
that referenced
this issue
Apr 1, 2024
@riverzhou thanks to @Limour-dev this should be in the latest v0.2.58 release by setting |
xhedit
pushed a commit
to xhedit/llama-cpp-conv
that referenced
this issue
Apr 6, 2024
* add KV cache quantization options abetlen#1220 abetlen#1305 * Add ggml_type * Use ggml_type instead of string for quantization * Add server support --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>
xhedit
added a commit
to xhedit/llama-cpp-conv
that referenced
this issue
Apr 6, 2024
* feat: add support for KV cache quantization options (abetlen#1307) * add KV cache quantization options abetlen#1220 abetlen#1305 * Add ggml_type * Use ggml_type instead of string for quantization * Add server support --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com> * fix: Changed local API doc references to hosted (abetlen#1317) * chore: Bump version * fix: last tokens passing to sample_repetition_penalties function (abetlen#1295) Co-authored-by: ymikhaylov <ymikhaylov@x5.ru> Co-authored-by: Andrei <abetlen@gmail.com> * feat: Update llama.cpp * fix: segfault when logits_all=False. Closes abetlen#1319 * feat: Binary wheels for CPU, CUDA (12.1 - 12.3), Metal (abetlen#1247) * Generate binary wheel index on release * Add total release downloads badge * Update download label * Use official cibuildwheel action * Add workflows to build CUDA and Metal wheels * Update generate index workflow * Update workflow name * feat: Update llama.cpp * chore: Bump version * fix(ci): use correct script name * docs: LLAMA_CUBLAS -> LLAMA_CUDA * docs: Add docs explaining how to install pre-built wheels. * docs: Rename cuBLAS section to CUDA * fix(docs): incorrect tool_choice example (abetlen#1330) * feat: Update llama.cpp * fix: missing logprobs in response, incorrect response type for functionary, minor type issues. Closes abetlen#1328 abetlen#1314 * fix: missing logprobs in response, incorrect response type for functionary, minor type issues. Closes abetlen#1328 Closes abetlen#1314 * feat: Update llama.cpp * fix: Always embed metal library. Closes abetlen#1332 * feat: Update llama.cpp * chore: Bump version --------- Co-authored-by: Limour <93720049+Limour-dev@users.noreply.github.com> Co-authored-by: Andrei Betlen <abetlen@gmail.com> Co-authored-by: lawfordp2017 <lawfordp@gmail.com> Co-authored-by: Yuri Mikhailov <bitsharp@gmail.com> Co-authored-by: ymikhaylov <ymikhaylov@x5.ru> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
francomattar
added a commit
to francomattar/Python-llama-cpp
that referenced
this issue
Dec 13, 2024
* add KV cache quantization options abetlen/llama-cpp-python#1220 abetlen/llama-cpp-python#1305 * Add ggml_type * Use ggml_type instead of string for quantization * Add server support --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>
nayanzin33sergey
added a commit
to nayanzin33sergey/Python-llama-cpp
that referenced
this issue
Dec 18, 2024
* add KV cache quantization options abetlen/llama-cpp-python#1220 abetlen/llama-cpp-python#1305 * Add ggml_type * Use ggml_type instead of string for quantization * Add server support --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>
jamesdev9
pushed a commit
to jamesdev9/python-llama-cpp
that referenced
this issue
Dec 22, 2024
* add KV cache quantization options abetlen/llama-cpp-python#1220 abetlen/llama-cpp-python#1305 * Add ggml_type * Use ggml_type instead of string for quantization * Add server support --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>
NoBrainer242
pushed a commit
to NoBrainer242/Python-App-llama
that referenced
this issue
Dec 27, 2024
* add KV cache quantization options abetlen/llama-cpp-python#1220 abetlen/llama-cpp-python#1305 * Add ggml_type * Use ggml_type instead of string for quantization * Add server support --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>
NoBrainer24
added a commit
to NoBrainer24/Python-App-llama-
that referenced
this issue
Dec 28, 2024
* add KV cache quantization options abetlen/llama-cpp-python#1220 abetlen/llama-cpp-python#1305 * Add ggml_type * Use ggml_type instead of string for quantization * Add server support --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Origin llama.cpp is support cache-type-k and cache-type-p setting, but llama-cpp-python server is not.
The text was updated successfully, but these errors were encountered: