CausalLM ERROR: byte not found in vocab #840

ArtyomZemlyak · 2023-10-24T05:04:38Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Run model CausalLM without errors
https://huggingface.co/TheBloke/CausalLM-14B-GGUF/tree/main

Current Behavior

Error when loading model (llama-cpp-python installed throught pip, not from source)

...
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q5_1:  281 tensors
llama_model_loader: - type q6_K:    1 tensors
ERROR: byte not found in vocab: '
'
fish: Job 1, 'python server.py --api --listen…' terminated by signal SIGSEGV (Address boundary error)

ggml-org/llama.cpp#3732

Environment and Context

Docker container with latest

Failure Information (for bugs)

Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.

Steps to Reproduce

Failure Logs

The text was updated successfully, but these errors were encountered:

Gincioks · 2023-10-24T16:57:48Z

Same

alienatorZ · 2023-10-25T12:22:48Z

I am glad to know its not just me.

thekitchenscientist · 2023-10-27T19:43:46Z

I am facing the same issue using the Q5_0.gguf. The missing byte in the vocab varies. Sometimes all the spaces between words are replaced with !, other times there are no spaces between words in the output.

jorgerance · 2023-10-28T13:05:38Z

Has been already discussed in llama.cpp. The team behind CausalLM and TheBloke are aware of this issue which is caused by the "non-standard" vocabulary the model uses. As per the last time I tried, inference on CPU was already working for GGUF. As per the last comments on one of the issues related to this model and llama.cpp, inference seems to be running fine on GPU too: ggml-org/llama.cpp#3740

jorgerance · 2023-10-29T15:30:04Z

Got it working. It's quite straightforward; just follow the steps below.

Try reinstalling llama-cpp-python as follows (I would advise using a Python virtual environment but that's a different topic):

CMAKE_ARGS='-DLLAMA_CUBLAS=on -DLLAMA_CUDA_MMV_Y=8 -DCMAKE_CUDA_ARCHITECTURES=native' CUDACXX=/usr/local/cuda-11.8/bin/nvcc FORCE_CMAKE=1 pip install git+https://github.com/abetlen/llama-cpp-python.git --force-reinstall --no-cache-dir --verbose --upgrade

Set CUDACXX to the path of the nvcc version you intend to use. I have versions 11.8 and 12 installed and succeeded with 11.8 (I didn't try to build with version 12).

Afterwards, generation with CausalLM 14B works smoothly.

python llama_cpp_server_v2.py causallm
Model causallm is not in use, starting...
Starting model causallm

······················ Settings for causallm  ······················
> model:                            /root/code/localai/models/causallm_14b.Q5_1.gguf
> model_alias:                                                              causallm
> seed:                                                                   4294967295
> n_ctx:                                                                        8192
> n_batch:                                                                       128
> n_gpu_layers:                                                                   45
> main_gpu:                                                                        0
> rope_freq_base:                                                                0.0
> rope_freq_scale:                                                               1.0
> mul_mat_q:                                                                       1
> f16_kv:                                                                          1
> logits_all:                                                                      1
> vocab_only:                                                                      0
> use_mmap:                                                                        1
> use_mlock:                                                                       1
> embedding:                                                                       1
> n_threads:                                                                       4
> last_n_tokens_size:                                                            128
> numa:                                                                            0
> chat_format:                                                                chatml
> cache:                                                                           0
> cache_type:                                                                    ram
> cache_size:                                                             2147483648
> verbose:                                                                         1
> host:                                                                      0.0.0.0
> port:                                                                         8040
> interrupt_requests:                                                              1
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
llama_model_loader: loaded meta data with 21 key-value pairs and 363 tensors from /root/code/localai/models/causallm_14b.Q5_1.gguf (version unknown)
llama_model_loader: - tensor    0:                token_embd.weight q5_1     [  5120, 152064,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
[...]
llm_load_print_meta: model ftype      = mostly Q5_1
llm_load_print_meta: model params     = 14.17 B
llm_load_print_meta: model size       = 9.95 GiB (6.03 BPW) 
llm_load_print_meta: general.name   = causallm_14b
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151643 '<|endoftext|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token  = 128 'Ä'
llm_load_tensors: ggml ctx size =    0.12 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =  557.00 MB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 9629.41 MB
...........................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 6400.00 MB
llama_new_context_with_model: kv self size  = 6400.00 MB
llama_new_context_with_model: compute buffer total size = 177.63 MB
llama_new_context_with_model: VRAM scratch buffer: 171.50 MB
llama_new_context_with_model: total VRAM used: 16200.92 MB (model: 9629.41 MB, context: 6571.50 MB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
INFO:     Started server process [735011]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8040 (Press CTRL+C to quit)
INFO:     127.0.0.1:52746 - "OPTIONS / HTTP/1.0" 200 OK
INFO:     127.0.0.1:47452 - "OPTIONS / HTTP/1.0" 200 OK
INFO:     127.0.0.1:47456 - "OPTIONS / HTTP/1.0" 200 OK
INFO:     127.0.0.1:47468 - "OPTIONS / HTTP/1.0" 200 OK
INFO:     127.0.0.1:47484 - "OPTIONS / HTTP/1.0" 200 OK
INFO:     127.0.0.1:47500 - "OPTIONS / HTTP/1.0" 200 OK
INFO:     127.0.0.1:52300 - "OPTIONS / HTTP/1.0" 200 OK
INFO:     127.0.0.1:52302 - "OPTIONS / HTTP/1.0" 200 OK
·································· Prompt ChatML ··································

<|im_start|>system
You are a helpful assistant. You can help me by answering my questions. You can also ask me questions.<|im_end|>
<|im_start|>user
Is this chat working?<|im_end|>
<|im_start|>assistant

INFO:     127.0.0.1:52314 - "POST /v1/chat/completions HTTP/1.1" 200 OK

llama_print_timings:        load time =     173.04 ms
llama_print_timings:      sample time =      12.94 ms /    29 runs   (    0.45 ms per token,  2240.77 tokens per second)
llama_print_timings: prompt eval time =     172.94 ms /    64 tokens (    2.70 ms per token,   370.07 tokens per second)
llama_print_timings:        eval time =     434.08 ms /    28 runs   (   15.50 ms per token,    64.50 tokens per second)
llama_print_timings:       total time =    1158.61 ms
INFO:     127.0.0.1:52328 - "OPTIONS / HTTP/1.0" 200 OK
INFO:     127.0.0.1:52334 - "OPTIONS / HTTP/1.0" 200 OK

Feel free to ping me if you don't succeed.

Use UTF-16 as input on Windows, since UTF-8 does not work and reads multibyte characters as zeros

javaarchive · 2023-10-31T03:35:12Z

Odd, I'm still getting the crash with the error. I'm trying to apply this fix to the text generation webui docker container. I'm doing the following in portainer on the container.
source venv/bin/activate
CMAKE_ARGS='-DLLAMA_CUBLAS=on -DLLAMA_CUDA_MMV_Y=8 -DCMAKE_CUDA_ARCHITECTURES=native' CUDACXX=/usr/local/cuda-12.1/bin/nvcc FORCE_CMAKE=1 pip install git+https://github.com/abetlen/llama-cpp-python.git --force-reinstall --no-cache-dir --verbose --upgrade (a quick ls of /usr/local/ reveals cuda 12.1 is present in the directory).

Edit: never mind, the nvcc was missing form that directory

…sions of llama_cpp_python, see abetlen/llama-cpp-python#840

NotSpooky · 2023-11-17T22:58:31Z

I'm assuming for AMD GPUs the command would be different (due to the lack of nvcc). I'm currently getting the error for llama-cpp-python 0.2.18 with a 6800 XT on Manjaro Linux (CPU works fine).

abetlen added the bug Something isn't working label Oct 24, 2023

antoine-lizee pushed a commit to antoine-lizee/llama-cpp-python that referenced this issue Oct 30, 2023

fix for windows utf-8 input (abetlen#840)

aaf3b23

Use UTF-16 as input on Windows, since UTF-8 does not work and reads multibyte characters as zeros

pseudotensor mentioned this issue Nov 11, 2023

Req: Deepseek-coder-33b-instruct Prompt Template h2oai/h2ogpt#1082

Closed

pseudotensor added a commit to h2oai/h2ogpt that referenced this issue Nov 11, 2023

Update for Issue #1082 -- something off going on with other newer ver…

cfd57ed

…sions of llama_cpp_python, see abetlen/llama-cpp-python#840

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CausalLM ERROR: byte not found in vocab #840

CausalLM ERROR: byte not found in vocab #840

ArtyomZemlyak commented Oct 24, 2023

Gincioks commented Oct 24, 2023

alienatorZ commented Oct 25, 2023

thekitchenscientist commented Oct 27, 2023

jorgerance commented Oct 28, 2023

jorgerance commented Oct 29, 2023

javaarchive commented Oct 31, 2023 •

edited

Loading

NotSpooky commented Nov 17, 2023

CausalLM ERROR: byte not found in vocab #840

CausalLM ERROR: byte not found in vocab #840

Comments

ArtyomZemlyak commented Oct 24, 2023

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Failure Logs

Gincioks commented Oct 24, 2023

alienatorZ commented Oct 25, 2023

thekitchenscientist commented Oct 27, 2023

jorgerance commented Oct 28, 2023

jorgerance commented Oct 29, 2023

javaarchive commented Oct 31, 2023 • edited Loading

NotSpooky commented Nov 17, 2023

javaarchive commented Oct 31, 2023 •

edited

Loading