-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CausalLM ERROR: byte not found in vocab #840
Comments
Same |
I am glad to know its not just me. |
I am facing the same issue using the Q5_0.gguf. The missing byte in the vocab varies. Sometimes all the spaces between words are replaced with !, other times there are no spaces between words in the output. |
Has been already discussed in llama.cpp. The team behind CausalLM and TheBloke are aware of this issue which is caused by the "non-standard" vocabulary the model uses. As per the last time I tried, inference on CPU was already working for GGUF. As per the last comments on one of the issues related to this model and llama.cpp, inference seems to be running fine on GPU too: ggml-org/llama.cpp#3740 |
Got it working. It's quite straightforward; just follow the steps below. Try reinstalling llama-cpp-python as follows (I would advise using a Python virtual environment but that's a different topic): CMAKE_ARGS='-DLLAMA_CUBLAS=on -DLLAMA_CUDA_MMV_Y=8 -DCMAKE_CUDA_ARCHITECTURES=native' CUDACXX=/usr/local/cuda-11.8/bin/nvcc FORCE_CMAKE=1 pip install git+https://github.com/abetlen/llama-cpp-python.git --force-reinstall --no-cache-dir --verbose --upgrade Set CUDACXX to the path of the nvcc version you intend to use. I have versions 11.8 and 12 installed and succeeded with 11.8 (I didn't try to build with version 12). Afterwards, generation with CausalLM 14B works smoothly. python llama_cpp_server_v2.py causallm
Model causallm is not in use, starting...
Starting model causallm
······················ Settings for causallm ······················
> model: /root/code/localai/models/causallm_14b.Q5_1.gguf
> model_alias: causallm
> seed: 4294967295
> n_ctx: 8192
> n_batch: 128
> n_gpu_layers: 45
> main_gpu: 0
> rope_freq_base: 0.0
> rope_freq_scale: 1.0
> mul_mat_q: 1
> f16_kv: 1
> logits_all: 1
> vocab_only: 0
> use_mmap: 1
> use_mlock: 1
> embedding: 1
> n_threads: 4
> last_n_tokens_size: 128
> numa: 0
> chat_format: chatml
> cache: 0
> cache_type: ram
> cache_size: 2147483648
> verbose: 1
> host: 0.0.0.0
> port: 8040
> interrupt_requests: 1
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
llama_model_loader: loaded meta data with 21 key-value pairs and 363 tensors from /root/code/localai/models/causallm_14b.Q5_1.gguf (version unknown)
llama_model_loader: - tensor 0: token_embd.weight q5_1 [ 5120, 152064, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_q.weight q5_1 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 2: blk.0.attn_k.weight q5_1 [ 5120, 5120, 1, 1 ]
[...]
llm_load_print_meta: model ftype = mostly Q5_1
llm_load_print_meta: model params = 14.17 B
llm_load_print_meta: model size = 9.95 GiB (6.03 BPW)
llm_load_print_meta: general.name = causallm_14b
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151643 '<|endoftext|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_tensors: ggml ctx size = 0.12 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 557.00 MB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 9629.41 MB
...........................................................................................
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 6400.00 MB
llama_new_context_with_model: kv self size = 6400.00 MB
llama_new_context_with_model: compute buffer total size = 177.63 MB
llama_new_context_with_model: VRAM scratch buffer: 171.50 MB
llama_new_context_with_model: total VRAM used: 16200.92 MB (model: 9629.41 MB, context: 6571.50 MB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
INFO: Started server process [735011]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8040 (Press CTRL+C to quit)
INFO: 127.0.0.1:52746 - "OPTIONS / HTTP/1.0" 200 OK
INFO: 127.0.0.1:47452 - "OPTIONS / HTTP/1.0" 200 OK
INFO: 127.0.0.1:47456 - "OPTIONS / HTTP/1.0" 200 OK
INFO: 127.0.0.1:47468 - "OPTIONS / HTTP/1.0" 200 OK
INFO: 127.0.0.1:47484 - "OPTIONS / HTTP/1.0" 200 OK
INFO: 127.0.0.1:47500 - "OPTIONS / HTTP/1.0" 200 OK
INFO: 127.0.0.1:52300 - "OPTIONS / HTTP/1.0" 200 OK
INFO: 127.0.0.1:52302 - "OPTIONS / HTTP/1.0" 200 OK
·································· Prompt ChatML ··································
<|im_start|>system
You are a helpful assistant. You can help me by answering my questions. You can also ask me questions.<|im_end|>
<|im_start|>user
Is this chat working?<|im_end|>
<|im_start|>assistant
INFO: 127.0.0.1:52314 - "POST /v1/chat/completions HTTP/1.1" 200 OK
llama_print_timings: load time = 173.04 ms
llama_print_timings: sample time = 12.94 ms / 29 runs ( 0.45 ms per token, 2240.77 tokens per second)
llama_print_timings: prompt eval time = 172.94 ms / 64 tokens ( 2.70 ms per token, 370.07 tokens per second)
llama_print_timings: eval time = 434.08 ms / 28 runs ( 15.50 ms per token, 64.50 tokens per second)
llama_print_timings: total time = 1158.61 ms
INFO: 127.0.0.1:52328 - "OPTIONS / HTTP/1.0" 200 OK
INFO: 127.0.0.1:52334 - "OPTIONS / HTTP/1.0" 200 OK Feel free to ping me if you don't succeed. |
Use UTF-16 as input on Windows, since UTF-8 does not work and reads multibyte characters as zeros
Odd, I'm still getting the crash with the error. I'm trying to apply this fix to the text generation webui docker container. I'm doing the following in portainer on the container. Edit: never mind, the nvcc was missing form that directory |
…sions of llama_cpp_python, see abetlen/llama-cpp-python#840
I'm assuming for AMD GPUs the command would be different (due to the lack of nvcc). I'm currently getting the error for llama-cpp-python 0.2.18 with a 6800 XT on Manjaro Linux (CPU works fine). |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Run model CausalLM without errors
https://huggingface.co/TheBloke/CausalLM-14B-GGUF/tree/main
Current Behavior
Error when loading model (llama-cpp-python installed throught pip, not from source)
ggml-org/llama.cpp#3732
Environment and Context
Docker container with latest
Failure Information (for bugs)
Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.
Steps to Reproduce
Failure Logs
The text was updated successfully, but these errors were encountered: