-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression. Unable to run any model. CRASH!!! #12075
Comments
Please try a clean build folder, and also check for any stale ggml-vulkan-shaders.* files in the source tree. |
My RPM spec file for building. It uses a clean folder every time.
|
Did you delete any stale shaders files in the source tree? I was wondering if it was an issue like #11788 (comment) but somehow manifesting as a runtime shader compile failure rather than a build time failure. If that's not it, can you try to bisect to a commit? |
I added a
After the build
Even after rebuilding, the failure the same. I will bisect when I get the time or create a debug build and find the exact line number where it fails. |
The out of range error happens much later after a failed shader compile. There should have been a message to stderr about what shader failed to compile. I think bisecting will be helpful, but there's a good chance this is a driver bug of some sort. |
There are no other messages.
BTW, I am using Vulkan drivers from MESA. Is there any way, I can dump the shader compilation logs? |
It's broken tests, you built with This is not the first case of someone enabling this, I guess the confusion is that it looks like a post-build validation test run? |
Yes. That was it. I had enabled it thinking it is a post build validation test and I enabled it recently due to a few crashes at runtime related to Vulkan. I think we should disable this option until it is stable. |
I already fixed it in a feature I'm working on, but it's not yet ready to be merged. But even if it had not crashed, your program would have just run the tests instead of whatever you wanted it to do. |
Name and Version
llama-cli --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7600 (RADV NAVI33) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | matrix cores: KHR_coopmat
version: 4778 (a82c9e7)
built with clang version 18.1.1 for x86_64-unknown-linux-gnu
Operating systems
Linux
GGML backends
Vulkan
Hardware
RX 7600
Models
agentica-org_DeepScaleR-1.5B-Preview-Q8_0.gguf
Problem description & steps to reproduce
gdb llama-server
GNU gdb (GDB) 13.2
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
https://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from llama-server...
(No debugging symbols found in llama-server)
(gdb) set args -t 1 --ctx-size 0 --no-kv-offload --port 8999 --n-predict 2048 --gpu-layers 128 -m ./LLM/agentica-org_DeepScaleR-1.5B-Preview-Q8_0.gguf
(gdb) run
Starting program: /usr/local/bin/llama-server -t 1 --ctx-size 0 --no-kv-offload --port 8999 --n-predict 2048 --gpu-layers 128 -m ./LLM/agentica-org_DeepScaleR-1.5B-Preview-Q8_0.gguf
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7fffed16b700 (LWP 31050)]
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7600 (RADV NAVI33) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | matrix cores: KHR_coopmat
[New Thread 0x7fffec96a700 (LWP 31051)]
build: 4778 (a82c9e7) with clang version 18.1.1 for x86_64-unknown-linux-gnu
system info: n_threads = 1, n_threads_batch = 1, total_threads = 8
system_info: n_threads = 1 (n_threads_batch = 1) / 8 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
[New Thread 0x7fffe7fff700 (LWP 31052)]
[New Thread 0x7fffe77fe700 (LWP 31053)]
[New Thread 0x7fffe6ffd700 (LWP 31054)]
[New Thread 0x7fffe67fc700 (LWP 31055)]
[New Thread 0x7fffe5ffb700 (LWP 31056)]
[New Thread 0x7fffe57fa700 (LWP 31057)]
[New Thread 0x7fffe4ff9700 (LWP 31058)]
[New Thread 0x7fffdbfff700 (LWP 31059)]
main: HTTP server is listening, hostname: 127.0.0.1, port: 8999, http threads: 7
main: loading model
srv load_model: loading model './LLM/agentica-org_DeepScaleR-1.5B-Preview-Q8_0.gguf'
llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 7600 (RADV NAVI33)) - 7936 MiB free
llama_model_loader: loaded meta data with 51 key-value pairs and 339 tensors from ./LLM/agentica-org_DeepScaleR-1.5B-Preview-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = DeepScaleR 1.5B Preview
llama_model_loader: - kv 3: general.organization str = Agentica Org
llama_model_loader: - kv 4: general.finetune str = Preview
llama_model_loader: - kv 5: general.basename str = DeepScaleR
llama_model_loader: - kv 6: general.size_label str = 1.5B
llama_model_loader: - kv 7: general.license str = mit
llama_model_loader: - kv 8: general.base_model.count u32 = 1
llama_model_loader: - kv 9: general.base_model.0.name str = DeepSeek R1 Distill Qwen 1.5B
llama_model_loader: - kv 10: general.base_model.0.organization str = Deepseek Ai
llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/deepseek-ai/De...
llama_model_loader: - kv 12: general.dataset.count u32 = 4
llama_model_loader: - kv 13: general.dataset.0.name str = NuminaMath CoT
llama_model_loader: - kv 14: general.dataset.0.organization str = AI MO
llama_model_loader: - kv 15: general.dataset.0.repo_url str = https://huggingface.co/AI-MO/NuminaMa...
llama_model_loader: - kv 16: general.dataset.1.name str = Omni MATH
llama_model_loader: - kv 17: general.dataset.1.organization str = KbsdJames
llama_model_loader: - kv 18: general.dataset.1.repo_url str = https://huggingface.co/KbsdJames/Omni...
llama_model_loader: - kv 19: general.dataset.2.name str = STILL 3 Preview RL Data
llama_model_loader: - kv 20: general.dataset.2.organization str = RUC AIBOX
llama_model_loader: - kv 21: general.dataset.2.repo_url str = https://huggingface.co/RUC-AIBOX/STIL...
llama_model_loader: - kv 22: general.dataset.3.name str = Competition_Math
llama_model_loader: - kv 23: general.dataset.3.organization str = Hendrycks
llama_model_loader: - kv 24: general.dataset.3.repo_url str = https://huggingface.co/hendrycks/comp...
llama_model_loader: - kv 25: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 26: qwen2.block_count u32 = 28
llama_model_loader: - kv 27: qwen2.context_length u32 = 131072
llama_model_loader: - kv 28: qwen2.embedding_length u32 = 1536
llama_model_loader: - kv 29: qwen2.feed_forward_length u32 = 8960
llama_model_loader: - kv 30: qwen2.attention.head_count u32 = 12
llama_model_loader: - kv 31: qwen2.attention.head_count_kv u32 = 2
llama_model_loader: - kv 32: qwen2.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 33: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 34: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 35: tokenizer.ggml.pre str = deepseek-r1-qwen
llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 37: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 38: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 39: tokenizer.ggml.bos_token_id u32 = 151646
llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 151643
llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 42: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 43: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 44: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 45: general.quantization_version u32 = 2
llama_model_loader: - kv 46: general.file_type u32 = 7
llama_model_loader: - kv 47: quantize.imatrix.file str = /models_out/DeepScaleR-1.5B-Preview-G...
llama_model_loader: - kv 48: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
llama_model_loader: - kv 49: quantize.imatrix.entries_count i32 = 196
llama_model_loader: - kv 50: quantize.imatrix.chunks_count i32 = 128
llama_model_loader: - type f32: 141 tensors
llama_model_loader: - type q8_0: 198 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 1.76 GiB (8.50 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 1536
print_info: n_layer = 28
print_info: n_head = 12
print_info: n_head_kv = 2
print_info: n_rot = 128
print_info: n_swa = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 6
print_info: n_embd_k_gqa = 256
print_info: n_embd_v_gqa = 256
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: n_ff = 8960
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 1.5B
print_info: model params = 1.78 B
print_info: general.name = DeepScaleR 1.5B Preview
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 151646 '<|begin▁of▁sentence|>'
print_info: EOS token = 151643 '<|end▁of▁sentence|>'
print_info: EOT token = 151643 '<|end▁of▁sentence|>'
print_info: PAD token = 151643 '<|end▁of▁sentence|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|end▁of▁sentence|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
[New Thread 0x7fffdb35a700 (LWP 31060)]
[New Thread 0x7fffd2b59700 (LWP 31061)]
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors: CPU_Mapped model buffer size = 236.47 MiB
load_tensors: Vulkan0 model buffer size = 1564.62 MiB
............................................................................
llama_init_from_model: n_seq_max = 1
llama_init_from_model: n_ctx = 131072
llama_init_from_model: n_ctx_per_seq = 131072
llama_init_from_model: n_batch = 2048
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 0
llama_init_from_model: freq_base = 10000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: kv_size = 131072, offload = 0, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init: CPU KV buffer size = 3584.00 MiB
llama_init_from_model: KV self size = 3584.00 MiB, K (f16): 1792.00 MiB, V (f16): 1792.00 MiB
llama_init_from_model: Vulkan_Host output buffer size = 0.58 MiB
llama_init_from_model: Vulkan0 compute buffer size = 302.75 MiB
llama_init_from_model: Vulkan_Host compute buffer size = 3332.01 MiB
llama_init_from_model: graph nodes = 986
llama_init_from_model: graph splits = 58
common_init_from_params: setting dry_penalty_last_n to ctx_size = 131072
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[New Thread 0x7fffd37fe700 (LWP 31073)]
[New Thread 0x7fffd2358700 (LWP 31075)]
[New Thread 0x7fffd1b57700 (LWP 31076)]
[Thread 0x7fffd2358700 (LWP 31075) exited]
[Thread 0x7fffd37fe700 (LWP 31073) exited]
[New Thread 0x7fffd1356700 (LWP 31077)]
[New Thread 0x7fffd0b55700 (LWP 31078)]
[Thread 0x7fffd1b57700 (LWP 31076) exited]
[Thread 0x7fffd1356700 (LWP 31077) exited]
[Thread 0x7fffd0b55700 (LWP 31078) exited]
terminate called after throwing an instance of 'std::out_of_range'
what(): unordered_map::at
Thread 1 "llama-server" received signal SIGABRT, Aborted.
0x00007ffff5471e35 in raise () from /lib64/libc.so.6
(gdb) where
#0 0x00007ffff5471e35 in raise () from /lib64/libc.so.6
#1 0x00007ffff545c895 in abort () from /lib64/libc.so.6
#2 0x00007ffff56a2bf9 in __gnu_cxx::__verbose_terminate_handler ()
at ../../../../gcc/libstdc++-v3/libsupc++/vterminate.cc:95
#3 0x00007ffff56ae26a in __cxxabiv1::__terminate (handler=)
at ../../../../gcc/libstdc++-v3/libsupc++/eh_terminate.cc:48
#4 0x00007ffff56ae2d5 in std::terminate () at ../../../../gcc/libstdc++-v3/libsupc++/eh_terminate.cc:58
#5 0x00007ffff56ae527 in __cxxabiv1::__cxa_throw (obj=,
tinfo=0x7ffff58141c8 , dest=0x7ffff56c3440 std::out_of_range::~out_of_range())
at ../../../../gcc/libstdc++-v3/libsupc++/eh_throw.cc:98
#6 0x00007ffff56a5500 in std::__throw_out_of_range (__s=0x5555559b819a "unordered_map::at")
at ../../../../../gcc/libstdc++-v3/src/c++11/functexcept.cc:86
#7 0x00005555559194e7 in ggml_pipeline_allocate_descriptor_sets(std::shared_ptr<vk_device_struct>&) ()
#8 0x00005555559398d9 in void ggml_vk_test_matmul<unsigned short, float>(ggml_backend_vk_context*, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, int, int) ()
#9 0x0000555555914ff4 in ggml_backend_vk_graph_compute(ggml_backend*, ggml_cgraph*) ()
#10 0x0000555555965417 in ggml_backend_sched_graph_compute_async ()
#11 0x000055555576fd14 in llama_graph_compute(llama_context&, ggml_cgraph*, int, ggml_threadpool*) ()
#12 0x000055555576c177 in llama_decode ()
#13 0x0000555555744d82 in common_init_from_params(common_params&) ()
#14 0x00005555555b6d05 in server_context::load_model(common_params const&) ()
#15 0x0000555555584b18 in main ()
(gdb)
First Bad Commit
Relevant log output
The text was updated successfully, but these errors were encountered: