You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
❯ llama-cli --version
version: 4568 (a4417dd)
built with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.2.0
Operating systems
Mac Studio M2 Ultra (192GB)
Which llama.cpp modules do you know to be affected?
llama-server
Command line
Problem description & steps to reproduce
Description: I encountered an issue when trying to run multiple concurrent inferences with llama-server. Even though I set the --parallel flag to 8, the server processes only 6 inferences concurrently. The remaining inferences stay in a waiting queue until one of the active processes is completed.
Prepare the Requests:
Open 8 browser windows and navigate to http://127.0.0.1:8080/.
Send Prompts:
In each browser window, paste the desired prompt into the input textbox. After all windows have the prompt ready, click the “Send” button simultaneously in all 8 windows.
Observe the Behavior:
Only 6 requests begin processing immediately. The remaining 2 are queued and will only start once one of the initial 6 inferences finishes.
Expected Behavior: All 8 inferences should run concurrently as specified by the --parallel flag.
Actual Behavior: Regardless of setting --parallel to 8 (or even using a different --ctx-size value such as 18), only 6 concurrent inferences are allowed. The rest remain queued until one of the active processes is completed. RAM usage is about 50 GB, excluding the memory issue.
Any insights or suggestions would be appreciated. Thank you!
First Bad Commit
No response
Relevant log output
build: 4568 (a4417ddd) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.2.0
system info: n_threads = 16, n_threads_batch = 16, total_threads = 24
system_info: n_threads = 16 (n_threads_batch = 16) / 24 | Metal : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 |
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 23
main: loading model
srv load_model: loading model 'DeepSeek-R1-Distill-Llama-8B-Q8_0.gguf'
llama_model_load_from_file_impl: using device Metal (Apple M2 Ultra) - 147455 MiB free
llama_model_loader: loaded meta data with 32 key-value pairs and 292 tensors from DeepSeek-R1-Distill-Llama-8B-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Llama 8B
llama_model_loader: - kv 3: general.organization str = Deepseek Ai
llama_model_loader: - kv 4: general.basename str = DeepSeek-R1-Distill-Llama
llama_model_loader: - kv 5: general.size_label str = 8B
llama_model_loader: - kv 6: llama.block_count u32 = 32
llama_model_loader: - kv 7: llama.context_length u32 = 131072
llama_model_loader: - kv 8: llama.embedding_length u32 = 4096
llama_model_loader: - kv 9: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 10: llama.attention.head_count u32 = 32
llama_model_loader: - kv 11: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 12: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 13: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 14: llama.attention.key_length u32 = 128
llama_model_loader: - kv 15: llama.attention.value_length u32 = 128
llama_model_loader: - kv 16: general.file_type u32 = 7
llama_model_loader: - kv 17: llama.vocab_size u32 = 128256
llama_model_loader: - kv 18: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 19: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 20: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 22: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 23: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...llama_model_loader: - kv 24: tokenizer.ggml.bos_token_id u32 = 128000llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 128001llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 128004llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = truellama_model_loader: - kv 28: tokenizer.ggml.add_eos_token bool = falsellama_model_loader: - kv 29: tokenizer.chat_template str = {% if not add_generation_prompt is de...llama_model_loader: - kv 30: tokenizer.ggml.add_space_prefix bool = falsellama_model_loader: - kv 31: general.quantization_version u32 = 2llama_model_loader: - type f32: 66 tensorsllama_model_loader: - type q8_0: 226 tensorsprint_info: file format = GGUF V3 (latest)print_info: file type = Q8_0print_info: file size = 7.95 GiB (8.50 BPW)load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrectload: special tokens cache size = 256load: token to piece cache size = 0.7999 MBprint_info: arch = llamaprint_info: vocab_only = 0print_info: n_ctx_train = 131072print_info: n_embd = 4096print_info: n_layer = 32print_info: n_head = 32print_info: n_head_kv = 8print_info: n_rot = 128print_info: n_swa = 0print_info: n_embd_head_k = 128print_info: n_embd_head_v = 128print_info: n_gqa = 4print_info: n_embd_k_gqa = 1024print_info: n_embd_v_gqa = 1024print_info: f_norm_eps = 0.0e+00print_info: f_norm_rms_eps = 1.0e-05print_info: f_clamp_kqv = 0.0e+00print_info: f_max_alibi_bias = 0.0e+00print_info: f_logit_scale = 0.0e+00print_info: n_ff = 14336print_info: n_expert = 0print_info: n_expert_used = 0print_info: causal attn = 1print_info: pooling type = 0print_info: rope type = 0print_info: rope scaling = linearprint_info: freq_base_train = 500000.0print_info: freq_scale_train = 1print_info: n_ctx_orig_yarn = 131072print_info: rope_finetuned = unknownprint_info: ssm_d_conv = 0print_info: ssm_d_inner = 0print_info: ssm_d_state = 0print_info: ssm_dt_rank = 0print_info: ssm_dt_b_c_rms = 0print_info: model type = 8Bprint_info: model params = 8.03 Bprint_info: general.name = DeepSeek R1 Distill Llama 8Bprint_info: vocab type = BPEprint_info: n_vocab = 128256print_info: n_merges = 280147print_info: BOS token = 128000 '<|begin▁of▁sentence|>'print_info: EOS token = 128001 '<|end▁of▁sentence|>'print_info: EOT token = 128001 '<|end▁of▁sentence|>'print_info: EOM token = 128008 '<|eom_id|>'print_info: PAD token = 128004 '<|finetune_right_pad_id|>'print_info: LF token = 128 'Ä'print_info: EOG token = 128001 '<|end▁of▁sentence|>'print_info: EOG token = 128008 '<|eom_id|>'print_info: EOG token = 128009 '<|eot_id|>'print_info: max token length = 256load_tensors: offloading 32 repeating layers to GPUload_tensors: offloading output layer to GPUload_tensors: offloaded 33/33 layers to GPUload_tensors: CPU_Mapped model buffer size = 532.31 MiBload_tensors: Metal_Mapped model buffer size = 8137.65 MiBllama_init_from_model: n_seq_max = 8llama_init_from_model: n_ctx = 81920llama_init_from_model: n_ctx_per_seq = 10240llama_init_from_model: n_batch = 2048llama_init_from_model: n_ubatch = 512llama_init_from_model: flash_attn = 1llama_init_from_model: freq_base = 500000.0llama_init_from_model: freq_scale = 1llama_init_from_model: n_ctx_per_seq (10240) < n_ctx_train (131072) -- the full capacity of the model will not be utilizedggml_metal_init: allocatingggml_metal_init: found device: Apple M2 Ultraggml_metal_init: picking default device: Apple M2 Ultraggml_metal_init: using embedded metal libraryggml_metal_init: GPU name: Apple M2 Ultraggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)ggml_metal_init: simdgroup reduction = trueggml_metal_init: simdgroup matrix mul. = trueggml_metal_init: has residency sets = trueggml_metal_init: has bfloat = trueggml_metal_init: use bfloat = falseggml_metal_init: hasUnifiedMemory = trueggml_metal_init: recommendedMaxWorkingSetSize = 154618.82 MBggml_metal_init: skipping kernel_get_rows_bf16 (not supported)ggml_metal_init: skipping kernel_mul_mv_bf16_f32 (not supported)ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row (not supported)ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4 (not supported)ggml_metal_init: skipping kernel_mul_mv_bf16_bf16 (not supported)ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32 (not supported)ggml_metal_init: skipping kernel_mul_mm_bf16_f32 (not supported)ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32 (not supported)ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64 (not supported)ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80 (not supported)ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96 (not supported)ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112 (not supported)ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128 (not supported)ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256 (not supported)ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128 (not supported)ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256 (not supported)ggml_metal_init: skipping kernel_cpy_f32_bf16 (not supported)ggml_metal_init: skipping kernel_cpy_bf16_f32 (not supported)ggml_metal_init: skipping kernel_cpy_bf16_bf16 (not supported)llama_kv_cache_init: kv_size = 81920, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1llama_kv_cache_init: Metal KV buffer size = 10240.00 MiBllama_init_from_model: KV self size = 10240.00 MiB, K (f16): 5120.00 MiB, V (f16): 5120.00 MiBllama_init_from_model: CPU output buffer size = 3.91 MiBllama_init_from_model: Metal compute buffer size = 272.00 MiBllama_init_from_model: CPU compute buffer size = 168.01 MiBllama_init_from_model: graph nodes = 903llama_init_from_model: graph splits = 2common_init_from_params: setting dry_penalty_last_n to ctx_size = 81920common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)srv init: initializing slots, n_slots = 8slot init: id 0 | task -1 | new slot n_ctx_slot = 10240slot init: id 1 | task -1 | new slot n_ctx_slot = 10240slot init: id 2 | task -1 | new slot n_ctx_slot = 10240slot init: id 3 | task -1 | new slot n_ctx_slot = 10240slot init: id 4 | task -1 | new slot n_ctx_slot = 10240slot init: id 5 | task -1 | new slot n_ctx_slot = 10240slot init: id 6 | task -1 | new slot n_ctx_slot = 10240slot init: id 7 | task -1 | new slot n_ctx_slot = 10240main: model loadedmain: chat template, chat_template: {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<|User|>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '' + '\n' + tool['function']['arguments'] + '\n' + '' + '<|tool▁call▁end|>'}}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '' + '\n' + tool['function']['arguments'] + '\n' + '' + '<|tool▁call▁end|>'}}{{'<|tool▁calls▁end|><|end▁of▁sentence|>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}{{'<|Assistant|>' + content + '<|end▁of▁sentence|>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<|tool▁outputs▁begin|><|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\n<|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<|Assistant|>'}}{% endif %}, example_format: 'You are a helpful assistant<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>'main: server is listening on http://127.0.0.1:8080 - starting the main loopsrv update_slots: all slots are idle
The text was updated successfully, but these errors were encountered:
I can reproduce this. Seems like the 7th and 8th requests are not processed by the httplib handlers at all. So it's likely a misconfiguration or how we use httplib, but not sure. pinging @ngxson
importOpenAIfrom'openai';constclient=newOpenAI({apiKey: 'test',baseURL: 'http://localhost:8080/v1',});asyncfunctioncmpl(onToken){conststream=awaitclient.chat.completions.create({model: 'gpt-4o',messages: [{role: 'user',content: 'write a short poem about cats'}],stream: true,});forawait(constchunkofstream){//process.stdout.write(chunk.choices[0]?.delta?.content || '');onToken(chunk.choices[0]?.delta?.content||'');}//process.stdout.write('\n');}asyncfunctionmain(){constnp=8;constcounters=Array(np).fill(0);constprintCounters=()=>{console.log(counters);}constwrapper=async(i)=>{awaitcmpl((token)=>{counters[i]+=token.length;printCounters();});};awaitPromise.all(Array(np).fill(0).map((_,i)=>wrapper(i)));}main();
Name and Version
❯ llama-cli --version
version: 4568 (a4417dd)
built with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.2.0
Operating systems
Mac Studio M2 Ultra (192GB)
Which llama.cpp modules do you know to be affected?
llama-server
Command line
Problem description & steps to reproduce
Description: I encountered an issue when trying to run multiple concurrent inferences with
llama-server
. Even though I set the--parallel
flag to 8, the server processes only 6 inferences concurrently. The remaining inferences stay in a waiting queue until one of the active processes is completed.Steps to Reproduce:
Launch the Server:
Execute the following command:
Prepare the Requests:
Open 8 browser windows and navigate to
http://127.0.0.1:8080/
.Send Prompts:
In each browser window, paste the desired prompt into the input textbox. After all windows have the prompt ready, click the “Send” button simultaneously in all 8 windows.
Observe the Behavior:
Only 6 requests begin processing immediately. The remaining 2 are queued and will only start once one of the initial 6 inferences finishes.
Expected Behavior: All 8 inferences should run concurrently as specified by the
--parallel
flag.Actual Behavior: Regardless of setting
--parallel
to 8 (or even using a different--ctx-size
value such as 18), only 6 concurrent inferences are allowed. The rest remain queued until one of the active processes is completed. RAM usage is about 50 GB, excluding the memory issue.Any insights or suggestions would be appreciated. Thank you!
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: