Misc. bug: Concurrency Limitation: Only 6 Inferences Run Simultaneously When Setting `--parallel` > 6 #12013

karanotsingyu · 2025-02-21T21:33:31Z

Name and Version

❯ llama-cli --version
version: 4568 (a4417dd)
built with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.2.0

Operating systems

Mac Studio M2 Ultra (192GB)

Which llama.cpp modules do you know to be affected?

llama-server

Command line

Problem description & steps to reproduce

Description: I encountered an issue when trying to run multiple concurrent inferences with llama-server. Even though I set the --parallel flag to 8, the server processes only 6 inferences concurrently. The remaining inferences stay in a waiting queue until one of the active processes is completed.

Steps to Reproduce:

Launch the Server:
Execute the following command:

llama-server \
    -m DeepSeek-R1-Distill-Llama-8B-Q8_0.gguf \
    --temp 0.6 \
    --ctx-size 81920 \
    --parallel 8 \
    --flash-attn

Prepare the Requests:
Open 8 browser windows and navigate to http://127.0.0.1:8080/.
Send Prompts:
In each browser window, paste the desired prompt into the input textbox. After all windows have the prompt ready, click the “Send” button simultaneously in all 8 windows.
Observe the Behavior:
Only 6 requests begin processing immediately. The remaining 2 are queued and will only start once one of the initial 6 inferences finishes.

Expected Behavior: All 8 inferences should run concurrently as specified by the --parallel flag.

Actual Behavior: Regardless of setting --parallel to 8 (or even using a different --ctx-size value such as 18), only 6 concurrent inferences are allowed. The rest remain queued until one of the active processes is completed. RAM usage is about 50 GB, excluding the memory issue.

Any insights or suggestions would be appreciated. Thank you!

First Bad Commit

No response

Relevant log output

build: 4568 (a4417ddd) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.2.0
system info: n_threads = 16, n_threads_batch = 16, total_threads = 24

system_info: n_threads = 16 (n_threads_batch = 16) / 24 | Metal : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 |

main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 23
main: loading model
srv    load_model: loading model 'DeepSeek-R1-Distill-Llama-8B-Q8_0.gguf'
llama_model_load_from_file_impl: using device Metal (Apple M2 Ultra) - 147455 MiB free
llama_model_loader: loaded meta data with 32 key-value pairs and 292 tensors from DeepSeek-R1-Distill-Llama-8B-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Llama 8B
llama_model_loader: - kv   3:                       general.organization str              = Deepseek Ai
llama_model_loader: - kv   4:                           general.basename str              = DeepSeek-R1-Distill-Llama
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                          llama.block_count u32              = 32
llama_model_loader: - kv   7:                       llama.context_length u32              = 131072
llama_model_loader: - kv   8:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   9:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  10:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  11:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  12:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  13:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  14:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  15:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  16:                          general.file_type u32              = 7
llama_model_loader: - kv  17:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  18:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  26:            tokenizer.ggml.padding_token_id u32              = 128004
llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  28:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  30:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q8_0:  226 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 7.95 GiB (8.50 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 8B
print_info: model params     = 8.03 B
print_info: general.name     = DeepSeek R1 Distill Llama 8B
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<｜begin▁of▁sentence｜>'
print_info: EOS token        = 128001 '<｜end▁of▁sentence｜>'
print_info: EOT token        = 128001 '<｜end▁of▁sentence｜>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: PAD token        = 128004 '<|finetune_right_pad_id|>'
print_info: LF token         = 128 'Ä'
print_info: EOG token        = 128001 '<｜end▁of▁sentence｜>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   532.31 MiB
load_tensors: Metal_Mapped model buffer size =  8137.65 MiB
llama_init_from_model: n_seq_max     = 8
llama_init_from_model: n_ctx         = 81920
llama_init_from_model: n_ctx_per_seq = 10240
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 1
llama_init_from_model: freq_base     = 500000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (10240) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Ultra
ggml_metal_init: picking default device: Apple M2 Ultra
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M2 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 154618.82 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
llama_kv_cache_init: kv_size = 81920, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1
llama_kv_cache_init:      Metal KV buffer size = 10240.00 MiB
llama_init_from_model: KV self size  = 10240.00 MiB, K (f16): 5120.00 MiB, V (f16): 5120.00 MiB
llama_init_from_model:        CPU  output buffer size =     3.91 MiB
llama_init_from_model:      Metal compute buffer size =   272.00 MiB
llama_init_from_model:        CPU compute buffer size =   168.01 MiB
llama_init_from_model: graph nodes  = 903
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 81920
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 8
slot         init: id  0 | task -1 | new slot n_ctx_slot = 10240
slot         init: id  1 | task -1 | new slot n_ctx_slot = 10240
slot         init: id  2 | task -1 | new slot n_ctx_slot = 10240
slot         init: id  3 | task -1 | new slot n_ctx_slot = 10240
slot         init: id  4 | task -1 | new slot n_ctx_slot = 10240
slot         init: id  5 | task -1 | new slot n_ctx_slot = 10240
slot         init: id  6 | task -1 | new slot n_ctx_slot = 10240
slot         init: id  7 | task -1 | new slot n_ctx_slot = 10240
main: model loaded
main: chat template, chat_template: {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<｜User｜>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<｜Assistant｜><｜tool▁calls▁begin｜><｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n' + '' + '\n' + tool['function']['arguments'] + '\n' + '' + '<｜tool▁call▁end｜>'}}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n' + '' + '\n' + tool['function']['arguments'] + '\n' + '' + '<｜tool▁call▁end｜>'}}{{'<｜tool▁calls▁end｜><｜end▁of▁sentence｜>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<｜tool▁outputs▁end｜>' + message['content'] + '<｜end▁of▁sentence｜>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}{{'<｜Assistant｜>' + content + '<｜end▁of▁sentence｜>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<｜tool▁outputs▁begin｜><｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\n<｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<｜tool▁outputs▁end｜>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<｜Assistant｜>'}}{% endif %}, example_format: 'You are a helpful assistant

<｜User｜>Hello<｜Assistant｜>Hi there<｜end▁of▁sentence｜><｜User｜>How are you?<｜Assistant｜>'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle

The text was updated successfully, but these errors were encountered:

ggerganov · 2025-02-22T08:40:03Z

I can reproduce this. Seems like the 7th and 8th requests are not processed by the httplib handlers at all. So it's likely a misconfiguration or how we use httplib, but not sure. pinging @ngxson

ngxson · 2025-02-24T10:24:46Z

I'm unable to reproduce this issue. Did you test without the webui? (i.e. with python or curl)

Here is my command:

./build/bin/llama-server -m ../models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -c 8192 -np 8

And the test code in javascript:

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: 'test',
  baseURL: 'http://localhost:8080/v1',
});

async function cmpl(onToken) {
  const stream = await client.chat.completions.create({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: 'write a short poem about cats' }],
    stream: true,
  });
  for await (const chunk of stream) {
    //process.stdout.write(chunk.choices[0]?.delta?.content || '');
    onToken(chunk.choices[0]?.delta?.content || '');
  }
  //process.stdout.write('\n');
}

async function main() {
  const np = 8;
  const counters = Array(np).fill(0);
  const printCounters = () => {
    console.log(counters);
  }
  const wrapper = async (i) => {
    await cmpl((token) => {
      counters[i] += token.length;
      printCounters();
    });
  };
  await Promise.all(Array(np).fill(0).map((_, i) => wrapper(i)));
}

main();

ggerganov · 2025-02-24T11:01:29Z

The JS script works, but using the browser does not work:

ngxson · 2025-02-24T15:15:01Z

Seems like this may be related to a limitation in browser:

Though I'm not yet tested, will see a bit later

karanotsingyu added the bug-unconfirmed label Feb 21, 2025

ngxson self-assigned this Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: Concurrency Limitation: Only 6 Inferences Run Simultaneously When Setting `--parallel` > 6 #12013

Misc. bug: Concurrency Limitation: Only 6 Inferences Run Simultaneously When Setting `--parallel` > 6 #12013

karanotsingyu commented Feb 21, 2025 •

edited

Loading

ggerganov commented Feb 22, 2025

ngxson commented Feb 24, 2025 •

edited

Loading

ggerganov commented Feb 24, 2025

ngxson commented Feb 24, 2025 •

edited

Loading

Misc. bug: Concurrency Limitation: Only 6 Inferences Run Simultaneously When Setting --parallel > 6 #12013

Misc. bug: Concurrency Limitation: Only 6 Inferences Run Simultaneously When Setting --parallel > 6 #12013

Comments

karanotsingyu commented Feb 21, 2025 • edited Loading

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

ggerganov commented Feb 22, 2025

ngxson commented Feb 24, 2025 • edited Loading

ggerganov commented Feb 24, 2025

ngxson commented Feb 24, 2025 • edited Loading

Misc. bug: Concurrency Limitation: Only 6 Inferences Run Simultaneously When Setting `--parallel` > 6 #12013

Misc. bug: Concurrency Limitation: Only 6 Inferences Run Simultaneously When Setting `--parallel` > 6 #12013

karanotsingyu commented Feb 21, 2025 •

edited

Loading

ngxson commented Feb 24, 2025 •

edited

Loading

ngxson commented Feb 24, 2025 •

edited

Loading