Load all MoE experts during warmup #11571

fairydreaming · 2025-02-01T09:42:00Z

This PR is a somewhat crude hack that allows to load all experts in MoE models during warmup.

The hacky part is the warmup detection - I explicitly examine the ubatch tokens to detect the warmup.
I couldn't find a better way to do it, let me know if one exists.

If the model is warming up then n_expert_used is set to n_expert, this will cause all existing experts to be loaded to memory during warmup.

Fixes #11163

…f nodes during warmup

cpumaxx · 2025-02-03T17:05:26Z

A quick test with R1 on llama-server shows all experts loaded into memory during warmup. Inference started immediately once the web interface was available.
I will try a test on a non-MoE large model as well to make sure there are no regressions in that case.
Thanks for this fix!

jukofyork · 2025-02-06T21:23:55Z

I can confirm this is working for me and loads a couple of times faster than letting it warm up "naturally" (can see it uses ~2.5 cores instead of ~0.5 cores so possibly due to avoiding random access on the SSD?)

ggerganov · 2025-02-07T08:04:03Z

The hacky part is the warmup detection - I explicitly examine the ubatch tokens to detect the warmup. I couldn't find a better way to do it, let me know if one exists.

I'll consider adding proper support for this in #11213.

fairydreaming · 2025-02-09T11:58:04Z

The hacky part is the warmup detection - I explicitly examine the ubatch tokens to detect the warmup. I couldn't find a better way to do it, let me know if one exists.

I'll consider adding proper support for this in #11213.

@ggerganov if you are going to work on warmup then take a look at this: #11733

TLDR: Using 1-token-long sequence (instead of current 2 BOS and EOS tokens) in the warmup batch fixes token generation performance bottleneck (+80% to tg t/s with llama-3.1 70b f16) on dual Epyc systems.

llama : use all experts during warmup

83a473a

fairydreaming mentioned this pull request Feb 1, 2025

Misc. bug: model warmup doesn't work correctly for MoE models #11163

Open

llama : increased max_nodes as large MoE models use massive amounts o…

c8bc6e4

…f nodes during warmup

ggerganov mentioned this pull request Feb 7, 2025

llama : refactor llama_kv_cache, llama_context and llm_build_context #11213

Draft

20 tasks

saood06 mentioned this pull request Feb 9, 2025

Load all MoE experts during warmup and make warmup 1 token ikawrakow/ik_llama.cpp#198

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load all MoE experts during warmup #11571

Load all MoE experts during warmup #11571

fairydreaming commented Feb 1, 2025

cpumaxx commented Feb 3, 2025

jukofyork commented Feb 6, 2025

ggerganov commented Feb 7, 2025

fairydreaming commented Feb 9, 2025

Load all MoE experts during warmup #11571

Are you sure you want to change the base?

Load all MoE experts during warmup #11571

Conversation

fairydreaming commented Feb 1, 2025

cpumaxx commented Feb 3, 2025

jukofyork commented Feb 6, 2025

ggerganov commented Feb 7, 2025

fairydreaming commented Feb 9, 2025