easy-llama wishlist #10

oobabooga · 2025-01-24T15:11:23Z

I'm interested in potentially replacing llama-cpp-python with easy-llama in my project, and have some questions about feature parity:

Are all parameters in the params dict below available?

https://github.com/oobabooga/text-generation-webui/blob/096272f49e55357a364ed9016357b97829dae0fd/modules/llamacpp_model.py#L88

Is it possible to get the logits after a certain input? As done here:

https://github.com/oobabooga/text-generation-webui/blob/096272f49e55357a364ed9016357b97829dae0fd/modules/llamacpp_model.py#L134

Similar to 2. but more nuanced: is there a way to get the logits for every token position in an input at once? In llama-cpp-python, this is done by passing logits_all=True while loading the model, which reduces performance but makes all logits available as a matrix when you get them with model.eval_logits. I have used this feature to measure the perplexity of llama.cpp quants a while ago using the code here:

https://github.com/oobabooga/text-generation-webui/blob/096272f49e55357a364ed9016357b97829dae0fd/modules/llamacpp_hf.py#L133

I have a llamacpp_HF wrapper that connects llama.cpp to HF text generation functions; at its core, all it does is update model.n_tokens to do prefix matching, and evaluate new tokens by calling model.eval taking as input a list containing the new tokens only. Can that be done with easy-llama? See:

https://github.com/oobabooga/text-generation-webui/blob/096272f49e55357a364ed9016357b97829dae0fd/modules/llamacpp_hf.py#L118

Is speculative decoding implemented? There is a PR here https://github.com/oobabooga/text-generation-webui/pull/6669/files to add it, and having it in easy-llama would be great, especially if it could be done in a simple way by just passing new kwargs to its model loading and/or generation functions. I believe doing that for my llamacpp_HF wrapper would be very hard, so that's not something I have hopes for.

If you are interested, a PR changing llama-cpp-python to easy-llama in my repository would be highly welcome once wheels are available. It would be a way to test the library as well. But I can also to try to do the change myself.

The text was updated successfully, but these errors were encountered:

oobabooga · 2025-01-24T15:31:22Z

Also, one more thing:

At some point, llama.cpp introduced a change that optimized prompt processing on newer GPUs, but it also increased VRAM usage slightly and made things slower for older GPUs. This is enabled by default in llama.cpp, but it can be disabled by compiling with -DGGML_CUDA_FORCE_MMQ=ON. I wonder if it's possible to make that an optional parameter in easy-llama, to avoid having to compile two versions of the library like I do.

ddh0 · 2025-01-24T21:50:30Z

Are all parameters in the params dict below available?

mul_mat_q is not available because it is no longer part of the llama API. numa is implemented in the libllama bindings but is not currently exposed at a higher level - but that wouldn't be hard to add at all. All of the other parameters there are available.

Is it possible to get the logits after a certain input?

Yes, but it would be have to be done manually using the low-level API which would not be very fun. Currently, using a Llama, one can only get logits for the last token in the context. One option might be to record the logits token-by-token as the model generates. If that's not a effective solution, then support for logits_all could be implemented without too much headache.

Is there a way to get the logits for every token position in an input at once?

Same answer as above, if I understand correctly.

I have a llamacpp_HF wrapper that connects llama.cpp to HF text generation functions; at its core, all it does is update model.n_tokens to do prefix matching, and evaluate new tokens by calling model.eval taking as input a list containing the new tokens only. Can that be done with easy-llama?

The code you linked conceptually looks a lot like this snippet, which is part of the generate_single, generate, and stream methods of Llama:

# find how many tokens in the input are already in the KV cache
self.pos = self._first_valid_pos(input_tokens)

# remove all tokens that are past that point
self.kv_cache_seq_rm(0, self.pos, -1)

actual_input_tokens = input_tokens[self.pos:] # tokens after self.pos
self.context_tokens = input_tokens[:self.pos] # tokens already processed

n_actual_input_tokens = len(actual_input_tokens)
n_cache_hit_tokens = n_tokens - n_actual_input_tokens

This finds the longest common prefix (here referred to as the first valid Llama.pos) and uses any tokens beyond that point as the input to the model (the previous tokens are known to already be in the KV cache). I think this code could be adapted to work with tensors instead of list[int], if that's what you mean.

Is speculative decoding implemented?

No, and I'm afraid would take a lot of effort to get it implemented. The reason is that it's not part of the libllama API which easy_llama is built on top of. It's part of common.h and speculative.h and I'd have to get my hands dirty with C++ bring it in.

As far as I can tell, llama-cpp-python's LlamaPromptLookupDecoding class doesn't get around this issue either. It speeds up prompt processing using the draft model but does nothing during text generation. See lines 907 - 955 here.

EDIT: Nevermind, I was wrong. llama-cpp-python is doing this. So maybe it wouldn't be as hard as I thought originally. This is something I need to look into in the future.

Option for CUDA_FORCE_MMQ?

I'll have to get back to you on this when I'm a little more comfortable with the distribution and installation process for easy_llama outside of my own machines. I need to figure that out. 😅 I imagine once I know what I'm doing this would be possible to add, but I'm not there yet.

If you are interested, a PR changing llama-cpp-python to easy-llama in my repository would be highly welcome once wheels are available.

The PR is coming, slowly but surely. 🐌 After v0.2.0 is released and the installation process is figured out, drafting that PR is next on the roadmap.

Let me know which of these features are must-haves and which ones can be worked around. At a minimum, support for logits_all would be easy to add.

oobabooga · 2025-01-24T23:41:20Z

numa is implemented in the libllama bindings but is not currently exposed at a higher level

That would be a nice addition. I have never used it, but there has been demand for it in my repository at some point. It seems to speed things up to some extent on servers with multiple CPUs.

If that's not a effective solution, then support for logits_all could be implemented without too much headache.

If you could

Make it an optional parameter
Make it not hurt performance at all, just add a bit of extra memory usage

That would be a great feature to have. Having access to all logits is not only useful for perplexity evaluation, it's also necessary for speculative decoding.

The code you linked conceptually looks a lot like this snippet, which is part of the generate_single, generate, and stream methods of Llama:

Indeed your code does the same thing. I would suggest adding an eval(tokens: Sequence[int]) method, just like the one in llama-cpp-python. That's an often used function, and having to handle batches and call low level functions each time is time consuming. Also, maybe that would also avoid some code duplication in the easy-llama codebase?

On this, it would also be nice to have a kwarg like show_progress=False in the function, and, if true, show a tqdm progress bar for prompt evaluation like this one (useful for cases when prompt evaluation can take several minutes):

https://github.com/oobabooga/text-generation-webui/blob/096272f49e55357a364ed9016357b97829dae0fd/modules/llama_cpp_python_hijack.py#L67

No, and I'm afraid would take a lot of effort to get it implemented. The reason is that it's not part of the libllama API which easy_llama is built on top of.

That's entirely fair. I figured it wasn't a part of libllama.

EDIT: Nevermind, I was wrong. llama-cpp-python is doing this. So maybe it wouldn't be as hard as I thought originally. This is something I need to look into in the future.

Aha, I didn't know that. Prompt lookup decoding is what the Transformers library has in its .generate() function. It's not really what people call speculative decoding, but it does make things significantly faster as well, and without the need for a draft model.

The PR is coming, slowly but surely. 🐌 After v0.2.0 is released and the installation process is figured out, drafting that PR is next on the roadmap.

Thank you so much in advance for that!

iwr-redmond · 2025-01-25T01:45:45Z

I agree that logits would be very helpful to implement, because this functionality will be needed for compatibility with lm-format-enforcer. Packages like llama-cpp-agent and llm-axe, and it seems even ollama-py, all error out when a model doesn't produce valid JSON, an issue that can be avoided by using token-level supervision.

This would open the door to, for example, more reliable implementations of function calling and <thinking> tags in third-party packages that depend on easy-llama in the future.

ddh0 · 2025-01-26T02:52:02Z

If you could make it an optional parameter and make it not hurt performance at all, just add a bit of extra memory usage ...

Yes, ultimately the decision to return or discard logits is made per-token, per-batch. Setting the logits_all parameter upon loading a model is no longer required in any case.

Is there ever a situation where you need the logits for all tokens in a batch (i.e. logits from prompt processing), and not just the current token (i.e. logits from text generation)?

iwr-redmond · 2025-01-26T03:24:24Z

I'm not sufficiently familiar with the internals of LlamaCPP(-python) to know the answer. Hopefully @oobabooga is better skilled in this area (it certainly seems so!).

Having said that, the lm-format-enforcer sample notebook provides some idea of what it expects from llama-cpp-python. Perhaps that will be of assistance?

oobabooga · 2025-01-26T04:04:04Z

Is there ever a situation where you need the logits for all tokens in a batch

Yes, perplexity calculation formula depends on the logits across all positions in the input, not just the last one.

For speculative decoding, it's necessary to evaluate the prompt with an additional draft token appended and retrieve the logits at position N+1, where N is the original prompt length. I remember I tried doing that with llama-cpp-python at some point and had to use logits_all=True to access those logits.

In both cases, the logits from prompt processing are used.

ddh0 · 2025-01-26T04:33:55Z

Yes, perplexity calculation formula depends on the logits across all positions in the input, not just the last one ... In both cases, the logits from prompt processing are used.

I see, thanks. I'll work on this.

iwr-redmond · 2025-01-28T04:05:22Z

I fear we've killed Christmas (or 0.20's release schedule)

This is a wishlist, @ddh0, not a mandatory requirement for an initial version.

ddh0 · 2025-01-28T06:32:49Z

Yeah no worries at all. Just pushed the latest, mostly working on logits_all and (hopefully) making usage intuitive

ddh0 · 2025-01-30T05:35:12Z

Just pushed the implementation of logits_all. See this section of llama.py.

In short:

Llama.eval(toks, logits_all=False): get the logits for the last token in toks
Llama.eval(toks, logits_all=True): get the logits for all tokens in toks
Llama.generate_single(toks, return_logits=False): generate the next token ID
Llama.generate_single(toks, return_logits=True): get the logits for the next token
Llama.generate(toks, return_logits=False): generate multiple new token IDs
Llama.generate(toks, return_logits=True): get the logits for multiple new tokens
Llama.stream(toks, yield_logits=False): yield token IDs as they are generated
Llama.stream(toks, yield_logits=True): yield logits as they are generated

oobabooga · 2025-01-30T17:43:38Z

Nice @ddh0, thank you for that!

ddh0 added a commit that referenced this issue Jan 30, 2025

finish logits-related logic (#10)

8df5bc5

iwr-redmond mentioned this issue Jan 31, 2025

Failed to build llama_cpp_python? runew0lf/RuinedFooocus#223

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

easy-llama wishlist #10

easy-llama wishlist #10

oobabooga commented Jan 24, 2025 •

edited

Loading

oobabooga commented Jan 24, 2025 •

edited

Loading

ddh0 commented Jan 24, 2025 •

edited

Loading

oobabooga commented Jan 24, 2025 •

edited

Loading

iwr-redmond commented Jan 25, 2025 •

edited

Loading

ddh0 commented Jan 26, 2025

iwr-redmond commented Jan 26, 2025

oobabooga commented Jan 26, 2025 •

edited

Loading

ddh0 commented Jan 26, 2025

iwr-redmond commented Jan 28, 2025

ddh0 commented Jan 28, 2025

ddh0 commented Jan 30, 2025 •

edited

Loading

oobabooga commented Jan 30, 2025

easy-llama wishlist #10

easy-llama wishlist #10

Comments

oobabooga commented Jan 24, 2025 • edited Loading

oobabooga commented Jan 24, 2025 • edited Loading

ddh0 commented Jan 24, 2025 • edited Loading

oobabooga commented Jan 24, 2025 • edited Loading

iwr-redmond commented Jan 25, 2025 • edited Loading

ddh0 commented Jan 26, 2025

iwr-redmond commented Jan 26, 2025

oobabooga commented Jan 26, 2025 • edited Loading

ddh0 commented Jan 26, 2025

iwr-redmond commented Jan 28, 2025

ddh0 commented Jan 28, 2025

ddh0 commented Jan 30, 2025 • edited Loading

oobabooga commented Jan 30, 2025

oobabooga commented Jan 24, 2025 •

edited

Loading

oobabooga commented Jan 24, 2025 •

edited

Loading

ddh0 commented Jan 24, 2025 •

edited

Loading

oobabooga commented Jan 24, 2025 •

edited

Loading

iwr-redmond commented Jan 25, 2025 •

edited

Loading

oobabooga commented Jan 26, 2025 •

edited

Loading

ddh0 commented Jan 30, 2025 •

edited

Loading