Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

easy-llama wishlist #10

Open
oobabooga opened this issue Jan 24, 2025 · 12 comments
Open

easy-llama wishlist #10

oobabooga opened this issue Jan 24, 2025 · 12 comments

Comments

@oobabooga
Copy link

oobabooga commented Jan 24, 2025

I'm interested in potentially replacing llama-cpp-python with easy-llama in my project, and have some questions about feature parity:

  1. Are all parameters in the params dict below available?

https://github.com/oobabooga/text-generation-webui/blob/096272f49e55357a364ed9016357b97829dae0fd/modules/llamacpp_model.py#L88

  1. Is it possible to get the logits after a certain input? As done here:

https://github.com/oobabooga/text-generation-webui/blob/096272f49e55357a364ed9016357b97829dae0fd/modules/llamacpp_model.py#L134

  1. Similar to 2. but more nuanced: is there a way to get the logits for every token position in an input at once? In llama-cpp-python, this is done by passing logits_all=True while loading the model, which reduces performance but makes all logits available as a matrix when you get them with model.eval_logits. I have used this feature to measure the perplexity of llama.cpp quants a while ago using the code here:

https://github.com/oobabooga/text-generation-webui/blob/096272f49e55357a364ed9016357b97829dae0fd/modules/llamacpp_hf.py#L133

  1. I have a llamacpp_HF wrapper that connects llama.cpp to HF text generation functions; at its core, all it does is update model.n_tokens to do prefix matching, and evaluate new tokens by calling model.eval taking as input a list containing the new tokens only. Can that be done with easy-llama? See:

https://github.com/oobabooga/text-generation-webui/blob/096272f49e55357a364ed9016357b97829dae0fd/modules/llamacpp_hf.py#L118

  1. Is speculative decoding implemented? There is a PR here https://github.com/oobabooga/text-generation-webui/pull/6669/files to add it, and having it in easy-llama would be great, especially if it could be done in a simple way by just passing new kwargs to its model loading and/or generation functions. I believe doing that for my llamacpp_HF wrapper would be very hard, so that's not something I have hopes for.

If you are interested, a PR changing llama-cpp-python to easy-llama in my repository would be highly welcome once wheels are available. It would be a way to test the library as well. But I can also to try to do the change myself.

@oobabooga
Copy link
Author

oobabooga commented Jan 24, 2025

Also, one more thing:

  1. At some point, llama.cpp introduced a change that optimized prompt processing on newer GPUs, but it also increased VRAM usage slightly and made things slower for older GPUs. This is enabled by default in llama.cpp, but it can be disabled by compiling with -DGGML_CUDA_FORCE_MMQ=ON. I wonder if it's possible to make that an optional parameter in easy-llama, to avoid having to compile two versions of the library like I do.

@ddh0
Copy link
Owner

ddh0 commented Jan 24, 2025

  1. Are all parameters in the params dict below available?

mul_mat_q is not available because it is no longer part of the llama API. numa is implemented in the libllama bindings but is not currently exposed at a higher level - but that wouldn't be hard to add at all. All of the other parameters there are available.

  1. Is it possible to get the logits after a certain input?

Yes, but it would be have to be done manually using the low-level API which would not be very fun. Currently, using a Llama, one can only get logits for the last token in the context. One option might be to record the logits token-by-token as the model generates. If that's not a effective solution, then support for logits_all could be implemented without too much headache.

  1. Is there a way to get the logits for every token position in an input at once?

Same answer as above, if I understand correctly.

  1. I have a llamacpp_HF wrapper that connects llama.cpp to HF text generation functions; at its core, all it does is update model.n_tokens to do prefix matching, and evaluate new tokens by calling model.eval taking as input a list containing the new tokens only. Can that be done with easy-llama?

The code you linked conceptually looks a lot like this snippet, which is part of the generate_single, generate, and stream methods of Llama:

# find how many tokens in the input are already in the KV cache
self.pos = self._first_valid_pos(input_tokens)

# remove all tokens that are past that point
self.kv_cache_seq_rm(0, self.pos, -1)

actual_input_tokens = input_tokens[self.pos:] # tokens after self.pos
self.context_tokens = input_tokens[:self.pos] # tokens already processed

n_actual_input_tokens = len(actual_input_tokens)
n_cache_hit_tokens = n_tokens - n_actual_input_tokens

This finds the longest common prefix (here referred to as the first valid Llama.pos) and uses any tokens beyond that point as the input to the model (the previous tokens are known to already be in the KV cache). I think this code could be adapted to work with tensors instead of list[int], if that's what you mean.

  1. Is speculative decoding implemented?

No, and I'm afraid would take a lot of effort to get it implemented. The reason is that it's not part of the libllama API which easy_llama is built on top of. It's part of common.h and speculative.h and I'd have to get my hands dirty with C++ bring it in.

As far as I can tell, llama-cpp-python's LlamaPromptLookupDecoding class doesn't get around this issue either. It speeds up prompt processing using the draft model but does nothing during text generation. See lines 907 - 955 here.

EDIT: Nevermind, I was wrong. llama-cpp-python is doing this. So maybe it wouldn't be as hard as I thought originally. This is something I need to look into in the future.

  1. Option for CUDA_FORCE_MMQ?

I'll have to get back to you on this when I'm a little more comfortable with the distribution and installation process for easy_llama outside of my own machines. I need to figure that out. 😅 I imagine once I know what I'm doing this would be possible to add, but I'm not there yet.

If you are interested, a PR changing llama-cpp-python to easy-llama in my repository would be highly welcome once wheels are available.

The PR is coming, slowly but surely. 🐌 After v0.2.0 is released and the installation process is figured out, drafting that PR is next on the roadmap.


Let me know which of these features are must-haves and which ones can be worked around. At a minimum, support for logits_all would be easy to add.

@oobabooga
Copy link
Author

oobabooga commented Jan 24, 2025

numa is implemented in the libllama bindings but is not currently exposed at a higher level

That would be a nice addition. I have never used it, but there has been demand for it in my repository at some point. It seems to speed things up to some extent on servers with multiple CPUs.

If that's not a effective solution, then support for logits_all could be implemented without too much headache.

If you could

  1. Make it an optional parameter
  2. Make it not hurt performance at all, just add a bit of extra memory usage

That would be a great feature to have. Having access to all logits is not only useful for perplexity evaluation, it's also necessary for speculative decoding.

The code you linked conceptually looks a lot like this snippet, which is part of the generate_single, generate, and stream methods of Llama:

Indeed your code does the same thing. I would suggest adding an eval(tokens: Sequence[int]) method, just like the one in llama-cpp-python. That's an often used function, and having to handle batches and call low level functions each time is time consuming. Also, maybe that would also avoid some code duplication in the easy-llama codebase?

On this, it would also be nice to have a kwarg like show_progress=False in the function, and, if true, show a tqdm progress bar for prompt evaluation like this one (useful for cases when prompt evaluation can take several minutes):

https://github.com/oobabooga/text-generation-webui/blob/096272f49e55357a364ed9016357b97829dae0fd/modules/llama_cpp_python_hijack.py#L67

No, and I'm afraid would take a lot of effort to get it implemented. The reason is that it's not part of the libllama API which easy_llama is built on top of.

That's entirely fair. I figured it wasn't a part of libllama.

EDIT: Nevermind, I was wrong. llama-cpp-python is doing this. So maybe it wouldn't be as hard as I thought originally. This is something I need to look into in the future.

Aha, I didn't know that. Prompt lookup decoding is what the Transformers library has in its .generate() function. It's not really what people call speculative decoding, but it does make things significantly faster as well, and without the need for a draft model.

The PR is coming, slowly but surely. 🐌 After v0.2.0 is released and the installation process is figured out, drafting that PR is next on the roadmap.

Thank you so much in advance for that!

@iwr-redmond
Copy link

iwr-redmond commented Jan 25, 2025

I agree that logits would be very helpful to implement, because this functionality will be needed for compatibility with lm-format-enforcer. Packages like llama-cpp-agent and llm-axe, and it seems even ollama-py, all error out when a model doesn't produce valid JSON, an issue that can be avoided by using token-level supervision.

This would open the door to, for example, more reliable implementations of function calling and <thinking> tags in third-party packages that depend on easy-llama in the future.

@ddh0
Copy link
Owner

ddh0 commented Jan 26, 2025

If you could make it an optional parameter and make it not hurt performance at all, just add a bit of extra memory usage ...

Yes, ultimately the decision to return or discard logits is made per-token, per-batch. Setting the logits_all parameter upon loading a model is no longer required in any case.

Is there ever a situation where you need the logits for all tokens in a batch (i.e. logits from prompt processing), and not just the current token (i.e. logits from text generation)?

@iwr-redmond
Copy link

I'm not sufficiently familiar with the internals of LlamaCPP(-python) to know the answer. Hopefully @oobabooga is better skilled in this area (it certainly seems so!).

Having said that, the lm-format-enforcer sample notebook provides some idea of what it expects from llama-cpp-python. Perhaps that will be of assistance?

@oobabooga
Copy link
Author

oobabooga commented Jan 26, 2025

Is there ever a situation where you need the logits for all tokens in a batch

Yes, perplexity calculation formula depends on the logits across all positions in the input, not just the last one.

For speculative decoding, it's necessary to evaluate the prompt with an additional draft token appended and retrieve the logits at position N+1, where N is the original prompt length. I remember I tried doing that with llama-cpp-python at some point and had to use logits_all=True to access those logits.

In both cases, the logits from prompt processing are used.

@ddh0
Copy link
Owner

ddh0 commented Jan 26, 2025

Yes, perplexity calculation formula depends on the logits across all positions in the input, not just the last one ... In both cases, the logits from prompt processing are used.

I see, thanks. I'll work on this.

@iwr-redmond
Copy link

I fear we've killed Christmas (or 0.20's release schedule)

This is a wishlist, @ddh0, not a mandatory requirement for an initial version.

@ddh0
Copy link
Owner

ddh0 commented Jan 28, 2025

Yeah no worries at all. Just pushed the latest, mostly working on logits_all and (hopefully) making usage intuitive

ddh0 added a commit that referenced this issue Jan 30, 2025
@ddh0
Copy link
Owner

ddh0 commented Jan 30, 2025

Just pushed the implementation of logits_all. See this section of llama.py.

In short:

  • Llama.eval(toks, logits_all=False): get the logits for the last token in toks
  • Llama.eval(toks, logits_all=True): get the logits for all tokens in toks
  • Llama.generate_single(toks, return_logits=False): generate the next token ID
  • Llama.generate_single(toks, return_logits=True): get the logits for the next token
  • Llama.generate(toks, return_logits=False): generate multiple new token IDs
  • Llama.generate(toks, return_logits=True): get the logits for multiple new tokens
  • Llama.stream(toks, yield_logits=False): yield token IDs as they are generated
  • Llama.stream(toks, yield_logits=True): yield logits as they are generated

@oobabooga
Copy link
Author

Nice @ddh0, thank you for that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants