-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
easy-llama wishlist #10
Comments
Also, one more thing:
|
Yes, but it would be have to be done manually using the low-level API which would not be very fun. Currently, using a
Same answer as above, if I understand correctly.
The code you linked conceptually looks a lot like this snippet, which is part of the # find how many tokens in the input are already in the KV cache
self.pos = self._first_valid_pos(input_tokens)
# remove all tokens that are past that point
self.kv_cache_seq_rm(0, self.pos, -1)
actual_input_tokens = input_tokens[self.pos:] # tokens after self.pos
self.context_tokens = input_tokens[:self.pos] # tokens already processed
n_actual_input_tokens = len(actual_input_tokens)
n_cache_hit_tokens = n_tokens - n_actual_input_tokens This finds the longest common prefix (here referred to as the first valid
No, and I'm afraid would take a lot of effort to get it implemented. The reason is that it's not part of the
EDIT: Nevermind, I was wrong. llama-cpp-python is doing this. So maybe it wouldn't be as hard as I thought originally. This is something I need to look into in the future.
I'll have to get back to you on this when I'm a little more comfortable with the distribution and installation process for easy_llama outside of my own machines. I need to figure that out. 😅 I imagine once I know what I'm doing this would be possible to add, but I'm not there yet.
The PR is coming, slowly but surely. 🐌 After v0.2.0 is released and the installation process is figured out, drafting that PR is next on the roadmap. Let me know which of these features are must-haves and which ones can be worked around. At a minimum, support for |
That would be a nice addition. I have never used it, but there has been demand for it in my repository at some point. It seems to speed things up to some extent on servers with multiple CPUs.
If you could
That would be a great feature to have. Having access to all logits is not only useful for perplexity evaluation, it's also necessary for speculative decoding.
Indeed your code does the same thing. I would suggest adding an On this, it would also be nice to have a kwarg like
That's entirely fair. I figured it wasn't a part of libllama.
Aha, I didn't know that. Prompt lookup decoding is what the Transformers library has in its
Thank you so much in advance for that! |
I agree that logits would be very helpful to implement, because this functionality will be needed for compatibility with lm-format-enforcer. Packages like llama-cpp-agent and llm-axe, and it seems even ollama-py, all error out when a model doesn't produce valid JSON, an issue that can be avoided by using token-level supervision. This would open the door to, for example, more reliable implementations of function calling and <thinking> tags in third-party packages that depend on easy-llama in the future. |
Yes, ultimately the decision to return or discard logits is made per-token, per-batch. Setting the Is there ever a situation where you need the logits for all tokens in a batch (i.e. logits from prompt processing), and not just the current token (i.e. logits from text generation)? |
I'm not sufficiently familiar with the internals of LlamaCPP(-python) to know the answer. Hopefully @oobabooga is better skilled in this area (it certainly seems so!). Having said that, the lm-format-enforcer sample notebook provides some idea of what it expects from llama-cpp-python. Perhaps that will be of assistance? |
Yes, perplexity calculation formula depends on the logits across all positions in the input, not just the last one. For speculative decoding, it's necessary to evaluate the prompt with an additional draft token appended and retrieve the logits at position N+1, where N is the original prompt length. I remember I tried doing that with llama-cpp-python at some point and had to use logits_all=True to access those logits. In both cases, the logits from prompt processing are used. |
I see, thanks. I'll work on this. |
I fear we've killed Christmas (or 0.20's release schedule) This is a wishlist, @ddh0, not a mandatory requirement for an initial version. |
Yeah no worries at all. Just pushed the latest, mostly working on logits_all and (hopefully) making usage intuitive |
Just pushed the implementation of In short:
|
Nice @ddh0, thank you for that! |
I'm interested in potentially replacing llama-cpp-python with easy-llama in my project, and have some questions about feature parity:
params
dict below available?https://github.com/oobabooga/text-generation-webui/blob/096272f49e55357a364ed9016357b97829dae0fd/modules/llamacpp_model.py#L88
https://github.com/oobabooga/text-generation-webui/blob/096272f49e55357a364ed9016357b97829dae0fd/modules/llamacpp_model.py#L134
logits_all=True
while loading the model, which reduces performance but makes all logits available as a matrix when you get them withmodel.eval_logits
. I have used this feature to measure the perplexity of llama.cpp quants a while ago using the code here:https://github.com/oobabooga/text-generation-webui/blob/096272f49e55357a364ed9016357b97829dae0fd/modules/llamacpp_hf.py#L133
model.n_tokens
to do prefix matching, and evaluate new tokens by callingmodel.eval
taking as input a list containing the new tokens only. Can that be done with easy-llama? See:https://github.com/oobabooga/text-generation-webui/blob/096272f49e55357a364ed9016357b97829dae0fd/modules/llamacpp_hf.py#L118
If you are interested, a PR changing llama-cpp-python to easy-llama in my repository would be highly welcome once wheels are available. It would be a way to test the library as well. But I can also to try to do the change myself.
The text was updated successfully, but these errors were encountered: