Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix phi 3 conversion #8262

Merged
merged 1 commit into from
Jul 3, 2024
Merged

Fix phi 3 conversion #8262

merged 1 commit into from
Jul 3, 2024

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Jul 2, 2024

Reflect the change in latest update: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/commit/10d25dfa1593265daf4a3bac573ab76ccf61d60f

The only changes are:

  • Rope type name changed to longrope
  • Scaling factor list changed ==> model need to be re-converted

Related to #6849 (comment)

Example result after conversion:

make llama-cli -j && ./llama-cli -m ../_hf/Phi-3-mini-128k-instruct/ggml-model-q8_0.gguf -p "You are AI" -c 1024 -cnv

> What is 2+2
 2+2 equals 4. The operation here is a simple addition, which is one of the four basic arithmetic operations. Adding two units to two units results in four units.

@ngxson ngxson requested a review from ggerganov July 2, 2024 21:18
@github-actions github-actions bot added the python python script changes label Jul 2, 2024
@bartowski1182
Copy link
Contributor

is this all that's needed? I thought longrope wasn't supported

@ngxson
Copy link
Collaborator Author

ngxson commented Jul 3, 2024

Longrope is added in #7225

@bartowski1182
Copy link
Contributor

bartowski1182 commented Jul 3, 2024

Is that long rope? I thought it was supporting yarn with a long factor, I could be way off base

I guess my question is, with this, does it actually work as it should or does this just make conversion work

@ann-brown
Copy link

(I'm a bit confused because I'm on the fork and python convert-hf-to-gguf.py ../Phi-3-mini-128k-instruct is still giving me "NotImplementedError: The rope scaling type longrope is not supported yet" ... as is trying to load it directly in transformers ... and anyone else's GGUFs that they've managed to make are broken for me in LM studio, though not for them, apparently. ... Mac thing?)

@ngxson
Copy link
Collaborator Author

ngxson commented Jul 3, 2024

@bartowski1182 I ran perplexity test yesterday (forgot the save the result), but result seems good.

@ann-brown Maybe problem with your git. I converted Phi-3-mini-128k-instruct without any problem. Please double check your git command before doing other things.

@ann-brown
Copy link

@bartowski1182 I ran perplexity test yesterday (forgot the save the result), but result seems good.

@ann-brown Maybe problem with your git. I converted Phi-3-mini-128k-instruct without any problem. Please double check your git command before doing other things.

git clone https://github.com/ngxson/llama.cpp.git llama.fork ?

@ngxson
Copy link
Collaborator Author

ngxson commented Jul 3, 2024

@ann-brown branch xsn/phi3-convert

@slaren
Copy link
Member

slaren commented Jul 3, 2024

It seems that the chat template has also been changed to support a system prompt.

@ngxson
Copy link
Collaborator Author

ngxson commented Jul 3, 2024

Actually we (accidentally) have support for system message from the beginning: https://github.com/ggerganov/llama.cpp/blob/f8d6a23804f3798ff2869da68c1223b618df09ec/src/llama.cpp#L19897-L19905

This is because phi-3 template mostly the same as zephyr template, so we copied the code from zephyr which already had system message support.

@ngxson
Copy link
Collaborator Author

ngxson commented Jul 3, 2024

@bartowski1182 Could you confirm if this fix works on your side? I'll merge when someone can confirm that. Thanks.

@bartowski1182
Copy link
Contributor

Sure I'll check it out

Did your perplexity test go past 4k tokens? I'll try to validate the models coherency at ~10k+ just in case too

@ngxson
Copy link
Collaborator Author

ngxson commented Jul 3, 2024

@bartowski1182 No I didn't tested beyond -c 4096 (my laptop is potato as always - have to wait until next month for the new rig). Thanks for testing that.

@ngxson
Copy link
Collaborator Author

ngxson commented Jul 3, 2024

Btw the 128k "mini" version should work fine with context bigger than 4k. Only beginning from the "small" that they use sliding window and block sparse attention, which will break on llama.cpp from 2k tokens

@slaren
Copy link
Member

slaren commented Jul 3, 2024

It doesn't seem to work with contexts above 4096. ppl with -c 8192 is very high.

[1]1139.2192,[2]697.7324,[3]697.8156,[4]681.5541,[5]573.3356,[6]575.8573,[7]547.5108,[8]572.5840

@bartowski1182
Copy link
Contributor

that's kind of what i expected, this isn't a longrope implementation, it's just allowing it to fall through to old rope if its set to longrope

@ngxson
Copy link
Collaborator Author

ngxson commented Jul 3, 2024

thanks for the test. hmm, I'm looking at the commit on HF, the only change to the file is a new function _rope_scaling_adjustment which simply force self.rope_scaling["type"] = "longrope"

https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/blob/main/configuration_phi3.py#L187

probably there are other changes under the hood

@bartowski1182
Copy link
Contributor

it's likely because it was trained with whatever the longrope method is, so while this may extend the context, it's in the wrong way and the model will get confused

@slaren
Copy link
Member

slaren commented Jul 3, 2024

It is likely a bug in build_rope_factors and/or the value perplexity sets for n_seq_max. Forcing it to return rope_long works.

perplexity: calculating perplexity over 40 chunks, n_ctx=8192, batch_size=512, n_seq=1
perplexity: 2.38 seconds per pass - ETA 1.58 minutes
[1]7.1336,[2]5.1053,[3]5.5121,[4]5.8647,[5]5.9926,[6]6.0420,[7]6.0684,[8]6.3195,[9]5.9890,[10]5.8958,

@bartowski1182
Copy link
Contributor

Oo that's interesting... Good find, where is that change being made sorry?

@ngxson
Copy link
Collaborator Author

ngxson commented Jul 3, 2024

OK I see, probably because yarn is removed so tehre is no more n_ctx_orig_yarn, thus it never returns model.layers[il].rope_long

See: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/commit/10d25dfa1593265daf4a3bac573ab76ccf61d60f.diff?file=modeling_phi3.py

@bartowski1182 in llama.cpp, build_rope_factors we simply return model.layers[il].rope_long without any checks

@slaren
Copy link
Member

slaren commented Jul 3, 2024

I think the problem is that perplexity sets n_seq_max to 4, so it takes 16k context before it starts using the long rope factors.

@bartowski1182
Copy link
Contributor

ah so it's more than just a conversion change, it'll need an update to the runtime as well? does it work with this conversion change + new llama.cpp change, or does it need a different conversion AND llama.cpp changes?

@slaren
Copy link
Member

slaren commented Jul 3, 2024

I think this is fine as is. It will work on programs that set n_seq_max correctly. We can fix the issue with perplexity in another PR.

@ngxson
Copy link
Collaborator Author

ngxson commented Jul 3, 2024

You're right @slaren , when running with llama-cli -m ... -c 8192, it always returns rope_long. So I assume that it should work at least in llama-cli (not sure if it has impact on llama-server with multi-sequences)

@ngxson
Copy link
Collaborator Author

ngxson commented Jul 3, 2024

A small note is that on huggingface model, they simply rename Phi3SuScaledRotaryEmbedding to Phi3LongRoPEScaledRotaryEmbedding without changing any code inside, so "longrope" should be just "su" in another name.

I'm merging this PR now.

@ngxson ngxson merged commit 916248a into ggml-org:master Jul 3, 2024
8 checks passed
@bartowski1182
Copy link
Contributor

I'm still suspicious but will run some tests and report back if there's anything off, maybe i'm just naturally too suspicious tho

@andysalerno
Copy link
Contributor

will the "old" Phi-3s continue to run and convert properly, with this change?

also thank you for the diligence, I think it really helps catch subtle issues like these

@jxy
Copy link
Contributor

jxy commented Jul 3, 2024

It looks good to me. Tried to convert to Q8_0 from f32.

4k ppl

perplexity: calculating perplexity over 81 chunks, n_ctx=4096, batch_size=2048, n_seq=1
perplexity: 17.06 seconds per pass - ETA 23.03 minutes
[1]6.3298,[2]6.2972,[3]5.8147,[4]5.0729,[5]5.1476,[6]5.2719,[7]5.4058,[8]5.5532,

8k ppl

perplexity: calculating perplexity over 40 chunks, n_ctx=8192, batch_size=2048, n_seq=1
perplexity: 43.26 seconds per pass - ETA 28.83 minutes
[1]7.1303,[2]5.0979,[3]5.5071,[4]5.8604,

16k ppl

perplexity: calculating perplexity over 20 chunks, n_ctx=16384, batch_size=2048, n_seq=1
perplexity: 133.90 seconds per pass - ETA 44.63 minutes
[1]4.0957,[2]5.0421,

and tried a simple test

$ ./llama-cli -m models/Phi-3.1-mini-128k-instruct-Q8_0.gguf -ngl 99 -c 16384 -n -1 --temp 0 -fa -p "<|system|>
As a CS PhD, explain the patch given to you.<|end|>          
<|user|>
$(curl -L https://github.com/ggerganov/llama.cpp/pull/8227.patch)<|end|>
<|assistant|>
"

and got

  The patches provided by ngxson and others are a series of changes made to the `llama.cpp` file, which is part of the Llama model implementation. The changes are related to the addition and modification of features for sliding window attention (SWA) in the model. Here's a summary of the changes:

1. Patch 1: Added new enum values for `LLM_KV_CONTEXT_LENGTH_SWA` and `LLM_KV_EMBEDDING_LENGTH` to the `llm_kv` enum.
2. Patch 2: Fixed an issue with the `data_swa` variable being uninitialized.
3. Patch 3: Improved naming for the `sliding_window` parameter in the `set_gguf_parameters` function.
4. Patch 4: Replaced a list of tensors with a single tensor for the `inp_KQ_mask_SWA` variable.
5. Patch 5: Co-authored by Arlo Phoenix.
6. Patch 6: Updated the `n_sliding` parameter to `n_swa` for consistency with other parameters.
7. Patch 7: Improved the styling of the code.
8. Patch 8: Added a sanity check for the `query_pre_attn_scalar` parameter in the `set_gguf_parameters` function.
9. Patch 9: Fixed a typo in the README.md file.

Overall, these patches aim to improve the implementation of sliding window attention in the Llama model, making it more efficient and accurate. [end of text]

llama_print_timings:        load time =     862.99 ms
llama_print_timings:      sample time =       9.46 ms /   379 runs   (    0.02 ms per token, 40050.72 tokens per second)
llama_print_timings: prompt eval time =   69443.49 ms / 13841 tokens (    5.02 ms per token,   199.31 tokens per second)
llama_print_timings:        eval time =   44818.62 ms /   378 runs   (  118.57 ms per token,     8.43 tokens per second)
llama_print_timings:       total time =  114317.45 ms / 14219 tokens

not brilliant but alright for a small model.

@MoonRide303
Copy link
Contributor

MoonRide303 commented Jul 4, 2024

I'm still suspicious but will run some tests and report back if there's anything off, maybe i'm just naturally too suspicious tho

@bartowski1182 I did a custom test based on 16k+ tokens input (first 2133 lines of NetHack Guidebook as a single message), and this updated version of Phi-3-mini-128k-instruct freshly converted with llama.cpp b3291 (Q8_0 quant) seems to be working pretty well.

Question:
image

and the answer:
image

Launched with llama-server.exe -v -ngl 99 -m Phi-3-mini-128k-instruct-Q8_0.gguf -c 22528 (to fit into 16 GB of VRAM).

@bartowski1182
Copy link
Contributor

You are correct, I meant to update my comment as I did so a test where I fed it some 20k tokens and it replied coherently, my suspicions are completely unjustified and I appreciate you doing a good thorough test to prove it, thank you!!

@HydrogenBombaklot
Copy link

Hi Bartowski, any update on Phi 3 Small?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants