Fix phi 3 conversion #8262

ngxson · 2024-07-02T21:18:41Z

Reflect the change in latest update: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/commit/10d25dfa1593265daf4a3bac573ab76ccf61d60f

The only changes are:

Rope type name changed to longrope
Scaling factor list changed ==> model need to be re-converted

Related to #6849 (comment)

Example result after conversion:

make llama-cli -j && ./llama-cli -m ../_hf/Phi-3-mini-128k-instruct/ggml-model-q8_0.gguf -p "You are AI" -c 1024 -cnv

> What is 2+2
 2+2 equals 4. The operation here is a simple addition, which is one of the four basic arithmetic operations. Adding two units to two units results in four units.

I have read the contributing guidelines
Self-reported review complexity:
- Low

bartowski1182 · 2024-07-03T03:48:39Z

is this all that's needed? I thought longrope wasn't supported

ngxson · 2024-07-03T07:57:17Z

Longrope is added in #7225

bartowski1182 · 2024-07-03T11:28:34Z

Is that long rope? I thought it was supporting yarn with a long factor, I could be way off base

I guess my question is, with this, does it actually work as it should or does this just make conversion work

ann-brown · 2024-07-03T12:09:06Z

(I'm a bit confused because I'm on the fork and python convert-hf-to-gguf.py ../Phi-3-mini-128k-instruct is still giving me "NotImplementedError: The rope scaling type longrope is not supported yet" ... as is trying to load it directly in transformers ... and anyone else's GGUFs that they've managed to make are broken for me in LM studio, though not for them, apparently. ... Mac thing?)

ngxson · 2024-07-03T12:16:22Z

@bartowski1182 I ran perplexity test yesterday (forgot the save the result), but result seems good.

@ann-brown Maybe problem with your git. I converted Phi-3-mini-128k-instruct without any problem. Please double check your git command before doing other things.

ann-brown · 2024-07-03T12:47:09Z

@bartowski1182 I ran perplexity test yesterday (forgot the save the result), but result seems good.

@ann-brown Maybe problem with your git. I converted Phi-3-mini-128k-instruct without any problem. Please double check your git command before doing other things.

git clone https://github.com/ngxson/llama.cpp.git llama.fork ?

ngxson · 2024-07-03T12:48:21Z

@ann-brown branch xsn/phi3-convert

slaren · 2024-07-03T12:56:33Z

It seems that the chat template has also been changed to support a system prompt.

ngxson · 2024-07-03T13:02:34Z

Actually we (accidentally) have support for system message from the beginning: https://github.com/ggerganov/llama.cpp/blob/f8d6a23804f3798ff2869da68c1223b618df09ec/src/llama.cpp#L19897-L19905

This is because phi-3 template mostly the same as zephyr template, so we copied the code from zephyr which already had system message support.

ngxson · 2024-07-03T13:09:02Z

@bartowski1182 Could you confirm if this fix works on your side? I'll merge when someone can confirm that. Thanks.

bartowski1182 · 2024-07-03T13:11:06Z

Sure I'll check it out

Did your perplexity test go past 4k tokens? I'll try to validate the models coherency at ~10k+ just in case too

ngxson · 2024-07-03T13:15:23Z

@bartowski1182 No I didn't tested beyond -c 4096 (my laptop is potato as always - have to wait until next month for the new rig). Thanks for testing that.

ngxson · 2024-07-03T13:18:16Z

Btw the 128k "mini" version should work fine with context bigger than 4k. Only beginning from the "small" that they use sliding window and block sparse attention, which will break on llama.cpp from 2k tokens

slaren · 2024-07-03T13:21:16Z

It doesn't seem to work with contexts above 4096. ppl with -c 8192 is very high.

[1]1139.2192,[2]697.7324,[3]697.8156,[4]681.5541,[5]573.3356,[6]575.8573,[7]547.5108,[8]572.5840

bartowski1182 · 2024-07-03T13:27:59Z

that's kind of what i expected, this isn't a longrope implementation, it's just allowing it to fall through to old rope if its set to longrope

ngxson · 2024-07-03T13:29:56Z

thanks for the test. hmm, I'm looking at the commit on HF, the only change to the file is a new function _rope_scaling_adjustment which simply force self.rope_scaling["type"] = "longrope"

https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/blob/main/configuration_phi3.py#L187

probably there are other changes under the hood

bartowski1182 · 2024-07-03T13:34:33Z

it's likely because it was trained with whatever the longrope method is, so while this may extend the context, it's in the wrong way and the model will get confused

slaren · 2024-07-03T13:41:15Z

It is likely a bug in build_rope_factors and/or the value perplexity sets for n_seq_max. Forcing it to return rope_long works.

perplexity: calculating perplexity over 40 chunks, n_ctx=8192, batch_size=512, n_seq=1
perplexity: 2.38 seconds per pass - ETA 1.58 minutes
[1]7.1336,[2]5.1053,[3]5.5121,[4]5.8647,[5]5.9926,[6]6.0420,[7]6.0684,[8]6.3195,[9]5.9890,[10]5.8958,

bartowski1182 · 2024-07-03T13:43:18Z

Oo that's interesting... Good find, where is that change being made sorry?

ngxson · 2024-07-03T13:45:51Z

~~OK I see, probably because yarn is removed so tehre is no more n_ctx_orig_yarn, thus it never returns model.layers[il].rope_long~~

See: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/commit/10d25dfa1593265daf4a3bac573ab76ccf61d60f.diff?file=modeling_phi3.py

@bartowski1182 in llama.cpp, build_rope_factors we simply return model.layers[il].rope_long without any checks

slaren · 2024-07-03T13:47:04Z

I think the problem is that perplexity sets n_seq_max to 4, so it takes 16k context before it starts using the long rope factors.

bartowski1182 · 2024-07-03T13:47:05Z

ah so it's more than just a conversion change, it'll need an update to the runtime as well? does it work with this conversion change + new llama.cpp change, or does it need a different conversion AND llama.cpp changes?

slaren · 2024-07-03T13:48:23Z

I think this is fine as is. It will work on programs that set n_seq_max correctly. We can fix the issue with perplexity in another PR.

ngxson · 2024-07-03T13:57:02Z

You're right @slaren , when running with llama-cli -m ... -c 8192, it always returns rope_long. So I assume that it should work at least in llama-cli (not sure if it has impact on llama-server with multi-sequences)

ngxson · 2024-07-03T14:01:08Z

A small note is that on huggingface model, they simply rename Phi3SuScaledRotaryEmbedding to Phi3LongRoPEScaledRotaryEmbedding without changing any code inside, so "longrope" should be just "su" in another name.

I'm merging this PR now.

bartowski1182 · 2024-07-03T14:16:40Z

I'm still suspicious but will run some tests and report back if there's anything off, maybe i'm just naturally too suspicious tho

andysalerno · 2024-07-03T21:41:52Z

will the "old" Phi-3s continue to run and convert properly, with this change?

also thank you for the diligence, I think it really helps catch subtle issues like these

jxy · 2024-07-03T23:13:42Z

It looks good to me. Tried to convert to Q8_0 from f32.

4k ppl

perplexity: calculating perplexity over 81 chunks, n_ctx=4096, batch_size=2048, n_seq=1
perplexity: 17.06 seconds per pass - ETA 23.03 minutes
[1]6.3298,[2]6.2972,[3]5.8147,[4]5.0729,[5]5.1476,[6]5.2719,[7]5.4058,[8]5.5532,

8k ppl

perplexity: calculating perplexity over 40 chunks, n_ctx=8192, batch_size=2048, n_seq=1
perplexity: 43.26 seconds per pass - ETA 28.83 minutes
[1]7.1303,[2]5.0979,[3]5.5071,[4]5.8604,

16k ppl

perplexity: calculating perplexity over 20 chunks, n_ctx=16384, batch_size=2048, n_seq=1
perplexity: 133.90 seconds per pass - ETA 44.63 minutes
[1]4.0957,[2]5.0421,

and tried a simple test

$ ./llama-cli -m models/Phi-3.1-mini-128k-instruct-Q8_0.gguf -ngl 99 -c 16384 -n -1 --temp 0 -fa -p "<|system|>
As a CS PhD, explain the patch given to you.<|end|>          
<|user|>
$(curl -L https://github.com/ggerganov/llama.cpp/pull/8227.patch)<|end|>
<|assistant|>
"

and got

  The patches provided by ngxson and others are a series of changes made to the `llama.cpp` file, which is part of the Llama model implementation. The changes are related to the addition and modification of features for sliding window attention (SWA) in the model. Here's a summary of the changes:

1. Patch 1: Added new enum values for `LLM_KV_CONTEXT_LENGTH_SWA` and `LLM_KV_EMBEDDING_LENGTH` to the `llm_kv` enum.
2. Patch 2: Fixed an issue with the `data_swa` variable being uninitialized.
3. Patch 3: Improved naming for the `sliding_window` parameter in the `set_gguf_parameters` function.
4. Patch 4: Replaced a list of tensors with a single tensor for the `inp_KQ_mask_SWA` variable.
5. Patch 5: Co-authored by Arlo Phoenix.
6. Patch 6: Updated the `n_sliding` parameter to `n_swa` for consistency with other parameters.
7. Patch 7: Improved the styling of the code.
8. Patch 8: Added a sanity check for the `query_pre_attn_scalar` parameter in the `set_gguf_parameters` function.
9. Patch 9: Fixed a typo in the README.md file.

Overall, these patches aim to improve the implementation of sliding window attention in the Llama model, making it more efficient and accurate. [end of text]

llama_print_timings:        load time =     862.99 ms
llama_print_timings:      sample time =       9.46 ms /   379 runs   (    0.02 ms per token, 40050.72 tokens per second)
llama_print_timings: prompt eval time =   69443.49 ms / 13841 tokens (    5.02 ms per token,   199.31 tokens per second)
llama_print_timings:        eval time =   44818.62 ms /   378 runs   (  118.57 ms per token,     8.43 tokens per second)
llama_print_timings:       total time =  114317.45 ms / 14219 tokens

not brilliant but alright for a small model.

MoonRide303 · 2024-07-04T10:04:57Z

I'm still suspicious but will run some tests and report back if there's anything off, maybe i'm just naturally too suspicious tho

@bartowski1182 I did a custom test based on 16k+ tokens input (first 2133 lines of NetHack Guidebook as a single message), and this updated version of Phi-3-mini-128k-instruct freshly converted with llama.cpp b3291 (Q8_0 quant) seems to be working pretty well.

Question:

and the answer:

Launched with llama-server.exe -v -ngl 99 -m Phi-3-mini-128k-instruct-Q8_0.gguf -c 22528 (to fit into 16 GB of VRAM).

bartowski1182 · 2024-07-04T10:14:08Z

You are correct, I meant to update my comment as I did so a test where I fed it some 20k tokens and it replied coherently, my suspicions are completely unjustified and I appreciate you doing a good thorough test to prove it, thank you!!

HydrogenBombaklot · 2024-07-20T19:45:15Z

Hi Bartowski, any update on Phi 3 Small?

fix phi 3 conversion

cb2ab47

ngxson requested a review from ggerganov July 2, 2024 21:18

github-actions bot added the python python script changes label Jul 2, 2024

foldl mentioned this pull request Jul 3, 2024

STILL no way to convert phi-3-small to GGUF #8241

Closed

slaren approved these changes Jul 3, 2024

View reviewed changes

ngxson merged commit 916248a into ggml-org:master Jul 3, 2024
8 checks passed

SystemPanic mentioned this pull request Jul 5, 2024

Conversion to EXL2 of Phi-3 Mini 128k July update produces gibberish output turboderp-org/exllamav2#537

Closed

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 7, 2024

fix phi 3 conversion (ggml-org#8262)

40a2a1b

MoonRide303 mentioned this pull request Jul 22, 2024

Support Mistral-Nemo-Instruct-2407 128K #8577

Closed

4 tasks

ThiloteE mentioned this pull request Aug 3, 2024

Models: Add Phi-3.1-mini-128k-instruct nomic-ai/gpt4all#2790

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix phi 3 conversion #8262

Fix phi 3 conversion #8262

ngxson commented Jul 2, 2024

bartowski1182 commented Jul 3, 2024

ngxson commented Jul 3, 2024

bartowski1182 commented Jul 3, 2024 •

edited

Loading

ann-brown commented Jul 3, 2024

ngxson commented Jul 3, 2024

ann-brown commented Jul 3, 2024

ngxson commented Jul 3, 2024

slaren commented Jul 3, 2024

ngxson commented Jul 3, 2024 •

edited

Loading

ngxson commented Jul 3, 2024

bartowski1182 commented Jul 3, 2024

ngxson commented Jul 3, 2024

ngxson commented Jul 3, 2024 •

edited

Loading

slaren commented Jul 3, 2024

bartowski1182 commented Jul 3, 2024

ngxson commented Jul 3, 2024

bartowski1182 commented Jul 3, 2024

slaren commented Jul 3, 2024 •

edited

Loading

bartowski1182 commented Jul 3, 2024

ngxson commented Jul 3, 2024 •

edited

Loading

slaren commented Jul 3, 2024

bartowski1182 commented Jul 3, 2024

slaren commented Jul 3, 2024

ngxson commented Jul 3, 2024

ngxson commented Jul 3, 2024

bartowski1182 commented Jul 3, 2024

andysalerno commented Jul 3, 2024

jxy commented Jul 3, 2024

MoonRide303 commented Jul 4, 2024 •

edited

Loading

bartowski1182 commented Jul 4, 2024

HydrogenBombaklot commented Jul 20, 2024

Fix phi 3 conversion #8262

Fix phi 3 conversion #8262

Conversation

ngxson commented Jul 2, 2024

bartowski1182 commented Jul 3, 2024

ngxson commented Jul 3, 2024

bartowski1182 commented Jul 3, 2024 • edited Loading

ann-brown commented Jul 3, 2024

ngxson commented Jul 3, 2024

ann-brown commented Jul 3, 2024

ngxson commented Jul 3, 2024

slaren commented Jul 3, 2024

ngxson commented Jul 3, 2024 • edited Loading

ngxson commented Jul 3, 2024

bartowski1182 commented Jul 3, 2024

ngxson commented Jul 3, 2024

ngxson commented Jul 3, 2024 • edited Loading

slaren commented Jul 3, 2024

bartowski1182 commented Jul 3, 2024

ngxson commented Jul 3, 2024

bartowski1182 commented Jul 3, 2024

slaren commented Jul 3, 2024 • edited Loading

bartowski1182 commented Jul 3, 2024

ngxson commented Jul 3, 2024 • edited Loading

slaren commented Jul 3, 2024

bartowski1182 commented Jul 3, 2024

slaren commented Jul 3, 2024

ngxson commented Jul 3, 2024

ngxson commented Jul 3, 2024

bartowski1182 commented Jul 3, 2024

andysalerno commented Jul 3, 2024

jxy commented Jul 3, 2024

MoonRide303 commented Jul 4, 2024 • edited Loading

bartowski1182 commented Jul 4, 2024

HydrogenBombaklot commented Jul 20, 2024

bartowski1182 commented Jul 3, 2024 •

edited

Loading

ngxson commented Jul 3, 2024 •

edited

Loading

ngxson commented Jul 3, 2024 •

edited

Loading

slaren commented Jul 3, 2024 •

edited

Loading

ngxson commented Jul 3, 2024 •

edited

Loading

MoonRide303 commented Jul 4, 2024 •

edited

Loading