-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add phi3 128K model support #7225
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I'll add the Metal support in a day or two if it is not yet pushed
Thanks @ggerganov for your help. I did not have device to test metal, so not implement that part. |
Looking into this now |
I would like to refactor the |
The refactor make the api looks more clean, truly great. |
I would prefer if the scaling factors were exported as a tensor rather than metadata, it would remove quite a bit of code and it would be more efficient. |
Yup, would be better to have the factors as tensors. @liuwei-git would you like to give this a go? |
llama.cpp
Outdated
// choose long/short freq factors based on the context size | ||
const auto n_ctx = llama_n_ctx(&lctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this work correctly with multiple sequences? Maybe something like llama_n_ctx(&lctx) / llama_n_seq_max(&lctx)
would be correct in more cases, but still not in every case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like
llama_n_ctx(&lctx) / llama_n_seq_max(&lctx)
For Transformer-like models, this would always equal 1.
llama_n_ctx(&lctx) / cparams.n_seq_max
would be what you meant.
I think this is going to cause the rope to be run on the CPU always, because the scheduler prefers running ops that use weights in the backend of the weights. I will fix that after this is merged. |
Ok. Btw, do you see something that could affect the performance of |
Looking at the graphs, it seems that the load time increased, but the throughput looks similar. Maybe it was a fluke? I can't reproduce it on my system either.
|
the model seems to be doing rather poorly. I cannot tell if its a tokenizer issue or just the model itself, but I quantized the 128k medium instruct model to a q8_0 and its failing pretty simple logic questions. Perhaps its just not good with rather basic math? I tried a temperature of The question I asked that it specifically struggled on was:
It gave some pretty dumb answers such as the tape being over 1300 cm thick, and kept trying to correct itself, giving equally incorrect answers. |
Did you try running a 16 bit gguf model and seeing how that performs?
…On Tue, May 21, 2024 at 7:02 PM Cross ***@***.***> wrote:
the model seems to be doing rather poorly. I cannot tell if its a
tokenizer issue or just the model itself, but I quantized the 128k medium
instruct model to a q8_0 and its failing pretty simple logic questions.
Perhaps its just not good with rather basic math?
I tried a temperature of 1 down to 0.6 and even down to 0 and its still
not fairing well on logical questions. I was expecting more from a phi
model, which leads me to think it may be some other underlying issue.
—
Reply to this email directly, view it on GitHub
<#7225 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASVG6CQF3YEGLIEMJGROMUDZDOD4JAVCNFSM6AAAAABHSEOE66VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRTGE2TSNJWGI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
I tried the prompt on the 128k mini instruct model (f16 and Q4_K_M from https://huggingface.co/AlessandroW/Phi-3-mini-128k-instruct-gguf) and both models performed similar to an older ChatGPT version https://neil.fraser.name/news/2023/02/17/. The quantized model replied:
The f16 model replied
|
the fp16 looks a bit better, at least it gets to an answer. How does the
pytorch model answer compare?
…On Wed, May 22, 2024 at 3:06 PM Dr. Alessandro Wollek < ***@***.***> wrote:
Did you try running a 16 bit gguf model and seeing how that performs?
I tried the prompt on the 128k small instruct model (f16 and Q4_K_M from
https://huggingface.co/AlessandroW/Phi-3-mini-128k-instruct-gguf) and
both models performed similar to an older ChatGPT version
https://neil.fraser.name/news/2023/02/17/. The quantized model replied:
<|user|> There is a roll of tape. The tape is 100 meters long when unrolled. When rolled up, the outer diameter is 10 cm, and the inner diameter is 5 cm. How thick is the tape?<|end|><|assistant|> To find the thickness of the tape, we need to calculate the volume of the tape when it is rolled up and then use that to find the thickness.
First, let's calculate the volume of the cylinder formed by the outer diameter. The formula for the volume of a cylinder is V = πr²h, where r is the radius and h is the height (or length in this case).
The outer radius (r_outer) is half of the outer diameter, so r_outer = 10 cm / 2 = 5 cm.
The length of the tape (h) is 100 meters, but we need to convert it to centimeters because the radius is in centimeters. So, h = 100 meters * 100 cm/meter = 10000 cm.
Now, let's calculate the volume of the outer cylinder (V_outer):
V_outer = π * (r_outer)² * h
V_outer = π * (5 cm)² * 10000 cm
V_outer = π * 25 cm² * 10000 cm
V_outer = 250000π cm³
Next, let's calculate the volume of the inner cylinder (V_inner), which represents the empty space inside the tape. The inner radius (r_inner) is half of the inner diameter, so r_inner = 5 cm / 2 = 2.5 cm.
The volume of the inner cylinder (V_inner) is:
V_inner = π * (r_inner)² * h
V_inner = π * (2.5 cm)² * 10000 cm
V_inner = π * 6.25 cm² * 10000 cm
V_inner = 62500π cm³
The volume of the tape itself (V_tape) is the difference between the outer and are not standard mathematical operations, and thus are not applicable in this context.
Since the question seems to be asking for the volume of the tape itself, we will subtract the inner volume from the outer volume to find the volume of the tape:
V_tape = V_outer - V_inner
V_tape = 250000π cm³ - 62500π cm³
V_tape = 187500π cm³
To get the numerical value, we use the approximation π ≈ 3.14159:
V_tape ≈ 187500 * 3.14159 cm³
V_tape ≈ 588746.25 cm³
Therefore, the volume of the tape itself is approximately 588746.25 cubic centimeters.<|endoftext|>
The f16 model replied
<|user|> There is a roll of tape. The tape is 100 meters long when unrolled. When rolled up, the outer diameter is 10 cm, and the inner diameter is 5 cm. How thick is the tape?<|end|><|assistant|> To find the thickness of the tape, we need to calculate the volume of the tape when it is rolled up and then use that to find the thickness.
First, let's find the volume of the entire length of the tape if it were a cylinder with the outer diameter. The formula for the volume of a cylinder is V = πr²h, where r is the radius and h is the height (or length in this case).
The outer radius (R) is half of the outer diameter, so R = 10 cm / 2 = 5 cm.
The length (h) of the tape is 100 meters, which we need to convert to centimeters because the diameter is in centimeters. 100 meters = 100 * 100 cm = 10000 cm.
Now, let's calculate the volume of the outer cylinder (V_outer):
V_outer = π * R² * h
V_outer = π * (5 cm)² * 10000 cm
V_outer = π * 25 cm² * 10000 cm
V_outer = 250000π cm³
Next, we need to calculate the volume of the inner cylinder, which represents the space that is not occupied by the tape. The inner radius (r) is half of the inner diameter, so r = 5 cm / 2 = 2.5 cm.
The volume of the inner cylinder (V_inner) is:
V_inner = π * r² * h
V_inner = π * (2.5 cm)² * 10000 cm
V_inner = π * 6.25 cm² * 10000 cm
V_inner = 62500π cm³
Now, to find the volume of the tape itself, we subtract the volume of the inner cylinder from the volume of the outer cylinder:
V_tile = V_outer - V_inner
V_tile = 250000π cm³ - 62500π cm³
V_tile = 187500π cm³
Finally, to find the thickness of the tape, we divide the volume of the tape by the surface area of the inner cylinder (since the thickness will be uniform and we are considering the volume that the tape occupies over the inner surface area):
Surface area of the inner cylinder (A_inner) is:
A_inner = 2π * r * h
A_inner = 2π * (2.5 cm) * 10000 cm
A_inner = 50000π cm²
The thickness of the tape (t) is:
t = V_tile / A_inner
t = 187500π cm³ / 50000π cm²
t = 3.75 cm
So, the thickness of the tape is 3.75 cm
Since the thickness of the tape is not a practical length (it's too large and not realistic for a thin tape), we need to reconsider our calculation. The thickness should be in the same order of magnitude as the width of the tape, which is 0.2 cm. Let's correct this by using the correct formula for the volume of the cylindrical shell (the tape):
V_tile = A_inner * t
We can now solve for the thickness (t):
t = V_tile / A_inner
t = 187500π cm³ / (π * 50000 cm * 100 cm)
t = 187500 / (50000 * 100)
t = 0.375 cm
Therefore, the thickness of the tape is 0.375 cm.<|endoftext|> [end of text]
—
Reply to this email directly, view it on GitHub
<#7225 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASVG6CRGHB5QH5J3VVOUZC3ZDSQ6VAVCNFSM6AAAAABHSEOE66VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRUHA4DSNBZGY>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
I guess I got lucky with a seed. Phi-3-medium (q4 from https://huggingface.co/bartowski/Phi-3-medium-128k-instruct-GGUF, temp 0.1): There is a roll of tape. The tape is 100 meters long when unrolled. When rolled up, the outer diameter is 10 cm, and the inner diameter is 5 cm. How thick is the tape? Answer:
|
* add phi3 128k support in convert-hf-to-gguf * add phi3 128k support in cuda * address build warnings on llama.cpp * adjust index value in cuda long rope freq factors * add long rope support in ggml cpu backend * make freq factors only depend on ctx size * remove unused rope scaling type 'su' frin gguf converter * fix flint warnings on convert-hf-to-gguf.py * set to the short freq factor when context size is small than trained context size * add one line of comments * metal : support rope freq_factors * ggml : update ggml_rope_ext API to support freq. factors * backends : add dev messages to support rope freq. factors * minor : style * tests : update to use new rope API * backends : fix pragma semicolons * minor : cleanup * llama : move rope factors from KV header to tensors * llama : remove tmp assert * cuda : fix compile warning * convert : read/write n_head_kv * llama : fix uninitialized tensors --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
ref #6849
The only difference between phi3 4k and 128k model is from the rotary embedding. 128k model adds long/short rope scaling factors (freq_factors) and an attn factor to each hidden dimension. The chosen of long/short factor is based on the total length of the input sequences, i.e, the kv context size.
The attn factor value is based on the postional embedding size.
Workflow
convert-hf-to-gguf.py: Write long/short freq factors to gguf metadata for phi3 model
llama.cpp:
ggml: update rope op to support long/short freq factors:
Test
convert-hf-to-gguf.py
to convert to gguf format.