Add phi3 128K model support #7225

liuwei-git · 2024-05-11T19:03:00Z

The only difference between phi3 4k and 128k model is from the rotary embedding. 128k model adds long/short rope scaling factors (freq_factors) and an attn factor to each hidden dimension. The chosen of long/short factor is based on the total length of the input sequences, i.e, the kv context size.

seq_len = torch.max(position_ids) + 1
if seq_len > self.original_max_position_embeddings:
    ext_factors = torch.tensor(self.long_factor, dtype=torch.float32, device=x.device)
else:
    ext_factors = torch.tensor(self.short_factor, dtype=torch.float32, device=x.device)

inv_freq_shape = torch.arange(0, self.dim, 2, dtype=torch.int64, device=x.device).float() / self.dim
self.inv_freq = 1.0 / (ext_factors * self.base**inv_freq_shape)

The attn factor value is based on the postional embedding size.

scale = self.max_position_embeddings / self.original_max_position_embeddings
if scale <= 1.0:
    scaling_factor = 1.0
else:
    scaling_factor = math.sqrt(1 + math.log(scale) / math.log(self.original_max_position_embeddings))

Workflow

convert-hf-to-gguf.py: Write long/short freq factors to gguf metadata for phi3 model
llama.cpp:
- load the freq factors and attn factor from metadata
- take freq factors as an input tensor of phi3 model, and a source of k/q rope tensor
- choose the long or short freq_factors based on the context size when setting the tensor value
ggml: update rope op to support long/short freq factors:
- CPU
- CUDA
- Metal
- SYCL
- Vulkan

Test

https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/tree/main
- clone the model and use convert-hf-to-gguf.py to convert to gguf format.
- use passkey to test the model, e.g,:

    passkey phi3_128k_fp16.gguf 500

ggerganov

Nice! I'll add the Metal support in a day or two if it is not yet pushed

liuwei-git · 2024-05-12T18:15:42Z

Thanks @ggerganov for your help. I did not have device to test metal, so not implement that part.

ggerganov · 2024-05-16T08:34:57Z

Looking into this now

ggerganov · 2024-05-16T09:05:43Z

I would like to refactor the ggml_rope_custom API and remove ggml_rope_with_freq_factors before merging - will push in a bit

liuwei-git · 2024-05-16T10:43:00Z

I would like to refactor the ggml_rope_custom API and remove ggml_rope_with_freq_factors before merging - will push in a bit

The refactor make the api looks more clean, truly great.

slaren · 2024-05-16T11:56:48Z

I would prefer if the scaling factors were exported as a tensor rather than metadata, it would remove quite a bit of code and it would be more efficient.

ggerganov · 2024-05-16T12:07:53Z

Yup, would be better to have the factors as tensors. @liuwei-git would you like to give this a go?

slaren · 2024-05-16T13:36:55Z

llama.cpp

+        // choose long/short freq factors based on the context size
+        const auto n_ctx = llama_n_ctx(&lctx);


Would this work correctly with multiple sequences? Maybe something like llama_n_ctx(&lctx) / llama_n_seq_max(&lctx) would be correct in more cases, but still not in every case.

Maybe something like llama_n_ctx(&lctx) / llama_n_seq_max(&lctx)

For Transformer-like models, this would always equal 1.

llama_n_ctx(&lctx) / cparams.n_seq_max would be what you meant.

…context size

github-actions · 2024-05-21T16:00:53Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 538 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8734.28ms p(95)=21554.83ms fails=, finish reason: stop=477 truncated=61
Prompt processing (pp): avg=104.82tk/s p(95)=509.78tk/s
Token generation (tg): avg=32.71tk/s p(95)=46.82tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=7528c705b0c741a68a1d85a523d827374c258195

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 538 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716324379 --> 1716325011
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 815.29, 815.29, 815.29, 815.29, 815.29, 831.39, 831.39, 831.39, 831.39, 831.39, 842.08, 842.08, 842.08, 842.08, 842.08, 864.65, 864.65, 864.65, 864.65, 864.65, 898.12, 898.12, 898.12, 898.12, 898.12, 891.76, 891.76, 891.76, 891.76, 891.76, 902.51, 902.51, 902.51, 902.51, 902.51, 910.64, 910.64, 910.64, 910.64, 910.64, 916.48, 916.48, 916.48, 916.48, 916.48, 916.28, 916.28, 916.28, 916.28, 916.28, 945.03, 945.03, 945.03, 945.03, 945.03, 888.37, 888.37, 888.37, 888.37, 888.37, 904.45, 904.45, 904.45, 904.45, 904.45, 915.15, 915.15, 915.15, 915.15, 915.15, 915.13, 915.13, 915.13, 915.13, 915.13, 916.52, 916.52, 916.52, 916.52, 916.52, 912.88, 912.88, 912.88, 912.88, 912.88, 877.85, 877.85, 877.85, 877.85, 877.85, 878.11, 878.11, 878.11, 878.11, 878.11, 877.89, 877.89, 877.89, 877.89, 877.89, 881.6, 881.6, 881.6, 881.6, 881.6, 884.59, 884.59, 884.59, 884.59, 884.59, 878.57, 878.57, 878.57, 878.57, 878.57, 876.38, 876.38, 876.38, 876.38, 876.38, 873.85, 873.85, 873.85, 873.85, 873.85, 887.18, 887.18, 887.18, 887.18, 887.18, 882.68, 882.68, 882.68, 882.68, 882.68, 882.52, 882.52, 882.52, 882.52, 882.52, 882.48, 882.48, 882.48, 882.48, 882.48, 882.9, 882.9, 882.9, 882.9, 882.9, 882.05, 882.05, 882.05, 882.05, 882.05, 881.29, 881.29, 881.29, 881.29, 881.29, 882.95, 882.95, 882.95, 882.95, 882.95, 881.03, 881.03, 881.03, 881.03, 881.03, 883.45, 883.45, 883.45, 883.45, 883.45, 884.94, 884.94, 884.94, 884.94, 884.94, 882.21, 882.21, 882.21, 882.21, 882.21, 881.15, 881.15, 881.15, 881.15, 881.15, 883.08, 883.08, 883.08, 883.08, 883.08, 882.82, 882.82, 882.82, 882.82, 882.82, 887.34, 887.34, 887.34, 887.34, 887.34, 895.6, 895.6, 895.6, 895.6, 895.6, 895.55, 895.55, 895.55, 895.55, 895.55, 893.74, 893.74, 893.74, 893.74, 893.74, 891.05, 891.05, 891.05, 891.05, 891.05, 889.14, 889.14, 889.14, 889.14, 889.14, 895.34, 895.34, 895.34, 895.34, 895.34, 894.76, 894.76, 894.76, 894.76, 894.76, 892.45, 892.45, 892.45, 892.45, 892.45, 897.32, 897.32, 897.32, 897.32, 897.32, 896.01, 896.01, 896.01, 896.01, 896.01, 898.87, 898.87, 898.87, 898.87, 898.87, 901.0, 901.0, 901.0, 901.0, 901.0, 901.37, 901.37, 901.37, 901.37, 901.37, 907.17, 907.17, 907.17, 907.17, 907.17, 905.49, 905.49, 905.49, 905.49, 905.49, 906.02, 906.02, 906.02, 906.02, 906.02, 905.23, 905.23, 905.23, 905.23, 905.23, 905.62, 905.62, 905.62, 905.62, 905.62, 906.83, 906.83, 906.83, 906.83, 906.83, 906.83, 906.83]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 538 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716324379 --> 1716325011
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 37.79, 37.79, 37.79, 37.79, 37.79, 34.05, 34.05, 34.05, 34.05, 34.05, 29.04, 29.04, 29.04, 29.04, 29.04, 30.45, 30.45, 30.45, 30.45, 30.45, 30.35, 30.35, 30.35, 30.35, 30.35, 31.95, 31.95, 31.95, 31.95, 31.95, 32.94, 32.94, 32.94, 32.94, 32.94, 33.36, 33.36, 33.36, 33.36, 33.36, 33.84, 33.84, 33.84, 33.84, 33.84, 34.31, 34.31, 34.31, 34.31, 34.31, 34.4, 34.4, 34.4, 34.4, 34.4, 34.17, 34.17, 34.17, 34.17, 34.17, 33.84, 33.84, 33.84, 33.84, 33.84, 32.26, 32.26, 32.26, 32.26, 32.26, 32.23, 32.23, 32.23, 32.23, 32.23, 30.19, 30.19, 30.19, 30.19, 30.19, 30.46, 30.46, 30.46, 30.46, 30.46, 30.49, 30.49, 30.49, 30.49, 30.49, 30.42, 30.42, 30.42, 30.42, 30.42, 30.45, 30.45, 30.45, 30.45, 30.45, 30.38, 30.38, 30.38, 30.38, 30.38, 30.57, 30.57, 30.57, 30.57, 30.57, 30.67, 30.67, 30.67, 30.67, 30.67, 30.42, 30.42, 30.42, 30.42, 30.42, 30.48, 30.48, 30.48, 30.48, 30.48, 30.7, 30.7, 30.7, 30.7, 30.7, 30.57, 30.57, 30.57, 30.57, 30.57, 30.69, 30.69, 30.69, 30.69, 30.69, 31.06, 31.06, 31.06, 31.06, 31.06, 31.09, 31.09, 31.09, 31.09, 31.09, 31.14, 31.14, 31.14, 31.14, 31.14, 31.24, 31.24, 31.24, 31.24, 31.24, 31.28, 31.28, 31.28, 31.28, 31.28, 31.37, 31.37, 31.37, 31.37, 31.37, 31.3, 31.3, 31.3, 31.3, 31.3, 30.89, 30.89, 30.89, 30.89, 30.89, 30.3, 30.3, 30.3, 30.3, 30.3, 30.37, 30.37, 30.37, 30.37, 30.37, 30.55, 30.55, 30.55, 30.55, 30.55, 30.66, 30.66, 30.66, 30.66, 30.66, 30.8, 30.8, 30.8, 30.8, 30.8, 30.73, 30.73, 30.73, 30.73, 30.73, 30.66, 30.66, 30.66, 30.66, 30.66, 30.55, 30.55, 30.55, 30.55, 30.55, 29.84, 29.84, 29.84, 29.84, 29.84, 28.81, 28.81, 28.81, 28.81, 28.81, 28.91, 28.91, 28.91, 28.91, 28.91, 28.85, 28.85, 28.85, 28.85, 28.85, 28.82, 28.82, 28.82, 28.82, 28.82, 28.78, 28.78, 28.78, 28.78, 28.78, 28.78, 28.78, 28.78, 28.78, 28.78, 28.81, 28.81, 28.81, 28.81, 28.81, 28.83, 28.83, 28.83, 28.83, 28.83, 28.76, 28.76, 28.76, 28.76, 28.76, 28.76, 28.76, 28.76, 28.76, 28.76, 28.67, 28.67, 28.67, 28.67, 28.67, 28.71, 28.71, 28.71, 28.71, 28.71, 28.88, 28.88, 28.88, 28.88, 28.88, 29.03, 29.03, 29.03, 29.03, 29.03, 29.11, 29.11, 29.11, 29.11, 29.11, 29.19, 29.19]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 538 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716324379 --> 1716325011
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.31, 0.31, 0.31, 0.31, 0.31, 0.33, 0.33, 0.33, 0.33, 0.33, 0.25, 0.25, 0.25, 0.25, 0.25, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.25, 0.25, 0.25, 0.25, 0.25, 0.39, 0.39, 0.39, 0.39, 0.39, 0.39, 0.39, 0.39, 0.39, 0.39, 0.3, 0.3, 0.3, 0.3, 0.3, 0.18, 0.18, 0.18, 0.18, 0.18, 0.1, 0.1, 0.1, 0.1, 0.1, 0.23, 0.23, 0.23, 0.23, 0.23, 0.21, 0.21, 0.21, 0.21, 0.21, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.31, 0.31, 0.31, 0.31, 0.31, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.33, 0.33, 0.33, 0.33, 0.33, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.26, 0.26, 0.26, 0.26, 0.26, 0.43, 0.43, 0.43, 0.43, 0.43, 0.42, 0.42, 0.42, 0.42, 0.42, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.45, 0.45, 0.45, 0.45, 0.45, 0.63, 0.63, 0.63, 0.63, 0.63, 0.66, 0.66, 0.66, 0.66, 0.66, 0.39, 0.39, 0.39, 0.39, 0.39, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.26, 0.26, 0.26, 0.26, 0.26, 0.08, 0.08, 0.08, 0.08, 0.08, 0.24, 0.24, 0.24, 0.24, 0.24, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24, 0.24, 0.24, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.26, 0.26, 0.26, 0.26, 0.26, 0.2, 0.2, 0.2, 0.2, 0.2, 0.08, 0.08, 0.08, 0.08, 0.08, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.19, 0.19]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 538 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716324379 --> 1716325011
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0]

slaren · 2024-05-21T16:10:04Z

I think this is going to cause the rope to be run on the CPU always, because the scheduler prefers running ops that use weights in the backend of the weights. I will fix that after this is merged.

ggerganov · 2024-05-21T16:12:10Z

Ok. Btw, do you see something that could affect the performance of phi-2 (no rope factors)? The benchmark is half the performance than usual (217 iters) and I'm wondering if it is a fluke, because I don't reproduce on my RTX 2060

slaren · 2024-05-21T16:20:34Z

Looking at the graphs, it seems that the load time increased, but the throughput looks similar. Maybe it was a fluke?

I can't reproduce it on my system either.

GPU	Model	Test	t/s master	t/s liuwei-git/master	Speedup
RTX 3090 Ti	phi2 3B Q8_0	pp512	8543.90	8519.48	1.00
RTX 3090 Ti	phi2 3B Q8_0	tg128	185.05	184.47	1.00
RTX 3090 Ti	phi2 3B Q8_0	pp512+tg128	808.40	808.70	1.00

dillfrescott · 2024-05-21T18:02:18Z

the model seems to be doing rather poorly. I cannot tell if its a tokenizer issue or just the model itself, but I quantized the 128k medium instruct model to a q8_0 and its failing pretty simple logic questions. Perhaps its just not good with rather basic math?

I tried a temperature of 1 down to 0.6 and even down to 0 and its still not fairing well on logical questions. I was expecting more from a phi model, which leads me to think it may be some other underlying issue.

The question I asked that it specifically struggled on was:

There is a roll of tape. The tape is 100 meters long when unrolled. When rolled up, the outer diameter is 10 cm, and the inner diameter is 5 cm. How thick is the tape?

It gave some pretty dumb answers such as the tape being over 1300 cm thick, and kept trying to correct itself, giving equally incorrect answers.

RonanKMcGovern · 2024-05-22T13:10:38Z

Did you try running a 16 bit gguf model and seeing how that performs?

…

On Tue, May 21, 2024 at 7:02 PM Cross ***@***.***> wrote: the model seems to be doing rather poorly. I cannot tell if its a tokenizer issue or just the model itself, but I quantized the 128k medium instruct model to a q8_0 and its failing pretty simple logic questions. Perhaps its just not good with rather basic math? I tried a temperature of 1 down to 0.6 and even down to 0 and its still not fairing well on logical questions. I was expecting more from a phi model, which leads me to think it may be some other underlying issue. — Reply to this email directly, view it on GitHub <#7225 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASVG6CQF3YEGLIEMJGROMUDZDOD4JAVCNFSM6AAAAABHSEOE66VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRTGE2TSNJWGI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

AlessandroW · 2024-05-22T14:06:07Z

Did you try running a 16 bit gguf model and seeing how that performs?

I tried the prompt on the 128k mini instruct model (f16 and Q4_K_M from https://huggingface.co/AlessandroW/Phi-3-mini-128k-instruct-gguf) and both models performed similar to an older ChatGPT version https://neil.fraser.name/news/2023/02/17/. The quantized model replied:

<|user|> There is a roll of tape. The tape is 100 meters long when unrolled. When rolled up, the outer diameter is 10 cm, and the inner diameter is 5 cm. How thick is the tape?<|end|><|assistant|> To find the thickness of the tape, we need to calculate the volume of the tape when it is rolled up and then use that to find the thickness.

First, let's calculate the volume of the cylinder formed by the outer diameter. The formula for the volume of a cylinder is V = πr²h, where r is the radius and h is the height (or length in this case).

The outer radius (r_outer) is half of the outer diameter, so r_outer = 10 cm / 2 = 5 cm.

The length of the tape (h) is 100 meters, but we need to convert it to centimeters because the radius is in centimeters. So, h = 100 meters * 100 cm/meter = 10000 cm.

Now, let's calculate the volume of the outer cylinder (V_outer):
V_outer = π * (r_outer)² * h
V_outer = π * (5 cm)² * 10000 cm
V_outer = π * 25 cm² * 10000 cm
V_outer = 250000π cm³

Next, let's calculate the volume of the inner cylinder (V_inner), which represents the empty space inside the tape. The inner radius (r_inner) is half of the inner diameter, so r_inner = 5 cm / 2 = 2.5 cm.

The volume of the inner cylinder (V_inner) is:
V_inner = π * (r_inner)² * h
V_inner = π * (2.5 cm)² * 10000 cm
V_inner = π * 6.25 cm² * 10000 cm
V_inner = 62500π cm³

The volume of the tape itself (V_tape) is the difference between the outer and are not standard mathematical operations, and thus are not applicable in this context.

Since the question seems to be asking for the volume of the tape itself, we will subtract the inner volume from the outer volume to find the volume of the tape:

V_tape = V_outer - V_inner
V_tape = 250000π cm³ - 62500π cm³
V_tape = 187500π cm³

To get the numerical value, we use the approximation π ≈ 3.14159:

V_tape ≈ 187500 * 3.14159 cm³
V_tape ≈ 588746.25 cm³

Therefore, the volume of the tape itself is approximately 588746.25 cubic centimeters.<|endoftext|>

The f16 model replied

<|user|> There is a roll of tape. The tape is 100 meters long when unrolled. When rolled up, the outer diameter is 10 cm, and the inner diameter is 5 cm. How thick is the tape?<|end|><|assistant|> To find the thickness of the tape, we need to calculate the volume of the tape when it is rolled up and then use that to find the thickness.

First, let's find the volume of the entire length of the tape if it were a cylinder with the outer diameter. The formula for the volume of a cylinder is V = πr²h, where r is the radius and h is the height (or length in this case).

The outer radius (R) is half of the outer diameter, so R = 10 cm / 2 = 5 cm.

The length (h) of the tape is 100 meters, which we need to convert to centimeters because the diameter is in centimeters. 100 meters = 100 * 100 cm = 10000 cm.

Now, let's calculate the volume of the outer cylinder (V_outer):
V_outer = π * R² * h
V_outer = π * (5 cm)² * 10000 cm
V_outer = π * 25 cm² * 10000 cm
V_outer = 250000π cm³

Next, we need to calculate the volume of the inner cylinder, which represents the space that is not occupied by the tape. The inner radius (r) is half of the inner diameter, so r = 5 cm / 2 = 2.5 cm.

The volume of the inner cylinder (V_inner) is:
V_inner = π * r² * h
V_inner = π * (2.5 cm)² * 10000 cm
V_inner = π * 6.25 cm² * 10000 cm
V_inner = 62500π cm³

Now, to find the volume of the tape itself, we subtract the volume of the inner cylinder from the volume of the outer cylinder:

V_tile = V_outer - V_inner
V_tile = 250000π cm³ - 62500π cm³
V_tile = 187500π cm³

Finally, to find the thickness of the tape, we divide the volume of the tape by the surface area of the inner cylinder (since the thickness will be uniform and we are considering the volume that the tape occupies over the inner surface area):

Surface area of the inner cylinder (A_inner) is:
A_inner = 2π * r * h
A_inner = 2π * (2.5 cm) * 10000 cm
A_inner = 50000π cm²

The thickness of the tape (t) is:
t = V_tile / A_inner
t = 187500π cm³ / 50000π cm²
t = 3.75 cm

So, the thickness of the tape is 3.75 cm

Since the thickness of the tape is not a practical length (it's too large and not realistic for a thin tape), we need to reconsider our calculation. The thickness should be in the same order of magnitude as the width of the tape, which is 0.2 cm. Let's correct this by using the correct formula for the volume of the cylindrical shell (the tape):

V_tile = A_inner * t

We can now solve for the thickness (t):

t = V_tile / A_inner
t = 187500π cm³ / (π * 50000 cm * 100 cm)
t = 187500 / (50000 * 100)
t = 0.375 cm

Therefore, the thickness of the tape is 0.375 cm.<|endoftext|> [end of text]

RonanKMcGovern · 2024-05-23T10:56:29Z

the fp16 looks a bit better, at least it gets to an answer. How does the pytorch model answer compare?

…

On Wed, May 22, 2024 at 3:06 PM Dr. Alessandro Wollek < ***@***.***> wrote: Did you try running a 16 bit gguf model and seeing how that performs? I tried the prompt on the 128k small instruct model (f16 and Q4_K_M from https://huggingface.co/AlessandroW/Phi-3-mini-128k-instruct-gguf) and both models performed similar to an older ChatGPT version https://neil.fraser.name/news/2023/02/17/. The quantized model replied: <|user|> There is a roll of tape. The tape is 100 meters long when unrolled. When rolled up, the outer diameter is 10 cm, and the inner diameter is 5 cm. How thick is the tape?<|end|><|assistant|> To find the thickness of the tape, we need to calculate the volume of the tape when it is rolled up and then use that to find the thickness. First, let's calculate the volume of the cylinder formed by the outer diameter. The formula for the volume of a cylinder is V = πr²h, where r is the radius and h is the height (or length in this case). The outer radius (r_outer) is half of the outer diameter, so r_outer = 10 cm / 2 = 5 cm. The length of the tape (h) is 100 meters, but we need to convert it to centimeters because the radius is in centimeters. So, h = 100 meters * 100 cm/meter = 10000 cm. Now, let's calculate the volume of the outer cylinder (V_outer): V_outer = π * (r_outer)² * h V_outer = π * (5 cm)² * 10000 cm V_outer = π * 25 cm² * 10000 cm V_outer = 250000π cm³ Next, let's calculate the volume of the inner cylinder (V_inner), which represents the empty space inside the tape. The inner radius (r_inner) is half of the inner diameter, so r_inner = 5 cm / 2 = 2.5 cm. The volume of the inner cylinder (V_inner) is: V_inner = π * (r_inner)² * h V_inner = π * (2.5 cm)² * 10000 cm V_inner = π * 6.25 cm² * 10000 cm V_inner = 62500π cm³ The volume of the tape itself (V_tape) is the difference between the outer and are not standard mathematical operations, and thus are not applicable in this context. Since the question seems to be asking for the volume of the tape itself, we will subtract the inner volume from the outer volume to find the volume of the tape: V_tape = V_outer - V_inner V_tape = 250000π cm³ - 62500π cm³ V_tape = 187500π cm³ To get the numerical value, we use the approximation π ≈ 3.14159: V_tape ≈ 187500 * 3.14159 cm³ V_tape ≈ 588746.25 cm³ Therefore, the volume of the tape itself is approximately 588746.25 cubic centimeters.<|endoftext|> The f16 model replied <|user|> There is a roll of tape. The tape is 100 meters long when unrolled. When rolled up, the outer diameter is 10 cm, and the inner diameter is 5 cm. How thick is the tape?<|end|><|assistant|> To find the thickness of the tape, we need to calculate the volume of the tape when it is rolled up and then use that to find the thickness. First, let's find the volume of the entire length of the tape if it were a cylinder with the outer diameter. The formula for the volume of a cylinder is V = πr²h, where r is the radius and h is the height (or length in this case). The outer radius (R) is half of the outer diameter, so R = 10 cm / 2 = 5 cm. The length (h) of the tape is 100 meters, which we need to convert to centimeters because the diameter is in centimeters. 100 meters = 100 * 100 cm = 10000 cm. Now, let's calculate the volume of the outer cylinder (V_outer): V_outer = π * R² * h V_outer = π * (5 cm)² * 10000 cm V_outer = π * 25 cm² * 10000 cm V_outer = 250000π cm³ Next, we need to calculate the volume of the inner cylinder, which represents the space that is not occupied by the tape. The inner radius (r) is half of the inner diameter, so r = 5 cm / 2 = 2.5 cm. The volume of the inner cylinder (V_inner) is: V_inner = π * r² * h V_inner = π * (2.5 cm)² * 10000 cm V_inner = π * 6.25 cm² * 10000 cm V_inner = 62500π cm³ Now, to find the volume of the tape itself, we subtract the volume of the inner cylinder from the volume of the outer cylinder: V_tile = V_outer - V_inner V_tile = 250000π cm³ - 62500π cm³ V_tile = 187500π cm³ Finally, to find the thickness of the tape, we divide the volume of the tape by the surface area of the inner cylinder (since the thickness will be uniform and we are considering the volume that the tape occupies over the inner surface area): Surface area of the inner cylinder (A_inner) is: A_inner = 2π * r * h A_inner = 2π * (2.5 cm) * 10000 cm A_inner = 50000π cm² The thickness of the tape (t) is: t = V_tile / A_inner t = 187500π cm³ / 50000π cm² t = 3.75 cm So, the thickness of the tape is 3.75 cm Since the thickness of the tape is not a practical length (it's too large and not realistic for a thin tape), we need to reconsider our calculation. The thickness should be in the same order of magnitude as the width of the tape, which is 0.2 cm. Let's correct this by using the correct formula for the volume of the cylindrical shell (the tape): V_tile = A_inner * t We can now solve for the thickness (t): t = V_tile / A_inner t = 187500π cm³ / (π * 50000 cm * 100 cm) t = 187500 / (50000 * 100) t = 0.375 cm Therefore, the thickness of the tape is 0.375 cm.<|endoftext|> [end of text] — Reply to this email directly, view it on GitHub <#7225 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASVG6CRGHB5QH5J3VVOUZC3ZDSQ6VAVCNFSM6AAAAABHSEOE66VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRUHA4DSNBZGY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

RachidAR · 2024-05-23T12:18:42Z

I guess I got lucky with a seed. Phi-3-medium (q4 from https://huggingface.co/bartowski/Phi-3-medium-128k-instruct-GGUF, temp 0.1):

There is a roll of tape. The tape is 100 meters long when unrolled. When rolled up, the outer diameter is 10 cm, and the inner diameter is 5 cm. How thick is the tape?

Answer:

To find out how thick the tape is, we can use the formula for the volume of a cylindrical shell (the shape that represents the unrolled tape). The formula is V = πh(R^2 - r^2), where h is the height or length of the cylinder, R is the outer radius and r is the inner radius.
First, let's convert all measurements to meters for consistency:
- Length (height) of tape, h = 100 m
- Outer diameter = 10 cm = 0.1 m
- Inner diameter = 5 cm = 0.05 m
Now we can find the outer and inner radii by dividing the diameters by 2:
- Outer radius, R = 0.1 m / 2 = 0.05 m
- Inner radius, r = 0.05 m / 2 = 0.025 m
Now we can plug these values into the formula for volume of a cylindrical shell:
V = πh(R^2 - r^2)
We know that V (volume) is equal to the length times the thickness, so let's call the thickness "t" and set up an equation:
100 m * t = π * 100 m * ((0.05 m)^2 - (0.025 m)^2)
Now we can solve for t:
t = (π * ((0.05 m)^2 - (0.025 m)^2)) / 100
t ≈ (3.1416 * (0.0025 m^2 - 0.000625 m^2)) / 100
t ≈ (3.1416 * 0.001875 m^2) / 100
t ≈ 0.0059 m
So the thickness of the tape is approximately 0.0059 meters, or about 5.9 millimeters.

Did you try running a 16 bit gguf model and seeing how that performs?

I tried the prompt on the 128k mini instruct model (f16 and Q4_K_M from https://huggingface.co/AlessandroW/Phi-3-mini-128k-instruct-gguf) and both models performed similar to an older ChatGPT version https://neil.fraser.name/news/2023/02/17/. The quantized model replied:

<|user|> There is a roll of tape. The tape is 100 meters long when unrolled. When rolled up, the outer diameter is 10 cm, and the inner diameter is 5 cm. How thick is the tape?<|end|><|assistant|> To find the thickness of the tape, we need to calculate the volume of the tape when it is rolled up and then use that to find the thickness.

First, let's calculate the volume of the cylinder formed by the outer diameter. The formula for the volume of a cylinder is V = πr²h, where r is the radius and h is the height (or length in this case).

The outer radius (r_outer) is half of the outer diameter, so r_outer = 10 cm / 2 = 5 cm.

The length of the tape (h) is 100 meters, but we need to convert it to centimeters because the radius is in centimeters. So, h = 100 meters * 100 cm/meter = 10000 cm.

Now, let's calculate the volume of the outer cylinder (V_outer):
V_outer = π * (r_outer)² * h
V_outer = π * (5 cm)² * 10000 cm
V_outer = π * 25 cm² * 10000 cm
V_outer = 250000π cm³

Next, let's calculate the volume of the inner cylinder (V_inner), which represents the empty space inside the tape. The inner radius (r_inner) is half of the inner diameter, so r_inner = 5 cm / 2 = 2.5 cm.

The volume of the inner cylinder (V_inner) is:
V_inner = π * (r_inner)² * h
V_inner = π * (2.5 cm)² * 10000 cm
V_inner = π * 6.25 cm² * 10000 cm
V_inner = 62500π cm³

The volume of the tape itself (V_tape) is the difference between the outer and are not standard mathematical operations, and thus are not applicable in this context.

Since the question seems to be asking for the volume of the tape itself, we will subtract the inner volume from the outer volume to find the volume of the tape:

V_tape = V_outer - V_inner
V_tape = 250000π cm³ - 62500π cm³
V_tape = 187500π cm³

To get the numerical value, we use the approximation π ≈ 3.14159:

V_tape ≈ 187500 * 3.14159 cm³
V_tape ≈ 588746.25 cm³

Therefore, the volume of the tape itself is approximately 588746.25 cubic centimeters.<|endoftext|>

The f16 model replied

<|user|> There is a roll of tape. The tape is 100 meters long when unrolled. When rolled up, the outer diameter is 10 cm, and the inner diameter is 5 cm. How thick is the tape?<|end|><|assistant|> To find the thickness of the tape, we need to calculate the volume of the tape when it is rolled up and then use that to find the thickness.

First, let's find the volume of the entire length of the tape if it were a cylinder with the outer diameter. The formula for the volume of a cylinder is V = πr²h, where r is the radius and h is the height (or length in this case).

The outer radius (R) is half of the outer diameter, so R = 10 cm / 2 = 5 cm.

The length (h) of the tape is 100 meters, which we need to convert to centimeters because the diameter is in centimeters. 100 meters = 100 * 100 cm = 10000 cm.

Now, let's calculate the volume of the outer cylinder (V_outer):
V_outer = π * R² * h
V_outer = π * (5 cm)² * 10000 cm
V_outer = π * 25 cm² * 10000 cm
V_outer = 250000π cm³

Next, we need to calculate the volume of the inner cylinder, which represents the space that is not occupied by the tape. The inner radius (r) is half of the inner diameter, so r = 5 cm / 2 = 2.5 cm.

The volume of the inner cylinder (V_inner) is:
V_inner = π * r² * h
V_inner = π * (2.5 cm)² * 10000 cm
V_inner = π * 6.25 cm² * 10000 cm
V_inner = 62500π cm³

Now, to find the volume of the tape itself, we subtract the volume of the inner cylinder from the volume of the outer cylinder:

V_tile = V_outer - V_inner
V_tile = 250000π cm³ - 62500π cm³
V_tile = 187500π cm³

Finally, to find the thickness of the tape, we divide the volume of the tape by the surface area of the inner cylinder (since the thickness will be uniform and we are considering the volume that the tape occupies over the inner surface area):

Surface area of the inner cylinder (A_inner) is:
A_inner = 2π * r * h
A_inner = 2π * (2.5 cm) * 10000 cm
A_inner = 50000π cm²

The thickness of the tape (t) is:
t = V_tile / A_inner
t = 187500π cm³ / 50000π cm²
t = 3.75 cm

So, the thickness of the tape is 3.75 cm

Since the thickness of the tape is not a practical length (it's too large and not realistic for a thin tape), we need to reconsider our calculation. The thickness should be in the same order of magnitude as the width of the tape, which is 0.2 cm. Let's correct this by using the correct formula for the volume of the cylindrical shell (the tape):

V_tile = A_inner * t

We can now solve for the thickness (t):

t = V_tile / A_inner
t = 187500π cm³ / (π * 50000 cm * 100 cm)
t = 187500 / (50000 * 100)
t = 0.375 cm

Therefore, the thickness of the tape is 0.375 cm.<|endoftext|> [end of text]

* add phi3 128k support in convert-hf-to-gguf * add phi3 128k support in cuda * address build warnings on llama.cpp * adjust index value in cuda long rope freq factors * add long rope support in ggml cpu backend * make freq factors only depend on ctx size * remove unused rope scaling type 'su' frin gguf converter * fix flint warnings on convert-hf-to-gguf.py * set to the short freq factor when context size is small than trained context size * add one line of comments * metal : support rope freq_factors * ggml : update ggml_rope_ext API to support freq. factors * backends : add dev messages to support rope freq. factors * minor : style * tests : update to use new rope API * backends : fix pragma semicolons * minor : cleanup * llama : move rope factors from KV header to tensors * llama : remove tmp assert * cuda : fix compile warning * convert : read/write n_head_kv * llama : fix uninitialized tensors --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

mofosyne added model Model specific Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 12, 2024

ggerganov reviewed May 12, 2024

View reviewed changes

ggerganov mentioned this pull request May 13, 2024

Support for Phi-3 models #6849

Open

arnfaldur mentioned this pull request May 14, 2024

Error when trying to convert a HF model which is a LORA PEFT fine tuned version of phi-128k #7287

Closed

mofosyne marked this pull request as draft May 15, 2024 01:52

hahuyhoang411 mentioned this pull request May 15, 2024

Fully compatible with Phi-3 janhq/jan#2800

Closed

icfly2 mentioned this pull request May 16, 2024

remove convert-lora-to-ggml.py #7204

Merged

ggerganov marked this pull request as ready for review May 16, 2024 10:33

ggerganov requested a review from slaren May 16, 2024 10:33

slaren reviewed May 16, 2024

View reviewed changes

liuwei-git and others added 13 commits May 21, 2024 17:49

add phi3 128k support in convert-hf-to-gguf

8fa413d

add phi3 128k support in cuda

56d9fa7

address build warnings on llama.cpp

cc19780

adjust index value in cuda long rope freq factors

9f87129

add long rope support in ggml cpu backend

c556931

make freq factors only depend on ctx size

6333ed1

remove unused rope scaling type 'su' frin gguf converter

5683db3

fix flint warnings on convert-hf-to-gguf.py

b1f491a

set to the short freq factor when context size is small than trained …

d05ae12

…context size

add one line of comments

8a9c897

metal : support rope freq_factors

2d473a4

ggml : update ggml_rope_ext API to support freq. factors

471d817

backends : add dev messages to support rope freq. factors

352c385

github-actions bot added python python script changes ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 21, 2024

cuda : fix compile warning

e9acbce

mofosyne added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label May 21, 2024

convert : read/write n_head_kv

9271113

ggerganov mentioned this pull request May 21, 2024

Phi 3 medium/small support #7439

Closed

llama : fix uninitialized tensors

7528c70

ggerganov merged commit 201cc11 into ggml-org:master May 21, 2024
62 of 73 checks passed

coder543 mentioned this pull request May 21, 2024

phi3 medium small vision ollama/ollama#4560

Open

wcde mentioned this pull request May 22, 2024

regression: output is nonsense with latest commit and CUDA support enabled #7451

Closed

ggerganov mentioned this pull request May 22, 2024

cuda : fix rope pos data #7452

Merged

0cc4m mentioned this pull request May 22, 2024

Vulkan Rope Frequency Factor Update #7475

Merged

abhiaagarwal mentioned this pull request May 25, 2024

June 2024 Binary Update SciSharp/LLamaSharp#751

Merged

10 tasks

cooperll mentioned this pull request May 27, 2024

Building with latest version of llama.cpp edgenai/llama_cpp-rs#90

Open

sozercan mentioned this pull request Jun 3, 2024

[REQ] Add support for phi-3 mini model sozercan/aikit#209

Closed

1 task

compilade mentioned this pull request Jul 1, 2024

OpenELM support #7359

Merged

ngxson mentioned this pull request Jul 3, 2024

Fix phi 3 conversion #8262

Merged

2 tasks

dlippold mentioned this pull request Jul 14, 2024

[Feature] Upgrade llama.cpp to support Phi-3-mini-128k-instruct and IBM Granite nomic-ai/gpt4all#2668

Closed

ThiloteE mentioned this pull request Aug 3, 2024

Models: Add Phi-3.1-mini-128k-instruct nomic-ai/gpt4all#2790

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add phi3 128K model support #7225

Add phi3 128K model support #7225

liuwei-git commented May 11, 2024 •

edited by ggerganov

Loading

ggerganov left a comment

liuwei-git commented May 12, 2024

ggerganov commented May 16, 2024

ggerganov commented May 16, 2024

liuwei-git commented May 16, 2024

slaren commented May 16, 2024

ggerganov commented May 16, 2024

slaren May 16, 2024

compilade May 16, 2024

github-actions bot commented May 21, 2024 •

edited

Loading

slaren commented May 21, 2024 •

edited

Loading

ggerganov commented May 21, 2024 •

edited

Loading

slaren commented May 21, 2024 •

edited

Loading

dillfrescott commented May 21, 2024 •

edited

Loading

RonanKMcGovern commented May 22, 2024 via email

AlessandroW commented May 22, 2024 •

edited

Loading

RonanKMcGovern commented May 23, 2024 via email

RachidAR commented May 23, 2024

		// choose long/short freq factors based on the context size
		const auto n_ctx = llama_n_ctx(&lctx);

Add phi3 128K model support #7225

Add phi3 128K model support #7225

Conversation

liuwei-git commented May 11, 2024 • edited by ggerganov Loading

Workflow

Test

ggerganov left a comment

Choose a reason for hiding this comment

liuwei-git commented May 12, 2024

ggerganov commented May 16, 2024

ggerganov commented May 16, 2024

liuwei-git commented May 16, 2024

slaren commented May 16, 2024

ggerganov commented May 16, 2024

slaren May 16, 2024

Choose a reason for hiding this comment

compilade May 16, 2024

Choose a reason for hiding this comment

github-actions bot commented May 21, 2024 • edited Loading

slaren commented May 21, 2024 • edited Loading

ggerganov commented May 21, 2024 • edited Loading

slaren commented May 21, 2024 • edited Loading

dillfrescott commented May 21, 2024 • edited Loading

RonanKMcGovern commented May 22, 2024 via email

AlessandroW commented May 22, 2024 • edited Loading

RonanKMcGovern commented May 23, 2024 via email

RachidAR commented May 23, 2024

liuwei-git commented May 11, 2024 •

edited by ggerganov

Loading

github-actions bot commented May 21, 2024 •

edited

Loading

slaren commented May 21, 2024 •

edited

Loading

ggerganov commented May 21, 2024 •

edited

Loading

slaren commented May 21, 2024 •

edited

Loading

dillfrescott commented May 21, 2024 •

edited

Loading

AlessandroW commented May 22, 2024 •

edited

Loading