Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add phi3 128K model support #7225

Merged
merged 22 commits into from
May 21, 2024
Merged

Add phi3 128K model support #7225

merged 22 commits into from
May 21, 2024

Conversation

liuwei-git
Copy link
Contributor

@liuwei-git liuwei-git commented May 11, 2024

ref #6849

The only difference between phi3 4k and 128k model is from the rotary embedding. 128k model adds long/short rope scaling factors (freq_factors) and an attn factor to each hidden dimension. The chosen of long/short factor is based on the total length of the input sequences, i.e, the kv context size.

seq_len = torch.max(position_ids) + 1
if seq_len > self.original_max_position_embeddings:
    ext_factors = torch.tensor(self.long_factor, dtype=torch.float32, device=x.device)
else:
    ext_factors = torch.tensor(self.short_factor, dtype=torch.float32, device=x.device)

inv_freq_shape = torch.arange(0, self.dim, 2, dtype=torch.int64, device=x.device).float() / self.dim
self.inv_freq = 1.0 / (ext_factors * self.base**inv_freq_shape)

The attn factor value is based on the postional embedding size.

scale = self.max_position_embeddings / self.original_max_position_embeddings
if scale <= 1.0:
    scaling_factor = 1.0
else:
    scaling_factor = math.sqrt(1 + math.log(scale) / math.log(self.original_max_position_embeddings))

Workflow

  • convert-hf-to-gguf.py: Write long/short freq factors to gguf metadata for phi3 model

  • llama.cpp:

    • load the freq factors and attn factor from metadata
    • take freq factors as an input tensor of phi3 model, and a source of k/q rope tensor
    • choose the long or short freq_factors based on the context size when setting the tensor value
  • ggml: update rope op to support long/short freq factors:

    • CPU
    • CUDA
    • Metal
    • SYCL
    • Vulkan

Test

    passkey phi3_128k_fp16.gguf 500

@mofosyne mofosyne added model Model specific Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 12, 2024
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I'll add the Metal support in a day or two if it is not yet pushed

@liuwei-git
Copy link
Contributor Author

Thanks @ggerganov for your help. I did not have device to test metal, so not implement that part.

@ggerganov
Copy link
Member

Looking into this now

@ggerganov
Copy link
Member

I would like to refactor the ggml_rope_custom API and remove ggml_rope_with_freq_factors before merging - will push in a bit

@ggerganov ggerganov marked this pull request as ready for review May 16, 2024 10:33
@ggerganov ggerganov requested a review from slaren May 16, 2024 10:33
@liuwei-git
Copy link
Contributor Author

I would like to refactor the ggml_rope_custom API and remove ggml_rope_with_freq_factors before merging - will push in a bit

The refactor make the api looks more clean, truly great.

@slaren
Copy link
Member

slaren commented May 16, 2024

I would prefer if the scaling factors were exported as a tensor rather than metadata, it would remove quite a bit of code and it would be more efficient.

@ggerganov
Copy link
Member

Yup, would be better to have the factors as tensors. @liuwei-git would you like to give this a go?

llama.cpp Outdated
Comment on lines 11252 to 11253
// choose long/short freq factors based on the context size
const auto n_ctx = llama_n_ctx(&lctx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this work correctly with multiple sequences? Maybe something like llama_n_ctx(&lctx) / llama_n_seq_max(&lctx) would be correct in more cases, but still not in every case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like llama_n_ctx(&lctx) / llama_n_seq_max(&lctx)

For Transformer-like models, this would always equal 1.

llama_n_ctx(&lctx) / cparams.n_seq_max would be what you meant.

@github-actions github-actions bot added python python script changes ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 21, 2024
Copy link
Contributor

github-actions bot commented May 21, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 538 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8734.28ms p(95)=21554.83ms fails=, finish reason: stop=477 truncated=61
  • Prompt processing (pp): avg=104.82tk/s p(95)=509.78tk/s
  • Token generation (tg): avg=32.71tk/s p(95)=46.82tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=7528c705b0c741a68a1d85a523d827374c258195

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 538 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716324379 --> 1716325011
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 815.29, 815.29, 815.29, 815.29, 815.29, 831.39, 831.39, 831.39, 831.39, 831.39, 842.08, 842.08, 842.08, 842.08, 842.08, 864.65, 864.65, 864.65, 864.65, 864.65, 898.12, 898.12, 898.12, 898.12, 898.12, 891.76, 891.76, 891.76, 891.76, 891.76, 902.51, 902.51, 902.51, 902.51, 902.51, 910.64, 910.64, 910.64, 910.64, 910.64, 916.48, 916.48, 916.48, 916.48, 916.48, 916.28, 916.28, 916.28, 916.28, 916.28, 945.03, 945.03, 945.03, 945.03, 945.03, 888.37, 888.37, 888.37, 888.37, 888.37, 904.45, 904.45, 904.45, 904.45, 904.45, 915.15, 915.15, 915.15, 915.15, 915.15, 915.13, 915.13, 915.13, 915.13, 915.13, 916.52, 916.52, 916.52, 916.52, 916.52, 912.88, 912.88, 912.88, 912.88, 912.88, 877.85, 877.85, 877.85, 877.85, 877.85, 878.11, 878.11, 878.11, 878.11, 878.11, 877.89, 877.89, 877.89, 877.89, 877.89, 881.6, 881.6, 881.6, 881.6, 881.6, 884.59, 884.59, 884.59, 884.59, 884.59, 878.57, 878.57, 878.57, 878.57, 878.57, 876.38, 876.38, 876.38, 876.38, 876.38, 873.85, 873.85, 873.85, 873.85, 873.85, 887.18, 887.18, 887.18, 887.18, 887.18, 882.68, 882.68, 882.68, 882.68, 882.68, 882.52, 882.52, 882.52, 882.52, 882.52, 882.48, 882.48, 882.48, 882.48, 882.48, 882.9, 882.9, 882.9, 882.9, 882.9, 882.05, 882.05, 882.05, 882.05, 882.05, 881.29, 881.29, 881.29, 881.29, 881.29, 882.95, 882.95, 882.95, 882.95, 882.95, 881.03, 881.03, 881.03, 881.03, 881.03, 883.45, 883.45, 883.45, 883.45, 883.45, 884.94, 884.94, 884.94, 884.94, 884.94, 882.21, 882.21, 882.21, 882.21, 882.21, 881.15, 881.15, 881.15, 881.15, 881.15, 883.08, 883.08, 883.08, 883.08, 883.08, 882.82, 882.82, 882.82, 882.82, 882.82, 887.34, 887.34, 887.34, 887.34, 887.34, 895.6, 895.6, 895.6, 895.6, 895.6, 895.55, 895.55, 895.55, 895.55, 895.55, 893.74, 893.74, 893.74, 893.74, 893.74, 891.05, 891.05, 891.05, 891.05, 891.05, 889.14, 889.14, 889.14, 889.14, 889.14, 895.34, 895.34, 895.34, 895.34, 895.34, 894.76, 894.76, 894.76, 894.76, 894.76, 892.45, 892.45, 892.45, 892.45, 892.45, 897.32, 897.32, 897.32, 897.32, 897.32, 896.01, 896.01, 896.01, 896.01, 896.01, 898.87, 898.87, 898.87, 898.87, 898.87, 901.0, 901.0, 901.0, 901.0, 901.0, 901.37, 901.37, 901.37, 901.37, 901.37, 907.17, 907.17, 907.17, 907.17, 907.17, 905.49, 905.49, 905.49, 905.49, 905.49, 906.02, 906.02, 906.02, 906.02, 906.02, 905.23, 905.23, 905.23, 905.23, 905.23, 905.62, 905.62, 905.62, 905.62, 905.62, 906.83, 906.83, 906.83, 906.83, 906.83, 906.83, 906.83]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 538 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716324379 --> 1716325011
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 37.79, 37.79, 37.79, 37.79, 37.79, 34.05, 34.05, 34.05, 34.05, 34.05, 29.04, 29.04, 29.04, 29.04, 29.04, 30.45, 30.45, 30.45, 30.45, 30.45, 30.35, 30.35, 30.35, 30.35, 30.35, 31.95, 31.95, 31.95, 31.95, 31.95, 32.94, 32.94, 32.94, 32.94, 32.94, 33.36, 33.36, 33.36, 33.36, 33.36, 33.84, 33.84, 33.84, 33.84, 33.84, 34.31, 34.31, 34.31, 34.31, 34.31, 34.4, 34.4, 34.4, 34.4, 34.4, 34.17, 34.17, 34.17, 34.17, 34.17, 33.84, 33.84, 33.84, 33.84, 33.84, 32.26, 32.26, 32.26, 32.26, 32.26, 32.23, 32.23, 32.23, 32.23, 32.23, 30.19, 30.19, 30.19, 30.19, 30.19, 30.46, 30.46, 30.46, 30.46, 30.46, 30.49, 30.49, 30.49, 30.49, 30.49, 30.42, 30.42, 30.42, 30.42, 30.42, 30.45, 30.45, 30.45, 30.45, 30.45, 30.38, 30.38, 30.38, 30.38, 30.38, 30.57, 30.57, 30.57, 30.57, 30.57, 30.67, 30.67, 30.67, 30.67, 30.67, 30.42, 30.42, 30.42, 30.42, 30.42, 30.48, 30.48, 30.48, 30.48, 30.48, 30.7, 30.7, 30.7, 30.7, 30.7, 30.57, 30.57, 30.57, 30.57, 30.57, 30.69, 30.69, 30.69, 30.69, 30.69, 31.06, 31.06, 31.06, 31.06, 31.06, 31.09, 31.09, 31.09, 31.09, 31.09, 31.14, 31.14, 31.14, 31.14, 31.14, 31.24, 31.24, 31.24, 31.24, 31.24, 31.28, 31.28, 31.28, 31.28, 31.28, 31.37, 31.37, 31.37, 31.37, 31.37, 31.3, 31.3, 31.3, 31.3, 31.3, 30.89, 30.89, 30.89, 30.89, 30.89, 30.3, 30.3, 30.3, 30.3, 30.3, 30.37, 30.37, 30.37, 30.37, 30.37, 30.55, 30.55, 30.55, 30.55, 30.55, 30.66, 30.66, 30.66, 30.66, 30.66, 30.8, 30.8, 30.8, 30.8, 30.8, 30.73, 30.73, 30.73, 30.73, 30.73, 30.66, 30.66, 30.66, 30.66, 30.66, 30.55, 30.55, 30.55, 30.55, 30.55, 29.84, 29.84, 29.84, 29.84, 29.84, 28.81, 28.81, 28.81, 28.81, 28.81, 28.91, 28.91, 28.91, 28.91, 28.91, 28.85, 28.85, 28.85, 28.85, 28.85, 28.82, 28.82, 28.82, 28.82, 28.82, 28.78, 28.78, 28.78, 28.78, 28.78, 28.78, 28.78, 28.78, 28.78, 28.78, 28.81, 28.81, 28.81, 28.81, 28.81, 28.83, 28.83, 28.83, 28.83, 28.83, 28.76, 28.76, 28.76, 28.76, 28.76, 28.76, 28.76, 28.76, 28.76, 28.76, 28.67, 28.67, 28.67, 28.67, 28.67, 28.71, 28.71, 28.71, 28.71, 28.71, 28.88, 28.88, 28.88, 28.88, 28.88, 29.03, 29.03, 29.03, 29.03, 29.03, 29.11, 29.11, 29.11, 29.11, 29.11, 29.19, 29.19]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 538 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716324379 --> 1716325011
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.31, 0.31, 0.31, 0.31, 0.31, 0.33, 0.33, 0.33, 0.33, 0.33, 0.25, 0.25, 0.25, 0.25, 0.25, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.25, 0.25, 0.25, 0.25, 0.25, 0.39, 0.39, 0.39, 0.39, 0.39, 0.39, 0.39, 0.39, 0.39, 0.39, 0.3, 0.3, 0.3, 0.3, 0.3, 0.18, 0.18, 0.18, 0.18, 0.18, 0.1, 0.1, 0.1, 0.1, 0.1, 0.23, 0.23, 0.23, 0.23, 0.23, 0.21, 0.21, 0.21, 0.21, 0.21, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.31, 0.31, 0.31, 0.31, 0.31, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.33, 0.33, 0.33, 0.33, 0.33, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.26, 0.26, 0.26, 0.26, 0.26, 0.43, 0.43, 0.43, 0.43, 0.43, 0.42, 0.42, 0.42, 0.42, 0.42, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.45, 0.45, 0.45, 0.45, 0.45, 0.63, 0.63, 0.63, 0.63, 0.63, 0.66, 0.66, 0.66, 0.66, 0.66, 0.39, 0.39, 0.39, 0.39, 0.39, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.26, 0.26, 0.26, 0.26, 0.26, 0.08, 0.08, 0.08, 0.08, 0.08, 0.24, 0.24, 0.24, 0.24, 0.24, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24, 0.24, 0.24, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.26, 0.26, 0.26, 0.26, 0.26, 0.2, 0.2, 0.2, 0.2, 0.2, 0.08, 0.08, 0.08, 0.08, 0.08, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.19, 0.19]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 538 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716324379 --> 1716325011
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0]
                    
Loading

@mofosyne mofosyne added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label May 21, 2024
@slaren
Copy link
Member

slaren commented May 21, 2024

I think this is going to cause the rope to be run on the CPU always, because the scheduler prefers running ops that use weights in the backend of the weights. I will fix that after this is merged.

@ggerganov
Copy link
Member

ggerganov commented May 21, 2024

Ok. Btw, do you see something that could affect the performance of phi-2 (no rope factors)? The benchmark is half the performance than usual (217 iters) and I'm wondering if it is a fluke, because I don't reproduce on my RTX 2060

@slaren
Copy link
Member

slaren commented May 21, 2024

Looking at the graphs, it seems that the load time increased, but the throughput looks similar. Maybe it was a fluke?

I can't reproduce it on my system either.

GPU Model Test t/s master t/s liuwei-git/master Speedup
RTX 3090 Ti phi2 3B Q8_0 pp512 8543.90 8519.48 1.00
RTX 3090 Ti phi2 3B Q8_0 tg128 185.05 184.47 1.00
RTX 3090 Ti phi2 3B Q8_0 pp512+tg128 808.40 808.70 1.00

@dillfrescott
Copy link

dillfrescott commented May 21, 2024

the model seems to be doing rather poorly. I cannot tell if its a tokenizer issue or just the model itself, but I quantized the 128k medium instruct model to a q8_0 and its failing pretty simple logic questions. Perhaps its just not good with rather basic math?

I tried a temperature of 1 down to 0.6 and even down to 0 and its still not fairing well on logical questions. I was expecting more from a phi model, which leads me to think it may be some other underlying issue.

The question I asked that it specifically struggled on was:

There is a roll of tape. The tape is 100 meters long when unrolled. When rolled up, the outer diameter is 10 cm, and the inner diameter is 5 cm. How thick is the tape?

It gave some pretty dumb answers such as the tape being over 1300 cm thick, and kept trying to correct itself, giving equally incorrect answers.

@ggerganov ggerganov merged commit 201cc11 into ggml-org:master May 21, 2024
62 of 73 checks passed
@RonanKMcGovern
Copy link

RonanKMcGovern commented May 22, 2024 via email

@AlessandroW
Copy link

AlessandroW commented May 22, 2024

Did you try running a 16 bit gguf model and seeing how that performs?

I tried the prompt on the 128k mini instruct model (f16 and Q4_K_M from https://huggingface.co/AlessandroW/Phi-3-mini-128k-instruct-gguf) and both models performed similar to an older ChatGPT version https://neil.fraser.name/news/2023/02/17/. The quantized model replied:

<|user|> There is a roll of tape. The tape is 100 meters long when unrolled. When rolled up, the outer diameter is 10 cm, and the inner diameter is 5 cm. How thick is the tape?<|end|><|assistant|> To find the thickness of the tape, we need to calculate the volume of the tape when it is rolled up and then use that to find the thickness.

First, let's calculate the volume of the cylinder formed by the outer diameter. The formula for the volume of a cylinder is V = πr²h, where r is the radius and h is the height (or length in this case).

The outer radius (r_outer) is half of the outer diameter, so r_outer = 10 cm / 2 = 5 cm.

The length of the tape (h) is 100 meters, but we need to convert it to centimeters because the radius is in centimeters. So, h = 100 meters * 100 cm/meter = 10000 cm.

Now, let's calculate the volume of the outer cylinder (V_outer):
V_outer = π * (r_outer)² * h
V_outer = π * (5 cm)² * 10000 cm
V_outer = π * 25 cm² * 10000 cm
V_outer = 250000π cm³

Next, let's calculate the volume of the inner cylinder (V_inner), which represents the empty space inside the tape. The inner radius (r_inner) is half of the inner diameter, so r_inner = 5 cm / 2 = 2.5 cm.

The volume of the inner cylinder (V_inner) is:
V_inner = π * (r_inner)² * h
V_inner = π * (2.5 cm)² * 10000 cm
V_inner = π * 6.25 cm² * 10000 cm
V_inner = 62500π cm³

The volume of the tape itself (V_tape) is the difference between the outer and are not standard mathematical operations, and thus are not applicable in this context.

Since the question seems to be asking for the volume of the tape itself, we will subtract the inner volume from the outer volume to find the volume of the tape:

V_tape = V_outer - V_inner
V_tape = 250000π cm³ - 62500π cm³
V_tape = 187500π cm³

To get the numerical value, we use the approximation π ≈ 3.14159:

V_tape ≈ 187500 * 3.14159 cm³
V_tape ≈ 588746.25 cm³

Therefore, the volume of the tape itself is approximately 588746.25 cubic centimeters.<|endoftext|>

The f16 model replied

<|user|> There is a roll of tape. The tape is 100 meters long when unrolled. When rolled up, the outer diameter is 10 cm, and the inner diameter is 5 cm. How thick is the tape?<|end|><|assistant|> To find the thickness of the tape, we need to calculate the volume of the tape when it is rolled up and then use that to find the thickness.

First, let's find the volume of the entire length of the tape if it were a cylinder with the outer diameter. The formula for the volume of a cylinder is V = πr²h, where r is the radius and h is the height (or length in this case).

The outer radius (R) is half of the outer diameter, so R = 10 cm / 2 = 5 cm.

The length (h) of the tape is 100 meters, which we need to convert to centimeters because the diameter is in centimeters. 100 meters = 100 * 100 cm = 10000 cm.

Now, let's calculate the volume of the outer cylinder (V_outer):
V_outer = π * R² * h
V_outer = π * (5 cm)² * 10000 cm
V_outer = π * 25 cm² * 10000 cm
V_outer = 250000π cm³

Next, we need to calculate the volume of the inner cylinder, which represents the space that is not occupied by the tape. The inner radius (r) is half of the inner diameter, so r = 5 cm / 2 = 2.5 cm.

The volume of the inner cylinder (V_inner) is:
V_inner = π * r² * h
V_inner = π * (2.5 cm)² * 10000 cm
V_inner = π * 6.25 cm² * 10000 cm
V_inner = 62500π cm³

Now, to find the volume of the tape itself, we subtract the volume of the inner cylinder from the volume of the outer cylinder:

V_tile = V_outer - V_inner
V_tile = 250000π cm³ - 62500π cm³
V_tile = 187500π cm³

Finally, to find the thickness of the tape, we divide the volume of the tape by the surface area of the inner cylinder (since the thickness will be uniform and we are considering the volume that the tape occupies over the inner surface area):

Surface area of the inner cylinder (A_inner) is:
A_inner = 2π * r * h
A_inner = 2π * (2.5 cm) * 10000 cm
A_inner = 50000π cm²

The thickness of the tape (t) is:
t = V_tile / A_inner
t = 187500π cm³ / 50000π cm²
t = 3.75 cm

So, the thickness of the tape is 3.75 cm

Since the thickness of the tape is not a practical length (it's too large and not realistic for a thin tape), we need to reconsider our calculation. The thickness should be in the same order of magnitude as the width of the tape, which is 0.2 cm. Let's correct this by using the correct formula for the volume of the cylindrical shell (the tape):

V_tile = A_inner * t

We can now solve for the thickness (t):

t = V_tile / A_inner
t = 187500π cm³ / (π * 50000 cm * 100 cm)
t = 187500 / (50000 * 100)
t = 0.375 cm

Therefore, the thickness of the tape is 0.375 cm.<|endoftext|> [end of text]

@RonanKMcGovern
Copy link

RonanKMcGovern commented May 23, 2024 via email

@RachidAR
Copy link

I guess I got lucky with a seed. Phi-3-medium (q4 from https://huggingface.co/bartowski/Phi-3-medium-128k-instruct-GGUF, temp 0.1):

There is a roll of tape. The tape is 100 meters long when unrolled. When rolled up, the outer diameter is 10 cm, and the inner diameter is 5 cm. How thick is the tape?

Answer:

To find out how thick the tape is, we can use the formula for the volume of a cylindrical shell (the shape that represents the unrolled tape). The formula is V = πh(R^2 - r^2), where h is the height or length of the cylinder, R is the outer radius and r is the inner radius.
First, let's convert all measurements to meters for consistency:
- Length (height) of tape, h = 100 m
- Outer diameter = 10 cm = 0.1 m
- Inner diameter = 5 cm = 0.05 m
Now we can find the outer and inner radii by dividing the diameters by 2:
- Outer radius, R = 0.1 m / 2 = 0.05 m
- Inner radius, r = 0.05 m / 2 = 0.025 m
Now we can plug these values into the formula for volume of a cylindrical shell:
V = πh(R^2 - r^2)
We know that V (volume) is equal to the length times the thickness, so let's call the thickness "t" and set up an equation:
100 m * t = π * 100 m * ((0.05 m)^2 - (0.025 m)^2)
Now we can solve for t:
t = (π * ((0.05 m)^2 - (0.025 m)^2)) / 100
t ≈ (3.1416 * (0.0025 m^2 - 0.000625 m^2)) / 100
t ≈ (3.1416 * 0.001875 m^2) / 100
t ≈ 0.0059 m
So the thickness of the tape is approximately 0.0059 meters, or about 5.9 millimeters.

Did you try running a 16 bit gguf model and seeing how that performs?

I tried the prompt on the 128k mini instruct model (f16 and Q4_K_M from https://huggingface.co/AlessandroW/Phi-3-mini-128k-instruct-gguf) and both models performed similar to an older ChatGPT version https://neil.fraser.name/news/2023/02/17/. The quantized model replied:

<|user|> There is a roll of tape. The tape is 100 meters long when unrolled. When rolled up, the outer diameter is 10 cm, and the inner diameter is 5 cm. How thick is the tape?<|end|><|assistant|> To find the thickness of the tape, we need to calculate the volume of the tape when it is rolled up and then use that to find the thickness.

First, let's calculate the volume of the cylinder formed by the outer diameter. The formula for the volume of a cylinder is V = πr²h, where r is the radius and h is the height (or length in this case).

The outer radius (r_outer) is half of the outer diameter, so r_outer = 10 cm / 2 = 5 cm.

The length of the tape (h) is 100 meters, but we need to convert it to centimeters because the radius is in centimeters. So, h = 100 meters * 100 cm/meter = 10000 cm.

Now, let's calculate the volume of the outer cylinder (V_outer):
V_outer = π * (r_outer)² * h
V_outer = π * (5 cm)² * 10000 cm
V_outer = π * 25 cm² * 10000 cm
V_outer = 250000π cm³

Next, let's calculate the volume of the inner cylinder (V_inner), which represents the empty space inside the tape. The inner radius (r_inner) is half of the inner diameter, so r_inner = 5 cm / 2 = 2.5 cm.

The volume of the inner cylinder (V_inner) is:
V_inner = π * (r_inner)² * h
V_inner = π * (2.5 cm)² * 10000 cm
V_inner = π * 6.25 cm² * 10000 cm
V_inner = 62500π cm³

The volume of the tape itself (V_tape) is the difference between the outer and are not standard mathematical operations, and thus are not applicable in this context.

Since the question seems to be asking for the volume of the tape itself, we will subtract the inner volume from the outer volume to find the volume of the tape:

V_tape = V_outer - V_inner
V_tape = 250000π cm³ - 62500π cm³
V_tape = 187500π cm³

To get the numerical value, we use the approximation π ≈ 3.14159:

V_tape ≈ 187500 * 3.14159 cm³
V_tape ≈ 588746.25 cm³

Therefore, the volume of the tape itself is approximately 588746.25 cubic centimeters.<|endoftext|>

The f16 model replied

<|user|> There is a roll of tape. The tape is 100 meters long when unrolled. When rolled up, the outer diameter is 10 cm, and the inner diameter is 5 cm. How thick is the tape?<|end|><|assistant|> To find the thickness of the tape, we need to calculate the volume of the tape when it is rolled up and then use that to find the thickness.

First, let's find the volume of the entire length of the tape if it were a cylinder with the outer diameter. The formula for the volume of a cylinder is V = πr²h, where r is the radius and h is the height (or length in this case).

The outer radius (R) is half of the outer diameter, so R = 10 cm / 2 = 5 cm.

The length (h) of the tape is 100 meters, which we need to convert to centimeters because the diameter is in centimeters. 100 meters = 100 * 100 cm = 10000 cm.

Now, let's calculate the volume of the outer cylinder (V_outer):
V_outer = π * R² * h
V_outer = π * (5 cm)² * 10000 cm
V_outer = π * 25 cm² * 10000 cm
V_outer = 250000π cm³

Next, we need to calculate the volume of the inner cylinder, which represents the space that is not occupied by the tape. The inner radius (r) is half of the inner diameter, so r = 5 cm / 2 = 2.5 cm.

The volume of the inner cylinder (V_inner) is:
V_inner = π * r² * h
V_inner = π * (2.5 cm)² * 10000 cm
V_inner = π * 6.25 cm² * 10000 cm
V_inner = 62500π cm³

Now, to find the volume of the tape itself, we subtract the volume of the inner cylinder from the volume of the outer cylinder:

V_tile = V_outer - V_inner
V_tile = 250000π cm³ - 62500π cm³
V_tile = 187500π cm³

Finally, to find the thickness of the tape, we divide the volume of the tape by the surface area of the inner cylinder (since the thickness will be uniform and we are considering the volume that the tape occupies over the inner surface area):

Surface area of the inner cylinder (A_inner) is:
A_inner = 2π * r * h
A_inner = 2π * (2.5 cm) * 10000 cm
A_inner = 50000π cm²

The thickness of the tape (t) is:
t = V_tile / A_inner
t = 187500π cm³ / 50000π cm²
t = 3.75 cm

So, the thickness of the tape is 3.75 cm

Since the thickness of the tape is not a practical length (it's too large and not realistic for a thin tape), we need to reconsider our calculation. The thickness should be in the same order of magnitude as the width of the tape, which is 0.2 cm. Let's correct this by using the correct formula for the volume of the cylindrical shell (the tape):

V_tile = A_inner * t

We can now solve for the thickness (t):

t = V_tile / A_inner
t = 187500π cm³ / (π * 50000 cm * 100 cm)
t = 187500 / (50000 * 100)
t = 0.375 cm

Therefore, the thickness of the tape is 0.375 cm.<|endoftext|> [end of text]

teleprint-me pushed a commit to teleprint-me/llama.cpp that referenced this pull request May 23, 2024
* add phi3 128k support in convert-hf-to-gguf

* add phi3 128k support in cuda

* address build warnings on llama.cpp

* adjust index value in cuda long rope freq factors

* add long rope support in ggml cpu backend

* make freq factors only depend on ctx size

* remove unused rope scaling type 'su' frin gguf converter

* fix flint warnings on convert-hf-to-gguf.py

* set to the short freq factor when context size is small than trained context size

* add one line of comments

* metal : support rope freq_factors

* ggml : update ggml_rope_ext API to support freq. factors

* backends : add dev messages to support rope freq. factors

* minor : style

* tests : update to use new rope API

* backends : fix pragma semicolons

* minor : cleanup

* llama : move rope factors from KV header to tensors

* llama : remove tmp assert

* cuda : fix compile warning

* convert : read/write n_head_kv

* llama : fix uninitialized tensors

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@compilade compilade mentioned this pull request Jul 1, 2024
@ngxson ngxson mentioned this pull request Jul 3, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning merge ready indicates that this may be ready to merge soon and is just holding out in case of objections model Model specific Nvidia GPU Issues specific to Nvidia GPUs python python script changes Review Complexity : High Generally require indepth knowledge of LLMs or GPUs SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants