ggml: aarch64: implement SVE kernels for q2_k_q8_k vector dot #12064

Vithulep · 2025-02-25T07:56:16Z

This PR introduces support for SVE (Scalable Vector Extensions) kernels for the q2_K_q8_K vector dot on the Arm architecture. A similar proposal for SVE support is made in PR 7433 and 11227.

This PR contains the SVE implementation of the vector dot used to compute the Q2_K quantization.
By running a Q2_K quantized model of mistral-7b-v01, on Graviton 3E (Perf 21 XL), Accuracy and Performance are measured.

Performance

The performance enhancement with this PR (SVE) is ~ x1.03 to x1.09 faster than the NEON implementation.

Decoding Throughput (TPOT)

Threads	NEON (original)	This PR(SVE)	Ratio
2	4.31	4.67	1.08
4	8.43	9.17	1.09
8	16.24	17.56	1.08
16	30.04	32.24	1.07
32	50.06	53.12	1.06
48	58.05	59.78	1.03

The command used to measure the performance is

./llama-bench  -m ${PATH_TO_MODEL} -n 0 -n 16 -p 64 -t 2,4,8,16,32,48

Perplexity

I have ran perplexity with the NEON(Original) and SVE (This PR) Implementation.
And below is the summary.

NEON (original)	SVE (this PR)
3.1285 +/- 0.40252	3.1289 +/- 0.40320

This correction does not appear to have any impact on accuracy.

ggerganov · 2025-02-25T11:41:00Z

ggml/src/ggml-cpu/ggml-cpu-quants.c

+    switch (vector_length) {
+        case 128:
+                for (int i = 0; i < nb; ++i) {
+                    const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
+                    svfloat32_t d_broad = svdup_n_f32((float32_t)d);
+                    const float dmin = -y[i].d * GGML_FP16_TO_FP32(x[i].dmin);


Use 4-space indentation in the switch cases:

Suggested change

switch (vector_length) {

case 128:

for (int i = 0; i < nb; ++i) {

const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);

svfloat32_t d_broad = svdup_n_f32((float32_t)d);

const float dmin = -y[i].d * GGML_FP16_TO_FP32(x[i].dmin);

switch (vector_length) {

case 128:

for (int i = 0; i < nb; ++i) {

const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);

svfloat32_t d_broad = svdup_n_f32((float32_t)d);

const float dmin = -y[i].d * GGML_FP16_TO_FP32(x[i].dmin);

Added SVE Support for Q2_K Quantized Models

6cfedbe

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 25, 2025

ggerganov reviewed Feb 25, 2025

View reviewed changes

pvname added 2 commits February 25, 2025 17:24

Use 4-space indentation in the switch cases

45bd871

removed comments lines

417a6c9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml: aarch64: implement SVE kernels for q2_k_q8_k vector dot #12064

ggml: aarch64: implement SVE kernels for q2_k_q8_k vector dot #12064

Vithulep commented Feb 25, 2025

ggerganov Feb 25, 2025

ggml: aarch64: implement SVE kernels for q2_k_q8_k vector dot #12064

Are you sure you want to change the base?

ggml: aarch64: implement SVE kernels for q2_k_q8_k vector dot #12064

Conversation

Vithulep commented Feb 25, 2025

Performance

Perplexity

ggerganov Feb 25, 2025

Choose a reason for hiding this comment