PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #246

zhouwg · 2025-02-05T14:55:25Z

Required Info
Branch	{ master }
Device Vendor	Xiaomi14(equipped with Qualcomm SM8650-AB Snapdragon 8 Gen 3)
Device OS Version	Android 14
Section	{ ggml-qnn }

PR Description

there is a long story about this ggml-qnn backend, pls refer to:

first touch with ggml(03/05/2024---03/16/2024) PoC:clean-room implementation of real-time AI subtitle for English online-TV(OTT TV) #64
first implementation of ggml-qnn(03/29/2024---04/24/2024) PoC: Add Qualcomm mobile SoC native backend for GGML #121
first PR of ggml-qnn in upstream llama.cpp(04/24/2024---06/15/2024) ggml-qnn: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend ggml-org/llama.cpp#6869
refined implementation of ggml-qnn(01/29/2025---02/13/2025) PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #246
second PR of ggml-qnn in upstream llama.cpp: ggml-qnn: add Qualcomm mobile SoC native backend for GGML ggml-org/llama.cpp#11844

thanks to the big changes of software architecture(especially the "backend scheduler" feature has been introduced and matured) in latest upstream llama.cpp, this refined implementation can works pretty good with ASR inference via whisper.cpp and LLM inference via llama.cpp on Xiaomi14(equipped with Qualcomm Snapdragon 8 Gen 3).

[2025-02-10] the source code of this PR are both available in project kantv and kantvai-ggmlqnn.
[2025-02-12] create kantv.ai, this PR will be submitted to the upstream llama.cpp from a formal member of kantv-ai later after sanity-check and bug-fix again.
[2025-02-13] this PR has submitted to the upstream llama.cpp community from a formal member of the kantv-ai team.
[2025-02-24] third PR in upstream: ggml-org/llama.cpp#12049

How to verify the PR

this PR can be verified easily with a standard Android APP from master branch of project kantv(pls see below screenshots on 02-05-2024) or the official test-backend-ops or llama-cli command line application from llama.cpp in kantv-ai.

For Android developer, pls see README-qnn.md in project kantv
For llama.cpp community developer,

  git clone https://github.com/kantv-ai/llama.cpp
  cd llama.cpp
  git checkout kantvai-ggmlqnn
  ./scripts/build-run-android.sh build               (it'll setup local build envs automatically and build the entire project)
  ./scripts/build-run-android.sh updateqnnlib       (upload Qualcomm's QNN binary runtime libs to Android phone)
  ./scripts/build-run-android.sh run_llamacli        (running llama-cli on Android pohone)
  ./scripts/build-run-android.sh run_testop          (running test-backend-ops on Android phone)

we can find that this backend works fine as expected from the log output of "adb logcat | grep ggml-qnn".

General notes

put main logic in one single source file(ggml-qnn.cpp) because it's very helpful for other experienced programmers be involved in dev activity which similar to what ggerganov did in the very beginning of ggml.c/llama.cpp or what Intel did in the very beginning of ggml-sycl.cpp or what Qualcomm did in the very beginning of ggml-opencl.cpp. we should enable this refined ggml-qnn backend works pretty good before code reconstruction because code reconstruction is not the key-point at current phase. pls refer to the point of view from Oriol Vinyals whom is VP of Research & Deep Learning Lead from Google DeepMind:

if someone Chinese independent programmer want to provide help with source codes or participate in the dev activity of ggml-qnn, pls follow this coding style, pls focus on the real key-point in this ggml backend:how to utilize the Hexagon NPU maximally with the highly well-designed "backend scheduler" in latest upstream llama.cpp, pls contribute ideas or codes in this single source file ggml-qnn.cpp, or at least pls don't brings interference again and again(we can walk our way independently), thanks for your cooperation.
the previous and this refined implementation of ggml-qnn is mainly porting from executorch(the QNN backend's implementation in executorch comes from Qualcomm, especially got breakthrough help from chiwwang@Qualcomm Technologies Inc. I also got a meaningful help from XiaoMi-StableDiffusionOnDevice, so any other similar PRs in upstream llama.cpp are greatly welcomed so I can learning something from the PRs in upstream llama.cpp.
of course,it's great that Qualcomm's QTI/QuIC can submit an official PR of ggml-qnn's implementation to the upstream llama.cpp which similar to what Intel/Huawei/AMD/Moore Threads did in the upstream llama.cpp community.

Special notes

thanks for that I borrowed 5-7 functions from a forked llama.cpp project which comes from a Chinese programmer whom I don't know. I'd like to cooperate with this Chinese programmer if he has intention of cooperate with me for ggml-qnn backend: code review in the upstream llama.cpp community is still welcomed but pls see my "general notes" carefully and again code reconstruction via C++ is NOT the key-point at current phase although I think the C++ skill of this unknown(here means "I don't know") Chinese programmer is good. I don't want to say anything others about this unknown(here means "I don't know") Chinese programmer because I don't want programmers from the US and EU to think this is another joke from China. there is also a long story about that and the unrecoverable result has brought to me: I was blocked in the upstream llama.cpp community cause of

his unprofessional behavior about the PR in the upstream llama.cpp community: this Chinese programmer didn't brought any meaningful technical comments or suggestions(such as bugfix in that PR, how to do type trait in function ggml_qnn_mulmat, how to offload mulmat to Hexagon NPU------he mentioned that a matrix transpose operation is required for this scenario, how to manage NPU RPC memory effectively...so I personally think: although this Chinese programmer might be familiar with hardcore AI tech but act as a "language lawyer" in an influential open source C++ project(there is an interesting personal point of view:I haven't seen any programmers in the llama.cpp community whose C++ coding level exceeds of the genius programmer Georgi Gerganov@ggerganov and Diego Devesa@slaren whom I think both are a real AI expert and C++ master). I has argued some stupid questions with this unknown(here means "I don't know") Chinese programmer again and again in that PR from the upstream llama.cpp community, that was my mistake.
my stupid comment in that PR from the upstream llama.cpp community, that was also my mistake.

BTW, my point of view about DeepSeek-R1: I'm not surprise that DeepSeek-R1 comes from China, because China has established the world's largest higher education system with 240 ------ 260 million people having received university education and accordingly there are many STEM geniuses or super smart guys in China whom IQ is above 145 or similar to the original authors of the great llama.cpp. I agree that China's DeepSeek-R1 really brings a wake-up call to the US's AI&tech industry, but I also strongly agree the point of view from Yann LeCun whom is Meta's Chief AI Scientist:

one of reasons of above personal point of review comes from an impressive sentence in a popular American song: "I won't forget the ones who died, who gave that right to me." I personally think this is an universal emotion of human being on our planet.

finally, I personally think that this Chinese programmer could contribute meaningful idea or source code in our first PR or second PR in the upstream llama.cpp community or [ llama.cpp in kantv-ai team] (https://github.com/kantv-ai/llama.cpp) directly rather than hard-fork our first PR and then claim that our work is a duplicated effort. anyway, we sincerely respect the freedom of everyone in llama.cpp community and we agree that's also the GGML way: try crazy ideas, build wild demos, and push the edge of what’s possible.

[updated on 02-16-2025] I suddenly found that I can access to the upstream llama.cpp community again, thanks too much, I'll never forget who gave that right to me.

The text was updated successfully, but these errors were encountered:

zhouwg · 2025-02-05T15:15:55Z

Thanks to the wonderful feature "backend scheduler" has been introduced and matured in the latest upstream llama.cpp, this PR works pretty good(as my understanding) in a standard Android APP as expected with whisper.cpp and llama.cpp on Xiaomi14(Android smartphone equipped with Qualcomm Snapdragon Gen 3).

zhouwg · 2025-02-11T01:45:28Z

No description provided.

zhouwg added enhancement New feature or request android QNN-backend labels Feb 5, 2025

zhouwg mentioned this issue Feb 6, 2025

Refactoring: add helper class to bind qnn tensor -> ggml tensor zhouwg/llama.cpp#2

Closed

4 tasks

zhouwg changed the title ~~Draft: Add QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp~~ Draft: Refine QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) backend for latest ggml,whisper.cpp,llama.cpp Feb 7, 2025

zhouwg added WIP and removed enhancement New feature or request labels Feb 7, 2025

kantv-ai locked and limited conversation to collaborators Feb 7, 2025

zhouwg added QNN-backend and removed QNN-backend WIP labels Feb 7, 2025

kantv-ai unlocked this conversation Feb 7, 2025

kantv-ai locked as resolved and limited conversation to collaborators Feb 7, 2025

zhouwg self-assigned this Feb 7, 2025

zhouwg added the done label Feb 7, 2025

zhouwg closed this as completed Feb 7, 2025

zhouwg reopened this Feb 7, 2025

zhouwg closed this as completed Feb 7, 2025

zhouwg added WIP and removed done labels Feb 8, 2025

zhouwg reopened this Feb 8, 2025

kantv-ai unlocked this conversation Feb 8, 2025

zhouwg added android WIP and removed android WIP labels Feb 8, 2025

zhouwg reopened this Feb 9, 2025

zhouwg added a commit that referenced this issue Feb 10, 2025

ggml-qnn:submit source code of ggml-qnn PR in #246

d056cc4

zhouwg added a commit that referenced this issue Feb 10, 2025

ggml-qnn:submit source code of ggml-qnn PR in #246

8c3fb13

zhouwg added a commit that referenced this issue Feb 10, 2025

ggml-qnn:submit source code of ggml-qnn PR in #246

f663a4e

zhouwg added a commit to zhouwg/llama.cpp that referenced this issue Feb 10, 2025

ggml-qnn:kantv-ai/kantv#246

93a0daa

zhouwg added a commit to zhouwg/llama.cpp that referenced this issue Feb 10, 2025

ggml-qnn:submit source code of ggml-qnn PR in kantv-ai/kantv#246

b8c2f93

zhouwg added a commit to zhouwg/llama.cpp that referenced this issue Feb 10, 2025

ggml-qnn:submit source code of kantv-ai/kantv#246

52fd2c4

zhouwg added a commit to zhouwg/llama.cpp that referenced this issue Feb 10, 2025

ggml-qnn:submit source code of kantv-ai/kantv#246

b5629c7

zhouwg added a commit that referenced this issue Feb 10, 2025

ggml-qnn:submit source code of ggml-qnn PR in #246

b8aaa88

zhouwg added a commit to zhouwg/llama.cpp that referenced this issue Feb 10, 2025

ggml-qnn:submit source code of kantv-ai/kantv#246

3706f93

zhouwg added a commit to zhouwg/llama.cpp that referenced this issue Feb 10, 2025

ggml-qnn:submit source code of kantv-ai/kantv#246

1bbb2a2

zhouwg added a commit that referenced this issue Feb 10, 2025

ggml-qnn:submit source code of ggml-qnn PR in #246

8cf0638

zhouwg added a commit that referenced this issue Feb 10, 2025

ggml-qnn:submit source code of ggml-qnn PR in #246

e793e26

zhouwg added a commit that referenced this issue Feb 10, 2025

ggml-qnn:submit source code of #246

65cc05d

zhouwg added a commit to zhouwg/llama.cpp that referenced this issue Feb 10, 2025

ggml-qnn:submit source code of kantv-ai/kantv#246

d4ad400

zhouwg closed this as completed Feb 10, 2025

zhouwg added a commit to zhouwg/llama.cpp that referenced this issue Feb 10, 2025

ggml-qnn:submit source code of ggml-qnn PR in kantv-ai/kantv#246

429feaa

zhouwg added a commit to zhouwg/llama.cpp that referenced this issue Feb 10, 2025

ggml-qnn:submit source code of ggml-qnn PR in kantv-ai/kantv#246

30717d7

zhouwg added a commit to zhouwg/llama.cpp that referenced this issue Feb 10, 2025

ggml-qnn:submit source code of ggml-qnn PR in kantv-ai/kantv#246

268746a

zhouwg added the ggml-qnn label Feb 11, 2025

zhouwg reopened this Feb 11, 2025

zhouwg added need-review WIP done and removed done WIP labels Feb 11, 2025

awklover mentioned this issue Feb 13, 2025

ggml-qnn: add Qualcomm mobile SoC native backend for GGML ggml-org/llama.cpp#11844

Closed

zhouwg closed this as completed Feb 13, 2025

kantv-ai locked and limited conversation to collaborators Feb 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #246

PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #246

zhouwg commented Feb 5, 2025 •

edited

Loading

zhouwg commented Feb 5, 2025 •

edited

Loading

zhouwg commented Feb 11, 2025 •

edited

Loading

PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #246

PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #246

Comments

zhouwg commented Feb 5, 2025 • edited Loading

PR Description

How to verify the PR

General notes

Special notes

zhouwg commented Feb 5, 2025 • edited Loading

zhouwg commented Feb 11, 2025 • edited Loading

zhouwg commented Feb 5, 2025 •

edited

Loading

zhouwg commented Feb 5, 2025 •

edited

Loading

zhouwg commented Feb 11, 2025 •

edited

Loading