-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Idea]: Use Android NNAPI to accelerate inference on Android Devices #88
Comments
Would love to see this as well! |
If there is community help, we can try to add support for NNAPI. Currently, I don't have enough capacity to investigate this, but I think it is something interesting and can unlock many applications. Probably will look into this in the future and hoping there are some contributions in the meantime |
im trying to write a nnapi backend (well, you should not expect my work. because im a completely newbie. and mostly wont have any success). but after some document reading. i found that unlike cl or vk, nnapi didn't provide a way to use accelerated matrix multiply or some shader like stuff to compute something in gpu. the only things you can do with it is upload a graph of how layers connected (include operand and weight). so seems like it not very match the architecture llama.cpp current have? if no, please point me a backend using similar architecture so that i can have reference |
@ggerganov maybe it's worth checking NNAPI using ONNX runtime? WhisperRN runs smooth with CoreML, but on Android, even the tiny model is way too laggy to be usable on a budget device (for example Samsung a14, 4 GB RAM) |
@pax-k how do you define "laggy"? I am also investigating the performance on the Android side. My Samsung S22 is capable to transcribe a 30 second german voice message in about 3 seconds with the small whisper model. I am also looking forward for the future because I am 100% sure the great Ai focus at Google will improve the Ai hardware for the next generations of Android phones. With running in profile mode I could reduce the inference time by almost a second. I think this is acceptable. |
This is just an idea for you. Most modern smartphones come with some form of AI accelerator. I am aware GGML-based projects like llama.cpp can compile and run on mobile devices, but there is probably performance left on the table. I think there is right now a gap for an mobile-optimized AI inference library with quantization support and the other tricks present in GGML. For reference: https://developer.android.com/ndk/guides/neuralnetworks
The text was updated successfully, but these errors were encountered: