|
1 | 1 | # Description
|
| 2 | + |
2 | 3 | RKLLM software stack can help users to quickly deploy AI models to Rockchip chips. The overall framework is as follows:
|
3 | 4 | <center class="half">
|
4 | 5 | <div style="background-color:#ffffff;">
|
|
14 | 15 | - RKNPU kernel driver is responsible for interacting with NPU hardware. It has been open source and can be found in the Rockchip kernel code.
|
15 | 16 |
|
16 | 17 | # Support Platform
|
17 |
| - - RK3588 Series |
18 |
| - - RK3576 Series |
| 18 | + |
| 19 | +- RK3588 Series |
| 20 | +- RK3576 Series |
19 | 21 |
|
20 | 22 | # Support Models
|
21 |
| - - [X] [LLAMA models](https://huggingface.co/meta-llama) |
22 |
| - - [X] [TinyLLAMA models](https://huggingface.co/TinyLlama) |
23 |
| - - [X] [Qwen models](https://huggingface.co/models?search=Qwen/Qwen) |
24 |
| - - [X] [Phi models](https://huggingface.co/models?search=microsoft/phi) |
25 |
| - - [X] [ChatGLM3-6B](https://huggingface.co/THUDM/chatglm3-6b/tree/103caa40027ebfd8450289ca2f278eac4ff26405) |
26 |
| - - [X] [Gemma models](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315) |
27 |
| - - [X] [InternLM2 models](https://huggingface.co/collections/internlm/internlm2-65b0ce04970888799707893c) |
28 |
| - - [X] [MiniCPM models](https://huggingface.co/collections/openbmb/minicpm-65d48bf958302b9fd25b698f) |
| 23 | + |
| 24 | +- [x] [LLAMA models](https://huggingface.co/meta-llama) |
| 25 | +- [x] [TinyLLAMA models](https://huggingface.co/TinyLlama) |
| 26 | +- [x] [Qwen models](https://huggingface.co/models?search=Qwen/Qwen) |
| 27 | +- [x] [Phi models](https://huggingface.co/models?search=microsoft/phi) |
| 28 | +- [x] [ChatGLM3-6B](https://huggingface.co/THUDM/chatglm3-6b/tree/103caa40027ebfd8450289ca2f278eac4ff26405) |
| 29 | +- [x] [Gemma models](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315) |
| 30 | +- [x] [InternLM2 models](https://huggingface.co/collections/internlm/internlm2-65b0ce04970888799707893c) |
| 31 | +- [x] [MiniCPM models](https://huggingface.co/collections/openbmb/minicpm-65d48bf958302b9fd25b698f) |
| 32 | + |
| 33 | +# Model Performance Benchmark |
| 34 | + |
| 35 | +| model | dtype | seqlen | max_context | new_tokens | TTFT(ms) | Tokens/s | memory(G) | platform | |
| 36 | +|:-------------- |:---------- |:------:|:-----------:|:----------:|:--------:|:--------:|:---------:|:--------:| |
| 37 | +| TinyLLAMA-1.1B | w4a16 | 64 | 320 | 256 | 345.00 | 21.10 | 0.77 | RK3576 | |
| 38 | +| | w4a16_g128 | 64 | 320 | 256 | 410.00 | 18.50 | 0.8 | RK3576 | |
| 39 | +| | w8a8 | 64 | 320 | 256 | 140.46 | 24.21 | 1.25 | RK3588 | |
| 40 | +| | w8a8_g512 | 64 | 320 | 256 | 195.00 | 20.08 | 1.29 | RK3588 | |
| 41 | +| Qwen2-1.5B | w4a16 | 64 | 320 | 256 | 512.00 | 14.40 | 1.75 | RK3576 | |
| 42 | +| | w4a16_g128 | 64 | 320 | 256 | 550.00 | 12.75 | 1.76 | RK3576 | |
| 43 | +| | w8a8 | 64 | 320 | 256 | 206.00 | 16.46 | 2.47 | RK3588 | |
| 44 | +| | w8a8_g128 | 64 | 320 | 256 | 725.00 | 7.00 | 2.65 | RK3588 | |
| 45 | +| Phi-3-3.8B | w4a16 | 64 | 320 | 256 | 975.00 | 6.60 | 2.16 | RK3576 | |
| 46 | +| | w4a16_g128 | 64 | 320 | 256 | 1180.00 | 5.85 | 2.23 | RK3576 | |
| 47 | +| | w8a8 | 64 | 320 | 256 | 516.00 | 7.44 | 3.88 | RK3588 | |
| 48 | +| | w8a8_g512 | 64 | 320 | 256 | 610.00 | 6.13 | 3.95 | RK3588 | |
| 49 | +| ChatGLM3-6B | w4a16 | 64 | 320 | 256 | 1168.00 | 4.62 | 3.86 | RK3576 | |
| 50 | +| | w4a16_g128 | 64 | 320 | 256 | 1582.56 | 3.82 | 3.96 | RK3576 | |
| 51 | +| | w8a8 | 64 | 320 | 256 | 800.00 | 4.95 | 6.69 | RK3588 | |
| 52 | +| | w8a8_g128 | 64 | 320 | 256 | 2190.00 | 2.70 | 7.18 | RK3588 | |
| 53 | +| Gemma2-2B | w4a16 | 64 | 320 | 256 | 628.00 | 8.00 | 3.63 | RK3576 | |
| 54 | +| | w4a16_g128 | 64 | 320 | 256 | 776.20 | 7.40 | 3.63 | RK3576 | |
| 55 | +| | w8a8 | 64 | 320 | 256 | 342.29 | 9.67 | 4.84 | RK3588 | |
| 56 | +| | w8a8_g128 | 64 | 320 | 256 | 1055.00 | 5.49 | 5.14 | RK3588 | |
| 57 | +| InternLM2-1.8B | w4a16 | 64 | 320 | 256 | 475.00 | 13.30 | 1.59 | RK3576 | |
| 58 | +| | w4a16_g128 | 64 | 320 | 256 | 572.00 | 11.95 | 1.62 | RK3576 | |
| 59 | +| | w8a8 | 64 | 320 | 256 | 205.97 | 15.66 | 2.38 | RK3588 | |
| 60 | +| | w8a8_g512 | 64 | 320 | 256 | 298.00 | 12.66 | 2.45 | RK3588 | |
| 61 | +| MiniCPM3-4B | w4a16 | 64 | 320 | 256 | 1397.00 | 4.80 | 2.7 | RK3576 | |
| 62 | +| | w4a16_g128 | 64 | 320 | 256 | 1645.00 | 4.39 | 2.8 | RK3576 | |
| 63 | +| | w8a8 | 64 | 320 | 256 | 702.18 | 6.15 | 4.65 | RK3588 | |
| 64 | +| | w8a8_g128 | 64 | 320 | 256 | 1691.00 | 3.42 | 5.06 | RK3588 | |
| 65 | +| llama3-8B | w4a16 | 64 | 320 | 256 | 1607.98 | 3.60 | 5.63 | RK3576 | |
| 66 | +| | w4a16_g128 | 64 | 320 | 256 | 2010.00 | 3.00 | 5.76 | RK3576 | |
| 67 | +| | w8a8 | 64 | 320 | 256 | 1128.00 | 3.79 | 9.21 | RK3588 | |
| 68 | +| | w8a8_g512 | 64 | 320 | 256 | 1281.35 | 3.05 | 9.45 | RK3588 | |
| 69 | + |
| 70 | +- This performance data were collected based on the maximum CPU and NPU frequencies of each platform with version 1.1.0. |
| 71 | +- The script for setting the frequencies is located in the scripts directory. |
29 | 72 |
|
30 | 73 | # Download
|
| 74 | + |
31 | 75 | You can download the latest package, docker image, example, documentation, and platform-tool from [RKLLM_SDK](https://console.zbox.filez.com/l/RJJDmB), fetch code: rkllm
|
32 | 76 |
|
33 | 77 | # Note
|
34 | 78 |
|
35 |
| -The modifications in version 1.1.0 are significant, making it incompatible with older version models. Please use the latest toolchain for model conversion and inference. |
| 79 | +- The modifications in version 1.1 are significant, making it incompatible with older version models. Please use the latest toolchain for model conversion and inference. |
| 80 | + |
| 81 | +- The supported Python versions are: |
| 82 | + |
| 83 | + - Python 3.8 |
| 84 | + |
| 85 | + - Python 3.10 |
| 86 | + |
| 87 | +- Latest version: [ <u>v1.1.1](https://github.com/airockchip/rknn-llm/releases/tag/release-v1.1.1)</u> |
36 | 88 |
|
37 | 89 | # RKNN Toolkit2
|
| 90 | + |
38 | 91 | If you want to deploy additional AI model, we have introduced a SDK called RKNN-Toolkit2. For details, please refer to:
|
39 | 92 |
|
40 | 93 | https://github.com/airockchip/rknn-toolkit2
|
41 | 94 |
|
42 | 95 | # CHANGELOG
|
| 96 | + |
43 | 97 | ## v1.1.0
|
| 98 | + |
44 | 99 | - Support group-wise quantization (w4a16 group sizes of 32/64/128, w8a8 group sizes of 128/256/512).
|
45 | 100 | - Support joint inference with LoRA model loading
|
46 | 101 | - Support storage and preloading of prompt cache.
|
|
0 commit comments