Skip to content

AsVoider/POWERSERVE_CUDA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PowerServe

Support models

model CPU NPU Note
LLaMA 3.1 πŸ¦™πŸ¦™πŸ¦™ βœ”οΈ βœ”οΈ
LLaMA 3.2 πŸ¦™πŸ¦™πŸ¦™ βœ”οΈ βœ”οΈ
Qwen2 βœ”οΈ[Need test] βœ”οΈ
Phi3 βœ”οΈ[Need test]
InternVL2 (1B, 2B, 8B) βœ”οΈ

Table of Contents

  1. Prerequisites
  2. Model Preparation
  3. Compile PowerServe
  4. Prepare PowerServe Workspace
  5. Execution
  6. Known Issues

Prerequisites

pip install -r requirements.txt
git submodule update --init --recursive

To deploy on aarch64 with Qualcomm NPU using QNN, NDK and QNN are required to be installed.

export NDK=<path-to-ndk>
export QNN_SDK_ROOT=<path-to-QNN>

Model Preparation

For CPU-only execution, only Models For CPU is required. For NPU execution, both Models For CPU and Models For NPU is required.

Take llama3.1-8b-instruct model as example, the structure of model folder:

-- models                       # Level-1 dir, where server search different models.
    -- llama3.1-8b-instruct         # Level-2 dir, where CLI search for runtime configurations
        -- hparams.json                 # Hyper params, containing #threads, #batch_size and sampler configurations.
        -- workspace.json               # The defintion of model workspace structure, where main model and target model(if exist) is determined.
        -- bin                          # The binaries for execution
            -- smart-config-generator
            -- smart-perplexity-test
            -- smart-run
            -- smart-server
        -- model_dir                    # The model weights of GGUF and QNN
            -- model.json               #
            -- vocab.gguf               # The vocab table of model
            -- ggml                     # GGUF model binaries
                -- weights.gguf
            -- qnn                      # QNN model bianries
                -- kv
                    -- *.raw
                    -- ...
                -- config.json          # The information of QNN models and QNN backend configurations
                -- llama3_1_8b_0.bin
                -- llama3_1_8b_1.bin
                -- llama3_1_8b_2.bin
                -- llama3_1_8b_3.bin
                -- lmhead.bin
        -- qnn_libs                     # Dependent libraries of QNN
            -- libQNNSystem.so
            -- libQNNHtp.so
            -- libQNNHtpV79.so
            -- libQNNHtpV79Skel.so
            -- libQNNHtpV79Stub.so

    -- qwen2_7b_instruct            # Level-2 dir of another model
        -- ...

Convert Models For CPU

# Under the root directory of PowerServe
python ./tools/gguf_export.py -m <hf-model> -o models/llama3.1-8b-instruct

Convert Models For NPU

If you just want to run PowerServe on CPUs, this step can be skipped. More details please refer to QNN Model Conversion

# Under the root directory of PowerServe
cd smartserving/tools/qnn_converter

# This may take a long time...
python converter.py                                 \
    --model-folder Llama-3.1-8B-Instruct            \
    --model-name llama3_1_8b                        \
    --system-prompt-file system_prompt_llama.txt    \
    --prompt-file lab_intro_llama.md                \
    --batch-sizes 1 128                             \
    --artifact-name llama3_1_8b                     \
    --n-model-chunk 4                               \
    --output-folder ./llama3.1-8b-QNN               \
    --build-folder ./llama3.1-8b-QNN-tmp            \
    --soc 8gen4

Convert GGUF models and integrate them with QNN models

# Under the root directory of PowerServe
python ./tools/gguf_export.py -m <hf-model> --qnn-path tools/qnn_converter/llama3.1-8b-QNN -o ./llama3.1-8b-instruct-model

Compile PowerServe

The options of platform and ABI vary when deploying on different devices. DO CARE about the configuration.

Build for Linux cpu

# Under the root directory of PowerServe
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build

Build for Andorid cpu

# Under the root directory of PowerServe
cmake -B build                                                      \
    -DCMAKE_BUILD_TYPE=Release                                      \
    -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake \
    -DANDROID_ABI=arm64-v8a                                         \
    -DANDROID_PLATFORM=android-35                                   \
    -DGGML_OPENMP=OFF                                               \
    -DSMART_WITH_QNN=OFF

cmake --build build

Build for Andorid qnn

# Under the root directory of PowerServe
cmake -B build                                                      \
    -DCMAKE_BUILD_TYPE=Release                                      \
    -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake \
    -DANDROID_ABI=arm64-v8a                                         \
    -DANDROID_PLATFORM=android-35                                   \
    -DGGML_OPENMP=OFF                                               \
    -DSMART_WITH_QNN=ON

cmake --build build

Prepare PowerServe Workspace

# Under the root directory of PowerServe
mkdir -p models

# Generate PowerServe Workspace
./smartserving create -m ./llama3.1-8b-instruct-model --exe-path ./build -o ./models/llama3.1-8b-instruct

Execution

CLI

More details please refer to CLI App

For pure CPU execution

# Under the root directory of PowerServe
./models/llama3.1-8b-instruct/bin/smart-run --work-folder ./models/llama3.1-8b-instruct --prompt "Once upon a time, there was a little girl named Lucy" --no-qnn

For NPU execution

# Under the root directory of PowerServe
export LD_LIBRARY_PATH=/system/lib64:/vendor/lib64 && ./models/llama3.1-8b-instruct/bin/smart-run --work-folder ./models/llama3.1-8b-instruct --prompt "Once upon a time, there was a little girl named Lucy"

Server

More details please refer to Server App

# Under the root directory of PowerServe
./models/llama3.1-8b-instruct/bin/smart-server --model-folder ./models --host <ip-addr> --port <port>

Performance

  • QNN: 8gen3 phone + n_predicts = 256 + n_prompts = 1652
  • CPU: n_threads = 8 + n_predicts = 128 + n_prompts = 95
Model CPU(Prefill / Decode) NPU(Prefill / Decode) Note
LLaMA3.1-8b-q4_0 19.88 / 6.75 tokens/s 559.81 / 11.21 tokens/s
LLaMA3.2-1b-q4_0 127.86 / 38.04 tokens/s 1764.96 / 57.53 tokens/s
Qwen2-7b-q4_0 21.35 / 7.20 tokens/s 654.94 / 12.11 tokens/s The output is normal
Qwen2-0.5b-q4_0 293.45 / 73.33 tokens/s 3495.78 / 19.16 tokens/s ❗️ The output of batch size=128 + cpu lm_head is normal, and the output of batch size=1+128 + cpu lm_head is chaotic

Speculative Performance

  • gsm8k_1
Mark has a garden with flowers. He planted plants of three different colors in it. Ten of them are yellow, and there are 80% more of those in purple. There are only 25% as many green flowers as there are yellow and purple flowers. How many flowers does Mark have in his garden?
  • gsm8k_2
Alexis is applying for a new job and bought a new set of business clothes to wear to the interview. She went to a department store with a budget of $200 and spent $30 on a button-up shirt, $46 on suit pants, $38 on a suit coat, $11 on socks, and $18 on a belt. She also purchased a pair of shoes, but lost the receipt for them. She has $16 left from her budget. How much did Alexis pay for the shoes?
  • math
Mia is planning a camping trip in the Canadian Rockies and has a budget of $800 for equipment. She buys a tent for $120, which is 15% off the original price. She then purchases a sleeping bag for $80, which is 20% off. If she also needs to buy a backpack and a portable stove, and the total cost of these two items is $180, what percentage of her budget will she have left after all the purchases?
Main model Draft model NPU(Prefill / Decode) Accept Rate % dataset Note
LLaMA3.1-8b-q4_0 LLaMA3.2-1b-q4_0 263.35 / 21.27 tokens/s 71.43 gsm8k_1 ❗️ Without system prompt, one sentence will be repeated over and over again
LLaMA3.1-8b-q4_0 LLaMA3.2-1b-q4_0 368.06 / 19.08 tokens/s 73.2 math
LLaMA3.1-8b-q4_0 LLaMA3.2-1b-q4_0 392.98 / 12.45 tokens/s 43.65 gsm8k_2

Known Issues

Model Conversion

  1. When exporting model to onnx: RuntimeError: The serialized model is larger than the 2GiB limit imposed by the protobuf library. Therefore the output file must be a file path, so that the ONNX external data can be written to the same directory. Please specify the output file name.

    The version of pytorch should be less than 2.5.1. Please reinstall pytorch like:

    pip install pytorch==2.4.1

Execution

  1. When inferencing with QNN: Failed to open lib /vendor/lib64/libcdsprpc.so: dlopen failed: library "/vendor/lib64/libcdsprpc.so" needed or dlopened by "/data/data/com.termux/files/home/workspace/qnn/llama-3.2-1b-instruct/bin/smart-run" is not accessible for the namespace "(default)

    Use export LD_LIBRARY_PATH=/system/lib64:/vendor/lib64 before executing the program.

    Because libcdsprpc.so depends on /system/lib64/libbinder.so instead of /vendor/lib64/libbinder.so. If the linker searches the /vendor/lib64 at first, it may find and links /vendor/lib64/libbinder.so which does not contain corresponding function definitions.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published