This guide demonstrates how to deploy using the llama.cpp library with a merged checkpoint created using the AI Workbench-LLaMa-Factory project.
First, convert the merged checkpoint into a (quantized) GGUF binary. Then, run inference using the llama.cpp library or its Python bindings. GGUF binaries generated using this workflow are also compatible with applications such as LMStudio, jan.ai, text-generation-webUI, as well as other applications that offer a llama.cpp execution backend.
[!NOTE] Skip this if you have a functional Llama.cpp local environment.
Build Llama.cpp with CUDA acceleration by following the instructions here. Ensure you have the correct pre-requisites.
Clone and build the llama.cpp repo
git clone https://github.com/ggerganov/llama.cpp cd llama.cpp cmake -B build -DLLAMA_CUDA=ON cmake --build build --config Release python3 -m pip install -r requirements.txt
To run inference, first, let's use the HF checkpoint generated by LLaMa-Factory in the GGUF model format used by llama.cpp. Then quantize the model to desired quantization level.
python convert-hf-to-gguf.py --outfile
For example:
python convert-hf-to-gguf.py C:\models\codealpaca-merged --outfile C:\models\codealpaca.gguf
Quantize down to Q4:
cd build\bin\Debug quantize.exe C:\models\codealpaca.gguf C:\models\codealpaca_q4.gguf Q4_K_M
set CMAKE_ARGS=-DLLAMA_CUBLAS=on set FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
from llama_cpp import Llama llm = Llama( model_path="C:\models\llama-model.gguf", n_gpu_layers=-1, #Too use GPU acceleration # seed=1337, # Uncomment to set a specific seed # n_ctx=2048, # Uncomment to increase the context window ) output = llm( "Q: Name the planets in the solar system? A: ", # Prompt max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window stop=["Q:", "\n"], # Stop generating just before the model would generate a new question echo=True # Echo the prompt back in the output ) # Generate a completion, can also call create_completion print(output)