NVIDIA NIM Factory

This project is a factory for NVIDIA NIM containers in which users/businesses can quantize many models and build their own TensorRT-LLM engine for optimized inference. This enables users/businesses with large hardware resources but smaller business goals to save compute power by quantizing LLMs into different sizes.

Project Description

Over the past few years Generative AI models have popped up everywhere - from creating realistic responses to complex questions, to generating images and music to impress art critics around the globe. However, there are still some users or businesses who cannot use Generative AI due to limited resources, high cost for compute power or simply overweight their business goals. In this project, we enable them to quantize almost any AI model into different sizes and build an optimized inference engine using TensorRT-LLM. You will see how to quantize a model into one of the quantization format (qformat) such as fp8, int4_sq, int4_awq and many more.

From there, you will see how to build an inference engine using Nvidia TensorRT-LLM. If you are interested in more detailed explanation for building optimized inference engine, you can take a look at official documentation here. After that, you might want to get your hands on your favourite AI model and quantize it to integrate into your own applications. What model it might be? Llama? Nemotron? Let's start immediately!

Application diagram

System Requirements

Operating System: Ubuntu 22.04
CPU: None, tested with Intel Core i7 7th Gen CPU @ 2.30 GHz
GPU requirements: Any NVIDIA training GPU, tested with 1x NVIDIA GTX-16GB
NVIDIA driver requirements: Latest driver version (with CUDA 12.2)
Storage requirements: 40GB

Quickstart

This section demonstrates how to use this project to run NVIDIA NIM Factory via NVIDIA AI Workbench.

Prerequisites

Huggingface account: Get a username and token to download models. (some models might require access permission)
Enough memory for storing downloaded models.

Tutorial: build your own "NIM"

Install and configure AI Workbench locally and open up AI Workbench. Select a location of your choice.
Fork this repo into your own GitHub account.
Inside AI Workbench:
- Click Clone Project and enter the repo URL of your newly-forked repo.
- AI Workbench will automatically clone the repo and build out the project environment, which can take several minutes to complete.
- Upon Build Complete, select Open Backend-app on the top right of the AI Workbench window, after that, select Open Frontend-app to interact with application in browser.
- OR go to Environment section of Workbench and start 1) backend-app , after, 2) frontend-app .
In the Frontend-app:
- Choose your desired family model, such as Llama, Nemotron, GPT etc. (Note: follow the order of tabs, for example, in our case, we chose "GPT" model family. After that, we are offered available model versions described in "Support Matrix" of TensorRT-LLM documentation. Hugging Face credentials are optional as long as model repository will not require special permission as Llama does. So, make sure you obtained permission if you want to select Llama model family. After, click on Prepare Environment button on the top middle. If everything gets installed successfully, we will see how TensorRT-LLM text on the right side becomes green, otherwise, it will turn to red color.
- Click on TensorRT-LLM tab next to Environment to proceed to our tensor operations. Choose one of the available quantization format offered by TensorRT-LLM itself. In our case, we decided with int4_awq format. Default values and their description are given for each parameter. (In the future, we will provide more range of parameters). After that, click on Start Quantization button to start it and observe the Quantization Window to know the progress of your operation. If quantization is successfull, we will see a model directory in our "models" folder, such as "quant_gpt2_int4_awq" folder and output at the end of the window as below:
```
Inserted 147 quantizers
Caching activation statistics for awq_lite...
Searching awq_lite parameters...
Padding vocab_embedding and lm_head for AWQ weights export
current rank: 0, tp rank: 0, pp rank: 0
```
- Next, click on Build Engine tab to start building our inference engine. Note: parameter max input lenght value must not exceed the value of max_position_embeddings in config.json file of the model, otherwise, you will get an error. After, click on Start Engine Building to start the process. Watch closely the Build Window if you get any error during build process. If engine builds successfully, you will see such output at the end:
```
[03/12/2024-10:21:08] [TRT] [I] Engine generation completed in 35.9738 seconds.
[03/12/2024-10:21:08] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 212 MiB, GPU 775 MiB
[03/12/2024-10:21:08] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +775, now: CPU 0, GPU 775 (MiB)
[03/12/2024-10:21:09] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 6600 MiB
[03/12/2024-10:21:09] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:36
[03/12/2024-10:21:09] [TRT-LLM] [I] Serializing engine to trtllm_quant_gpt2_int4_awq/trtllm-engine/trrank0.engine...
[03/12/2024-10:21:11] [TRT-LLM] [I] Engine serialized. Total time: 00:00:02
[03/12/2024-10:21:11] [TRT-LLM] [I] Total time of building all engines: 00:00:41
```
- After successfully downloading, quantizing model and building inference engine, paths of each model are saved in model_paths.json file which is located in code/nim-factory-ui/backend path of the project. In our case, our file should look like this:
```
{
     "base_models": {
         "gpt2": "/models/gpt2"
     },

     "quant_models": {
         "quant_gpt2_int4_awq": "/models/quant_gpt2_int4_awq"
     },

     "trtllm_engines": {
         "trtllm_quant_gpt2_int4_awq": "/models/trtllm_quant_gpt2_int4_awq"
     }
}
```
  It can be used to track all the models when application is deployed to remote servers.
- Here, you have to check your "models" folder for existence of inference engine which starts with "trtllm...". If it does, you can run the next command manually to interract with model:
```
python3 run.py --engine_dir models/trtllm_quant_gpt2_int4_awq
```
  At the moment, Run Engine tab is under active development which enables users to comfortably have a chat with their inference engine.
YouTube demo

Link: video link

Hackathon submission

Check out my achievement and project submission to the hackathon: hackathon_submission_link

Summary

This project is a demonstration of first steps to building your own "NIMs". A lot is required to do to improve the application, such as advanced error handling, enabling more features of TensorRT-LLM and more. We actively developing Run Engine tab of the application to enable users to interact with built engine despite that we experience the hardware resource shortage at the moment. The application serves as a good stepping stone to those who want to discover the power of TensorRT libraries and take advantage of it.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
.project		.project
code		code
data		data
models		models
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
application diagram.drawio.svg		application diagram.drawio.svg
apt.txt		apt.txt
postBuild.bash		postBuild.bash
preBuild.bash		preBuild.bash
quantization.ipynb		quantization.ipynb
requirements.txt		requirements.txt
variables.env		variables.env

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NVIDIA NIM Factory

Project Description

Application diagram

System Requirements

Quickstart

Prerequisites

Tutorial: build your own "NIM"

YouTube demo

Hackathon submission

Summary

About

Releases

Packages

Languages

Rahman2001/nim-factory

Folders and files

Latest commit

History

Repository files navigation

NVIDIA NIM Factory

Project Description

Application diagram

System Requirements

Quickstart

Prerequisites

Tutorial: build your own "NIM"

YouTube demo

Hackathon submission

Summary

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages