-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMD GPU support via llama.cpp HIPBLAS #92
Comments
I've never used an AMD GPU before, so I can't answer questions about them. However we're happy to consider your request that llamafile support them. I'd encourage anyone else who wants this to leave a comment saying so. That'll help us gauge the interest level and determine what to focus on. |
I would also like this! Thanks! Your work is mind-blowing. |
I'm also interested in using ROCm as I don't want to pay double for Nvidia! |
Agreed, I'd also love AMD support. In my opinion it also fits with the general mission of this project. Don't let big tech companies get a monopoly on LLMs, and also don't let Nvidia get a monopoly on AI computing. I dunno. Thanks for all the hard work, this is great. |
Thanks for the suggestions, folks. We're going to look into this. Any recommendations on a specific AMD card we should use for dev and testing? We could pick up an RX 7900 XTX (i.e. basically a RTX 4090 equivalent) but I want to make sure the model we pick is broadly representative in terms of hardware support. |
AMD officially only support ROCm on one or two consumer hardware level GPU, RX7900XTX being one of them, with limited Linux distribution. However, by following the guide here on Fedora, I managed to get both RX 7800XT and the integrated GPU inside Ryzen 7840U running ROCm perfectly fine. Those are the mid and lower models of their RDNA3 lineup. So, I think it's fair to say all RDNA3 ones would work. Judging from other ROCm related topics on PyTorch, it seems that people with RDNA2 (RX6XXX) series cards are the majority folks. Probably due to the competitive pricing after crypto mining boom? |
Any RDNA3 card should work fine. It's just the "official" support that's been navi31 based GPUs. Ideally a cmake based pipeline would be best for Windows support, but I'd understand if the makefile paradigm is what cosmopolitan's built on. |
So I managed to get llamafile compiling ggml-cuda.so using HIP but it fails at runtime:
normal Attaching strace log: Using Ubuntu 22.04.2 LTS. EDIT: Looks like this error goes away by setting the execstack bit on amdhip64 runtime library EDIT EDIT: PR made at #122. Works fine on my navi31 machine running Ubuntu 22.04. I expect it to work fine on Windows as well, though it's not tested yet. |
I don't have enough experience to give advise, but I can contribute testing on a Thinkpad T495 with |
I would also like this feature added since it is technically possible. I've used my 6750XT successfully with Llama.cpp and linux. |
User with Radeon 6800XT here, will be more than happy to test things out, in order to have Radeon GPU support. I am using Debian GNU/Linux. |
@github12101 @batfasturd try this PR for linux? #122 I've compiled the binary. Let me attach it |
My apologies but I don't know how to install and launch this. I tried to run it, but it throws an error. Any guideline/tutorial would be great to have. |
Can you share the error here? |
Should this work on AMD 6000 gpus? |
Just a quick update that we now have an RDNA2 card, and an RDNA3 card is on the way. @jart is actively working on adding this support! |
I've just cut a 0.5 release which features outstanding AMD support for Windows users. Let me tell you about how cool it is. If I download and run the LLaVA llamafile, being certain to pass the
Then it takes ~2 seconds to start up (on a cold start!) before a tab with the web ui pops up in my browser. I can then upload a picture and ask LLaVA a question, and I get a response back in a few seconds. That's a good deal, since I'm doing it with a $300 AMD Radeon RX 6800 graphics card. Here's the best part. Support only depends on the graphics card driver. The HIP SDK is for developers so I think it's nice that we won't need to ask users to install it in order to use llamafile. You can if you want to. In which case llamafile will compile a better GPU module that links hipBLAS (instead of tinyBLAS) the first time you run your llamafile. That will cause inference to go faster. Although it takes about 30 seconds to run the clang++ command that comes with the ROCm SDK. I'm going to have a Linux computer with the RDNA4 card @stlhood hood mentioned soon, probably by mid-month. We should be able to have excellent AMD support there too, although installing the AMD HIP tools will need to be a requirement, since the Linux and BSDs platforms don't have the same kind of binary friendliness. |
* Add quantize script for batch quantization * Indentation * README for new quantize.sh * Fix script name * Fix file list on Mac OS --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
My 6900 XT on windows "just worked", pulling 143 watts and producing a nice llava 7b output. Thank you!!
|
Happy to hear it @bennmann! Also there's more good news. I've just shipped a llamafile v0.6 that adds support for AMD GPUs on top of Linux too. Unlike Windows, Linux users need to install the AMD ROCm SDK. There's no prebuilt binary. llamafile will build your GPU support the first time you run your llamafile, using the hipcc compiler. I've tested it with a Radeon RX 7900 XTX. The v0.6 release also adds support for multiple GPUs. I know for certain it works with NVIDIA. I have a second Radeon coming in the mail so I'll be able to test that works with AMD too. With that said, I think this issue is satisfactorily solved. Please report any issues or suboptimal experiences you have. The only real area of improvement I know we need at the moment, is that our tinyBLAS kernels don't go as fast on AMD as they do on NVIDIA where we developed them. We'll be looking into that soon. Note that this only impacts Windows users who haven't installed the HIP ROCm SDK on their computers. That's what you want, if your goal is to get maximum performance, since the hipBLAS library doesn't come with the video drivers that Windows installs. Enjoy! |
I'm having issue getting GPU support on Fedora 39 with 0.6 release.
Meanwhile, I do have
Anything else do I need? |
Can confirm working really well with a 5700xt on windows 11. |
Can confirm it works really well with AMD Radeon RX 6700 XT on windows 11 (~36 tokens/sec versus ~5.6 tokens/sec on CPU only) ! Thank you for landing the AMD support ! Note that windows complained about it containing a trojan Detected: Trojan:Win32/Sabsik.FL.A!ml I checked that the sha256 was equal to the one declared on hugginface 9c37a9a8e3f067dea8c028db9525b399fc53b267667ed9c2a60155b1aa75 and went through with it but that was a bit surprising. Am I the only one getting this warning ? The parameters are "-ngl 35 --nocompile" for the tinyBLAS solution but what are the parameters if I install ROCm ? |
Is there a documentation somewhere to guide us to run llamafile on ubuntu with AMD gpu ? |
Tried to get my 6650 XT to work under Nobara (Fedora based) by installing rocm-hip-sdk and got this error after I think it failed to properly build on first launch:
Launching through the gpu again just gives me the last error now. No idea what else I'm missing or what I did wrong here but it's certainly not an easy experience under Linux with AMD gpus (previous it would just default to CPU mode, which is way too slow to be usable). |
I'm getting a very similar bug:
|
I got it working out of the box with an AMD Radeon 6800M mobile GPU :) |
Hello,
First of all thank you for your work on llamafile it seems like a great idea to simplify model usage.
It seems from the readme that at this stage llamafile does not support AMD GPUs.
The cuda.c in llamafile backend seems dedicated to cuda while ggml-cuda.h in llama.cpp has a GGML_USE_HIPBLAS option for ROCm support. ROCm support is now officially supported by llama.cpp according to their README about hipBLAS
I understand that ROCm support was maybe not priority#1 for llamafile but I was wondering if you had already tried to use the HIPBLAS llama.cpp option and have some insights on the work that would need to be done in llamafile in order to add this GPU family target.
from what I understand, the llama.cpp would take care of the GPU side of things, and llamafile would need to be modified to JIT-compile llama.cpp with the correct flags and maybe need a specific toolchain for the compilation (At least ROCm SDK).
Thanks for sharing your experience on this
The text was updated successfully, but these errors were encountered: