Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMD GPU support via llama.cpp HIPBLAS #92

Closed
jeromew opened this issue Dec 13, 2023 · 26 comments
Closed

AMD GPU support via llama.cpp HIPBLAS #92

jeromew opened this issue Dec 13, 2023 · 26 comments

Comments

@jeromew
Copy link
Contributor

jeromew commented Dec 13, 2023

Hello,

First of all thank you for your work on llamafile it seems like a great idea to simplify model usage.

It seems from the readme that at this stage llamafile does not support AMD GPUs.
The cuda.c in llamafile backend seems dedicated to cuda while ggml-cuda.h in llama.cpp has a GGML_USE_HIPBLAS option for ROCm support. ROCm support is now officially supported by llama.cpp according to their README about hipBLAS

I understand that ROCm support was maybe not priority#1 for llamafile but I was wondering if you had already tried to use the HIPBLAS llama.cpp option and have some insights on the work that would need to be done in llamafile in order to add this GPU family target.

from what I understand, the llama.cpp would take care of the GPU side of things, and llamafile would need to be modified to JIT-compile llama.cpp with the correct flags and maybe need a specific toolchain for the compilation (At least ROCm SDK).

Thanks for sharing your experience on this

@jart
Copy link
Collaborator

jart commented Dec 13, 2023

I've never used an AMD GPU before, so I can't answer questions about them. However we're happy to consider your request that llamafile support them. I'd encourage anyone else who wants this to leave a comment saying so. That'll help us gauge the interest level and determine what to focus on.

@franalbani
Copy link

I would also like this!

Thanks! Your work is mind-blowing.

@mildwood
Copy link

I'm also interested in using ROCm as I don't want to pay double for Nvidia!

@jesserizzo
Copy link

Agreed, I'd also love AMD support. In my opinion it also fits with the general mission of this project. Don't let big tech companies get a monopoly on LLMs, and also don't let Nvidia get a monopoly on AI computing. I dunno. Thanks for all the hard work, this is great.

@stlhood
Copy link
Collaborator

stlhood commented Dec 18, 2023

Thanks for the suggestions, folks. We're going to look into this.

Any recommendations on a specific AMD card we should use for dev and testing? We could pick up an RX 7900 XTX (i.e. basically a RTX 4090 equivalent) but I want to make sure the model we pick is broadly representative in terms of hardware support.

@lovenemesis
Copy link

AMD officially only support ROCm on one or two consumer hardware level GPU, RX7900XTX being one of them, with limited Linux distribution.

However, by following the guide here on Fedora, I managed to get both RX 7800XT and the integrated GPU inside Ryzen 7840U running ROCm perfectly fine. Those are the mid and lower models of their RDNA3 lineup. So, I think it's fair to say all RDNA3 ones would work.

Judging from other ROCm related topics on PyTorch, it seems that people with RDNA2 (RX6XXX) series cards are the majority folks. Probably due to the competitive pricing after crypto mining boom?

@jammm
Copy link
Contributor

jammm commented Dec 18, 2023

Thanks for the suggestions, folks. We're going to look into this.

Any recommendations on a specific AMD card we should use for dev and testing? We could pick up an RX 7900 XTX (i.e. basically a RTX 4090 equivalent) but I want to make sure the model we pick is broadly representative in terms of hardware support.

Any RDNA3 card should work fine. It's just the "official" support that's been navi31 based GPUs.
You can also compile and run llama.cpp just fine on Windows using the HIP SDK. My primary OS is Windows and I could get this port running myself but I'm having difficulty setting up cosmos bash (getting cosmo++ permission denied errors). If someone can help me setup cosmos bash properly, I should be able to get llamafile up and running on RDNA3 GPUs within a couple hours or so.

Ideally a cmake based pipeline would be best for Windows support, but I'd understand if the makefile paradigm is what cosmopolitan's built on.

@jammm
Copy link
Contributor

jammm commented Dec 18, 2023

So I managed to get llamafile compiling ggml-cuda.so using HIP but it fails at runtime:

building ggml-cuda with nvcc -arch=native...
/usr/bin/hipcc -march=native --shared -use_fast_math -fPIC -O3 -march=native -mtune=native -DNDEBUG -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_USE_HIPBLAS -o /home/rpr/.llamafile/ggml-cuda.so.zmifde /home/rpr/.llamafile/ggml-cuda.cu -lhipblas -lrocblas
/home/rpr/.llamafile/ggml-cuda.so
libamdhip64.so.5: cannot enable executable stack as shared object requires: Invalid argument: failed to load library
warning: GPU offload not supported on this platform; GPU related options will be ignored
warning: you might need to install xcode (macos) or cuda (windows, linux, etc.) check the output above to see why support wasn't linked

normal dlopen() works fine with ggml-cuda.so. I'm not sure why cosmo_dlopen fails here. CC @jart any ideas?

Attaching strace log:
strace_llamafile_hip.txt

Using Ubuntu 22.04.2 LTS.

EDIT: Looks like this error goes away by setting the execstack bit on amdhip64 runtime library
sudo execstack -c /opt/rocm/lib/libamdhip64.so.5

EDIT EDIT: PR made at #122. Works fine on my navi31 machine running Ubuntu 22.04. I expect it to work fine on Windows as well, though it's not tested yet.

@jammm jammm mentioned this issue Dec 18, 2023
@franalbani
Copy link

Thanks for the suggestions, folks. We're going to look into this.

Any recommendations on a specific AMD card we should use for dev and testing? We could pick up an RX 7900 XTX (i.e. basically a RTX 4090 equivalent) but I want to make sure the model we pick is broadly representative in terms of hardware support.

I don't have enough experience to give advise, but I can contribute testing on a Thinkpad T495 with AMD Radeon Vega 10 Graphics.

@batfasturd
Copy link

I would also like this feature added since it is technically possible. I've used my 6750XT successfully with Llama.cpp and linux.

@github12101
Copy link

User with Radeon 6800XT here, will be more than happy to test things out, in order to have Radeon GPU support. I am using Debian GNU/Linux.

@jammm
Copy link
Contributor

jammm commented Dec 22, 2023

@github12101 @batfasturd try this PR for linux? #122

I've compiled the binary. Let me attach it
llamafile_hip_linux.zip

@github12101
Copy link

@github12101 @batfasturd try this PR for linux? #122

I've compiled the binary. Let me attach it llamafile_hip_linux.zip

My apologies but I don't know how to install and launch this. I tried to run it, but it throws an error. Any guideline/tutorial would be great to have.

@jammm
Copy link
Contributor

jammm commented Dec 22, 2023

@github12101 @batfasturd try this PR for linux? #122
I've compiled the binary. Let me attach it llamafile_hip_linux.zip

My apologies but I don't know how to install and launch this. I tried to run it, but it throws an error. Any guideline/tutorial would be great to have.

Can you share the error here?
In order for this to work, you need to have ROCm installed on your linux system. Is that installed?

@jesserizzo
Copy link

jesserizzo commented Dec 23, 2023

@github12101 @batfasturd try this PR for linux? #122

I've compiled the binary. Let me attach it llamafile_hip_linux.zip

Should this work on AMD 6000 gpus?
Edit: I tried it on my 6600 and there's no errors but it doesn't seem to be doing anything. I thought I read somewhere that it only works on 7000 gpus, but one of the project maintainers commented on this PR #122 that they were going to test on a 6800, so now I'm confused.

@stlhood
Copy link
Collaborator

stlhood commented Jan 2, 2024

Just a quick update that we now have an RDNA2 card, and an RDNA3 card is on the way. @jart is actively working on adding this support!

@jart
Copy link
Collaborator

jart commented Jan 5, 2024

I've just cut a 0.5 release which features outstanding AMD support for Windows users. Let me tell you about how cool it is. If I download and run the LLaVA llamafile, being certain to pass the -ngl 35 flag to turn GPU support on:

curl -L -o llava-v1.5-7b-q4.llamafile https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-q4.llamafile
.\llava-v1.5-7b-q4.llamafile -ngl 35 --nocompile

Then it takes ~2 seconds to start up (on a cold start!) before a tab with the web ui pops up in my browser. I can then upload a picture and ask LLaVA a question, and I get a response back in a few seconds. That's a good deal, since I'm doing it with a $300 AMD Radeon RX 6800 graphics card.

image

Here's the best part. Support only depends on the graphics card driver. The HIP SDK is for developers so I think it's nice that we won't need to ask users to install it in order to use llamafile. You can if you want to. In which case llamafile will compile a better GPU module that links hipBLAS (instead of tinyBLAS) the first time you run your llamafile. That will cause inference to go faster. Although it takes about 30 seconds to run the clang++ command that comes with the ROCm SDK.

I'm going to have a Linux computer with the RDNA4 card @stlhood hood mentioned soon, probably by mid-month. We should be able to have excellent AMD support there too, although installing the AMD HIP tools will need to be a requirement, since the Linux and BSDs platforms don't have the same kind of binary friendliness.

mofosyne pushed a commit to mofosyne/llamafile that referenced this issue Jan 9, 2024
* Add quantize script for batch quantization

* Indentation

* README for new quantize.sh

* Fix script name

* Fix file list on Mac OS

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@bennmann
Copy link

bennmann commented Jan 9, 2024

I've just cut a 0.5 release which features outstanding AMD support for Windows users. Let me tell you about how cool it is. If I download and run the LLaVA llamafile, being certain to pass the -ngl 35 flag to turn GPU support on:

curl -L -o llava-v1.5-7b-q4.llamafile https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-q4.llamafile
.\llava-v1.5-7b-q4.llamafile -ngl 35 --nocompile

My 6900 XT on windows "just worked", pulling 143 watts and producing a nice llava 7b output. Thank you!!

{"timestamp":1704831422,"level":"INFO","function":"log_server_request","line":2741,"message":"request","remote_addr":"","remote_port":-1,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 released (153 tokens in cache)
slot 0 is processing [task id: 2]
slot 0 : in cache: 47 tokens | to process: 22 tokens
slot 0 : kv cache rm - [47, end)

print_timings: prompt eval time =      88.55 ms /    22 tokens (    4.02 ms per token,   248.46 tokens per second)
print_timings:        eval time =    7817.18 ms /   400 runs   (   19.54 ms per token,    51.17 tokens per second)
print_timings:       total time =    7905.73 ms
slot 0 released (470 tokens in cache)```

@jart
Copy link
Collaborator

jart commented Jan 9, 2024

Happy to hear it @bennmann!

Also there's more good news. I've just shipped a llamafile v0.6 that adds support for AMD GPUs on top of Linux too. Unlike Windows, Linux users need to install the AMD ROCm SDK. There's no prebuilt binary. llamafile will build your GPU support the first time you run your llamafile, using the hipcc compiler. I've tested it with a Radeon RX 7900 XTX. The v0.6 release also adds support for multiple GPUs. I know for certain it works with NVIDIA. I have a second Radeon coming in the mail so I'll be able to test that works with AMD too.

With that said, I think this issue is satisfactorily solved. Please report any issues or suboptimal experiences you have. The only real area of improvement I know we need at the moment, is that our tinyBLAS kernels don't go as fast on AMD as they do on NVIDIA where we developed them. We'll be looking into that soon. Note that this only impacts Windows users who haven't installed the HIP ROCm SDK on their computers. That's what you want, if your goal is to get maximum performance, since the hipBLAS library doesn't come with the video drivers that Windows installs.

Enjoy!

@jart jart closed this as completed Jan 9, 2024
@lovenemesis
Copy link

I'm having issue getting GPU support on Fedora 39 with 0.6 release.

HSA_OVERRIDE_GFX_VERSION=11.0.0 ./llamafile-0.6 -ngl 35 -m mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf 
initializing gpu module...
note: won't compile AMD GPU support because $HIP_PATH/bin/clang++ is missing
prebuilt binary /zip/ggml-rocm.so not found
prebuilt binary /zip/ggml-cuda.so not found
fatal error: --n-gpu-layers 35 was passed but no gpus were found

Meanwhile, I do have clang++ and hipcc available in $PATH.

sudo rpm -ql hipcc clang
/usr/bin/hipcc
/usr/bin/hipcc.pl
/usr/bin/hipconfig
/usr/bin/hipconfig.pl
/usr/share/perl5/vendor_perl/hipvars.pm
/usr/bin/clang
/usr/bin/clang++
/usr/bin/clang++-17
/usr/bin/clang-17
/usr/bin/clang-cl
/usr/bin/clang-cpp
/usr/lib/.build-id
/usr/lib/.build-id/32
/usr/lib/.build-id/32/e94d93e9ba24c19eb5ffdd7288d637d7cda793
/usr/lib/.build-id/32/e94d93e9ba24c19eb5ffdd7288d637d7cda793.1
/usr/lib/.build-id/32/e94d93e9ba24c19eb5ffdd7288d637d7cda793.2
/usr/lib/.build-id/32/e94d93e9ba24c19eb5ffdd7288d637d7cda793.3
/usr/share/licenses/clang
/usr/share/licenses/clang/LICENSE.TXT
/usr/share/man/man1/clang++-17.1.gz
/usr/share/man/man1/clang++.1.gz
/usr/share/man/man1/clang-17.1.gz
/usr/share/man/man1/clang.1.gz

Anything else do I need?

@AwesomeApple12
Copy link

AwesomeApple12 commented Jan 10, 2024

Screenshot 2024-01-10 021429

I've just cut a 0.5 release which features outstanding AMD support for Windows users. Let me tell you about how cool it is. If I download and run the LLaVA llamafile, being certain to pass the -ngl 35 flag to turn GPU support on:

curl -L -o llava-v1.5-7b-q4.llamafile https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-q4.llamafile
.\llava-v1.5-7b-q4.llamafile -ngl 35 --nocompile

Can confirm working really well with a 5700xt on windows 11.

@jeromew
Copy link
Contributor Author

jeromew commented Jan 10, 2024

Can confirm it works really well with AMD Radeon RX 6700 XT on windows 11 (~36 tokens/sec versus ~5.6 tokens/sec on CPU only) ! Thank you for landing the AMD support !

Note that windows complained about it containing a trojan

Detected: Trojan:Win32/Sabsik.FL.A!ml
File: D:\user\dev\llava-v1.5-7b-q4.llamafile

I checked that the sha256 was equal to the one declared on hugginface 9c37a9a8e3f067dea8c028db9525b399fc53b267667ed9c2a60155b1aa75 and went through with it but that was a bit surprising. Am I the only one getting this warning ?

The parameters are "-ngl 35 --nocompile" for the tinyBLAS solution but what are the parameters if I install ROCm ?

@Amine-Smahi
Copy link

Is there a documentation somewhere to guide us to run llamafile on ubuntu with AMD gpu ?

@Dark-Thoughts
Copy link

Tried to get my 6650 XT to work under Nobara (Fedora based) by installing rocm-hip-sdk and got this error after I think it failed to properly build on first launch:

./mistral-7b-instruct-v0.2.Q5_K_M.llamafile -ngl 999
import_cuda_impl: initializing gpu module...
get_rocm_bin_path: note: amdclang++ not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/amdclang++ does not exist
get_rocm_bin_path: note: hipInfo not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/hipInfo does not exist
get_rocm_bin_path: note: /opt/rocm/bin/hipInfo does not exist
llamafile_log_command: /usr/bin/rocminfo
llamafile_log_command: hipcc -O3 -fPIC -shared -DNDEBUG --offload-arch=gfx1032 -march=native -mtune=native -use_fast_math -DGGML_BUILD=1 -DGGML_SHARED=1 -Wno-return-type -Wno-unused-result -DGGML_USE_HIPBLAS -DGGML_CUDA_MMV_Y=1 -DGGML_MULTIPLATFORM -DGGML_CUDA_DMMV_X=32 -DIGNORE4 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DIGNORE -o /home/*****/.llamafile/ggml-rocm.so.ikigfn /home/*****/.llamafile/ggml-cuda.cu -lhipblas -lrocblas
/home/*****/.llamafile/ggml-cuda.cu:408:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
}
^
/home/*****/.llamafile/ggml-cuda.cu:777:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
}
^
/home/*****/.llamafile/ggml-cuda.cu:5132:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
    mul_mat_q4_K(
    ^
/home/*****/.llamafile/ggml-cuda.cu:5132:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/*****/.llamafile/ggml-cuda.cu:5199:1: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
mul_mat_q5_K(
^
/home/*****/.llamafile/ggml-cuda.cu:5199:1: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/*****/.llamafile/ggml-cuda.cu:5268:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
    mul_mat_q6_K(
    ^
/home/*****/.llamafile/ggml-cuda.cu:5268:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/*****/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
static __global__ void soft_max_f32(const float * x, const float * y, float * dst, const int ncols_par, const int nrows_y, const float scale) {
                       ^
/home/*****/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/*****/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/*****/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/*****/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/*****/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
14 warnings generated when compiling for gfx1032.
/home/*****/.llamafile/ggml-cuda.cu:408:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
}
^
/home/*****/.llamafile/ggml-cuda.cu:777:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
}
^
2 warnings generated when compiling for host.
link_cuda_dso: note: dynamically linking /home/*****/.llamafile/ggml-rocm.so
ggml_cuda_link: welcome to ROCm SDK with hipBLAS
link_cuda_dso: GPU support linked

rocBLAS error: Cannot read /opt/rocm-5.6.1/lib/rocblas/library/TensileLibrary.dat: No such file or directory
Aborted (core dumped)

Launching through the gpu again just gives me the last error now. No idea what else I'm missing or what I did wrong here but it's certainly not an easy experience under Linux with AMD gpus (previous it would just default to CPU mode, which is way too slow to be usable).

@nameiwillforget
Copy link

nameiwillforget commented Feb 25, 2024

I'm getting a very similar bug:

[alex@Arch wizard]$ sh wizardcoder-python-34b-v1.0.Q5_K_M.llamafile -ngl 9999 
import_cuda_impl: initializing gpu module...
get_rocm_bin_path: note: amdclang++ not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/amdclang++ does not exist
get_rocm_bin_path: note: /opt/rocm/bin/amdclang++ does not exist
get_rocm_bin_path: note: hipInfo not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/hipInfo does not exist
get_rocm_bin_path: note: /opt/rocm/bin/hipInfo does not exist
llamafile_log_command: /opt/rocm/bin/rocminfo
llamafile_log_command: hipcc -O3 -fPIC -shared -DNDEBUG --offload-arch=gfx1032 -march=native -mtune=native -use_fast_math -DGGML_BUILD=1 -DGGML_SHARED=1 -Wno-return-type -Wno-unused-result -DGGML_USE_HIPBLAS -DGGML_CUDA_MMV_Y=1 -DGGML_MULTIPLATFORM -DGGML_CUDA_DMMV_X=32 -DIGNORE4 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DIGNORE -o /home/alex/.llamafile/ggml-rocm.so.r2whcc /home/alex/.llamafile/ggml-cuda.cu -lhipblas -lrocblas
/home/alex/.llamafile/ggml-cuda.cu:408:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
}
^
/home/alex/.llamafile/ggml-cuda.cu:777:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
}
^
/home/alex/.llamafile/ggml-cuda.cu:5132:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
    mul_mat_q4_K(
    ^
/home/alex/.llamafile/ggml-cuda.cu:5132:5: warning: loop not unrolled: the optimizer was unabl
e to perform the requested transformation; the transformation might be disabled or specified a
s part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/alex/.llamafile/ggml-cuda.cu:5199:1: warning: loop not unrolled: the optimizer was unabl
e to perform the requested transformation; the transformation might be disabled or specified a
s part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
mul_mat_q5_K(
^
/home/alex/.llamafile/ggml-cuda.cu:5199:1: warning: loop not unrolled: the optimizer was unabl
e to perform the requested transformation; the transformation might be disabled or specified a
s part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/alex/.llamafile/ggml-cuda.cu:5268:5: warning: loop not unrolled: the optimizer was unabl
e to perform the requested transformation; the transformation might be disabled or specified a
s part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
    mul_mat_q6_K(
    ^
/home/alex/.llamafile/ggml-cuda.cu:5268:5: warning: loop not unrolled: the optimizer was unabl
e to perform the requested transformation; the transformation might be disabled or specified a
s part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/alex/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unab
le to perform the requested transformation; the transformation might be disabled or specified 
as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
static __global__ void soft_max_f32(const float * x, const float * y, float * dst, const int n
cols_par, const int nrows_y, const float scale) {
                       ^
/home/alex/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unab
le to perform the requested transformation; the transformation might be disabled or specified 
as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/alex/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unab
le to perform the requested transformation; the transformation might be disabled or specified 
as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/alex/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unab
le to perform the requested transformation; the transformation might be disabled or specified 
as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/alex/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unab
le to perform the requested transformation; the transformation might be disabled or specified 
as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/home/alex/.llamafile/ggml-cuda.cu:6034:24: warning: loop not unrolled: the optimizer was unab
le to perform the requested transformation; the transformation might be disabled or specified 
as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
14 warnings generated when compiling for gfx1032.
/home/alex/.llamafile/ggml-cuda.cu:408:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
}
^
/home/alex/.llamafile/ggml-cuda.cu:777:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
}
^
2 warnings generated when compiling for host.
link_cuda_dso: note: dynamically linking /home/alex/.llamafile/ggml-rocm.so
wizardcoder-python-34b-v1.0.Q5_K_M.llamafile: /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/rocclr/os/os_posix.cpp:321: static void amd::Os::currentStackInfo(unsigned char**, size_t*): Assertion `Os::currentStackPtr() >= *base - *size && Os::currentStackPtr() < *base && "just checking"' failed.
cosmoaddr2line /home/alex/.local/bin/wizardcoder-python-34b-v1.0.Q5_K_M.llamafile 7fe973ea932c 7fe973e1a0e0 7fe973e1a0e0 7fe973e1a0e0 7fe973e1a0e0 7fe973e1a0e0 7fe973e1a0e0 7fe973e1a0e0 7fe973e1a0e0 7fe973e1a0e0 7fe973e1a0e0 7fe973e1a0e0 7fe973e1a0e0

0x00007fe973ea932c: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0
0x00007fe973e1a0e0: ?? ??:0

10008004-10008006 rw-pa-       3x automap 192kB w/ 64kB hole
10008008-10008011 rw-pa-      10x automap 640kB w/ 14gB hole
10040060-10098eec r--s-- 364'173x automap 22gB w/ 96tB hole
6fd00004-6fd0000c rw-paF       9x zipos 576kB w/ 64gB hole
6fe00004-6fe00004 rw-paF       1x g_fds 64kB
# 22gB total mapped memory
/home/alex/.local/bin/wizardcoder-python-34b-v1.0.Q5_K_M.llamafile -m wizardcoder-python-34b-v1.0.Q5_K_M.gguf -c 0 -ngl 9999 
Aborted (core dumped)

@Trubador
Copy link

I got it working out of the box with an AMD Radeon 6800M mobile GPU :)
It seems to run nicely and 10 times faster than CPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests