-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml-cuda.cu:1278: to_fp32_cuda != nullptr #7211
Comments
The error message is unhelpful but this is just a consequence of there being no CUDA implementation for BF16. For 0 GPU layers the GPU is used for prompt processing only if the prompt is at least 32 tokens long so that is most likely why it only crashes sometimes. |
Ahh gotcha. Looking at the command line options I didn't notice anything to disable prompt processing on the GPU. Is this the case? |
There is no CLI option, you have to compile without flags like |
I am facing the same issue with the same model - see #7223 (now closed). Is bf16 unsupported by CUDA in general, or only by llama.cpp? EDIT: FYI, it works for me if I stay under batch size 32, such as with flag |
It's a llama.cpp issue. |
+1 for bf16 CUDA llama.cpp support |
please add |
I'm currently creating an imatrix from f32 that will end up taking about two days total. I assume bf16 would have cut that in half |
For Rocm support too |
Can anyone share quick fix for it ? |
There is no fix. CUDA support for BF16 is simply not implemented. Your only options are to either implement it yourself or to use a llama.cpp version without CUDA. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
I was trying this model https://huggingface.co/ddh0/Meta-Llama-3-8B-Instruct-bf16-GGUF
By varying the prompt it randomly works sometimes. When offloading layers to the GPU it seems to crash no matter the prompt.
The text was updated successfully, but these errors were encountered: