-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trouble running llama.cpp compiled for OpenMPI #3752
Comments
Inference with MPI is usually slower than without it. |
MPI is mostly working in #3334, but I haven't rebased on master in awhile and there's a KV cache bug. Performance using MPI will indeed be slower if you can fit the entire model in RAM because the original implementation is using a simple ring pipeline architecture and splits layers over the nodes, not tensors, so only one node is running at once. I'm working on a way to optimize this using speculative inference and asynchronous computation on a different branch of mine |
#3228 is already fixed but MPI is not working for me because of the following:
We wanted to execute LLM on cluster of android TV boxes (aarch64) which are cheap but have only 4G RAM. |
You can follow the progress on MPI in #3334. Right now that branch should work for the most part, but there's a KV cache synchronization issue that prevents more advanced usage like the speculative example. I've paused development of that branch while I finish my master's class semester project, but the issue has been solved in a different branch of mine if you need that functionality immediately |
is there a branch where the mpi works ? |
@ageorgios on my fork, there is a branch called |
Hey @AutonomicPerfectionist, thanks for your amazing efforts! I tried to run |
That's expected, on that branch I added the ability to set how many layers each MPI node should work on via the stated command line argument, but I haven't had the time to fix the original behavior of implicitly distributing the layers evenly when that argument isn't given EDIT: I'll have to update that error message to actually reference the command line argument... You need to use the new |
I've just tried this (sorry for the long unreadable file paths): mpirun -hostfile llm_experiments/llama.cpp/hostnames -n 2 --map-by node --mca oob_base_verbose 100 --mca btl_tcp_if_include eth0 llm_experiments/llama_experimental/llama.cpp/main -m llm_experiments/jetson-containers/data/models/text-generation-webui/slimorca-13b.Q5_K_M.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e --mpi-layer-split 50.0,50.0 And got this crash:
Same when |
Oops, sorry, by float representing percentage I meant something like Also, I wouldn't expect GPU offloading to work correctly at the moment, I haven't tested it but I expect things to break spectacularly since the MPI backend currently modifies the graph. I definitely plan on fixing that though, just need to get through the next week of semester tests first |
I tried 0.5 too, and got the same error 👀 |
Hmmm, I'll give it a look when I have time. Honestly not surprised it's broken, like I said that branch is extremely volatile and I've only been testing with the speculative example. You can try the |
Any updates on this bug or workarrounds?. I ran mpi on ARM64 and running into process aborted error with main branch (mpirun -hostfile mpihostfile -n 3 ./main -m ./models/gguf/llama-2-13b.Q4_0.gguf -p "Write a story about llamas"). |
I found that MPI was broken when PR #3228 was merged. It mentioned in TODOs section that MPI would be fixed in future PRs. I could not find any branches or discussions indicating any work being done in this direction. Even branches @AutonomicPerfectionist mentioned, mpi-speculative and mpi-heterogenous are not present any more. |
They are still present, they're on my fork, not the main repo. I'm in the process of rebasing on master, but it's taking awhile due to the extensive changes to how backends work |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
I have an ole Blade CPU cluster with HS22s each running dual Xeon E5540 CPUs. The best one has 72GB of RAM. I have been trying to see if I can use llama.cpp with my existing OpenMPI install to distribute Mistral-7B across my cluster to see if it makes any difference in inference rate.
I was inspired by the guy in #2164 who successfully ran llama.cpp across a bunch of Raspberry Pis so it seems like it should be possible.
I ran
make CC=mpicc CXX=mpicxx LLAMA_MPI=1 -j
to compile it for compatibility with OpenMPI and then tried to run it on the model I downloaded:
Current Behavior
It seemed to load the model and start setting things up but then bombed. Here's what I got:
I checked line 5876 in llama.cpp and the code surrounding it is this:
Line 5876 is the one that says "GGML_ASSERT(false && "not implemented");"
Environment and Context
System:
Ubuntu Server 22.04 LTS
HS22 with dual Xeon E5540 processors and 72GB RAM
Running OpenMPI V4.1.2
$ uname -a
Linux blade8 5.15.0-87-generic #97-Ubuntu SMP Mon Oct 2 21:09:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: