Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble running llama.cpp compiled for OpenMPI #3752

Closed
Jdo300 opened this issue Oct 23, 2023 · 16 comments
Closed

Trouble running llama.cpp compiled for OpenMPI #3752

Jdo300 opened this issue Oct 23, 2023 · 16 comments
Labels
bug Something isn't working stale

Comments

@Jdo300
Copy link

Jdo300 commented Oct 23, 2023

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [ x ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • [ x ] I carefully followed the README.md.
  • [ x ] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [ x ] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I have an ole Blade CPU cluster with HS22s each running dual Xeon E5540 CPUs. The best one has 72GB of RAM. I have been trying to see if I can use llama.cpp with my existing OpenMPI install to distribute Mistral-7B across my cluster to see if it makes any difference in inference rate.

I was inspired by the guy in #2164 who successfully ran llama.cpp across a bunch of Raspberry Pis so it seems like it should be possible.

I ran
make CC=mpicc CXX=mpicxx LLAMA_MPI=1 -j

to compile it for compatibility with OpenMPI and then tried to run it on the model I downloaded:

mpirun -hostfile ~/hostfile -n 2 ./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -i --n-predict -2 -p "Hello, how are you?"

Current Behavior

It seemed to load the model and start setting things up but then bombed. Here's what I got:

cluster@blade8:~/llama.cpp$ mpirun -hostfile ~/hostfile -n 2 ./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -i --n-predict -2 -p "Hello, how are you?"
Log start
main: build = 1415 (6336701)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1698096051
Log start
main: build = 1415 (6336701)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1698096051
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /home/cluster/models/mistral-7b-v0.1.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]

...
llama_model_loader: - tensor  290:                    output.weight q6_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                       llama.context_length u32
llama_model_loader: - kv   3:                     llama.embedding_length u32
llama_model_loader: - kv   4:                          llama.block_count u32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32
llama_model_loader: - kv   7:                 llama.attention.head_count u32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv  10:                       llama.rope.freq_base f32
llama_model_loader: - kv  11:                          general.file_type u32
llama_model_loader: - kv  12:                       tokenizer.ggml.model str
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32
llama_model_loader: - kv  19:               general.quantization_version u32
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q4_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.07 GiB (4.83 BPW)
llm_load_print_meta: general.name   = mistralai_mistral-7b-v0.1
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.10 MB
llm_load_tensors: mem required  = 4165.46 MB
...............................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: compute buffer total size = 162.13 MB
GGML_ASSERT: llama.cpp:5876: false && "not implemented"
[blade8:09703] *** Process received signal ***
[blade8:09703] Signal: Aborted (6)
[blade8:09703] Signal code:  (-6)
[blade8:09703] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f90ca49f520]
[blade8:09703] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f90ca4f39fc]
[blade8:09703] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f90ca49f476]
[blade8:09703] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f90ca4857f3]
[blade8:09703] [ 4] ./main(+0x6aa93)[0x55b88ad8ea93]
[blade8:09703] [ 5] ./main(+0xa0628)[0x55b88adc4628]
[blade8:09703] [ 6] ./main(+0x137a9)[0x55b88ad377a9]
[blade8:09703] [ 7] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f90ca486d90]
[blade8:09703] [ 8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f90ca486e40]
[blade8:09703] [ 9] ./main(+0x1b3c5)[0x55b88ad3f3c5]
[blade8:09703] *** End of error message ***
Aborted (core dumped)

I checked line 5876 in llama.cpp and the code surrounding it is this:

... other code
    GGML_ASSERT(n_tokens <= n_batch);

    int n_threads = n_tokens == 1 ? cparams.n_threads : cparams.n_threads_batch;
    GGML_ASSERT((!batch.token && batch.embd) || (batch.token && !batch.embd)); // NOLINT

    const int64_t t_start_us = ggml_time_us();

#ifdef GGML_USE_MPI
    // TODO: needs fix after #3228
    GGML_ASSERT(false && "not implemented");
    //ggml_mpi_eval_init(lctx.ctx_mpi, &n_tokens, &n_past, &n_threads);
#endif

    GGML_ASSERT(n_threads > 0);

    auto & kv_self = lctx.kv_self;

    GGML_ASSERT(!!kv_self.ctx);
... more other code

Line 5876 is the one that says "GGML_ASSERT(false && "not implemented");"

Environment and Context

System:
Ubuntu Server 22.04 LTS
HS22 with dual Xeon E5540 processors and 72GB RAM
Running OpenMPI V4.1.2

  • Physical (or virtual) hardware you are using, e.g. for Linux:
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         40 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  16
  On-line CPU(s) list:   0-15
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU           E5540  @ 2.53GHz
    CPU family:          6
    Model:               26
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           2
    Stepping:            5
    CPU max MHz:         2527.0000
    CPU min MHz:         1596.0000
    BogoMIPS:            5066.73
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse
                         36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm
                         constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_ts
                         c cpuid aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 x
                         tpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm pti ssbd ibrs ibpb stibp tp
                         r_shadow vnmi flexpriority ept vpid dtherm flush_l1d
Virtualization features:
  Virtualization:        VT-x
Caches (sum of all):
  L1d:                   256 KiB (8 instances)
  L1i:                   256 KiB (8 instances)
  L2:                    2 MiB (8 instances)
  L3:                    16 MiB (2 instances)
NUMA:
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-3,8-11
  NUMA node1 CPU(s):     4-7,12-15
Vulnerabilities:
  Gather data sampling:  Not affected
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnera
                         ble
  Mds:                   Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
  Meltdown:              Mitigation; PTI
  Mmio stale data:       Unknown: No mitigations
  Retbleed:              Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional,
                          RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected
  • Operating System, e.g. for Linux:

$ uname -a
Linux blade8 5.15.0-87-generic #97-Ubuntu SMP Mon Oct 2 21:09:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

  • SDK version, e.g. for Linux:
$ python3 --version
Python 3.10.12
$ make --version
GNU Make 4.3
Built for x86_64-pc-linux-gnu
$ g++ --version
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
@Jdo300 Jdo300 added the bug Something isn't working label Oct 23, 2023
@shibe2
Copy link
Contributor

shibe2 commented Oct 24, 2023

Inference with MPI is usually slower than without it.

@AutonomicPerfectionist
Copy link
Contributor

MPI is mostly working in #3334, but I haven't rebased on master in awhile and there's a KV cache bug. Performance using MPI will indeed be slower if you can fit the entire model in RAM because the original implementation is using a simple ring pipeline architecture and splits layers over the nodes, not tensors, so only one node is running at once. I'm working on a way to optimize this using speculative inference and asynchronous computation on a different branch of mine

@xor2003
Copy link

xor2003 commented Nov 24, 2023

#3228 is already fixed but MPI is not working for me because of the following:

#ifdef GGML_USE_MPI
    ctx->ctx_mpi = ggml_mpi_init();

    if (ggml_mpi_rank(ctx->ctx_mpi) > 0) {
        // Enter a blocking eval loop with dummy input, letting rank=0 drive the process
        // TODO: needs fix after #3228
        GGML_ASSERT(false && "not implemented");

We wanted to execute LLM on cluster of android TV boxes (aarch64) which are cheap but have only 4G RAM.
Small LLM works on single node. But MPI does not work :-(

@AutonomicPerfectionist
Copy link
Contributor

You can follow the progress on MPI in #3334. Right now that branch should work for the most part, but there's a KV cache synchronization issue that prevents more advanced usage like the speculative example. I've paused development of that branch while I finish my master's class semester project, but the issue has been solved in a different branch of mine if you need that functionality immediately

@ageorgios
Copy link

is there a branch where the mpi works ?

@AutonomicPerfectionist
Copy link
Contributor

@ageorgios on my fork, there is a branch called mpi-speculative that should work. Be warned though: this branch is for my Master's class project, so I've made very large changes to the codebase and may make more in the future. It is not guaranteed to work at any given point, and there may be bugs in areas I have not tested.

@vvsotnikov
Copy link

Hey @AutonomicPerfectionist, thanks for your amazing efforts! I tried to run mpi-speculative on two Nvidia Jetson Xavier NX, but I'm getting GGML_ASSERT: llama.cpp:8822: false && "Must have same number of split percentages as devices". For comparison, the version from master 45855b3 (the last commit before #3228 that broke MPI) works w/ the same setup, but CPU-only (passing -ngl 128 makes LLM produce garbage text like in #3099). I hope that helps you somehow :) Let me know if I can help you with further testing.

@AutonomicPerfectionist
Copy link
Contributor

AutonomicPerfectionist commented Dec 8, 2023

That's expected, on that branch I added the ability to set how many layers each MPI node should work on via the stated command line argument, but I haven't had the time to fix the original behavior of implicitly distributing the layers evenly when that argument isn't given

EDIT: I'll have to update that error message to actually reference the command line argument... You need to use the new --mpi-layer-split, the values are given as floats representing percentage of layers, each float separated by a comma. With multiple communicators, you can set the layer split separately for each by using a slash to separate the values. Note that user code needs to take that into account right now, only the speculation example does so

@vvsotnikov
Copy link

I've just tried this (sorry for the long unreadable file paths):

mpirun -hostfile llm_experiments/llama.cpp/hostnames -n 2 --map-by node --mca oob_base_verbose 100 --mca btl_tcp_if_include eth0 llm_experiments/llama_experimental/llama.cpp/main -m llm_experiments/jetson-containers/data/models/text-generation-webui/slimorca-13b.Q5_K_M.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e --mpi-layer-split 50.0,50.0

And got this crash:

Building a website can be done in 10 simple steps:
Step 1:[arts-x-01:07176] *** Process received signal ***
[arts-x-01:07176] Signal: Segmentation fault (11)
[arts-x-01:07176] Signal code: Address not mapped (1)
[arts-x-01:07176] Failing at address: 0xaaaace084af0
[arts-x-01:07176] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffff9f8267c0]
[arts-x-01:07176] [ 1] /usr/lib/aarch64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_recv_request_progress_frag+0x3c)[0xffff73c86b64]
[arts-x-01:07176] [ 2] /usr/lib/aarch64-linux-gnu/openmpi/lib/openmpi3/mca_btl_tcp.so(+0x815c)[0xffff73caa15c]
[arts-x-01:07176] [ 3] /lib/aarch64-linux-gnu/libevent-2.1.so.7(+0x1ffac)[0xffff7d8a1fac]
[arts-x-01:07176] [ 4] /lib/aarch64-linux-gnu/libevent-2.1.so.7(event_base_loop+0x50c)[0xffff7d8a2a4c]
[arts-x-01:07176] [ 5] /lib/aarch64-linux-gnu/libopen-pal.so.40(+0x27b24)[0xffff7d990b24]
[arts-x-01:07176] [ 6] /lib/aarch64-linux-gnu/libopen-pal.so.40(opal_progress+0x94)[0xffff7d990c84]
[arts-x-01:07176] [ 7] /lib/aarch64-linux-gnu/libopen-pal.so.40(ompi_sync_wait_mt+0xb4)[0xffff7d997894]
[arts-x-01:07176] [ 8] /usr/lib/aarch64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_recv+0x6d4)[0xffff73c79d3c]
[arts-x-01:07176] [ 9] /lib/aarch64-linux-gnu/libmpi.so.40(MPI_Recv+0xf8)[0xffff9555d3a0]
[arts-x-01:07176] [10] llm_experiments/llama_experimental/llama.cpp/main(+0xb6dec)[0xaaaad1056dec]
[arts-x-01:07176] [11] llm_experiments/llama_experimental/llama.cpp/main(+0x6155c)[0xaaaad100155c]
[arts-x-01:07176] [12] llm_experiments/llama_experimental/llama.cpp/main(+0x66450)[0xaaaad1006450]
[arts-x-01:07176] [13] llm_experiments/llama_experimental/llama.cpp/main(+0xdc78)[0xaaaad0fadc78]
[arts-x-01:07176] [14] /lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0xe8)[0xffff950d8e10]
[arts-x-01:07176] [15] llm_experiments/llama_experimental/llama.cpp/main(+0x149a4)[0xaaaad0fb49a4]
[arts-x-01:07176] *** End of error message ***

Same when -ngl 128 is also passed

@AutonomicPerfectionist
Copy link
Contributor

AutonomicPerfectionist commented Dec 8, 2023

Oops, sorry, by float representing percentage I meant something like 0.5 for 50%, I definitely could've worded that better

Also, I wouldn't expect GPU offloading to work correctly at the moment, I haven't tested it but I expect things to break spectacularly since the MPI backend currently modifies the graph. I definitely plan on fixing that though, just need to get through the next week of semester tests first

@vvsotnikov
Copy link

I tried 0.5 too, and got the same error 👀

@AutonomicPerfectionist
Copy link
Contributor

Hmmm, I'll give it a look when I have time. Honestly not surprised it's broken, like I said that branch is extremely volatile and I've only been testing with the speculative example. You can try the mpi-heterogenous branch, it's missing a KV cache synchronization fix but otherwise should work.

@amgowda-oci
Copy link

amgowda-oci commented Jan 4, 2024

Any updates on this bug or workarrounds?. I ran mpi on ARM64 and running into process aborted error with main branch (mpirun -hostfile mpihostfile -n 3 ./main -m ./models/gguf/llama-2-13b.Q4_0.gguf -p "Write a story about llamas").

@mermutt
Copy link

mermutt commented Feb 11, 2024

I found that MPI was broken when PR #3228 was merged. It mentioned in TODOs section that MPI would be fixed in future PRs. I could not find any branches or discussions indicating any work being done in this direction. Even branches @AutonomicPerfectionist mentioned, mpi-speculative and mpi-heterogenous are not present any more.
I'd be interested in hearing any updates in this area.

@AutonomicPerfectionist
Copy link
Contributor

They are still present, they're on my fork, not the main repo. I'm in the process of rebasing on master, but it's taking awhile due to the extensive changes to how backends work

@github-actions github-actions bot added the stale label Mar 19, 2024
Copy link
Contributor

github-actions bot commented Apr 2, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

8 participants