Very slow - any way to speed up? #300

luke-jr · 2022-12-21T15:58:14Z

Per #10,

Noting that the processing time is considerably shorter than the length of speech,

Yet even using 64 threads, it's taking days to process minutes of audio on my POWER9.

Has something changed since #10, or is there something I am doing wrong?

RndyP · 2022-12-22T15:35:22Z

What is your platform?

ggerganov · 2022-12-22T15:59:26Z

@luke-jr
I'm not familiar with POWER9, but from a quick ChatGPT search, it seems this CPU has a RISC architecture:

Currently, whisper.cpp supports only x86 and ARM architectures. By support, it means that it uses the available SIMD instruction set to make the computation efficient. On other architectures, it will fallback to non-SIMD computation which is multiple times slower.

Adding support for Power ISA (or whatever the instruction set is called) should not be very difficult. The matrix multiplication routines in ggml.c need to be extended to support the respective instruction set and the corresponding compile flags added to the Makefile.

I don't have experience with this architecture, so hopefully someone contributes.
It will be very interesting to see what is the performance on these CPUs.

luke-jr · 2022-12-22T19:50:32Z

Yeah, I'm not surprised it isn't optimised for PPC64, but I wouldn't expect it to be magnitudes slower either. Real-time to days is a huge difference. :/

ggerganov · 2022-12-22T20:01:23Z

For example on my Ryzen 9 5950X if I remove the -mavx -mavx2 -mfma -mf16c flags I observed about x50 slower computation of the bench tool. Removing those flags is similar to what you have on the PPC64 - no SIMD, no F16C support.

SIMD can make a huge difference

fitzsim · 2022-12-23T14:31:34Z

ChatGPT is out-of-date regarding the Power ISA being proprietary. It is open source now, just like RISC-V. See https://openpowerfoundation.org/.

luke-jr · 2022-12-23T14:33:49Z

After #320, ./main -m models/ggml-base.en.bin -f samples/jfk.wav takes 15.7 seconds.

Additional options	Time
`-p 64`	77s
`-t 64`	28s
`-t 1`	59.6s
`-t 16`	5.8s
`-t 32`	5.1s
`-m models/ggml-large.bin -t 32`	111.1s

ChatGPT is out-of-date regarding the Power ISA being proprietary.

In my experience, ChatGPT tends to be wrong quite often.

ggerganov · 2022-12-23T17:15:48Z

@fitzsim @luke-jr
I am planning to merge a refactored version of the SIMD routines in ggml which I think will make things easier to maintain in the future. The PR is pretty much ready in #324

All instruction sets fit quite nicely in the proposed pattern, but I'm having a little trouble with the ppc64le stuff since I don't have a way to test it. So for the moment, I've special-cased it, which is not great.

If you are interested and have some free time, you can take a look at the implementation and see if you can fit it in the new pattern. Or at the very least - run a test and see that it still works after the changes.

Regarding the new performance: 5s for jfk.wav using base still seems quite a lot. Not sure why the performance is so bad

fitzsim · 2022-12-23T22:03:15Z

@ggerganov, sure, I'll try to fit the POWER9 optimizations into the main SIMD structure, some time after #324 lands in the master branch.

Agreed regarding 5s likely not being optimal. @luke-jr, can you add the whisper_print_timings lines to your table? They may contain hints about further optimization efforts.

luke-jr · 2022-12-23T22:29:58Z

$ time ./main -m models/ggml-base.en.bin -f samples/jfk.wav -t 32
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: adding 1607 extra tokens
whisper_model_load: mem_required  =  506.00 MB
whisper_model_load: ggml ctx size =  140.60 MB
whisper_model_load: memory size   =   22.83 MB
whisper_model_load: model size    =  140.54 MB

system_info: n_threads = 32 / 64 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 32 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   110.77 ms
whisper_print_timings:      mel time =    49.83 ms
whisper_print_timings:   sample time =     8.41 ms
whisper_print_timings:   encode time =  3631.16 ms / 605.19 ms per layer
whisper_print_timings:   decode time =  1374.97 ms / 229.16 ms per layer
whisper_print_timings:    total time =  5175.76 ms

real    0m5.187s
user    2m31.675s
sys     0m1.078s

fitzsim · 2022-12-31T05:31:15Z

The remaining slowness seems to be in the short-to-fp32 conversion. Would it make sense to try a GGML_TYPE_F32 version of ggml-base.en.bin, to eliminate the conversion steps? Can someone outline steps for trying that?

ggerganov · 2022-12-31T08:07:50Z

The steps are like this:

# we need this for the f32 conversion
git clone https://github.com/openai/whisper

# create f32 ggml model (assumes you have ~/.cache/whisper/base.en.pt downloaded from original repo)
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
python3 models/convert-pt-to-ggml.py ~/.cache/whisper/base.en.pt ../whisper . 1

# use the new f32 model
make -j
./main -m ./ggml-model-f32.bin samples/jfk.wav

You need the following patch/hack in whisper.cpp to increase the memory buffers:

diff --git a/whisper.cpp b/whisper.cpp
index 84c2490..8709723 100644
--- a/whisper.cpp
+++ b/whisper.cpp
@@ -131,7 +131,7 @@ static const std::map<std::string, std::pair<int, std::string>> g_lang = {
     { "su",  { 98,  "sundanese",      } },
 };
 
-static const size_t MB = 1024*1024;
+static const size_t MB = 3*1024*1024;
 
 static const std::map<e_model, size_t> MEM_REQ_MODEL = {
     { MODEL_TINY,     74ull*MB },

RndyP · 2023-01-02T14:52:40Z

I used the Visual Studio performance profiler to see where all the CPU time is spent. Half the time is in the SIMD code here:

I reviewed the code for any obvious opportunities for speed up. Nothing major except I believe ax[] and ay[] are not neccessary. You can write the summation like so:
sum[j] = GGML_F16_VEC_FMA(sum[j], GGML_F16_VEC_LOAD(x + i + jGGML_F16_EPR), GGML_F16_VEC_LOAD(y + i + jGGML_F16_EPR));
This didn't help the times though; I think the optimizing compiler figures this out on it's own.
The other thing that stands out is this:

Not sure if this is an opportunity for improvement. I was thinking instead of looping with while, might want to use an event???

fitzsim · 2023-01-03T06:46:01Z

Thanks for the model instructions @ggerganov.

With the FP32 model and #366 I get:

$ time ./main -t 32 -m ../fp32-model/ggml-model-f32.bin samples/jfk.wav
whisper_model_load: loading model from '../fp32-model/ggml-model-f32.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 0
whisper_model_load: type          = 2
whisper_model_load: adding 1607 extra tokens
whisper_model_load: mem_required  = 1518.00 MB
whisper_model_load: ggml ctx size =  276.98 MB
whisper_model_load: memory size   =   22.83 MB
whisper_model_load: model size    =  276.92 MB

system_info: n_threads = 32 / 64 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 32 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   236.47 ms
whisper_print_timings:      mel time =    42.08 ms
whisper_print_timings:   sample time =     4.75 ms
whisper_print_timings:   encode time =  1945.92 ms / 324.32 ms per layer
whisper_print_timings:   decode time =   933.50 ms / 155.58 ms per layer
whisper_print_timings:    total time =  3163.23 ms

real	0m3.182s
user	1m17.748s
sys	0m0.607s

ggerganov · 2023-01-03T20:00:52Z

@fitzsim
Great work! Will take a look at the PRs in the following days and merge after I make sure the other platforms work correctly.

prsyahmi · 2023-01-04T00:19:25Z

Hi, I'm kind of agreeing with @RndyP

I've profiled it few weeks ago and found out that you are using spin locks. I changed it to event and using WaitForMultipleObjects (I'm on windows). CPU usage did tamed down but I didn't bother to bench it at that time.

This is the bench results for commit afe2db0
The one with Event seems to perform better on my PC.

CPU: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz 2.21 GHz

whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: adding 1607 extra tokens
whisper_model_load: mem_required  =  506.00 MB
whisper_model_load: ggml ctx size =  140.60 MB
whisper_model_load: memory size   =   22.83 MB
whisper_model_load: model size    =  140.54 MB

system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

Spinlock: All cores 100% CPU usage

whisper_print_timings:     load time =   312.28 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2975.37 ms / 495.89 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3288.11 ms

whisper_print_timings:     load time =   285.14 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2932.89 ms / 488.81 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3218.43 ms

whisper_print_timings:     load time =   267.65 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2930.10 ms / 488.35 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3198.02 ms

whisper_print_timings:     load time =   270.98 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2821.18 ms / 470.20 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3092.38 ms

Event: CPU usage tamed

whisper_print_timings:     load time =   308.21 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2791.27 ms / 465.21 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3099.88 ms

whisper_print_timings:     load time =   268.62 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2687.68 ms / 447.95 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  2956.58 ms

whisper_print_timings:     load time =   267.01 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2727.19 ms / 454.53 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  2994.49 ms

whisper_print_timings:     load time =   294.01 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2803.29 ms / 467.22 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3097.75 ms

whisper_print_timings:     load time =   293.43 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2876.54 ms / 479.42 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3170.70 ms

The more recent one seems slower on my PC, without any change to the code: f00509d
Spinlock:

whisper_print_timings:     load time =   268.64 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  3209.34 ms / 534.89 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3478.29 ms

whisper_print_timings:     load time =   270.09 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  3391.52 ms / 565.25 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3661.90 ms

whisper_print_timings:     load time =   310.33 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  3375.87 ms / 562.64 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3686.47 ms

ggerganov · 2023-01-05T19:33:18Z

Can you demonstrate the Event-based Windows implementation?
I tried waiting on condition_variable instead of spin locks, but it wasn't more efficient. Maybe I missed something.

fitzsim · 2023-01-05T22:04:12Z

@luke-jr
Now that #369 is merged can you try bench with various arguments, and post an updated table of results to #89? Then #300 can probably be closed.

ggerganov · 2023-01-06T16:48:16Z

@fitzsim
We just merged a FP16 lookup-table (#368) that is used when F16C intrinsics are not available.
I believe that this will lead to significant improvement on POWER9 platforms using the F16 models.

prsyahmi · 2023-01-07T00:14:32Z

@ggerganov I'm using WinAPI directly. My intention was to reduce CPU usage, maybe I'll try again with condition_var and see if it makes any diffrence

index c5780ed..7ad5be6 100644
--- "a/ggml.c"
+++ "b/ggml.c"
@@ -1118,7 +1118,44 @@ inline static void ggml_vec_mad_f16(const int n, ggml_fp16_t * restrict y, ggml_
 #endif
 }
 
-inline static void ggml_vec_scale_f32(const int n, float * y, const float   v) { for (int i = 0; i < n; ++i) y[i] *= v;          }
+//inline static void ggml_vec_scale_f32(const int n, float * y, const float   v) { for (int i = 0; i < n; ++i) y[i] *= v;          }
+inline static void ggml_vec_scale_f32(const int n, float * y, const float   v) {
+#if defined(__AVX__) || defined(__AVX2__)
+    // AVX 256-bit
+    const int n32 = (n & ~31);
+
+    const __m256 v4 = _mm256_set1_ps(v);
+
+    __m256 y0, y1, y2, y3;
+
+    for (int i = 0; i < n32; i += 32) {
+        y0 = _mm256_loadu_ps(y + i + 0);
+        y1 = _mm256_loadu_ps(y + i + 8);
+        y2 = _mm256_loadu_ps(y + i + 16);
+        y3 = _mm256_loadu_ps(y + i + 24);
+
+        y0 = _mm256_mul_ps(y0, v4);
+        y1 = _mm256_mul_ps(y1, v4);
+        y2 = _mm256_mul_ps(y2, v4);
+        y3 = _mm256_mul_ps(y3, v4);
+
+        _mm256_storeu_ps(y + i + 0, y0);
+        _mm256_storeu_ps(y + i + 8, y1);
+        _mm256_storeu_ps(y + i + 16, y2);
+        _mm256_storeu_ps(y + i + 24, y3);
+    }
+
+    // leftovers
+    for (int i = n32; i < n; ++i) {
+        y[i] *= v;
+    }
+#else
+    // scalar
+    for (int i = 0; i < n; ++i) {
+        y[i] *= v;
+    }
+#endif
+}
 inline static void ggml_vec_norm_f32 (const int n, float * s, const float * x) { ggml_vec_dot_f32(n, s, x, x); *s = sqrt(*s);   }
 inline static void ggml_vec_sqr_f32  (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = x[i]*x[i];   }
 inline static void ggml_vec_sqrt_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = sqrt(x[i]); }
@@ -1621,7 +1658,7 @@ struct ggml_tensor * ggml_new_tensor_impl(
     size_needed += sizeof(struct ggml_tensor);
 
     if (cur_end + size_needed + GGML_OBJECT_SIZE > ctx->mem_size) {
-        GGML_PRINT("%s: not enough space in the context's memory pool\n", __func__);
+        GGML_PRINT("%s: not enough space in the context's memory pool (%zu/%zu needed)\n", __func__, cur_end + size_needed + GGML_OBJECT_SIZE, ctx->mem_size);
         assert(false);
         return NULL;
     }
@@ -7010,7 +7047,7 @@ typedef int ggml_lock_t;
 
 #define ggml_lock_init(x)    UNUSED(x)
 #define ggml_lock_destroy(x) UNUSED(x)
-#define ggml_lock_lock(x)    UNUSED(x)
+#define ggml_lock_lock(x)    Sleep(1)
 #define ggml_lock_unlock(x)  UNUSED(x)
 
 #define GGML_LOCK_INITIALIZER 0
@@ -7035,6 +7072,9 @@ struct ggml_compute_state {
     struct ggml_tensor * node;
 
     struct ggml_compute_state_shared * shared;
+
+    HANDLE wait_handle;
+    HANDLE fin_handle;
 };
 
 // function used by each compute thread
@@ -7052,6 +7092,17 @@ thread_ret_t ggml_graph_compute_thread(void * data) {
     const int n_threads = state->shared->n_threads;
 
     while (true) {
+        WaitForSingleObject(state->wait_handle, INFINITE);
+        if (state->node) {
+            ggml_compute_forward(&state->params, state->node);
+            state->node = NULL;
+            SetEvent(state->fin_handle);
+        } else {
+            SetEvent(state->fin_handle);
+            break;
+        }
+
+        /*
         if (atomic_fetch_add(&state->shared->n_ready, 1) == n_threads - 1) {
             atomic_store(&state->shared->has_work, false);
         } else {
@@ -7086,6 +7137,7 @@ thread_ret_t ggml_graph_compute_thread(void * data) {
         } else {
             break;
         }
+        */
     }
 
     return 0;
@@ -7106,6 +7158,7 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
         /*.stop      =*/ false,
     };
     struct ggml_compute_state * workers = n_threads > 1 ? alloca(sizeof(struct ggml_compute_state)*(n_threads - 1)) : NULL;
+    HANDLE worker_handles[16];
 
     // create thread pool
     if (n_threads > 1) {
@@ -7125,7 +7178,12 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
                 },
                 .node   = NULL,
                 .shared = &state_shared,
+                .wait_handle = CreateEvent(NULL, FALSE, FALSE, NULL),
+                .fin_handle = CreateEvent(NULL, FALSE, FALSE, NULL),
             };
+
+            worker_handles[j] = workers[j].fin_handle;
+
             int rc = pthread_create(&workers[j].thrd, NULL, ggml_graph_compute_thread, &workers[j]);
             assert(rc == 0);
             UNUSED(rc);
@@ -7345,14 +7403,14 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
 
         // COMPUTE
         if (node->n_tasks > 1) {
-            if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
+            /*if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
                 atomic_store(&state_shared.has_work, false);
             }
 
             while (atomic_load(&state_shared.has_work)) {
                 ggml_lock_lock  (&state_shared.spin);
                 ggml_lock_unlock(&state_shared.spin);
-            }
+            }*/
 
             // launch thread pool
             for (int j = 0; j < n_threads - 1; j++) {
@@ -7364,16 +7422,17 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
                     .wdata = cgraph->work ? cgraph->work->data : NULL,
                 };
                 workers[j].node = node;
+                SetEvent(workers[j].wait_handle);
             }
 
-            atomic_fetch_sub(&state_shared.n_ready, 1);
+            /*atomic_fetch_sub(&state_shared.n_ready, 1);
 
             while (atomic_load(&state_shared.n_ready) > 0) {
                 ggml_lock_lock  (&state_shared.spin);
                 ggml_lock_unlock(&state_shared.spin);
             }
 
-            atomic_store(&state_shared.has_work, true);
+            atomic_store(&state_shared.has_work, true);*/
         }
 
         params.type = GGML_TASK_COMPUTE;
@@ -7381,7 +7440,8 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
 
         // wait for thread pool
         if (node->n_tasks > 1) {
-            if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
+            WaitForMultipleObjects(n_threads - 1, worker_handles, TRUE, INFINITE);
+            /*if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
                 atomic_store(&state_shared.has_work, false);
             }
 
@@ -7395,19 +7455,19 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
             while (atomic_load(&state_shared.n_ready) != 0) {
                 ggml_lock_lock  (&state_shared.spin);
                 ggml_lock_unlock(&state_shared.spin);
-            }
+            }*/
         }
 
         // FINALIZE
         if (node->n_tasks > 1) {
-            if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
+            /*if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
                 atomic_store(&state_shared.has_work, false);
             }
 
             while (atomic_load(&state_shared.has_work)) {
                 ggml_lock_lock  (&state_shared.spin);
                 ggml_lock_unlock(&state_shared.spin);
-            }
+            }*/
 
             // launch thread pool
             for (int j = 0; j < n_threads - 1; j++) {
@@ -7419,16 +7479,17 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
                     .wdata = cgraph->work ? cgraph->work->data : NULL,
                 };
                 workers[j].node = node;
+                SetEvent(workers[j].wait_handle);
             }
 
-            atomic_fetch_sub(&state_shared.n_ready, 1);
+            /*atomic_fetch_sub(&state_shared.n_ready, 1);
 
             while (atomic_load(&state_shared.n_ready) > 0) {
                 ggml_lock_lock  (&state_shared.spin);
                 ggml_lock_unlock(&state_shared.spin);
             }
 
-            atomic_store(&state_shared.has_work, true);
+            atomic_store(&state_shared.has_work, true);*/
         }
 
         params.type = GGML_TASK_FINALIZE;
@@ -7436,7 +7497,8 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
 
         // wait for thread pool
         if (node->n_tasks > 1) {
-            if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
+            WaitForMultipleObjects(n_threads - 1, worker_handles, TRUE, INFINITE);
+            /*if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
                 atomic_store(&state_shared.has_work, false);
             }
 
@@ -7450,7 +7512,7 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
             while (atomic_load(&state_shared.n_ready) != 0) {
                 ggml_lock_lock  (&state_shared.spin);
                 ggml_lock_unlock(&state_shared.spin);
-            }
+            }*/
         }
 
         // performance stats (node)
@@ -7470,6 +7532,7 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
         atomic_store(&state_shared.has_work, true);
 
         for (int j = 0; j < n_threads - 1; j++) {
+            SetEvent(workers[j].wait_handle);
             int rc = pthread_join(workers[j].thrd, NULL);
             assert(rc == 0);
             UNUSED(rc);

fitzsim · 2023-01-08T04:13:46Z

@ggerganov
Yes, 87dd4a3 is about half-a-second faster on the jfk example, I guess due to the FP16 lookup table.

luke-jr · 2023-01-08T04:35:21Z

@fitzsim I won't be in any position to do anything any time soon, unfortunately. (link)

jaybinks · 2023-01-08T05:13:12Z

Luke, I'm so sorry to hear the news. Good luck

…

On Sun, 8 Jan 2023, 2:35 pm Luke Dashjr, ***@***.***> wrote: @fitzsim <https://github.com/fitzsim> I won't be in any position to do anything any time soon, unfortunately. (link <https://www.fxstreet.com/cryptocurrencies/news/bitcoin-core-developer-loses-nearly-35-million-in-btc-changpeng-zhao-of-binance-offers-help-202301020843> ) — Reply to this email directly, view it on GitHub <#300 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALQR62KJKPXFD4JSHUO4WTWRI7ZNANCNFSM6AAAAAATFY3IEM> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

ggerganov added performance CPU and memory usage - results and comparisons build Build related issues help wanted Extra attention is needed labels Dec 22, 2022

fitzsim mentioned this issue Dec 23, 2022

ggml : add f16 acceleration for POWER9 ppc64le #320

Merged

fitzsim mentioned this issue Dec 31, 2022

ggml : improve f16 acceleration for POWER9 ppc64le #349

Merged

fitzsim mentioned this issue Jan 3, 2023

Enable POWER9 fp32 and fp16 SIMD code #366

Closed

RndyP mentioned this issue Feb 13, 2023

Separated context and state for easier parallelization. #494

Closed

ggerganov closed this as completed Feb 15, 2023

fitzsim mentioned this issue Feb 16, 2023

Benchmark results #89

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow - any way to speed up? #300

Very slow - any way to speed up? #300

luke-jr commented Dec 21, 2022

RndyP commented Dec 22, 2022

ggerganov commented Dec 22, 2022

luke-jr commented Dec 22, 2022

ggerganov commented Dec 22, 2022 •

edited

Loading

fitzsim commented Dec 23, 2022

luke-jr commented Dec 23, 2022

ggerganov commented Dec 23, 2022

fitzsim commented Dec 23, 2022

luke-jr commented Dec 23, 2022

fitzsim commented Dec 31, 2022

ggerganov commented Dec 31, 2022

RndyP commented Jan 2, 2023

fitzsim commented Jan 3, 2023

ggerganov commented Jan 3, 2023

prsyahmi commented Jan 4, 2023 •

edited

Loading

ggerganov commented Jan 5, 2023 •

edited

Loading

fitzsim commented Jan 5, 2023

ggerganov commented Jan 6, 2023

prsyahmi commented Jan 7, 2023 •

edited by ggerganov

Loading

fitzsim commented Jan 8, 2023

luke-jr commented Jan 8, 2023

jaybinks commented Jan 8, 2023 via email

Very slow - any way to speed up? #300

Very slow - any way to speed up? #300

Comments

luke-jr commented Dec 21, 2022

RndyP commented Dec 22, 2022

ggerganov commented Dec 22, 2022

luke-jr commented Dec 22, 2022

ggerganov commented Dec 22, 2022 • edited Loading

fitzsim commented Dec 23, 2022

luke-jr commented Dec 23, 2022

ggerganov commented Dec 23, 2022

fitzsim commented Dec 23, 2022

luke-jr commented Dec 23, 2022

fitzsim commented Dec 31, 2022

ggerganov commented Dec 31, 2022

RndyP commented Jan 2, 2023

fitzsim commented Jan 3, 2023

ggerganov commented Jan 3, 2023

prsyahmi commented Jan 4, 2023 • edited Loading

ggerganov commented Jan 5, 2023 • edited Loading

fitzsim commented Jan 5, 2023

ggerganov commented Jan 6, 2023

prsyahmi commented Jan 7, 2023 • edited by ggerganov Loading

fitzsim commented Jan 8, 2023

luke-jr commented Jan 8, 2023

jaybinks commented Jan 8, 2023 via email

ggerganov commented Dec 22, 2022 •

edited

Loading

prsyahmi commented Jan 4, 2023 •

edited

Loading

ggerganov commented Jan 5, 2023 •

edited

Loading

prsyahmi commented Jan 7, 2023 •

edited by ggerganov

Loading