Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow - any way to speed up? #300

Closed
luke-jr opened this issue Dec 21, 2022 · 22 comments
Closed

Very slow - any way to speed up? #300

luke-jr opened this issue Dec 21, 2022 · 22 comments
Labels
build Build related issues help wanted Extra attention is needed performance CPU and memory usage - results and comparisons

Comments

@luke-jr
Copy link

luke-jr commented Dec 21, 2022

Per #10,

Noting that the processing time is considerably shorter than the length of speech,

Yet even using 64 threads, it's taking days to process minutes of audio on my POWER9.

Has something changed since #10, or is there something I am doing wrong?

@RndyP
Copy link

RndyP commented Dec 22, 2022

What is your platform?

@ggerganov ggerganov added performance CPU and memory usage - results and comparisons build Build related issues help wanted Extra attention is needed labels Dec 22, 2022
@ggerganov
Copy link
Owner

@luke-jr
I'm not familiar with POWER9, but from a quick ChatGPT search, it seems this CPU has a RISC architecture:

image

Currently, whisper.cpp supports only x86 and ARM architectures. By support, it means that it uses the available SIMD instruction set to make the computation efficient. On other architectures, it will fallback to non-SIMD computation which is multiple times slower.

Adding support for Power ISA (or whatever the instruction set is called) should not be very difficult. The matrix multiplication routines in ggml.c need to be extended to support the respective instruction set and the corresponding compile flags added to the Makefile.

I don't have experience with this architecture, so hopefully someone contributes.
It will be very interesting to see what is the performance on these CPUs.

@luke-jr
Copy link
Author

luke-jr commented Dec 22, 2022

Yeah, I'm not surprised it isn't optimised for PPC64, but I wouldn't expect it to be magnitudes slower either. Real-time to days is a huge difference. :/

@ggerganov
Copy link
Owner

ggerganov commented Dec 22, 2022

For example on my Ryzen 9 5950X if I remove the -mavx -mavx2 -mfma -mf16c flags I observed about x50 slower computation of the bench tool. Removing those flags is similar to what you have on the PPC64 - no SIMD, no F16C support.

SIMD can make a huge difference

@fitzsim
Copy link
Contributor

fitzsim commented Dec 23, 2022

ChatGPT is out-of-date regarding the Power ISA being proprietary. It is open source now, just like RISC-V. See https://openpowerfoundation.org/.

@luke-jr
Copy link
Author

luke-jr commented Dec 23, 2022

After #320, ./main -m models/ggml-base.en.bin -f samples/jfk.wav takes 15.7 seconds.

Additional options Time
-p 64 77s
-t 64 28s
-t 1 59.6s
-t 16 5.8s
-t 32 5.1s
-m models/ggml-large.bin -t 32 111.1s

ChatGPT is out-of-date regarding the Power ISA being proprietary.

In my experience, ChatGPT tends to be wrong quite often.

@ggerganov
Copy link
Owner

@fitzsim @luke-jr
I am planning to merge a refactored version of the SIMD routines in ggml which I think will make things easier to maintain in the future. The PR is pretty much ready in #324

All instruction sets fit quite nicely in the proposed pattern, but I'm having a little trouble with the ppc64le stuff since I don't have a way to test it. So for the moment, I've special-cased it, which is not great.

If you are interested and have some free time, you can take a look at the implementation and see if you can fit it in the new pattern. Or at the very least - run a test and see that it still works after the changes.

Regarding the new performance: 5s for jfk.wav using base still seems quite a lot. Not sure why the performance is so bad

@fitzsim
Copy link
Contributor

fitzsim commented Dec 23, 2022

@ggerganov, sure, I'll try to fit the POWER9 optimizations into the main SIMD structure, some time after #324 lands in the master branch.

Agreed regarding 5s likely not being optimal. @luke-jr, can you add the whisper_print_timings lines to your table? They may contain hints about further optimization efforts.

@luke-jr
Copy link
Author

luke-jr commented Dec 23, 2022

$ time ./main -m models/ggml-base.en.bin -f samples/jfk.wav -t 32
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: adding 1607 extra tokens
whisper_model_load: mem_required  =  506.00 MB
whisper_model_load: ggml ctx size =  140.60 MB
whisper_model_load: memory size   =   22.83 MB
whisper_model_load: model size    =  140.54 MB

system_info: n_threads = 32 / 64 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 32 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   110.77 ms
whisper_print_timings:      mel time =    49.83 ms
whisper_print_timings:   sample time =     8.41 ms
whisper_print_timings:   encode time =  3631.16 ms / 605.19 ms per layer
whisper_print_timings:   decode time =  1374.97 ms / 229.16 ms per layer
whisper_print_timings:    total time =  5175.76 ms

real    0m5.187s
user    2m31.675s
sys     0m1.078s

@fitzsim
Copy link
Contributor

fitzsim commented Dec 31, 2022

The remaining slowness seems to be in the short-to-fp32 conversion. Would it make sense to try a GGML_TYPE_F32 version of ggml-base.en.bin, to eliminate the conversion steps? Can someone outline steps for trying that?

@ggerganov
Copy link
Owner

The steps are like this:

# we need this for the f32 conversion
git clone https://github.com/openai/whisper

# create f32 ggml model (assumes you have ~/.cache/whisper/base.en.pt downloaded from original repo)
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
python3 models/convert-pt-to-ggml.py ~/.cache/whisper/base.en.pt ../whisper . 1

# use the new f32 model
make -j
./main -m ./ggml-model-f32.bin samples/jfk.wav

You need the following patch/hack in whisper.cpp to increase the memory buffers:

diff --git a/whisper.cpp b/whisper.cpp
index 84c2490..8709723 100644
--- a/whisper.cpp
+++ b/whisper.cpp
@@ -131,7 +131,7 @@ static const std::map<std::string, std::pair<int, std::string>> g_lang = {
     { "su",  { 98,  "sundanese",      } },
 };
 
-static const size_t MB = 1024*1024;
+static const size_t MB = 3*1024*1024;
 
 static const std::map<e_model, size_t> MEM_REQ_MODEL = {
     { MODEL_TINY,     74ull*MB },

@RndyP
Copy link

RndyP commented Jan 2, 2023

I used the Visual Studio performance profiler to see where all the CPU time is spent. Half the time is in the SIMD code here:
image

I reviewed the code for any obvious opportunities for speed up. Nothing major except I believe ax[] and ay[] are not neccessary. You can write the summation like so:
sum[j] = GGML_F16_VEC_FMA(sum[j], GGML_F16_VEC_LOAD(x + i + jGGML_F16_EPR), GGML_F16_VEC_LOAD(y + i + jGGML_F16_EPR));
This didn't help the times though; I think the optimizing compiler figures this out on it's own.
The other thing that stands out is this:
image
Not sure if this is an opportunity for improvement. I was thinking instead of looping with while, might want to use an event???

@fitzsim
Copy link
Contributor

fitzsim commented Jan 3, 2023

Thanks for the model instructions @ggerganov.

With the FP32 model and #366 I get:

$ time ./main -t 32 -m ../fp32-model/ggml-model-f32.bin samples/jfk.wav
whisper_model_load: loading model from '../fp32-model/ggml-model-f32.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 0
whisper_model_load: type          = 2
whisper_model_load: adding 1607 extra tokens
whisper_model_load: mem_required  = 1518.00 MB
whisper_model_load: ggml ctx size =  276.98 MB
whisper_model_load: memory size   =   22.83 MB
whisper_model_load: model size    =  276.92 MB

system_info: n_threads = 32 / 64 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 32 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   236.47 ms
whisper_print_timings:      mel time =    42.08 ms
whisper_print_timings:   sample time =     4.75 ms
whisper_print_timings:   encode time =  1945.92 ms / 324.32 ms per layer
whisper_print_timings:   decode time =   933.50 ms / 155.58 ms per layer
whisper_print_timings:    total time =  3163.23 ms

real	0m3.182s
user	1m17.748s
sys	0m0.607s

@ggerganov
Copy link
Owner

@fitzsim
Great work! Will take a look at the PRs in the following days and merge after I make sure the other platforms work correctly.

@prsyahmi
Copy link
Contributor

prsyahmi commented Jan 4, 2023

Hi, I'm kind of agreeing with @RndyP

I've profiled it few weeks ago and found out that you are using spin locks. I changed it to event and using WaitForMultipleObjects (I'm on windows). CPU usage did tamed down but I didn't bother to bench it at that time.

This is the bench results for commit afe2db0
The one with Event seems to perform better on my PC.

CPU: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz 2.21 GHz

whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: adding 1607 extra tokens
whisper_model_load: mem_required  =  506.00 MB
whisper_model_load: ggml ctx size =  140.60 MB
whisper_model_load: memory size   =   22.83 MB
whisper_model_load: model size    =  140.54 MB

system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

Spinlock: All cores 100% CPU usage

whisper_print_timings:     load time =   312.28 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2975.37 ms / 495.89 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3288.11 ms

whisper_print_timings:     load time =   285.14 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2932.89 ms / 488.81 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3218.43 ms

whisper_print_timings:     load time =   267.65 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2930.10 ms / 488.35 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3198.02 ms

whisper_print_timings:     load time =   270.98 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2821.18 ms / 470.20 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3092.38 ms

Event: CPU usage tamed

whisper_print_timings:     load time =   308.21 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2791.27 ms / 465.21 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3099.88 ms

whisper_print_timings:     load time =   268.62 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2687.68 ms / 447.95 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  2956.58 ms

whisper_print_timings:     load time =   267.01 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2727.19 ms / 454.53 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  2994.49 ms

whisper_print_timings:     load time =   294.01 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2803.29 ms / 467.22 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3097.75 ms

whisper_print_timings:     load time =   293.43 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2876.54 ms / 479.42 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3170.70 ms

The more recent one seems slower on my PC, without any change to the code: f00509d
Spinlock:

whisper_print_timings:     load time =   268.64 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  3209.34 ms / 534.89 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3478.29 ms

whisper_print_timings:     load time =   270.09 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  3391.52 ms / 565.25 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3661.90 ms

whisper_print_timings:     load time =   310.33 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  3375.87 ms / 562.64 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3686.47 ms

@ggerganov
Copy link
Owner

ggerganov commented Jan 5, 2023

Can you demonstrate the Event-based Windows implementation?
I tried waiting on condition_variable instead of spin locks, but it wasn't more efficient. Maybe I missed something.

@fitzsim
Copy link
Contributor

fitzsim commented Jan 5, 2023

@luke-jr
Now that #369 is merged can you try bench with various arguments, and post an updated table of results to #89? Then #300 can probably be closed.

@ggerganov
Copy link
Owner

@fitzsim
We just merged a FP16 lookup-table (#368) that is used when F16C intrinsics are not available.
I believe that this will lead to significant improvement on POWER9 platforms using the F16 models.

@prsyahmi
Copy link
Contributor

prsyahmi commented Jan 7, 2023

@ggerganov I'm using WinAPI directly. My intention was to reduce CPU usage, maybe I'll try again with condition_var and see if it makes any diffrence

index c5780ed..7ad5be6 100644
--- "a/ggml.c"
+++ "b/ggml.c"
@@ -1118,7 +1118,44 @@ inline static void ggml_vec_mad_f16(const int n, ggml_fp16_t * restrict y, ggml_
 #endif
 }
 
-inline static void ggml_vec_scale_f32(const int n, float * y, const float   v) { for (int i = 0; i < n; ++i) y[i] *= v;          }
+//inline static void ggml_vec_scale_f32(const int n, float * y, const float   v) { for (int i = 0; i < n; ++i) y[i] *= v;          }
+inline static void ggml_vec_scale_f32(const int n, float * y, const float   v) {
+#if defined(__AVX__) || defined(__AVX2__)
+    // AVX 256-bit
+    const int n32 = (n & ~31);
+
+    const __m256 v4 = _mm256_set1_ps(v);
+
+    __m256 y0, y1, y2, y3;
+
+    for (int i = 0; i < n32; i += 32) {
+        y0 = _mm256_loadu_ps(y + i + 0);
+        y1 = _mm256_loadu_ps(y + i + 8);
+        y2 = _mm256_loadu_ps(y + i + 16);
+        y3 = _mm256_loadu_ps(y + i + 24);
+
+        y0 = _mm256_mul_ps(y0, v4);
+        y1 = _mm256_mul_ps(y1, v4);
+        y2 = _mm256_mul_ps(y2, v4);
+        y3 = _mm256_mul_ps(y3, v4);
+
+        _mm256_storeu_ps(y + i + 0, y0);
+        _mm256_storeu_ps(y + i + 8, y1);
+        _mm256_storeu_ps(y + i + 16, y2);
+        _mm256_storeu_ps(y + i + 24, y3);
+    }
+
+    // leftovers
+    for (int i = n32; i < n; ++i) {
+        y[i] *= v;
+    }
+#else
+    // scalar
+    for (int i = 0; i < n; ++i) {
+        y[i] *= v;
+    }
+#endif
+}
 inline static void ggml_vec_norm_f32 (const int n, float * s, const float * x) { ggml_vec_dot_f32(n, s, x, x); *s = sqrt(*s);   }
 inline static void ggml_vec_sqr_f32  (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = x[i]*x[i];   }
 inline static void ggml_vec_sqrt_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = sqrt(x[i]); }
@@ -1621,7 +1658,7 @@ struct ggml_tensor * ggml_new_tensor_impl(
     size_needed += sizeof(struct ggml_tensor);
 
     if (cur_end + size_needed + GGML_OBJECT_SIZE > ctx->mem_size) {
-        GGML_PRINT("%s: not enough space in the context's memory pool\n", __func__);
+        GGML_PRINT("%s: not enough space in the context's memory pool (%zu/%zu needed)\n", __func__, cur_end + size_needed + GGML_OBJECT_SIZE, ctx->mem_size);
         assert(false);
         return NULL;
     }
@@ -7010,7 +7047,7 @@ typedef int ggml_lock_t;
 
 #define ggml_lock_init(x)    UNUSED(x)
 #define ggml_lock_destroy(x) UNUSED(x)
-#define ggml_lock_lock(x)    UNUSED(x)
+#define ggml_lock_lock(x)    Sleep(1)
 #define ggml_lock_unlock(x)  UNUSED(x)
 
 #define GGML_LOCK_INITIALIZER 0
@@ -7035,6 +7072,9 @@ struct ggml_compute_state {
     struct ggml_tensor * node;
 
     struct ggml_compute_state_shared * shared;
+
+    HANDLE wait_handle;
+    HANDLE fin_handle;
 };
 
 // function used by each compute thread
@@ -7052,6 +7092,17 @@ thread_ret_t ggml_graph_compute_thread(void * data) {
     const int n_threads = state->shared->n_threads;
 
     while (true) {
+        WaitForSingleObject(state->wait_handle, INFINITE);
+        if (state->node) {
+            ggml_compute_forward(&state->params, state->node);
+            state->node = NULL;
+            SetEvent(state->fin_handle);
+        } else {
+            SetEvent(state->fin_handle);
+            break;
+        }
+
+        /*
         if (atomic_fetch_add(&state->shared->n_ready, 1) == n_threads - 1) {
             atomic_store(&state->shared->has_work, false);
         } else {
@@ -7086,6 +7137,7 @@ thread_ret_t ggml_graph_compute_thread(void * data) {
         } else {
             break;
         }
+        */
     }
 
     return 0;
@@ -7106,6 +7158,7 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
         /*.stop      =*/ false,
     };
     struct ggml_compute_state * workers = n_threads > 1 ? alloca(sizeof(struct ggml_compute_state)*(n_threads - 1)) : NULL;
+    HANDLE worker_handles[16];
 
     // create thread pool
     if (n_threads > 1) {
@@ -7125,7 +7178,12 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
                 },
                 .node   = NULL,
                 .shared = &state_shared,
+                .wait_handle = CreateEvent(NULL, FALSE, FALSE, NULL),
+                .fin_handle = CreateEvent(NULL, FALSE, FALSE, NULL),
             };
+
+            worker_handles[j] = workers[j].fin_handle;
+
             int rc = pthread_create(&workers[j].thrd, NULL, ggml_graph_compute_thread, &workers[j]);
             assert(rc == 0);
             UNUSED(rc);
@@ -7345,14 +7403,14 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
 
         // COMPUTE
         if (node->n_tasks > 1) {
-            if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
+            /*if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
                 atomic_store(&state_shared.has_work, false);
             }
 
             while (atomic_load(&state_shared.has_work)) {
                 ggml_lock_lock  (&state_shared.spin);
                 ggml_lock_unlock(&state_shared.spin);
-            }
+            }*/
 
             // launch thread pool
             for (int j = 0; j < n_threads - 1; j++) {
@@ -7364,16 +7422,17 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
                     .wdata = cgraph->work ? cgraph->work->data : NULL,
                 };
                 workers[j].node = node;
+                SetEvent(workers[j].wait_handle);
             }
 
-            atomic_fetch_sub(&state_shared.n_ready, 1);
+            /*atomic_fetch_sub(&state_shared.n_ready, 1);
 
             while (atomic_load(&state_shared.n_ready) > 0) {
                 ggml_lock_lock  (&state_shared.spin);
                 ggml_lock_unlock(&state_shared.spin);
             }
 
-            atomic_store(&state_shared.has_work, true);
+            atomic_store(&state_shared.has_work, true);*/
         }
 
         params.type = GGML_TASK_COMPUTE;
@@ -7381,7 +7440,8 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
 
         // wait for thread pool
         if (node->n_tasks > 1) {
-            if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
+            WaitForMultipleObjects(n_threads - 1, worker_handles, TRUE, INFINITE);
+            /*if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
                 atomic_store(&state_shared.has_work, false);
             }
 
@@ -7395,19 +7455,19 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
             while (atomic_load(&state_shared.n_ready) != 0) {
                 ggml_lock_lock  (&state_shared.spin);
                 ggml_lock_unlock(&state_shared.spin);
-            }
+            }*/
         }
 
         // FINALIZE
         if (node->n_tasks > 1) {
-            if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
+            /*if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
                 atomic_store(&state_shared.has_work, false);
             }
 
             while (atomic_load(&state_shared.has_work)) {
                 ggml_lock_lock  (&state_shared.spin);
                 ggml_lock_unlock(&state_shared.spin);
-            }
+            }*/
 
             // launch thread pool
             for (int j = 0; j < n_threads - 1; j++) {
@@ -7419,16 +7479,17 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
                     .wdata = cgraph->work ? cgraph->work->data : NULL,
                 };
                 workers[j].node = node;
+                SetEvent(workers[j].wait_handle);
             }
 
-            atomic_fetch_sub(&state_shared.n_ready, 1);
+            /*atomic_fetch_sub(&state_shared.n_ready, 1);
 
             while (atomic_load(&state_shared.n_ready) > 0) {
                 ggml_lock_lock  (&state_shared.spin);
                 ggml_lock_unlock(&state_shared.spin);
             }
 
-            atomic_store(&state_shared.has_work, true);
+            atomic_store(&state_shared.has_work, true);*/
         }
 
         params.type = GGML_TASK_FINALIZE;
@@ -7436,7 +7497,8 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
 
         // wait for thread pool
         if (node->n_tasks > 1) {
-            if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
+            WaitForMultipleObjects(n_threads - 1, worker_handles, TRUE, INFINITE);
+            /*if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
                 atomic_store(&state_shared.has_work, false);
             }
 
@@ -7450,7 +7512,7 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
             while (atomic_load(&state_shared.n_ready) != 0) {
                 ggml_lock_lock  (&state_shared.spin);
                 ggml_lock_unlock(&state_shared.spin);
-            }
+            }*/
         }
 
         // performance stats (node)
@@ -7470,6 +7532,7 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
         atomic_store(&state_shared.has_work, true);
 
         for (int j = 0; j < n_threads - 1; j++) {
+            SetEvent(workers[j].wait_handle);
             int rc = pthread_join(workers[j].thrd, NULL);
             assert(rc == 0);
             UNUSED(rc);

@fitzsim
Copy link
Contributor

fitzsim commented Jan 8, 2023

@ggerganov
Yes, 87dd4a3 is about half-a-second faster on the jfk example, I guess due to the FP16 lookup table.

@luke-jr
Copy link
Author

luke-jr commented Jan 8, 2023

@fitzsim I won't be in any position to do anything any time soon, unfortunately. (link)

@jaybinks
Copy link
Contributor

jaybinks commented Jan 8, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Build related issues help wanted Extra attention is needed performance CPU and memory usage - results and comparisons
Projects
None yet
Development

No branches or pull requests

6 participants