-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very slow - any way to speed up? #300
Comments
What is your platform? |
@luke-jr Currently, Adding support for Power ISA (or whatever the instruction set is called) should not be very difficult. The matrix multiplication routines in I don't have experience with this architecture, so hopefully someone contributes. |
Yeah, I'm not surprised it isn't optimised for PPC64, but I wouldn't expect it to be magnitudes slower either. Real-time to days is a huge difference. :/ |
For example on my SIMD can make a huge difference |
ChatGPT is out-of-date regarding the Power ISA being proprietary. It is open source now, just like RISC-V. See https://openpowerfoundation.org/. |
After #320,
In my experience, ChatGPT tends to be wrong quite often. |
@fitzsim @luke-jr All instruction sets fit quite nicely in the proposed pattern, but I'm having a little trouble with the If you are interested and have some free time, you can take a look at the implementation and see if you can fit it in the new pattern. Or at the very least - run a test and see that it still works after the changes. Regarding the new performance: 5s for |
@ggerganov, sure, I'll try to fit the POWER9 optimizations into the main SIMD structure, some time after #324 lands in the master branch. Agreed regarding 5s likely not being optimal. @luke-jr, can you add the whisper_print_timings lines to your table? They may contain hints about further optimization efforts. |
|
The remaining slowness seems to be in the short-to-fp32 conversion. Would it make sense to try a GGML_TYPE_F32 version of ggml-base.en.bin, to eliminate the conversion steps? Can someone outline steps for trying that? |
The steps are like this: # we need this for the f32 conversion
git clone https://github.com/openai/whisper
# create f32 ggml model (assumes you have ~/.cache/whisper/base.en.pt downloaded from original repo)
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
python3 models/convert-pt-to-ggml.py ~/.cache/whisper/base.en.pt ../whisper . 1
# use the new f32 model
make -j
./main -m ./ggml-model-f32.bin samples/jfk.wav You need the following patch/hack in diff --git a/whisper.cpp b/whisper.cpp
index 84c2490..8709723 100644
--- a/whisper.cpp
+++ b/whisper.cpp
@@ -131,7 +131,7 @@ static const std::map<std::string, std::pair<int, std::string>> g_lang = {
{ "su", { 98, "sundanese", } },
};
-static const size_t MB = 1024*1024;
+static const size_t MB = 3*1024*1024;
static const std::map<e_model, size_t> MEM_REQ_MODEL = {
{ MODEL_TINY, 74ull*MB }, |
Thanks for the model instructions @ggerganov. With the FP32 model and #366 I get:
|
@fitzsim |
Hi, I'm kind of agreeing with @RndyP I've profiled it few weeks ago and found out that you are using spin locks. I changed it to event and using WaitForMultipleObjects (I'm on windows). CPU usage did tamed down but I didn't bother to bench it at that time. This is the bench results for commit afe2db0
Spinlock: All cores 100% CPU usage
Event: CPU usage tamed
The more recent one seems slower on my PC, without any change to the code: f00509d
|
Can you demonstrate the Event-based Windows implementation? |
@ggerganov I'm using WinAPI directly. My intention was to reduce CPU usage, maybe I'll try again with condition_var and see if it makes any diffrence index c5780ed..7ad5be6 100644
--- "a/ggml.c"
+++ "b/ggml.c"
@@ -1118,7 +1118,44 @@ inline static void ggml_vec_mad_f16(const int n, ggml_fp16_t * restrict y, ggml_
#endif
}
-inline static void ggml_vec_scale_f32(const int n, float * y, const float v) { for (int i = 0; i < n; ++i) y[i] *= v; }
+//inline static void ggml_vec_scale_f32(const int n, float * y, const float v) { for (int i = 0; i < n; ++i) y[i] *= v; }
+inline static void ggml_vec_scale_f32(const int n, float * y, const float v) {
+#if defined(__AVX__) || defined(__AVX2__)
+ // AVX 256-bit
+ const int n32 = (n & ~31);
+
+ const __m256 v4 = _mm256_set1_ps(v);
+
+ __m256 y0, y1, y2, y3;
+
+ for (int i = 0; i < n32; i += 32) {
+ y0 = _mm256_loadu_ps(y + i + 0);
+ y1 = _mm256_loadu_ps(y + i + 8);
+ y2 = _mm256_loadu_ps(y + i + 16);
+ y3 = _mm256_loadu_ps(y + i + 24);
+
+ y0 = _mm256_mul_ps(y0, v4);
+ y1 = _mm256_mul_ps(y1, v4);
+ y2 = _mm256_mul_ps(y2, v4);
+ y3 = _mm256_mul_ps(y3, v4);
+
+ _mm256_storeu_ps(y + i + 0, y0);
+ _mm256_storeu_ps(y + i + 8, y1);
+ _mm256_storeu_ps(y + i + 16, y2);
+ _mm256_storeu_ps(y + i + 24, y3);
+ }
+
+ // leftovers
+ for (int i = n32; i < n; ++i) {
+ y[i] *= v;
+ }
+#else
+ // scalar
+ for (int i = 0; i < n; ++i) {
+ y[i] *= v;
+ }
+#endif
+}
inline static void ggml_vec_norm_f32 (const int n, float * s, const float * x) { ggml_vec_dot_f32(n, s, x, x); *s = sqrt(*s); }
inline static void ggml_vec_sqr_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = x[i]*x[i]; }
inline static void ggml_vec_sqrt_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = sqrt(x[i]); }
@@ -1621,7 +1658,7 @@ struct ggml_tensor * ggml_new_tensor_impl(
size_needed += sizeof(struct ggml_tensor);
if (cur_end + size_needed + GGML_OBJECT_SIZE > ctx->mem_size) {
- GGML_PRINT("%s: not enough space in the context's memory pool\n", __func__);
+ GGML_PRINT("%s: not enough space in the context's memory pool (%zu/%zu needed)\n", __func__, cur_end + size_needed + GGML_OBJECT_SIZE, ctx->mem_size);
assert(false);
return NULL;
}
@@ -7010,7 +7047,7 @@ typedef int ggml_lock_t;
#define ggml_lock_init(x) UNUSED(x)
#define ggml_lock_destroy(x) UNUSED(x)
-#define ggml_lock_lock(x) UNUSED(x)
+#define ggml_lock_lock(x) Sleep(1)
#define ggml_lock_unlock(x) UNUSED(x)
#define GGML_LOCK_INITIALIZER 0
@@ -7035,6 +7072,9 @@ struct ggml_compute_state {
struct ggml_tensor * node;
struct ggml_compute_state_shared * shared;
+
+ HANDLE wait_handle;
+ HANDLE fin_handle;
};
// function used by each compute thread
@@ -7052,6 +7092,17 @@ thread_ret_t ggml_graph_compute_thread(void * data) {
const int n_threads = state->shared->n_threads;
while (true) {
+ WaitForSingleObject(state->wait_handle, INFINITE);
+ if (state->node) {
+ ggml_compute_forward(&state->params, state->node);
+ state->node = NULL;
+ SetEvent(state->fin_handle);
+ } else {
+ SetEvent(state->fin_handle);
+ break;
+ }
+
+ /*
if (atomic_fetch_add(&state->shared->n_ready, 1) == n_threads - 1) {
atomic_store(&state->shared->has_work, false);
} else {
@@ -7086,6 +7137,7 @@ thread_ret_t ggml_graph_compute_thread(void * data) {
} else {
break;
}
+ */
}
return 0;
@@ -7106,6 +7158,7 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
/*.stop =*/ false,
};
struct ggml_compute_state * workers = n_threads > 1 ? alloca(sizeof(struct ggml_compute_state)*(n_threads - 1)) : NULL;
+ HANDLE worker_handles[16];
// create thread pool
if (n_threads > 1) {
@@ -7125,7 +7178,12 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
},
.node = NULL,
.shared = &state_shared,
+ .wait_handle = CreateEvent(NULL, FALSE, FALSE, NULL),
+ .fin_handle = CreateEvent(NULL, FALSE, FALSE, NULL),
};
+
+ worker_handles[j] = workers[j].fin_handle;
+
int rc = pthread_create(&workers[j].thrd, NULL, ggml_graph_compute_thread, &workers[j]);
assert(rc == 0);
UNUSED(rc);
@@ -7345,14 +7403,14 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
// COMPUTE
if (node->n_tasks > 1) {
- if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
+ /*if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
atomic_store(&state_shared.has_work, false);
}
while (atomic_load(&state_shared.has_work)) {
ggml_lock_lock (&state_shared.spin);
ggml_lock_unlock(&state_shared.spin);
- }
+ }*/
// launch thread pool
for (int j = 0; j < n_threads - 1; j++) {
@@ -7364,16 +7422,17 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
.wdata = cgraph->work ? cgraph->work->data : NULL,
};
workers[j].node = node;
+ SetEvent(workers[j].wait_handle);
}
- atomic_fetch_sub(&state_shared.n_ready, 1);
+ /*atomic_fetch_sub(&state_shared.n_ready, 1);
while (atomic_load(&state_shared.n_ready) > 0) {
ggml_lock_lock (&state_shared.spin);
ggml_lock_unlock(&state_shared.spin);
}
- atomic_store(&state_shared.has_work, true);
+ atomic_store(&state_shared.has_work, true);*/
}
params.type = GGML_TASK_COMPUTE;
@@ -7381,7 +7440,8 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
// wait for thread pool
if (node->n_tasks > 1) {
- if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
+ WaitForMultipleObjects(n_threads - 1, worker_handles, TRUE, INFINITE);
+ /*if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
atomic_store(&state_shared.has_work, false);
}
@@ -7395,19 +7455,19 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
while (atomic_load(&state_shared.n_ready) != 0) {
ggml_lock_lock (&state_shared.spin);
ggml_lock_unlock(&state_shared.spin);
- }
+ }*/
}
// FINALIZE
if (node->n_tasks > 1) {
- if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
+ /*if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
atomic_store(&state_shared.has_work, false);
}
while (atomic_load(&state_shared.has_work)) {
ggml_lock_lock (&state_shared.spin);
ggml_lock_unlock(&state_shared.spin);
- }
+ }*/
// launch thread pool
for (int j = 0; j < n_threads - 1; j++) {
@@ -7419,16 +7479,17 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
.wdata = cgraph->work ? cgraph->work->data : NULL,
};
workers[j].node = node;
+ SetEvent(workers[j].wait_handle);
}
- atomic_fetch_sub(&state_shared.n_ready, 1);
+ /*atomic_fetch_sub(&state_shared.n_ready, 1);
while (atomic_load(&state_shared.n_ready) > 0) {
ggml_lock_lock (&state_shared.spin);
ggml_lock_unlock(&state_shared.spin);
}
- atomic_store(&state_shared.has_work, true);
+ atomic_store(&state_shared.has_work, true);*/
}
params.type = GGML_TASK_FINALIZE;
@@ -7436,7 +7497,8 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
// wait for thread pool
if (node->n_tasks > 1) {
- if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
+ WaitForMultipleObjects(n_threads - 1, worker_handles, TRUE, INFINITE);
+ /*if (atomic_fetch_add(&state_shared.n_ready, 1) == n_threads - 1) {
atomic_store(&state_shared.has_work, false);
}
@@ -7450,7 +7512,7 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
while (atomic_load(&state_shared.n_ready) != 0) {
ggml_lock_lock (&state_shared.spin);
ggml_lock_unlock(&state_shared.spin);
- }
+ }*/
}
// performance stats (node)
@@ -7470,6 +7532,7 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
atomic_store(&state_shared.has_work, true);
for (int j = 0; j < n_threads - 1; j++) {
+ SetEvent(workers[j].wait_handle);
int rc = pthread_join(workers[j].thrd, NULL);
assert(rc == 0);
UNUSED(rc); |
@ggerganov |
Luke, I'm so sorry to hear the news.
Good luck
…On Sun, 8 Jan 2023, 2:35 pm Luke Dashjr, ***@***.***> wrote:
@fitzsim <https://github.com/fitzsim> I won't be in any position to do
anything any time soon, unfortunately. (link
<https://www.fxstreet.com/cryptocurrencies/news/bitcoin-core-developer-loses-nearly-35-million-in-btc-changpeng-zhao-of-binance-offers-help-202301020843>
)
—
Reply to this email directly, view it on GitHub
<#300 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALQR62KJKPXFD4JSHUO4WTWRI7ZNANCNFSM6AAAAAATFY3IEM>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Per #10,
Yet even using 64 threads, it's taking days to process minutes of audio on my POWER9.
Has something changed since #10, or is there something I am doing wrong?
The text was updated successfully, but these errors were encountered: