Server: Use multi-task for embeddings endpoint #6001

ngxson · 2024-03-11T17:10:50Z

As discussed in #5939 (review) , using multi-task for multi prompts in embeddings endpoint is preferable.

I also make the code less verbose for this function. There're maybe other ways to make the code even shorter, but I prefer stay this way to keep it readable (also easy to understand what the code does). But suggestions are welcome for this.

The next task for me will be to see if we can reuse part of code inside handle_completions for handle_chat_completions to prevent code duplication.

ggerganov · 2024-03-11T17:36:15Z

It does not work with the following request:

./server -m models/bert-bge-small/ggml-model-f16.gguf --embedding --port 6900 -v --log-format text -c 4096 -b 4096

curl -sS --data '{"content":"1234567890"}' http://127.0.0.1:6900/embedding | jq
{
  "error": {
    "code": 500,
    "message": "[json.exception.type_error.305] cannot use operator[] with a numeric argument with object",
    "type": "server_error"
  }
}

ngxson · 2024-03-11T18:20:20Z

@ggerganov That's weird, the command above works on my computer:

make clean && make server && ./server -m ../bert-bge-small.gguf --embedding --port 6900 -v --log-format text -c 4096 -b 4096

curl -sS --data '{"content":"1234567890"}' http://127.0.0.1:6900/embedding
{"embedding":[-0.03480084612965584,-0.07584927976131439,-0.006564121227711439,-0.017552589997649193,-0.05999302491545677,-0.0035985081922262907,0.029658228158950806,0.019941214472055435,-0.0001966610725503415,-0.0010271853534504771,0.0118120647............

Link to model: https://huggingface.co/ggml-org/models/blob/main/bert-bge-small/ggml-model-f16.gguf

Do you have the server log?

ngxson · 2024-03-11T21:16:33Z

The windows run failed while linux runs are success (I'm also using linux). I suspect I did something compiler-dependent, so I tried to re-write the code a little bit to see if it works or not.

ngxson · 2024-03-11T21:25:00Z

That turns out to be true, the windows test passed. Sometimes compilers do weird things...

phymbert · 2024-03-12T11:10:40Z

examples/server/server.cpp

-            ctx_server.request_completion(id_task, -1, { {"prompt", prompt}, { "n_predict", 0}}, false, true);
+            ctx_server.request_completion(id_task, -1, {
+                {"prompt", prompt},
+                {"n_predict", 0},


n_predict is useless here

I'm not sure what's the purpose of {"n_predict", 0} here so I just leave it as-is. (It was there from a long time ago)

I think n_predict must be explicitly set to 0 because we're not generating new tokens on embedding mode, only evaluating them. Maybe it's not the best place to set n_predict here, should be in request_completion.

But this part I'm not 100% sure so I'll wait for confirmation from @ggerganov

Yes, seems useless - if a problem arises from removing it we should fix embedding processing to not depend in any way on n_predict

Yeah seems like it doesn't break anything. I pushed a new commit to remove this. Will merge when the CI test passed.

ngxson · 2024-03-13T10:42:18Z

@phymbert I noticed that embedding test with bert-bge-small is quite slow particularly with LLAMA_SANITIZE_THREAD (usually takes 30 minutes to finish). I'm wondering if we can skip embedding test on LLAMA_SANITIZE_THREAD,while keeping it on LLAMA_SANITIZE_ADDRESS and LLAMA_SANITIZE_UNDEFINED

Edit: I tried reducing n_threads to 2 but it doesn't help, here the run on my forked repo: https://github.com/ngxson/llama.cpp/actions/runs/8263115740/job/22603962488

Edit 2: Also, since we're using hosted runner, speed is not guaranteed. Sometimes waiting for server to start timed out, I think we should increase the max attempts

phymbert · 2024-03-13T10:58:27Z

Yes I am on it already, it will run on master scheduled only. @ggerganov ok to remove it for PR ?

ggerganov · 2024-03-13T17:55:30Z

Yes

* use multitask for embd endpoint * specify types * remove redundant {"n_predict", 0}

use multitask for embd endpoint

1a29871

ngxson requested a review from ggerganov March 11, 2024 17:10

specify types

a601da6

phymbert reviewed Mar 12, 2024

View reviewed changes

ggerganov approved these changes Mar 12, 2024

View reviewed changes

remove redundant {"n_predict", 0}

dd0b2be

phymbert approved these changes Mar 12, 2024

View reviewed changes

ngxson merged commit 99b71c0 into ggml-org:master Mar 13, 2024
61 checks passed

phymbert mentioned this pull request Mar 13, 2024

server: test: disable debug release type sanitizer, simplify trigger #6047

Merged

NeoZhangJianyu pushed a commit to NeoZhangJianyu/llama.cpp that referenced this pull request Mar 15, 2024

Server: Use multi-task for embeddings endpoint (ggml-org#6001)

42810dd

* use multitask for embd endpoint * specify types * remove redundant {"n_predict", 0}

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024

Server: Use multi-task for embeddings endpoint (ggml-org#6001)

eb199ad

* use multitask for embd endpoint * specify types * remove redundant {"n_predict", 0}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server: Use multi-task for embeddings endpoint #6001

Server: Use multi-task for embeddings endpoint #6001

ngxson commented Mar 11, 2024

ggerganov commented Mar 11, 2024

ngxson commented Mar 11, 2024

ngxson commented Mar 11, 2024 •

edited

Loading

ngxson commented Mar 11, 2024

phymbert Mar 12, 2024

ngxson Mar 12, 2024

ggerganov Mar 12, 2024

ngxson Mar 12, 2024

ngxson commented Mar 13, 2024 •

edited

Loading

phymbert commented Mar 13, 2024

ggerganov commented Mar 13, 2024

Server: Use multi-task for embeddings endpoint #6001

Server: Use multi-task for embeddings endpoint #6001

Conversation

ngxson commented Mar 11, 2024

ggerganov commented Mar 11, 2024

ngxson commented Mar 11, 2024

ngxson commented Mar 11, 2024 • edited Loading

ngxson commented Mar 11, 2024

phymbert Mar 12, 2024

Choose a reason for hiding this comment

ngxson Mar 12, 2024

Choose a reason for hiding this comment

ggerganov Mar 12, 2024

Choose a reason for hiding this comment

ngxson Mar 12, 2024

Choose a reason for hiding this comment

ngxson commented Mar 13, 2024 • edited Loading

phymbert commented Mar 13, 2024

ggerganov commented Mar 13, 2024

ngxson commented Mar 11, 2024 •

edited

Loading

ngxson commented Mar 13, 2024 •

edited

Loading