Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server: Use multi-task for embeddings endpoint #6001

Merged
merged 3 commits into from
Mar 13, 2024

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Mar 11, 2024

As discussed in #5939 (review) , using multi-task for multi prompts in embeddings endpoint is preferable.

I also make the code less verbose for this function. There're maybe other ways to make the code even shorter, but I prefer stay this way to keep it readable (also easy to understand what the code does). But suggestions are welcome for this.

The next task for me will be to see if we can reuse part of code inside handle_completions for handle_chat_completions to prevent code duplication.

@ngxson ngxson requested a review from ggerganov March 11, 2024 17:10
@ggerganov
Copy link
Member

It does not work with the following request:

./server -m models/bert-bge-small/ggml-model-f16.gguf --embedding --port 6900 -v --log-format text -c 4096 -b 4096
curl -sS --data '{"content":"1234567890"}' http://127.0.0.1:6900/embedding | jq
{
  "error": {
    "code": 500,
    "message": "[json.exception.type_error.305] cannot use operator[] with a numeric argument with object",
    "type": "server_error"
  }
}

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 11, 2024

@ggerganov That's weird, the command above works on my computer:

make clean && make server && ./server -m ../bert-bge-small.gguf --embedding --port 6900 -v --log-format text -c 4096 -b 4096
curl -sS --data '{"content":"1234567890"}' http://127.0.0.1:6900/embedding
{"embedding":[-0.03480084612965584,-0.07584927976131439,-0.006564121227711439,-0.017552589997649193,-0.05999302491545677,-0.0035985081922262907,0.029658228158950806,0.019941214472055435,-0.0001966610725503415,-0.0010271853534504771,0.0118120647............

Link to model: https://huggingface.co/ggml-org/models/blob/main/bert-bge-small/ggml-model-f16.gguf

Do you have the server log?

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 11, 2024

The windows run failed while linux runs are success (I'm also using linux). I suspect I did something compiler-dependent, so I tried to re-write the code a little bit to see if it works or not.

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 11, 2024

That turns out to be true, the windows test passed. Sometimes compilers do weird things...

ctx_server.request_completion(id_task, -1, { {"prompt", prompt}, { "n_predict", 0}}, false, true);
ctx_server.request_completion(id_task, -1, {
{"prompt", prompt},
{"n_predict", 0},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_predict is useless here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what's the purpose of {"n_predict", 0} here so I just leave it as-is. (It was there from a long time ago)

I think n_predict must be explicitly set to 0 because we're not generating new tokens on embedding mode, only evaluating them. Maybe it's not the best place to set n_predict here, should be in request_completion.

But this part I'm not 100% sure so I'll wait for confirmation from @ggerganov

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, seems useless - if a problem arises from removing it we should fix embedding processing to not depend in any way on n_predict

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah seems like it doesn't break anything. I pushed a new commit to remove this. Will merge when the CI test passed.

@ngxson ngxson merged commit 99b71c0 into ggml-org:master Mar 13, 2024
61 checks passed
@ngxson
Copy link
Collaborator Author

ngxson commented Mar 13, 2024

@phymbert I noticed that embedding test with bert-bge-small is quite slow particularly with LLAMA_SANITIZE_THREAD (usually takes 30 minutes to finish). I'm wondering if we can skip embedding test on LLAMA_SANITIZE_THREAD,while keeping it on LLAMA_SANITIZE_ADDRESS and LLAMA_SANITIZE_UNDEFINED

Edit: I tried reducing n_threads to 2 but it doesn't help, here the run on my forked repo: https://github.com/ngxson/llama.cpp/actions/runs/8263115740/job/22603962488

Edit 2: Also, since we're using hosted runner, speed is not guaranteed. Sometimes waiting for server to start timed out, I think we should increase the max attempts

@phymbert
Copy link
Collaborator

Yes I am on it already, it will run on master scheduled only. @ggerganov ok to remove it for PR ?

@ggerganov
Copy link
Member

Yes

NeoZhangJianyu pushed a commit to NeoZhangJianyu/llama.cpp that referenced this pull request Mar 15, 2024
* use multitask for embd endpoint

* specify types

* remove redundant {"n_predict", 0}
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* use multitask for embd endpoint

* specify types

* remove redundant {"n_predict", 0}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants