-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server: Use multi-task for embeddings endpoint #6001
Conversation
It does not work with the following request: ./server -m models/bert-bge-small/ggml-model-f16.gguf --embedding --port 6900 -v --log-format text -c 4096 -b 4096 curl -sS --data '{"content":"1234567890"}' http://127.0.0.1:6900/embedding | jq
{
"error": {
"code": 500,
"message": "[json.exception.type_error.305] cannot use operator[] with a numeric argument with object",
"type": "server_error"
}
} |
@ggerganov That's weird, the command above works on my computer:
Link to model: https://huggingface.co/ggml-org/models/blob/main/bert-bge-small/ggml-model-f16.gguf Do you have the server log? |
The windows run failed while linux runs are success (I'm also using linux). I suspect I did something compiler-dependent, so I tried to re-write the code a little bit to see if it works or not. |
That turns out to be true, the windows test passed. Sometimes compilers do weird things... |
examples/server/server.cpp
Outdated
ctx_server.request_completion(id_task, -1, { {"prompt", prompt}, { "n_predict", 0}}, false, true); | ||
ctx_server.request_completion(id_task, -1, { | ||
{"prompt", prompt}, | ||
{"n_predict", 0}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_predict
is useless here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what's the purpose of {"n_predict", 0}
here so I just leave it as-is. (It was there from a long time ago)
I think n_predict
must be explicitly set to 0 because we're not generating new tokens on embedding mode, only evaluating them. Maybe it's not the best place to set n_predict
here, should be in request_completion
.
But this part I'm not 100% sure so I'll wait for confirmation from @ggerganov
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, seems useless - if a problem arises from removing it we should fix embedding processing to not depend in any way on n_predict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah seems like it doesn't break anything. I pushed a new commit to remove this. Will merge when the CI test passed.
@phymbert I noticed that embedding test with Edit: I tried reducing n_threads to 2 but it doesn't help, here the run on my forked repo: https://github.com/ngxson/llama.cpp/actions/runs/8263115740/job/22603962488 Edit 2: Also, since we're using hosted runner, speed is not guaranteed. Sometimes |
Yes I am on it already, it will run on master scheduled only. @ggerganov ok to remove it for PR ? |
Yes |
* use multitask for embd endpoint * specify types * remove redundant {"n_predict", 0}
* use multitask for embd endpoint * specify types * remove redundant {"n_predict", 0}
As discussed in #5939 (review) , using multi-task for multi prompts in embeddings endpoint is preferable.
I also make the code less verbose for this function. There're maybe other ways to make the code even shorter, but I prefer stay this way to keep it readable (also easy to understand what the code does). But suggestions are welcome for this.
The next task for me will be to see if we can reuse part of code inside
handle_completions
forhandle_chat_completions
to prevent code duplication.