Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tracking] Shortfin usability issues #245

Open
6 of 12 tasks
renxida opened this issue Oct 3, 2024 · 0 comments
Open
6 of 12 tasks

[tracking] Shortfin usability issues #245

renxida opened this issue Oct 3, 2024 · 0 comments

Comments

@renxida
Copy link
Contributor

renxida commented Oct 3, 2024

Goal: make Shortfin-LLM a polished, high-level API for serving VMFB-format LLM models

Plan:

@renxida renxida changed the title [tracking] Production Grade Shortfin [tracking] Production Grade Shortfin-LLM Nov 19, 2024
stbaione added a commit that referenced this issue Jan 29, 2025
…e_count export option (#884)

# Changes
1. Allow token_ids to be returned from server
2. Stream token-by-token and add rid for batch scenario
3. Add device_block_count option for longer prompt lengths

Really we need the `Standard API` task in #245, but this serves as a
temporary implementation, so that we can handle returning token_ids and
handle streaming for batch requests.

# Desciption

## 1

The [reference
implementation](https://github.com/mlcommons/inference/tree/master/language/llama3.1-405b)
for mlperf was already setup to process non-decoded tokens coming back
from the LLM server. However, currently shortfin doesn't have the
ability to return token_ids.

This adds an option for requesting raw token_ids back from the server.

For a normal request, you receive the bytes of the `json.dumps` of the
list:

```text
# Single
b'[21, 220, 22, 220, 23, 220, 24, 220, 605, 220, 806]'

# Batch
b'[[21, 220, 22, 220, 23, 220, 24, 220, 605, 220, 806], [845, 220, 1114, 220, 972, 220, 777, 220, 508, 220, 1691]]'
```

## 2

I also change the way that we stream. Currently, we do this:

```text
1. b'data: Hello\n\n'
5. b'data: Hello how'
6. b'data: Hello how are you'
7. b'data: Hello how are you today?'
```

Where, each time we return a response to the stream, we repeat the
entire contents of the response. This is pretty inefficient, especially
if we were submitting a request with a high `max_completion_tokens`.

Instead we stream token-by-token, like this:

```text
# Text
1. b'data(rid1): Hello'
2. b'data(rid1): how'
3. b'data(rid1): are you'
4. b'data(rid1): today?'

# Tokens
1. data(rid1): 21\n\n
2. data(rid1): 220\n\n
3. data(rid1): 22\n\n
4. data(rid1): 220\n\n
```

## Why data(rid):?

Another issue that was already present was that, we didn't have a way to
tell which response aligned with which prompt when streaming from the
server. For each chunk coming back from the server, you would either
receive a token from response 1 or a token from response 2, but can't
know which token aligns with which request.

As a simple patch for this, I add the `rid` for streaming. So we
receive:

```
b'data(rid1):  I\n\n'
b'data(rid2):  I'm\n\n'
b'data(rid1):  am\n\n'
b'data(rid2):  sure\n\n'
```

## 3

Finally, I add an option in `export_paged_llm_v1` to make the
`device_block_count` configurable. This is needed for longer input
prompts.
@renxida renxida changed the title [tracking] Production Grade Shortfin-LLM [tracking] Shortfin usability issues Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant