How to get the newly generated tokens only? #227

yunfeng-scale · 2023-11-01T05:40:54Z

Hi, I'm trying to get the newly generated tokens. It looks like currently the model returns all the token ids including input.

I tried to update the postprocessing model to take in REQUEST_INPUT_LEN from preprocessing model, however I'm receiving error E1101 04:46:29.381834 3651 model_repository_manager.cc:563] Invalid argument: in ensemble ensemble, step of model 'ensemble' receives inputs originated from different decoupled models.

This appears to be that postprocessing model tries to take input from both the decouple model tensorrt_llm and non-decoupled model preprocessing. Ensemble loads if I set model_transaction_policy.decoupled of model tensorrt_llm to false.

I could also do some tokenizations outside of the model ensemble but that is duplicated calculations. Any suggestions?

The text was updated successfully, but these errors were encountered:

juney-nvidia · 2023-11-01T11:24:08Z

@yunfeng-scale thanks for reporting this. Several customers also report the similar request. We already have an internal MR to support this which is being under reviewed. After it is done, it will be released correspondingly and there also will be release announcement mentioning this.

Thanks
June

kaiyux · 2023-11-07T12:29:47Z

Hi @yunfeng-scale , we pushed an update to the main branch for both TensorRT-LLM and TensorRT-LLM backend, including the feature to only get newly generated tokens.

Closing, please feel free to re-open if you have any questions, thanks.

yunfeng-scale · 2023-11-07T17:38:13Z

@kaiyux would you mind link the PR to this issue?

kaiyux · 2023-12-11T07:11:45Z

@yunfeng-scale Please see the newly added parameter exclude_input_in_output in triton-inference-server/tensorrtllm_backend#101

Note that the team is doing the development in internal repos, and sync the changes to GitHub periodically, so we do not have a dedicated PR on GitHub to fix this issue.

If you are still seeing the issue, please feel free to ask and we will reopen the issue. Thanks.

juney-nvidia assigned kaiyux Nov 1, 2023

juney-nvidia added triaged Issue has been triaged by maintainers feature request New feature or request labels Nov 1, 2023

yunfeng-scale mentioned this issue Nov 1, 2023

Integrate TensorRT-LLM scaleapi/llm-engine#358

Merged

kaiyux closed this as completed Nov 7, 2023

ncomly-nvidia mentioned this issue Dec 11, 2023

TensorRT-LLM Requests #632

Open

41 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get the newly generated tokens only? #227

How to get the newly generated tokens only? #227

yunfeng-scale commented Nov 1, 2023

juney-nvidia commented Nov 1, 2023

kaiyux commented Nov 7, 2023

yunfeng-scale commented Nov 7, 2023

kaiyux commented Dec 11, 2023

How to get the newly generated tokens only? #227

How to get the newly generated tokens only? #227

Comments

yunfeng-scale commented Nov 1, 2023

juney-nvidia commented Nov 1, 2023

kaiyux commented Nov 7, 2023

yunfeng-scale commented Nov 7, 2023

kaiyux commented Dec 11, 2023