Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get the newly generated tokens only? #227

Closed
yunfeng-scale opened this issue Nov 1, 2023 · 4 comments
Closed

How to get the newly generated tokens only? #227

yunfeng-scale opened this issue Nov 1, 2023 · 4 comments
Assignees
Labels
feature request New feature or request triaged Issue has been triaged by maintainers

Comments

@yunfeng-scale
Copy link

Hi, I'm trying to get the newly generated tokens. It looks like currently the model returns all the token ids including input.

I tried to update the postprocessing model to take in REQUEST_INPUT_LEN from preprocessing model, however I'm receiving error E1101 04:46:29.381834 3651 model_repository_manager.cc:563] Invalid argument: in ensemble ensemble, step of model 'ensemble' receives inputs originated from different decoupled models.

This appears to be that postprocessing model tries to take input from both the decouple model tensorrt_llm and non-decoupled model preprocessing. Ensemble loads if I set model_transaction_policy.decoupled of model tensorrt_llm to false.

I could also do some tokenizations outside of the model ensemble but that is duplicated calculations. Any suggestions?

@juney-nvidia
Copy link
Collaborator

@yunfeng-scale thanks for reporting this. Several customers also report the similar request. We already have an internal MR to support this which is being under reviewed. After it is done, it will be released correspondingly and there also will be release announcement mentioning this.

Thanks
June

@juney-nvidia juney-nvidia added triaged Issue has been triaged by maintainers feature request New feature or request labels Nov 1, 2023
@kaiyux
Copy link
Member

kaiyux commented Nov 7, 2023

Hi @yunfeng-scale , we pushed an update to the main branch for both TensorRT-LLM and TensorRT-LLM backend, including the feature to only get newly generated tokens.

Closing, please feel free to re-open if you have any questions, thanks.

@kaiyux kaiyux closed this as completed Nov 7, 2023
@yunfeng-scale
Copy link
Author

@kaiyux would you mind link the PR to this issue?

@kaiyux
Copy link
Member

kaiyux commented Dec 11, 2023

@yunfeng-scale Please see the newly added parameter exclude_input_in_output in triton-inference-server/tensorrtllm_backend#101

Note that the team is doing the development in internal repos, and sync the changes to GitHub periodically, so we do not have a dedicated PR on GitHub to fix this issue.

If you are still seeing the issue, please feel free to ask and we will reopen the issue. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

3 participants