Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Support for Per-Request Logits Post-Processor Registration #2809

Open
EmileDqy opened this issue Feb 21, 2025 · 0 comments
Open

Comments

@EmileDqy
Copy link

Issue

Currently, logits post-processors must be registered at initialization time through ExecutorConfig.logits_post_processor_map. This prevents runtime registration of post-processors and schemas on a per-request basis.

using LogitsPostProcessor = std::function<void(IdType, Tensor&, BeamTokens const&, StreamPtr const&, std::optional<IdType>)>;
using LogitsPostProcessorMap = std::unordered_map<std::string, LogitsPostProcessor>;

I noticed Disaggregated Serving in the 2025 Roadmap, but I am unsure of its scope and whether it is related to this issue.

Impact

Using Triton Inference Server with TensorRT-LLM backend in production:

  • Tight coupling between application and model deployment
  • The full set of validation schemas must be known and named at model build time
  • Application logic changes require model redeployment
  • Inference server cannot be schema-agnostic

Proposed Solution

Add support for single-use logits post-processors scoped to individual requests. The post-processor would be registered for the duration of the request only and automatically cleaned up afterwards, following the pattern established by vLLM and TGI's grammar implementations for constrained decoding. This functionality can be used in addition to the LogitsPostProcessorMap declared at initialization time, ensuring backward compatibility.

TGI has recently integrated TensorRT-LLM but had to disable its grammar support due to this limitation. This suggests broader ecosystem benefits.

Example Use Case

Using LM Format Enforcer with dynamic schemas that depend on request context. The current workaround requires pre-registering all possible schema combinations at initialization time, which can grow exponentially with context complexity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant