You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, logits post-processors must be registered at initialization time through ExecutorConfig.logits_post_processor_map. This prevents runtime registration of post-processors and schemas on a per-request basis.
using LogitsPostProcessor = std::function<void(IdType, Tensor&, BeamTokens const&, StreamPtr const&, std::optional<IdType>)>;
using LogitsPostProcessorMap = std::unordered_map<std::string, LogitsPostProcessor>;
I noticed Disaggregated Serving in the 2025 Roadmap, but I am unsure of its scope and whether it is related to this issue.
Impact
Using Triton Inference Server with TensorRT-LLM backend in production:
Tight coupling between application and model deployment
The full set of validation schemas must be known and named at model build time
Application logic changes require model redeployment
Inference server cannot be schema-agnostic
Proposed Solution
Add support for single-use logits post-processors scoped to individual requests. The post-processor would be registered for the duration of the request only and automatically cleaned up afterwards, following the pattern established by vLLM and TGI's grammar implementations for constrained decoding. This functionality can be used in addition to the LogitsPostProcessorMap declared at initialization time, ensuring backward compatibility.
Using LM Format Enforcer with dynamic schemas that depend on request context. The current workaround requires pre-registering all possible schema combinations at initialization time, which can grow exponentially with context complexity.
The text was updated successfully, but these errors were encountered:
Issue
Currently, logits post-processors must be registered at initialization time through
ExecutorConfig.logits_post_processor_map
. This prevents runtime registration of post-processors and schemas on a per-request basis.Impact
Using Triton Inference Server with TensorRT-LLM backend in production:
Proposed Solution
Add support for single-use logits post-processors scoped to individual requests. The post-processor would be registered for the duration of the request only and automatically cleaned up afterwards, following the pattern established by vLLM and TGI's grammar implementations for constrained decoding. This functionality can be used in addition to the
LogitsPostProcessorMap
declared at initialization time, ensuring backward compatibility.Example Use Case
Using LM Format Enforcer with dynamic schemas that depend on request context. The current workaround requires pre-registering all possible schema combinations at initialization time, which can grow exponentially with context complexity.
The text was updated successfully, but these errors were encountered: