-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure of TensorRT 10.7 to eliminate concatenation with upstream custom layer #4345
Comments
You are correct, currently the concatenation elimination pass is unsupported for plugin nodes. I'll see to updating the TensorRT developer guide about this. Do you have motivating use case where the time to copy is dominating over the time saved over using a custom plugin? |
Is there any trick to make concatenation elimination work even with plugin nodes? I don't mind a bit of hacking if it's not too complicated. My network contains about 20 layers of a custom layer type implemented using a plugin. The application is latency-sensitive, with relatively low data and compute volume that does not saturate the GPU. The batch size is 1. The custom layer is stateful, meaning each layer takes two inputs: the currently observed data/features and the current state. It produces two outputs: the regular output and the next state. Because each custom layer is stateful, the entire network is also stateful. During inference, the second network/graph input is the current network state, and the second network/graph output is the next network state, which is fed back into the network for the next inference step. (The first network/graph input and output are the regular features and output.) The network state is simply the combination of all layer states. When constructing a model, I can choose the number of custom layers. To keep deployment simple, I concatenate all layer output states into a single network output state. This prevents the user of the engine from having to manage a variable number of inputs and outputs dictated by internal implementation details. The user only has to accommodate for varying state sizes from different models. On an L4, I observe the following: From ncu profiling: For a model with 21 custom layers, each inference involves 130 kernel launches (With |
I did further experimentation, focused on I changed my plugin to accept the |
collected this feature request in 2025 TensorRT roadmap, support concatenation elimination pass for plugins |
Thanks. There are two similar/related problem that cause the extra copying: downstream concatenation and upstream slicing. Originally, the Github issue title described only the former but my later inspection revealed both problems, which I've described previously, and I ended up working around the latter by fusing the slicing into my plugin. I wonder whether you are including in the roadmap only the former problem (downstream concatenation) or both problems? |
Thanks for the clarification! I will include both downstream concatenation and upstream slicing in the roadmap |
Description
It seems that TensorRT cannot eliminate a concatenation layer if there is an upstream custom layer.
In a simple model that uses all standard operators, TensorRT engine building eliminates concatenation, but after I replaced Add with a CustomAdd that does the same thing as Add, TensorRT engine building does not eliminate the concatenation.
This failure to eliminate concatenation diminishes the benefit of using plugins when a plugin outputs to a concatenation layer, especially in terms of reducing the number of kernels, since failing to eliminate concatenation typically results in kernels called
copyVectorizedKernel
being used to do the copying.From the engine-building log, it appears that the failure is related to a concept called "striding support", but I could not find any documentation on it especially in relation to plugins.
My goal is for the concatenation to also be eliminated in the case involving custom layers, so that there are no unnecessary
copyVectorizedKernel
kernels. If the current behavior is by design, there should be documentation about this caveat regarding the use of plugins.Environment
TensorRT Version: 10.7
NVIDIA GPU: RTX 3080
NVIDIA Driver Version: 565.57.01
CUDA Version: 12.7
CUDNN Version: N/A
Operating System: Ubuntu 24.04
Python Version (if applicable): 3.12 (but irrelevant)
Tensorflow Version (if applicable): N/A
PyTorch Version (if applicable): N/A
Baremetal or Container (if so, version): baremetal
Relevant Files
https://github.com/jchia/trt-copy contains all the details to repro a situation illustrating the problem.
Steps To Reproduce
With the content of the repo at https://github.com/jchia/trt-copy, refer to https://github.com/jchia/trt-copy/blob/master/README.md.
The steps are:
The output of the engine-building steps indicates that concatenation is eliminated when Add is used but not when CustomAdd is used. Details are explained in the README.md.
In particular, for the model with Add (sac16.onnx), there are these lines:
But for the model with CustomAdd (sac16c.onnx), there are these lines:
Commands or scripts:
Have you tried the latest release?: No
Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (
polygraphy run <model.onnx> --onnxrt
): Haven't tried, but it runs on TensorRT, suboptimally.The text was updated successfully, but these errors were encountered: