Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scene_detect not work: TensorRT EP execution context enqueue failed #74

Open
Sg4Dylan opened this issue Aug 16, 2024 · 6 comments
Open

Comments

@Sg4Dylan
Copy link

Environments:

  • os: ubuntu server 22.04 LTS
  • gpu: H100*2
  • docker-ce: 5:27.1.2
  • nvidia-container-toolkit: 1.16.1
  • image: styler00dollar/vsgan_tensorrt:latest (08/15/2024)
  • commit: 41f25e6

Code:

clip_sc = scene_detect(
    clip,
    fp16=True,
    thresh=0.985,
    model=3,  # same on model=12, recompiled engine still not work
    num_sessions=6  # same on num_sessions=1 or 2
)

Log:

2024-08-16 05:40:41.192673822 [E:onnxruntime:Default, tensorrt_execution_provider.h:84 log] [2024-08-16 05:40:41   ERROR] IExecutionContext::enqueueV3: Error Code 1: Cuda Runtime (invalid resource handle)
2024-08-16 05:40:41.192758318 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running TRTKernel_graph_main_graph_3554867279417518500_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_main_graph_3554867279417518500_0_0' Status Message: TensorRT EP execution context enqueue failed.
2024-08-16 05:40:41.261624123 [E:onnxruntime:Default, tensorrt_execution_provider.h:84 log] [2024-08-16 05:40:41   ERROR] IExecutionContext::enqueueV3: Error Code 1: Cuda Runtime (invalid resource handle)
2024-08-16 05:40:41.261704713 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running TRTKernel_graph_main_graph_3554867279417518500_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_main_graph_3554867279417518500_0_0' Status Message: TensorRT EP execution context enqueue failed.
2024-08-16 05:40:41.288012812 [E:onnxruntime:Default, tensorrt_execution_provider.h:84 log] [2024-08-16 05:40:41   ERROR] IExecutionContext::enqueueV3: Error Code 1: Cuda Runtime (invalid resource handle)
2024-08-16 05:40:41.288064561 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running TRTKernel_graph_main_graph_3554867279417518500_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_main_graph_3554867279417518500_0_0' Status Message: TensorRT EP execution context enqueue failed.
Error: Failed to retrieve frame 34 with error: 
Traceback (most recent call last):
  File "src/cython/vapoursynth.pyx", line 3216, in vapoursynth.publicFunction
  File "src/cython/vapoursynth.pyx", line 3218, in vapoursynth.publicFunction
  File "src/cython/vapoursynth.pyx", line 834, in vapoursynth.FuncData.__call__
  File "/workspace/tensorrt/src/scene_detect.py", line 175, in execute
    result = ort_session.run(None, {"input": in_sess})[0][0]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 220, in run
    return self._sess.run(output_names, input_feed, run_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running TRTKernel_graph_main_graph_3554867279417518500_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_main_graph_3554867279417518500_0_0' Status Message: TensorRT EP execution context enqueue failed.

It works when I uncommented CUDAExecutionProvider, but the speed is only half of trt8.6 :

providers=[
    ("TensorrtExecutionProvider", options),
    "CUDAExecutionProvider",
],
@Sg4Dylan Sg4Dylan changed the title scene_detect not work, scene_detect not work: TensorRT EP execution context enqueue failed Aug 16, 2024
@styler00dollar
Copy link
Owner

It works fine on my 4090.

clip = scene_detect(
    clip,
    fp16=True,
    thresh=0.985,
    model=3,
    num_sessions=6
)

vspipe inference.py -p .
downloading: sc_efficientnetv2b0+rife46_flow_1362_256_CHW_6ch_clamp_softmax_op17_fp16_sim.onnx
100% [........................................................................] 22986288 / 22986288Script evaluation done in 266.09 seconds
Output 2210 frames in 2.70 seconds (819.48 fps)

Did you delete all temp data? Certain files are not compatible across multiple TensorRT versions. Otherwise this may be multi-gpu related. Try setting num_sessions to 1. If that doesn't help, you could try adjusting parameters.

options = {}
options["device_id"] = 0
options["trt_engine_cache_enable"] = True
options["trt_timing_cache_enable"] = (
True # Using TensorRT timing cache to accelerate engine build time on a device with the same compute capability
)
options["trt_engine_cache_path"] = (
"/workspace/tensorrt" # "/home/user/Schreibtisch/VSGAN-tensorrt-docker/"
)
options["trt_fp16_enable"] = fp16
options["trt_max_workspace_size"] = 7000000000 # ~7gb
options["trt_builder_optimization_level"] = 5

Since I only have one gpu I can't test this kind of system. The latest docker also currently has trt10.3 and not trt8.6.

@Sg4Dylan
Copy link
Author

Thanks,

Did you delete all temp data? Certain files are not compatible across multiple TensorRT versions. Otherwise this may be multi-gpu related. Try setting num_sessions to 1. If that doesn't help, you could try adjusting parameters.

I've deleted the entire TRT cache and recompiled the engine when I started using the trt10.3 image.
(cus trt9.3 compiled engine doesn't work with trt10.3)


In addition to the parameters mentioned at the beginning of the issue (model=3, num_sessions=1/2),
it was observed that model=3 does not need to be additionally uncommented for CUDAExecutionProvider on trt9.3 environments (model=12 is still required, weird).
Also, the performance issues mentioned above may be caused by numactl, which needs to be further tested.

@Sg4Dylan
Copy link
Author

Sg4Dylan commented Sep 6, 2024

Also, the performance issues mentioned above may be caused by numactl, which needs to be further tested.

What has been found to be the most likely cause of the problem is that the HPE BIOS does not have Turbo Mode enabled as well as Legacy P-states. When enabled, performance reaches 90-95% of the Supermicro machine used as a baseline.


For the TensorRT EP execution context enqueue failed issue, it may be caused by a low driver version (535.183.06), which needs further verification.

@Sg4Dylan
Copy link
Author

For the TensorRT EP execution context enqueue failed issue, it may be caused by a low driver version (535.183.06), which needs further verification.

Unfortunately this machine doesn't seem to be able to install drivers higher than version 535 (550 & 560).

It works when I uncommented CUDAExecutionProvider, but the speed is only half of trt8.6 :

providers=[
    ("TensorrtExecutionProvider", options),
    "CUDAExecutionProvider",
],

This method works on the latest commit with trt9.3, at almost the same performance as the older trt8.6.


On the latest trt10.4 there will be the following error, which requires further troubleshooting

2024-09-18 09:50:57.563774258 [E:onnxruntime:Default, tensorrt_execution_provider.h:88 log] [2024-09-18 09:50:57 ERROR] [timingCache.cpp::validate::905] Error Code 4: API Usage Error (Timing cache header mismatch:Incoming ITimingCache: UUID = GPU-bda92872-cdda-4070-9e11-7a17551ac1ea, commit = 09a8e4a4fa08dc04
Runtime device: UUID = GPU-3ad2bd86-5418-416e-b26e-8302494b6d41, commit = 72b2487a08d97679
The incoming cache will not be used!)

@styler00dollar
Copy link
Owner

Unfortunately this machine doesn't seem to be able to install drivers higher than version 535 (550 & 560).

I use Nvidia driver 560 for my current 10.4 docker because otherwise the docker doesn't even start for me. 550 is too old.

nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.5, please update your driver to a newer version, or use an earlier cuda container: unknown

@Sg4Dylan
Copy link
Author

Unfortunately this machine doesn't seem to be able to install drivers higher than version 535 (550 & 560).

After trying to install the latest BIOS/VBIOS, the 550/560 driver works fine.


It works when I uncommented CUDAExecutionProvider, but the speed is only half of trt8.6 :

providers=[
    ("TensorrtExecutionProvider", options),
    "CUDAExecutionProvider",
],

This method is still necessary for the latest commit with trt9.3 & trt10.4 (550 & 560), otherwise the failure shown in the title will occur.
As well, some of the videos on the latest commit will hang the engine (100% GPU-Util /w no output).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants