scene_detect not work: TensorRT EP execution context enqueue failed #74

Sg4Dylan · 2024-08-16T06:54:42Z

Environments:

os: ubuntu server 22.04 LTS
gpu: H100*2
docker-ce: 5:27.1.2
nvidia-container-toolkit: 1.16.1
image: styler00dollar/vsgan_tensorrt:latest (08/15/2024)
commit: 41f25e6

Code:

clip_sc = scene_detect(
    clip,
    fp16=True,
    thresh=0.985,
    model=3,  # same on model=12, recompiled engine still not work
    num_sessions=6  # same on num_sessions=1 or 2
)

Log:

2024-08-16 05:40:41.192673822 [E:onnxruntime:Default, tensorrt_execution_provider.h:84 log] [2024-08-16 05:40:41   ERROR] IExecutionContext::enqueueV3: Error Code 1: Cuda Runtime (invalid resource handle)
2024-08-16 05:40:41.192758318 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running TRTKernel_graph_main_graph_3554867279417518500_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_main_graph_3554867279417518500_0_0' Status Message: TensorRT EP execution context enqueue failed.
2024-08-16 05:40:41.261624123 [E:onnxruntime:Default, tensorrt_execution_provider.h:84 log] [2024-08-16 05:40:41   ERROR] IExecutionContext::enqueueV3: Error Code 1: Cuda Runtime (invalid resource handle)
2024-08-16 05:40:41.261704713 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running TRTKernel_graph_main_graph_3554867279417518500_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_main_graph_3554867279417518500_0_0' Status Message: TensorRT EP execution context enqueue failed.
2024-08-16 05:40:41.288012812 [E:onnxruntime:Default, tensorrt_execution_provider.h:84 log] [2024-08-16 05:40:41   ERROR] IExecutionContext::enqueueV3: Error Code 1: Cuda Runtime (invalid resource handle)
2024-08-16 05:40:41.288064561 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running TRTKernel_graph_main_graph_3554867279417518500_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_main_graph_3554867279417518500_0_0' Status Message: TensorRT EP execution context enqueue failed.
Error: Failed to retrieve frame 34 with error: 
Traceback (most recent call last):
  File "src/cython/vapoursynth.pyx", line 3216, in vapoursynth.publicFunction
  File "src/cython/vapoursynth.pyx", line 3218, in vapoursynth.publicFunction
  File "src/cython/vapoursynth.pyx", line 834, in vapoursynth.FuncData.__call__
  File "/workspace/tensorrt/src/scene_detect.py", line 175, in execute
    result = ort_session.run(None, {"input": in_sess})[0][0]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 220, in run
    return self._sess.run(output_names, input_feed, run_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running TRTKernel_graph_main_graph_3554867279417518500_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_main_graph_3554867279417518500_0_0' Status Message: TensorRT EP execution context enqueue failed.

It works when I uncommented CUDAExecutionProvider, but the speed is only half of trt8.6 :

providers=[
    ("TensorrtExecutionProvider", options),
    "CUDAExecutionProvider",
],

The text was updated successfully, but these errors were encountered:

styler00dollar · 2024-08-16T12:53:29Z

It works fine on my 4090.

clip = scene_detect(
    clip,
    fp16=True,
    thresh=0.985,
    model=3,
    num_sessions=6
)

vspipe inference.py -p .
downloading: sc_efficientnetv2b0+rife46_flow_1362_256_CHW_6ch_clamp_softmax_op17_fp16_sim.onnx
100% [........................................................................] 22986288 / 22986288Script evaluation done in 266.09 seconds
Output 2210 frames in 2.70 seconds (819.48 fps)

Did you delete all temp data? Certain files are not compatible across multiple TensorRT versions. Otherwise this may be multi-gpu related. Try setting num_sessions to 1. If that doesn't help, you could try adjusting parameters.

VSGAN-tensorrt-docker/src/scene_detect.py

Lines 111 to 122 in 41f25e6

    
           options = {} 
        
           options["device_id"] = 0 
        
           options["trt_engine_cache_enable"] = True 
        
           options["trt_timing_cache_enable"] = ( 
        
               True  # Using TensorRT timing cache to accelerate engine build time on a device with the same compute capability 
        
           ) 
        
           options["trt_engine_cache_path"] = ( 
        
               "/workspace/tensorrt"  # "/home/user/Schreibtisch/VSGAN-tensorrt-docker/" 
        
           ) 
        
           options["trt_fp16_enable"] = fp16 
        
           options["trt_max_workspace_size"] = 7000000000  # ~7gb 
        
           options["trt_builder_optimization_level"] = 5

Since I only have one gpu I can't test this kind of system. The latest docker also currently has trt10.3 and not trt8.6.

Sg4Dylan · 2024-08-16T15:55:58Z

Thanks,

Did you delete all temp data? Certain files are not compatible across multiple TensorRT versions. Otherwise this may be multi-gpu related. Try setting num_sessions to 1. If that doesn't help, you could try adjusting parameters.

I've deleted the entire TRT cache and recompiled the engine when I started using the trt10.3 image.
(cus trt9.3 compiled engine doesn't work with trt10.3)

In addition to the parameters mentioned at the beginning of the issue (model=3, num_sessions=1/2),
it was observed that model=3 does not need to be additionally uncommented for CUDAExecutionProvider on trt9.3 environments (model=12 is still required, weird).
Also, the performance issues mentioned above may be caused by numactl, which needs to be further tested.

Sg4Dylan · 2024-09-06T03:45:13Z

Also, the performance issues mentioned above may be caused by numactl, which needs to be further tested.

What has been found to be the most likely cause of the problem is that the HPE BIOS does not have Turbo Mode enabled as well as Legacy P-states. When enabled, performance reaches 90-95% of the Supermicro machine used as a baseline.

For the TensorRT EP execution context enqueue failed issue, it may be caused by a low driver version (535.183.06), which needs further verification.

Sg4Dylan · 2024-09-18T10:02:58Z

For the TensorRT EP execution context enqueue failed issue, it may be caused by a low driver version (535.183.06), which needs further verification.

Unfortunately this machine doesn't seem to be able to install drivers higher than version 535 (550 & 560).

It works when I uncommented CUDAExecutionProvider, but the speed is only half of trt8.6 :
providers=[
    ("TensorrtExecutionProvider", options),
    "CUDAExecutionProvider",
],

This method works on the latest commit with trt9.3, at almost the same performance as the older trt8.6.

On the latest trt10.4 there will be the following error, which requires further troubleshooting

2024-09-18 09:50:57.563774258 [E:onnxruntime:Default, tensorrt_execution_provider.h:88 log] [2024-09-18 09:50:57 ERROR] [timingCache.cpp::validate::905] Error Code 4: API Usage Error (Timing cache header mismatch:Incoming ITimingCache: UUID = GPU-bda92872-cdda-4070-9e11-7a17551ac1ea, commit = 09a8e4a4fa08dc04
Runtime device: UUID = GPU-3ad2bd86-5418-416e-b26e-8302494b6d41, commit = 72b2487a08d97679
The incoming cache will not be used!)

styler00dollar · 2024-09-18T23:37:25Z

Unfortunately this machine doesn't seem to be able to install drivers higher than version 535 (550 & 560).

I use Nvidia driver 560 for my current 10.4 docker because otherwise the docker doesn't even start for me. 550 is too old.

nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.5, please update your driver to a newer version, or use an earlier cuda container: unknown

Sg4Dylan · 2024-09-19T05:18:33Z

Unfortunately this machine doesn't seem to be able to install drivers higher than version 535 (550 & 560).

After trying to install the latest BIOS/VBIOS, the 550/560 driver works fine.

It works when I uncommented CUDAExecutionProvider, but the speed is only half of trt8.6 :
providers=[
    ("TensorrtExecutionProvider", options),
    "CUDAExecutionProvider",
],

This method is still necessary for the latest commit with trt9.3 & trt10.4 (550 & 560), otherwise the failure shown in the title will occur.
As well, some of the videos on the latest commit will hang the engine (100% GPU-Util /w no output).

Sg4Dylan changed the title ~~scene_detect not work,~~ scene_detect not work: TensorRT EP execution context enqueue failed Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scene_detect not work: TensorRT EP execution context enqueue failed #74

scene_detect not work: TensorRT EP execution context enqueue failed #74

Sg4Dylan commented Aug 16, 2024

styler00dollar commented Aug 16, 2024

Sg4Dylan commented Aug 16, 2024

Sg4Dylan commented Sep 6, 2024

Sg4Dylan commented Sep 18, 2024

styler00dollar commented Sep 18, 2024

Sg4Dylan commented Sep 19, 2024

scene_detect not work: TensorRT EP execution context enqueue failed #74

scene_detect not work: TensorRT EP execution context enqueue failed #74

Comments

Sg4Dylan commented Aug 16, 2024

styler00dollar commented Aug 16, 2024

Sg4Dylan commented Aug 16, 2024

Sg4Dylan commented Sep 6, 2024

Sg4Dylan commented Sep 18, 2024

styler00dollar commented Sep 18, 2024

Sg4Dylan commented Sep 19, 2024