Batch size >= 65536 in xformers.ops.memory_efficient_attention gives CUDA error. #845

comfyanonymous · 2023-09-03T06:15:08Z

🐛 Bug

Xformers gives a CUDA error like this when the batch size is larger or equal to 65536.

RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Command

To Reproduce

Steps to reproduce the behavior:

import xformers
import xformers.ops
import torch

q = torch.zeros(([65536, 16, 80])).cuda()
k = torch.zeros(([65536, 16, 80])).cuda()
v = torch.zeros(([65536, 16, 80])).cuda()
out = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=None)

Expected behavior

Raise a NotImplementedError or a ValueError if the input sizes are not supported.

Environment

I can reproduce this with the above code on my 3090 TI with xformers 0.0.21 and on the T4 GPU on free google colab with xformers-0.0.22.dev599

The text was updated successfully, but these errors were encountered:

danthe3rd · 2023-09-04T12:47:52Z

Hi,
Thanks for reporting this bug! We'll try to get this fixed asap.

continue-revolution · 2023-09-19T17:02:50Z

Can confirm this happens to me as well (AnimateDiff) for xformers >= 0.0.21

If I run with xformers==0.0.20, things work well

Yard1 · 2023-10-05T05:08:15Z

Also running into this issue.

xzqjack · 2023-10-18T02:29:58Z

same error

samiede · 2023-10-25T11:55:51Z

I do have the same issue as well!

dianyo · 2023-12-12T13:50:48Z

Hi @danthe3rd ,

I've traced back a bit to cuda code here. I found the problem is came from that the batch size used in the original attention layer will build corresponding SM threads on GPU. If the threads(batch) size is larger than one GPU can support (A100 can only support up to 32 x 2048 = 65536 threads), the error occurred.

Also took a quick look at pytorch source code and found that they always have a constraint constant (one calledMAX_BLOCK_SIZE) to deal with large amount of resource. Using the similar logic might solve this issue.

danthe3rd · 2023-12-15T16:02:45Z

Hey,
So if you want to have a look, this is because we run many blocks in parallel across 3 dimensions (x,y,z), and there is a limit to 65k for dimensions y and z (source).
As you can see, we use dimension x for the number of queries, dimension y for the number of heads, and dimension z for the batch size.

xformers/xformers/csrc/attention/cuda/fmha/kernel_forward.h

Lines 358 to 363 in 1254a16

    
           __host__ dim3 getBlocksGrid() const { 
        
             return dim3( 
        
                 ceil_div(num_queries, (int32_t)kQueriesPerBlock), 
        
                 num_heads, 
        
                 num_batches); 
        
           }

A proper solution would be to "flatten" these dimensions into the x axis, and replace each occurence of blockIdx.[x,y,z] and gridDim.[x,y,z] in the code. Now you would also have to do it for Flash-Attention so this would be a bit more complicated...

Ir1d · 2024-04-18T02:41:25Z

Hi, what is the workaround for this issue?

guolinke · 2024-06-01T08:50:08Z

A fast work-around is using several small sub-batches, each with batch size < 6.5k.

danthe3rd added the bug Something isn't working label Sep 4, 2023

Kosinkadink mentioned this issue Sep 8, 2023

"CUDA error: invalid configuration argument" for certain resolution/batch size combos while using xformers Kosinkadink/ComfyUI-AnimateDiff-Evolved#6

Closed

continue-revolution mentioned this issue Sep 19, 2023

[Bug]: CUDA ERROR continue-revolution/sd-webui-animatediff#101

Closed

2 tasks

continue-revolution mentioned this issue Oct 31, 2023

[Bug]: It only create a mosaic-like GIF. continue-revolution/sd-webui-animatediff#265

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch size >= 65536 in xformers.ops.memory_efficient_attention gives CUDA error. #845

Batch size >= 65536 in xformers.ops.memory_efficient_attention gives CUDA error. #845

comfyanonymous commented Sep 3, 2023

danthe3rd commented Sep 4, 2023

continue-revolution commented Sep 19, 2023

Yard1 commented Oct 5, 2023

xzqjack commented Oct 18, 2023

samiede commented Oct 25, 2023

dianyo commented Dec 12, 2023

danthe3rd commented Dec 15, 2023

Ir1d commented Apr 18, 2024

guolinke commented Jun 1, 2024

Batch size >= 65536 in xformers.ops.memory_efficient_attention gives CUDA error. #845

Batch size >= 65536 in xformers.ops.memory_efficient_attention gives CUDA error. #845

Comments

comfyanonymous commented Sep 3, 2023

🐛 Bug

Command

To Reproduce

Expected behavior

Environment

danthe3rd commented Sep 4, 2023

continue-revolution commented Sep 19, 2023

Yard1 commented Oct 5, 2023

xzqjack commented Oct 18, 2023

samiede commented Oct 25, 2023

dianyo commented Dec 12, 2023

danthe3rd commented Dec 15, 2023

Ir1d commented Apr 18, 2024

guolinke commented Jun 1, 2024