Default to hardware floating-point atomics. #604

pxl-th · 2024-02-27T08:50:25Z

Default to 'unsafe' hardware floating-point atomics.
TL;DR instead of emulating them via CAS loop use hardware RWM instruction that is significantly faster.
More details: link.

E.g. assembly atomic instruction before & after this PR for the following kernel:

function ker!(x)
    @inline @atomic x[1] += 1f0
    return
end

Before: global_atomic_cmpswap_b32 v0, v2, v[0:1], s[0:1] glc
After: global_atomic_add_f32 v0, v1, s[0:1]

I'm inclined to make this a default because of huge performance increase.
On Nerf.jl benchmark this gives ~2x performance improvement and on yet-unreleased GaussianSplatting.jl 17x boost in performance matching CUDA.

However, on a per-kernel basis this can be disabled with:

@roc unsafe_fp_atomics=false f(...)

CC @luraess @OsKnoth

luraess · 2024-02-27T09:33:24Z

cc: @albert-de-montserrat

Default to hardware floating-point atomics.

72bac30

Fix

39d1660

pxl-th merged commit dbad788 into master Feb 27, 2024
1 check was pending

pxl-th deleted the pxl-th/unsafe-atomics branch February 27, 2024 15:15

pxl-th mentioned this pull request Feb 29, 2024

@atomic is slow within AMDGPU.jl #569

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default to hardware floating-point atomics. #604

Default to hardware floating-point atomics. #604

pxl-th commented Feb 27, 2024 •

edited

Loading

luraess commented Feb 27, 2024

Default to hardware floating-point atomics. #604

Default to hardware floating-point atomics. #604

Conversation

pxl-th commented Feb 27, 2024 • edited Loading

luraess commented Feb 27, 2024

pxl-th commented Feb 27, 2024 •

edited

Loading