Default to hardware floating-point atomics. #604
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Default to 'unsafe' hardware floating-point atomics.
TL;DR instead of emulating them via CAS loop use hardware RWM instruction that is significantly faster.
More details: link.
E.g. assembly atomic instruction before & after this PR for the following kernel:
global_atomic_cmpswap_b32 v0, v2, v[0:1], s[0:1] glc
global_atomic_add_f32 v0, v1, s[0:1]
I'm inclined to make this a default because of huge performance increase.
On Nerf.jl benchmark this gives ~2x performance improvement and on yet-unreleased GaussianSplatting.jl 17x boost in performance matching CUDA.
However, on a per-kernel basis this can be disabled with:
CC @luraess @OsKnoth