-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lower Vector128/256/512.Shuffle
to SHUFPS, VSHUFPS, VSHUFPD, VPERMILPS, ... when possible
#105908
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
Thanks for opening the issue here, I've assigned it to myself as this is an area I'm already working in/around. Some things to note...
This is no longer true, it is fixed in .NET 9
This is only true for very simplistic cases. If you're doing this in a loop, for example, the indices parameter will be hoisted and come from register instead. The CPU is also likely to prefetch and cache these RIP relative constants itself as well, resulting in the memory load being effectively erased for typical code.
It's not as easy as one might think. There are many aspects that come into play including register selection, neighboring instructions/ports, dependency chains, whether a memory load can be hoisted, whether new encoding semantics relevant to EVEX are possible, whether it crosses lanes or not, how it impacts loop and general code alignment, etc. There are some cases where we can produce better instruction selection and those are already known/being tracked, they just couldn't be completed in .NET 9. But, in general, this will be done by taking into account the right balance between complexity, maintainability, code size, and code perf across a range of microarchitectures; it may not always be what appears to have the fewest theoretical cycles. |
Thank you for working on this, and improving things! :)
Just for context: In my case, I'm not using it in a loop. (I also saw many cases where smaller code is actually better. I encountered many scenarios where
I think the shown case is clear.
This is why I provided benchmarks. |
The following applies to
Vector128
,Vector256
andVector512
- but I will discuss only theVector256
case.Let's consider this example which permutes a
Vector256
:Actual Behavior
The runtime used the
VPERM
instruction to permute the elements, which takes a vector of indices as parameter. This means we have to load the constant from memory.Expected Behavior
There are other, faster instructions (SHUFPD, SHUFPS, PSHUFD, VPSHUFD, VSHUFPS, VSHUFPD, VPERMILPS, VPERMILPD, VPERM2I128, VPERM2D128, VSHUFF64x2, VSHUFF32x4, VSHUFI32x4, VSHUFI64x2) which don't quite offer the same flexibility as the VPERM instruction, but should be emitted when they are applicable. Those functions take an imm8 control byte to describe the permutation. It's easy to detect if one of these cheaper functions is useable. In the case shown above
SHUFPS
is applicable, because the permutations in the first four elements and the last four elements are the same.When I manually rewrite the function to use SHUFPS
the runtime produces this faster and smaller code:
In other cases the
.Shuffle
can be replaced withAvx512F.Permute4x64
,Avx512F.Permute4x32
,Sse.Shuffle
, ...Stretch Goal
Reversing a vector with
Vector256.Shuffle(v, Vector256.Create(7, 6, 5, 4, 3, 2, 1, 0));
could be special-cased toThis runs approx. twice as fast on my machine, and is used in some places like
SpanHelpers.Reverse
.Effects
In my implementation of a SIMD-accelerated Bitonic sorting network I saw a performance uplift of approx 20% and a significant code size reduction.
Benchmark code
Context
Vector.Shuffle
are preferrable over x64-specificAvx.Shuffle
. In Switching to Vectors from target dependent instrinsics #101251 similar goals are outlined.Vector.Create
has to be called explicitly for eachVector.Shuffle
call because otherwise suboptimal code is generated (Vector256.Shuffle does not produce optimal codegen #72793)The text was updated successfully, but these errors were encountered: