Lower `Vector128/256/512.Shuffle` to SHUFPS, VSHUFPS, VSHUFPD, VPERMILPS, ... when possible #105908

sfiruch · 2024-08-03T09:48:15Z

The following applies to Vector128, Vector256 and Vector512 - but I will discuss only the Vector256 case.

Let's consider this example which permutes a Vector256:

Vector256<float> Test(Vector256<float> v) => Vector256.Shuffle(v,Vector256.Create(3, 2, 1, 0, 7, 6, 5, 4));

Actual Behavior

    L0000: vzeroupper
    L0003: vmovups ymm0, [0x7ff7ea7d0180]
    L000b: vpermps ymm0, ymm0, [rdx]
    L0010: vmovups [rcx], ymm0
    L0014: mov rax, rcx
    L0017: vzeroupper
    L001a: ret

The runtime used the VPERM instruction to permute the elements, which takes a vector of indices as parameter. This means we have to load the constant from memory.

Expected Behavior

There are other, faster instructions (SHUFPD, SHUFPS, PSHUFD, VPSHUFD, VSHUFPS, VSHUFPD, VPERMILPS, VPERMILPD, VPERM2I128, VPERM2D128, VSHUFF64x2, VSHUFF32x4, VSHUFI32x4, VSHUFI64x2) which don't quite offer the same flexibility as the VPERM instruction, but should be emitted when they are applicable. Those functions take an imm8 control byte to describe the permutation. It's easy to detect if one of these cheaper functions is useable. In the case shown above SHUFPS is applicable, because the permutations in the first four elements and the last four elements are the same.

When I manually rewrite the function to use SHUFPS

Vector256<float> Test(Vector256<float> v) => Avx.Shuffle(v, v, 0b00_01_10_11);

the runtime produces this faster and smaller code:

    L0000: vzeroupper
    L0003: vmovups ymm0, [rdx]
    L0007: vshufps ymm0, ymm0, ymm0, 0x1b
    L000c: vmovups [rcx], ymm0
    L0010: mov rax, rcx
    L0013: vzeroupper
    L0016: ret

In other cases the .Shuffle can be replaced with Avx512F.Permute4x64, Avx512F.Permute4x32, Sse.Shuffle, ...

Stretch Goal

Reversing a vector with Vector256.Shuffle(v, Vector256.Create(7, 6, 5, 4, 3, 2, 1, 0)); could be special-cased to

v = Avx.Permute(v, 0b_00_01_10_11);
v = Avx.Permute2x128(v, v, 0b_00_00_00_01);

This runs approx. twice as fast on my machine, and is used in some places like SpanHelpers.Reverse.

Effects

In my implementation of a SIMD-accelerated Bitonic sorting network I saw a performance uplift of approx 20% and a significant code size reduction.

BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22631.3880)
11th Gen Intel Core i5-1135G7 2.40GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK=8.0.303
  [Host]   : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2
  ShortRun : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2

Job=ShortRun  IterationCount=3  LaunchCount=1  
WarmupCount=3

Method	Mean	Error	StdDev	Ratio	RatioSD	Code Size
OptimizedShuffle	5.241 ns	4.301 ns	0.2358 ns	0.82	0.05	387 B
DefaultShuffle	6.373 ns	5.072 ns	0.2780 ns	1.00	0.00	488 B

Benchmark code

[DisassemblyDiagnoser]
[ShortRunJob]
public class BitonicSortTests
{
    [Benchmark]
    public (Vector256<float>, Vector256<int>) OptimizedShuffle() => OptimizedShuffleImpl(Vector256<float>.Zero);

    [Benchmark(Baseline = true)]
    public (Vector256<float>, Vector256<int>) DefaultShuffle() => DefaultShuffleImpl(Vector256<float>.Zero);

    //        Sort2      Sort4Rev      Sort2      Sort8Rev       Sort4          Sort2      
    // 0 -------#----------#-------------#--------#---------------#---------------#--------
    //          |          |             |        |               |               |        
    // 1 -------#----------+-#-----------#--------+-#-------------+-#-------------#--------
    //                     | |                    | |             | |                      
    // 2 -------#----------+-#-----------#--------+-+-#-----------#-+-------------#--------
    //          |          |             |        | | |             |             |        
    // 3 -------#----------#-------------#--------+-+-+-#-----------#-------------#--------
    //                                            | | | |                                  
    // 4 -------#----------#-------------#--------+-+-+-#---------#---------------#--------
    //          |          |             |        | | |           |               |        
    // 5 -------#----------|-#-----------#--------+-+-#-----------+-#-------------#--------
    //                     | |                    | |             | |                      
    // 6 -------#----------|-#-----------#--------+-#-------------#-+-------------#--------
    //          |          |             |        |                 |             |        
    // 7 -------#----------#-------------#--------#-----------------#-------------#--------

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    static Vector256<float> Reverse32(Vector256<float> _v)
    {
        _v = Avx.Permute(_v, 0b_00_01_10_11); //3210 7654
        return Avx.Permute2x128(_v, _v, 0b_00_00_00_01);
    }

    static (Vector256<float>, Vector256<int>) OptimizedShuffleImpl(Vector256<float> lambda)
    {
        var indices = Vector256.Create(0, 1, 2, 3, 4, 5, 6, 7).AsSingle();

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        void Sort8Rev()
        {
            var lambdaShuf = Reverse32(lambda);
            var comp = Vector128.LessThan(lambda.GetLower(), lambdaShuf.GetLower());
            var mask = Vector256.Create(comp, Sse.Shuffle(comp, comp, 0b_00_01_10_11));
            lambda = Vector256.ConditionalSelect(mask, lambda, lambdaShuf);
            indices = Vector256.ConditionalSelect(mask, indices, Reverse32(indices));
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        void Sort4Rev()
        {
            var lambdaShuf = Avx.Shuffle(lambda, lambda, 0b00_01_10_11);
            var mask = Vector256.LessThan(lambda, lambdaShuf);
            mask = Avx.Shuffle(mask, mask, 0b00_01_01_00);
            lambda = Vector256.ConditionalSelect(mask, lambda, lambdaShuf);
            indices = Vector256.ConditionalSelect(mask, indices, Avx.Shuffle(indices, indices, 0b00_01_10_11));
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        void Sort4()
        {
            var lambdaPerm = Avx.Shuffle(lambda, lambda, 0b01_00_11_10);
            var mask = Vector256.LessThan(lambda, lambdaPerm);
            mask = Avx.Shuffle(mask, mask, 0b01_00_01_00);
            lambda = Vector256.ConditionalSelect(mask, lambda, lambdaPerm);
            indices = Vector256.ConditionalSelect(mask, indices, Avx.Shuffle(indices, indices, 0b01_00_11_10));
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        void Sort2()
        {
            var lambdaShuf = Avx.Shuffle(lambda, lambda, 0b10_11_00_01);
            var mask = Vector256.LessThan(lambda, lambdaShuf);
            mask = Avx.Shuffle(mask, mask, 0b10_10_00_00);
            lambda = Vector256.ConditionalSelect(mask, lambda, lambdaShuf);
            indices = Vector256.ConditionalSelect(mask, indices, Avx.Shuffle(indices, indices, 0b10110001));
        }

        Sort2();
        Sort4Rev();
        Sort2();
        Sort8Rev();
        Sort4();
        Sort2();

        return (lambda, indices.AsInt32());
    }

    static (Vector256<float>, Vector256<int>) DefaultShuffleImpl(Vector256<float> lambda)
    {
        var indices = Vector256.Create(0, 1, 2, 3, 4, 5, 6, 7).AsSingle();

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        void Sort8Rev()
        {
            var lambdaShuf = Vector256.Shuffle(lambda, Vector256.Create(7, 6, 5, 4, 3, 2, 1, 0));
            var comp = Vector128.LessThan(lambda.GetLower(), lambdaShuf.GetLower());
            var mask = Vector256.Create(comp, Vector128.Shuffle(comp, Vector128.Create(3, 2, 1, 0)));
            lambda = Vector256.ConditionalSelect(mask, lambda, lambdaShuf);
            indices = Vector256.ConditionalSelect(mask, indices, Vector256.Shuffle(indices, Vector256.Create(7, 6, 5, 4, 3, 2, 1, 0)));
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        void Sort4Rev()
        {
            var lambdaShuf = Vector256.Shuffle(lambda, Vector256.Create(3, 2, 1, 0, 7, 6, 5, 4));
            var mask = Vector256.LessThan(lambda, lambdaShuf);
            mask = Vector256.Shuffle(mask, Vector256.Create(0, 1, 1, 0, 4, 5, 5, 4));
            lambda = Vector256.ConditionalSelect(mask, lambda, lambdaShuf);
            indices = Vector256.ConditionalSelect(mask, indices, Vector256.Shuffle(indices, Vector256.Create(3, 2, 1, 0, 7, 6, 5, 4)));
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        void Sort4()
        {
            var lambdaPerm = Vector256.Shuffle(lambda, Vector256.Create(2, 3, 0, 1, 6, 7, 4, 5));
            var mask = Vector256.LessThan(lambda, lambdaPerm);
            mask = Vector256.Shuffle(mask, Vector256.Create(0, 1, 0, 1, 4, 5, 4, 5));
            lambda = Vector256.ConditionalSelect(mask, lambda, lambdaPerm);
            indices = Vector256.ConditionalSelect(mask, indices, Vector256.Shuffle(indices, Vector256.Create(2, 3, 0, 1, 6, 7, 4, 5)));
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        void Sort2()
        {
            var lambdaShuf = Vector256.Shuffle(lambda, Vector256.Create(1, 0, 3, 2, 5, 4, 7, 6));
            var mask = Vector256.LessThan(lambda, lambdaShuf);
            mask = Vector256.Shuffle(mask, Vector256.Create(0, 0, 2, 2, 4, 4, 6, 6));
            lambda = Vector256.ConditionalSelect(mask, lambda, lambdaShuf);
            indices = Vector256.ConditionalSelect(mask, indices, Vector256.Shuffle(indices, Vector256.Create(1, 0, 3, 2, 5, 4, 7, 6)));
        }

        Sort2();
        Sort4Rev();
        Sort2();
        Sort8Rev();
        Sort4();
        Sort2();

        return (lambda, indices.AsInt32());
    }
}

Context

I think platform-agnostic functions like Vector.Shuffle are preferrable over x64-specific Avx.Shuffle. In Switching to Vectors from target dependent instrinsics #101251 similar goals are outlined.
Vector.Create has to be called explicitly for each Vector.Shuffle call because otherwise suboptimal code is generated (Vector256.Shuffle does not produce optimal codegen #72793)

The text was updated successfully, but these errors were encountered:

dotnet-policy-service · 2024-08-03T19:27:02Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

tannergooding · 2024-08-03T19:39:47Z

Thanks for opening the issue here, I've assigned it to myself as this is an area I'm already working in/around.

Some things to note...

Vector.Create has to be called explicitly for each Vector.Shuffle call because otherwise suboptimal code is generated

This is no longer true, it is fixed in .NET 9

This means we have to load the constant from memory.

This is only true for very simplistic cases. If you're doing this in a loop, for example, the indices parameter will be hoisted and come from register instead. The CPU is also likely to prefetch and cache these RIP relative constants itself as well, resulting in the memory load being effectively erased for typical code.

It's easy to detect if one of these cheaper functions is useable.

It's not as easy as one might think. There are many aspects that come into play including register selection, neighboring instructions/ports, dependency chains, whether a memory load can be hoisted, whether new encoding semantics relevant to EVEX are possible, whether it crosses lanes or not, how it impacts loop and general code alignment, etc.

There are some cases where we can produce better instruction selection and those are already known/being tracked, they just couldn't be completed in .NET 9. But, in general, this will be done by taking into account the right balance between complexity, maintainability, code size, and code perf across a range of microarchitectures; it may not always be what appears to have the fewest theoretical cycles.

sfiruch · 2024-08-03T20:08:57Z

Thanks for opening the issue here, I've assigned it to myself as this is an area I'm already working in/around.

Thank you for working on this, and improving things! :)

This means we have to load the constant from memory.

This is only true for very simplistic cases. If you're doing this in a loop, for example, the indices parameter will be hoisted and come from register instead. The CPU is also likely to prefetch and cache these RIP relative constants itself as well, resulting in the memory load being effectively erased for typical code.

Just for context: In my case, I'm not using it in a loop.

(I also saw many cases where smaller code is actually better. I encountered many scenarios where -Os outperformed -O2.)

It's easy to detect if one of these cheaper functions is useable.

It's not as easy as one might think. There are many aspects that come into play including register selection, neighboring instructions/ports, dependency chains, whether a memory load can be hoisted, whether new encoding semantics relevant to EVEX are possible, whether it crosses lanes or not, how it impacts loop and general code alignment, etc.

I think the shown case is clear. VSHUFPS uses one fewer registers, performs equal or better on every Intel since Skylake and AMD CPU since Zen, runs on more execution ports, fewer uops, lower latency.

VSHUFPS can be used for Vector256.Shuffle(Vector256<Single>, mask) iff mask[0]+4==mask[4] && mask[1]+4==mask[5] && mask[2]+4==mask[6] && mask[3]+4==mask[7]. Other uses have similarly complex rules. I suspect that finding the mask values in the JIT might be the complex part. The rest should be "easy" :)

There are some cases where we can produce better instruction selection and those are already known/being tracked, they just couldn't be completed in .NET 9. But, in general, this will be done by taking into account the right balance between complexity, maintainability, code size, and code perf across a range of microarchitectures; it may not always be what appears to have the fewest theoretical cycles.

This is why I provided benchmarks.

sfiruch added the tenet-performance Performance related issue label Aug 3, 2024

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Aug 3, 2024

dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Aug 3, 2024

tannergooding self-assigned this Aug 3, 2024

tannergooding added this to the 10.0.0 milestone Aug 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lower `Vector128/256/512.Shuffle` to SHUFPS, VSHUFPS, VSHUFPD, VPERMILPS, ... when possible #105908

Lower `Vector128/256/512.Shuffle` to SHUFPS, VSHUFPS, VSHUFPD, VPERMILPS, ... when possible #105908

sfiruch commented Aug 3, 2024 •

edited

Loading

dotnet-policy-service bot commented Aug 3, 2024

tannergooding commented Aug 3, 2024

sfiruch commented Aug 3, 2024

Lower Vector128/256/512.Shuffle to SHUFPS, VSHUFPS, VSHUFPD, VPERMILPS, ... when possible #105908

Lower Vector128/256/512.Shuffle to SHUFPS, VSHUFPS, VSHUFPD, VPERMILPS, ... when possible #105908

Comments

sfiruch commented Aug 3, 2024 • edited Loading

Actual Behavior

Expected Behavior

Stretch Goal

Effects

Context

dotnet-policy-service bot commented Aug 3, 2024

tannergooding commented Aug 3, 2024

sfiruch commented Aug 3, 2024

Lower `Vector128/256/512.Shuffle` to SHUFPS, VSHUFPS, VSHUFPD, VPERMILPS, ... when possible #105908

Lower `Vector128/256/512.Shuffle` to SHUFPS, VSHUFPS, VSHUFPD, VPERMILPS, ... when possible #105908

sfiruch commented Aug 3, 2024 •

edited

Loading