Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switching to Vectors from target dependent instrinsics #101251

Open
DeepakRajendrakumaran opened this issue Apr 18, 2024 · 2 comments
Open

Switching to Vectors from target dependent instrinsics #101251

DeepakRajendrakumaran opened this issue Apr 18, 2024 · 2 comments

Comments

@DeepakRajendrakumaran
Copy link
Contributor

As part of effort to convert target dependent intrinsic in .NET libraries to target-independent Vector* function, I went through intrinsics used in .NET libraries. I have the list below and some possible options to switch to cross-platform vectors if we either expand Vector API or have JIT optimize certain patterns where multiple Vector functions can achieve same result

  1. Base64 Encoder/Decoder
    a. Has AVX512 path
    b.

    private static unsafe OperationStatus DecodeFromUtf8(ReadOnlySpan<byte> utf8, Span<byte> bytes, out int bytesConsumed, out int bytesWritten, bool isFinalBlock, bool ignoreWhiteSpace)

    c.
    public static unsafe OperationStatus EncodeToUtf8(ReadOnlySpan<byte> bytes, Span<byte> utf8, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true)

    d. Cannot convert everything to Vector* without expanding Vector surface area

  2. ProbablisiticMap
    a. Has AVX512 paths
    b.

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    [CompExactlyDependsOn(typeof(Avx512Vbmi))]
    private static Vector512<byte> ContainsMask64CharsAvx512(Vector512<byte> charMap, ref char searchSpace0, ref char searchSpace1)
    {
    Vector512<ushort> source0 = Vector512.LoadUnsafe(ref searchSpace0);
    Vector512<ushort> source1 = Vector512.LoadUnsafe(ref searchSpace1);
    Vector512<byte> sourceLower = Avx512BW.PackUnsignedSaturate(
    (source0 & Vector512.Create((ushort)255)).AsInt16(),
    (source1 & Vector512.Create((ushort)255)).AsInt16());
    Vector512<byte> sourceUpper = Avx512BW.PackUnsignedSaturate(
    (source0 >>> 8).AsInt16(),
    (source1 >>> 8).AsInt16());
    Vector512<byte> resultLower = IsCharBitNotSetAvx512(charMap, sourceLower);
    Vector512<byte> resultUpper = IsCharBitNotSetAvx512(charMap, sourceUpper);
    return ~(resultLower | resultUpper);
    }
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    [CompExactlyDependsOn(typeof(Avx512Vbmi))]
    private static Vector512<byte> IsCharBitNotSetAvx512(Vector512<byte> charMap, Vector512<byte> values)
    {
    Vector512<byte> shifted = values >>> VectorizedIndexShift;
    Vector512<byte> bitPositions = Avx512BW.Shuffle(Vector512.Create(0x8040201008040201).AsByte(), shifted);
    Vector512<byte> index = values & Vector512.Create((byte)VectorizedIndexMask);
    Vector512<byte> bitMask = Avx512Vbmi.PermuteVar64x8(charMap, index);
    return Vector512.Equals(bitMask & bitPositions, Vector512<byte>.Zero);
    }
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    [CompExactlyDependsOn(typeof(Avx512Vbmi.VL))]
    private static Vector256<byte> ContainsMask32CharsAvx512(Vector256<byte> charMap, ref char searchSpace0, ref char searchSpace1)
    {
    Vector256<ushort> source0 = Vector256.LoadUnsafe(ref searchSpace0);
    Vector256<ushort> source1 = Vector256.LoadUnsafe(ref searchSpace1);
    Vector256<byte> sourceLower = Avx2.PackUnsignedSaturate(
    (source0 & Vector256.Create((ushort)255)).AsInt16(),
    (source1 & Vector256.Create((ushort)255)).AsInt16());
    Vector256<byte> sourceUpper = Avx2.PackUnsignedSaturate(
    (source0 >>> 8).AsInt16(),
    (source1 >>> 8).AsInt16());
    Vector256<byte> resultLower = IsCharBitNotSetAvx512(charMap, sourceLower);
    Vector256<byte> resultUpper = IsCharBitNotSetAvx512(charMap, sourceUpper);
    return ~(resultLower | resultUpper);
    }
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    [CompExactlyDependsOn(typeof(Avx512Vbmi.VL))]
    private static Vector256<byte> IsCharBitNotSetAvx512(Vector256<byte> charMap, Vector256<byte> values)
    {
    Vector256<byte> shifted = values >>> VectorizedIndexShift;
    Vector256<byte> bitPositions = Avx2.Shuffle(Vector256.Create(0x8040201008040201).AsByte(), shifted);
    Vector256<byte> index = values & Vector256.Create((byte)VectorizedIndexMask);
    Vector256<byte> bitMask = Avx512Vbmi.VL.PermuteVar32x8(charMap, index);
    return Vector256.Equals(bitMask & bitPositions, Vector256<byte>.Zero);
    }
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    [CompExactlyDependsOn(typeof(Avx2))]
    private static Vector256<byte> ContainsMask32CharsAvx2(Vector256<byte> charMapLower, Vector256<byte> charMapUpper, ref char searchSpace)
    {
    Vector256<ushort> source0 = Vector256.LoadUnsafe(ref searchSpace);
    Vector256<ushort> source1 = Vector256.LoadUnsafe(ref searchSpace, (nuint)Vector256<ushort>.Count);
    Vector256<byte> sourceLower = Avx2.PackUnsignedSaturate(
    (source0 & Vector256.Create((ushort)255)).AsInt16(),
    (source1 & Vector256.Create((ushort)255)).AsInt16());
    Vector256<byte> sourceUpper = Avx2.PackUnsignedSaturate(
    (source0 >>> 8).AsInt16(),
    (source1 >>> 8).AsInt16());
    Vector256<byte> resultLower = IsCharBitNotSetAvx2(charMapLower, charMapUpper, sourceLower);
    Vector256<byte> resultUpper = IsCharBitNotSetAvx2(charMapLower, charMapUpper, sourceUpper);
    return ~(resultLower | resultUpper);
    }
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    [CompExactlyDependsOn(typeof(Avx2))]
    private static Vector256<byte> IsCharBitNotSetAvx2(Vector256<byte> charMapLower, Vector256<byte> charMapUpper, Vector256<byte> values)
    {
    Vector256<byte> shifted = values >>> VectorizedIndexShift;
    Vector256<byte> bitPositions = Avx2.Shuffle(Vector256.Create(0x8040201008040201).AsByte(), shifted);
    Vector256<byte> index = values & Vector256.Create((byte)VectorizedIndexMask);
    Vector256<byte> bitMaskLower = Avx2.Shuffle(charMapLower, index);
    Vector256<byte> bitMaskUpper = Avx2.Shuffle(charMapUpper, index - Vector256.Create((byte)16));
    Vector256<byte> mask = Vector256.GreaterThan(index, Vector256.Create((byte)15));
    Vector256<byte> bitMask = Vector256.ConditionalSelect(mask, bitMaskUpper, bitMaskLower);
    return Vector256.Equals(bitMask & bitPositions, Vector256<byte>.Zero);
    }
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    [CompExactlyDependsOn(typeof(AdvSimd.Arm64))]
    [CompExactlyDependsOn(typeof(Sse2))]
    private static Vector128<byte> ContainsMask16Chars(Vector128<byte> charMapLower, Vector128<byte> charMapUpper, ref char searchSpace)
    {
    Vector128<ushort> source0 = Vector128.LoadUnsafe(ref searchSpace);
    Vector128<ushort> source1 = Vector128.LoadUnsafe(ref searchSpace, (nuint)Vector128<ushort>.Count);
    Vector128<byte> sourceLower = Sse2.IsSupported
    ? Sse2.PackUnsignedSaturate((source0 & Vector128.Create((ushort)255)).AsInt16(), (source1 & Vector128.Create((ushort)255)).AsInt16())
    : AdvSimd.Arm64.UnzipEven(source0.AsByte(), source1.AsByte());
    Vector128<byte> sourceUpper = Sse2.IsSupported
    ? Sse2.PackUnsignedSaturate((source0 >>> 8).AsInt16(), (source1 >>> 8).AsInt16())
    : AdvSimd.Arm64.UnzipOdd(source0.AsByte(), source1.AsByte());
    Vector128<byte> resultLower = IsCharBitNotSet(charMapLower, charMapUpper, sourceLower);
    Vector128<byte> resultUpper = IsCharBitNotSet(charMapLower, charMapUpper, sourceUpper);
    return ~(resultLower | resultUpper);
    }
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    [CompExactlyDependsOn(typeof(Sse2))]
    [CompExactlyDependsOn(typeof(Ssse3))]
    [CompExactlyDependsOn(typeof(AdvSimd))]
    [CompExactlyDependsOn(typeof(AdvSimd.Arm64))]
    [CompExactlyDependsOn(typeof(PackedSimd))]
    private static Vector128<byte> IsCharBitNotSet(Vector128<byte> charMapLower, Vector128<byte> charMapUpper, Vector128<byte> values)
    {
    Vector128<byte> shifted = values >>> VectorizedIndexShift;
    Vector128<byte> bitPositions = Vector128.ShuffleUnsafe(Vector128.Create(0x8040201008040201).AsByte(), shifted);
    Vector128<byte> index = values & Vector128.Create((byte)VectorizedIndexMask);
    Vector128<byte> bitMask;
    if (AdvSimd.Arm64.IsSupported)
    {
    bitMask = AdvSimd.Arm64.VectorTableLookup((charMapLower, charMapUpper), index);
    }
    else
    {
    Vector128<byte> bitMaskLower = Vector128.ShuffleUnsafe(charMapLower, index);
    Vector128<byte> bitMaskUpper = Vector128.ShuffleUnsafe(charMapUpper, index - Vector128.Create((byte)16));
    Vector128<byte> mask = Vector128.GreaterThan(index, Vector128.Create((byte)15));
    bitMask = Vector128.ConditionalSelect(mask, bitMaskUpper, bitMaskLower);
    }
    return Vector128.Equals(bitMask & bitPositions, Vector128<byte>.Zero);
    }

    c. Uses the following
    i. Avx512BW.PackUnsignedSaturate
    ii. Avx512Vbmi.PermuteVar64x8
    iii. Avx512BW.Shuffle
    d. Cannot Upgrade- No way to switch PackUnsignedSaturate

  3. XxHashShared.c
    a. No Avx512 path
    b.

    if (Vector256.IsHardwareAccelerated && BitConverter.IsLittleEndian)

    c. Uses Avx2.Multiply
    d. Cannot switch Intrinsic Multiply to vector multiply

  4. BitArray.cs
    a. Has AVX512 path
    b.

    if (Avx512F.IsSupported && (uint)m_length >= Vector512<byte>.Count)
    {
    Vector256<byte> upperShuffleMask_CopyToBoolArray256 = Vector256.Create(0x04040404_04040404, 0x05050505_05050505,
    0x06060606_06060606, 0x07070707_07070707).AsByte();
    Vector256<byte> lowerShuffleMask_CopyToBoolArray256 = Vector256.Create(lowerShuffleMask_CopyToBoolArray, upperShuffleMask_CopyToBoolArray);
    Vector512<byte> shuffleMask = Vector512.Create(lowerShuffleMask_CopyToBoolArray256, upperShuffleMask_CopyToBoolArray256);
    Vector512<byte> bitMask = Vector512.Create(0x80402010_08040201).AsByte();
    Vector512<byte> ones = Vector512.Create((byte)1);
    fixed (bool* destination = &boolArray[index])
    {
    for (; (i + Vector512<byte>.Count) <= (uint)m_length; i += (uint)Vector512<byte>.Count)
    {
    ulong bits = (ulong)(uint)m_array[i / (uint)BitsPerInt32] + ((ulong)m_array[(i / (uint)BitsPerInt32) + 1] << BitsPerInt32);
    Vector512<ulong> scalar = Vector512.Create(bits);
    Vector512<byte> shuffled = Avx512BW.Shuffle(scalar.AsByte(), shuffleMask);
    Vector512<byte> extracted = Avx512F.And(shuffled, bitMask);
    // The extracted bits can be anywhere between 0 and 255, so we normalise the value to either 0 or 1
    // to ensure compatibility with "C# bool" (0 for false, 1 for true, rest undefined)
    Vector512<byte> normalized = Avx512BW.Min(extracted, ones);
    Avx512F.Store((byte*)destination + i, normalized);
    }
    }
    }
    else if (Avx2.IsSupported && (uint)m_length >= Vector256<byte>.Count)
    {
    Vector256<byte> shuffleMask = Vector256.Create(lowerShuffleMask_CopyToBoolArray, upperShuffleMask_CopyToBoolArray);
    Vector256<byte> bitMask = Vector256.Create(0x80402010_08040201).AsByte();
    //Internal.Console.WriteLine(bitMask);
    Vector256<byte> ones = Vector256.Create((byte)1);
    fixed (bool* destination = &boolArray[index])
    {
    for (; (i + Vector256<byte>.Count) <= (uint)m_length; i += (uint)Vector256<byte>.Count)
    {
    int bits = m_array[i / (uint)BitsPerInt32];
    Vector256<int> scalar = Vector256.Create(bits);
    Vector256<byte> shuffled = Avx2.Shuffle(scalar.AsByte(), shuffleMask);
    Vector256<byte> extracted = Avx2.And(shuffled, bitMask);
    // The extracted bits can be anywhere between 0 and 255, so we normalise the value to either 0 or 1
    // to ensure compatibility with "C# bool" (0 for false, 1 for true, rest undefined)
    Vector256<byte> normalized = Avx2.Min(extracted, ones);
    Avx.Store((byte*)destination + i, normalized);
    }
    }
    }
    else if (Ssse3.IsSupported && ((uint)m_length >= Vector512<byte>.Count * 2u))

    c. Uses the following
    i. Avx2.Shuffle
    ii. Avx2.And
    iii. Avx2.Min
    iv. Avx.Store
    d. Shuffle with non constant ‘indices’ will be problematic to convert- But should be fine with ShuffleUnsafe implemented

  5. AsciiStringSearchValuesTeddyBase.cs/ TeddyHelper.cs
    a. Has AVX512F path
    b.

    private int IndexOfAnyN3Avx2(ReadOnlySpan<char> span)
    {
    // See comments in 'IndexOfAnyN3Vector128' above.
    // This method is the same, but operates on 32 input characters at a time.
    Debug.Assert(span.Length >= CharsPerIterationAvx2 + MatchStartOffsetN3);
    ref char searchSpace = ref MemoryMarshal.GetReference(span);
    ref char lastSearchSpaceStart = ref Unsafe.Add(ref searchSpace, span.Length - CharsPerIterationAvx2);
    searchSpace = ref Unsafe.Add(ref searchSpace, MatchStartOffsetN3);
    Vector256<byte> n0Low = _n0Low._lower, n0High = _n0High._lower;
    Vector256<byte> n1Low = _n1Low._lower, n1High = _n1High._lower;
    Vector256<byte> n2Low = _n2Low._lower, n2High = _n2High._lower;
    Vector256<byte> prev0 = Vector256<byte>.AllBitsSet;
    Vector256<byte> prev1 = Vector256<byte>.AllBitsSet;
    Loop:
    ValidateReadPosition(span, ref searchSpace);
    Vector256<byte> input = TStartCaseSensitivity.TransformInput(LoadAndPack32AsciiChars(ref searchSpace));
    (Vector256<byte> result, prev0, prev1) = ProcessInputN3(input, prev0, prev1, n0Low, n0High, n1Low, n1High, n2Low, n2High);
    if (result != Vector256<byte>.Zero)
    {
    goto CandidateFound;
    }
    ContinueLoop:
    searchSpace = ref Unsafe.Add(ref searchSpace, CharsPerIterationAvx2);
    if (Unsafe.IsAddressGreaterThan(ref searchSpace, ref lastSearchSpaceStart))
    {
    if (Unsafe.AreSame(ref searchSpace, ref Unsafe.Add(ref lastSearchSpaceStart, CharsPerIterationAvx2)))
    {
    return -1;
    }
    // We're switching which characters we will process in the next iteration.
    // prev0 and prev1 no longer point to the characters just before the current input, so we must reset them.
    prev0 = Vector256<byte>.AllBitsSet;
    prev1 = Vector256<byte>.AllBitsSet;
    searchSpace = ref lastSearchSpaceStart;
    }
    goto Loop;
    CandidateFound:
    if (TryFindMatch(span, ref searchSpace, result, MatchStartOffsetN3, out int offset))
    {
    return offset;
    }
    goto ContinueLoop;
    }

    c. Related : TeddyHelper :
    d. Uses the following
    i. PackUnsignedSaturate: no 1-1
    ii. Shuffle – possible with shuffleunsafe
    iii. Permute2x128
    iv. AlignRight : no 1-1
    v. PermuteVar8x64x2

  6. SpanHelpers.cs : Consider all span under this umbrella
    a. Has AVX512F path
    b.

    // Avx2 branch also operates on Sse2 sizes, so check is combined.

    c. Uses the following
    i. Shuffle
    ii. Avx2.Permute2x128
    iii. PermuteVar8x32
    iv. Permute4x64
    v. Avx2.And
    vi. Avx2.MultiplyHigh
    vii. Avx2.MultiplyLow
    viii. Avx2.Or
    ix. Avx2.SubtractSaturate
    x. Avx2.CompareGreaterThan
    xi. Avx2.Subtract
    xii. Avx2.Add

  7. IndexOfAnyAsciiSearcher
    a. No AVx512F path – Tried impl/had issues
    b.

    if (Avx2.IsSupported && searchSpaceLength > 2 * Vector128<short>.Count)

    c. Uses following
    i. PackUnsignedSaturate
    ii. Shuffle

  8. Matrix4x4.Impl
    a. No avx512 path and in some cases avx paths
    b.


    c. Uses foll
    i. Shuffle/Permute – constant indices..so possible?
    ii. UnpackLow
    iii. UnpackHigh

  9. Ascii.Equality
    a. Avx512 path added
    b.

    else if (Avx.IsSupported && length >= (uint)Vector256<TLeft>.Count)

    c. Already uses Vector – switch check?

  10. Ascii.Utility.
    a. Has avx512 path
    b.

    private static bool VectorContainsNonAsciiChar(Vector256<ushort> utf16Vector)

    c. Uses Testz/ PackUnsignedSaturate – can possibly move to more efficient patterns similar to ‘HasMatch’

BitArray is the only one where it’s feasible currently and that’s dependent on #99596

Some patterns we can consider

  1. Sse2.multiply – vector multiply does not work the same way. Vector version stores only the lower half after multiplication. Intrinsic version(for sse and avx upgrades type uint->ulong for eg). So Widen -> Multiply might work
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Apr 18, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Apr 18, 2024
@tannergooding tannergooding added area-System.Runtime.Intrinsics and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Apr 18, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

@tannergooding
Copy link
Member

This is a bit of a meta issue as it applies to many areas across the BCL, but I've marked it as intrinsics since they all have to do with updating the intrinsic code.

CC. @jeffhandley

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants