Tensor primitives divide int32 #111505

alexcovington · 2025-01-16T18:26:50Z

Add a vector path for int in TensorPrimitives.Divide.

Improves performance in microbenchmarks:

| Method        | Job        | Toolchain        | BufferLength | Mean       | Error    | StdDev   | Median     | Min        | Max        | Ratio | Allocated | Alloc Ratio |
|-------------- |----------- |----------------- |------------- |-----------:|---------:|---------:|-----------:|-----------:|-----------:|------:|----------:|------------:|
| Divide_Vector | Job-SUKFCS | Base             | 128          |   174.4 ns |  0.83 ns |  0.69 ns |   174.2 ns |   173.5 ns |   176.0 ns |  1.00 |         - |          NA |
| Divide_Vector | Job-OOXWSE | Diff             | 128          |   111.5 ns |  0.56 ns |  0.50 ns |   111.5 ns |   110.7 ns |   112.3 ns |  0.64 |         - |          NA |
|               |            |                  |              |            |          |          |            |            |            |       |           |             |
| Divide_Scalar | Job-SUKFCS | Base             | 128          |   138.9 ns |  0.88 ns |  0.78 ns |   138.5 ns |   137.8 ns |   140.5 ns |  1.00 |         - |          NA |
| Divide_Scalar | Job-OOXWSE | Diff             | 128          |   104.8 ns |  0.50 ns |  0.42 ns |   104.7 ns |   104.3 ns |   105.8 ns |  0.75 |         - |          NA |
|               |            |                  |              |            |          |          |            |            |            |       |           |             |
| Divide_Vector | Job-SUKFCS | Base             | 3079         | 4,038.8 ns | 13.47 ns | 11.94 ns | 4,034.4 ns | 4,025.5 ns | 4,062.2 ns |  1.00 |         - |          NA |
| Divide_Vector | Job-OOXWSE | Diff             | 3079         | 1,358.5 ns |  5.78 ns |  4.83 ns | 1,356.7 ns | 1,351.8 ns | 1,367.6 ns |  0.34 |         - |          NA |
|               |            |                  |              |            |          |          |            |            |            |       |           |             |
| Divide_Scalar | Job-SUKFCS | Base             | 3079         | 3,315.1 ns | 12.55 ns | 11.12 ns | 3,313.0 ns | 3,301.0 ns | 3,337.9 ns |  1.00 |         - |          NA |
| Divide_Scalar | Job-OOXWSE | Diff             | 3079         | 1,294.8 ns |  7.84 ns |  6.95 ns | 1,293.1 ns | 1,286.2 ns | 1,309.0 ns |  0.39 |         - |          NA |

...aries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.Divide.cs

stephentoub · 2025-01-16T20:28:00Z

...aries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.Divide.cs

@@ -70,11 +72,80 @@ public static void Divide<T>(T x, ReadOnlySpan<T> y, Span<T> destination)
        internal readonly struct DivideOperator<T> : IBinaryOperator<T> where T : IDivisionOperators<T, T, T>
        {
            public static bool Vectorizable => typeof(T) == typeof(float)
-                                            || typeof(T) == typeof(double);
+                                            || typeof(T) == typeof(double)
+#if NET10_0_OR_GREATER


Why is this limited to .NET 10? Are the APIs being used newly introduced only in .NET 10?

I was running into issues with FloatRoundingMode not being available in .NET 8.0.

That was added in 9, not 10.

...aries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.Divide.cs

alexcovington · 2025-01-22T17:36:56Z

@tannergooding Would it be possible for us to take a similar approach to PopCount or IndexOfMax and provide both a hardware-specific and xplat path? This would just be for the conversions, something like this:

Vector256<double> num_pd;
Vector256<double> den_pd;

if (Avx.IsSupported)
{
    num_pd = Avx.ConvertToVector256Double(x.AsInt32());
    den_pd = Avx.ConvertToVector256Double(y.AsInt32());
}
else
{
    num_pd = Vector256.ConvertToDouble(Vector256.WidenLower(x.AsInt32().ToVector256Unsafe()));
    den_pd = Vector256.ConvertToDouble(Vector256.WidenLower(y.AsInt32().ToVector256Unsafe()));
}

This provides an optimal path when Avx/Avx512 are available while providing a xplat path to fallback on for other platforms.

tannergooding · 2025-01-23T16:35:29Z

Would it be possible for us to take a similar approach to PopCount or IndexOfMax and provide both a hardware-specific and xplat path

It's possible, but the general desire is to instead have the JIT recognize the xplat pattern and have it implicitly light up to the more optimized hardware specific pattern instead. Often times the patterns involved, particularly for things like Widen(...), Narrow(..., ...) or even WidenLower(WidenLower(...)) and such are relatively trivial to handle and simply haven't been added in yet. This simplifies the managed code and allows us to implicitly use even newer ISAs as they become available, without having to go back and touch the code again.

There are of course exceptions where the pattern is non-trivial, in such scenarios we want to leave a comment explaining why it's more complex and often have a tracking issue around it.

...aries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.Divide.cs

alexcovington · 2025-02-03T21:35:32Z

@tannergooding I've updated the draft to perform the optimization to operator /. Currently the draft only optimizes for Vector128<int> and Vector256<int>, but I wanted to verify with you that the approach is correct before continuing further.

You also mentioned that it would be ideal to do something similar for other int sizes (byte/short/uint/etc.). Would that be best done in this PR or a separate PR?

alexcovington · 2025-02-06T16:28:33Z

@tannergooding Friendly ping

tannergooding · 2025-02-06T17:39:09Z

You also mentioned that it would be ideal to do something similar for other int sizes (byte/short/uint/etc.). Would that be best done in this PR or a separate PR?

A separate PR is fine with me!

src/coreclr/jit/hwintrinsicxarch.cpp

tannergooding · 2025-02-18T03:12:20Z

Sorry for the delay, should be getting to this first thing tomorrow.

src/coreclr/jit/codegenxarch.cpp

src/coreclr/jit/hwintrinsicxarch.cpp

tannergooding · 2025-02-18T17:15:19Z

This needs weigh in from @dotnet/jit-contrib

tannergooding · 2025-02-19T18:27:02Z

This needs weigh in from @dotnet/jit-contrib

@alexcovington, so I talked through the approach being taken here with @jakobbotsch and got some input from @EgorBo as well.

There was some concern with the cost/overhead of introducing a net new GT_SIMD_DIV_BY_ZERO_CHECK node that is specific to SIMD and to one kind of check (div by zero). We talked through some options on how this cost could be justified by making it more reusable (particularly since PRs like #111543 have a similar overall need, conceptually).

We then landed on some options we thought would be feasible (mostly just minor tweaks to the node name and what parameters it takes), but then realized that the check is not sufficient "as is" because it isn't handling the other case of MinValue / -1 which also causes a hardware fault (and which is surfaced as an OverflowException in .NET). This additionally complexity coupled with it likely not being feasible for the checks to be hoisted today halted that train of thought. We also realized that expanding this to ConvertToInteger(HWINTRINSIC Div(ConvertToFloat(dividend), ConvertToInteger(ValidityCheck(divisor))) in import would make it much more difficult to do certain other optimizations, like transforming division into cheaper alternative sequences (like shifts) where possible.

What we landed on is that this likely needs to be handled differently from the approach that would've been more ideal, at least today. So instead of introducing a new node that checks the inputs are well formed, we should instead just import this as a regular HWINTRINSIC Div(dividend, divisor) node where the base type is the correct integral type. This HWINTRINSIC Div node should be marked as being side effecting, much as other nodes like GT_DIV or HWINTRINSIC StoreFence are. We would then deal with emitting the appropriate comparison and throw ourselves in codegen (much as is being done for the new GT_SIMD_DIV_BY_ZERO_CHECK node you introduced, just rather being done for the side effecting intrinsic node instead). This isn't "ideal", but its the closest to the right step for ensuring we can enable the right optimizations today and make incremental improvements in the future when the other desired optimizations like hoisting the implied side effects here are more feasible.

tannergooding · 2025-02-19T18:29:00Z

Sorry for the churn here, the more natural approach to handle this just isn't panning out due some limitations in the IR representation.

Happy to answer any follow up questions or elaborate as needed.

alexcovington · 2025-02-19T19:53:17Z

@tannergooding Thanks for the further review. Most of the feedback makes sense, I just have a few follow up questions:

So instead of introducing a new node that checks the inputs are well formed, we should instead just import this as a regular HWINTRINSIC Div(dividend, divisor) node where the base type is the correct integral type

I'm a little confused by regular HWINTRINSIC Div(dividend, divisor), is this still building the division node with gtNewSimdBinOp and just using the integer type instead? Or is there a different way to construct the node that is more appropriate?

This HWINTRINSIC Div node should be marked as being side effecting

Just to confirm, we're marking the node during import?

tannergooding · 2025-02-19T20:11:37Z

I'm a little confused by regular HWINTRINSIC Div(dividend, divisor), is this still building the division node with gtNewSimdBinOp and just using the integer type instead? Or is there a different way to construct the node that is more appropriate?

Right, still in gtNewSimdBinOp. It'd be similar to the floating-point path but we'd use NI_Vector*_Divide as the intrinsic ID (which likely requires updating the flags in hwintrinsiclistxarch.h to indicate this ID is valid to have) and track the simdBaseType as the appropriate integer type.

Just to confirm, we're marking the node during import?

I'd recommend looking at how we handle specially handle NI_X86Base_Pause and HW_Flag_SpecialSideEffect_Other. This gets handled in a few places via the HasSpecialSideEffect(..) check which shows where we may want to consider special handling for the HWINTRINSIC Divide nodes.

tannergooding · 2025-02-24T14:47:03Z

src/coreclr/jit/gentree.cpp

+        NamedIntrinsic intrinsicId = this->AsHWIntrinsic()->GetHWIntrinsicId();
+        if (intrinsicId == NI_Vector128_op_Division || intrinsicId == NI_Vector256_op_Division)
+        {
+            return true;


This should probably assert that we're only encountering it for varTypeIsInteger(AsHWIntrinsic()->GetSimdBaseType()), just to avoid issues from potential future refactorings.

I'd expect we want to go ahead and include Vector512_op_Division here as well, even if it can't be encountered quite yet.

Thanks, I've added the assert and included Vector512_op_Division.

tannergooding · 2025-02-24T14:52:41Z

src/coreclr/jit/hwintrinsiccodegenxarch.cpp

+            //      }
+            //
+            //      Vector128<int> overflowMask =
+            //          Vector128.Equals(op1, Vector128.Create(int.MaxValue)


this should be MinValue, not MaxValue.

it's MinValue / -1 that overflows, since that produces 2147483648, which isn't representable.

Thanks, fixed.

tannergooding · 2025-02-24T15:10:38Z

src/coreclr/jit/hwintrinsiccodegenxarch.cpp

+            if (typeSize == EA_16BYTE)
+            {
+                simd16_t maxValueIntVec;
+                maxValueIntVec.i32[0] = INT_MAX;
+                maxValueIntVec.i32[1] = INT_MAX;
+                maxValueIntVec.i32[2] = INT_MAX;
+                maxValueIntVec.i32[3] = INT_MAX;
+
+                simd16_t negOneIntVec;
+                negOneIntVec.i32[0] = -1;
+                negOneIntVec.i32[1] = -1;
+                negOneIntVec.i32[2] = -1;
+                negOneIntVec.i32[3] = -1;
+
+                maxValueFld = emit->emitSimd16Const(maxValueIntVec);
+                negOneFld   = emit->emitSimd16Const(negOneIntVec);
+            }
+            else
+            {
+                noway_assert(typeSize == EA_32BYTE);
+                simd32_t maxValueIntVec;
+                maxValueIntVec.i32[0] = INT_MAX;
+                maxValueIntVec.i32[1] = INT_MAX;
+                maxValueIntVec.i32[2] = INT_MAX;
+                maxValueIntVec.i32[3] = INT_MAX;
+                maxValueIntVec.i32[4] = INT_MAX;
+                maxValueIntVec.i32[5] = INT_MAX;
+                maxValueIntVec.i32[6] = INT_MAX;
+                maxValueIntVec.i32[7] = INT_MAX;
+
+                simd32_t negOneIntVec;
+                negOneIntVec.i32[0] = -1;
+                negOneIntVec.i32[1] = -1;
+                negOneIntVec.i32[2] = -1;
+                negOneIntVec.i32[3] = -1;
+                negOneIntVec.i32[4] = -1;
+                negOneIntVec.i32[5] = -1;
+                negOneIntVec.i32[6] = -1;
+                negOneIntVec.i32[7] = -1;
+
+                maxValueFld = emit->emitSimd32Const(maxValueIntVec);
+                negOneFld   = emit->emitSimd32Const(negOneIntVec);
+            }


This could be simplified to a simd_t vecCns = {}; and then a for loop for elementCount iterations (given elementCount = simdSize / 4).

Then doing emitSimd16/32Cons based on the simdSize (and accessing v128[0] or v256[0])

It might, alternatively, be cheaper to just check if the result of divpd is 2147483648. That gives us a single comparison instead and is the only case that will be out of range for int.

isn't negOneIntVec can be just simd32_t::AllBitsSet ?

Yes, that as well.

…o verify base type is TYP_INT

tannergooding

CC. @dotnet/jit-contrib for secondary review

tannergooding · 2025-02-28T15:38:25Z

CC. @EgorBo, @jakobbotsch this is ready for secondary review

EgorBo

LGTM. I guess tests to cover overflow and divbyzero wouldn't hurt, but up to you. Actually, let me kick off a Fuzzlyn just in case

PS: For normal division we also check divisor and dividend in morph and global assertprop and leave hints for the emitter that actually the operation does never overflow (e.g. see optAssertionProp_ModDiv). But it's just an idea for a follow up.

EgorBo · 2025-02-28T16:09:10Z

/azp run Fuzzlyn

azure-pipelines · 2025-02-28T16:09:25Z

Azure Pipelines successfully started running 1 pipeline(s).

Alex Covington (Advanced Micro Devices added 2 commits January 16, 2025 09:50

Add vectorized path for Int32 type in TensorPrimitives.Divide

dfa44fa

Add ISA guards and Debug.Assert

37afdd2

dotnet-issue-labeler bot added the area-System.Numerics.Tensors label Jan 16, 2025

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jan 16, 2025

huoyaoyuan reviewed Jan 16, 2025

View reviewed changes