Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensor primitives divide int32 #111505

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

alexcovington
Copy link
Contributor

Add a vector path for int in TensorPrimitives.Divide.

Improves performance in microbenchmarks:

| Method        | Job        | Toolchain        | BufferLength | Mean       | Error    | StdDev   | Median     | Min        | Max        | Ratio | Allocated | Alloc Ratio |
|-------------- |----------- |----------------- |------------- |-----------:|---------:|---------:|-----------:|-----------:|-----------:|------:|----------:|------------:|
| Divide_Vector | Job-SUKFCS | Base             | 128          |   174.4 ns |  0.83 ns |  0.69 ns |   174.2 ns |   173.5 ns |   176.0 ns |  1.00 |         - |          NA |
| Divide_Vector | Job-OOXWSE | Diff             | 128          |   111.5 ns |  0.56 ns |  0.50 ns |   111.5 ns |   110.7 ns |   112.3 ns |  0.64 |         - |          NA |
|               |            |                  |              |            |          |          |            |            |            |       |           |             |
| Divide_Scalar | Job-SUKFCS | Base             | 128          |   138.9 ns |  0.88 ns |  0.78 ns |   138.5 ns |   137.8 ns |   140.5 ns |  1.00 |         - |          NA |
| Divide_Scalar | Job-OOXWSE | Diff             | 128          |   104.8 ns |  0.50 ns |  0.42 ns |   104.7 ns |   104.3 ns |   105.8 ns |  0.75 |         - |          NA |
|               |            |                  |              |            |          |          |            |            |            |       |           |             |
| Divide_Vector | Job-SUKFCS | Base             | 3079         | 4,038.8 ns | 13.47 ns | 11.94 ns | 4,034.4 ns | 4,025.5 ns | 4,062.2 ns |  1.00 |         - |          NA |
| Divide_Vector | Job-OOXWSE | Diff             | 3079         | 1,358.5 ns |  5.78 ns |  4.83 ns | 1,356.7 ns | 1,351.8 ns | 1,367.6 ns |  0.34 |         - |          NA |
|               |            |                  |              |            |          |          |            |            |            |       |           |             |
| Divide_Scalar | Job-SUKFCS | Base             | 3079         | 3,315.1 ns | 12.55 ns | 11.12 ns | 3,313.0 ns | 3,301.0 ns | 3,337.9 ns |  1.00 |         - |          NA |
| Divide_Scalar | Job-OOXWSE | Diff             | 3079         | 1,294.8 ns |  7.84 ns |  6.95 ns | 1,293.1 ns | 1,286.2 ns | 1,309.0 ns |  0.39 |         - |          NA |

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jan 16, 2025
@@ -70,11 +72,80 @@ public static void Divide<T>(T x, ReadOnlySpan<T> y, Span<T> destination)
internal readonly struct DivideOperator<T> : IBinaryOperator<T> where T : IDivisionOperators<T, T, T>
{
public static bool Vectorizable => typeof(T) == typeof(float)
|| typeof(T) == typeof(double);
|| typeof(T) == typeof(double)
#if NET10_0_OR_GREATER
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this limited to .NET 10? Are the APIs being used newly introduced only in .NET 10?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was running into issues with FloatRoundingMode not being available in .NET 8.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was added in 9, not 10.

@alexcovington
Copy link
Contributor Author

alexcovington commented Jan 22, 2025

@tannergooding Would it be possible for us to take a similar approach to PopCount or IndexOfMax and provide both a hardware-specific and xplat path? This would just be for the conversions, something like this:

Vector256<double> num_pd;
Vector256<double> den_pd;

if (Avx.IsSupported)
{
    num_pd = Avx.ConvertToVector256Double(x.AsInt32());
    den_pd = Avx.ConvertToVector256Double(y.AsInt32());
}
else
{
    num_pd = Vector256.ConvertToDouble(Vector256.WidenLower(x.AsInt32().ToVector256Unsafe()));
    den_pd = Vector256.ConvertToDouble(Vector256.WidenLower(y.AsInt32().ToVector256Unsafe()));
}

This provides an optimal path when Avx/Avx512 are available while providing a xplat path to fallback on for other platforms.

@tannergooding
Copy link
Member

Would it be possible for us to take a similar approach to PopCount or IndexOfMax and provide both a hardware-specific and xplat path

It's possible, but the general desire is to instead have the JIT recognize the xplat pattern and have it implicitly light up to the more optimized hardware specific pattern instead. Often times the patterns involved, particularly for things like Widen(...), Narrow(..., ...) or even WidenLower(WidenLower(...)) and such are relatively trivial to handle and simply haven't been added in yet. This simplifies the managed code and allows us to implicitly use even newer ISAs as they become available, without having to go back and touch the code again.

There are of course exceptions where the pattern is non-trivial, in such scenarios we want to leave a comment explaining why it's more complex and often have a tracking issue around it.

@alexcovington
Copy link
Contributor Author

@tannergooding I've updated the draft to perform the optimization to operator /. Currently the draft only optimizes for Vector128<int> and Vector256<int>, but I wanted to verify with you that the approach is correct before continuing further.

You also mentioned that it would be ideal to do something similar for other int sizes (byte/short/uint/etc.). Would that be best done in this PR or a separate PR?

@alexcovington
Copy link
Contributor Author

@tannergooding Friendly ping

@tannergooding
Copy link
Member

You also mentioned that it would be ideal to do something similar for other int sizes (byte/short/uint/etc.). Would that be best done in this PR or a separate PR?

A separate PR is fine with me!

@tannergooding
Copy link
Member

Sorry for the delay, should be getting to this first thing tomorrow.

@tannergooding
Copy link
Member

This needs weigh in from @dotnet/jit-contrib

@tannergooding
Copy link
Member

This needs weigh in from @dotnet/jit-contrib

@alexcovington, so I talked through the approach being taken here with @jakobbotsch and got some input from @EgorBo as well.

There was some concern with the cost/overhead of introducing a net new GT_SIMD_DIV_BY_ZERO_CHECK node that is specific to SIMD and to one kind of check (div by zero). We talked through some options on how this cost could be justified by making it more reusable (particularly since PRs like #111543 have a similar overall need, conceptually).

We then landed on some options we thought would be feasible (mostly just minor tweaks to the node name and what parameters it takes), but then realized that the check is not sufficient "as is" because it isn't handling the other case of MinValue / -1 which also causes a hardware fault (and which is surfaced as an OverflowException in .NET). This additionally complexity coupled with it likely not being feasible for the checks to be hoisted today halted that train of thought. We also realized that expanding this to ConvertToInteger(HWINTRINSIC Div(ConvertToFloat(dividend), ConvertToInteger(ValidityCheck(divisor))) in import would make it much more difficult to do certain other optimizations, like transforming division into cheaper alternative sequences (like shifts) where possible.

What we landed on is that this likely needs to be handled differently from the approach that would've been more ideal, at least today. So instead of introducing a new node that checks the inputs are well formed, we should instead just import this as a regular HWINTRINSIC Div(dividend, divisor) node where the base type is the correct integral type. This HWINTRINSIC Div node should be marked as being side effecting, much as other nodes like GT_DIV or HWINTRINSIC StoreFence are. We would then deal with emitting the appropriate comparison and throw ourselves in codegen (much as is being done for the new GT_SIMD_DIV_BY_ZERO_CHECK node you introduced, just rather being done for the side effecting intrinsic node instead). This isn't "ideal", but its the closest to the right step for ensuring we can enable the right optimizations today and make incremental improvements in the future when the other desired optimizations like hoisting the implied side effects here are more feasible.

@tannergooding
Copy link
Member

Sorry for the churn here, the more natural approach to handle this just isn't panning out due some limitations in the IR representation.

Happy to answer any follow up questions or elaborate as needed.

@alexcovington
Copy link
Contributor Author

@tannergooding Thanks for the further review. Most of the feedback makes sense, I just have a few follow up questions:

So instead of introducing a new node that checks the inputs are well formed, we should instead just import this as a regular HWINTRINSIC Div(dividend, divisor) node where the base type is the correct integral type

I'm a little confused by regular HWINTRINSIC Div(dividend, divisor), is this still building the division node with gtNewSimdBinOp and just using the integer type instead? Or is there a different way to construct the node that is more appropriate?

This HWINTRINSIC Div node should be marked as being side effecting

Just to confirm, we're marking the node during import?

@tannergooding
Copy link
Member

I'm a little confused by regular HWINTRINSIC Div(dividend, divisor), is this still building the division node with gtNewSimdBinOp and just using the integer type instead? Or is there a different way to construct the node that is more appropriate?

Right, still in gtNewSimdBinOp. It'd be similar to the floating-point path but we'd use NI_Vector*_Divide as the intrinsic ID (which likely requires updating the flags in hwintrinsiclistxarch.h to indicate this ID is valid to have) and track the simdBaseType as the appropriate integer type.

Just to confirm, we're marking the node during import?

I'd recommend looking at how we handle specially handle NI_X86Base_Pause and HW_Flag_SpecialSideEffect_Other. This gets handled in a few places via the HasSpecialSideEffect(..) check which shows where we may want to consider special handling for the HWINTRINSIC Divide nodes.

NamedIntrinsic intrinsicId = this->AsHWIntrinsic()->GetHWIntrinsicId();
if (intrinsicId == NI_Vector128_op_Division || intrinsicId == NI_Vector256_op_Division)
{
return true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably assert that we're only encountering it for varTypeIsInteger(AsHWIntrinsic()->GetSimdBaseType()), just to avoid issues from potential future refactorings.

I'd expect we want to go ahead and include Vector512_op_Division here as well, even if it can't be encountered quite yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've added the assert and included Vector512_op_Division.

// }
//
// Vector128<int> overflowMask =
// Vector128.Equals(op1, Vector128.Create(int.MaxValue)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be MinValue, not MaxValue.

it's MinValue / -1 that overflows, since that produces 2147483648, which isn't representable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

Comment on lines 1911 to 1953
if (typeSize == EA_16BYTE)
{
simd16_t maxValueIntVec;
maxValueIntVec.i32[0] = INT_MAX;
maxValueIntVec.i32[1] = INT_MAX;
maxValueIntVec.i32[2] = INT_MAX;
maxValueIntVec.i32[3] = INT_MAX;

simd16_t negOneIntVec;
negOneIntVec.i32[0] = -1;
negOneIntVec.i32[1] = -1;
negOneIntVec.i32[2] = -1;
negOneIntVec.i32[3] = -1;

maxValueFld = emit->emitSimd16Const(maxValueIntVec);
negOneFld = emit->emitSimd16Const(negOneIntVec);
}
else
{
noway_assert(typeSize == EA_32BYTE);
simd32_t maxValueIntVec;
maxValueIntVec.i32[0] = INT_MAX;
maxValueIntVec.i32[1] = INT_MAX;
maxValueIntVec.i32[2] = INT_MAX;
maxValueIntVec.i32[3] = INT_MAX;
maxValueIntVec.i32[4] = INT_MAX;
maxValueIntVec.i32[5] = INT_MAX;
maxValueIntVec.i32[6] = INT_MAX;
maxValueIntVec.i32[7] = INT_MAX;

simd32_t negOneIntVec;
negOneIntVec.i32[0] = -1;
negOneIntVec.i32[1] = -1;
negOneIntVec.i32[2] = -1;
negOneIntVec.i32[3] = -1;
negOneIntVec.i32[4] = -1;
negOneIntVec.i32[5] = -1;
negOneIntVec.i32[6] = -1;
negOneIntVec.i32[7] = -1;

maxValueFld = emit->emitSimd32Const(maxValueIntVec);
negOneFld = emit->emitSimd32Const(negOneIntVec);
}
Copy link
Member

@tannergooding tannergooding Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be simplified to a simd_t vecCns = {}; and then a for loop for elementCount iterations (given elementCount = simdSize / 4).

Then doing emitSimd16/32Cons based on the simdSize (and accessing v128[0] or v256[0])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might, alternatively, be cheaper to just check if the result of divpd is 2147483648. That gives us a single comparison instead and is the only case that will be out of range for int.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't negOneIntVec can be just simd32_t::AllBitsSet ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that as well.

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC. @dotnet/jit-contrib for secondary review

@tannergooding
Copy link
Member

CC. @EgorBo, @jakobbotsch this is ready for secondary review

Copy link
Member

@EgorBo EgorBo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I guess tests to cover overflow and divbyzero wouldn't hurt, but up to you. Actually, let me kick off a Fuzzlyn just in case

PS: For normal division we also check divisor and dividend in morph and global assertprop and leave hints for the emitter that actually the operation does never overflow (e.g. see optAssertionProp_ModDiv). But it's just an idea for a follow up.

@EgorBo
Copy link
Member

EgorBo commented Feb 28, 2025

/azp run Fuzzlyn

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Numerics.Tensors community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants