-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensor primitives divide int32 #111505
base: main
Are you sure you want to change the base?
Tensor primitives divide int32 #111505
Conversation
...aries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.Divide.cs
Outdated
Show resolved
Hide resolved
...aries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.Divide.cs
Outdated
Show resolved
Hide resolved
@@ -70,11 +72,80 @@ public static void Divide<T>(T x, ReadOnlySpan<T> y, Span<T> destination) | |||
internal readonly struct DivideOperator<T> : IBinaryOperator<T> where T : IDivisionOperators<T, T, T> | |||
{ | |||
public static bool Vectorizable => typeof(T) == typeof(float) | |||
|| typeof(T) == typeof(double); | |||
|| typeof(T) == typeof(double) | |||
#if NET10_0_OR_GREATER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this limited to .NET 10? Are the APIs being used newly introduced only in .NET 10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was running into issues with FloatRoundingMode
not being available in .NET 8.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was added in 9, not 10.
...aries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.Divide.cs
Outdated
Show resolved
Hide resolved
...aries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.Divide.cs
Outdated
Show resolved
Hide resolved
...aries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.Divide.cs
Outdated
Show resolved
Hide resolved
@tannergooding Would it be possible for us to take a similar approach to PopCount or IndexOfMax and provide both a hardware-specific and xplat path? This would just be for the conversions, something like this: Vector256<double> num_pd;
Vector256<double> den_pd;
if (Avx.IsSupported)
{
num_pd = Avx.ConvertToVector256Double(x.AsInt32());
den_pd = Avx.ConvertToVector256Double(y.AsInt32());
}
else
{
num_pd = Vector256.ConvertToDouble(Vector256.WidenLower(x.AsInt32().ToVector256Unsafe()));
den_pd = Vector256.ConvertToDouble(Vector256.WidenLower(y.AsInt32().ToVector256Unsafe()));
} This provides an optimal path when Avx/Avx512 are available while providing a xplat path to fallback on for other platforms. |
It's possible, but the general desire is to instead have the JIT recognize the xplat pattern and have it implicitly light up to the more optimized hardware specific pattern instead. Often times the patterns involved, particularly for things like There are of course exceptions where the pattern is non-trivial, in such scenarios we want to leave a comment explaining why it's more complex and often have a tracking issue around it. |
...aries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.Divide.cs
Outdated
Show resolved
Hide resolved
@tannergooding I've updated the draft to perform the optimization to You also mentioned that it would be ideal to do something similar for other int sizes ( |
@tannergooding Friendly ping |
A separate PR is fine with me! |
Sorry for the delay, should be getting to this first thing tomorrow. |
This needs weigh in from @dotnet/jit-contrib |
@alexcovington, so I talked through the approach being taken here with @jakobbotsch and got some input from @EgorBo as well. There was some concern with the cost/overhead of introducing a net new We then landed on some options we thought would be feasible (mostly just minor tweaks to the node name and what parameters it takes), but then realized that the check is not sufficient "as is" because it isn't handling the other case of What we landed on is that this likely needs to be handled differently from the approach that would've been more ideal, at least today. So instead of introducing a new node that checks the inputs are well formed, we should instead just import this as a regular |
Sorry for the churn here, the more natural approach to handle this just isn't panning out due some limitations in the IR representation. Happy to answer any follow up questions or elaborate as needed. |
@tannergooding Thanks for the further review. Most of the feedback makes sense, I just have a few follow up questions:
I'm a little confused by
Just to confirm, we're marking the node during import? |
Right, still in
I'd recommend looking at how we handle specially handle |
NamedIntrinsic intrinsicId = this->AsHWIntrinsic()->GetHWIntrinsicId(); | ||
if (intrinsicId == NI_Vector128_op_Division || intrinsicId == NI_Vector256_op_Division) | ||
{ | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably assert that we're only encountering it for varTypeIsInteger(AsHWIntrinsic()->GetSimdBaseType())
, just to avoid issues from potential future refactorings.
I'd expect we want to go ahead and include Vector512_op_Division
here as well, even if it can't be encountered quite yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I've added the assert
and included Vector512_op_Division
.
// } | ||
// | ||
// Vector128<int> overflowMask = | ||
// Vector128.Equals(op1, Vector128.Create(int.MaxValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be MinValue
, not MaxValue
.
it's MinValue / -1
that overflows, since that produces 2147483648
, which isn't representable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed.
if (typeSize == EA_16BYTE) | ||
{ | ||
simd16_t maxValueIntVec; | ||
maxValueIntVec.i32[0] = INT_MAX; | ||
maxValueIntVec.i32[1] = INT_MAX; | ||
maxValueIntVec.i32[2] = INT_MAX; | ||
maxValueIntVec.i32[3] = INT_MAX; | ||
|
||
simd16_t negOneIntVec; | ||
negOneIntVec.i32[0] = -1; | ||
negOneIntVec.i32[1] = -1; | ||
negOneIntVec.i32[2] = -1; | ||
negOneIntVec.i32[3] = -1; | ||
|
||
maxValueFld = emit->emitSimd16Const(maxValueIntVec); | ||
negOneFld = emit->emitSimd16Const(negOneIntVec); | ||
} | ||
else | ||
{ | ||
noway_assert(typeSize == EA_32BYTE); | ||
simd32_t maxValueIntVec; | ||
maxValueIntVec.i32[0] = INT_MAX; | ||
maxValueIntVec.i32[1] = INT_MAX; | ||
maxValueIntVec.i32[2] = INT_MAX; | ||
maxValueIntVec.i32[3] = INT_MAX; | ||
maxValueIntVec.i32[4] = INT_MAX; | ||
maxValueIntVec.i32[5] = INT_MAX; | ||
maxValueIntVec.i32[6] = INT_MAX; | ||
maxValueIntVec.i32[7] = INT_MAX; | ||
|
||
simd32_t negOneIntVec; | ||
negOneIntVec.i32[0] = -1; | ||
negOneIntVec.i32[1] = -1; | ||
negOneIntVec.i32[2] = -1; | ||
negOneIntVec.i32[3] = -1; | ||
negOneIntVec.i32[4] = -1; | ||
negOneIntVec.i32[5] = -1; | ||
negOneIntVec.i32[6] = -1; | ||
negOneIntVec.i32[7] = -1; | ||
|
||
maxValueFld = emit->emitSimd32Const(maxValueIntVec); | ||
negOneFld = emit->emitSimd32Const(negOneIntVec); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be simplified to a simd_t vecCns = {};
and then a for loop for elementCount
iterations (given elementCount
= simdSize / 4
).
Then doing emitSimd16/32Cons
based on the simdSize
(and accessing v128[0]
or v256[0]
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might, alternatively, be cheaper to just check if the result of divpd
is 2147483648
. That gives us a single comparison instead and is the only case that will be out of range for int
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't negOneIntVec
can be just simd32_t::AllBitsSet
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that as well.
…o verify base type is TYP_INT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC. @dotnet/jit-contrib for secondary review
CC. @EgorBo, @jakobbotsch this is ready for secondary review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I guess tests to cover overflow and divbyzero wouldn't hurt, but up to you. Actually, let me kick off a Fuzzlyn just in case
PS: For normal division we also check divisor and dividend in morph and global assertprop and leave hints for the emitter that actually the operation does never overflow (e.g. see optAssertionProp_ModDiv
). But it's just an idea for a follow up.
/azp run Fuzzlyn |
Azure Pipelines successfully started running 1 pipeline(s). |
Add a vector path for
int
inTensorPrimitives.Divide
.Improves performance in microbenchmarks: