Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unreliable codegen around vbroadcastss in .NET 8 release builds #96156

Closed
twest820 opened this issue Dec 18, 2023 · 18 comments · Fixed by #105659
Closed

unreliable codegen around vbroadcastss in .NET 8 release builds #96156

twest820 opened this issue Dec 18, 2023 · 18 comments · Fixed by #105659
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@twest820
Copy link

Description

Code that runs fine in .NET 7 fails in .NET 8.

Reproduction Steps

This affects a couple helper methods which have been in place for years.

        public unsafe static Vector128<float> BroadcastScalarToVector128(float value)
        {
            return Avx.BroadcastScalarToVector128(&value);
        }

        public unsafe static Vector256<float> BroadcastScalarToVector256(float value)
        {
            return Avx.BroadcastScalarToVector256(&value);
        }

Expected behavior

Numerical simulations remain correct.

Actual behavior

Simulations experience numerical divergence as timesteps proceed, ultimately either completing with wrong results or crashing on ArgumentOutOfRangeExceptions raised by internal consistency checks detecting NaNs or other incorrect values resulting from things like numerical overflows in exponentiation.

Regression?

Yes. .NET 8 debug builds work fine. Flipping projects back to .NET 7 removes this problem. Rolling either of the affected repos I've detected so far back to .NET 7 and then moving projects forward to .NET 8 with no other changes also reproduces the error.

Known Workarounds

Use of Vector128.Create(value) or Vector256.Create(value) instead of calling Avx.BroadcastScalarToVectorNNN() directly has, so far, been successful in avoiding this issue.

Configuration

.NET 8, Win 10 22H2, AMD Zen 3. Yes, specific to .NET 8 and x64 intrinsics.

Other information

In .NET 7 the disassembly is

vmovss      dword ptr [rbp+18h],xmm1  
vbroadcastss xmm0,dword ptr [rbp+18h]
vmovss      dword ptr [rbp+18h],xmm1
vbroadcastss ymm0,dword ptr [rbp+18h]

In .NET 8 it's of the form

vmovss      dword ptr [rbp+18h],xmm1  
cmp         dword ptr [7FFE0A69A0F0h],0  
je          BroadcastScalarToVector128(Single)+025h (07FFE0A744805h)  
call        00007FFE697D0A10  
vbroadcastss xmm0,dword ptr [rbp+18h]  
vmovaps     xmmword ptr [rbp-20h],xmm0  
mov         rax,qword ptr [rbp+10h]  
vmovaps     xmm0,xmmword ptr [rbp-20h]  
vmovups     xmmword ptr [rax],xmm0  

However, the JIT from the apparent super quick workaround of

        public static Vector128<float> BroadcastScalarToVector128(float value)
        {
            return Vector128.Create(value);
        }

appears to be functionally identical, both locally

vmovss      dword ptr [rbp+18h],xmm1  
cmp         dword ptr [7FFDEE0E45C0h],0  
je          BroadcastScalarToVector128(Single)+025h (07FFDEE338F15h)  
call        00007FFE4D290A10  
vbroadcastss xmm0,dword ptr [rbp+18h]  
vmovaps     xmmword ptr [rbp-20h],xmm0  
mov         rax,qword ptr [rbp+10h]  
vmovaps     xmm0,xmmword ptr [rbp-20h]  
vmovups     xmmword ptr [rax],xmm0  

and in the surrounding call sites I've checked. After several hours' investigation my current best guess is there's some small fraction out of hundreds of calls where rbp isn't stable, leading vbroadcastss to sometimes load the wrong value.

I'm not particularly concerned about this issue—looking at what .NET 8 JIT does, removing the entire helper class as legacy code and changing to inline VectorNNN.Create() calls makes sense—but it strikes me as strange enough to be worth documenting, particularly should someone else happen to hit it. Our code's nearly all floating point so I don't have a read on whether the vpbroadcast instructions are also affected.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 18, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Dec 18, 2023
@ghost
Copy link

ghost commented Dec 18, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Code that runs fine in .NET 7 fails in .NET 8.

Reproduction Steps

This affects a couple helper methods which have been in place for years.

        public unsafe static Vector128<float> BroadcastScalarToVector128(float value)
        {
            return Avx.BroadcastScalarToVector128(&value);
        }

        public unsafe static Vector256<float> BroadcastScalarToVector256(float value)
        {
            return Avx.BroadcastScalarToVector256(&value);
        }

Expected behavior

Numerical simulations remain correct.

Actual behavior

Simulations experience numerical divergence as timesteps proceed, ultimately either completing with wrong results or crashing on ArgumentOutOfRangeExceptions raised by internal consistency checks detecting NaNs or other incorrect values resulting from things like numerical overflows in exponentiation.

Regression?

Yes. .NET 8 debug builds work fine. Flipping projects back to .NET 7 removes this problem. Rolling either of the affected repos I've detected so far back to .NET 7 and then moving projects forward to .NET 8 with no other changes also reproduces the error.

Known Workarounds

Use of Vector128.Create(value) or Vector256.Create(value) instead of calling Avx.BroadcastScalarToVectorNNN() directly has, so far, been successful in avoiding this issue.

Configuration

.NET 8, Win 10 22H2, AMD Zen 3. Yes, specific to .NET 8 and x64 intrinsics.

Other information

In .NET 7 the disassembly is

vmovss      dword ptr [rbp+18h],xmm1  
vbroadcastss xmm0,dword ptr [rbp+18h]
vmovss      dword ptr [rbp+18h],xmm1
vbroadcastss ymm0,dword ptr [rbp+18h]

In .NET 8 it's of the form

vmovss      dword ptr [rbp+18h],xmm1  
cmp         dword ptr [7FFE0A69A0F0h],0  
je          BroadcastScalarToVector128(Single)+025h (07FFE0A744805h)  
call        00007FFE697D0A10  
vbroadcastss xmm0,dword ptr [rbp+18h]  
vmovaps     xmmword ptr [rbp-20h],xmm0  
mov         rax,qword ptr [rbp+10h]  
vmovaps     xmm0,xmmword ptr [rbp-20h]  
vmovups     xmmword ptr [rax],xmm0  

However, the JIT from the apparent super quick workaround of

        public static Vector128<float> BroadcastScalarToVector128(float value)
        {
            return Vector128.Create(value);
        }

appears to be functionally identical, both locally

vmovss      dword ptr [rbp+18h],xmm1  
cmp         dword ptr [7FFDEE0E45C0h],0  
je          BroadcastScalarToVector128(Single)+025h (07FFDEE338F15h)  
call        00007FFE4D290A10  
vbroadcastss xmm0,dword ptr [rbp+18h]  
vmovaps     xmmword ptr [rbp-20h],xmm0  
mov         rax,qword ptr [rbp+10h]  
vmovaps     xmm0,xmmword ptr [rbp-20h]  
vmovups     xmmword ptr [rax],xmm0  

and in the surrounding call sites I've checked. After several hours' investigation my current best guess is there's some small fraction out of hundreds of calls where rbp isn't stable, leading vbroadcastss to sometimes load the wrong value.

I'm not particularly concerned about this issue—looking at what .NET 8 JIT does, removing the entire helper class as legacy code and changing to inline VectorNNN.Create() calls makes sense—but it strikes me as strange enough to be worth documenting, particularly should someone else happen to hit it. Our code's nearly all floating point so I don't have a read on whether the vpbroadcast instructions are also affected.

Author: twest820
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@EgorBo
Copy link
Member

EgorBo commented Dec 19, 2023

@twest820 do you have a repro project?

@twest820
Copy link
Author

Not a simplified repro at this time, unfortunately. But, if this isn't something going sideways on my dev machine, the R3PGReferenceStands() test case should provide an open repro—clone master there and flip the projects to .NET 8. I've pushed quick fix in the corresponding develop branch, which is on .NET 8, that changes from Avx.BroadcastScalarToVector128(float*) to Avx2.BroadcastScalarToVector128(float). So it looks like something specific to the AVX form of vbroadcastss.

There's another repo that's also affected (at commit 80ca9ef) since it contains the same legacy helper methods. We don't have permission to release the data the tests run on, though, so it'd be harder to work with.

@BruceForstall
Copy link
Member

@tannergooding @dotnet/avx512-contrib

@JulieLeeMSFT JulieLeeMSFT added this to the 9.0.0 milestone Jan 4, 2024
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Jan 4, 2024
@threbejk
Copy link

We have encountered similar issue. After switching to .NET8, code that was using Avx.BroadcastScalarToVector256(double*) started giving wrong results, but only on some older cpus. (Intel Xeon E5-2690) The behavior seemed deterministic - the results were always same on a given computer. The suggested workaround (using Vector256.Create(value)) fixed the issue.

@tannergooding tannergooding removed their assignment Apr 18, 2024
@AndyAyersMS
Copy link
Member

@twest820 sorry it has taken us so long to look into this.

Is this the kind of failure you are seeing with 8.0?

      R3PGReferenceStands (588ms): Error Message: Assert.IsTrue failed. broadleaf_mix: AvailableSoilWater[4]: ratio = <1.0000024, 0, 1.5494434, 1.5494434> at iteration 0.
      Stack Trace:
         at BayesianPG.Test.AssertV.IsTrue(Vector128`1 condition, String message) in C:\repos\BayesianPG\UnitTests\AssertV.cs:line 11
         at BayesianPG.Test.Test3PG.VerifyArray(ThreePGpjsMix threePG, String variable, Vector128`1[] actual, Int32 expectedTimesteps, Single[] expectedValues, Single tolerance, Int32 maxTimestep, Int32 iteration) in C:\repos\Ba
      yesianPG\UnitTests\Test3PG.cs:line 343
         at BayesianPG.Test.Test3PG.VerifyStandTrajectory(ThreePGSimd128 threePG128, ThreePGStandTrajectory`2 expectedTrajectory, StandTrajectoryTolerance tolerances, Int32 iteration) in C:\repos\BayesianPG\UnitTests\Test3PG.cs:
      line 757
         at BayesianPG.Test.Test3PG.R3PGReferenceStands() in C:\repos\BayesianPG\UnitTests\Test3PG.cs:line 114
         at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
         at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr)

@AndyAyersMS
Copy link
Member

@threbejk if you have a repro you can share that would also be helpful.

@AndyAyersMS
Copy link
Member

The bug in the original repro seems to be the call at line 84 of InitializeParametersWeatherAndFirstMonth

https://github.com/OSU-MARS/BayesianPG/blob/23a5dcb29ad537c7f4666ae50f30cdbe2f5394f5/BayesianPG/ThreePG/ThreePGSimd128.cs#L79-L84

Codegen (good on left, bad on left) shows the vbroadcastss of the divided result is lost somehow

image

The issue is still present in .Net9.

Looks like the broadcast ends up getting marked as contained by the multiply and then gets dropped entirely.

@AndyAyersMS
Copy link
Member

Some forms of vmulps support embedded broadcasts for their second source operand. Perhaps we're just not generating the right variant? Let me look a bit deeper.

@AndyAyersMS
Copy link
Member

       FF15CF029600         call     [BayesianPG.Extensions.DateTimeExtensions:DaysInMonth(System.DateTime):int]
       C5F857C0             vxorps   xmm0, xmm0, xmm0
       C5FA2AC0             vcvtsi2ss xmm0, xmm0, eax
       C5FA104D0C           vmovss   xmm1, dword ptr [rbp+0x0C]
       C5F25EC0             vdivss   xmm0, xmm1, xmm0
       C5FA11457C           vmovss   dword ptr [rbp+0x7C], xmm0
       488B4B48             mov      rcx, gword ptr [rbx+0x48]
       488B4958             mov      rcx, gword ptr [rcx+0x58]
       488B4960             mov      rcx, gword ptr [rcx+0x60]
       418BD4               mov      edx, r12d
       8B85EC020000         mov      eax, dword ptr [rbp+0x2EC]
       448BC0               mov      r8d, eax
       C58859457C           vmulps   xmm0, xmm14, xmmword ptr [rbp+0x7C]

Seems like the broadcast versions require AVX512, and we're not checking for that.

Relevant code is

https://github.com/dotnet/runtime/blame/b764692d18e66561c7cff2101d642b0f789dae0c/src/coreclr/jit/lowerxarch.cpp#L8283-L8296

Looks like this changed in #84821

cc @tannergooding

@AndyAyersMS
Copy link
Member

Simple repro vs 9p6

using System;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using System.Runtime.CompilerServices;

class Runtime_96156
{
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public unsafe static Vector128<float> BroadcastScalarToVector128(float value)
    {
        return Avx.BroadcastScalarToVector128(&value);
    }

    public static void Main()
    {
        Vector128<float> c = Vector128.Create(1.0f);
        Vector128<float> r = Problem(2.0f, 0.5f, c);
        Console.WriteLine($"{r}");
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static Vector128<float> Problem(float a, float b, Vector128<float> c)
    {
        return Avx.Multiply(c, BroadcastScalarToVector128(a / b));
    }
}

D:\bugs\r96156>set DOTNET_TieredCompilation=0

D:\bugs\r96156>dotnet run  -c Release
<4, -1.0316779E-08, 4.591E-41, 2.7913741E+17>

D:\bugs\r96156>dotnet run
<4, 4, 4, 4>
; Assembly listing for method Runtime_96156:Problem(float,float,System.Runtime.Intrinsics.Vector128`1[float]):System.Runtime.Intrinsics.Vector128`1[float] (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data

G_M000_IG01:                ;; offset=0x0000
       push     rax

G_M000_IG02:                ;; offset=0x0001
       vdivss   xmm0, xmm1, xmm2
       vmovss   dword ptr [rsp+0x04], xmm0
       vmovups  xmm0, xmmword ptr [r9]
       lea      rax, [rsp+0x04]
       vmulps   xmm0, xmm0, xmmword ptr [rax]
       vmovups  xmmword ptr [rcx], xmm0
       mov      rax, rcx

G_M000_IG03:                ;; offset=0x0020
       add      rsp, 8
       ret

; Total bytes of code 37

@AndyAyersMS
Copy link
Member

Ah, looks like this just got fixed in main by #104517 (so should be fixed in P7).

@tannergooding you think we should try and backport a fix to 8.0 (given all the refactoring, maybe not a straight port)?

@AndyAyersMS
Copy link
Member

@tannergooding sorry to nag, curious what you think here about a backport.

@tannergooding
Copy link
Member

Ah sorry, I had seen this over the weekend and forgot to respond when I got in this morning.

I think a backport would be good, I'll see if I can get a minimal fix so we don't have to take the larger refactoring.

@AndyAyersMS
Copy link
Member

Since this is fixed in .NET 9, I am going to change the milestone to .NET 8.

@tannergooding
Copy link
Member

Got the backport up here: #105659

@AndyAyersMS
Copy link
Member

@twest820 this should be fixed in 8.0 servicing -- can you verify if you get a chance?

@twest820
Copy link
Author

Hi @AndyAyersMS, I moved the build to .NET 9.0 a while ago (can confirm reverting to the original unsafe code works there). @tannergooding's unit test coverage looks good to me, though.

FWIW, I've left the build here with the safe alternative

public static Vector128<float> BroadcastScalarToVector128(float value)
{
    Vector128<float> value128 = Vector128.CreateScalarUnsafe(value);
    return Avx2.BroadcastScalarToVector128(value128);
}

as we've stopped supporting processors without FMA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
8 participants