Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peculiar string operation performance #13451

Closed
Suchiman opened this issue Sep 22, 2019 · 31 comments
Closed

Peculiar string operation performance #13451

Suchiman opened this issue Sep 22, 2019 · 31 comments
Assignees
Labels
Milestone

Comments

@Suchiman
Copy link
Contributor

So i got nerd sniped on IRC with how fast you can append characters to create a string.
I created the following benchmark to measure a few alternatives just before i got told it's for an old MVC4 app. Then I've noticed that the performance is consistently worse on .NET Core, in particular for StringBuilder and even more so if you supply a capacity for StringBuilder. I then suspected Tiered Compilation but i couldn't figure how to really disable it for BDN.NET, <TieredCompilation>false</TieredCompilation> had at best an effect on the InProcess Benchmark. Also it appears that BDN.NET or the API it is using for measuring the Allocated number is incorrect on .NET Core, the .NET Framework numbers seem more plausible (allocating an char[] of 50k elements and a string from that char[] certainly doesn't allocate exactly as much as just allocating the string).

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text;

namespace ConsoleApp37
{
    class Program
    {
        static void Main(string[] args)
        {
            BenchmarkRunner.Run<ConcatRace>();
        }
    }

    //[InProcess]
    [MemoryDiagnoser]
    public class ConcatRace
    {
        const int length = 50000;

        [Benchmark]
        public string StringBuilderAddChar()
        {
            var result = new StringBuilder();

            for (int i = 0; i < length; i++)
            {
                result.Append('s');
            }

            return result.ToString();
        }

        [Benchmark]
        public string StringBuilderAddCharOptimized()
        {
            var result = new StringBuilder(length);

            for (int i = 0; i < length; i++)
            {
                result.Append('s');
            }

            return result.ToString();
        }

        [Benchmark]
        public unsafe string PointerAddChar()
        {
            var result = new char[length];

            fixed (char* fixedPointer = result)
            {
                var pointer = fixedPointer;
                for (int i = 0; i < length; i++)
                {
                    *pointer++ = 's';
                }
            }

            return new string(result);
        }

        [Benchmark]
        public unsafe string NormalAddChar()
        {
            var result = new char[length];

            for (int i = 0; i < result.Length; i++)
            {
                result[i] = 's';
            }

            return new string(result);
        }

        [Benchmark]
        public unsafe string FastPointerAddChar()
        {
            var result = new string('\0', length);

            fixed (char* fixedPointer = result)
            {
                var pointer = fixedPointer;
                for (int i = 0; i < length; i++)
                {
                    *pointer++ = 's';
                }
            }

            return result;
        }
    }
}

.NET Framework 4.7.2

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.18362
Intel Core i5-4670K CPU 3.40GHz (Haswell), 1 CPU, 4 logical and 4 physical cores
  [Host]     : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.8.3815.0
  DefaultJob : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.8.3815.0

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
StringBuilderAddChar 109.43 us 0.3330 us 0.2952 us 62.3779 31.1279 31.1279 209.02 KB
StringBuilderAddCharOptimized 100.69 us 0.5440 us 0.5089 us 62.3779 62.3779 62.3779 195.87 KB
PointerAddChar 33.79 us 0.2182 us 0.1934 us 62.4390 62.4390 62.4390 195.37 KB
NormalAddChar 38.40 us 0.1843 us 0.1539 us 62.4390 62.4390 62.4390 195.37 KB
FastPointerAddChar 24.62 us 0.2241 us 0.2096 us 31.2195 31.2195 31.2195 97.69 KB

.NET Core 3.0 RC1 (OutOfProcess)

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.18362
Intel Core i5-4670K CPU 3.40GHz (Haswell), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.0.100-rc1-014190
  [Host]     : .NET Core 3.0.0-rc1-19456-20 (CoreCLR 4.700.19.45506, CoreFX 4.700.19.45604), 64bit RyuJIT
  DefaultJob : .NET Core 3.0.0-rc1-19456-20 (CoreCLR 4.700.19.45506, CoreFX 4.700.19.45604), 64bit RyuJIT

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
StringBuilderAddChar 120.80 us 0.5277 us 0.4936 us 62.2559 31.0059 31.0059 208.55 KB
StringBuilderAddCharOptimized 142.44 us 1.1631 us 1.0879 us 62.2559 62.2559 62.2559 97.73 KB
PointerAddChar 35.33 us 0.2321 us 0.2057 us 62.4390 62.4390 62.4390 97.68 KB
NormalAddChar 40.27 us 0.4038 us 0.3777 us 62.4390 62.4390 62.4390 97.68 KB
FastPointerAddChar 29.78 us 0.1331 us 0.1245 us 31.2195 31.2195 31.2195 97.68 KB

.NET Core 3.0 RC1 (InProcess)

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.18362
Intel Core i5-4670K CPU 3.40GHz (Haswell), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.0.100-rc1-014190
  [Host] : .NET Core 3.0.0-rc1-19456-20 (CoreCLR 4.700.19.45506, CoreFX 4.700.19.45604), 64bit RyuJIT

Job=InProcess  Toolchain=InProcessEmitToolchain  
Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
StringBuilderAddChar 121.68 us 0.4015 us 0.3352 us 62.3779 31.1279 31.1279 208.55 KB
StringBuilderAddCharOptimized 111.83 us 1.4062 us 1.3154 us 62.3779 62.3779 62.3779 97.73 KB
PointerAddChar 43.64 us 0.5115 us 0.4784 us 62.4390 62.4390 62.4390 97.68 KB
NormalAddChar 48.63 us 0.6182 us 0.5782 us 62.4390 62.4390 62.4390 97.68 KB
FastPointerAddChar 29.98 us 0.3981 us 0.3724 us 31.2195 31.2195 31.2195 97.68 KB
@jkotas
Copy link
Member

jkotas commented Sep 23, 2019

cc @adamsitnik

@danmoseley
Copy link
Member

Also reported here https://github.com/dotnet/coreclr/issues/25774

@danmoseley
Copy link
Member

Seems a duplicate of #13107. Lets continue discussion there.

@Suchiman
Copy link
Contributor Author

Suchiman commented Oct 1, 2019

@danmosemsft imo this is a superset of that other ticket which is exclusive to stringbuilder

@danmoseley
Copy link
Member

@Suchiman I see your observations about non StringBuilder appends but it's not clear to me that there's a problem with those?

@Suchiman
Copy link
Contributor Author

Suchiman commented Oct 1, 2019

They're consistently slower even if only by a small margin (between 5% and 18%) and then there's the incorrect measurement of Allocated

@danmoseley danmoseley reopened this Oct 1, 2019
@danmoseley
Copy link
Member

@Suchiman you're right -- I had read the tables in reverse.

@adamsitnik adamsitnik self-assigned this Oct 1, 2019
@adamsitnik
Copy link
Member

I am going to investigate it tomorrow

@adamsitnik
Copy link
Member

Hi @Suchiman

Also it appears that BDN.NET or the API it is using for measuring the Allocated number is incorrect on .NET Core

You are right. Before .NET Core 3.0 preview 6+ and BenchmarkDotNet 0.11.6, BenchmarkDotNet was using GC.GetAllocatedBytesForCurrentThread to get the number of allocated bytes. Recently we have switched to the new API added recently to .NET Core 3.0 - GC.GetTotalAllocatedBytes dotnet/BenchmarkDotNet#1155 and this solves the problem of missing allocations.

Using the latest version of BDN from our CI feed and .NET Core 3.0 I get the correct number of allocated bytes:

BenchmarkDotNet=v0.11.5.1159-nightly, OS=Windows 10.0.18362
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-alpha1-013817
[Host] : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), X64 RyuJIT
Job-XMGAOR : .NET Framework 4.8 (4.8.4010.0), X64 RyuJIT
Job-JHKCBR : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), X64 RyuJIT

Method Runtime Toolchain Mean Ratio Allocated
StringBuilderAddChar .NET 4.7.2 net472 124.16 us 1.00 208.77 KB
StringBuilderAddChar .NET Core 3.0 netcoreapp3.0 147.46 us 1.21 208.56 KB
StringBuilderAddCharOptimized .NET 4.7.2 net472 120.43 us 1.00 195.87 KB
StringBuilderAddCharOptimized .NET Core 3.0 netcoreapp3.0 140.72 us 1.17 195.41 KB
PointerAddChar .NET 4.7.2 net472 41.11 us 1.00 195.37 KB
PointerAddChar .NET Core 3.0 netcoreapp3.0 40.19 us 0.98 195.36 KB
NormalAddChar .NET 4.7.2 net472 42.01 us 1.00 195.37 KB
NormalAddChar .NET Core 3.0 netcoreapp3.0 44.21 us 1.05 195.36 KB
FastPointerAddChar .NET 4.7.2 net472 29.44 us 1.00 97.69 KB
FastPointerAddChar .NET Core 3.0 netcoreapp3.0 32.82 us 1.12 97.68 KB

i couldn't figure how to really disable it for BDN.NET

The benchmark should get promoted to Tier 1 during Warmup or Pilot stage, this is why BDN does not expose a possibility to disable Tiered JIT. However, you can achieve that by setting environment variables

@adamsitnik
Copy link
Member

When it comes to the most important aspect - for your benchmark .NET Core 3.0 is slower than .NET 4.8. I am going to investigate that.

@adamsitnik
Copy link
Member

Also, I wanted to mention that we track the performance of StringBuilder in dotnet/performance repository: https://github.com/dotnet/performance/blob/master/src/benchmarks/micro/corefx/System.Runtime/Perf.StringBuilder.cs

I've run all the benchmarks we have for 4.8 vs 3.0 and in general .NET Core 3.0 is slower than 4.8 only in a single case:

Method Runtime Toolchain length Mean Ratio Allocated
ctor .NET 4.7.2 net472 ? 15.09 ns 1.00 104 B
ctor .NET Core 3.0 netcoreapp3.0 ? 12.98 ns 0.88 104 B
StringConcat .NET 4.7.2 net472 ? 45,430,416.67 ns 1.00 275971747 B
StringConcat .NET Core 3.0 netcoreapp3.0 ? 48,247,462.50 ns 1.06 275701266 B
StringBuilderAppend .NET 4.7.2 net472 ? 129,789.88 ns 1.00 565002 B
StringBuilderAppend .NET Core 3.0 netcoreapp3.0 ? 126,277.46 ns 0.97 563552 B
Append_ValueTypes .NET 4.7.2 net472 ? 7,435,766.25 ns 1.00 2225087 B
Append_ValueTypes .NET Core 3.0 netcoreapp3.0 ? 2,285,843.80 ns 0.31 932616 B
StringBuilderToString .NET 4.7.2 net472 ? 79,553.35 ns 1.00 550032 B
StringBuilderToString .NET Core 3.0 netcoreapp3.0 ? 80,571.22 ns 1.01 550024 B
AppendMemory .NET 4.7.2 net472 ? 41,637.46 ns 1.00 235515 B
AppendMemory .NET Core 3.0 netcoreapp3.0 ? 29,821.27 ns 0.72 203040 B
Append .NET 4.7.2 net472 0 26,470.82 ns 1.00 105 B
Append .NET Core 3.0 netcoreapp3.0 0 31,749.11 ns 1.20 104 B
ctor_string .NET 4.7.2 net472 100 32.81 ns 1.00 273 B
ctor_string .NET Core 3.0 netcoreapp3.0 100 32.89 ns 1.01 272 B
Append .NET 4.7.2 net472 200 1,068,612.78 ns 1.00 4035648 B
Append .NET Core 3.0 netcoreapp3.0 200 926,691.15 ns 0.87 4028106 B
ctor_string .NET 4.7.2 net472 1000 200.28 ns 1.00 2078 B
ctor_string .NET Core 3.0 netcoreapp3.0 1000 197.00 ns 0.98 2072 B

@Suchiman
Copy link
Contributor Author

Suchiman commented Oct 3, 2019

BenchmarkDotNet was using GC.GetAllocatedBytesForCurrentThread to get the number of allocated bytes

Okay, but there's only a single thread involved here, why is this API not delivering the correct result?

@adamsitnik
Copy link
Member

adamsitnik commented Oct 21, 2019

I was able to narrow down the problem to a small repro that does not involve GC/memory allocations: creating a reusable StringBuilder with a capacity that never needs a resize:

public class Repro
{
    const int length = 50000;

    private StringBuilder _sb;

    [GlobalSetup]
    public void Setup() => _sb = new StringBuilder(length);

    [Benchmark]
    public StringBuilder StringBuilderAddCharOptimized()
    {
        var result = _sb.Clear(); // it just sents the internal index to 0

        for (int i = 0; i < length; i++)
        {
            result.Append('s');
        }

        return result;
    }
}

First of all, this benchmark is sensitive to loop head alignment. As crazy as it sounds, CLR is typically more lucky with the alignment in this particular case. This is not the first time such a problem happens with a microbenchmark: https://github.com/dotnet/coreclr/issues/17932

This can be checked with following BenchmarkDotNet config using an experimental feature flag COMPlus_JitAlignLoops exposed by CoreCLR:

static void Main(string[] args) =>
    BenchmarkRunner.Run<Repro>(
        DefaultConfig.Instance
            .With(Job.Default.With(CsProjCoreToolchain.NetCoreApp30).WithId("3.0 defaults"))
            .With(Job.Default.With(CsProjCoreToolchain.NetCoreApp30).With(new[] { new EnvironmentVariable("COMPlus_JitAlignLoops", "1") }).WithId("3.0 align loops")));
Method Job EnvironmentVariables Toolchain Mean
StringBuilderAddCharOptimized 3.0 align loops COMPlus_JitAlignLoops=1 .NET Core 3.0 106.81 us
StringBuilderAddCharOptimized 3.0 defaults Empty .NET Core 3.0 120.10 us

The more important, deterministic difference is a code-gen difference for 4.8 and 3.0 for the array indexer check:

obraz

I've observed, that if I copy-paste StringBuilder source code and run it as a regular .NET app (not a part of mscorelib), then both Frameworks produce the same assembly code for the Append(char) method.

I've disabled R2R and Zap:

obraz

And confirmed that this causes the difference:

obraz

@jkotas @AndyAyersMS I am definitely not a code-gen expert and I am not sure if I VTune is not showing me skids. Do you think that it makes sense that this code-gen difference can be causing 10% regression? (When I set both env vars the micro-benchmark tells me that there is still a 3% difference, but it can be some noise again)

I am going to send a PR that removes this array bound check, but if there is an R2R|PGO issue then this is not enough. I am not familiar with the internals of R2R|NGen, is it fine if I just create another issue with this small repro?

@AndyAyersMS
Copy link
Member

It looks like in 3.0 codegen we now miss some CSE opportunities and have some redundant loads. This might be the result of a correctness fix as we've had to make CSE more conservative in cases where arrays are obtained indirectly. We should investigate.

That being said, I would not expect the highlighted load to cause a big perf hit. That same location was just loaded. Likely VTune is actually indicating the branch above as the source of the cycle hit and the samples have skidded down to the load.

There's nothing in the disassembly above that indicates the branch should be slower, and the branch should be highly predictable. So it could be some nonlocal interaction (say a collision in the branch predictor with some other branch). Usually these aren't all that durable; this one seems to be. So not sure what is going on.

@jkotas
Copy link
Member

jkotas commented Oct 21, 2019

I am going to send a PR that removes this array bound check

I do not think you will be able to remove this bounds check without compromising security properties of StringBuilder.

@AndyAyersMS
Copy link
Member

@dotnet/jit-contrib we should investigate what lead to this codegen change.

@erozenfeld
Copy link
Member

I'll investigate.

@erozenfeld
Copy link
Member

The codegen difference in these cases is caused by IBC data. Ngen-ing mscorlib on the desktop is using IBC data while other scenarios mentioned above don't. If I set complus_DisableIBC=1 while ngen-ing mscorlib, I get the extra loads. The use of IBC data changes the flowgraph we get when performing CSE and that affects ref counts and CSE profitability checks. In one case we decide the CSE is profitable and in the other case we decide it's unprofitable:

Considering CSE dotnet/coreclr#1 {$242, $240} [def=100, use=50, cost= 4] CSE Expression:
N008 (  4,  4) CSE dotnet/coreclr#1 (def)[000003] ---XG-------              *  IND       ref    <l:$244, c:$243>
N007 (  2,  2)              [000035] -------N----              \--*  ADD       byref  $141
N005 (  1,  1)              [000002] ------------                 +--*  LCL_VAR   ref    V00 this         u:1 $80
N006 (  1,  1)              [000034] ------------                 \--*  CNS_INT   long   8 field offset Fseq[m_ChunkChars] $101

Moderate CSE Promotion (CSE never live at call) (250 >= 150)
cseRefCnt=250, aggressiveRefCnt=300, moderateRefCnt=150
defCnt=100, useCnt=50, cost=4, size=4
def_cost=2, use_cost=1, extra_no_cost=6, extra_yes_cost=0
CSE cost savings check (206 >= 250) fails
Did Not promote this CSE

vs.

Considering CSE dotnet/coreclr#1 {$242, $240} [def=100, use=99, cost= 4] CSE Expression:
N008 (  4,  4) CSE dotnet/coreclr#1 (def)[000003] ---XG-------              *  IND       ref    <l:$244, c:$243>
N007 (  2,  2)              [000035] -------N----              \--*  ADD       byref  $141
N005 (  1,  1)              [000002] ------------                 +--*  LCL_VAR   ref    V00 this         u:1 $80
N006 (  1,  1)              [000034] ------------                 \--*  CNS_INT   long   8 field offset Fseq[m_ChunkChars] $101

Moderate CSE Promotion (CSE never live at call) (299 >= 150)
cseRefCnt=299, aggressiveRefCnt=300, moderateRefCnt=150
defCnt=100, useCnt=99, cost=4, size=4
def_cost=2, use_cost=1, extra_no_cost=6, extra_yes_cost=0
CSE cost savings check (402 >= 299) passes

Promoting CSE:

I'll investigate whether we need to tweak ref counts calculations.

@erozenfeld
Copy link
Member

erozenfeld commented Oct 23, 2019

Looks like the difference that matters without IBC data we assign 0.5 bbWeight to both basic block targets of the first compare; with IBC data we assign 0.99 and 0.01:

G_M46015_IG02:
       8B4E18               mov      ecx, dword ptr [rsi+24]
       4C8B4608             mov      r8, gword ptr [rsi+8]
       413B4808             cmp      ecx, dword ptr [r8+8]
       7D20                 jge      SHORT G_M46015_IG04

G_M46015_IG03:          ;; bbWeight=0.50
       488B4E08             mov      rcx, gword ptr [rsi+8]
       448B4618             mov      r8d, dword ptr [rsi+24]
       418D4001             lea      eax, [r8+1]
       894618               mov      dword ptr [rsi+24], eax
       443B4108             cmp      r8d, dword ptr [rcx+8]
       7326                 jae      SHORT G_M46015_IG07
       4D63C0               movsxd   r8, r8d
       664289544110         mov      word  ptr [rcx+2*r8+16], dx
       EB12                 jmp      SHORT G_M46015_IG05

G_M46015_IG04:          ;; bbWeight=0.50
       0FB7D2               movzx    rdx, dx
       488BCE               mov      rcx, rsi
       41B801000000         mov      r8d, 1
       FF1500000000         call     [System.Text.StringBuilder:Append(ushort,int):ref:this]

vs.

G_M46015_IG02:
       8B4E18               mov      ecx, dword ptr [rsi+24]
       4C8B4608             mov      r8, gword ptr [rsi+8]
       413B4808             cmp      ecx, dword ptr [r8+8]
       7D1E                 jge      SHORT G_M46015_IG06

G_M46015_IG03:          ;; bbWeight=0.99
       8D4101               lea      eax, [rcx+1]
       894618               mov      dword ptr [rsi+24], eax
       413B4808             cmp      ecx, dword ptr [r8+8]
       7326                 jae      SHORT G_M46015_IG07
       4863C9               movsxd   rcx, ecx
       664189544810         mov      word  ptr [r8+2*rcx+16], dx

G_M46015_IG04:
       488BC6               mov      rax, rsi

G_M46015_IG05:
       4883C420             add      rsp, 32
       5E                   pop      rsi
       C3                   ret

G_M46015_IG06:          ;; bbWeight=0.01
       0FB7D2               movzx    rdx, dx
       488BCE               mov      rcx, rsi
       41B801000000         mov      r8d, 1
       FF1500000000         call     [System.Text.StringBuilder:Append(ushort,int):ref:this]
       EBE3                 jmp      SHORT G_M46015_IG04

@erozenfeld
Copy link
Member

That affects CSE profitability calculations.

@AndyAyersMS
Copy link
Member

Interesting -- do we have IBC data for the R2R compiled version in Core? Is it different somehow, or missing?

The fact that IBC only applies when prejitting may lead to more performance anomalies like this. Normally we'd expect jitted code to be faster than R2R'd code, but if IBC data hilights the hot paths only for prejitting, we may end up being less aggressive when jitting and possibly lose perf.

@erozenfeld
Copy link
Member

When I do a normal local build, IBC data is not used when compiling R2R System.Private.CoreLib. I'm not sure if that's intentional. However, it appears IBC data was used when building release 3.0 bits. Looking at System.Private.CoreLib.dll (from https://dotnet.microsoft.com/download/thank-you/dotnet-runtime-3.0.0-windows-x64-binaries) with R2RDump (below) I see no extra loads. @adamsitnik Where you using official release 3.0 core bits when doing your experiments?

  3aab00: 56                    push rsi
                                UWOP_PUSH_NONVOL RSI(6)
  3aab01: 48 83 ec 20           sub rsp, 32
                                UWOP_ALLOC_SMALL 32
  3aab05: 48 8b f1              mov rsi, rcx
  3aab08: 8b 4e 18              mov ecx, dword ptr [rsi + 24]
  3aab0b: 4c 8b 46 08           mov r8, qword ptr [rsi + 8]
  3aab0f: 41 3b 48 08           cmp ecx, dword ptr [r8 + 8]
  3aab13: 7d 1e                 jge 0x3aab33
  3aab15: 8d 41 01              lea eax, [rcx + 1]
  3aab18: 89 46 18              mov dword ptr [rsi + 24], eax
  3aab1b: 41 3b 48 08           cmp ecx, dword ptr [r8 + 8]
  3aab1f: 73 26                 jae 0x3aab47
  3aab21: 48 63 c9              movsxd rcx, ecx
  3aab24: 66 41 89 54 48 10     mov word ptr [r8 + 2*rcx + 16], dx
  3aab2a: 48 8b c6              mov rax, rsi
  3aab2d: 48 83 c4 20           add rsp, 32
  3aab31: 5e                    pop rsi
  3aab32: c3                    ret
  3aab33: 0f b7 d2              movzx edx, dx
  3aab36: 48 8b ce              mov rcx, rsi
  3aab39: 41 b8 01 00 00 00     mov r8d, 1
  3aab3f: ff 15 73 4a c6 ff     call qword ptr [0xf5b8]       // System.Text.StringBuilder System.Text.StringBuilder.Append(Char, Int32) (METHOD_ENTRY_DEF_TOKEN)
  3aab45: eb e3                 jmp 0x3aab2a
  3aab47: e8 8e dc c7 ff        call 0x287da
  3aab4c: cc                    int3

@erozenfeld
Copy link
Member

So, it appears there is no difference in the core vs. desktop performance (when using official core bits). Both official mscorlib and System.Private.CoreLib are built with IBC data and don't have the extra loads. The problem is that we may get worse code when re-jitting without IBC data.

@AndyAyersMS
Copy link
Member

Right, perf tests should (generally) now be measuring jitted code perf on Core.

So Core vs Desktop perf comparisons where NGEN / IBC provided benefits on Desktop may show up as "regressions" in Core vs Desktop numbers.

Would still love for us to find a way to pick up IBC data when jitting. I realize size on disk is a concern.
Would be nice to see numbers. Also it seems like the IBC data stream should be highly compressible.

Perhaps worth a prototype just to see what perf we'd gain today, perhaps that would be motivating. And more perf is possible as the jit isn't all that clever when it has IBC data (eg we don't read in IBC for inlinees...)

@erozenfeld
Copy link
Member

/cc: @davidwrighton

@adamsitnik
Copy link
Member

Where you using official release 3.0 core bits when doing your experiments

Yes, I've just double-checked that.

dotnet --info
PS C:\Users\adsitnik\source\repos\SbProfiling> dotnet --info
.NET Core SDK (reflecting any global.json):
 Version:   3.0.100
 Commit:    04339c3a26

Runtime Environment:
 OS Name:     Windows
 OS Version:  10.0.18362
 OS Platform: Windows
 RID:         win10-x64
 Base Path:   C:\Program Files\dotnet\sdk\3.0.100\

Host (useful for support):
  Version: 3.0.0
  Commit:  7d57652f33

Project settings:

<PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFrameworks>net472;netcoreapp2.2;netcoreapp3.0</TargetFrameworks>
    <DebugType>pdbonly</DebugType>
    <DebugSymbols>true</DebugSymbols>
    <!--<TieredCompilation>false</TieredCompilation>-->
    <PlatformTarget>x64</PlatformTarget>
  </PropertyGroup>

@adamsitnik
Copy link
Member

So, it appears there is no difference in the core vs. desktop performance (when using official core bits). Both official mscorlib and System.Private.CoreLib are built with IBC data and don't have the extra loads. The problem is that we may get worse code when re-jitting without IBC data.

I've just run VTune again and confirmed the previous results

obraz

So the busiest method is StringBuilder.Append(char) from System.Private.CoreLib.dll (mind the lack of ni)

Does it mean that in Tier 0 we get a precompiled method which was trained using IBC and when the method is recompiled by Tiered JIT and promoted to Tier 1 we don't have this extra infromation and the resulting machine code is worse?

@erozenfeld
Copy link
Member

Does it mean that in Tier 0 we get a precompiled method which was trained using IBC and when the method is recompiled by Tiered JIT and promoted to Tier 1 we don't have this extra infromation and the resulting machine code is worse?

Yes, that is correct.

@adamsitnik
Copy link
Member

I wonder how common such problem can be. How hard would it be to generate a list of System.Private.CoreLib methods that are bigger after re-jitting compared to the output produced by R2R with IBC enabled?

Would it then make sense to extend the Tiered JIT with some simple heuristic like "if the re-jitted code size is bigger than R2R, don't replace it"?

@davidwrighton
Copy link
Member

@adamsitnik I would say that's a tricky heuristic to write. For instance, the tiered JIT is able to inline much more than the R2R code can, which has a high potential to increase code size in a nicely profitable fashion. I believe @briansull is looking at ways for R2R code to claim its good enough to not need recompilation, which I think is a better approach to the same problem.

OTOH, all of this has shown a place where we can provide measurable proof of improvement from IBC block counts in real world code. While I don't think its worthwhile to carry IBC data in its current form in R2R images, I do think we could carry block counts, or a reasonable facsimile thereof without incurring a tremendous size cost. As such I've filed dotnet/coreclr#27511. I think we should consider making such a change for 5.0

@adamsitnik
Copy link
Member

@davidwrighton thank you for a detailed answer. I forgot about inlining, which makes it non-trivial.

The source of the problem has been identified and now it's being tracked in https://github.com/dotnet/coreclr/issues/27511 so I am closing this issue

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 5.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

8 participants