Peculiar string operation performance #13451

Suchiman · 2019-09-22T18:40:30Z

So i got nerd sniped on IRC with how fast you can append characters to create a string.
I created the following benchmark to measure a few alternatives just before i got told it's for an old MVC4 app. Then I've noticed that the performance is consistently worse on .NET Core, in particular for StringBuilder and even more so if you supply a capacity for StringBuilder. I then suspected Tiered Compilation but i couldn't figure how to really disable it for BDN.NET, <TieredCompilation>false</TieredCompilation> had at best an effect on the InProcess Benchmark. Also it appears that BDN.NET or the API it is using for measuring the Allocated number is incorrect on .NET Core, the .NET Framework numbers seem more plausible (allocating an char[] of 50k elements and a string from that char[] certainly doesn't allocate exactly as much as just allocating the string).

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text;

namespace ConsoleApp37
{
    class Program
    {
        static void Main(string[] args)
        {
            BenchmarkRunner.Run<ConcatRace>();
        }
    }

    //[InProcess]
    [MemoryDiagnoser]
    public class ConcatRace
    {
        const int length = 50000;

        [Benchmark]
        public string StringBuilderAddChar()
        {
            var result = new StringBuilder();

            for (int i = 0; i < length; i++)
            {
                result.Append('s');
            }

            return result.ToString();
        }

        [Benchmark]
        public string StringBuilderAddCharOptimized()
        {
            var result = new StringBuilder(length);

            for (int i = 0; i < length; i++)
            {
                result.Append('s');
            }

            return result.ToString();
        }

        [Benchmark]
        public unsafe string PointerAddChar()
        {
            var result = new char[length];

            fixed (char* fixedPointer = result)
            {
                var pointer = fixedPointer;
                for (int i = 0; i < length; i++)
                {
                    *pointer++ = 's';
                }
            }

            return new string(result);
        }

        [Benchmark]
        public unsafe string NormalAddChar()
        {
            var result = new char[length];

            for (int i = 0; i < result.Length; i++)
            {
                result[i] = 's';
            }

            return new string(result);
        }

        [Benchmark]
        public unsafe string FastPointerAddChar()
        {
            var result = new string('\0', length);

            fixed (char* fixedPointer = result)
            {
                var pointer = fixedPointer;
                for (int i = 0; i < length; i++)
                {
                    *pointer++ = 's';
                }
            }

            return result;
        }
    }
}

.NET Framework 4.7.2

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.18362
Intel Core i5-4670K CPU 3.40GHz (Haswell), 1 CPU, 4 logical and 4 physical cores
  [Host]     : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.8.3815.0
  DefaultJob : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.8.3815.0

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
StringBuilderAddChar	109.43 us	0.3330 us	0.2952 us	62.3779	31.1279	31.1279	209.02 KB
StringBuilderAddCharOptimized	100.69 us	0.5440 us	0.5089 us	62.3779	62.3779	62.3779	195.87 KB
PointerAddChar	33.79 us	0.2182 us	0.1934 us	62.4390	62.4390	62.4390	195.37 KB
NormalAddChar	38.40 us	0.1843 us	0.1539 us	62.4390	62.4390	62.4390	195.37 KB
FastPointerAddChar	24.62 us	0.2241 us	0.2096 us	31.2195	31.2195	31.2195	97.69 KB

.NET Core 3.0 RC1 (OutOfProcess)

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.18362
Intel Core i5-4670K CPU 3.40GHz (Haswell), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.0.100-rc1-014190
  [Host]     : .NET Core 3.0.0-rc1-19456-20 (CoreCLR 4.700.19.45506, CoreFX 4.700.19.45604), 64bit RyuJIT
  DefaultJob : .NET Core 3.0.0-rc1-19456-20 (CoreCLR 4.700.19.45506, CoreFX 4.700.19.45604), 64bit RyuJIT

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
StringBuilderAddChar	120.80 us	0.5277 us	0.4936 us	62.2559	31.0059	31.0059	208.55 KB
StringBuilderAddCharOptimized	142.44 us	1.1631 us	1.0879 us	62.2559	62.2559	62.2559	97.73 KB
PointerAddChar	35.33 us	0.2321 us	0.2057 us	62.4390	62.4390	62.4390	97.68 KB
NormalAddChar	40.27 us	0.4038 us	0.3777 us	62.4390	62.4390	62.4390	97.68 KB
FastPointerAddChar	29.78 us	0.1331 us	0.1245 us	31.2195	31.2195	31.2195	97.68 KB

.NET Core 3.0 RC1 (InProcess)

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.18362
Intel Core i5-4670K CPU 3.40GHz (Haswell), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.0.100-rc1-014190
  [Host] : .NET Core 3.0.0-rc1-19456-20 (CoreCLR 4.700.19.45506, CoreFX 4.700.19.45604), 64bit RyuJIT

Job=InProcess  Toolchain=InProcessEmitToolchain

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
StringBuilderAddChar	121.68 us	0.4015 us	0.3352 us	62.3779	31.1279	31.1279	208.55 KB
StringBuilderAddCharOptimized	111.83 us	1.4062 us	1.3154 us	62.3779	62.3779	62.3779	97.73 KB
PointerAddChar	43.64 us	0.5115 us	0.4784 us	62.4390	62.4390	62.4390	97.68 KB
NormalAddChar	48.63 us	0.6182 us	0.5782 us	62.4390	62.4390	62.4390	97.68 KB
FastPointerAddChar	29.98 us	0.3981 us	0.3724 us	31.2195	31.2195	31.2195	97.68 KB

The text was updated successfully, but these errors were encountered:

jkotas · 2019-09-23T00:53:08Z

cc @adamsitnik

danmoseley · 2019-09-23T22:00:20Z

Also reported here https://github.com/dotnet/coreclr/issues/25774

danmoseley · 2019-10-01T16:20:14Z

Seems a duplicate of #13107. Lets continue discussion there.

Suchiman · 2019-10-01T17:10:09Z

@danmosemsft imo this is a superset of that other ticket which is exclusive to stringbuilder

danmoseley · 2019-10-01T17:46:11Z

@Suchiman I see your observations about non StringBuilder appends but it's not clear to me that there's a problem with those?

Suchiman · 2019-10-01T17:49:13Z

They're consistently slower even if only by a small margin (between 5% and 18%) and then there's the incorrect measurement of Allocated

danmoseley · 2019-10-01T18:03:59Z

@Suchiman you're right -- I had read the tables in reverse.

adamsitnik · 2019-10-01T19:16:27Z

I am going to investigate it tomorrow

adamsitnik · 2019-10-02T10:46:52Z

Hi @Suchiman

Also it appears that BDN.NET or the API it is using for measuring the Allocated number is incorrect on .NET Core

You are right. Before .NET Core 3.0 preview 6+ and BenchmarkDotNet 0.11.6, BenchmarkDotNet was using GC.GetAllocatedBytesForCurrentThread to get the number of allocated bytes. Recently we have switched to the new API added recently to .NET Core 3.0 - GC.GetTotalAllocatedBytes dotnet/BenchmarkDotNet#1155 and this solves the problem of missing allocations.

Using the latest version of BDN from our CI feed and .NET Core 3.0 I get the correct number of allocated bytes:

BenchmarkDotNet=v0.11.5.1159-nightly, OS=Windows 10.0.18362
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-alpha1-013817
[Host] : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), X64 RyuJIT
Job-XMGAOR : .NET Framework 4.8 (4.8.4010.0), X64 RyuJIT
Job-JHKCBR : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), X64 RyuJIT

Method	Runtime	Toolchain	Mean	Ratio	Allocated
StringBuilderAddChar	.NET 4.7.2	net472	124.16 us	1.00	208.77 KB
StringBuilderAddChar	.NET Core 3.0	netcoreapp3.0	147.46 us	1.21	208.56 KB

StringBuilderAddCharOptimized	.NET 4.7.2	net472	120.43 us	1.00	195.87 KB
StringBuilderAddCharOptimized	.NET Core 3.0	netcoreapp3.0	140.72 us	1.17	195.41 KB

PointerAddChar	.NET 4.7.2	net472	41.11 us	1.00	195.37 KB
PointerAddChar	.NET Core 3.0	netcoreapp3.0	40.19 us	0.98	195.36 KB

NormalAddChar	.NET 4.7.2	net472	42.01 us	1.00	195.37 KB
NormalAddChar	.NET Core 3.0	netcoreapp3.0	44.21 us	1.05	195.36 KB

FastPointerAddChar	.NET 4.7.2	net472	29.44 us	1.00	97.69 KB
FastPointerAddChar	.NET Core 3.0	netcoreapp3.0	32.82 us	1.12	97.68 KB

i couldn't figure how to really disable it for BDN.NET

The benchmark should get promoted to Tier 1 during Warmup or Pilot stage, this is why BDN does not expose a possibility to disable Tiered JIT. However, you can achieve that by setting environment variables

adamsitnik · 2019-10-02T10:47:38Z

When it comes to the most important aspect - for your benchmark .NET Core 3.0 is slower than .NET 4.8. I am going to investigate that.

adamsitnik · 2019-10-02T10:51:11Z

Also, I wanted to mention that we track the performance of StringBuilder in dotnet/performance repository: https://github.com/dotnet/performance/blob/master/src/benchmarks/micro/corefx/System.Runtime/Perf.StringBuilder.cs

I've run all the benchmarks we have for 4.8 vs 3.0 and in general .NET Core 3.0 is slower than 4.8 only in a single case:

Method	Runtime	Toolchain	length	Mean	Ratio	Allocated
ctor	.NET 4.7.2	net472	?	15.09 ns	1.00	104 B
ctor	.NET Core 3.0	netcoreapp3.0	?	12.98 ns	0.88	104 B

StringConcat	.NET 4.7.2	net472	?	45,430,416.67 ns	1.00	275971747 B
StringConcat	.NET Core 3.0	netcoreapp3.0	?	48,247,462.50 ns	1.06	275701266 B

StringBuilderAppend	.NET 4.7.2	net472	?	129,789.88 ns	1.00	565002 B
StringBuilderAppend	.NET Core 3.0	netcoreapp3.0	?	126,277.46 ns	0.97	563552 B

Append_ValueTypes	.NET 4.7.2	net472	?	7,435,766.25 ns	1.00	`2225087` B
Append_ValueTypes	.NET Core 3.0	netcoreapp3.0	?	2,285,843.80 ns	0.31	932616 B

StringBuilderToString	.NET 4.7.2	net472	?	79,553.35 ns	1.00	550032 B
StringBuilderToString	.NET Core 3.0	netcoreapp3.0	?	80,571.22 ns	1.01	550024 B

AppendMemory	.NET 4.7.2	net472	?	41,637.46 ns	1.00	235515 B
AppendMemory	.NET Core 3.0	netcoreapp3.0	?	29,821.27 ns	0.72	203040 B

Append	.NET 4.7.2	net472	0	26,470.82 ns	1.00	105 B
Append	.NET Core 3.0	netcoreapp3.0	0	31,749.11 ns	1.20	104 B

ctor_string	.NET 4.7.2	net472	100	32.81 ns	1.00	273 B
ctor_string	.NET Core 3.0	netcoreapp3.0	100	32.89 ns	1.01	272 B

Append	.NET 4.7.2	net472	200	1,068,612.78 ns	1.00	4035648 B
Append	.NET Core 3.0	netcoreapp3.0	200	926,691.15 ns	0.87	`4028106` B

ctor_string	.NET 4.7.2	net472	1000	200.28 ns	1.00	2078 B
ctor_string	.NET Core 3.0	netcoreapp3.0	1000	197.00 ns	0.98	2072 B

Suchiman · 2019-10-03T20:13:32Z

BenchmarkDotNet was using GC.GetAllocatedBytesForCurrentThread to get the number of allocated bytes

Okay, but there's only a single thread involved here, why is this API not delivering the correct result?

adamsitnik · 2019-10-21T12:52:50Z

I was able to narrow down the problem to a small repro that does not involve GC/memory allocations: creating a reusable StringBuilder with a capacity that never needs a resize:

public class Repro
{
    const int length = 50000;

    private StringBuilder _sb;

    [GlobalSetup]
    public void Setup() => _sb = new StringBuilder(length);

    [Benchmark]
    public StringBuilder StringBuilderAddCharOptimized()
    {
        var result = _sb.Clear(); // it just sents the internal index to 0

        for (int i = 0; i < length; i++)
        {
            result.Append('s');
        }

        return result;
    }
}

First of all, this benchmark is sensitive to loop head alignment. As crazy as it sounds, CLR is typically more lucky with the alignment in this particular case. This is not the first time such a problem happens with a microbenchmark: https://github.com/dotnet/coreclr/issues/17932

This can be checked with following BenchmarkDotNet config using an experimental feature flag COMPlus_JitAlignLoops exposed by CoreCLR:

static void Main(string[] args) =>
    BenchmarkRunner.Run<Repro>(
        DefaultConfig.Instance
            .With(Job.Default.With(CsProjCoreToolchain.NetCoreApp30).WithId("3.0 defaults"))
            .With(Job.Default.With(CsProjCoreToolchain.NetCoreApp30).With(new[] { new EnvironmentVariable("COMPlus_JitAlignLoops", "1") }).WithId("3.0 align loops")));

Method	Job	EnvironmentVariables	Toolchain	Mean
StringBuilderAddCharOptimized	3.0 align loops	COMPlus_JitAlignLoops=1	.NET Core 3.0	106.81 us
StringBuilderAddCharOptimized	3.0 defaults	Empty	.NET Core 3.0	120.10 us

The more important, deterministic difference is a code-gen difference for 4.8 and 3.0 for the array indexer check:

I've observed, that if I copy-paste StringBuilder source code and run it as a regular .NET app (not a part of mscorelib), then both Frameworks produce the same assembly code for the Append(char) method.

I've disabled R2R and Zap:

And confirmed that this causes the difference:

@jkotas @AndyAyersMS I am definitely not a code-gen expert and I am not sure if I VTune is not showing me skids. Do you think that it makes sense that this code-gen difference can be causing 10% regression? (When I set both env vars the micro-benchmark tells me that there is still a 3% difference, but it can be some noise again)

I am going to send a PR that removes this array bound check, but if there is an R2R|PGO issue then this is not enough. I am not familiar with the internals of R2R|NGen, is it fine if I just create another issue with this small repro?

AndyAyersMS · 2019-10-21T14:07:21Z

It looks like in 3.0 codegen we now miss some CSE opportunities and have some redundant loads. This might be the result of a correctness fix as we've had to make CSE more conservative in cases where arrays are obtained indirectly. We should investigate.

That being said, I would not expect the highlighted load to cause a big perf hit. That same location was just loaded. Likely VTune is actually indicating the branch above as the source of the cycle hit and the samples have skidded down to the load.

There's nothing in the disassembly above that indicates the branch should be slower, and the branch should be highly predictable. So it could be some nonlocal interaction (say a collision in the branch predictor with some other branch). Usually these aren't all that durable; this one seems to be. So not sure what is going on.

jkotas · 2019-10-21T15:13:55Z

I am going to send a PR that removes this array bound check

I do not think you will be able to remove this bounds check without compromising security properties of StringBuilder.

AndyAyersMS · 2019-10-22T19:06:48Z

@dotnet/jit-contrib we should investigate what lead to this codegen change.

erozenfeld · 2019-10-22T19:53:01Z

I'll investigate.

erozenfeld · 2019-10-23T19:25:49Z

The codegen difference in these cases is caused by IBC data. Ngen-ing mscorlib on the desktop is using IBC data while other scenarios mentioned above don't. If I set complus_DisableIBC=1 while ngen-ing mscorlib, I get the extra loads. The use of IBC data changes the flowgraph we get when performing CSE and that affects ref counts and CSE profitability checks. In one case we decide the CSE is profitable and in the other case we decide it's unprofitable:

Considering CSE dotnet/coreclr#1 {$242, $240} [def=100, use=50, cost= 4] CSE Expression:
N008 (  4,  4) CSE dotnet/coreclr#1 (def)[000003] ---XG-------              *  IND       ref    <l:$244, c:$243>
N007 (  2,  2)              [000035] -------N----              \--*  ADD       byref  $141
N005 (  1,  1)              [000002] ------------                 +--*  LCL_VAR   ref    V00 this         u:1 $80
N006 (  1,  1)              [000034] ------------                 \--*  CNS_INT   long   8 field offset Fseq[m_ChunkChars] $101

Moderate CSE Promotion (CSE never live at call) (250 >= 150)
cseRefCnt=250, aggressiveRefCnt=300, moderateRefCnt=150
defCnt=100, useCnt=50, cost=4, size=4
def_cost=2, use_cost=1, extra_no_cost=6, extra_yes_cost=0
CSE cost savings check (206 >= 250) fails
Did Not promote this CSE

vs.

Considering CSE dotnet/coreclr#1 {$242, $240} [def=100, use=99, cost= 4] CSE Expression:
N008 (  4,  4) CSE dotnet/coreclr#1 (def)[000003] ---XG-------              *  IND       ref    <l:$244, c:$243>
N007 (  2,  2)              [000035] -------N----              \--*  ADD       byref  $141
N005 (  1,  1)              [000002] ------------                 +--*  LCL_VAR   ref    V00 this         u:1 $80
N006 (  1,  1)              [000034] ------------                 \--*  CNS_INT   long   8 field offset Fseq[m_ChunkChars] $101

Moderate CSE Promotion (CSE never live at call) (299 >= 150)
cseRefCnt=299, aggressiveRefCnt=300, moderateRefCnt=150
defCnt=100, useCnt=99, cost=4, size=4
def_cost=2, use_cost=1, extra_no_cost=6, extra_yes_cost=0
CSE cost savings check (402 >= 299) passes

Promoting CSE:

I'll investigate whether we need to tweak ref counts calculations.

erozenfeld · 2019-10-23T19:53:00Z

Looks like the difference that matters without IBC data we assign 0.5 bbWeight to both basic block targets of the first compare; with IBC data we assign 0.99 and 0.01:

G_M46015_IG02:
       8B4E18               mov      ecx, dword ptr [rsi+24]
       4C8B4608             mov      r8, gword ptr [rsi+8]
       413B4808             cmp      ecx, dword ptr [r8+8]
       7D20                 jge      SHORT G_M46015_IG04

G_M46015_IG03:          ;; bbWeight=0.50
       488B4E08             mov      rcx, gword ptr [rsi+8]
       448B4618             mov      r8d, dword ptr [rsi+24]
       418D4001             lea      eax, [r8+1]
       894618               mov      dword ptr [rsi+24], eax
       443B4108             cmp      r8d, dword ptr [rcx+8]
       7326                 jae      SHORT G_M46015_IG07
       4D63C0               movsxd   r8, r8d
       664289544110         mov      word  ptr [rcx+2*r8+16], dx
       EB12                 jmp      SHORT G_M46015_IG05

G_M46015_IG04:          ;; bbWeight=0.50
       0FB7D2               movzx    rdx, dx
       488BCE               mov      rcx, rsi
       41B801000000         mov      r8d, 1
       FF1500000000         call     [System.Text.StringBuilder:Append(ushort,int):ref:this]

vs.

G_M46015_IG02:
       8B4E18               mov      ecx, dword ptr [rsi+24]
       4C8B4608             mov      r8, gword ptr [rsi+8]
       413B4808             cmp      ecx, dword ptr [r8+8]
       7D1E                 jge      SHORT G_M46015_IG06

G_M46015_IG03:          ;; bbWeight=0.99
       8D4101               lea      eax, [rcx+1]
       894618               mov      dword ptr [rsi+24], eax
       413B4808             cmp      ecx, dword ptr [r8+8]
       7326                 jae      SHORT G_M46015_IG07
       4863C9               movsxd   rcx, ecx
       664189544810         mov      word  ptr [r8+2*rcx+16], dx

G_M46015_IG04:
       488BC6               mov      rax, rsi

G_M46015_IG05:
       4883C420             add      rsp, 32
       5E                   pop      rsi
       C3                   ret

G_M46015_IG06:          ;; bbWeight=0.01
       0FB7D2               movzx    rdx, dx
       488BCE               mov      rcx, rsi
       41B801000000         mov      r8d, 1
       FF1500000000         call     [System.Text.StringBuilder:Append(ushort,int):ref:this]
       EBE3                 jmp      SHORT G_M46015_IG04

erozenfeld · 2019-10-23T19:53:40Z

That affects CSE profitability calculations.

AndyAyersMS · 2019-10-23T21:40:07Z

Interesting -- do we have IBC data for the R2R compiled version in Core? Is it different somehow, or missing?

The fact that IBC only applies when prejitting may lead to more performance anomalies like this. Normally we'd expect jitted code to be faster than R2R'd code, but if IBC data hilights the hot paths only for prejitting, we may end up being less aggressive when jitting and possibly lose perf.

erozenfeld · 2019-10-23T22:39:13Z

When I do a normal local build, IBC data is not used when compiling R2R System.Private.CoreLib. I'm not sure if that's intentional. However, it appears IBC data was used when building release 3.0 bits. Looking at System.Private.CoreLib.dll (from https://dotnet.microsoft.com/download/thank-you/dotnet-runtime-3.0.0-windows-x64-binaries) with R2RDump (below) I see no extra loads. @adamsitnik Where you using official release 3.0 core bits when doing your experiments?

  3aab00: 56                    push rsi
                                UWOP_PUSH_NONVOL RSI(6)
  3aab01: 48 83 ec 20           sub rsp, 32
                                UWOP_ALLOC_SMALL 32
  3aab05: 48 8b f1              mov rsi, rcx
  3aab08: 8b 4e 18              mov ecx, dword ptr [rsi + 24]
  3aab0b: 4c 8b 46 08           mov r8, qword ptr [rsi + 8]
  3aab0f: 41 3b 48 08           cmp ecx, dword ptr [r8 + 8]
  3aab13: 7d 1e                 jge 0x3aab33
  3aab15: 8d 41 01              lea eax, [rcx + 1]
  3aab18: 89 46 18              mov dword ptr [rsi + 24], eax
  3aab1b: 41 3b 48 08           cmp ecx, dword ptr [r8 + 8]
  3aab1f: 73 26                 jae 0x3aab47
  3aab21: 48 63 c9              movsxd rcx, ecx
  3aab24: 66 41 89 54 48 10     mov word ptr [r8 + 2*rcx + 16], dx
  3aab2a: 48 8b c6              mov rax, rsi
  3aab2d: 48 83 c4 20           add rsp, 32
  3aab31: 5e                    pop rsi
  3aab32: c3                    ret
  3aab33: 0f b7 d2              movzx edx, dx
  3aab36: 48 8b ce              mov rcx, rsi
  3aab39: 41 b8 01 00 00 00     mov r8d, 1
  3aab3f: ff 15 73 4a c6 ff     call qword ptr [0xf5b8]       // System.Text.StringBuilder System.Text.StringBuilder.Append(Char, Int32) (METHOD_ENTRY_DEF_TOKEN)
  3aab45: eb e3                 jmp 0x3aab2a
  3aab47: e8 8e dc c7 ff        call 0x287da
  3aab4c: cc                    int3

erozenfeld · 2019-10-23T22:46:16Z

So, it appears there is no difference in the core vs. desktop performance (when using official core bits). Both official mscorlib and System.Private.CoreLib are built with IBC data and don't have the extra loads. The problem is that we may get worse code when re-jitting without IBC data.

AndyAyersMS · 2019-10-23T23:03:10Z

Right, perf tests should (generally) now be measuring jitted code perf on Core.

So Core vs Desktop perf comparisons where NGEN / IBC provided benefits on Desktop may show up as "regressions" in Core vs Desktop numbers.

Would still love for us to find a way to pick up IBC data when jitting. I realize size on disk is a concern.
Would be nice to see numbers. Also it seems like the IBC data stream should be highly compressible.

Perhaps worth a prototype just to see what perf we'd gain today, perhaps that would be motivating. And more perf is possible as the jit isn't all that clever when it has IBC data (eg we don't read in IBC for inlinees...)

erozenfeld · 2019-10-23T23:04:26Z

/cc: @davidwrighton

adamsitnik · 2019-10-24T17:26:11Z

Where you using official release 3.0 core bits when doing your experiments

Yes, I've just double-checked that.

dotnet --info

PS C:\Users\adsitnik\source\repos\SbProfiling> dotnet --info
.NET Core SDK (reflecting any global.json):
 Version:   3.0.100
 Commit:    04339c3a26

Runtime Environment:
 OS Name:     Windows
 OS Version:  10.0.18362
 OS Platform: Windows
 RID:         win10-x64
 Base Path:   C:\Program Files\dotnet\sdk\3.0.100\

Host (useful for support):
  Version: 3.0.0
  Commit:  7d57652f33

Project settings:

<PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFrameworks>net472;netcoreapp2.2;netcoreapp3.0</TargetFrameworks>
    <DebugType>pdbonly</DebugType>
    <DebugSymbols>true</DebugSymbols>
    <!--<TieredCompilation>false</TieredCompilation>-->
    <PlatformTarget>x64</PlatformTarget>
  </PropertyGroup>

adamsitnik · 2019-10-24T17:33:43Z

So, it appears there is no difference in the core vs. desktop performance (when using official core bits). Both official mscorlib and System.Private.CoreLib are built with IBC data and don't have the extra loads. The problem is that we may get worse code when re-jitting without IBC data.

I've just run VTune again and confirmed the previous results

So the busiest method is StringBuilder.Append(char) from System.Private.CoreLib.dll (mind the lack of ni)

Does it mean that in Tier 0 we get a precompiled method which was trained using IBC and when the method is recompiled by Tiered JIT and promoted to Tier 1 we don't have this extra infromation and the resulting machine code is worse?

erozenfeld · 2019-10-24T17:41:36Z

Does it mean that in Tier 0 we get a precompiled method which was trained using IBC and when the method is recompiled by Tiered JIT and promoted to Tier 1 we don't have this extra infromation and the resulting machine code is worse?

Yes, that is correct.

adamsitnik · 2019-10-28T17:42:16Z

I wonder how common such problem can be. How hard would it be to generate a list of System.Private.CoreLib methods that are bigger after re-jitting compared to the output produced by R2R with IBC enabled?

Would it then make sense to extend the Tiered JIT with some simple heuristic like "if the re-jitted code size is bigger than R2R, don't replace it"?

davidwrighton · 2019-10-28T18:35:03Z

@adamsitnik I would say that's a tricky heuristic to write. For instance, the tiered JIT is able to inline much more than the R2R code can, which has a high potential to increase code size in a nicely profitable fashion. I believe @briansull is looking at ways for R2R code to claim its good enough to not need recompilation, which I think is a better approach to the same problem.

OTOH, all of this has shown a place where we can provide measurable proof of improvement from IBC block counts in real world code. While I don't think its worthwhile to carry IBC data in its current form in R2R images, I do think we could carry block counts, or a reasonable facsimile thereof without incurring a tremendous size cost. As such I've filed dotnet/coreclr#27511. I think we should consider making such a change for 5.0

adamsitnik · 2019-10-29T17:22:34Z

@davidwrighton thank you for a detailed answer. I forgot about inlining, which makes it non-trivial.

The source of the problem has been identified and now it's being tracked in https://github.com/dotnet/coreclr/issues/27511 so I am closing this issue

danmoseley closed this as completed Oct 1, 2019

danmoseley reopened this Oct 1, 2019

adamsitnik self-assigned this Oct 1, 2019

adamsitnik closed this as completed Oct 29, 2019

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the 5.0 milestone Jan 31, 2020

davidwrighton mentioned this issue Jan 31, 2020

Provide block counts in tiered compilation from R2R images #13672

Closed

adamsitnik mentioned this issue Sep 22, 2020

Performance regression: System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8 #41104

Closed

ghost locked as resolved and limited conversation to collaborators Dec 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Peculiar string operation performance #13451

Peculiar string operation performance #13451

Suchiman commented Sep 22, 2019

jkotas commented Sep 23, 2019

danmoseley commented Sep 23, 2019

danmoseley commented Oct 1, 2019

Suchiman commented Oct 1, 2019

danmoseley commented Oct 1, 2019

Suchiman commented Oct 1, 2019

danmoseley commented Oct 1, 2019

adamsitnik commented Oct 1, 2019

adamsitnik commented Oct 2, 2019

adamsitnik commented Oct 2, 2019

adamsitnik commented Oct 2, 2019

Suchiman commented Oct 3, 2019

adamsitnik commented Oct 21, 2019 •

edited

Loading

AndyAyersMS commented Oct 21, 2019

jkotas commented Oct 21, 2019

AndyAyersMS commented Oct 22, 2019

erozenfeld commented Oct 22, 2019

erozenfeld commented Oct 23, 2019

erozenfeld commented Oct 23, 2019 •

edited

Loading

erozenfeld commented Oct 23, 2019

AndyAyersMS commented Oct 23, 2019

erozenfeld commented Oct 23, 2019

erozenfeld commented Oct 23, 2019

AndyAyersMS commented Oct 23, 2019

erozenfeld commented Oct 23, 2019

adamsitnik commented Oct 24, 2019

adamsitnik commented Oct 24, 2019

erozenfeld commented Oct 24, 2019

adamsitnik commented Oct 28, 2019

davidwrighton commented Oct 28, 2019

adamsitnik commented Oct 29, 2019

Peculiar string operation performance #13451

Peculiar string operation performance #13451

Comments

Suchiman commented Sep 22, 2019

.NET Framework 4.7.2

.NET Core 3.0 RC1 (OutOfProcess)

.NET Core 3.0 RC1 (InProcess)

jkotas commented Sep 23, 2019

danmoseley commented Sep 23, 2019

danmoseley commented Oct 1, 2019

Suchiman commented Oct 1, 2019

danmoseley commented Oct 1, 2019

Suchiman commented Oct 1, 2019

danmoseley commented Oct 1, 2019

adamsitnik commented Oct 1, 2019

adamsitnik commented Oct 2, 2019

adamsitnik commented Oct 2, 2019

adamsitnik commented Oct 2, 2019

Suchiman commented Oct 3, 2019

adamsitnik commented Oct 21, 2019 • edited Loading

AndyAyersMS commented Oct 21, 2019

jkotas commented Oct 21, 2019

AndyAyersMS commented Oct 22, 2019

erozenfeld commented Oct 22, 2019

erozenfeld commented Oct 23, 2019

erozenfeld commented Oct 23, 2019 • edited Loading

erozenfeld commented Oct 23, 2019

AndyAyersMS commented Oct 23, 2019

erozenfeld commented Oct 23, 2019

erozenfeld commented Oct 23, 2019

AndyAyersMS commented Oct 23, 2019

erozenfeld commented Oct 23, 2019

adamsitnik commented Oct 24, 2019

adamsitnik commented Oct 24, 2019

erozenfeld commented Oct 24, 2019

adamsitnik commented Oct 28, 2019

davidwrighton commented Oct 28, 2019

adamsitnik commented Oct 29, 2019

adamsitnik commented Oct 21, 2019 •

edited

Loading

erozenfeld commented Oct 23, 2019 •

edited

Loading