-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Peculiar string operation performance #13451
Comments
cc @adamsitnik |
Also reported here https://github.com/dotnet/coreclr/issues/25774 |
Seems a duplicate of #13107. Lets continue discussion there. |
@danmosemsft imo this is a superset of that other ticket which is exclusive to stringbuilder |
@Suchiman I see your observations about non StringBuilder appends but it's not clear to me that there's a problem with those? |
They're consistently slower even if only by a small margin (between 5% and 18%) and then there's the incorrect measurement of |
@Suchiman you're right -- I had read the tables in reverse. |
I am going to investigate it tomorrow |
Hi @Suchiman
You are right. Before .NET Core 3.0 preview 6+ and BenchmarkDotNet 0.11.6, BenchmarkDotNet was using Using the latest version of BDN from our CI feed and .NET Core 3.0 I get the correct number of allocated bytes: BenchmarkDotNet=v0.11.5.1159-nightly, OS=Windows 10.0.18362
The benchmark should get promoted to Tier 1 during Warmup or Pilot stage, this is why BDN does not expose a possibility to disable Tiered JIT. However, you can achieve that by setting environment variables |
When it comes to the most important aspect - for your benchmark .NET Core 3.0 is slower than .NET 4.8. I am going to investigate that. |
Also, I wanted to mention that we track the performance of I've run all the benchmarks we have for 4.8 vs 3.0 and in general .NET Core 3.0 is slower than 4.8 only in a single case:
|
Okay, but there's only a single thread involved here, why is this API not delivering the correct result? |
I was able to narrow down the problem to a small repro that does not involve GC/memory allocations: creating a reusable public class Repro
{
const int length = 50000;
private StringBuilder _sb;
[GlobalSetup]
public void Setup() => _sb = new StringBuilder(length);
[Benchmark]
public StringBuilder StringBuilderAddCharOptimized()
{
var result = _sb.Clear(); // it just sents the internal index to 0
for (int i = 0; i < length; i++)
{
result.Append('s');
}
return result;
}
} First of all, this benchmark is sensitive to loop head alignment. As crazy as it sounds, CLR is typically more lucky with the alignment in this particular case. This is not the first time such a problem happens with a microbenchmark: https://github.com/dotnet/coreclr/issues/17932 This can be checked with following BenchmarkDotNet config using an experimental feature flag
The more important, deterministic difference is a code-gen difference for 4.8 and 3.0 for the array indexer check: I've observed, that if I copy-paste I've disabled R2R and Zap: And confirmed that this causes the difference: @jkotas @AndyAyersMS I am definitely not a code-gen expert and I am not sure if I VTune is not showing me skids. Do you think that it makes sense that this code-gen difference can be causing 10% regression? (When I set both env vars the micro-benchmark tells me that there is still a 3% difference, but it can be some noise again) I am going to send a PR that removes this array bound check, but if there is an R2R|PGO issue then this is not enough. I am not familiar with the internals of R2R|NGen, is it fine if I just create another issue with this small repro? |
It looks like in 3.0 codegen we now miss some CSE opportunities and have some redundant loads. This might be the result of a correctness fix as we've had to make CSE more conservative in cases where arrays are obtained indirectly. We should investigate. That being said, I would not expect the highlighted load to cause a big perf hit. That same location was just loaded. Likely VTune is actually indicating the branch above as the source of the cycle hit and the samples have skidded down to the load. There's nothing in the disassembly above that indicates the branch should be slower, and the branch should be highly predictable. So it could be some nonlocal interaction (say a collision in the branch predictor with some other branch). Usually these aren't all that durable; this one seems to be. So not sure what is going on. |
I do not think you will be able to remove this bounds check without compromising security properties of StringBuilder. |
@dotnet/jit-contrib we should investigate what lead to this codegen change. |
I'll investigate. |
The codegen difference in these cases is caused by IBC data. Ngen-ing mscorlib on the desktop is using IBC data while other scenarios mentioned above don't. If I set complus_DisableIBC=1 while ngen-ing mscorlib, I get the extra loads. The use of IBC data changes the flowgraph we get when performing CSE and that affects ref counts and CSE profitability checks. In one case we decide the CSE is profitable and in the other case we decide it's unprofitable:
vs.
I'll investigate whether we need to tweak ref counts calculations. |
Looks like the difference that matters without IBC data we assign 0.5 bbWeight to both basic block targets of the first compare; with IBC data we assign 0.99 and 0.01:
vs.
|
That affects CSE profitability calculations. |
Interesting -- do we have IBC data for the R2R compiled version in Core? Is it different somehow, or missing? The fact that IBC only applies when prejitting may lead to more performance anomalies like this. Normally we'd expect jitted code to be faster than R2R'd code, but if IBC data hilights the hot paths only for prejitting, we may end up being less aggressive when jitting and possibly lose perf. |
When I do a normal local build, IBC data is not used when compiling R2R System.Private.CoreLib. I'm not sure if that's intentional. However, it appears IBC data was used when building release 3.0 bits. Looking at System.Private.CoreLib.dll (from https://dotnet.microsoft.com/download/thank-you/dotnet-runtime-3.0.0-windows-x64-binaries) with R2RDump (below) I see no extra loads. @adamsitnik Where you using official release 3.0 core bits when doing your experiments?
|
So, it appears there is no difference in the core vs. desktop performance (when using official core bits). Both official mscorlib and System.Private.CoreLib are built with IBC data and don't have the extra loads. The problem is that we may get worse code when re-jitting without IBC data. |
Right, perf tests should (generally) now be measuring jitted code perf on Core. So Core vs Desktop perf comparisons where NGEN / IBC provided benefits on Desktop may show up as "regressions" in Core vs Desktop numbers. Would still love for us to find a way to pick up IBC data when jitting. I realize size on disk is a concern. Perhaps worth a prototype just to see what perf we'd gain today, perhaps that would be motivating. And more perf is possible as the jit isn't all that clever when it has IBC data (eg we don't read in IBC for inlinees...) |
/cc: @davidwrighton |
Yes, I've just double-checked that. dotnet --info
Project settings: <PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFrameworks>net472;netcoreapp2.2;netcoreapp3.0</TargetFrameworks>
<DebugType>pdbonly</DebugType>
<DebugSymbols>true</DebugSymbols>
<!--<TieredCompilation>false</TieredCompilation>-->
<PlatformTarget>x64</PlatformTarget>
</PropertyGroup> |
I've just run VTune again and confirmed the previous results So the busiest method is Does it mean that in Tier 0 we get a precompiled method which was trained using IBC and when the method is recompiled by Tiered JIT and promoted to Tier 1 we don't have this extra infromation and the resulting machine code is worse? |
Yes, that is correct. |
I wonder how common such problem can be. How hard would it be to generate a list of System.Private.CoreLib methods that are bigger after re-jitting compared to the output produced by R2R with IBC enabled? Would it then make sense to extend the Tiered JIT with some simple heuristic like "if the re-jitted code size is bigger than R2R, don't replace it"? |
@adamsitnik I would say that's a tricky heuristic to write. For instance, the tiered JIT is able to inline much more than the R2R code can, which has a high potential to increase code size in a nicely profitable fashion. I believe @briansull is looking at ways for R2R code to claim its good enough to not need recompilation, which I think is a better approach to the same problem. OTOH, all of this has shown a place where we can provide measurable proof of improvement from IBC block counts in real world code. While I don't think its worthwhile to carry IBC data in its current form in R2R images, I do think we could carry block counts, or a reasonable facsimile thereof without incurring a tremendous size cost. As such I've filed dotnet/coreclr#27511. I think we should consider making such a change for 5.0 |
@davidwrighton thank you for a detailed answer. I forgot about inlining, which makes it non-trivial. The source of the problem has been identified and now it's being tracked in https://github.com/dotnet/coreclr/issues/27511 so I am closing this issue |
So i got nerd sniped on IRC with how fast you can append characters to create a string.
I created the following benchmark to measure a few alternatives just before i got told it's for an old MVC4 app. Then I've noticed that the performance is consistently worse on .NET Core, in particular for
StringBuilder
and even more so if you supply acapacity
forStringBuilder
. I then suspected Tiered Compilation but i couldn't figure how to really disable it for BDN.NET,<TieredCompilation>false</TieredCompilation>
had at best an effect on theInProcess
Benchmark. Also it appears that BDN.NET or the API it is using for measuring theAllocated
number is incorrect on .NET Core, the .NET Framework numbers seem more plausible (allocating anchar[]
of 50k elements and astring
from thatchar[]
certainly doesn't allocate exactly as much as just allocating thestring
)..NET Framework 4.7.2
.NET Core 3.0 RC1 (OutOfProcess)
.NET Core 3.0 RC1 (InProcess)
The text was updated successfully, but these errors were encountered: