-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Try more optimal fixpoint computation strategies for liveness #86527
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsWe can use a RPO traversal in liveness to try to minimize the number of iterations we need, as suggested by @AndyAyersMS in #86043 (comment). Another interesting experiment would be to compute strongly connected components and compute liveness within each SCC.
|
Since liveness runs backwards we likely want just a plain postorder (successors before preds), not a reverse postorder (preds before successors. |
https://github.com/dotnet/runtime/compare/main...jakobbotsch:runtime:liveness-fixpoint-strategies?expand=1 has an experimental version of the SCC approach, computed with Tarjan's algorithm. While finding the SCCs themselves is not too bad (code wise), actually iterating the successors that may affect liveness is very complicated and very expensive. I had to add a visitor-based successor function to not lose all the throughput wins on the extra The throughput improvement is pretty modest: Base: 97346934422, Diff: 97032570982, -0.3229%
VisitAllSuccessors<`Compiler::fgLocalVarLivenessFindSCC'::`2'::<lambda_1> > : 200204050 : NA : 8.64% : +0.2057%
?fgLocalVarLivenessFindSCC@Compiler@@QEAAXPEAUSCCNode@@AEAU2@AEAHAEAV?$ArrayStack@PEAUSCCNode@@@@@Z : 156037841 : NA : 6.73% : +0.1603%
?Push@?$ArrayStack@PEAUGenTree@@@@QEAAXPEAUGenTree@@@Z : 146341436 : +72.21% : 6.31% : +0.1503%
VisitLivenessSuccessors<`Compiler::fgLocalVarLivenessFindSCC'::`2'::<lambda_1> > : 125829706 : NA : 5.43% : +0.1293%
`Compiler::fgLocalVarLivenessFindSCC'::`2'::<lambda_1>::operator() : 121235726 : NA : 5.23% : +0.1245%
VisitEHSuccessors<`Compiler::fgLocalVarLivenessFindSCC'::`2'::<lambda_1> > : 85531421 : NA : 3.69% : +0.0879%
?fgLocalVarLivenessFindSCCs@Compiler@@QEAAXXZ : 82035143 : NA : 3.54% : +0.0843%
VisitSuccessorsEHSuccessors<`Compiler::fgLocalVarLivenessFindSCC'::`2'::<lambda_1> > : 70136763 : NA : 3.03% : +0.0720%
?allocateMemory@ArenaAllocator@@QEAAPEAX_K@Z : 8903876 : +0.43% : 0.38% : +0.0091%
memset : 2889512 : +0.40% : 0.12% : +0.0030%
?Assign@?$BitSetOps@PEA_K$00PEAVCompiler@@VTrackedVarBitSetTraits@@@@SAXPEAVCompiler@@AEAPEA_KPEA_K@Z : -4796448 : -1.50% : 0.21% : -0.0049%
?fgGetHandlerLiveVars@Compiler@@QEAAPEA_KPEAUBasicBlock@@@Z : -7906158 : -10.88% : 0.34% : -0.0081%
?LivenessDLong@?$BitSetOps@PEA_K$00PEAVCompiler@@VTrackedVarBitSetTraits@@@@CAXPEAVCompiler@@AEAPEA_KQEA_K22@Z : -23881701 : -13.93% : 1.03% : -0.0245%
?ehGetBlockExnFlowDsc@Compiler@@QEAAPEAUEHblkDsc@@PEAUBasicBlock@@@Z : -32483225 : -17.34% : 1.40% : -0.0334%
?ClearD@?$BitSetOps@PEA_K$00PEAVCompiler@@VTrackedVarBitSetTraits@@@@SAXPEAVCompiler@@AEAPEA_K@Z : -62872657 : -25.52% : 2.71% : -0.0646%
?Advance@AllSuccessorIterPosition@@QEAAXPEAVCompiler@@PEAUBasicBlock@@@Z : -78804203 : -16.77% : 3.40% : -0.0810%
?NumSucc@BasicBlock@@QEAAIPEAVCompiler@@@Z : -99404116 : -14.97% : 4.29% : -0.1021%
?FindNextRegSuccTry@EHSuccessorIterPosition@@AEAAXPEAVCompiler@@PEAUBasicBlock@@@Z : -109928534 : -16.83% : 4.74% : -0.1129%
??0AllSuccessorIterPosition@@QEAA@PEAVCompiler@@PEAUBasicBlock@@@Z : -128716399 : -18.45% : 5.55% : -0.1322%
?GetSucc@BasicBlock@@QEAAPEAU1@IPEAVCompiler@@@Z : -145512470 : -14.04% : 6.28% : -0.1495%
?PerBlockAnalysis@LiveVarAnalysis@@AEAA_NPEAUBasicBlock@@_N1@Z : -618437434 : -32.10% : 26.68% : -0.6353% It would also need to be written in an iterative version. Doesn't seem like it's worth it, though maybe it'd be worth it to look at switching the rest of the JIT away from |
Some idle thoughts:
|
I tried to switch liveness to use a PO traversal as part of #95085, to see if it would make up for some of the throughput cost. The code is here: https://github.com/jakobbotsch/runtime/pull/new/dead-block-elimination-liveness-post-order We might be able to be even more efficient by using the loop table, at least in the future if we end up with the more "general" loop finding in the middle end through #95251. If all cycles are natural loops then we can iterate their SCCs to completion before proceeding with other blocks. However iterating (only) the loop blocks repeatedly will be more expensive than iterating the full post-order, since the loop blocks are stored as a bit vector while the full post-order is much more efficient to access. |
It would be nice to be able to efficiently use bit vectors as enumerable sets of blocks, but it requires keeping a mapping from bit vector indices back to blocks, and it would be nice if that persisted and didn't need to be rebuilt all the time, so we could amortize its cost. Likely this means using |
I am using the post-order index for the bit vectors in #95251, coupled with storing the post-order itself in a reachable place (the DFS tree contains it and is referenced by the loop class). It means you can add and remove blocks from the flow graph itself while still being able to map back using the stored post order array. Of course if you are adding and removing blocks from the flow graph then you are invalidating the information in the DFS tree and loop structures, but that is to be expected, and each individual pass needs to deal with that. I think the DFS and loop identification is cheap enough that just rebuilding the structures at the end of the phase is something we might seriously consider. |
If recomputing the graph "metadata" becomes cheap enough then that's certainly very appealing, since as you say the other downside of a persistent representation is keeping it accurate over time. |
This was fixed by #103809, we're now using RPO during the dataflow. |
We can use a RPO traversal in liveness to try to minimize the number of iterations we need, as suggested by @AndyAyersMS in #86043 (comment).
Another interesting experiment would be to compute strongly connected components and compute the fixpoint within each SCC.
The text was updated successfully, but these errors were encountered: