Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Try more optimal fixpoint computation strategies for liveness #86527

Closed
jakobbotsch opened this issue May 19, 2023 · 9 comments
Closed

JIT: Try more optimal fixpoint computation strategies for liveness #86527

jakobbotsch opened this issue May 19, 2023 · 9 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@jakobbotsch
Copy link
Member

jakobbotsch commented May 19, 2023

We can use a RPO traversal in liveness to try to minimize the number of iterations we need, as suggested by @AndyAyersMS in #86043 (comment).

Another interesting experiment would be to compute strongly connected components and compute the fixpoint within each SCC.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 19, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label May 19, 2023
@ghost
Copy link

ghost commented May 19, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

We can use a RPO traversal in liveness to try to minimize the number of iterations we need, as suggested by @AndyAyersMS in #86043 (comment).

Another interesting experiment would be to compute strongly connected components and compute liveness within each SCC.

Author: jakobbotsch
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@jakobbotsch jakobbotsch added this to the Future milestone May 19, 2023
@jakobbotsch jakobbotsch removed the untriaged New issue has not been triaged by the area owner label May 19, 2023
@AndyAyersMS
Copy link
Member

Since liveness runs backwards we likely want just a plain postorder (successors before preds), not a reverse postorder (preds before successors.

@jakobbotsch
Copy link
Member Author

https://github.com/dotnet/runtime/compare/main...jakobbotsch:runtime:liveness-fixpoint-strategies?expand=1 has an experimental version of the SCC approach, computed with Tarjan's algorithm.

While finding the SCCs themselves is not too bad (code wise), actually iterating the successors that may affect liveness is very complicated and very expensive. I had to add a visitor-based successor function to not lose all the throughput wins on the extra GetAllSuccs calls as part of finding SCCs, and the successor code is still showing up on top.

The throughput improvement is pretty modest:

Base: 97346934422, Diff: 97032570982, -0.3229%

VisitAllSuccessors<`Compiler::fgLocalVarLivenessFindSCC'::`2'::<lambda_1> >                                    : 200204050  : NA      : 8.64%  : +0.2057%
?fgLocalVarLivenessFindSCC@Compiler@@QEAAXPEAUSCCNode@@AEAU2@AEAHAEAV?$ArrayStack@PEAUSCCNode@@@@@Z            : 156037841  : NA      : 6.73%  : +0.1603%
?Push@?$ArrayStack@PEAUGenTree@@@@QEAAXPEAUGenTree@@@Z                                                         : 146341436  : +72.21% : 6.31%  : +0.1503%
VisitLivenessSuccessors<`Compiler::fgLocalVarLivenessFindSCC'::`2'::<lambda_1> >                               : 125829706  : NA      : 5.43%  : +0.1293%
`Compiler::fgLocalVarLivenessFindSCC'::`2'::<lambda_1>::operator()                                             : 121235726  : NA      : 5.23%  : +0.1245%
VisitEHSuccessors<`Compiler::fgLocalVarLivenessFindSCC'::`2'::<lambda_1> >                                     : 85531421   : NA      : 3.69%  : +0.0879%
?fgLocalVarLivenessFindSCCs@Compiler@@QEAAXXZ                                                                  : 82035143   : NA      : 3.54%  : +0.0843%
VisitSuccessorsEHSuccessors<`Compiler::fgLocalVarLivenessFindSCC'::`2'::<lambda_1> >                           : 70136763   : NA      : 3.03%  : +0.0720%
?allocateMemory@ArenaAllocator@@QEAAPEAX_K@Z                                                                   : 8903876    : +0.43%  : 0.38%  : +0.0091%
memset                                                                                                         : 2889512    : +0.40%  : 0.12%  : +0.0030%
?Assign@?$BitSetOps@PEA_K$00PEAVCompiler@@VTrackedVarBitSetTraits@@@@SAXPEAVCompiler@@AEAPEA_KPEA_K@Z          : -4796448   : -1.50%  : 0.21%  : -0.0049%
?fgGetHandlerLiveVars@Compiler@@QEAAPEA_KPEAUBasicBlock@@@Z                                                    : -7906158   : -10.88% : 0.34%  : -0.0081%
?LivenessDLong@?$BitSetOps@PEA_K$00PEAVCompiler@@VTrackedVarBitSetTraits@@@@CAXPEAVCompiler@@AEAPEA_KQEA_K22@Z : -23881701  : -13.93% : 1.03%  : -0.0245%
?ehGetBlockExnFlowDsc@Compiler@@QEAAPEAUEHblkDsc@@PEAUBasicBlock@@@Z                                           : -32483225  : -17.34% : 1.40%  : -0.0334%
?ClearD@?$BitSetOps@PEA_K$00PEAVCompiler@@VTrackedVarBitSetTraits@@@@SAXPEAVCompiler@@AEAPEA_K@Z               : -62872657  : -25.52% : 2.71%  : -0.0646%
?Advance@AllSuccessorIterPosition@@QEAAXPEAVCompiler@@PEAUBasicBlock@@@Z                                       : -78804203  : -16.77% : 3.40%  : -0.0810%
?NumSucc@BasicBlock@@QEAAIPEAVCompiler@@@Z                                                                     : -99404116  : -14.97% : 4.29%  : -0.1021%
?FindNextRegSuccTry@EHSuccessorIterPosition@@AEAAXPEAVCompiler@@PEAUBasicBlock@@@Z                             : -109928534 : -16.83% : 4.74%  : -0.1129%
??0AllSuccessorIterPosition@@QEAA@PEAVCompiler@@PEAUBasicBlock@@@Z                                             : -128716399 : -18.45% : 5.55%  : -0.1322%
?GetSucc@BasicBlock@@QEAAPEAU1@IPEAVCompiler@@@Z                                                               : -145512470 : -14.04% : 6.28%  : -0.1495%
?PerBlockAnalysis@LiveVarAnalysis@@AEAA_NPEAUBasicBlock@@_N1@Z                                                 : -618437434 : -32.10% : 26.68% : -0.6353%

It would also need to be written in an iterative version. Doesn't seem like it's worth it, though maybe it'd be worth it to look at switching the rest of the JIT away from GetAllSuccs to this visitor-based function instead.

@AndyAyersMS
Copy link
Member

AndyAyersMS commented May 20, 2023

Some idle thoughts:

  • What is the overall cost of liveness (or, what is the % improvement in just liveness from the above?). If it is small then none of the rest below will matter much. Or we could see liveness cost as being small as an opportunity to further boost the tracked var limit...
  • I wonder if it's worth memoizing the successor traversal. Say we keep a bbNum->block map, and then we keep a bit vector per node with the successor indices. Or we just allocate bespoke per-block array, that we allocate / populate the first time through.
  • I was thinking postorder might work out because it wouldn't impact much else, even if the wins are modest.
  • Within an SCC there's still a question of ensuring a good traversal order (so you would likely want to run postorder inside the SCC).
  • There should be various monotonicity guarantees within liveness (eg live sets never shrink as we iterate) and if we were sufficiently motivated (say making the various bit set ops return a bool indicating if any bit was changed) we could operate directly on the block vectors and not on scratch copies. That would save a bit of work zeroing, comparing, and copying bit vectors.
  • Another thought along those lines is to have blocks push their live ins to their predecessors rather than having a block pull live ins from its successors, since we have convenient (though perhaps incomplete) list of predecessors. But then we'd need a per-block flags to detect if any of those pushes changed the pred's live out so we would know to revisit that pred.
  • At some point (with sufficiently large index sets) we ought to consider sparse bit vectors. Only so many things can be live at any given place in the code. It might be interesting to compute density metrics and see what they tell us. I know of a sparse BV implementation we could leverage. However it can cause a lot of allocation churn so we'd probably also want some kind of node pool/free list.
    • [edit] looks like hashBv already implments a sparse vector, but it is not used much and its interface diverges from our other bit vectors

@jakobbotsch
Copy link
Member Author

I tried to switch liveness to use a PO traversal as part of #95085, to see if it would make up for some of the throughput cost. The code is here: https://github.com/jakobbotsch/runtime/pull/new/dead-block-elimination-liveness-post-order
Sadly I didn't keep/share the detailed throughput results anywhere, but if I remember right the results did not look like it gained us that much.

We might be able to be even more efficient by using the loop table, at least in the future if we end up with the more "general" loop finding in the middle end through #95251. If all cycles are natural loops then we can iterate their SCCs to completion before proceeding with other blocks. However iterating (only) the loop blocks repeatedly will be more expensive than iterating the full post-order, since the loop blocks are stored as a bit vector while the full post-order is much more efficient to access.

@AndyAyersMS
Copy link
Member

It would be nice to be able to efficiently use bit vectors as enumerable sets of blocks, but it requires keeping a mapping from bit vector indices back to blocks, and it would be nice if that persisted and didn't need to be rebuilt all the time, so we could amortize its cost.

Likely this means using bbID or something similar as a BV index that doesn't change ever (or often); since that may be sparse and increase over time, some kind of sparse-ish representation for BVs (maybe a compact known-sized part with extensions?), and perhaps just a hash map for the id -> block part?

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Nov 27, 2023

I am using the post-order index for the bit vectors in #95251, coupled with storing the post-order itself in a reachable place (the DFS tree contains it and is referenced by the loop class). It means you can add and remove blocks from the flow graph itself while still being able to map back using the stored post order array. Of course if you are adding and removing blocks from the flow graph then you are invalidating the information in the DFS tree and loop structures, but that is to be expected, and each individual pass needs to deal with that. I think the DFS and loop identification is cheap enough that just rebuilding the structures at the end of the phase is something we might seriously consider.
For example, I have a local change where I port loop cloning to the new representation, and the throughput cost is around 0.4% for building the DFS and identifying loops, then doing loop cloning, and then rebuilding the structures if any flowgraph modifications were made as part of loop cloning. I fully expect us to easily make up that 0.4% throughput by removing the old DFS, reachability and dominator recomputations that we do all the time.

@AndyAyersMS
Copy link
Member

If recomputing the graph "metadata" becomes cheap enough then that's certainly very appealing, since as you say the other downside of a persistent representation is keeping it accurate over time.

@jakobbotsch
Copy link
Member Author

This was fixed by #103809, we're now using RPO during the dataflow.

@jakobbotsch jakobbotsch modified the milestones: Future, 9.0.0 Jul 5, 2024
@jakobbotsch jakobbotsch self-assigned this Jul 5, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Aug 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

2 participants