You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I feel storing data out of band is the strongest motivation. There are viable alternatives for the other 3:
Getting parents: Build stack of ancestors in visitor (like Traverse does).
Getting children: Traverse down the AST.
Uniquely identify nodes: Use pointer equality (it's faster).
Proposed design
What kind of IDs?
I suggest that we have a single node ID for all nodes, rather than separate IDs (StatementId, ExpressionId etc). This aligns with the existing AstNodeId.
We may want in future to split into multiple IDs (StatementId etc). But for first implementation, a single ID is simpler.
What to call it?
NodeId. AstNodeId seems overly verbose. What other kinds of nodes do we have?
Rename the existing AstNodeId.
What gets an ID?
Only structs should have an ID. Enums e.g. Statement should not itself have an ID. But all the enum's variant "payloads" (e.g. BlockStatement) will have an ID. Rationale:
It is difficult to find a place to store IDs in enums.
This aligns with our current AstNodeId.
Unclear if any use cases exist which require Statement or Expression to have its own ID.
We can use a trait GetNodeId implemented on Statement etc to get its "payload" type's ID.
In my opinion, best to not add IDs to enums initially, and we'll discover if that blocks any use cases, and if the workarounds to support those use cases are painful/expensive.
ID type
Currently our IDs are wrappers around a NonMaxU32. That's motivated by making Option<Id> 4 bytes (vs Option<u32> which is 8 bytes).
This imposes some costs:
Every read or write of an ID has to convert from the internal representation to a zero-based ID. This is mostly a single XOR assembly instruction (id = internal_id ^ u32::MAX), but in some cases may have further knock-on effects. A small overhead, but it's a very hot path e.g. in linter.
In a post-semantic AST, you know that all IDs are Some, but constantly have to call unwrap on Option<Id>s. This complicates our code, and adds panicking branches to a lot of code.
Extra dependency on nonmax crate.
I propose an alternative:
Use 0 as a sentinel value equivalent to Option::<NodeId>::None.
If we have the need to store Option<NodeId> in places where extra bytes are undesirable, we can get a niche in NodeId with a different NonMaxU32 implementation.
(Ideally we would also change ScopeId, SymbolId and ReferenceId to be stored as plain u32s, instead of Option<NonMaxU32>s, for same reasons as above)
When to create IDs
This is the tricky part.
Ideally we'd create IDs in parser and set the node_id field on AST nodes when they are created. This:
is the most performant solution, as you don't have to write to the node_id field twice (dummy value first, then real value later).
means node_id fields can be a plain NodeId, not Cell<NodeId>.
However, this is not currently feasible. We'll need to generate NodeIds in SemanticBuilder initially.
Why not in parser?
1. Order of IDs
Currently AstNodeIds are in traversal order. Program is node ID 0. So iterating over AstNodes::nodes gives you nodes in traversal order.
But the parser creates nodes in a different order. Program is the last node to be created, so would have the highest NodeId.
Note: The parser does not create nodes in reverse travesal order either. Numbered in order of node creation:
foo * bar + qux
_1_ _2_
____3____
_4_
_______5_______
When traversing this AST, node IDs in visitation order are: 5, 3, 1, 2, 4.
It is unclear if any linter rules rely on current iteration order of AstNodes::nodes or not. If they do, they can be refactored so they don't, so this isn't a deal-breaker, but we'd need to do that first.
Don thinks that visiting nodes in parse order rather than traversal order might actually turn out to be a perf gain, as nodes would be visited in same order that they're layed out in arena - perfect access pattern for CPU caches.
2. Unused node IDs
This is the hard problem.
In some places (e.g. when parsing what may or may not be an arrow function) the parser creates nodes, but then may rewind and discard them, and then re-parse that section of code again. If parser creates a NodeId for every AST node created (e.g. in AstBuilder), this would result in gaps in the sequence of NodeIds which appear in the AST.
Why is this a problem?
Later on SemanticBuilder will read NodeIds from the AST and insert data into side arrays indexed by NodeId (AstNodes::nodes, AstNodes::parent_ids).
If we cannot guarantee that there are no gaps in the sequence of NodeIds, then these side arrays are likely to end up containing sections of uninitialized bytes. There is nothing to stop user creating a NodeId from an arbitrary u32 and then reading from AstNodes::parent_ids or AstNodes::nodes with that NodeId. If that NodeId hits an uninitialized "gap", then that's instant UB. Bounds checks cannot prevent this, as gaps will be in middle of the sequence, so the NodeId can be in bounds, but still read uninitialized bytes.
How can we solve this?
Fill the side arrays at the start with zeros (assuming that's a legal bit pattern for the types stored in the arrays). This means all data is initialized so UB is not possible. But it's expensive writing zeros over a large chunk of memory.
Track in SemanticBuilder which IDs have been seen, and where gaps are. At the end of SemanticBuilder's traversal, fill any gaps with zeros, so no uninitialized gaps remain. Unfortunately, because IDs will be encountered out of order, keeping track of what ranges of IDs have been seen, and where gaps are is non-trivial, and probably expensive.
Fill a side array with bools to represent "node exists" or "node does not exist" for each NodeId. Initialize it with zeros, and set each entry to true when that NodeId is found. This array is only bools so will not take as much memory as the main arrays, and so is cheaper to set up. But downside is it imposes a cost on every read from the arrays - now every lookup into nodes or parent_ids needs to first check the node_exists array, to make sure the node exists before reading from nodes.
All of these are possible, but all come with a significant performance penalty.
I can imagine these replies:
"We can just make sure we don't look up non-existent NodeIds and tests will catch any bugs." I don't think that's a good idea. Undefined behavior is not a normal bug. It can cause strange behavior in unrelated and unexpected places. It may even cause tests to pass when they shouldn't, and UB is extremely hard to debug. e.g. inserting a dbg!() can magically make the problem you're trying to track down disappear. We've seen exacty this problem in transformer due to AstBuilder::copy.
"We can make sure the parser doesn't produce any gaps in NodeId sequence". We can try to do that, but can we statically prove it's impossible? I don't think we can, in which case we open the door to UB.
"Prevent user creating arbitrary NodeIds, so any NodeId that exists is always valid". Yes, but nothing prevents using a NodeId from one AST to index into the arrays relating to another AST. UB!
There is one possibility which could work, but is quite a sizeable undertaking (see below).
3. Cost of growing Vecs
Let's say we are creating NodeIds in parser, and also building the side arrays e.g. parent_ids in parser too.
How many nodes are there going to be? It's impossible to know at the outset. #4367 demonstrated that growing large Vecs because you didn't allocate enough capacity initially is very costly. And for large source files, the number of AST nodes can be huge (100,000+) so these Vecs are big.
Maybe we can just allocate capacity initially to the number of bytes in source text + 2 (I think that's the maximum number of nodes you can get per byte of source e.g. 0+0+0 = 5 bytes, 7 nodes). That will be way too big for normal JS code, but it guarantees no reallocations. Then shrink the arrays to fit after parsing is complete, to free the excess memory again.
We would need to check:
Shrinking doesn't cause a reallocation and copy - depends on behavior of system allocator.
This works OK on WASM, which doesn't have virtual memory.
So maybe this isn't a deal-breaker, but there are potential challenges.
What to do
I suggest that we tackle this in 3 phases:
Phase 1: Generate IDs in semantic.
Phase 2: Optimize.
Phase 3: Move generating IDs into parser.
Phase 1: Add NodeIds
node_id fields
Add node_id: Cell<NodeId> fields to all AST types.
In my view, the fields should be visible in the type defs and not "hidden away" by adding them in a macro. So we'd need to add them manually. But we could do it in #[ast] macro initially while we're testing this out.
node_id would ideally be the first field on each type (before span).
GetNodeId trait
Similar to GetSpan. We can #[generate_derive(GetNodeId)].
If we make node_id the first field on each type, because the AST is now #[repr(C)], the memory offset of the field is guaranteed the same for every type. This makes GetNodeId::node_id extremely cheap even for the massive enums like Expression, as compiler sees it needs no branches to fetch the ID: https://godbolt.org/z/PGfWx6T57
Get rid of non-standard AstKinds
We have a few "special case" AstKinds e.g. FinallyClause. I think we should get rid of them. Either:
create a proper AST type FinallyClause
or alter the code which relies on this AstKind to instead identify if a node is in a finally clause by reading up the chain of parents.
This makes things simpler - every AST struct has a NodeId, every AST struct has an AstKind, every NodeId refers to an AstKind which matches the type of the node.
Make all AST node creation happen via an AstBuilder
Is not Copy, as needs to be used as &mut AstNodeBuilder.
Stored as a field on TraverseCtx, so only a single instance exists - no RefCells or unsafe required.
If we don't pass the "parser" version of AstBuilder to any transformers, then they have to create nodes via AstNodeBuilder, making it impossible to accidentally create any AST nodes without an ID.
AstNodeBuilder can also help us get other things right in the transformer. It can enforce that:
New IdentiferReferences must have a valid ReferenceId.
New BindingIdentifiers must have a valid SymbolId.
Nodes with a scope_id field must be created with a valid ScopeId.
Other IDs
I think we should leave the other ID fields (scope_id, symbol_id, reference_id) present in AST, as they are now. @rzvxasuggested accessing them via side tables indexed by NodeId. I don't think this is ideal because:
Those fields are not so common, so they take up little memory.
Fetching ScopeId etc via a side table adds an extra indirection.
These side tables would be mostly empty space - most nodes do not have a SymbolId.
Transformer only has access to ScopeTree + SymbolTable, not Nodes.
Phase 2: Optimize
Type layouts
Because NodeId is a u32, the field order of types will not be optimal, and in most types it will have 4 bytes padding after it. So every AST type will grow by 8 bytes.
To fix this, optimize field order in oxc_ast_tools, filling the 4-byte gap after NodeId with other u32 / bool / single-byte enum fields. As many types currently have 4 or more bytes padding, rearranging the fields will win back that 8 bytes in most cases, and adding NodeId will be free for many/most types.
AstNodes SoA
Convert AstNodes into full struct-of-arrays (split AstNode up).
AstNode will no longer be required (though we may need to keep it around until we've transitioned the linter to stop using it).
Access methods on NodeId
Rather than getting info about a node via methods on Semantic, I propose that we implement methods on NodeId itself.
We have decided to add
NodeId
s to AST node types. Here are my suggestions on how we should approach this.Motivating use cases
walk_statement
inwalk_statements
#4767).I feel storing data out of band is the strongest motivation. There are viable alternatives for the other 3:
Traverse
does).Proposed design
What kind of IDs?
I suggest that we have a single node ID for all nodes, rather than separate IDs (
StatementId
,ExpressionId
etc). This aligns with the existingAstNodeId
.We may want in future to split into multiple IDs (
StatementId
etc). But for first implementation, a single ID is simpler.What to call it?
NodeId
.AstNodeId
seems overly verbose. What other kinds of nodes do we have?Rename the existing
AstNodeId
.What gets an ID?
Only structs should have an ID. Enums e.g.
Statement
should not itself have an ID. But all the enum's variant "payloads" (e.g.BlockStatement
) will have an ID. Rationale:AstNodeId
.Statement
orExpression
to have its own ID.GetNodeId
implemented onStatement
etc to get its "payload" type's ID.In my opinion, best to not add IDs to enums initially, and we'll discover if that blocks any use cases, and if the workarounds to support those use cases are painful/expensive.
ID type
Currently our IDs are wrappers around a
NonMaxU32
. That's motivated by makingOption<Id>
4 bytes (vsOption<u32>
which is 8 bytes).This imposes some costs:
id = internal_id ^ u32::MAX
), but in some cases may have further knock-on effects. A small overhead, but it's a very hot path e.g. in linter.Some
, but constantly have to callunwrap
onOption<Id>
s. This complicates our code, and adds panicking branches to a lot of code.nonmax
crate.I propose an alternative:
0
as a sentinel value equivalent toOption::<NodeId>::None
.Option<NodeId>
in places where extra bytes are undesirable, we can get a niche inNodeId
with a differentNonMaxU32
implementation.(Ideally we would also change
ScopeId
,SymbolId
andReferenceId
to be stored as plainu32
s, instead ofOption<NonMaxU32>
s, for same reasons as above)When to create IDs
This is the tricky part.
Ideally we'd create IDs in parser and set the
node_id
field on AST nodes when they are created. This:node_id
field twice (dummy value first, then real value later).node_id
fields can be a plainNodeId
, notCell<NodeId>
.However, this is not currently feasible. We'll need to generate
NodeId
s inSemanticBuilder
initially.Why not in parser?
1. Order of IDs
Currently
AstNodeId
s are in traversal order.Program
is node ID 0. So iterating overAstNodes::nodes
gives you nodes in traversal order.But the parser creates nodes in a different order.
Program
is the last node to be created, so would have the highestNodeId
.Note: The parser does not create nodes in reverse travesal order either. Numbered in order of node creation:
When traversing this AST, node IDs in visitation order are: 5, 3, 1, 2, 4.
It is unclear if any linter rules rely on current iteration order of
AstNodes::nodes
or not. If they do, they can be refactored so they don't, so this isn't a deal-breaker, but we'd need to do that first.Don thinks that visiting nodes in parse order rather than traversal order might actually turn out to be a perf gain, as nodes would be visited in same order that they're layed out in arena - perfect access pattern for CPU caches.
2. Unused node IDs
This is the hard problem.
In some places (e.g. when parsing what may or may not be an arrow function) the parser creates nodes, but then may rewind and discard them, and then re-parse that section of code again. If parser creates a
NodeId
for every AST node created (e.g. inAstBuilder
), this would result in gaps in the sequence ofNodeId
s which appear in the AST.Why is this a problem?
Later on
SemanticBuilder
will readNodeId
s from the AST and insert data into side arrays indexed byNodeId
(AstNodes::nodes
,AstNodes::parent_ids
).If we cannot guarantee that there are no gaps in the sequence of
NodeId
s, then these side arrays are likely to end up containing sections of uninitialized bytes. There is nothing to stop user creating aNodeId
from an arbitraryu32
and then reading fromAstNodes::parent_ids
orAstNodes::nodes
with thatNodeId
. If thatNodeId
hits an uninitialized "gap", then that's instant UB. Bounds checks cannot prevent this, as gaps will be in middle of the sequence, so theNodeId
can be in bounds, but still read uninitialized bytes.How can we solve this?
SemanticBuilder
which IDs have been seen, and where gaps are. At the end ofSemanticBuilder
's traversal, fill any gaps with zeros, so no uninitialized gaps remain. Unfortunately, because IDs will be encountered out of order, keeping track of what ranges of IDs have been seen, and where gaps are is non-trivial, and probably expensive.bool
s to represent "node exists" or "node does not exist" for eachNodeId
. Initialize it with zeros, and set each entry totrue
when thatNodeId
is found. This array is onlybool
s so will not take as much memory as the main arrays, and so is cheaper to set up. But downside is it imposes a cost on every read from the arrays - now every lookup intonodes
orparent_ids
needs to first check thenode_exists
array, to make sure the node exists before reading fromnodes
.All of these are possible, but all come with a significant performance penalty.
I can imagine these replies:
NodeId
s and tests will catch any bugs." I don't think that's a good idea. Undefined behavior is not a normal bug. It can cause strange behavior in unrelated and unexpected places. It may even cause tests to pass when they shouldn't, and UB is extremely hard to debug. e.g. inserting adbg!()
can magically make the problem you're trying to track down disappear. We've seen exacty this problem in transformer due toAstBuilder::copy
.NodeId
sequence". We can try to do that, but can we statically prove it's impossible? I don't think we can, in which case we open the door to UB.NodeId
s, so anyNodeId
that exists is always valid". Yes, but nothing prevents using aNodeId
from one AST to index into the arrays relating to another AST. UB!There is one possibility which could work, but is quite a sizeable undertaking (see below).
3. Cost of growing
Vec
sLet's say we are creating
NodeId
s in parser, and also building the side arrays e.g.parent_ids
in parser too.How many nodes are there going to be? It's impossible to know at the outset. #4367 demonstrated that growing large
Vec
s because you didn't allocate enough capacity initially is very costly. And for large source files, the number of AST nodes can be huge (100,000+) so theseVec
s are big.Maybe we can just allocate capacity initially to the number of bytes in source text + 2 (I think that's the maximum number of nodes you can get per byte of source e.g.
0+0+0
= 5 bytes, 7 nodes). That will be way too big for normal JS code, but it guarantees no reallocations. Then shrink the arrays to fit after parsing is complete, to free the excess memory again.We would need to check:
So maybe this isn't a deal-breaker, but there are potential challenges.
What to do
I suggest that we tackle this in 3 phases:
Phase 1: Add
NodeId
snode_id
fieldsAdd
node_id: Cell<NodeId>
fields to all AST types.In my view, the fields should be visible in the type defs and not "hidden away" by adding them in a macro. So we'd need to add them manually. But we could do it in
#[ast]
macro initially while we're testing this out.node_id
would ideally be the first field on each type (beforespan
).GetNodeId
traitSimilar to
GetSpan
. We can#[generate_derive(GetNodeId)]
.If we make
node_id
the first field on each type, because the AST is now#[repr(C)]
, the memory offset of the field is guaranteed the same for every type. This makesGetNodeId::node_id
extremely cheap even for the massive enums likeExpression
, as compiler sees it needs no branches to fetch the ID: https://godbolt.org/z/PGfWx6T57Get rid of non-standard
AstKind
sWe have a few "special case"
AstKind
s e.g.FinallyClause
. I think we should get rid of them. Either:FinallyClause
AstKind
to instead identify if a node is in afinally
clause by reading up the chain of parents.This makes things simpler - every AST struct has a
NodeId
, every AST struct has anAstKind
, everyNodeId
refers to anAstKind
which matches the type of the node.Make all AST node creation happen via an
AstBuilder
Either:
#[non_exhaustive]
.#[ast]
macro.Either way, this means they cannot be constructed by code outside
oxc_ast
crate. Probably#[non_exhaustive]
is better solution.2nd
AstBuilder
for transformerCreate a 2nd
AstBuilder
for use in transformer + minifier (AstNodeBuilder
?).NodeId
.node_id
on new nodes.Copy
, as needs to be used as&mut AstNodeBuilder
.TraverseCtx
, so only a single instance exists - noRefCell
s or unsafe required.If we don't pass the "parser" version of
AstBuilder
to any transformers, then they have to create nodes viaAstNodeBuilder
, making it impossible to accidentally create any AST nodes without an ID.AstNodeBuilder
can also help us get other things right in the transformer. It can enforce that:IdentiferReference
s must have a validReferenceId
.BindingIdentifier
s must have a validSymbolId
.scope_id
field must be created with a validScopeId
.Other IDs
I think we should leave the other ID fields (
scope_id
,symbol_id
,reference_id
) present in AST, as they are now. @rzvxa suggested accessing them via side tables indexed byNodeId
. I don't think this is ideal because:ScopeId
etc via a side table adds an extra indirection.SymbolId
.ScopeTree
+SymbolTable
, notNodes
.Phase 2: Optimize
Type layouts
Because
NodeId
is au32
, the field order of types will not be optimal, and in most types it will have 4 bytes padding after it. So every AST type will grow by 8 bytes.To fix this, optimize field order in
oxc_ast_tools
, filling the 4-byte gap afterNodeId
with otheru32
/bool
/ single-byte enum fields. As many types currently have 4 or more bytes padding, rearranging the fields will win back that 8 bytes in most cases, and addingNodeId
will be free for many/most types.AstNodes
SoAConvert
AstNodes
into full struct-of-arrays (splitAstNode
up).AstNode
will no longer be required (though we may need to keep it around until we've transitioned the linter to stop using it).Access methods on
NodeId
Rather than getting info about a node via methods on
Semantic
, I propose that we implement methods onNodeId
itself.Along the same lines as: oxc-project/backlog#152
I feel this is cleaner and easier to read. It also avoids the huge pile-up of verbosely-named methods on
Semantic
.Phase 3: Move generating
NodeId
s into parserTODO: I'll write this up later.
The text was updated successfully, but these errors were encountered: