RFC: AST node IDs #5688

overlookmotel · 2024-09-11T01:04:21Z

We have decided to add NodeIds to AST node types. Here are my suggestions on how we should approach this.

Motivating use cases

Store data about nodes out of band.
Get parent of a node via an ID -> parent ID table.
Get children of a node via an ID -> child IDs table.
Uniquely identify AST nodes (as requested by Rolldown, also useful for e.g. traverse: insert new statements(If any) after calling each walk_statement in walk_statements #4767).

I feel storing data out of band is the strongest motivation. There are viable alternatives for the other 3:

Getting parents: Build stack of ancestors in visitor (like Traverse does).
Getting children: Traverse down the AST.
Uniquely identify nodes: Use pointer equality (it's faster).

Proposed design

What kind of IDs?

I suggest that we have a single node ID for all nodes, rather than separate IDs (StatementId, ExpressionId etc). This aligns with the existing AstNodeId.

We may want in future to split into multiple IDs (StatementId etc). But for first implementation, a single ID is simpler.

What to call it?

NodeId. AstNodeId seems overly verbose. What other kinds of nodes do we have?

Rename the existing AstNodeId.

What gets an ID?

Only structs should have an ID. Enums e.g. Statement should not itself have an ID. But all the enum's variant "payloads" (e.g. BlockStatement) will have an ID. Rationale:

It is difficult to find a place to store IDs in enums.
This aligns with our current AstNodeId.
Unclear if any use cases exist which require Statement or Expression to have its own ID.
We can use a trait GetNodeId implemented on Statement etc to get its "payload" type's ID.

In my opinion, best to not add IDs to enums initially, and we'll discover if that blocks any use cases, and if the workarounds to support those use cases are painful/expensive.

ID type

Currently our IDs are wrappers around a NonMaxU32. That's motivated by making Option<Id> 4 bytes (vs Option<u32> which is 8 bytes).

This imposes some costs:

Every read or write of an ID has to convert from the internal representation to a zero-based ID. This is mostly a single XOR assembly instruction (id = internal_id ^ u32::MAX), but in some cases may have further knock-on effects. A small overhead, but it's a very hot path e.g. in linter.
In a post-semantic AST, you know that all IDs are Some, but constantly have to call unwrap on Option<Id>s. This complicates our code, and adds panicking branches to a lot of code.
Extra dependency on nonmax crate.

I propose an alternative:

Use 0 as a sentinel value equivalent to Option::<NodeId>::None.
If we have the need to store Option<NodeId> in places where extra bytes are undesirable, we can get a niche in NodeId with a different NonMaxU32 implementation.

(Ideally we would also change ScopeId, SymbolId and ReferenceId to be stored as plain u32s, instead of Option<NonMaxU32>s, for same reasons as above)

When to create IDs

This is the tricky part.

Ideally we'd create IDs in parser and set the node_id field on AST nodes when they are created. This:

is the most performant solution, as you don't have to write to the node_id field twice (dummy value first, then real value later).
means node_id fields can be a plain NodeId, not Cell<NodeId>.

However, this is not currently feasible. We'll need to generate NodeIds in SemanticBuilder initially.

Why not in parser?

1. Order of IDs

Currently AstNodeIds are in traversal order. Program is node ID 0. So iterating over AstNodes::nodes gives you nodes in traversal order.

But the parser creates nodes in a different order. Program is the last node to be created, so would have the highest NodeId.

Note: The parser does not create nodes in reverse travesal order either. Numbered in order of node creation:

foo * bar + qux
_1_   _2_
____3____
            _4_
_______5_______

When traversing this AST, node IDs in visitation order are: 5, 3, 1, 2, 4.

It is unclear if any linter rules rely on current iteration order of AstNodes::nodes or not. If they do, they can be refactored so they don't, so this isn't a deal-breaker, but we'd need to do that first.

Don thinks that visiting nodes in parse order rather than traversal order might actually turn out to be a perf gain, as nodes would be visited in same order that they're layed out in arena - perfect access pattern for CPU caches.

2. Unused node IDs

This is the hard problem.

In some places (e.g. when parsing what may or may not be an arrow function) the parser creates nodes, but then may rewind and discard them, and then re-parse that section of code again. If parser creates a NodeId for every AST node created (e.g. in AstBuilder), this would result in gaps in the sequence of NodeIds which appear in the AST.

Why is this a problem?

Later on SemanticBuilder will read NodeIds from the AST and insert data into side arrays indexed by NodeId (AstNodes::nodes, AstNodes::parent_ids).

If we cannot guarantee that there are no gaps in the sequence of NodeIds, then these side arrays are likely to end up containing sections of uninitialized bytes. There is nothing to stop user creating a NodeId from an arbitrary u32 and then reading from AstNodes::parent_ids or AstNodes::nodes with that NodeId. If that NodeId hits an uninitialized "gap", then that's instant UB. Bounds checks cannot prevent this, as gaps will be in middle of the sequence, so the NodeId can be in bounds, but still read uninitialized bytes.

How can we solve this?

Fill the side arrays at the start with zeros (assuming that's a legal bit pattern for the types stored in the arrays). This means all data is initialized so UB is not possible. But it's expensive writing zeros over a large chunk of memory.
Track in SemanticBuilder which IDs have been seen, and where gaps are. At the end of SemanticBuilder's traversal, fill any gaps with zeros, so no uninitialized gaps remain. Unfortunately, because IDs will be encountered out of order, keeping track of what ranges of IDs have been seen, and where gaps are is non-trivial, and probably expensive.
Fill a side array with bools to represent "node exists" or "node does not exist" for each NodeId. Initialize it with zeros, and set each entry to true when that NodeId is found. This array is only bools so will not take as much memory as the main arrays, and so is cheaper to set up. But downside is it imposes a cost on every read from the arrays - now every lookup into nodes or parent_ids needs to first check the node_exists array, to make sure the node exists before reading from nodes.

All of these are possible, but all come with a significant performance penalty.

I can imagine these replies:

"We can just make sure we don't look up non-existent NodeIds and tests will catch any bugs." I don't think that's a good idea. Undefined behavior is not a normal bug. It can cause strange behavior in unrelated and unexpected places. It may even cause tests to pass when they shouldn't, and UB is extremely hard to debug. e.g. inserting a dbg!() can magically make the problem you're trying to track down disappear. We've seen exacty this problem in transformer due to AstBuilder::copy.
"We can make sure the parser doesn't produce any gaps in NodeId sequence". We can try to do that, but can we statically prove it's impossible? I don't think we can, in which case we open the door to UB.
"Prevent user creating arbitrary NodeIds, so any NodeId that exists is always valid". Yes, but nothing prevents using a NodeId from one AST to index into the arrays relating to another AST. UB!

There is one possibility which could work, but is quite a sizeable undertaking (see below).

3. Cost of growing `Vec`s

Let's say we are creating NodeIds in parser, and also building the side arrays e.g. parent_ids in parser too.

How many nodes are there going to be? It's impossible to know at the outset. #4367 demonstrated that growing large Vecs because you didn't allocate enough capacity initially is very costly. And for large source files, the number of AST nodes can be huge (100,000+) so these Vecs are big.

Maybe we can just allocate capacity initially to the number of bytes in source text + 2 (I think that's the maximum number of nodes you can get per byte of source e.g. 0+0+0 = 5 bytes, 7 nodes). That will be way too big for normal JS code, but it guarantees no reallocations. Then shrink the arrays to fit after parsing is complete, to free the excess memory again.

We would need to check:

Shrinking doesn't cause a reallocation and copy - depends on behavior of system allocator.
This works OK on WASM, which doesn't have virtual memory.

So maybe this isn't a deal-breaker, but there are potential challenges.

What to do

I suggest that we tackle this in 3 phases:

Phase 1: Generate IDs in semantic.
Phase 2: Optimize.
Phase 3: Move generating IDs into parser.

Phase 1: Add `NodeId`s

`node_id` fields

Add node_id: Cell<NodeId> fields to all AST types.

In my view, the fields should be visible in the type defs and not "hidden away" by adding them in a macro. So we'd need to add them manually. But we could do it in #[ast] macro initially while we're testing this out.

node_id would ideally be the first field on each type (before span).

`GetNodeId` trait

Similar to GetSpan. We can #[generate_derive(GetNodeId)].

If we make node_id the first field on each type, because the AST is now #[repr(C)], the memory offset of the field is guaranteed the same for every type. This makes GetNodeId::node_id extremely cheap even for the massive enums like Expression, as compiler sees it needs no branches to fetch the ID: https://godbolt.org/z/PGfWx6T57

Get rid of non-standard `AstKind`s

We have a few "special case" AstKinds e.g. FinallyClause. I think we should get rid of them. Either:

create a proper AST type FinallyClause
or alter the code which relies on this AstKind to instead identify if a node is in a finally clause by reading up the chain of parents.

This makes things simpler - every AST struct has a NodeId, every AST struct has an AstKind, every NodeId refers to an AstKind which matches the type of the node.

Make all AST node creation happen via an `AstBuilder`

Either:

Mark all AST structs #[non_exhaustive].
Add a private zero-size field to all AST structs in #[ast] macro.

Either way, this means they cannot be constructed by code outside oxc_ast crate. Probably #[non_exhaustive] is better solution.

2nd `AstBuilder` for transformer

Create a 2nd AstBuilder for use in transformer + minifier (AstNodeBuilder?).

struct AstNodeBuilder<'a> {
    allocator: &'a Allocator,
    next_node_id: u32,
}

Contains a counter for next NodeId.
Sets node_id on new nodes.
Is not Copy, as needs to be used as &mut AstNodeBuilder.
Stored as a field on TraverseCtx, so only a single instance exists - no RefCells or unsafe required.

If we don't pass the "parser" version of AstBuilder to any transformers, then they have to create nodes via AstNodeBuilder, making it impossible to accidentally create any AST nodes without an ID.

AstNodeBuilder can also help us get other things right in the transformer. It can enforce that:

New IdentiferReferences must have a valid ReferenceId.
New BindingIdentifiers must have a valid SymbolId.
Nodes with a scope_id field must be created with a valid ScopeId.

Other IDs

I think we should leave the other ID fields (scope_id, symbol_id, reference_id) present in AST, as they are now. @rzvxa suggested accessing them via side tables indexed by NodeId. I don't think this is ideal because:

Those fields are not so common, so they take up little memory.
Fetching ScopeId etc via a side table adds an extra indirection.
These side tables would be mostly empty space - most nodes do not have a SymbolId.
Transformer only has access to ScopeTree + SymbolTable, not Nodes.

Phase 2: Optimize

Type layouts

Because NodeId is a u32, the field order of types will not be optimal, and in most types it will have 4 bytes padding after it. So every AST type will grow by 8 bytes.

To fix this, optimize field order in oxc_ast_tools, filling the 4-byte gap after NodeId with other u32 / bool / single-byte enum fields. As many types currently have 4 or more bytes padding, rearranging the fields will win back that 8 bytes in most cases, and adding NodeId will be free for many/most types.

`AstNodes` SoA

Convert AstNodes into full struct-of-arrays (split AstNode up).

AstNode will no longer be required (though we may need to keep it around until we've transitioned the linter to stop using it).

Access methods on `NodeId`

Rather than getting info about a node via methods on Semantic, I propose that we implement methods on NodeId itself.

// Before
let flags = ctx.semantic.nodes.get_flags(node_id);
// After
let flags = node_id.flags(ctx);

Along the same lines as: oxc-project/backlog#152

I feel this is cleaner and easier to read. It also avoids the huge pile-up of verbosely-named methods on Semantic.

Phase 3: Move generating `NodeId`s into parser

TODO: I'll write this up later.

The text was updated successfully, but these errors were encountered:

overlookmotel added A-semantic Area - Semantic A-ast Area - AST labels Sep 11, 2024

overlookmotel mentioned this issue Sep 11, 2024

AST Node ID #4188

Closed

Boshen changed the title ~~AST node IDs RFC~~ RFC: AST node IDs Sep 11, 2024

oxc-project locked and limited conversation to collaborators Sep 11, 2024

Boshen converted this issue into discussion #5689 Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

RFC: AST node IDs #5688

RFC: AST node IDs #5688

overlookmotel commented Sep 11, 2024

This issue was moved to a discussion.

This issue was moved to a discussion.

RFC: AST node IDs #5688

RFC: AST node IDs #5688

Comments

overlookmotel commented Sep 11, 2024

Motivating use cases

Proposed design

What kind of IDs?

What to call it?

What gets an ID?

ID type

When to create IDs

Why not in parser?

1. Order of IDs

2. Unused node IDs

3. Cost of growing Vecs

What to do

Phase 1: Add NodeIds

node_id fields

GetNodeId trait

Get rid of non-standard AstKinds

Make all AST node creation happen via an AstBuilder

2nd AstBuilder for transformer

Other IDs

Phase 2: Optimize

Type layouts

AstNodes SoA

Access methods on NodeId

Phase 3: Move generating NodeIds into parser

This issue was moved to a discussion.

3. Cost of growing `Vec`s

Phase 1: Add `NodeId`s

`node_id` fields

`GetNodeId` trait

Get rid of non-standard `AstKind`s

Make all AST node creation happen via an `AstBuilder`

2nd `AstBuilder` for transformer

`AstNodes` SoA

Access methods on `NodeId`

Phase 3: Move generating `NodeId`s into parser