-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto generate ast expression nodes #16285
base: main
Are you sure you want to change the base?
Conversation
63f4777
to
fd8c7cd
Compare
fd8c7cd
to
85c9400
Compare
Uhh, exciting. I do feel a bit conflicted about using |
|
@MichaReiser I will also use ungrammar on a separate branch to see the difference I just know the name nothing more. I was also unsure if I should extend the I first started by re-using the ASDL parser by python and then realized the ASDL is not offering more functionality here for generating Nodes and Enums. I decided to continue with |
RustPython used to use ASDL and, uff, that was painful. It's probably different because we actually parsed the official python grammar and derived the AST from it. The problem with that is that our AST diverged in many places. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do feel a bit conflicted about using
toml
to define our grammar over e.g. something like ungrammar but maybe it's the right choice and doesn't really matter?
To be honest I don't have a strong opinion between our hand-rolled TOML, ASDL, and ungrammar. To me, the main concerns are:
- Is the AST definition easy to understand and maintain?
- Is it easy to parse in our code generator?
Our hand-rolled TOML was great for the first proof-of-concept, since we could rely on Python's builtin tomllib
to do the bulk of the parsing.
I thought ASDL could be nice primarily because we could reuse a lot of the parsing logic from CPython, since (at least at the moment) we are using a Python script to parse the AST definition and generate our (Rust) code. (At first I hoped it would also have the benefit of letting us reuse the grammar itself from CPython, but as @MichaReiser points out, our AST has diverged from CPython's representation.)
RustPython used to use ASDL and, uff, that was painful.
What part was painful about it? Was the syntax itself too limited? Or was it cumbersome to parse and generate code from?
crates/ruff_python_ast/ast.toml
Outdated
ExprBoolOp = { fields = [ | ||
{ name = "op", type = "BoolOp" }, | ||
{ name = "values", type = "Expr", seq = true } | ||
]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the AST definition easy to understand and maintain?
Given this concern, this syntax does start to look a bit unwieldy.
Is it easy to parse in our code generator?
But on the other hand, continuing to use TOML still makes it easier to iterate on the code generator itself.
So given all of this, I'd still lean towards using this TOML syntax. We can always revisit the syntax later if we feel there's something that e.g. ungrammar would give us that the TOML doesn't.
The main pain point was that it parsed python's official AST grammar and then did some hacky overrides in code for where we wanted to diverge. My other concern with using ASDL is that it isn't easy for us to extend if we need to and I always found it hard to read (e.g. could we extract field documentation?) |
good news 🎉 I was able to complete the generation for all expression nodes. There are a few hacky things I need to fix. We need a way to provide the derive attribute to structs. Right now I just added a
I initially thought only knowing a field is a sequence or not would be enough to insert it in a
I added a rule to automatically box if a field type is the group of the node it belongs to so we box every
I'm not sure if we refer to
The current code also uses I'm reading about ungrammar but I feel because of the extensibility we need in the generator using a custom file would be easier because we can add stuff without hacking it on top of something else. But I have to read it first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good news 🎉 I was able to complete the generation for all expression nodes. There are a few hacky things I need to fix.
This is starting to come together very nicely! Thank you for tackling this.
I'm reading about ungrammar but I feel because of the extensibility we need in the generator using a custom file would be easier because we can add stuff without hacking it on top of something else. But I have to read it first.
Given some of my style nits about cleaning up the TOML, I think it's fine to proceed with this PR without also looking for a way to migrate to ungrammar. I think that can be a separate experiment/PR — and in all honesty, a lower priority one, since this seems to be shaping up nicely as it currently is.
crates/ruff_python_ast/ast.toml
Outdated
ExprUnaryOp = { fields = [ | ||
{ name = "op", type = "UnaryOp" }, | ||
{ name = "operand", type = "Expr" } | ||
]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a style nit, the following are equivalent TOML (i.e., it shouldn't require changes to the generator) that I think might be a bit easier to read:
[Expr.nodes.ExprUnaryOp]
fields = [
{ name = "op", type = "UnaryOp" },
{ name = "operand", type = "Expr" }
]
or
[[Expr.nodes.ExprUnaryOp.fields]]
name = "op"
type = "UnaryOp"
[[Expr.nodes.ExprUnaryOp.fields]]
name = "operand"
type = "Expr"
I have a slight preference for the last one, since it eliminates the nested maps and lists from the TOML. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
crates/ruff_python_ast/generate.py
Outdated
inner = f"crate::{field.ty}" | ||
if field.ty == "bool" or field.ty.startswith("Box"): | ||
inner = field.ty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your suggestion of adding a hard-coded list of types (defined near the top of this file) that don't require a crate::
qualifier
crates/ruff_python_ast/ast.toml
Outdated
{ name = "ctx", type = "ExprContext" } | ||
]} | ||
ExprName = { fields = [ | ||
{ name = "id", type = "name::Name" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we refer to
Name
just by it's name and import it in the file or like this:{ name = "id", type = "name::Name" },
I think it's worth ensuring that we can use just Name
here in the TOML, and importing it at the top of the generated Rust file is a perfectly fine way to do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I defined the imports at the top of the generate.py
.
One additional improvement is to define the imports in a way we can check if a name is imported and not use crate
prefix for it.
For now I just used the global list of types we don't want to prefix with crate
(the other comment) instead. Because we only have one import and it might be too soon to generalize.
crates/ruff_python_ast/ast.toml
Outdated
{ name = "ops", type = "Box<[crate::CmpOp]>"}, | ||
{ name = "comparators", type = "Box<[Expr]>"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially thought only knowing a field is a sequence or not would be enough to insert it in a
Vec
but I found inExprCompare
one of the types isBox<[crate::CmpOp]>
Doing this as a hard-coded type without a seq
is just fine as a special case. We might also check whether they need to be a box-of-slice instead of a vec. The only benefit I can see is that we save one word of storage since we only need pointer+length instead of pointer+length+capacity. And if that's worth doing, why aren't we doing that for all of the other vecs? But we don't have to tackle that in this PR, this is fine as you have it.
Co-authored-by: Douglas Creager <dcreager@dcreager.net>
177cfed
to
ce4cf6f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like where this is going.
I do find the TOML somewhat hard to parse because of how verbose fields is. I'd suggest exploring using regular lists with inline tables to make the format more compact. We can still use the "full-table" layout in cases where it is necessary (because inline-tables need to be single line)
[[Expr.nodes.ExprBoolOp.fields]] | ||
name = "op" | ||
type = "BoolOp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use inline arrays? I find the syntax very verbose to read and find it hard to get a sense of a nodes fields
[[Expr.nodes.ExprBoolOp.fields]] | |
name = "op" | |
type = "BoolOp" | |
fields = [ | |
{ name = "op", type = "BoolOp" }, | |
{ name = "values", type = "Expr", seq = true }, | |
... | |
} |
The downside of this is that inline tables can't be multiline (this can be a problem with comments)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fairness to @Glyphack I had suggested the syntax he's using here #16285 (comment)!
Another possibility, since name
is required and unique, would be
[Expr.nodes.ExprBoolOp]
fields = {
op = { type = "BoolOp" },
values = { type = "Expr", seq = true }
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didn't see that one.
The second doesn't work, I think, because toml inline tables don't allow line breaks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes unfortunately it's not possible to have line break inside an inline table.
https://toml.io/en/v1.0.0#inline-table
To make that work we need to have a table [Expr.nodes.ExprBoolOp.fields.op]
which takes it more verbose I think?
But I think instead of the inline table using an inline array would solve the problem of having all the fields in one section and we only need to write an extra "name = ..." does sound good to me.
But one thing to keep in mind is that if we use inline table for the field attributes and have tables inside attributes(for example if we apply this comment #16285 (comment)) it might get very messy example:
[Expr.nodes.ExprBoolOp]
fields = [
{ name = "values", type = { kind = "sequence", element-type = "Expr" } }
]
I feel we end up making type a more complex field in the end.
|
||
[[Expr.nodes.ExprCompare.fields]] | ||
name = "comparators" | ||
type = "Box<[Expr]>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having to explicitly use Box<[Expr]>
here has the downside that it still requires manually updating all types when changing from one representation to another.
I also find that type = Expr
+ seq = true
reads somewhat strangely. It also allows for combination that aren't allowed. E.g. is seq = true
+ optional = true
supported? I wonder if we should do something like:
type = "Expr" // single item or escape hatch
type = { kind: "sequence", element-type: "Expr", slice: true }
type = { kind: "required", value: "Expr" }
type = { kind: "optiona", value: "Expr" }
I'm not very convinced it is better. Maybe you have ideas to improve it further?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the current format has no restrictions and definitely has some problems like You can use type Vec<Expr>
and also set the sequence. Honestly I was not sure how specific should I make the types in this stage.
I think your suggestion is needed in the end. For example when we want to provide functions to access data without accessing the fields the exact types are useful.
My feeling is that if you are also unsure we can leave this undecided to when it's needed?
Co-authored-by: Douglas Creager <dcreager@dcreager.net>
@@ -14,6 +14,16 @@ | |||
|
|||
import tomllib | |||
|
|||
imports = ["crate::name::Name"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One additional improvement is to define the imports in a way we can check if a name is imported and not use
crate
prefix for it.For now I just used the global list of types we don't want to prefix with
crate
(the other comment) instead. Because we only have one import and it might be too soon to generalize.
I agree that it's too soon to generalize. Given that, I don't think it's worth defining this as a Python array — just put the necessary use
statements directly into the Rust preamble below.
def requires_crate_prefix(ty: str) -> bool: | ||
no_crate_prefix_types = [ | ||
"Name", # Imported at the top | ||
"bool", | ||
] | ||
return not (ty in no_crate_prefix_types or ty.startswith("Box")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Come to think of it, once we have migrated all of the node types over to this mechanism, none of them will need crate::
qualifiers, since everything will be either (a) defined in the generated file, (b) in an use
statement at the top of the generated file, or (c) a primitive of some kind. So it might make sense to invert this into a list of types that do need a crate prefix. That would also eliminate the Box
special case.
Summary
Part of #15655
ast.toml
. I added similar fields toField
in ASDL to hold field informationTest Plan
Nothing outside the
ruff_python_ast
package should change.