-
Notifications
You must be signed in to change notification settings - Fork 108
Add initial specification for selector syntax. #239
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, that is really a good read. My comments are only in regards to typos the rest sounds great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking "request changes" as stand-in for "request discussion". Will d another pass over this once the first two pieces are clarified
Awesome work as a whole!
'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', | ||
'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', | ||
't', 'u', 'v', 'w', 'x', 'y', 'z', '~'] | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Easy embedding inside URLs" implies "easy visual skimming" ( perhaps with some initial training needed, just like e.g. regular expressions ). Assuming a person reading this is proficient: are we comfortable with a case sensitive, visually-collidable character set?
I am not particularly leaning one way or the other, but rather am bringing the point up for discussion .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is certainly something we can use as an added constraint to consider when choosing the characters used for short form. Currently, it only uses f
, i
, r
, u
, c
, F
, *
, .
, and ~
.
The listing of url safe characters is more of a technical constraint about what ASCII characters can be embedded in url components without needing to be escaped.
selectors/selector-syntax.md
Outdated
|
||
Parentheses annotate structure and are sometimes required for ambigious cases such as unions which contain an arbitrary number of selectors or selectors with optional parameters of conflicting types. | ||
|
||
However the parser can usually infer the structure without them because most selectors have a fixed or semi-fixed arity and certain types are only allowed at certain places. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little a'twitch about this part about "fixed or semi-fixed arity".
It's true that in the Schema for selectors, most things have fixed "arity" and any optionals are clearly demarcated. However, it's also the case that we have some idea of how migrations within the Schema system will work if we add more fields or selector types. Here, it seems it might be slightly less clear; migration rules that work for the Schema won't automatically translate to this DSL.
Maybe that's fine. I suppose unless the DSL is exactly an IPLD codec (... in which case it's hardly delivering the kind of increased terseness that makes something earn the term "DSL" at all!), it's necessarily going to have different migration rules. And maybe there's no major problem with that.
Just thinking outloud.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I'm not settled on this part yet either. I'm currently going through and implementing a full parser. I feel I'll get a much better sense of where it's ambiguous and be able to more clearly describe the grammar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no objections to this, I think :)
I'm also not really reviewing for ergonomics though, as I feel ill suited to do so without an application in my mind's eye, which I'm pretty sparse on. And thus, "no objections" is about the greenest light I'm likely to give, if that makes sense. :P
After spending a day implementing a full parser, I've discovered the white-space rules are more interesting than I initially thought. The design goal was to make white-space 100% irrelevant to the parser. Most programming languages claim to not have significant white-space, but that's not entirely true for any real language. The first exception obviously is white-space within strings needs to be preserved. This is easy enough, I simply turn off white-space stripping when within quoted sections. The second case is one you typically don't realize. Most languages use white-space as token separators. For example:
The parsing of the second example depends on that space being between the two string literals. If we really ignore white-space, then it should parse the same if the space is removed. Currently, this syntax really means it when it says no white-space, and the second will be a single string. Another example is identifiers. In most languages For example we should avoid having a This spec does specify which it should be (it would be always |
@warpfork One solution to the potential confusion with merged identifiers would be to preserve the white-space as tokens and tell the parser about them. But the problem with this is it would require those spaces to be preserved in compact mode. This could almost double the length of minimized version and include lots of spaces which can sometimes be problematic in URLs. I'll consider it further though. |
Definitely meant my remark on that as a "2 cents". You've probably already thought about it much more than I have. |
The initial lexer implementation is now done. The description in the spec for parsing out identifiers seems to be working. https://github.com/creationix/sel-parse-zig/blob/887c3628e11b4ff751e68a9ade4c18a9bf4daf25/src/lexer.zig In particular, there is a hard coded list of identifiers expected here: // Sorted by longest first, then lexical within same length.
const identifiers = .{
"condition",
"recursive",
"recurse",
"fields",
"index",
"match",
"range",
"union",
"all",
".",
"*",
"~",
"c",
"f",
"i",
"r",
"R",
"u",
}; And then parsing those ends up being quite straightforward. // Tokenize Identifiers
inline for (identifiers) |ident| {
var matched = true;
var i: u32 = 0;
while (i < ident.len) {
if (i >= input.len or ident[i] != input[i]) {
matched = false;
break;
}
i += 1;
}
if (matched) return Token{ .id = .Identifier, .slice = input[0..i] };
} |
To get an idea of what the lexer tokens look like, this is the output of the following:
0 Id.Identifier 9B `recursive`
9 Id.Open 1B `(`
10 Id.Unknown 1B `l`
11 Id.Identifier 1B `i`
12 Id.Unknown 1B `m`
13 Id.Identifier 1B `i`
14 Id.Unknown 2B `t=`
16 Id.Decimal 1B `5`
17 Id.Whitespace 3B `
`
20 Id.Identifier 6B `fields`
26 Id.Open 1B `(`
27 Id.Whitespace 5B `
`
32 Id.String 6B `'tree'`
38 Id.Open 1B `(`
39 Id.Whitespace 7B `
`
46 Id.Identifier 9B `recursive`
55 Id.Open 1B `(`
56 Id.Whitespace 9B `
`
65 Id.Identifier 3B `all`
68 Id.Open 1B `(`
69 Id.Identifier 7B `recurse`
76 Id.Close 1B `)`
77 Id.Whitespace 7B `
`
84 Id.Close 1B `)`
85 Id.Whitespace 5B `
`
90 Id.Close 1B `)`
91 Id.Whitespace 5B `
`
96 Id.String 9B `'parents'`
105 Id.Open 1B `(`
106 Id.Whitespace 7B `
`
113 Id.Identifier 3B `all`
116 Id.Open 1B `(`
117 Id.Identifier 7B `recurse`
124 Id.Close 1B `)`
125 Id.Whitespace 5B `
`
130 Id.Close 1B `)`
131 Id.Whitespace 3B `
`
134 Id.Close 1B `)`
135 Id.Whitespace 1B `
`
136 Id.Close 1B `)` |
|
||
Selector syntax should be easy to read/author by humans. | ||
|
||
This means it should be terser than the JSON or YAML representations of the IPLD data, but still verbose enough to have meaningful structure and keywords/symbols. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor style suggestion: collapse these two paragraphs into a bullet-point list hanging off the first sentence:
... easy to read/author by humans. This means it should:
- Be terser ...
- Allow flexibility ...
) | ||
``` | ||
|
||
Depending on the context, we could omit the parentheses because the optional `stopAt` parameter is of type `Condition` and the parser likely expects something else after this node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so there aren't any selector forms with possibly ambiguous lengths so predicting a stopAt
in the shortened selector syntax should be straightforward?
I see Matcher
in the selector schema also has two optional fields, onlyIf
and label
, could these get in the way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to say. This is why I'm implementing syntax parsers. So far I've not come across any concrete use cases where the parentheses are actually needed.
selectors/selector-syntax.md
Outdated
|
||
#### Strings | ||
|
||
Strings are quoted using single quote, they can be escaped using double single quote. You can include non url-safe characters between the quotes, but will need to escape the entire selector properly when embedding in a URL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is '
going to be robust enough if we wanted to start escaping more? Say we wanted to handle /
characters in fields but not have a selector confuse it with a path, '
won't help us here I think, it's the kind of thing you'd use \
for: parent/parent/a\/b
where "a/b"
is a map key at some node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the limitation. It can escape any string you want by simply doubling any '
characters found in the original string. If you want to include \/
in the string, it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're talking about embedding the selector in a url component, then the forward slash is already escaped at the url embedding layer. It's not included in the safe character set because encodeURIComponent("foo/bar")
is foo%2Fbar
This syntax assumes you'll use either encodeURI
or encodeURIComponent
when embedding a selector in such a situation. We avoid the characters that would be escaped to reduce size bloat, but otherwise they don't pose any issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking about the case where we want special characters, not just raw characters, in the same way that \n
is a special character with many \
escaping schemes. My point about /
is hypothetical—imagine a case where we say that 'foo/bar/baz'
is parsed according to the rules you have here, it's a path shorthand. Right now, if you happen to have data with keys that contain the /
character then paths aren't going to work nicely for you (bad luck, choose better keys buddy). But say we decide that it's important that we expand the range of characters that paths work for to include /
characters inside of keys, so we want to be able to switch between /
for path separators and /
as a character inside a key. A logical way of doing this is to escape it, turn it into something special that the parser has to treat differently ("oh, that's not a path separator because it's escaped, it must be part of the key string for this path segment"). How would you escape anything other than '
when '
is both your escaping character and your string terminator? If you have a key b/a/z
at the end of your path, you can't: 'foo/bar/b'/a'/z'
because you've got termination confusion. If a character other than '
was your escaping character then it's clearer: 'foo/bar/b\/a\/r'
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, you want the ability to embed things like newlines and tabs in the string without using actual newlines and tabs. In the current scheme, it's quite possible to include them, but it's going to look weird.
'this is a string
with a newline'
vs
'this is a string\nwith a newline'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and switched to a more traditional string literal syntax. It's now single quote only, but otherwise very similar to JSON with backslash escaping for whitespace, backslash and single quote. I'm leaving unicode escaping out of this since unicode values can simply be included as-is and the JSON form of \uxxxx
is only 16 bit.
I really like how the two forms are really just one form, with shortenings (and no comments for URL form), but we're trading that against ease of implementation though, so every environment that needs to be able to use these would need to have a custom parser written. I doesn't seem difficult, but it's worth noting as we expand our language support (Go, JS and now Rust, but also Filecoin is using selectors and being implemented in C++ and maybe others too and this might be something they want at some point) that we're making that exchange. |
@rvagg FWIW, I'm working on two concurrent implementations. One in vanilla JS and one using zig-lang which can be compiled to webassembly or a C ABI library for use in virtually any language. I plan to tweaking this spec with my findings from the two parsers to ensure it's not more difficult than it needs to be. |
So while implementing multiple versions of the parser, I'm getting more and more convinced this spec needs to be more automated and formalized. I'm now spiking on generating a syntax grammer that can be automatically derived from the selector schema directly. The design will still be similar to what's proposed here, but it should be more consistent and less hand-crafted to make tooling across the board easier. |
Ok, the JS parser can now correctly compile all the sample selectors in this spec. The new approach worked well. Basically, I load the existing IPLD Schema for selectors and generate a parser from that. In order to match the proposed syntax, the parser generator accepts a list of "aliases" for various types. For example, this is the line in JS that generates the selector syntax parser: const schema = schemaParse(readFileSync(`${__dirname}/selectors.ipldsch`, 'utf8'))
const parseSelectorEnvelope = makeParser(schema, "SelectorEnvelope", {
Matcher: ['match', '.'],
ExploreAll: ['all', '*'],
ExploreFields: ['fields', 'f'],
ExploreIndex: ['index', 'i'],
ExploreRange: ['range', 'r'],
ExploreRecursive: ['recursive', 'R'],
ExploreUnion: ['union', 'u'],
ExploreConditional: ['condition', 'c'],
ExploreRecursiveEdge: ['recurse', '~'],
RecursionLimit_None: ['none', 'n'],
}) The types mentioned are existing in the schema, I'm creating a semi-automated DSL by specifying the entry point and long and short keywords for some types. Note that this library could be used to create a DSL for any IPLD data structure that has a schema. |
I found a case where the parentheses are significant and can't always be removed when converting to short form. Consider the following selector: union(
union(
match
)
match
) Since # Correct short form
uu(.).
# Wrong short form
uu.. We could keep it simple and say the minimizer always preserves parentheses when encoding a list. I don't even know of any real world use cases that use |
Another case where they are required is labeled matchers inside of fields. # Fields with labels
fields(
'with-label'(
match('label')
)
'without-label'(
match
)
)
# Properly Minimized
f'with-label'(.'label')'without-label'.
# Another Properly Minimized
f'with-label'.('label')'without-label'.
# Broken minimized
f'with-label'.'label''without-label'. This brings up another question about normalization of minimized form. Are we OK with there being multiple correct short forms? Does it matter? |
@rvagg, you were right! Strings are problematic too. The less common escaping method used for strings (two single quotes) works, but it also introduces cases where multiple token are merged. For example, consider the following: fields
'foo'
match
label='blue'
'bar'
match This currently breaks because the Sample:
fields
'with-label'
match
label='my-label'
'another-field'
match
SyntaxError: "fields'foo'matchlabel='blue''bar'match"
^ Unexpected extra syntax at 33 I propose we switch to a more traditional string syntax with backslash escaping. It will add some bloat when url encoding, but overall should be an improvement. |
I don't know specifically for this but we keep on finding cases elsewhere where we don't have one-single-way and this being a problem. I don't know how that would show up here, but maybe if someone chose to encode a selector string rather than a full selector as per schema then having more than one way to say the same thing might be a problem. There's a lot of byte-shavers around that will look at this work and look at the full selector schema and opt for a short string form. Would it be a big deal to make a must be simplest accurate representation rule that would give us one-single-way? Re |
We’ve learned a lot from this but we’re not quite sure how we want to handle simplified string representations for selectors and paths. Closing for now. |
This is a proposal for a selector syntax that closely models the semantics already in the IPLD format of selectors. The rational and constraints for the design are included in the documents as well as many examples and hopefully enough description of lexing/parsing behavior to make it unambiguous.
I would love feedback on what you like about this, what drives you crazy, and hopefully find out if this is a good direction.