Skip to content
This repository was archived by the owner on Jun 29, 2022. It is now read-only.

Add initial specification for selector syntax. #239

Closed
wants to merge 9 commits into from

Conversation

creationix
Copy link
Contributor

This is a proposal for a selector syntax that closely models the semantics already in the IPLD format of selectors. The rational and constraints for the design are included in the documents as well as many examples and hopefully enough description of lexing/parsing behavior to make it unambiguous.

I would love feedback on what you like about this, what drives you crazy, and hopefully find out if this is a good direction.

Copy link
Member

@vmx vmx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that is really a good read. My comments are only in regards to typos the rest sounds great.

Copy link
Contributor

@ribasushi ribasushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking "request changes" as stand-in for "request discussion". Will d another pass over this once the first two pieces are clarified

Awesome work as a whole!

'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's',
't', 'u', 'v', 'w', 'x', 'y', 'z', '~']
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Easy embedding inside URLs" implies "easy visual skimming" ( perhaps with some initial training needed, just like e.g. regular expressions ). Assuming a person reading this is proficient: are we comfortable with a case sensitive, visually-collidable character set?

I am not particularly leaning one way or the other, but rather am bringing the point up for discussion .

Copy link
Contributor Author

@creationix creationix Feb 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is certainly something we can use as an added constraint to consider when choosing the characters used for short form. Currently, it only uses f, i, r, u, c, F, *, ., and ~.

The listing of url safe characters is more of a technical constraint about what ASCII characters can be embedded in url components without needing to be escaped.


Parentheses annotate structure and are sometimes required for ambigious cases such as unions which contain an arbitrary number of selectors or selectors with optional parameters of conflicting types.

However the parser can usually infer the structure without them because most selectors have a fixed or semi-fixed arity and certain types are only allowed at certain places.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little a'twitch about this part about "fixed or semi-fixed arity".

It's true that in the Schema for selectors, most things have fixed "arity" and any optionals are clearly demarcated. However, it's also the case that we have some idea of how migrations within the Schema system will work if we add more fields or selector types. Here, it seems it might be slightly less clear; migration rules that work for the Schema won't automatically translate to this DSL.

Maybe that's fine. I suppose unless the DSL is exactly an IPLD codec (... in which case it's hardly delivering the kind of increased terseness that makes something earn the term "DSL" at all!), it's necessarily going to have different migration rules. And maybe there's no major problem with that.

Just thinking outloud.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I'm not settled on this part yet either. I'm currently going through and implementing a full parser. I feel I'll get a much better sense of where it's ambiguous and be able to more clearly describe the grammar.

Copy link
Contributor

@warpfork warpfork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no objections to this, I think :)

I'm also not really reviewing for ergonomics though, as I feel ill suited to do so without an application in my mind's eye, which I'm pretty sparse on. And thus, "no objections" is about the greenest light I'm likely to give, if that makes sense. :P

@creationix
Copy link
Contributor Author

creationix commented Feb 10, 2020

After spending a day implementing a full parser, I've discovered the white-space rules are more interesting than I initially thought. The design goal was to make white-space 100% irrelevant to the parser. Most programming languages claim to not have significant white-space, but that's not entirely true for any real language.

The first exception obviously is white-space within strings needs to be preserved. This is easy enough, I simply turn off white-space stripping when within quoted sections.

The second case is one you typically don't realize. Most languages use white-space as token separators. For example:

# Escaped quotes
'This is ''one'' string with escaped quotes.'
# Escaped quotes or two strings?
'is this one string?' 'or two strings?`

The parsing of the second example depends on that space being between the two string literals. If we really ignore white-space, then it should parse the same if the space is removed.

Currently, this syntax really means it when it says no white-space, and the second will be a single string.

Another example is identifiers. In most languages foo bar is clearly two identifiers while foobar is one. In this language, we ignore the space entirely and both look identical to the parser. It needs other clues to know where one identifier/selector starts and where another ends. This is accomplished by the language having a fixed set of identifiers and careful design to make sure that there is never ambiguity.

For example we should avoid having a foo selector with short form f if there is also another selector who's short form is o. A developer may write foo and we might interpret it as f, o, o, or they might write f o o and we interpret it as foo.

This spec does specify which it should be (it would be always foo because it's longer). But we also want to avoid these situations whenever possible since humans are not compilers and may have different expectations since most/all existing languages don't really ignore white-space.

@creationix
Copy link
Contributor Author

creationix commented Feb 10, 2020

@warpfork One solution to the potential confusion with merged identifiers would be to preserve the white-space as tokens and tell the parser about them. But the problem with this is it would require those spaces to be preserved in compact mode. This could almost double the length of minimized version and include lots of spaces which can sometimes be problematic in URLs.

I'll consider it further though.

@warpfork
Copy link
Contributor

Definitely meant my remark on that as a "2 cents". You've probably already thought about it much more than I have.

@creationix
Copy link
Contributor Author

creationix commented Feb 17, 2020

The initial lexer implementation is now done. The description in the spec for parsing out identifiers seems to be working. https://github.com/creationix/sel-parse-zig/blob/887c3628e11b4ff751e68a9ade4c18a9bf4daf25/src/lexer.zig

In particular, there is a hard coded list of identifiers expected here:

// Sorted by longest first, then lexical within same length.
const identifiers = .{
    "condition",
    "recursive",
    "recurse",
    "fields",
    "index",
    "match",
    "range",
    "union",
    "all",
    ".",
    "*",
    "~",
    "c",
    "f",
    "i",
    "r",
    "R",
    "u",
};

And then parsing those ends up being quite straightforward.

// Tokenize Identifiers
inline for (identifiers) |ident| {
    var matched = true;
    var i: u32 = 0;
    while (i < ident.len) {
        if (i >= input.len or ident[i] != input[i]) {
            matched = false;
            break;
        }
        i += 1;
    }
    if (matched) return Token{ .id = .Identifier, .slice = input[0..i] };
}

@creationix
Copy link
Contributor Author

To get an idea of what the lexer tokens look like, this is the output of the following:

recursive(limit=5
  fields(
    'tree'(
      recursive(
        all(recurse)
      )
    )
    'parents'(
      all(recurse)
    )
  )
)
0       Id.Identifier   9B      `recursive`
9       Id.Open 1B      `(`
10      Id.Unknown      1B      `l`
11      Id.Identifier   1B      `i`
12      Id.Unknown      1B      `m`
13      Id.Identifier   1B      `i`
14      Id.Unknown      2B      `t=`
16      Id.Decimal      1B      `5`
17      Id.Whitespace   3B      `
  `
20      Id.Identifier   6B      `fields`
26      Id.Open 1B      `(`
27      Id.Whitespace   5B      `
    `
32      Id.String       6B      `'tree'`
38      Id.Open 1B      `(`
39      Id.Whitespace   7B      `
      `
46      Id.Identifier   9B      `recursive`
55      Id.Open 1B      `(`
56      Id.Whitespace   9B      `
        `
65      Id.Identifier   3B      `all`
68      Id.Open 1B      `(`
69      Id.Identifier   7B      `recurse`
76      Id.Close        1B      `)`
77      Id.Whitespace   7B      `
      `
84      Id.Close        1B      `)`
85      Id.Whitespace   5B      `
    `
90      Id.Close        1B      `)`
91      Id.Whitespace   5B      `
    `
96      Id.String       9B      `'parents'`
105     Id.Open 1B      `(`
106     Id.Whitespace   7B      `
      `
113     Id.Identifier   3B      `all`
116     Id.Open 1B      `(`
117     Id.Identifier   7B      `recurse`
124     Id.Close        1B      `)`
125     Id.Whitespace   5B      `
    `
130     Id.Close        1B      `)`
131     Id.Whitespace   3B      `
  `
134     Id.Close        1B      `)`
135     Id.Whitespace   1B      `
`
136     Id.Close        1B      `)`


Selector syntax should be easy to read/author by humans.

This means it should be terser than the JSON or YAML representations of the IPLD data, but still verbose enough to have meaningful structure and keywords/symbols.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor style suggestion: collapse these two paragraphs into a bullet-point list hanging off the first sentence:

... easy to read/author by humans. This means it should:

  • Be terser ...
  • Allow flexibility ...

)
```

Depending on the context, we could omit the parentheses because the optional `stopAt` parameter is of type `Condition` and the parser likely expects something else after this node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so there aren't any selector forms with possibly ambiguous lengths so predicting a stopAt in the shortened selector syntax should be straightforward?
I see Matcher in the selector schema also has two optional fields, onlyIf and label, could these get in the way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to say. This is why I'm implementing syntax parsers. So far I've not come across any concrete use cases where the parentheses are actually needed.


#### Strings

Strings are quoted using single quote, they can be escaped using double single quote. You can include non url-safe characters between the quotes, but will need to escape the entire selector properly when embedding in a URL.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ' going to be robust enough if we wanted to start escaping more? Say we wanted to handle / characters in fields but not have a selector confuse it with a path, ' won't help us here I think, it's the kind of thing you'd use \ for: parent/parent/a\/b where "a/b" is a map key at some node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the limitation. It can escape any string you want by simply doubling any ' characters found in the original string. If you want to include \/ in the string, it's fine.

Copy link
Contributor Author

@creationix creationix Mar 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're talking about embedding the selector in a url component, then the forward slash is already escaped at the url embedding layer. It's not included in the safe character set because encodeURIComponent("foo/bar") is foo%2Fbar This syntax assumes you'll use either encodeURI or encodeURIComponent when embedding a selector in such a situation. We avoid the characters that would be escaped to reduce size bloat, but otherwise they don't pose any issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about the case where we want special characters, not just raw characters, in the same way that \n is a special character with many \ escaping schemes. My point about / is hypothetical—imagine a case where we say that 'foo/bar/baz' is parsed according to the rules you have here, it's a path shorthand. Right now, if you happen to have data with keys that contain the / character then paths aren't going to work nicely for you (bad luck, choose better keys buddy). But say we decide that it's important that we expand the range of characters that paths work for to include / characters inside of keys, so we want to be able to switch between / for path separators and / as a character inside a key. A logical way of doing this is to escape it, turn it into something special that the parser has to treat differently ("oh, that's not a path separator because it's escaped, it must be part of the key string for this path segment"). How would you escape anything other than ' when ' is both your escaping character and your string terminator? If you have a key b/a/z at the end of your path, you can't: 'foo/bar/b'/a'/z' because you've got termination confusion. If a character other than ' was your escaping character then it's clearer: 'foo/bar/b\/a\/r'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you want the ability to embed things like newlines and tabs in the string without using actual newlines and tabs. In the current scheme, it's quite possible to include them, but it's going to look weird.

'this is a string
with a newline'

vs

'this is a string\nwith a newline'

Copy link
Contributor Author

@creationix creationix Mar 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and switched to a more traditional string literal syntax. It's now single quote only, but otherwise very similar to JSON with backslash escaping for whitespace, backslash and single quote. I'm leaving unicode escaping out of this since unicode values can simply be included as-is and the JSON form of \uxxxx is only 16 bit.

@rvagg
Copy link
Member

rvagg commented Mar 2, 2020

I really like how the two forms are really just one form, with shortenings (and no comments for URL form), but we're trading that against ease of implementation though, so every environment that needs to be able to use these would need to have a custom parser written. I doesn't seem difficult, but it's worth noting as we expand our language support (Go, JS and now Rust, but also Filecoin is using selectors and being implemented in C++ and maybe others too and this might be something they want at some point) that we're making that exchange.

@creationix
Copy link
Contributor Author

@rvagg FWIW, I'm working on two concurrent implementations. One in vanilla JS and one using zig-lang which can be compiled to webassembly or a C ABI library for use in virtually any language. I plan to tweaking this spec with my findings from the two parsers to ensure it's not more difficult than it needs to be.

@creationix
Copy link
Contributor Author

So while implementing multiple versions of the parser, I'm getting more and more convinced this spec needs to be more automated and formalized. I'm now spiking on generating a syntax grammer that can be automatically derived from the selector schema directly. The design will still be similar to what's proposed here, but it should be more consistent and less hand-crafted to make tooling across the board easier.

@creationix
Copy link
Contributor Author

creationix commented Mar 16, 2020

Ok, the JS parser can now correctly compile all the sample selectors in this spec. The new approach worked well. Basically, I load the existing IPLD Schema for selectors and generate a parser from that. In order to match the proposed syntax, the parser generator accepts a list of "aliases" for various types. For example, this is the line in JS that generates the selector syntax parser:

const schema = schemaParse(readFileSync(`${__dirname}/selectors.ipldsch`, 'utf8'))

const parseSelectorEnvelope = makeParser(schema, "SelectorEnvelope", {
  Matcher: ['match', '.'],
  ExploreAll: ['all', '*'],
  ExploreFields: ['fields', 'f'],
  ExploreIndex: ['index', 'i'],
  ExploreRange: ['range', 'r'],
  ExploreRecursive: ['recursive', 'R'],
  ExploreUnion: ['union', 'u'],
  ExploreConditional: ['condition', 'c'],
  ExploreRecursiveEdge: ['recurse', '~'],
  RecursionLimit_None: ['none', 'n'],
})

The types mentioned are existing in the schema, I'm creating a semi-automated DSL by specifying the entry point and long and short keywords for some types.

Note that this library could be used to create a DSL for any IPLD data structure that has a schema.

@creationix
Copy link
Contributor Author

creationix commented Mar 16, 2020

I found a case where the parentheses are significant and can't always be removed when converting to short form. Consider the following selector:

union(
  union(
    match
  )
  match
)

Sinceunion (aka ExploreUnion) contains a list (aka [Selector]), it consumes an arbitrary number of selectors. If the parentheses are removed, the two matches will be put under the inner union. Therefore the short form of this is:

# Correct short form
uu(.).

# Wrong short form
uu..

We could keep it simple and say the minimizer always preserves parentheses when encoding a list. I don't even know of any real world use cases that use ExploreUnion and would love to see the actual use cases. A smarter minimizer could do some basic analysis to know when the parentheses are required and only include them then.

@creationix
Copy link
Contributor Author

creationix commented Mar 16, 2020

Another case where they are required is labeled matchers inside of fields.

# Fields with labels
fields(
  'with-label'(
    match('label')
  )
  'without-label'(
    match
  )
)

# Properly Minimized
f'with-label'(.'label')'without-label'.

# Another Properly Minimized
f'with-label'.('label')'without-label'.

# Broken minimized
f'with-label'.'label''without-label'.

This brings up another question about normalization of minimized form. Are we OK with there being multiple correct short forms? Does it matter?

@creationix
Copy link
Contributor Author

creationix commented Mar 16, 2020

@rvagg, you were right!

Strings are problematic too. The less common escaping method used for strings (two single quotes) works, but it also introduces cases where multiple token are merged. For example, consider the following:

fields
  'foo'
    match 
      label='blue'
  'bar'
    match

This currently breaks because the 'blue' label and the 'bar' field are merged into a single label containing "blue'bar".

Sample:

    fields
      'with-label'
        match 
          label='my-label'
      'another-field'
        match


SyntaxError: "fields'foo'matchlabel='blue''bar'match"
                                               ^ Unexpected extra syntax at 33

I propose we switch to a more traditional string syntax with backslash escaping. It will add some bloat when url encoding, but overall should be an improvement.

@rvagg
Copy link
Member

rvagg commented Mar 17, 2020

This brings up another question about normalization of minimized form. Are we OK with there being multiple correct short forms? Does it matter?

I don't know specifically for this but we keep on finding cases elsewhere where we don't have one-single-way and this being a problem. I don't know how that would show up here, but maybe if someone chose to encode a selector string rather than a full selector as per schema then having more than one way to say the same thing might be a problem. There's a lot of byte-shavers around that will look at this work and look at the full selector schema and opt for a short string form.

Would it be a big deal to make a must be simplest accurate representation rule that would give us one-single-way?

Re ExploreUnion, it's a selection strategy that is only useful when you have more than one selector to use isn't it? For practical purposes you should always have >1 item in the list, making parens always necessary and your example overly simple. There are going to be cases of poorly crafted selectors that misuse it, but that shouldn't be the normal case.

@mikeal
Copy link
Contributor

mikeal commented Oct 8, 2020

We’ve learned a lot from this but we’re not quite sure how we want to handle simplified string representations for selectors and paths. Closing for now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants