Skip to content
This repository was archived by the owner on Jun 29, 2022. It is now read-only.

Add initial specification for selector syntax. #239

Closed
wants to merge 9 commits into from
384 changes: 384 additions & 0 deletions selectors/selector-syntax.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,384 @@
Specification: IPLD Selectors Syntax
=============================

**Status: Prescriptive - Draft**

Introduction
------------

### Motivation - What is Selectors Syntax

*Prerequisites: [Selectors](https://github.com/ipld/specs/blob/master/selectors/selectors.md).

IPLD Selectors are represented as IPLD data nodes. This is great for embedding them in a structured way, but authoring them or viewing them in this format isn't the easiest. This syntax provides a textual DSL for reading/writing selectors in a more text friendly format.

Tooling can be used to convert between formats and even various styles optimized for the use-case at hand.

#### URL Friendly

Selector syntax should embed easily inside URLs.

This means where possible, this syntax restricts itself to the characters that can be embedded in URLs without needing to escape them. This means this subset of ASCII:

```js
[ '!', "'", '(', ')', '*', '-', '.', '0', '1',
'2', '3', '4', '5', '6', '7', '8', '9', 'A',
'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J',
'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S',
'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '_', 'a',
'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's',
't', 'u', 'v', 'w', 'x', 'y', 'z', '~']
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Easy embedding inside URLs" implies "easy visual skimming" ( perhaps with some initial training needed, just like e.g. regular expressions ). Assuming a person reading this is proficient: are we comfortable with a case sensitive, visually-collidable character set?

I am not particularly leaning one way or the other, but rather am bringing the point up for discussion .

Copy link
Contributor Author

@creationix creationix Feb 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is certainly something we can use as an added constraint to consider when choosing the characters used for short form. Currently, it only uses f, i, r, u, c, F, *, ., and ~.

The listing of url safe characters is more of a technical constraint about what ASCII characters can be embedded in url components without needing to be escaped.


This also also means it needs to be as terse as possible and not contain whitespace of any kind.

For example, this selector simulates a git shallow clone by recursively walking commit parents up to depth 5 and walking all of the tree graphs for each.

```ipldsel
# Starting at the commit block.
R5f'tree'R*~'parents'*~
```

#### Human Friendly

Selector syntax should be easy to read/author by humans.

This means it should be terser than the JSON or YAML representations of the IPLD data, but still verbose enough to have meaningful structure and keywords/symbols.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor style suggestion: collapse these two paragraphs into a bullet-point list hanging off the first sentence:

... easy to read/author by humans. This means it should:

  • Be terser ...
  • Allow flexibility ...


This means it should allow flexibility with whitespace as well as allowing optional symbols and annotations to make structure easier to see visually.

The exact same selector for git shallow clone from above can also be written in the following style: (This is not another mode, it's the same syntax):

```ipldsel
recursive(limit=5
fields(
'tree'(
recursive(
all(recurse)
)
)
'parents'(
all(recurse)
)
)
)
```

Examples
--------

### Deeply Nested Path

Based on [this example](example-selectors.md#deeply-nested-path).

A selector to extract the year:

#### Human Readable Style

This is the default style for human interfacing. It has clear structure and descriptive keywords.

```ipldsel
fields('characters'(
fields('kathryn-janeway'(
fields('birthday'(
fields('year'(match))
))
))
))
```

#### URL Embeddable Style

This is the default style for maximum terseness. It minifies everything possible.

```ipldsel
f'characters'f'kathryn-janeway'f'birthday'f'year'.
```

### Getting a certain number of parent blocks in a blockchain

This is based on [this sample](example-selectors.md#getting-a-certain-number-of-parent-blocks-in-a-blockchain).

#### Parents Without Recursion

Direct and simple path traversal:

```ipldsel
# Long Form
fields('parent'(
fields('parent'(
fields('parent'(
fields('parent'(
fields('parent'(
match
))
))
))
))
))

# Short Form
f'parent'f'parent'f'parent'f'parent'f'parent'.
```

#### Parents Using Recursion

```ipldsel
# Long Form
recursive(limit=5
fields('parent'(
recurse
))
)

# Short Form
R5f'parent'~
```

### Getting changes up to a certain one

Based on [this example](example-selectors.md#getting-changes-up-to-a-certain-one).

```ipldsel
# Long Form
recursive(
limit=100
fields(
'prev'(recurse)
)
stopAt=... # Conditions are not specified yet
)

# Short Form
R100f'prev'~... # Conditions are not specified yet
```

### Retrieving data recursively

Based on [this example](example-selectors.md#retrieving-data-recursively).

The following selector visits all `links` and matches all `data` fields:

```ipldsel
# Long Form
recursive(limit=1000
fields(
'data'(match)
'links'(
all(
fields('cid'(
recurse
))
)
)
)
)

# Short Form
R1000f'data'.'links'*f'cid'~
```

Syntax Specification
--------------------

Selectors Syntax is defined as a textual projection of the Selector AST and thus does not contain any of its own runtime semantics.

### Long and Short Keywords

Each selector type has both long and short names that can be used interchangeably as follows:

- Matcher can be `match` or `.`
- ExploreAll can be `all` or `*`
- ExploreFields can be `fields` or `f`
- ExploreIndex can be `index` or `i`
- ExploreRange can be `range` or `r`
- ExploreRecursive can be `recursive` or `R`
- ExploreUnion can be `union` or `u`
- ExploreConditional can be `condition` or `c`
- ExploreRecursiveEdge can be `recurse` or `~`

This mode-less flexibility, combined with tools to automatically translate in bulk between styles, makes it possible for a single syntax to work well for both human and url embedding use cases.

### Whitespace is Ignored

Whitespace is completely ignored by the parser except for inside quoted strings.

When extending this in the future, be aware that whitespace cannot be used as keyword boundaries (`"ab cd"` is identical to `"a bc d"`).
We should have enough space for dozens of long and short names, but will want to write a tool to automatically look for ambiguities as well as improve developer experience with auto formatters and smart highlighters.

### Parentheses are Usually Optional

Parentheses annotate structure and are sometimes required for ambigious cases such as unions which contain an arbitrary number of selectors or selectors with optional parameters of conflicting types.

However the parser can usually infer the structure without them because most selectors have a fixed or semi-fixed arity and certain types are only allowed at certain places.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little a'twitch about this part about "fixed or semi-fixed arity".

It's true that in the Schema for selectors, most things have fixed "arity" and any optionals are clearly demarcated. However, it's also the case that we have some idea of how migrations within the Schema system will work if we add more fields or selector types. Here, it seems it might be slightly less clear; migration rules that work for the Schema won't automatically translate to this DSL.

Maybe that's fine. I suppose unless the DSL is exactly an IPLD codec (... in which case it's hardly delivering the kind of increased terseness that makes something earn the term "DSL" at all!), it's necessarily going to have different migration rules. And maybe there's no major problem with that.

Just thinking outloud.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I'm not settled on this part yet either. I'm currently going through and implementing a full parser. I feel I'll get a much better sense of where it's ambiguous and be able to more clearly describe the grammar.


The best practice (and what the default formatting styles will enforce) is for human readable selectors to use parentheses liberally while URL embedding style will only contain the required ones.

### Parameters can be Named

Parameters can usually be inferred by their contextual position, but there are some cases where it's ambigious and needs to be specified. There are more cases where it's good to annotate them for human clarity.

For example, `recursive` has two required parameters and a 3rd optional one.

```ts
recursive(sequence: Selector, limit: int, stopAt?: Condition)
```

Written verbosely with parentheses, named parameters, and whitespace, it looks like this:

```ipldsel
recursive(
limit=5
sequence=...
)
```

Depending on the context, we could omit the parentheses because the optional `stopAt` parameter is of type `Condition` and the parser likely expects something else after this node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so there aren't any selector forms with possibly ambiguous lengths so predicting a stopAt in the shortened selector syntax should be straightforward?
I see Matcher in the selector schema also has two optional fields, onlyIf and label, could these get in the way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to say. This is why I'm implementing syntax parsers. So far I've not come across any concrete use cases where the parentheses are actually needed.


Also we don't need to annotate `limit=` or `sequence=` since both are non-optional, and unique types. Notice that the order doesn't matter and we can put `limit` before `sequence` because of unambigious types.

Best practice is to annotate `limit`, but not `sequence` for human readable, and omit both for URL form.

```ipldsel
# Human Readable
recursive(limit=5 ...)
# URL Embeddable
R(5...)
```

### Literal Values

Some of the selectors accept literal values as parameters. These are currently `String`, `{String:Selector}`, and `int`.

#### Integers

Positive integers can be encoded using base 110:

```
123 # Decimal
```

#### Strings

Strings are quoted using single quote, they can be escaped using double single quote. You can include non url-safe characters between the quotes, but will need to escape the entire selector properly when embedding in a URL.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ' going to be robust enough if we wanted to start escaping more? Say we wanted to handle / characters in fields but not have a selector confuse it with a path, ' won't help us here I think, it's the kind of thing you'd use \ for: parent/parent/a\/b where "a/b" is a map key at some node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the limitation. It can escape any string you want by simply doubling any ' characters found in the original string. If you want to include \/ in the string, it's fine.

Copy link
Contributor Author

@creationix creationix Mar 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're talking about embedding the selector in a url component, then the forward slash is already escaped at the url embedding layer. It's not included in the safe character set because encodeURIComponent("foo/bar") is foo%2Fbar This syntax assumes you'll use either encodeURI or encodeURIComponent when embedding a selector in such a situation. We avoid the characters that would be escaped to reduce size bloat, but otherwise they don't pose any issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about the case where we want special characters, not just raw characters, in the same way that \n is a special character with many \ escaping schemes. My point about / is hypothetical—imagine a case where we say that 'foo/bar/baz' is parsed according to the rules you have here, it's a path shorthand. Right now, if you happen to have data with keys that contain the / character then paths aren't going to work nicely for you (bad luck, choose better keys buddy). But say we decide that it's important that we expand the range of characters that paths work for to include / characters inside of keys, so we want to be able to switch between / for path separators and / as a character inside a key. A logical way of doing this is to escape it, turn it into something special that the parser has to treat differently ("oh, that's not a path separator because it's escaped, it must be part of the key string for this path segment"). How would you escape anything other than ' when ' is both your escaping character and your string terminator? If you have a key b/a/z at the end of your path, you can't: 'foo/bar/b'/a'/z' because you've got termination confusion. If a character other than ' was your escaping character then it's clearer: 'foo/bar/b\/a\/r'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you want the ability to embed things like newlines and tabs in the string without using actual newlines and tabs. In the current scheme, it's quite possible to include them, but it's going to look weird.

'this is a string
with a newline'

vs

'this is a string\nwith a newline'

Copy link
Contributor Author

@creationix creationix Mar 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and switched to a more traditional string literal syntax. It's now single quote only, but otherwise very similar to JSON with backslash escaping for whitespace, backslash and single quote. I'm leaving unicode escaping out of this since unicode values can simply be included as-is and the JSON form of \uxxxx is only 16 bit.


```
'Hello World'
'It''s a lovely day'
'Multiline
strings'
'two'_'strings'
```

#### Maps

We need to be able to encode the keys for the `fields` selector. This is done using multiple string literals followed by nested contents.

```ipldsel
fields(
'foo'(...)
'bar'(...)
)
```

### Whitespace and Comments

Comments are allowed in this syntax and will be preserved by auto-formatters when possible, but will be stripped when converting to URL style and are not included in the IPLD representation of the selector.

A comment starts at `#` and ends at end of line.

Parser Specification
--------------------

### String and Comment Modes

The selector text is normally treated initially as a stream of characters. For purposes of parsing, strings and comments create modal changes to the rules.

- When in normal mode:
- Finding `"'"` changes to string mode.
- Finding `"#"` changes to comment mode. (Also discard it).
- Discard whitespace, defined as `"\r"`, `"\n"`, `"\t"`, and `" "`.
- When in string mode:
- Finding `"'"` changes back to normal mode.
- Preserve all characters.
- When in comment mode:
- Finding newline changes back to normal mode.
- Discard all characters.

If comments and strings overlap, whichever comes first is the correct mode:

```ipldsch
# This is a comment 'this is not a string'
'This # is # a string' this is normal
this is also normal
```

### Identifier Tokenization

The parser knows a fixed set of built-ins to look for. This is the long and short forms of the selectors and other built-ins. To keep the specification simple, text is semantically tokenized by sorting all the identifiers longest first and trying each one in that order till one matches.

```ipldsel
# This will match `fields` first and not even try `f`.
fields...
```

### Number Tokenization

Numbers are tokenized similar to the identifier method. If a single zero is followed by `x`, `o`, or `b` and then one or more digits belonging to that base, it will be tokenized as that base. Otherwise it will be a zero. Normal decimal numbers are also parsed greedily.

For example:

```ipldsel
123 # this is 123
0xdeg # this is 0xde or 222 with `g` leftover to tokenize.
0123 # this is 123
```

### String Tokenization

Strings are tokenized simply by switching modes based on the presense of `"'"` characters. We enable quote escaping with a rule that whenever two string literals are next to eachother, they are combined into a single string with a single quote inserted between them.

```ipldsel
'I am a string' # "I am a string"
'I''m a string too' # "I'm a string too"
```

### Parentheses and Parse Order

Arguments/parameters are consumed greedily by the innermost consumer. If the type doesn't match what it is looking for, then it is closed and the next in the stack gets a shot. If we run out of consumers and the value is unmatched, it's a syntax error. For example:

```ipldsel
fields 'fieldName' match
```

First we parse `fields`. This expects `{String:Selector}`, which to the parser, is a stream of alternating `String` and `Selector` tokens. We put this on the stack and look at the next value. It's a `String` which has no children. The consumer on the top of the stack is looking for a string, so we give it to it. Then we read the next. It's a `match` which also has no children. The `fields` on the stack is now looking for a `Selector` which this qualifies as, so it gets consumed next.

After that we reach the end of the stream and pop everything off the stack. Any consumer that still lacks a required parameter is now a syntax error.

We could have added parentheses to this, but they were not needed since the default parsing interpretation is what we wanted.

```ipldsel
# This is the same as above when parsed.
fields('fieldname'(match))
```

When parentheses are added, it sets constraints on what level tokens live on. It goes up with every `"("` and down with every `")"`. All parameters to a single consumer must have the same nesting level or they don't match.

Known issues
------------

- Note that the status of this document is "Draft"!
- The "Condition" system is not fully specified -- it is a placeholder awaiting further design.
- The description of the lexing and parsing algorithm should be sufficient for unambiguous parsing, but more formal consideration is strongly recommended including tools to test for regressions as we add to this language.

Other related work
------------------

### Implementations

None yet.

### Design History

None yet.