Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for pop values higher than 1 #177

Closed
wants to merge 1 commit into from

Conversation

zharinov
Copy link

First of all, I'm happy to finally appreciate this library which helps us a lot with our parsers for Renovate.

The problem we've encountered is how to easily parse different styles of string template literals:

  • com.fasterxml.jackson.core:jackson-annotations:$version
  • com.fasterxml.jackson.core:jackson-annotations:${version}

This PR implements support for pop values higher than 1 which seems to be enough to solve problems like this.

@nathan
Copy link
Collaborator

nathan commented Aug 25, 2022

You don't need pop > 1 to tokenize the example you gave:

const lex = moo.states({
    main: {
        complex: {match: '${', push: 'interp'},
        simple: {match: '$', push: 'simple'},
        lit: {match: /[^$]+/u, lineBreaks: true},
    },
    simple: {
        simpleStuff: {match: /\w+/, pop: true},
    },
    interp: {
        complexClose: {match: '}', pop: true},
        complexStuff: {match: /[^}]+/u, lineBreaks: true},
    },
})

console.log([...lex.reset('com.fasterxml.jackson.core:jackson-annotations:$version')])
console.log([...lex.reset('com.fasterxml.jackson.core:jackson-annotations:${version}')])

Could you give a real-world example of something that can't be tokenized with the current version of moo? (My original states implementation almost supported pop > 1, but I couldn't think of any uses for it that weren't just complex/dubious versions of pop: 1 lexers.)

EDIT: I looked at the test case in the PR. You can do that with just next and pop: 1. (In your code the tpl state nexts to itself, which is a no-op.)

const lex = moo.states({
  main: {
    strstart: {match: '"', push: 'str'},
    ident:    /\w+/,
    space:    {match: /\s+/, lineBreaks: true},
  },
  str: {
    strend:   {match: '"', pop: true},
    tplstart: {match: '$', next: 'tpl'},
    content:  moo.fallback,
  },
  tpl: {
    strend:   {match: '"', pop: true},
    tplstart: '$',
    ident:    /\w+/,
    content:  moo.fallback,
  },
})
console.log(Array.from(lex.reset('"$foo $bar" baz'), x => x.type))

@zharinov
Copy link
Author

zharinov commented Aug 25, 2022

Well, my edge-case is quite specific as I need to handle $foo.bar and ${foo.bar} in the same way: strstart tplstart sym dot sym tplend strend. I made my best to keep both variations as close as possible, but there may be undesired side-effects here and there.

Now I'm thinking towards constructing a single regex-based token type for the "simple" variation and post-process its inner value with simpler parser. It requires more code, but will be more precise.

Sorry for distracting you, I'll close this PR if you don't mind.

@zharinov zharinov closed this Aug 25, 2022
@zharinov
Copy link
Author

And thank you for the quick response 😉

@nathan
Copy link
Collaborator

nathan commented Aug 25, 2022

No worries! If you find yourself needing tokens that don't represent any characters in the input (like tplend in the un-braced example), it's a good sign you should post-process the token stream. The easiest way to do that is to use more specific token names like tplstartunbraced, then detect each tplstartunbraced, rename it to tplstart and insert a tplend after the last sym that follows it. (See, e.g., this post-processing example for whitespace sensitivity.) Matching too much in a single token and re-lexing is usually slower.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants