-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle carriage returns in the lexer #770
Conversation
Switching to draft: this PR seems to break the catala-examples repo build. |
compiler/surface/lexer.cppo.ml
Outdated
@@ -772,11 +773,13 @@ let lex_raw (lexbuf : lexbuf) : token = | |||
(* Nested match for lower priority; `_` matches length 0 so we effectively retry the | |||
sub-match at the same point *) | |||
match%sedlex lexbuf with | |||
| Star (Compl '\r'), ("\r\n" | eof) | |||
| Star (Compl '\n'), ('\n' | eof) -> LAW_TEXT (Utf8.lexeme lexbuf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe strip the \r
from the lexbuf in the first case, so that we consistently get strings without it for law text ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I'm afraid if you have a stray \r
further down in your file this could cause everything until then to be gobbled by the first regexp...
I guess we need to expand the regexp complement here ? Star (Plus (Compl "\r\n"), Opt ('\r', Compl '\n')), Opt '\r', '\n'
or something like that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm Compl
only accepts single characters though. I'll figure something out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this be sufficient ?
Star (Compl '\r' | Compl '\n'), Opt ('\r'), ('\n' | eof)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want Compl "\r\n"
(or however this can be written in sedlex syntax), (Compl '\r' | Compl '\n')
would include anything.
Otherwise, the difference is that this would forbid \r
that appear outside of a newline (or just before eof)... but I guess that's actually fine with us, indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed.. I must have been tired on friday. I pushed a new version. I encoded Compl "\r\n"
as any_but_eol = Sub (any, "\n\r")
which should be semantically equivalent. Otherwise, I think that's reasonable to forbid \r
chars outside of newlines but we could also allow them. I have no strong opinions. I also forbid them in this new version.
06d613c
to
827292c
Compare
827292c
to
959ef6b
Compare
@vincent-botbol is this ready for merge ? |
I think so. |
This PR allows the lexer to handle catala files that contains carriage-returns alongside line feeds. Currently, on windows, such files fails to parse entirely.
Following @AltGr suggestions, patching the lexer seems to be an easy fix. However, I'm not fully confident in this patch as there might be side-effects that I'm not aware of (e.g., lexing performances).