Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle carriage returns in the lexer #770

Merged
merged 1 commit into from
Jan 20, 2025
Merged

Conversation

vincent-botbol
Copy link
Contributor

This PR allows the lexer to handle catala files that contains carriage-returns alongside line feeds. Currently, on windows, such files fails to parse entirely.
Following @AltGr suggestions, patching the lexer seems to be an easy fix. However, I'm not fully confident in this patch as there might be side-effects that I'm not aware of (e.g., lexing performances).

@vincent-botbol vincent-botbol added ❌ bug Something isn't working 🔧 compiler Issue concerns the compiler 🤩 ide IDE plugins and tooling labels Jan 10, 2025
@vincent-botbol vincent-botbol marked this pull request as draft January 10, 2025 10:06
@vincent-botbol
Copy link
Contributor Author

Switching to draft: this PR seems to break the catala-examples repo build.

@@ -772,11 +773,13 @@ let lex_raw (lexbuf : lexbuf) : token =
(* Nested match for lower priority; `_` matches length 0 so we effectively retry the
sub-match at the same point *)
match%sedlex lexbuf with
| Star (Compl '\r'), ("\r\n" | eof)
| Star (Compl '\n'), ('\n' | eof) -> LAW_TEXT (Utf8.lexeme lexbuf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe strip the \r from the lexbuf in the first case, so that we consistently get strings without it for law text ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I'm afraid if you have a stray \r further down in your file this could cause everything until then to be gobbled by the first regexp...
I guess we need to expand the regexp complement here ? Star (Plus (Compl "\r\n"), Opt ('\r', Compl '\n')), Opt '\r', '\n' or something like that ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm Compl only accepts single characters though. I'll figure something out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this be sufficient ?

Star (Compl '\r' | Compl '\n'), Opt ('\r'), ('\n' | eof)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want Compl "\r\n" (or however this can be written in sedlex syntax), (Compl '\r' | Compl '\n') would include anything.
Otherwise, the difference is that this would forbid \r that appear outside of a newline (or just before eof)... but I guess that's actually fine with us, indeed.

Copy link
Contributor Author

@vincent-botbol vincent-botbol Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed.. I must have been tired on friday. I pushed a new version. I encoded Compl "\r\n" as any_but_eol = Sub (any, "\n\r") which should be semantically equivalent. Otherwise, I think that's reasonable to forbid \r chars outside of newlines but we could also allow them. I have no strong opinions. I also forbid them in this new version.

@vincent-botbol vincent-botbol force-pushed the vbot@handle-carriage-return branch 2 times, most recently from 06d613c to 827292c Compare January 10, 2025 16:53
@vincent-botbol vincent-botbol force-pushed the vbot@handle-carriage-return branch from 827292c to 959ef6b Compare January 13, 2025 11:13
@AltGr
Copy link
Contributor

AltGr commented Jan 20, 2025

@vincent-botbol is this ready for merge ?

@vincent-botbol vincent-botbol marked this pull request as ready for review January 20, 2025 09:21
@vincent-botbol
Copy link
Contributor Author

@vincent-botbol is this ready for merge ?

I think so.

@AltGr AltGr merged commit ba3b268 into master Jan 20, 2025
5 checks passed
@AltGr AltGr deleted the vbot@handle-carriage-return branch January 20, 2025 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
❌ bug Something isn't working 🔧 compiler Issue concerns the compiler 🤩 ide IDE plugins and tooling
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants