move some invalid exponent detection into rustc_session #131656

richard-uk1 · 2024-10-13T14:02:43Z

This PR moves part of the exponent checks from rustc_lexer/rustc_parser into rustc_session.

This change does not affect which programs are accepted by the complier, or the diagnostics that are reported, with one main exception. That exception is that floats or ints with suffixes beginning with e are rejected after the token stream is passed to proc macros, rather than being rejected by the parser as was the case. This gives proc macro authors more consistent access to numeric literals: currently a proc macro could interpret 1m or 30s but not 7eggs or 3em. After this change all are handled the same. The lexer will still reject input if it contains e followed by a number, +/-, or _ if they are not followed by a valid integer literal (number + _), but this doesn't affect macro authors who just want to access alpha suffixes.

This PR is a continuation of #79912. In that PR, it was suggested that a new enum was used to indicate type of exponent (whether accepted or rejected). I originally took that approach with this PR, but it didn't seem necessary and made the changes more complex. I can try to go down that road instead if that's the consensus. It is also solving exactly the same problem as #111628.

TODO before ready for review (assuming approach is OK)

Some more UI tests (e.g. 1em)
Check that re-ordering of errors is OK, or how to change it. If OK, update output.
Consider improving UI for suffixes beginning with e (if suffix begins with 'e' suggest an exponential)
Currently if the character following the e is _, then the lexer tries to parse an exponent and fails if there are no digits after. The issue is that a valid integer can have any number of _s before the digit, meaning deciding whether a the suffix is a number or not requires unbounded lookahead. There are a few options here This now handles arbitrary _s, removing the special case.

~~Although I haven't marked this PR as 'ready for review' since there are outstanding stuff that need doing, I do want to get feedback on the approach.~~ Ready for review now.

Also do you want me to write an MCP

r: @petrochenkov, since they reviewed #79912.

rustbot · 2024-10-13T14:02:51Z

r? @petrochenkov

rustbot has assigned @petrochenkov.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

richard-uk1 · 2024-11-10T12:57:10Z

I've added some ui tests and I think this is now ready for review. It doesn't stop @petrochenkov changing the whole thing later if they so choose.

petrochenkov · 2024-12-10T18:51:22Z

Apologies for the delays.
This is a language change rather than a fix, and it's not very urgent, so I'll prioritize some other things in my review queue over it.
In any case I should get to in in February, when I return to work.

tests/ui/consts/custom-suffixes-exponent-like.rs

compiler/rustc_session/src/errors.rs

compiler/rustc_lexer/src/lib.rs

petrochenkov · 2025-02-21T18:13:54Z

I'd still prefer to see #111628 (comment) implemented, but I think this looks like a compatible subset, so we can go forward.

compiler/rustc_lexer/src/lib.rs

petrochenkov · 2025-02-21T18:42:57Z

There's also one subtle change in behavior here.
If we have a suffix that starts with _s and then continues with some unicode character that is

is_id_continue
not is_id_start, and
not an ascii digit,

then eat_literal_suffix fill fail to eat the tail of the suffix correctly because it expects a is_id_start character at the start.

For correct behavior we'll need to use self.eat_while(is_id_continue); instead of eat_literal_suffix if the suffix start position is overridden.
This should be testable with U+005F _ LOW LINE.

richard-uk1 · 2025-02-23T13:41:00Z

Hey @petrochenkov,

did you mean U+005F (underscore) as the test character? It wouldn't fit your criteria since it is a valid identifier start character. I've been using U+00B7 (middle dot) as it meets your criteria.
I checked and the nuber parser already fails to handle the case where there is a id_continue but not id_start character that isn't a digit in the suffix, e.g. (1_\u{00B7}). If I understand correctly, this means it's not a regression, so perhaps it should be fixed in a separate PR? Please LMK if I've misunderstood.

Other than the points above, it should be ready for review again.

petrochenkov · 2025-02-24T16:18:14Z

If I understand correctly, this means it's not a regression, so perhaps it should be fixed in a separate PR?

I agree that the issue can be addressed separately, because it only appears in cases that produced errors previously and still produce errors after this PR.

petrochenkov · 2025-02-24T16:41:48Z

@mattheww
I think I agree with the conclusion.
In literals that accept the _ fillers we should keep the status quo and just eagerly attach them to the "main part" of the literal (rather than to the suffix).

Technically we could support 1_· without backtracking, by lexing both the main part of the literal and the suffix part "in one go", rather than separately, and then determining the "delimitation point" between the parts.
That's exactly what this PR does to support the e suffixes.
1___· with multiple underscores would then be lexed as 1__ + _·, to keep it similar to 1__u8 where underscores clearly prefer the "main part" on the left and u8 is clearly the suffix.
But whether we should do it? Probably not, I don't see practical motivation.

However, when e suffixes are supported, 1e_· should be valid, because the suffix is e_·, there's no ambiguity here.

mattheww · 2025-02-26T23:52:22Z

@petrochenkov

What should happen for something like 1e2·?

In that case there's a prefix of the input which is an acceptable floating-point literal, but the longest tokenisable sequence will be the integer literal with suffix e2·

Co-authored-by: Vadim Petrochenkov <vadim.petrochenkov@gmail.com>

- factor out code that determines whether the string beggining with `e_` is an exponent or the start of the suffix - reduce use of mutable state

richard-uk1 · 2025-02-27T15:10:59Z

Ok ready for review. I'll try to tackle the issue with xid_continue-but-not-digit after if you like. Is there an issue for it?

petrochenkov · 2025-02-27T18:13:09Z

Looks like fixes for #131656 (comment) and #131656 (comment) were lost during a rebase.

petrochenkov · 2025-02-27T18:20:14Z

compiler/rustc_lexer/src/lib.rs

+                (Float { base, empty_exponent }, suffix_start)
+            }
+            ('e' | 'E', '_') => {
+                if let Some(suffix_start) = self.eat_underscore_exponent() {


Style nit: if let with short arms usually looks better as a match.

petrochenkov · 2025-02-27T18:24:02Z

compiler/rustc_lexer/src/lib.rs

                // might have stuff after the ., and if it does, it needs to start
                // with a number
                self.bump();
+                if !self.first().is_ascii_digit() {
+                    return (Float { base, empty_exponent: false }, None);


This looks redundant, the logic below does the same thing.

petrochenkov · 2025-02-27T18:24:08Z

compiler/rustc_lexer/src/lib.rs

                    }
+                    _ => Some(self.pos_within_token()),


Suggested change

_ => Some(self.pos_within_token()),

_ => None,

petrochenkov · 2025-02-27T18:26:41Z

@richard-uk1
Could you also squash commits and update the PR description?

petrochenkov · 2025-02-27T18:35:30Z

I'll try to tackle the issue with xid_continue-but-not-digit after if you like. Is there an issue for it?

There's no issue, could you make one?

We could discuss the questions like

What should happen for something like 1e2·?

on that issue too, this PR doesn't change the behavior here (it still produces an error), so it can be postponed.
(I'm personally not sure about it. Technically, this is unambiguously a suffix, but it's clearly useless, so if it's harder to lex...)

@rustbot author

rustbot assigned petrochenkov Oct 13, 2024

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Oct 13, 2024