Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move some invalid exponent detection into rustc_session #131656

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

richard-uk1
Copy link
Contributor

@richard-uk1 richard-uk1 commented Oct 13, 2024

This PR moves part of the exponent checks from rustc_lexer/rustc_parser into rustc_session.

This change does not affect which programs are accepted by the complier, or the diagnostics that are reported, with one main exception. That exception is that floats or ints with suffixes beginning with e are rejected after the token stream is passed to proc macros, rather than being rejected by the parser as was the case. This gives proc macro authors more consistent access to numeric literals: currently a proc macro could interpret 1m or 30s but not 7eggs or 3em. After this change all are handled the same. The lexer will still reject input if it contains e followed by a number, +/-, or _ if they are not followed by a valid integer literal (number + _), but this doesn't affect macro authors who just want to access alpha suffixes.

This PR is a continuation of #79912. In that PR, it was suggested that a new enum was used to indicate type of exponent (whether accepted or rejected). I originally took that approach with this PR, but it didn't seem necessary and made the changes more complex. I can try to go down that road instead if that's the consensus. It is also solving exactly the same problem as #111628.

TODO before ready for review (assuming approach is OK)

  • Some more UI tests (e.g. 1em)
  • Check that re-ordering of errors is OK, or how to change it. If OK, update output.
  • Consider improving UI for suffixes beginning with e (if suffix begins with 'e' suggest an exponential)
  • Currently if the character following the e is _, then the lexer tries to parse an exponent and fails if there are no digits after. The issue is that a valid integer can have any number of _s before the digit, meaning deciding whether a the suffix is a number or not requires unbounded lookahead. There are a few options here This now handles arbitrary _s, removing the special case.

Although I haven't marked this PR as 'ready for review' since there are outstanding stuff that need doing, I do want to get feedback on the approach. Ready for review now.

Also do you want me to write an MCP

r: @petrochenkov, since they reviewed #79912.

@rustbot
Copy link
Collaborator

rustbot commented Oct 13, 2024

r? @petrochenkov

rustbot has assigned @petrochenkov.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Oct 13, 2024
@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@richard-uk1 richard-uk1 force-pushed the move_empty_exponent_to_rustc_session branch from 8b41315 to e8c244e Compare October 13, 2024 23:06
@rust-log-analyzer

This comment has been minimized.

@richard-uk1 richard-uk1 force-pushed the move_empty_exponent_to_rustc_session branch from a642bff to 82a8103 Compare November 10, 2024 12:06
@rust-log-analyzer

This comment has been minimized.

@richard-uk1 richard-uk1 marked this pull request as ready for review November 10, 2024 12:56
@richard-uk1
Copy link
Contributor Author

I've added some ui tests and I think this is now ready for review. It doesn't stop @petrochenkov changing the whole thing later if they so choose.

@rust-log-analyzer

This comment has been minimized.

@petrochenkov
Copy link
Contributor

Apologies for the delays.
This is a language change rather than a fix, and it's not very urgent, so I'll prioritize some other things in my review queue over it.
In any case I should get to in in February, when I return to work.

@petrochenkov
Copy link
Contributor

I'd still prefer to see #111628 (comment) implemented, but I think this looks like a compatible subset, so we can go forward.

@petrochenkov
Copy link
Contributor

There's also one subtle change in behavior here.
If we have a suffix that starts with _s and then continues with some unicode character that is

  • is_id_continue
  • not is_id_start, and
  • not an ascii digit,

then eat_literal_suffix fill fail to eat the tail of the suffix correctly because it expects a is_id_start character at the start.

For correct behavior we'll need to use self.eat_while(is_id_continue); instead of eat_literal_suffix if the suffix start position is overridden.
This should be testable with U+005F _ LOW LINE.

@petrochenkov petrochenkov added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 21, 2025
@richard-uk1 richard-uk1 force-pushed the move_empty_exponent_to_rustc_session branch from 633b1b0 to d13c474 Compare February 22, 2025 16:12
@rust-log-analyzer

This comment has been minimized.

@richard-uk1 richard-uk1 force-pushed the move_empty_exponent_to_rustc_session branch from d13c474 to 2c33df5 Compare February 22, 2025 16:40
@rust-log-analyzer

This comment has been minimized.

@richard-uk1 richard-uk1 force-pushed the move_empty_exponent_to_rustc_session branch from c3947a0 to cfb8823 Compare February 22, 2025 17:05
@rust-log-analyzer

This comment has been minimized.

@richard-uk1 richard-uk1 force-pushed the move_empty_exponent_to_rustc_session branch from cfb8823 to 79d5952 Compare February 23, 2025 13:30
@richard-uk1
Copy link
Contributor Author

richard-uk1 commented Feb 23, 2025

Hey @petrochenkov,

  • did you mean U+005F (underscore) as the test character? It wouldn't fit your criteria since it is a valid identifier start character. I've been using U+00B7 (middle dot) as it meets your criteria.
  • I checked and the nuber parser already fails to handle the case where there is a id_continue but not id_start character that isn't a digit in the suffix, e.g. (1_\u{00B7}). If I understand correctly, this means it's not a regression, so perhaps it should be fixed in a separate PR? Please LMK if I've misunderstood.

Other than the points above, it should be ready for review again.

@petrochenkov
Copy link
Contributor

If I understand correctly, this means it's not a regression, so perhaps it should be fixed in a separate PR?

I agree that the issue can be addressed separately, because it only appears in cases that produced errors previously and still produce errors after this PR.

@petrochenkov petrochenkov added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 24, 2025
@petrochenkov
Copy link
Contributor

@mattheww
I think I agree with the conclusion.
In literals that accept the _ fillers we should keep the status quo and just eagerly attach them to the "main part" of the literal (rather than to the suffix).

Technically we could support 1_· without backtracking, by lexing both the main part of the literal and the suffix part "in one go", rather than separately, and then determining the "delimitation point" between the parts.
That's exactly what this PR does to support the e suffixes.
1___· with multiple underscores would then be lexed as 1__ + , to keep it similar to 1__u8 where underscores clearly prefer the "main part" on the left and u8 is clearly the suffix.
But whether we should do it? Probably not, I don't see practical motivation.

However, when e suffixes are supported, 1e_· should be valid, because the suffix is e_·, there's no ambiguity here.

@rust-log-analyzer

This comment has been minimized.

@mattheww
Copy link
Contributor

@petrochenkov

What should happen for something like 1e2·?

In that case there's a prefix of the input which is an acceptable floating-point literal, but the longest tokenisable sequence will be the integer literal with suffix e2·

@richard-uk1 richard-uk1 force-pushed the move_empty_exponent_to_rustc_session branch from e2caf03 to 828cf96 Compare February 27, 2025 13:30
@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Feb 27, 2025
 - factor out code that determines whether the string beggining with `e_`
   is an exponent or the start of the suffix
 - reduce use of mutable state
@richard-uk1
Copy link
Contributor Author

Ok ready for review. I'll try to tackle the issue with xid_continue-but-not-digit after if you like. Is there an issue for it?

@petrochenkov
Copy link
Contributor

Looks like fixes for #131656 (comment) and #131656 (comment) were lost during a rebase.

(Float { base, empty_exponent }, suffix_start)
}
('e' | 'E', '_') => {
if let Some(suffix_start) = self.eat_underscore_exponent() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit: if let with short arms usually looks better as a match.

// might have stuff after the ., and if it does, it needs to start
// with a number
self.bump();
if !self.first().is_ascii_digit() {
return (Float { base, empty_exponent: false }, None);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks redundant, the logic below does the same thing.

}
_ => Some(self.pos_within_token()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_ => Some(self.pos_within_token()),
_ => None,

@petrochenkov
Copy link
Contributor

@richard-uk1
Could you also squash commits and update the PR description?

@petrochenkov
Copy link
Contributor

I'll try to tackle the issue with xid_continue-but-not-digit after if you like. Is there an issue for it?

There's no issue, could you make one?

We could discuss the questions like

What should happen for something like 1e2·?

on that issue too, this PR doesn't change the behavior here (it still produces an error), so it can be postponed.
(I'm personally not sure about it. Technically, this is unambiguously a suffix, but it's clearly useless, so if it's harder to lex...)

@rustbot author

@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants