Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pcre2test: tighten \x{...} parsing in subject #504

Merged
merged 1 commit into from
Oct 2, 2024

Conversation

carenas
Copy link
Contributor

@carenas carenas commented Sep 29, 2024

Address an oddity I found while accidentally making a typo of \x{100 in pcre2test and that resulted in an unexpected match and diverging results from perltest.

Additionally fix the handling of overlong numbers as shown by:

PCRE2 version 10.44 2024-06-07 (8-bit)
  re> /\D/
data> \x{1234567890}
** Too many hex digits in \x{...} item; using only the first eight.
** Character \x{23456780} is greater than 255 and UTF-8 mode is not enabled.
** Truncation will probably give the wrong result.
 0: \x80

@carenas carenas marked this pull request as draft September 29, 2024 20:50
@carenas carenas marked this pull request as ready for review September 30, 2024 16:06
@zherczeg
Copy link
Collaborator

I think the test wants to convert the utf8 representation to \x{100} as a 16 bit value. Since this is a pcre2test change, it should be harmless.

@carenas
Copy link
Contributor Author

carenas commented Sep 30, 2024

I think the test wants to convert the utf8 representation to \x{100} as a 16 bit value

That is another oddity of the test, and even more so if you consider that it ALSO hardcodes UTF-8 for the non 8-bit libraries which have a clone of it in testinpu12 and that also make even less sense.

Agree though that it is harmless, but should we keep it?

@carenas
Copy link
Contributor Author

carenas commented Sep 30, 2024

The test was actually introduced in PCRE 4.0 and the bug was actually:

PCRE version 3.9 02-Jan-2002

  re> /\x{100}{3,4}/8SD
------------------------------------------------------------------
  0  14 Bra 0
  3   1 \xc4
  6     \x80{3}
 10     \x80{,1}
 14  14 Ket
 17     End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 196
Need char = 128
Study returned NULL

which could had been simplified to /\x{100}?/ and had a typo that wasn't even relevant.

Eventhough it is documented that invalid escapes will be reported,
the code would fallback in that case and result in a NUL being
generated whenever an incompete \x{ escape was being parsed.

Refactor the code to report the error instead and fix the logic used
for overlong numbers so that the truncation doesn't result in an
unexpected value being used.

There was an old (from PCRE 4.0) test that was affected but which is
no longer relevant, because it could only be triggered with invalid
UTF (which isn't supported), and that was therefore removed as a
result.

Additionally, it was found that the same syntax error was affecting
perltest so correct that as well by reporting syntax errors in the
subject lines.

While at it update related documentation for Perl's compatibility.
@PhilipHazel PhilipHazel merged commit c0d86f7 into PCRE2Project:master Oct 2, 2024
15 checks passed
@carenas carenas deleted the bsx branch October 2, 2024 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants