pcre2test: tighten \x{...} parsing in subject #504

carenas · 2024-09-29T20:31:31Z

Address an oddity I found while accidentally making a typo of \x{100 in pcre2test and that resulted in an unexpected match and diverging results from perltest.

Additionally fix the handling of overlong numbers as shown by:

PCRE2 version 10.44 2024-06-07 (8-bit)
  re> /\D/
data> \x{1234567890}
** Too many hex digits in \x{...} item; using only the first eight.
** Character \x{23456780} is greater than 255 and UTF-8 mode is not enabled.
** Truncation will probably give the wrong result.
 0: \x80

zherczeg · 2024-09-30T19:44:33Z

I think the test wants to convert the utf8 representation to \x{100} as a 16 bit value. Since this is a pcre2test change, it should be harmless.

carenas · 2024-09-30T20:36:53Z

I think the test wants to convert the utf8 representation to \x{100} as a 16 bit value

That is another oddity of the test, and even more so if you consider that it ALSO hardcodes UTF-8 for the non 8-bit libraries which have a clone of it in testinpu12 and that also make even less sense.

Agree though that it is harmless, but should we keep it?

carenas · 2024-09-30T21:22:30Z

The test was actually introduced in PCRE 4.0 and the bug was actually:

PCRE version 3.9 02-Jan-2002

  re> /\x{100}{3,4}/8SD
------------------------------------------------------------------
  0  14 Bra 0
  3   1 \xc4
  6     \x80{3}
 10     \x80{,1}
 14  14 Ket
 17     End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 196
Need char = 128
Study returned NULL

which could had been simplified to /\x{100}?/ and had a typo that wasn't even relevant.

Eventhough it is documented that invalid escapes will be reported, the code would fallback in that case and result in a NUL being generated whenever an incompete \x{ escape was being parsed. Refactor the code to report the error instead and fix the logic used for overlong numbers so that the truncation doesn't result in an unexpected value being used. There was an old (from PCRE 4.0) test that was affected but which is no longer relevant, because it could only be triggered with invalid UTF (which isn't supported), and that was therefore removed as a result. Additionally, it was found that the same syntax error was affecting perltest so correct that as well by reporting syntax errors in the subject lines. While at it update related documentation for Perl's compatibility.

carenas marked this pull request as draft September 29, 2024 20:50

carenas force-pushed the bsx branch from e5352bc to 7fa877e Compare September 29, 2024 21:00

carenas marked this pull request as ready for review September 30, 2024 16:06

carenas force-pushed the bsx branch from 7fa877e to 07c5639 Compare September 30, 2024 19:37

carenas force-pushed the bsx branch from 07c5639 to a0adeef Compare October 1, 2024 00:14

PhilipHazel merged commit c0d86f7 into PCRE2Project:master Oct 2, 2024
15 checks passed

carenas deleted the bsx branch October 2, 2024 11:41

xandris mentioned this pull request Feb 10, 2025

Fix _encodeData regexp pear/Net_URL2#17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pcre2test: tighten \x{...} parsing in subject #504

pcre2test: tighten \x{...} parsing in subject #504

carenas commented Sep 29, 2024 •

edited

Loading

zherczeg commented Sep 30, 2024

carenas commented Sep 30, 2024

carenas commented Sep 30, 2024 •

edited

Loading

pcre2test: tighten \x{...} parsing in subject #504

pcre2test: tighten \x{...} parsing in subject #504

Conversation

carenas commented Sep 29, 2024 • edited Loading

zherczeg commented Sep 30, 2024

carenas commented Sep 30, 2024

carenas commented Sep 30, 2024 • edited Loading

carenas commented Sep 29, 2024 •

edited

Loading

carenas commented Sep 30, 2024 •

edited

Loading