Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customer report: p{Han} unexpectedly failing to match certain combined-surrogate code points #108

Closed
siegel opened this issue Apr 21, 2022 · 3 comments

Comments

@siegel
Copy link

siegel commented Apr 21, 2022

The following is from a customer report; I can reproduce it as described, and the proximate issue seems to be that the non-matching characters are not in fact UTF-16 singletons, but rather combined surrogate pairs. (For context: BBEdit's backing store is UTF-16, so it's using pcre2-16.)

I'm not sure whether this constitutes a bug, or an enhancement request; if \p{…} is intended to match surrogate pairs for any given character class, then I guess it's a bug; otherwise it would be an enhancement request. :-)

I've attached the customer's supplied file directly, but here is (substantially) the contents of same:

===
The following are regular characters. BBEdit regex \p{Han} and . can find them.

The following are surrogate-pair characters. BBEdit regex \p{Han} currently (v. 14.1) cannot find them. Using regex . BBEdit finds each individual codepoint separately, but not the whole character (both codepoints) at once.

@zherczeg
Copy link
Collaborator

I tried the first surrogate pair with pcre2test:

$ ./pcre2test -16
PCRE2 version 10.40 2022-04-14
  re> /\p{Han}/utf
data> \x{21C95}
 0: \x{21c95}
data> a
No match

Seems working as intended.

@PhilipHazel
Copy link
Collaborator

Did you perhaps forget to set the PCRE2_UTF option? Without that option, it won't recognize surrogate pairs.

@siegel
Copy link
Author

siegel commented Apr 23, 2022

Yes, that in fact is exactly the issue. I've made the appropriate change, and appreciate your corrective guidance. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants