Skip to content

Unicode code points above the Basic Multilingual Plane should throw an error. #683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rebolbot opened this issue Mar 23, 2009 · 2 comments
Labels

Comments

@rebolbot
Copy link
Collaborator

Submitted by: PeterWood

I understand that Rebol 3 only handles Unicode code points in the BMP – it doesn’t handle code points above the BMP properly. Instead it just uses the lower 16 bits of the code point and discards the rest, resulting in a code point within the BMP.

I would expect either a script error (out of range?) or for the code point to be converted to the “unknown” character (code point FFFD).


; Test character: "𝄢" F clef, Unicode 119074, hex 1D122, UTF-16 D834 DD22, UTF-8 F09D84A2

> > to-integer first to-string #{F09D84A2}
> > == 53538  ;; HANGUL SYLLABLE TYAELP
> > ; Should be 119074, or 65533 (unknown), or an error
> > 
> > to-hex/size first to-string #{F09D84A2} 8
> > == #0000D122
> > ; Should be #0001D122, or #0000FFFD (unknown), or an error
> > 
> > to-binary to-string #{F09D84A2}
> > == #{ED84A2}  ;; UTF-8 for HANGUL SYLLABLE TYAELP
> > ; Should be the same, or #{EFBFBD} (unknown), or an error
> > 
> > "𝄢"         ;; unicode D834 DD22 UTF-8 F09D84A2
> > == "턢"          ;; HANGUL SYLLABLE TYAELP
> > 
> > enbase/base "𝄢" 16
> > == "ED84A2"     ;; UTF-8 for HANGUL SYLLABLE TYAELP
> > 
> > d: "^(D834)^(DD22)"
> > == "??????"
> > 
> > enbase/base d 16
> > == "EDA0B4EDB4A2"

CC – Data [ Version: alpha 37 Type: Bug Platform: All Category: n/a Reproduce: Always Fixed-in:none ]

@rebolbot
Copy link
Collaborator Author

Submitted by: BrianH

The string is supposed to be converted internally to UCS4 when a codepoint above the BMP is inserted, but AFAIK that behavior is not yet implemented. For now, only UCS-2 characters are supported in R3.

I would not expect the string “(D834)(DD22)” to be converted to a single codepoint – it is clearly two codepoints in the source, since REBOL strings are UCS encoded in these escape sequences and UTF-16 encoding is not supported. If you put all 8 hex characters in a single escape expression, then it would be considered a single codepoint. Note that this doesn’t work yet.

Updated description and examples to show what R3 is doing, and to reflect that this is cross-platform.

@rebolbot
Copy link
Collaborator Author

Submitted by: Carl

It is possible to add an error check for chars out of BMP.

Also, it is possible later to allow 32 bit chars internally, but since it’s usage is very rare, it’s not a priority when compared to what it costs (in memory usage).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant