Unicode code points above the Basic Multilingual Plane should throw an error. #683
Labels
Type.bug
Type.Unicode
Waiting for future
Issues and wishes which are closed, but will be nice to resolve later
Submitted by: PeterWood
I understand that Rebol 3 only handles Unicode code points in the BMP - it doesn't handle code points above the BMP properly. Instead it just uses the lower 16 bits of the code point and discards the rest, resulting in a code point within the BMP.
I would expect either a script error (out of range?) or for the code point to be converted to the "unknown" character (code point FFFD).
Imported from: CureCode [ Version: alpha 37 Type: Bug Platform: All Category: n/a Reproduce: Always Fixed-in:none ]
Imported from: metaeducation#683
Comments:
Submitted by: BrianH
The string is supposed to be converted internally to UCS4 when a codepoint above the BMP is inserted, but AFAIK that behavior is not yet implemented. For now, only UCS-2 characters are supported in R3.
I would not expect the string "^(D834)^(DD22)" to be converted to a single codepoint - it is clearly two codepoints in the source, since REBOL strings are UCS encoded in these escape sequences and UTF-16 encoding is not supported. If you put all 8 hex characters in a single escape expression, then it would be considered a single codepoint. Note that this doesn't work yet.
Updated description and examples to show what R3 is doing, and to reflect that this is cross-platform.
Submitted by: Carl
It is possible to add an error check for chars out of BMP.
Also, it is possible later to allow 32 bit chars internally, but since it's usage is very rare, it's not a priority when compared to what it costs (in memory usage).
The text was updated successfully, but these errors were encountered: