Unicode code points above the Basic Multilingual Plane should throw an error. #683

Siskin-Bot · 2020-02-15T16:23:24Z

Submitted by: PeterWood

I understand that Rebol 3 only handles Unicode code points in the BMP - it doesn't handle code points above the BMP properly. Instead it just uses the lower 16 bits of the code point and discards the rest, resulting in a code point within the BMP.

I would expect either a script error (out of range?) or for the code point to be converted to the "unknown" character (code point FFFD).

; Test character: "&#119074;" F clef, Unicode 119074, hex 1D122, UTF-16 D834 DD22, UTF-8 F09D84A2

>> to-integer first to-string #{F09D84A2}
== 53538  ;; HANGUL SYLLABLE TYAELP
>> ; Should be 119074, or 65533 (unknown), or an error

>> to-hex/size first to-string #{F09D84A2} 8
== #0000D122
>> ; Should be #0001D122, or #0000FFFD (unknown), or an error

>> to-binary to-string #{F09D84A2}
== #{ED84A2}  ;; UTF-8 for HANGUL SYLLABLE TYAELP
>> ; Should be the same, or #{EFBFBD} (unknown), or an error

>> "&#119074;"         ;; unicode D834 DD22 UTF-8 F09D84A2
== "&#53538;"          ;; HANGUL SYLLABLE TYAELP
>> enbase/base "&#119074;" 16
== "ED84A2"     ;; UTF-8 for HANGUL SYLLABLE TYAELP
>> d: "^(D834)^(DD22)"
== "??????"
>> enbase/base d 16
== "EDA0B4EDB4A2"

^{Imported from: CureCode [ Version: alpha 37 Type: Bug Platform: All Category: n/a Reproduce: Always Fixed-in:none ]}
^{Imported from: metaeducation#683}

Comments:

Rebolbot commented on Mar 25, 2009:

Submitted by: BrianH

The string is supposed to be converted internally to UCS4 when a codepoint above the BMP is inserted, but AFAIK that behavior is not yet implemented. For now, only UCS-2 characters are supported in R3.

I would not expect the string "^(D834)^(DD22)" to be converted to a single codepoint - it is clearly two codepoints in the source, since REBOL strings are UCS encoded in these escape sequences and UTF-16 encoding is not supported. If you put all 8 hex characters in a single escape expression, then it would be considered a single codepoint. Note that this doesn't work yet.

Updated description and examples to show what R3 is doing, and to reflect that this is cross-platform.

Rebolbot commented on Mar 26, 2009:

Submitted by: Carl

It is possible to add an error check for chars out of BMP.

Also, it is possible later to allow 32 bit chars internally, but since it's usage is very rare, it's not a priority when compared to what it costs (in memory usage).

Rebolbot mentioned this issue on Jan 24, 2018:
Allow Unicode code points as chars

Rebolbot added the Type.bug on Jan 12, 2016

Siskin-Bot added the Type.bug label Feb 15, 2020

Siskin-Bot mentioned this issue Feb 15, 2020

Allow Unicode code points as chars #2024

Open

Oldes added Type.Unicode Waiting for future Issues and wishes which are closed, but will be nice to resolve later labels Sep 24, 2021

Oldes mentioned this issue Aug 17, 2024

Extend Unicode range #2618

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode code points above the Basic Multilingual Plane should throw an error. #683

Unicode code points above the Basic Multilingual Plane should throw an error. #683

Siskin-Bot commented Feb 15, 2020 •

edited by Oldes

Loading

Unicode code points above the Basic Multilingual Plane should throw an error. #683

Unicode code points above the Basic Multilingual Plane should throw an error. #683

Comments

Siskin-Bot commented Feb 15, 2020 • edited by Oldes Loading

Comments:

Siskin-Bot commented Feb 15, 2020 •

edited by Oldes

Loading