Unicode code points above the Basic Multilingual Plane should throw an error. #683

rebolbot · 2009-03-23T04:13:12Z

Submitted by: PeterWood

I understand that Rebol 3 only handles Unicode code points in the BMP – it doesn’t handle code points above the BMP properly. Instead it just uses the lower 16 bits of the code point and discards the rest, resulting in a code point within the BMP.

I would expect either a script error (out of range?) or for the code point to be converted to the “unknown” character (code point FFFD).


; Test character: "&#119074;" F clef, Unicode 119074, hex 1D122, UTF-16 D834 DD22, UTF-8 F09D84A2

> > to-integer first to-string #{F09D84A2}
> > == 53538  ;; HANGUL SYLLABLE TYAELP
> > ; Should be 119074, or 65533 (unknown), or an error
> > 
> > to-hex/size first to-string #{F09D84A2} 8
> > == #0000D122
> > ; Should be #0001D122, or #0000FFFD (unknown), or an error
> > 
> > to-binary to-string #{F09D84A2}
> > == #{ED84A2}  ;; UTF-8 for HANGUL SYLLABLE TYAELP
> > ; Should be the same, or #{EFBFBD} (unknown), or an error
> > 
> > "&#119074;"         ;; unicode D834 DD22 UTF-8 F09D84A2
> > == "&#53538;"          ;; HANGUL SYLLABLE TYAELP
> > 
> > enbase/base "&#119074;" 16
> > == "ED84A2"     ;; UTF-8 for HANGUL SYLLABLE TYAELP
> > 
> > d: "^(D834)^(DD22)"
> > == "??????"
> > 
> > enbase/base d 16
> > == "EDA0B4EDB4A2"

^{CC – Data [ Version: alpha 37 Type: Bug Platform: All Category: n/a Reproduce: Always Fixed-in:none ]}

The text was updated successfully, but these errors were encountered:

rebolbot · 2009-03-25T00:44:32Z

Submitted by: BrianH

The string is supposed to be converted internally to UCS4 when a codepoint above the BMP is inserted, but AFAIK that behavior is not yet implemented. For now, only UCS-2 characters are supported in R3.

I would not expect the string “^(D834)(DD22)” to be converted to a single codepoint – it is clearly two codepoints in the source, since REBOL strings are UCS encoded in these escape sequences and UTF-16 encoding is not supported. If you put all 8 hex characters in a single escape expression, then it would be considered a single codepoint. Note that this doesn’t work yet.

Updated description and examples to show what R3 is doing, and to reflect that this is cross-platform.

rebolbot · 2009-03-26T06:55:33Z

Submitted by: Carl

It is possible to add an error check for chars out of BMP.

Also, it is possible later to allow 32 bit chars internally, but since it’s usage is very rare, it’s not a priority when compared to what it costs (in memory usage).

rebolbot added the Type.bug label Jan 12, 2016

rebolbot mentioned this issue Jan 24, 2018

Allow Unicode code points as chars #2024

Closed

Siskin-Bot mentioned this issue Feb 15, 2020

Unicode code points above the Basic Multilingual Plane should throw an error. Oldes/Rebol-issues#683

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode code points above the Basic Multilingual Plane should throw an error. #683

Unicode code points above the Basic Multilingual Plane should throw an error. #683

rebolbot commented Mar 23, 2009

rebolbot commented Mar 25, 2009

rebolbot commented Mar 26, 2009

Unicode code points above the Basic Multilingual Plane should throw an error. #683

Unicode code points above the Basic Multilingual Plane should throw an error. #683

Comments

rebolbot commented Mar 23, 2009

rebolbot commented Mar 25, 2009

rebolbot commented Mar 26, 2009