Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode code points above the Basic Multilingual Plane should throw an error. #683

Open
Siskin-Bot opened this issue Feb 15, 2020 · 0 comments
Labels
Type.bug Type.Unicode Waiting for future Issues and wishes which are closed, but will be nice to resolve later

Comments

@Siskin-Bot
Copy link
Collaborator

Siskin-Bot commented Feb 15, 2020

Submitted by: PeterWood

I understand that Rebol 3 only handles Unicode code points in the BMP - it doesn't handle code points above the BMP properly. Instead it just uses the lower 16 bits of the code point and discards the rest, resulting in a code point within the BMP.

I would expect either a script error (out of range?) or for the code point to be converted to the "unknown" character (code point FFFD).

; Test character: "𝄢" F clef, Unicode 119074, hex 1D122, UTF-16 D834 DD22, UTF-8 F09D84A2

>> to-integer first to-string #{F09D84A2}
== 53538  ;; HANGUL SYLLABLE TYAELP
>> ; Should be 119074, or 65533 (unknown), or an error

>> to-hex/size first to-string #{F09D84A2} 8
== #0000D122
>> ; Should be #0001D122, or #0000FFFD (unknown), or an error

>> to-binary to-string #{F09D84A2}
== #{ED84A2}  ;; UTF-8 for HANGUL SYLLABLE TYAELP
>> ; Should be the same, or #{EFBFBD} (unknown), or an error

>> "𝄢"         ;; unicode D834 DD22 UTF-8 F09D84A2
== "턢"          ;; HANGUL SYLLABLE TYAELP
>> enbase/base "𝄢" 16
== "ED84A2"     ;; UTF-8 for HANGUL SYLLABLE TYAELP
>> d: "^(D834)^(DD22)"
== "??????"
>> enbase/base d 16
== "EDA0B4EDB4A2"

Imported from: CureCode [ Version: alpha 37 Type: Bug Platform: All Category: n/a Reproduce: Always Fixed-in:none ]
Imported from: metaeducation#683

Comments:

Rebolbot commented on Mar 25, 2009:

Submitted by: BrianH

The string is supposed to be converted internally to UCS4 when a codepoint above the BMP is inserted, but AFAIK that behavior is not yet implemented. For now, only UCS-2 characters are supported in R3.

I would not expect the string "^(D834)^(DD22)" to be converted to a single codepoint - it is clearly two codepoints in the source, since REBOL strings are UCS encoded in these escape sequences and UTF-16 encoding is not supported. If you put all 8 hex characters in a single escape expression, then it would be considered a single codepoint. Note that this doesn't work yet.

Updated description and examples to show what R3 is doing, and to reflect that this is cross-platform.


Rebolbot commented on Mar 26, 2009:

Submitted by: Carl

It is possible to add an error check for chars out of BMP.

Also, it is possible later to allow 32 bit chars internally, but since it's usage is very rare, it's not a priority when compared to what it costs (in memory usage).


Rebolbot mentioned this issue on Jan 24, 2018:
Allow Unicode code points as chars


Rebolbot added the Type.bug on Jan 12, 2016


@Oldes Oldes added Type.Unicode Waiting for future Issues and wishes which are closed, but will be nice to resolve later labels Sep 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type.bug Type.Unicode Waiting for future Issues and wishes which are closed, but will be nice to resolve later
Projects
None yet
Development

No branches or pull requests

2 participants