-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add UTF-16 output capabilities #291
Conversation
When set to Unicode, the output conversion function now can handle characters outside the Basic Multilingual Plane, using UTF-16. The output conversion function will detect whether a value passed in is a low surrogate value and save it in a static variable. If the next character is a correct high surrogate, the function will return the correct unicode character, otherwise it will discard the input.
MARIE uses 16-bit twos complement integers. When the output is set to BIN, the original conversion routine assumes an unsigned integer, resulting in zeros displayed for negative 16-bit twos complement numbers. This patch adds signed-integer conversion and grouping functions so that negative numbers will result in the correct bit pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I lack in knowledge when it comes to character encoding, but I hope that my review will improve the way students understand how UTF-16 works.
Also you may want to relabel "UNICODE" to "UNICODE (UTF-16)" for output mode in src/templates/index.ejs.
One question: how would you enter high and low surrogate values in input? Would students have to type in their hexadecimal representation to store the value in memory?
I've now fully implemented UTF-16BE (big-endian) and extensively commented the code. I've also added an example that outputs "Hello World!" and some other unicode characters. The example also shows the handling (ignoring) of Byte Order Markers. Should be much cleaner and readable now. |
This patch adds full support for Unicode in the UTF-16BE (big-endian) encoding. It ignores Byte-Order Markers and incorrect surrogate sequences. It also adds a Unicode program to the examples list, which prints out "Hello World" and a couple of emojis outside the Basic Multilingual Plane and a copyright sign as an example of a character inside the BMP.
976a91a
to
a0547f4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Good job for making a Unicode example demonstrating the new feature!
Referring to issue #288:
When set to Unicode, the output conversion function now can handle
characters outside the Basic Multilingual Plane, using UTF-16.
The output conversion function will detect whether a value passed in
is a high surrogate value and save it in a static variable.
If the next character is a correct low surrogate, the function will
return the correct unicode character, otherwise it will discard the
input.
The following code should output a smiley 🙂: