Fix for UTF-8 partials in function `ConhostConnection::_OutputThread`. #1850

german-one · 2019-07-06T20:39:22Z

Summary of the Pull Request

ConhostConnection::_OutputThread shall take care of partial UTF-8 characters generated while buffering the stream read.

References

This PR may partially fix the occurrence of � characters as seen on screenshots in the following issues:
#386 #455 #666

PR Checklist

Closes #xxx
CLA signed. If not, go over here and sign the CLA
Tests added/passed
Requires documentation to be updated
I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: Bad characters occasionally displayed when writing lots of identical UTF-8 lines #386

Detailed Description of the Pull Request / Additional comments

Code points are represented by a sequence of 1 to 4 bytes in UTF-8. Whenever UTF-8 text is getting buffered, code points that consist of multiple bytes may get split at the buffer boundaries. The buffer gets converted into a string of wchar_t where those partials are invalid and make MultiByteToWideChar replacing them with U+FFFD characters. The implementation needs to check whether or not the buffer ends with a partial character. If so, only convert the code points which are complete, and save the partial code units in a cache that gets prepended to the next chunk of text.

Remark:
The PR includes the removal of the UTF-8 Byte Order Mark if it is present at the beginning of the stream read.

Validation Steps Performed

Corresponding file UTF8OutPipeReaderTests.cpp added to the Types.Unit.Tests project.

This reverts commit ea46579.

…` and `ApiRoutines::WriteConsoleOutputCharacterAImpl` The implementation needs to check whether or not the buffer ends with a partial character. If so, only convert the code points which are complete, and save the partial code units in a cache that gets prepended to the next chunk of text.

…utThread` and `ApiRoutines::WriteConsoleOutputCharacterAImpl`" This reverts commit 7a4a814.

The implementation needs to check whether or not the buffer ends with a partial character. If so, only convert the code points which are complete, and save the partial code units in a cache that gets prepended to the next chunk of text.

DHowett-MSFT

Thanks for working on this! I’m not sure about this approach. The partials detection is great, but this duplicates the reading and closing logic to support a stream that starts with a BOM. There is no scenario where a pseudoconsole connection (like ConhostConnection) will receive a BOM on its output pipe. If it does, that BOM was produced by the connected application and must be preserved.

Have you encountered an incoming BOM as the first three bytes on the output pipe?

The implementation needs to check whether or not the buffer ends with a partial character. If so, only convert the code points which are complete, and save the partial code units in a cache that gets prepended to the next chunk of text.

german-one · 2019-07-07T00:14:17Z

Sorry I'm struggling with the update 😅 I'm still a GitHub newbie.

Have you encountered an incoming BOM as the first three bytes on the output pipe?

No. It was rather a guess. I removed the BOM check.

…aracterAImpl`

german-one · 2019-07-07T11:42:18Z

Found a misleading comment which was a leftover from one of my early tests.
If you still have any concerns please let me know.

DHowett-MSFT

I'm alright with this. It's more clever than my solution (using a static array of bitmasks instead of hardcoding each comparison), but it's largely the same. That brings me a bit of comfort. 😄

german-one · 2019-07-08T15:45:54Z

You gave detailed instrunctions. That made it difficult for me to fail entirely 😃

BTW Why not asking contributors for support for small tasks like that. It's one of the advantages of open source. Although I would have a hard time to find the position in the code where to implement or update things. That's where you are experienced. So if someone opens an issue and you already have a rough idea then just leave a hint like "we have to have a look at this file or review this class or that function". I think this has the potential to save you some hours of coding and testing. And it would help to close issues even faster.

DHowett-MSFT · 2019-07-08T15:56:29Z

We’ve got the “Help-Wanted” tag for that! Not all of those issues have guidance or suggestions on how to proceed, of course, but we’re getting there. Thanks! 🙂

zadjii-msft

Overall this seems good, but I'm not sure that I know enough about UTF-8 to be the second signoff on it. Maybe @miniksa could give it a look?

zadjii-msft · 2019-07-08T17:13:48Z

src/cascadia/TerminalConnection/ConhostConnection.cpp

+                for (DWORD dwSequenceLen{ 1UL }, stop{ dwRead < 4UL ? dwRead : 4UL }; dwSequenceLen < stop; ++dwSequenceLen, --backIter)
+                {
+                    // If Lead Byte found
+                    if ((*backIter & 0b11000000) == 0b11000000)


Can we get some names for these bit masks? I'd be surprised if they're not somewhere else in the codebase already.

At least I'm not aware of any common names. Wikipedia description for the meaning of these bits.
But I'll search the code base for these values. Maybe you already defined names that could be re-used here. (Perhaps in an enum?)

terminal/src/host/utf8ToWideCharParser.cpp

Lines 15 to 20 in 9b92986

const byte NonAsciiBytePrefix = 0x80;

const byte ContinuationByteMask = 0xC0;

const byte ContinuationBytePrefix = 0x80;

const byte MostSignificantBitMask = 0x80;

Different names for the same value. Doesn't include all values used here.

terminal/src/host/ut_host/Utf8ToWideCharParserTests.cpp

Lines 296 to 298 in 9b92986

VERIFY_IS_TRUE(parser._IsLeadByte(0xC0)); // 2 byte sequence

VERIFY_IS_TRUE(parser._IsLeadByte(0xE0)); // 3 byte sequence

VERIFY_IS_TRUE(parser._IsLeadByte(0xF0)); // 4 byte sequence

Better. But names are only in the comments.

I'm personally okay without the names 😄

I can't agree more.
Yesterday I fiddled with an enum and member names like LeadByteTwoByteSequence. Then I used these names in the array and noticed that they are ambiguous. E.g. we use value 0b11100000 as bitmask in the AND operation and compare the result with 0b11000000. In the end each of these values would need two names. For example something like MaskLeadByteTwoByteSequence and IsLeadByteThreeByteSequence for value 0b11100000. Since we can't use two names for an array element, we would need a second array to disambiguate the element names. I don't like it because it's a mess and it doesn't improve the readability imo. When I tried it yesterday I actually confused myself...
In contrast, the binary literals are quite self-explanatory.

src/cascadia/TerminalConnection/ConhostConnection.cpp

DHowett-MSFT · 2019-07-09T16:57:56Z

src/cascadia/TerminalConnection/ConhostConnection.cpp

@@ -234,16 +235,16 @@ namespace winrt::Microsoft::Terminal::TerminalConnection::implementation
            const BYTE* const endPtr{ buffer + dwRead };
            const BYTE* backIter{ endPtr - 1 };
            // If the last byte in the buffer was a byte belonging to a UTF-8 multi-byte character
-            if ((*backIter & 0b10000000) == 0b10000000)
+            if (WI_AreAllFlagsSet(*backIter & 0b10000000, 0b10000000))


This is actually pretty cool: WI_AreAllFlagsSet lets you skip the & 0b000 part. This would just be WI_AreAllFlagsSet(*backIter, 0b10000000)

You are right. That was dumb of me.

DHowett-MSFT · 2019-07-09T17:02:37Z

src/cascadia/TerminalConnection/ConhostConnection.cpp

                    {
                        // If the Lead Byte indicates that the last bytes in the buffer is a partial UTF-8 code point then cache them
-                        if ((*backIter & bitmasks[dwSequenceLen]) != bitmasks[dwSequenceLen - 1])
+                        if (WI_IsAnyFlagClear(*backIter & bitmasks[dwSequenceLen], bitmasks[dwSequenceLen - 1]))


This one deserves a comment -- since you're masking with the sequence prefix and checking the bits from the next sequence prefix down, you may want to call that out specifically. It took me three reviews to figure out why you were doing this 😄

I'm assuming it's because 0b11100000 is the prefix for a 3-byte sequence, but for it to be a 3-byte introducer those three high bits must be 110 (not 111), and we're just using the off-by-one array index because those mask/check values are luckily adjacent?

I begin with dwSequenceLen = 1 which is the reason why index dwSequenceLen - 1 can't be off by one in the array.
Actually only the WI_IsAnyFlagClear(*backIter & 0b11000000, 0) in the first iteration is tricky to understand. I checked before that a lead byte was found (that is, the two highest bits are 1). WI_IsAnyFlagClear will always return true here and the byte is getting cached.
The rest behaves exactly as you said. We need 0b111'00000 as mask to compare the result with 0b110'00000 etc. And that's the reason why naming these bytes would be ambiguous as I explained in the conversation above.

@DHowett-MSFT Now that I read it twice I get this uncomfortable feeling. Should the first argument the second and vice versa?

Hmm.

WI_IsAnyFlagClear(val, flags) Any bit specified by flags is not set in val.

I would prefer the direct equality comparison here (which you had before without WI_). You want to make sure that all bit values match within the mask.

Alright. I figured something is going wrong here. Thanks!

DHowett-MSFT · 2019-07-09T18:21:32Z

src/cascadia/TerminalConnection/ConhostConnection.cpp

@@ -4,7 +4,11 @@
 #include "pch.h"
 #include "ConhostConnection.h"
 #include "windows.h"
+#include "wil/common.h"


I think we already have wil/common, string_view, algorithm and type_traits in our precompiled header through pch.h -> LibraryIncludes.h!

I'm used to including the related headers. It's difficult for me to find the way across other header files to figure out which have been already included over detours.
Before I push the next commit I should probably check if it compiles without string_view, algorithm, and type_traits included directly.

german-one · 2019-07-09T20:01:41Z

I hope I was able to address all change requests 🙂

DHowett-MSFT

I like this, but I need a second reviewer. 😄

german-one · 2019-07-14T00:15:28Z

Figuring out how to get the TAEF working was driving me crazy 🤪

DHowett-MSFT · 2019-07-14T00:17:26Z

@german-one Thanks for doing so much for this pull request! 😄 We really appreciate it.

miniksa

Just a few tips for the tests that I'd like you to review before I sign off. Otherwise, this is looking good.

I'm sorry TAEF and unit testing was trouble for you, but you've done an excellent job here at working with us, learning this, and helping us identify areas where we can improve our communication as well. I really appreciate your efforts here.

miniksa · 2019-07-15T19:47:49Z

src/types/ut_types/UTF8OutPipeReaderTests.cpp

+        if (threadHandle == nullptr)
+        {
+            CloseHandle(writeTo);
+            return static_cast<HRESULT>(-1L);


generally we do something like "return E_FAIL;" instead of casting a -1 to HRESULT.

miniksa · 2019-07-15T19:48:42Z

src/types/ut_types/UTF8OutPipeReaderTests.cpp

+
+        ThreadData data{ writeTo, utf8TestString };
+
+        HANDLE threadHandle{ CreateThread(nullptr, 0, WritePipeThread, &data, 0, nullptr) }; // create a thread that writes to the pipe


If you hold threadHandle in a wil::unique_handle, then it will automatically CloseHandle on it when the object goes out of scope and you can simplify some of your early-return-on-failures below to things like

RETURN_HR_IF_NULL(E_FAIL, threadHandle.get());

miniksa · 2019-07-15T19:52:39Z

src/types/ut_types/UTF8OutPipeReaderTests.cpp

+        // Test 1:
+        //                                                           ||
+        utf8TestString.replace(bufferSize - 6, 12, "S\xF0\x90\x8D\x88TUVWXYZ");
+        if (SUCCEEDED(RunTest(utf8TestString)))


You could just do VERIFY_SUCCEEDED for these instead of storing a count of tests and successful tests.

You can call VERIFY multiple times per test. If any VERIFY fails, the whole test fails from TAEF's point of view.

VERIFY_SUCCEEDED(RunTest(utf8TestString));

german-one · 2019-07-15T20:45:24Z

german-one · 2019-07-15T20:57:58Z

I'm sorry TAEF and unit testing was trouble for you

Be assured that I fully understand that code needs proper testing and verification. I'm aware that the console and terminal will be released to billions of machines out there. I certainly don't want to be the culprit for causing trouble all around the world. 😉 So don't worry. I'm learning something new every day. Thanks four your patience.

german-one · 2019-07-16T00:13:55Z

I had skipped that detail. But now I think I addressed all of your requests.

miniksa

Excellent. Thank you very much! This looks good to merge.

#1850) * Fix for UTF-8 partials in functions `ConhostConnection::_OutputThread` and `ApiRoutines::WriteConsoleOutputCharacterAImpl` The implementation needs to check whether or not the buffer ends with a partial character. If so, only convert the code points which are complete, and save the partial code units in a cache that gets prepended to the next chunk of text. * Utf8OutPipeReader class added * Unit Test added * use specific macros and WIL classes * avoid possible deadlock caused by unclosed pipe handle (cherry picked from commit fa5b9b0)

microsoft#1850) * Fix for UTF-8 partials in functions `ConhostConnection::_OutputThread` and `ApiRoutines::WriteConsoleOutputCharacterAImpl` The implementation needs to check whether or not the buffer ends with a partial character. If so, only convert the code points which are complete, and save the partial code units in a cache that gets prepended to the next chunk of text. * Utf8OutPipeReader class added * Unit Test added * use specific macros and WIL classes * avoid possible deadlock caused by unclosed pipe handle

vblazhkun · 2019-07-23T19:45:42Z

It seems, there are still some issues left (blinking and improperly rendered characters):

DHowett-MSFT · 2019-07-23T23:30:37Z

Fortunately, that is not related to partial encoded utf-8 codepoints 😄

vblazhkun · 2019-07-23T23:33:21Z

@DHowett-MSFT Shall I open a new issue then?

DHowett-MSFT · 2019-07-23T23:35:45Z

Only if you can express exactly why those characters are "improperly rendered." Most of the people on my team don't know what they're supposed to look like.

HOWEVER: There's a chance it's already covered by a bunch of existing issues. We have known deficiencies in rendering Thai text, in rendering things with composing characters, in rendering things where there's a mismatch in number of cells to number of codepoints (we only support one codepoint -> 1 or 2 cells right now)

vblazhkun · 2019-07-23T23:45:22Z

So, I assume, the blinking is Okay? ;)

Here is the Notepad rendering:

DHowett-MSFT · 2019-07-23T23:46:44Z

Nah I mean: If your report is "they don't look right," we need to know why. If your report is "they're blinking and they absolutely should not be," that's completely and totally reasonable 😄

german-one · 2019-07-28T11:43:29Z

@DHowett-MSFT

We have known deficiencies [...] in rendering things with composing characters [...]

Now that I read that, solving this problem should be also a task for UTF8OutPipeReader::Read I think. Something like "if the last complete code point was no Combining Diacritical Mark (U+0300..036F, U+1AB0..1AFF, U+1DC0..1DFF, U+20D0..20FF), put it in the cache, too". That's because we don't know if the partial or the upcoming character is a combining character.
(I just hope that winrt::to_hstring would make precomposed wide characters out of them afterwards.)

Do you think I should try to update the code accordingly?

DHowett-MSFT · 2019-07-28T17:46:09Z

@german-one I'd hold off on that, actually. We shouldn't do this just yet, because we need to make sure that conhost (the console host, which doesn't use WinRT at all; it also serves as the API server for the legacy Win32 console APIs) knows to store combining characters and grapheme clusters in the buffer. If we just handle them at the Windows Terminal layer there will be a mismatch/miscount in the number of cells on the screen.

german-one · 2019-07-28T23:33:05Z

Well, I already successfully tried to keep combining characters complete. And I think that, regardless of the behavior of conhost, combining characters must not be split. Correct me if I'm wrong, but won't that still be the task for the Windows Terminal if it reads the pipe?

FWIW winrt::to_hstring doesn't make precomposed wide characters out of them. Thus, forwarding only complete combining characters still doesn't solve anything. But it might be a precondition for further improvements.
Just get back to me whenever you think the time has come ...

ghost · 2019-08-03T01:48:28Z

🎉Windows Terminal Preview v0.3.2142.0 has been released which incorporates this pull request.:tada:

Handy links:

german-one added 5 commits June 30, 2019 13:46

Cache UTF-8 partials of ConhostConnection output pipe

ea46579

Revert "Cache UTF-8 partials of ConhostConnection output pipe"

7c83c38

This reverts commit ea46579.

Revert "Fix for UTF-8 partials in functions `ConhostConnection::_Outp…

a424a16

…utThread` and `ApiRoutines::WriteConsoleOutputCharacterAImpl`" This reverts commit 7a4a814.

DHowett-MSFT suggested changes Jul 6, 2019

View reviewed changes

ghost added the Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something label Jul 6, 2019

german-one added 2 commits July 7, 2019 01:23

Fix for UTF-8 partials in function ConhostConnection::_OutputThread

d0e6e82

ghost removed the Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something label Jul 6, 2019

Fix for UTF-8 partials in function ConhostConnection::_OutputThread

d98c589

Fix for UTF-8 partials in function `ApiRoutines::WriteConsoleOutputCh…

7df6609

…aracterAImpl`

Fix for UTF-8 partials in function ConhostConnection::_OutputThread

507f526

DHowett-MSFT approved these changes Jul 8, 2019

View reviewed changes

DHowett-MSFT requested a review from miniksa July 8, 2019 03:34

zadjii-msft reviewed Jul 8, 2019

View reviewed changes

Fix for UTF-8 partials in function ConhostConnection::_OutputThread

7d8c4b1

DHowett-MSFT suggested changes Jul 9, 2019

View reviewed changes

ghost added Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something and removed Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something labels Jul 9, 2019

Fix for UTF-8 partials in function ConhostConnection::_OutputThread

f08f31c

DHowett-MSFT reviewed Jul 9, 2019

View reviewed changes

german-one added 2 commits July 9, 2019 20:44

Fix for UTF-8 partials in function ConhostConnection::_OutputThread

bb299d7

Fix for UTF-8 partials in function ConhostConnection::_OutputThread

c395f3b

DHowett-MSFT approved these changes Jul 9, 2019

View reviewed changes

Utf8OutPipeReader class added

e35a78f

Unit Test added

31db29b

german-one mentioned this pull request Jul 14, 2019

Docs should be updated to address how to use TAEF for the development of unit tests #1962

Closed

miniksa requested changes Jul 15, 2019

View reviewed changes

ghost added the Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something label Jul 15, 2019

ghost removed the Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something label Jul 15, 2019

use specific macros and WIL classes

d09ac87

avoid possible deadlock caused by unclosed pipe handle

f23bcd9

miniksa approved these changes Jul 16, 2019

View reviewed changes

miniksa merged commit fa5b9b0 into microsoft:master Jul 16, 2019

DHowett-MSFT mentioned this pull request Jul 17, 2019

Unicode box drawing rendering issues #1991

Closed

german-one mentioned this pull request Aug 3, 2019

Correctly process composite characters in Terminal #2228

Closed

5 tasks

zadjii-msft mentioned this pull request Jan 4, 2022

Bad characters occasionally displayed when writing lots of identical UTF-8 lines #386

Closed

	const byte NonAsciiBytePrefix = 0x80;

	const byte ContinuationByteMask = 0xC0;
	const byte ContinuationBytePrefix = 0x80;

	const byte MostSignificantBitMask = 0x80;

	VERIFY_IS_TRUE(parser._IsLeadByte(0xC0)); // 2 byte sequence
	VERIFY_IS_TRUE(parser._IsLeadByte(0xE0)); // 3 byte sequence
	VERIFY_IS_TRUE(parser._IsLeadByte(0xF0)); // 4 byte sequence


		ThreadData data{ writeTo, utf8TestString };

		HANDLE threadHandle{ CreateThread(nullptr, 0, WritePipeThread, &data, 0, nullptr) }; // create a thread that writes to the pipe

Fix for UTF-8 partials in function ConhostConnection::_OutputThread. #1850

Fix for UTF-8 partials in function ConhostConnection::_OutputThread. #1850

Conversation

german-one commented Jul 6, 2019 • edited Loading

Summary of the Pull Request

References

PR Checklist

Detailed Description of the Pull Request / Additional comments

Validation Steps Performed

DHowett-MSFT left a comment

Choose a reason for hiding this comment

german-one commented Jul 7, 2019

german-one commented Jul 7, 2019

DHowett-MSFT left a comment

Choose a reason for hiding this comment

german-one commented Jul 8, 2019

DHowett-MSFT commented Jul 8, 2019

zadjii-msft left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

german-one Jul 9, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

german-one commented Jul 9, 2019

DHowett-MSFT left a comment

Choose a reason for hiding this comment

german-one commented Jul 14, 2019

DHowett-MSFT commented Jul 14, 2019

miniksa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

german-one commented Jul 15, 2019

german-one commented Jul 15, 2019 • edited Loading

german-one commented Jul 16, 2019 • edited Loading

miniksa left a comment

Choose a reason for hiding this comment

vblazhkun commented Jul 23, 2019

DHowett-MSFT commented Jul 23, 2019

vblazhkun commented Jul 23, 2019

DHowett-MSFT commented Jul 23, 2019

vblazhkun commented Jul 23, 2019

DHowett-MSFT commented Jul 23, 2019

german-one commented Jul 28, 2019 • edited Loading

DHowett-MSFT commented Jul 28, 2019

german-one commented Jul 28, 2019 • edited Loading

ghost commented Aug 3, 2019

Fix for UTF-8 partials in function `ConhostConnection::_OutputThread`. #1850

Fix for UTF-8 partials in function `ConhostConnection::_OutputThread`. #1850

german-one commented Jul 6, 2019 •

edited

Loading

german-one Jul 9, 2019 •

edited

Loading

german-one commented Jul 15, 2019 •

edited

Loading

german-one commented Jul 16, 2019 •

edited

Loading

german-one commented Jul 28, 2019 •

edited

Loading

german-one commented Jul 28, 2019 •

edited

Loading