-
Notifications
You must be signed in to change notification settings - Fork 8.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for UTF-8 partials in function ConhostConnection::_OutputThread
.
#1850
Conversation
This reverts commit ea46579.
…` and `ApiRoutines::WriteConsoleOutputCharacterAImpl` The implementation needs to check whether or not the buffer ends with a partial character. If so, only convert the code points which are complete, and save the partial code units in a cache that gets prepended to the next chunk of text.
…utThread` and `ApiRoutines::WriteConsoleOutputCharacterAImpl`" This reverts commit 7a4a814.
The implementation needs to check whether or not the buffer ends with a partial character. If so, only convert the code points which are complete, and save the partial code units in a cache that gets prepended to the next chunk of text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! I’m not sure about this approach. The partials detection is great, but this duplicates the reading and closing logic to support a stream that starts with a BOM. There is no scenario where a pseudoconsole connection (like ConhostConnection) will receive a BOM on its output pipe. If it does, that BOM was produced by the connected application and must be preserved.
Have you encountered an incoming BOM as the first three bytes on the output pipe?
The implementation needs to check whether or not the buffer ends with a partial character. If so, only convert the code points which are complete, and save the partial code units in a cache that gets prepended to the next chunk of text.
Sorry I'm struggling with the update 😅 I'm still a GitHub newbie.
No. It was rather a guess. I removed the BOM check. |
Found a misleading comment which was a leftover from one of my early tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm alright with this. It's more clever than my solution (using a static array of bitmasks instead of hardcoding each comparison), but it's largely the same. That brings me a bit of comfort. 😄
You gave detailed instrunctions. That made it difficult for me to fail entirely 😃 BTW Why not asking contributors for support for small tasks like that. It's one of the advantages of open source. Although I would have a hard time to find the position in the code where to implement or update things. That's where you are experienced. So if someone opens an issue and you already have a rough idea then just leave a hint like "we have to have a look at this file or review this class or that function". I think this has the potential to save you some hours of coding and testing. And it would help to close issues even faster. |
We’ve got the “Help-Wanted” tag for that! Not all of those issues have guidance or suggestions on how to proceed, of course, but we’re getting there. Thanks! 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this seems good, but I'm not sure that I know enough about UTF-8 to be the second signoff on it. Maybe @miniksa could give it a look?
for (DWORD dwSequenceLen{ 1UL }, stop{ dwRead < 4UL ? dwRead : 4UL }; dwSequenceLen < stop; ++dwSequenceLen, --backIter) | ||
{ | ||
// If Lead Byte found | ||
if ((*backIter & 0b11000000) == 0b11000000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get some names for these bit masks? I'd be surprised if they're not somewhere else in the codebase already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least I'm not aware of any common names. Wikipedia description for the meaning of these bits.
But I'll search the code base for these values. Maybe you already defined names that could be re-used here. (Perhaps in an enum
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
terminal/src/host/utf8ToWideCharParser.cpp
Lines 15 to 20 in 9b92986
const byte NonAsciiBytePrefix = 0x80; | |
const byte ContinuationByteMask = 0xC0; | |
const byte ContinuationBytePrefix = 0x80; | |
const byte MostSignificantBitMask = 0x80; |
Different names for the same value. Doesn't include all values used here.
terminal/src/host/ut_host/Utf8ToWideCharParserTests.cpp
Lines 296 to 298 in 9b92986
VERIFY_IS_TRUE(parser._IsLeadByte(0xC0)); // 2 byte sequence | |
VERIFY_IS_TRUE(parser._IsLeadByte(0xE0)); // 3 byte sequence | |
VERIFY_IS_TRUE(parser._IsLeadByte(0xF0)); // 4 byte sequence |
Better. But names are only in the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm personally okay without the names 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't agree more.
Yesterday I fiddled with an enum
and member names like LeadByteTwoByteSequence
. Then I used these names in the array and noticed that they are ambiguous. E.g. we use value 0b11100000
as bitmask in the AND operation and compare the result with 0b11000000
. In the end each of these values would need two names. For example something like MaskLeadByteTwoByteSequence
and IsLeadByteThreeByteSequence
for value 0b11100000
. Since we can't use two names for an array element, we would need a second array to disambiguate the element names. I don't like it because it's a mess and it doesn't improve the readability imo. When I tried it yesterday I actually confused myself...
In contrast, the binary literals are quite self-explanatory.
@@ -234,16 +235,16 @@ namespace winrt::Microsoft::Terminal::TerminalConnection::implementation | |||
const BYTE* const endPtr{ buffer + dwRead }; | |||
const BYTE* backIter{ endPtr - 1 }; | |||
// If the last byte in the buffer was a byte belonging to a UTF-8 multi-byte character | |||
if ((*backIter & 0b10000000) == 0b10000000) | |||
if (WI_AreAllFlagsSet(*backIter & 0b10000000, 0b10000000)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually pretty cool: WI_AreAllFlagsSet
lets you skip the & 0b000
part. This would just be WI_AreAllFlagsSet(*backIter, 0b10000000)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. That was dumb of me.
{ | ||
// If the Lead Byte indicates that the last bytes in the buffer is a partial UTF-8 code point then cache them | ||
if ((*backIter & bitmasks[dwSequenceLen]) != bitmasks[dwSequenceLen - 1]) | ||
if (WI_IsAnyFlagClear(*backIter & bitmasks[dwSequenceLen], bitmasks[dwSequenceLen - 1])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one deserves a comment -- since you're masking with the sequence prefix and checking the bits from the next sequence prefix down, you may want to call that out specifically. It took me three reviews to figure out why you were doing this 😄
I'm assuming it's because 0b11100000
is the prefix for a 3-byte sequence, but for it to be a 3-byte introducer those three high bits must be 110
(not 111
), and we're just using the off-by-one array index because those mask/check values are luckily adjacent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I begin with dwSequenceLen = 1
which is the reason why index dwSequenceLen - 1
can't be off by one in the array.
Actually only the WI_IsAnyFlagClear(*backIter & 0b11000000, 0)
in the first iteration is tricky to understand. I checked before that a lead byte was found (that is, the two highest bits are 1). WI_IsAnyFlagClear
will always return true
here and the byte is getting cached.
The rest behaves exactly as you said. We need 0b111'00000
as mask to compare the result with 0b110'00000
etc. And that's the reason why naming these bytes would be ambiguous as I explained in the conversation above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DHowett-MSFT Now that I read it twice I get this uncomfortable feeling. Should the first argument the second and vice versa?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.
WI_IsAnyFlagClear(val, flags)
Any bit specified by flags
is not set in val
.
I would prefer the direct equality comparison here (which you had before without WI_
). You want to make sure that all bit values match within the mask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. I figured something is going wrong here. Thanks!
@@ -4,7 +4,11 @@ | |||
#include "pch.h" | |||
#include "ConhostConnection.h" | |||
#include "windows.h" | |||
#include "wil/common.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we already have wil/common, string_view, algorithm and type_traits in our precompiled header through pch.h
-> LibraryIncludes.h
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm used to including the related headers. It's difficult for me to find the way across other header files to figure out which have been already included over detours.
Before I push the next commit I should probably check if it compiles without string_view, algorithm, and type_traits included directly.
I hope I was able to address all change requests 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, but I need a second reviewer. 😄
@german-one Thanks for doing so much for this pull request! 😄 We really appreciate it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few tips for the tests that I'd like you to review before I sign off. Otherwise, this is looking good.
I'm sorry TAEF and unit testing was trouble for you, but you've done an excellent job here at working with us, learning this, and helping us identify areas where we can improve our communication as well. I really appreciate your efforts here.
if (threadHandle == nullptr) | ||
{ | ||
CloseHandle(writeTo); | ||
return static_cast<HRESULT>(-1L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally we do something like "return E_FAIL;" instead of casting a -1 to HRESULT.
|
||
ThreadData data{ writeTo, utf8TestString }; | ||
|
||
HANDLE threadHandle{ CreateThread(nullptr, 0, WritePipeThread, &data, 0, nullptr) }; // create a thread that writes to the pipe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you hold threadHandle in a wil::unique_handle, then it will automatically CloseHandle on it when the object goes out of scope and you can simplify some of your early-return-on-failures below to things like
RETURN_HR_IF_NULL(E_FAIL, threadHandle.get());
// Test 1: | ||
// || | ||
utf8TestString.replace(bufferSize - 6, 12, "S\xF0\x90\x8D\x88TUVWXYZ"); | ||
if (SUCCEEDED(RunTest(utf8TestString))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could just do VERIFY_SUCCEEDED
for these instead of storing a count of tests and successful tests.
You can call VERIFY multiple times per test. If any VERIFY fails, the whole test fails from TAEF's point of view.
VERIFY_SUCCEEDED(RunTest(utf8TestString));
Be assured that I fully understand that code needs proper testing and verification. I'm aware that the console and terminal will be released to billions of machines out there. I certainly don't want to be the culprit for causing trouble all around the world. 😉 So don't worry. I'm learning something new every day. Thanks four your patience. |
I had skipped that detail. But now I think I addressed all of your requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent. Thank you very much! This looks good to merge.
#1850) * Fix for UTF-8 partials in functions `ConhostConnection::_OutputThread` and `ApiRoutines::WriteConsoleOutputCharacterAImpl` The implementation needs to check whether or not the buffer ends with a partial character. If so, only convert the code points which are complete, and save the partial code units in a cache that gets prepended to the next chunk of text. * Utf8OutPipeReader class added * Unit Test added * use specific macros and WIL classes * avoid possible deadlock caused by unclosed pipe handle (cherry picked from commit fa5b9b0)
microsoft#1850) * Fix for UTF-8 partials in functions `ConhostConnection::_OutputThread` and `ApiRoutines::WriteConsoleOutputCharacterAImpl` The implementation needs to check whether or not the buffer ends with a partial character. If so, only convert the code points which are complete, and save the partial code units in a cache that gets prepended to the next chunk of text. * Utf8OutPipeReader class added * Unit Test added * use specific macros and WIL classes * avoid possible deadlock caused by unclosed pipe handle
Fortunately, that is not related to partial encoded utf-8 codepoints 😄 |
@DHowett-MSFT Shall I open a new issue then? |
Only if you can express exactly why those characters are "improperly rendered." Most of the people on my team don't know what they're supposed to look like. HOWEVER: There's a chance it's already covered by a bunch of existing issues. We have known deficiencies in rendering Thai text, in rendering things with composing characters, in rendering things where there's a mismatch in number of cells to number of codepoints (we only support one codepoint -> 1 or 2 cells right now) |
Nah I mean: If your report is "they don't look right," we need to know why. If your report is "they're blinking and they absolutely should not be," that's completely and totally reasonable 😄 |
Now that I read that, solving this problem should be also a task for Do you think I should try to update the code accordingly? |
@german-one I'd hold off on that, actually. We shouldn't do this just yet, because we need to make sure that conhost (the console host, which doesn't use WinRT at all; it also serves as the API server for the legacy Win32 console APIs) knows to store combining characters and grapheme clusters in the buffer. If we just handle them at the Windows Terminal layer there will be a mismatch/miscount in the number of cells on the screen. |
Well, I already successfully tried to keep combining characters complete. And I think that, regardless of the behavior of conhost, combining characters must not be split. Correct me if I'm wrong, but won't that still be the task for the Windows Terminal if it reads the pipe? FWIW |
🎉 Handy links: |
Summary of the Pull Request
ConhostConnection::_OutputThread
shall take care of partial UTF-8 characters generated while buffering the stream read.References
This PR may partially fix the occurrence of � characters as seen on screenshots in the following issues:
#386 #455 #666
PR Checklist
Detailed Description of the Pull Request / Additional comments
Code points are represented by a sequence of 1 to 4 bytes in UTF-8. Whenever UTF-8 text is getting buffered, code points that consist of multiple bytes may get split at the buffer boundaries. The buffer gets converted into a string of wchar_t where those partials are invalid and make
MultiByteToWideChar
replacing them with U+FFFD characters. The implementation needs to check whether or not the buffer ends with a partial character. If so, only convert the code points which are complete, and save the partial code units in a cache that gets prepended to the next chunk of text.Remark:The PR includes the removal of the UTF-8 Byte Order Mark if it is present at the beginning of the stream read.
Validation Steps Performed
Corresponding file
UTF8OutPipeReaderTests.cpp
added to theTypes.Unit.Tests
project.