Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for UTF-8 partials in function ConhostConnection::_OutputThread. #1850

Merged
merged 21 commits into from
Jul 16, 2019
Merged

Fix for UTF-8 partials in function ConhostConnection::_OutputThread. #1850

merged 21 commits into from
Jul 16, 2019

Conversation

german-one
Copy link
Contributor

@german-one german-one commented Jul 6, 2019

Summary of the Pull Request

ConhostConnection::_OutputThread shall take care of partial UTF-8 characters generated while buffering the stream read.

References

This PR may partially fix the occurrence of � characters as seen on screenshots in the following issues:
#386 #455 #666

PR Checklist

Detailed Description of the Pull Request / Additional comments

Code points are represented by a sequence of 1 to 4 bytes in UTF-8. Whenever UTF-8 text is getting buffered, code points that consist of multiple bytes may get split at the buffer boundaries. The buffer gets converted into a string of wchar_t where those partials are invalid and make MultiByteToWideChar replacing them with U+FFFD characters. The implementation needs to check whether or not the buffer ends with a partial character. If so, only convert the code points which are complete, and save the partial code units in a cache that gets prepended to the next chunk of text.

Remark:
The PR includes the removal of the UTF-8 Byte Order Mark if it is present at the beginning of the stream read.

Validation Steps Performed

Corresponding file UTF8OutPipeReaderTests.cpp added to the Types.Unit.Tests project.

…` and `ApiRoutines::WriteConsoleOutputCharacterAImpl`

The implementation needs to check whether or not the buffer ends with a partial character. If so, only convert the code points which are complete, and save the partial code units in a cache that gets prepended to the next chunk of text.
…utThread` and `ApiRoutines::WriteConsoleOutputCharacterAImpl`"

This reverts commit 7a4a814.
The implementation needs to check whether or not the buffer ends with a partial character. If so, only convert the code points which are complete, and save the partial code units in a cache that gets prepended to the next chunk of text.
Copy link
Contributor

@DHowett-MSFT DHowett-MSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! I’m not sure about this approach. The partials detection is great, but this duplicates the reading and closing logic to support a stream that starts with a BOM. There is no scenario where a pseudoconsole connection (like ConhostConnection) will receive a BOM on its output pipe. If it does, that BOM was produced by the connected application and must be preserved.

Have you encountered an incoming BOM as the first three bytes on the output pipe?

@ghost ghost added the Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something label Jul 6, 2019
The implementation needs to check whether or not the buffer ends with a partial character. If so, only convert the code points which are complete, and save the partial code units in a cache that gets prepended to the next chunk of text.
@ghost ghost removed the Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something label Jul 6, 2019
@german-one
Copy link
Contributor Author

Sorry I'm struggling with the update 😅 I'm still a GitHub newbie.

Have you encountered an incoming BOM as the first three bytes on the output pipe?

No. It was rather a guess. I removed the BOM check.

@german-one
Copy link
Contributor Author

Found a misleading comment which was a leftover from one of my early tests.
If you still have any concerns please let me know.

Copy link
Contributor

@DHowett-MSFT DHowett-MSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm alright with this. It's more clever than my solution (using a static array of bitmasks instead of hardcoding each comparison), but it's largely the same. That brings me a bit of comfort. 😄

@DHowett-MSFT DHowett-MSFT requested a review from miniksa July 8, 2019 03:34
@german-one
Copy link
Contributor Author

You gave detailed instrunctions. That made it difficult for me to fail entirely 😃

BTW Why not asking contributors for support for small tasks like that. It's one of the advantages of open source. Although I would have a hard time to find the position in the code where to implement or update things. That's where you are experienced. So if someone opens an issue and you already have a rough idea then just leave a hint like "we have to have a look at this file or review this class or that function". I think this has the potential to save you some hours of coding and testing. And it would help to close issues even faster.

@DHowett-MSFT
Copy link
Contributor

We’ve got the “Help-Wanted” tag for that! Not all of those issues have guidance or suggestions on how to proceed, of course, but we’re getting there. Thanks! 🙂

Copy link
Member

@zadjii-msft zadjii-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this seems good, but I'm not sure that I know enough about UTF-8 to be the second signoff on it. Maybe @miniksa could give it a look?

for (DWORD dwSequenceLen{ 1UL }, stop{ dwRead < 4UL ? dwRead : 4UL }; dwSequenceLen < stop; ++dwSequenceLen, --backIter)
{
// If Lead Byte found
if ((*backIter & 0b11000000) == 0b11000000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get some names for these bit masks? I'd be surprised if they're not somewhere else in the codebase already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least I'm not aware of any common names. Wikipedia description for the meaning of these bits.
But I'll search the code base for these values. Maybe you already defined names that could be re-used here. (Perhaps in an enum?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const byte NonAsciiBytePrefix = 0x80;
const byte ContinuationByteMask = 0xC0;
const byte ContinuationBytePrefix = 0x80;
const byte MostSignificantBitMask = 0x80;

Different names for the same value. Doesn't include all values used here.

VERIFY_IS_TRUE(parser._IsLeadByte(0xC0)); // 2 byte sequence
VERIFY_IS_TRUE(parser._IsLeadByte(0xE0)); // 3 byte sequence
VERIFY_IS_TRUE(parser._IsLeadByte(0xF0)); // 4 byte sequence

Better. But names are only in the comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm personally okay without the names 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't agree more.
Yesterday I fiddled with an enum and member names like LeadByteTwoByteSequence. Then I used these names in the array and noticed that they are ambiguous. E.g. we use value 0b11100000 as bitmask in the AND operation and compare the result with 0b11000000. In the end each of these values would need two names. For example something like MaskLeadByteTwoByteSequence and IsLeadByteThreeByteSequence for value 0b11100000. Since we can't use two names for an array element, we would need a second array to disambiguate the element names. I don't like it because it's a mess and it doesn't improve the readability imo. When I tried it yesterday I actually confused myself...
In contrast, the binary literals are quite self-explanatory.

@@ -234,16 +235,16 @@ namespace winrt::Microsoft::Terminal::TerminalConnection::implementation
const BYTE* const endPtr{ buffer + dwRead };
const BYTE* backIter{ endPtr - 1 };
// If the last byte in the buffer was a byte belonging to a UTF-8 multi-byte character
if ((*backIter & 0b10000000) == 0b10000000)
if (WI_AreAllFlagsSet(*backIter & 0b10000000, 0b10000000))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually pretty cool: WI_AreAllFlagsSet lets you skip the & 0b000 part. This would just be WI_AreAllFlagsSet(*backIter, 0b10000000)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. That was dumb of me.

{
// If the Lead Byte indicates that the last bytes in the buffer is a partial UTF-8 code point then cache them
if ((*backIter & bitmasks[dwSequenceLen]) != bitmasks[dwSequenceLen - 1])
if (WI_IsAnyFlagClear(*backIter & bitmasks[dwSequenceLen], bitmasks[dwSequenceLen - 1]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one deserves a comment -- since you're masking with the sequence prefix and checking the bits from the next sequence prefix down, you may want to call that out specifically. It took me three reviews to figure out why you were doing this 😄

I'm assuming it's because 0b11100000 is the prefix for a 3-byte sequence, but for it to be a 3-byte introducer those three high bits must be 110 (not 111), and we're just using the off-by-one array index because those mask/check values are luckily adjacent?

Copy link
Contributor Author

@german-one german-one Jul 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I begin with dwSequenceLen = 1 which is the reason why index dwSequenceLen - 1 can't be off by one in the array.
Actually only the WI_IsAnyFlagClear(*backIter & 0b11000000, 0) in the first iteration is tricky to understand. I checked before that a lead byte was found (that is, the two highest bits are 1). WI_IsAnyFlagClear will always return true here and the byte is getting cached.
The rest behaves exactly as you said. We need 0b111'00000 as mask to compare the result with 0b110'00000 etc. And that's the reason why naming these bytes would be ambiguous as I explained in the conversation above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DHowett-MSFT Now that I read it twice I get this uncomfortable feeling. Should the first argument the second and vice versa?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.

WI_IsAnyFlagClear(val, flags) Any bit specified by flags is not set in val.

I would prefer the direct equality comparison here (which you had before without WI_). You want to make sure that all bit values match within the mask.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. I figured something is going wrong here. Thanks!

@ghost ghost added Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something and removed Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something labels Jul 9, 2019
@@ -4,7 +4,11 @@
#include "pch.h"
#include "ConhostConnection.h"
#include "windows.h"
#include "wil/common.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we already have wil/common, string_view, algorithm and type_traits in our precompiled header through pch.h -> LibraryIncludes.h!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm used to including the related headers. It's difficult for me to find the way across other header files to figure out which have been already included over detours.
Before I push the next commit I should probably check if it compiles without string_view, algorithm, and type_traits included directly.

@german-one
Copy link
Contributor Author

I hope I was able to address all change requests 🙂

Copy link
Contributor

@DHowett-MSFT DHowett-MSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, but I need a second reviewer. 😄

@german-one
Copy link
Contributor Author

Figuring out how to get the TAEF working was driving me crazy 🤪

image

@DHowett-MSFT
Copy link
Contributor

@german-one Thanks for doing so much for this pull request! 😄 We really appreciate it.

Copy link
Member

@miniksa miniksa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few tips for the tests that I'd like you to review before I sign off. Otherwise, this is looking good.

I'm sorry TAEF and unit testing was trouble for you, but you've done an excellent job here at working with us, learning this, and helping us identify areas where we can improve our communication as well. I really appreciate your efforts here.

if (threadHandle == nullptr)
{
CloseHandle(writeTo);
return static_cast<HRESULT>(-1L);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally we do something like "return E_FAIL;" instead of casting a -1 to HRESULT.


ThreadData data{ writeTo, utf8TestString };

HANDLE threadHandle{ CreateThread(nullptr, 0, WritePipeThread, &data, 0, nullptr) }; // create a thread that writes to the pipe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you hold threadHandle in a wil::unique_handle, then it will automatically CloseHandle on it when the object goes out of scope and you can simplify some of your early-return-on-failures below to things like

RETURN_HR_IF_NULL(E_FAIL, threadHandle.get());

// Test 1:
// ||
utf8TestString.replace(bufferSize - 6, 12, "S\xF0\x90\x8D\x88TUVWXYZ");
if (SUCCEEDED(RunTest(utf8TestString)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just do VERIFY_SUCCEEDED for these instead of storing a count of tests and successful tests.

You can call VERIFY multiple times per test. If any VERIFY fails, the whole test fails from TAEF's point of view.

        VERIFY_SUCCEEDED(RunTest(utf8TestString));

@ghost ghost added the Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something label Jul 15, 2019
@german-one
Copy link
Contributor Author

image

@ghost ghost removed the Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something label Jul 15, 2019
@german-one
Copy link
Contributor Author

german-one commented Jul 15, 2019

I'm sorry TAEF and unit testing was trouble for you

Be assured that I fully understand that code needs proper testing and verification. I'm aware that the console and terminal will be released to billions of machines out there. I certainly don't want to be the culprit for causing trouble all around the world. 😉 So don't worry. I'm learning something new every day. Thanks four your patience.

@german-one
Copy link
Contributor Author

german-one commented Jul 16, 2019

I had skipped that detail. But now I think I addressed all of your requests.

Copy link
Member

@miniksa miniksa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent. Thank you very much! This looks good to merge.

@miniksa miniksa merged commit fa5b9b0 into microsoft:master Jul 16, 2019
DHowett-MSFT pushed a commit that referenced this pull request Jul 17, 2019
#1850)

* Fix for UTF-8 partials in functions `ConhostConnection::_OutputThread` and `ApiRoutines::WriteConsoleOutputCharacterAImpl`

The implementation needs to check whether or not the buffer ends with a partial character. If so, only convert the code points which are complete, and save the partial code units in a cache that gets prepended to the next chunk of text.

* Utf8OutPipeReader class added
* Unit Test added
* use specific macros and WIL classes
* avoid possible deadlock caused by unclosed pipe handle

(cherry picked from commit fa5b9b0)
mcpiroman pushed a commit to mcpiroman/terminal that referenced this pull request Jul 23, 2019
microsoft#1850)

* Fix for UTF-8 partials in functions `ConhostConnection::_OutputThread` and `ApiRoutines::WriteConsoleOutputCharacterAImpl`

The implementation needs to check whether or not the buffer ends with a partial character. If so, only convert the code points which are complete, and save the partial code units in a cache that gets prepended to the next chunk of text.

* Utf8OutPipeReader class added
* Unit Test added
* use specific macros and WIL classes
* avoid possible deadlock caused by unclosed pipe handle
@vblazhkun
Copy link

It seems, there are still some issues left (blinking and improperly rendered characters):

out

@DHowett-MSFT
Copy link
Contributor

Fortunately, that is not related to partial encoded utf-8 codepoints 😄

@vblazhkun
Copy link

@DHowett-MSFT Shall I open a new issue then?

@DHowett-MSFT
Copy link
Contributor

Only if you can express exactly why those characters are "improperly rendered." Most of the people on my team don't know what they're supposed to look like.

HOWEVER: There's a chance it's already covered by a bunch of existing issues. We have known deficiencies in rendering Thai text, in rendering things with composing characters, in rendering things where there's a mismatch in number of cells to number of codepoints (we only support one codepoint -> 1 or 2 cells right now)

@vblazhkun
Copy link

So, I assume, the blinking is Okay? ;)

Here is the Notepad rendering:

image

@DHowett-MSFT
Copy link
Contributor

Nah I mean: If your report is "they don't look right," we need to know why. If your report is "they're blinking and they absolutely should not be," that's completely and totally reasonable 😄

@german-one
Copy link
Contributor Author

german-one commented Jul 28, 2019

@DHowett-MSFT

We have known deficiencies [...] in rendering things with composing characters [...]

Now that I read that, solving this problem should be also a task for UTF8OutPipeReader::Read I think. Something like "if the last complete code point was no Combining Diacritical Mark (U+0300..036F, U+1AB0..1AFF, U+1DC0..1DFF, U+20D0..20FF), put it in the cache, too". That's because we don't know if the partial or the upcoming character is a combining character.
(I just hope that winrt::to_hstring would make precomposed wide characters out of them afterwards.)

Do you think I should try to update the code accordingly?

@DHowett-MSFT
Copy link
Contributor

@german-one I'd hold off on that, actually. We shouldn't do this just yet, because we need to make sure that conhost (the console host, which doesn't use WinRT at all; it also serves as the API server for the legacy Win32 console APIs) knows to store combining characters and grapheme clusters in the buffer. If we just handle them at the Windows Terminal layer there will be a mismatch/miscount in the number of cells on the screen.

@german-one
Copy link
Contributor Author

german-one commented Jul 28, 2019

Well, I already successfully tried to keep combining characters complete. And I think that, regardless of the behavior of conhost, combining characters must not be split. Correct me if I'm wrong, but won't that still be the task for the Windows Terminal if it reads the pipe?

FWIW winrt::to_hstring doesn't make precomposed wide characters out of them. Thus, forwarding only complete combining characters still doesn't solve anything. But it might be a precondition for further improvements.
Just get back to me whenever you think the time has come ...

@ghost
Copy link

ghost commented Aug 3, 2019

🎉Windows Terminal Preview v0.3.2142.0 has been released which incorporates this pull request.:tada:

Handy links:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants