-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating Unicode files to use 14.0.0 #66362
Conversation
Tagging subscribers to this area: @dotnet/area-system-globalization Issue Detailsfixes #44423 cc: @GrabYourPitchforks @tarekgh @stephentoub As part #61048 I needed to generate a casing table using the latest Unicode data (14.0) which wasn't yet being consumed by dotnet/runtime. This change will make the required changes to update to 14.0 so that in a follow-up I can add the required changes that will generate the casing table for Regex that I require. I basically followed the steps outlined both in the GenUnicodeProp readme as well as the STEW tools project to make these changes and then tested locally that both the Globalization and the STEW tests passed after the changes. Please let me know if I'm missing anything else as part of this change.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but others should review as well. I'm intrigued that there are no tests being edited; does that mean we don't have any that depend on the data that's changing?
I'm not super familiar with the tests so can't tell for sure, but looks like System.Text.Encodings.Web.Tests project imports the modified source file runtime/src/libraries/System.Text.Encodings.Web/tests/System.Text.Encodings.Web.Tests.csproj Line 19 in 2db5f07
It also seems like we are embedding the source UnicodeData.txt file into that test assembly, and it is loading it in a specific test and doing some validation:
|
We also seem to be doing something similar in the Globalization tests: runtime/src/libraries/System.Globalization/tests/System/Globalization/CharUnicodeInfoTestData.cs Lines 12 to 25 in 2db5f07
|
why including these tables? I think you need to remove Refers to: src/libraries/System.Private.CoreLib/src/System/Globalization/CharUnicodeInfoData.cs:1416 in 2db5f07. [](commit_id = 2db5f07, deletion_comment = False) |
@joperezr the test projects include the Unicode text file from the package path runtime/src/libraries/System.Globalization/tests/System.Globalization.Tests.csproj Line 123 in 2db5f07
|
I was only following the instructions over at https://github.com/dotnet/runtime/blob/2db5f079aa06a7bf453c1b8898dcb134214f9449/src%2Fcoreclr%2FSystem.Private.CoreLib%2FTools%2FGenUnicodeProp%2FReadme.md It isn't clear in the document weather or not CasingData should be includded by default in CharUnicodeInfoData.cs so I assumed it was. If it is not supposed to be, I can run it again without that parameter. |
Yes, after making my changes I built the relevant test projects and ensured that the resources were still embedded (meaning that the packagepath had not changed) |
If there's any changes to upper/lower casing, we should also update src\coreclr\pal\src\locale\unicodedata.cpp. IIRC, this file is generated by running src\coreclr\pal\src\locale\unicodedata.cs, giving it a path to UnicodeData.txt. unicodedata.cpp is used by e.g. the reflection stack to implement BindingFlags.IgnoreCase. |
I hope one day we consolidate this table with the table we generate inside corelib. The one in corelib is more optimized in size and access code speed too. |
Until they're consolidated, what prevents us from using the same approach on the managed side? |
nothing. It needs someone to spend some time doing it. I recall I had some offline discussion with @jkotas before when I was optimizing the ordinal operations, but it was not a priority to do anything in the runtime at that time. In addition, the current perf for this runtime table is acceptable as no one complained about it. |
The north start is to move all case-insensitive comparisons to C# and delete the casing table in the unmanaged runtime. |
@GrabYourPitchforks @tarekgh @jkotas @MichalStrehovsky I added as part of the PR in the last commit one markdown file that is making note of all of the instructions that I followed when updating to the 14.0 version, so that there is a simple guide to follow for next update. Could you PTAL to that comment to make sure I didn't miss anything that needs to be checked/updated and that I covered all steps correctly? |
src/libraries/System.Private.CoreLib/src/System/Globalization/Updating-Unicode-Versions.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I passed offline comment to Jose. Otherwise, LGTM.
Could you please also move the tool under libraries while you are on it? The tool is not coreclr specific, and so it should not live under coreclr. The tool is updating the tables in the shared CoreLib part under libraries and so it should live there too. |
src/libraries/System.Text.Encodings.Web/tools/GenUnicodeRanges/Program.cs
Outdated
Show resolved
Hide resolved
src/coreclr/System.Private.CoreLib/Tools/GenUnicodeProp/Updating-Unicode-Versions.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
fixes #44423
cc: @GrabYourPitchforks @tarekgh @stephentoub
As part #61048 I needed to generate a casing table using the latest Unicode data (14.0) which wasn't yet being consumed by dotnet/runtime. This change will make the required changes to update to 14.0 so that in a follow-up I can add the required changes that will generate the casing table for Regex that I require. I basically followed the steps outlined both in the GenUnicodeProp readme as well as the STEW tools project to make these changes and then tested locally that both the Globalization and the STEW tests passed after the changes. Please let me know if I'm missing anything else as part of this change.