Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating Unicode files to use 14.0.0 #66362

Merged
merged 7 commits into from
Mar 10, 2022
Merged

Conversation

joperezr
Copy link
Member

@joperezr joperezr commented Mar 8, 2022

fixes #44423

cc: @GrabYourPitchforks @tarekgh @stephentoub

As part #61048 I needed to generate a casing table using the latest Unicode data (14.0) which wasn't yet being consumed by dotnet/runtime. This change will make the required changes to update to 14.0 so that in a follow-up I can add the required changes that will generate the casing table for Regex that I require. I basically followed the steps outlined both in the GenUnicodeProp readme as well as the STEW tools project to make these changes and then tested locally that both the Globalization and the STEW tests passed after the changes. Please let me know if I'm missing anything else as part of this change.

@ghost
Copy link

ghost commented Mar 8, 2022

Tagging subscribers to this area: @dotnet/area-system-globalization
See info in area-owners.md if you want to be subscribed.

Issue Details

fixes #44423

cc: @GrabYourPitchforks @tarekgh @stephentoub

As part #61048 I needed to generate a casing table using the latest Unicode data (14.0) which wasn't yet being consumed by dotnet/runtime. This change will make the required changes to update to 14.0 so that in a follow-up I can add the required changes that will generate the casing table for Regex that I require. I basically followed the steps outlined both in the GenUnicodeProp readme as well as the STEW tools project to make these changes and then tested locally that both the Globalization and the STEW tests passed after the changes. Please let me know if I'm missing anything else as part of this change.

Author: joperezr
Assignees: joperezr
Labels:

area-System.Globalization

Milestone: -

@joperezr joperezr requested a review from tarekgh March 8, 2022 22:15
Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but others should review as well. I'm intrigued that there are no tests being edited; does that mean we don't have any that depend on the data that's changing?

@joperezr
Copy link
Member Author

joperezr commented Mar 8, 2022

I'm not super familiar with the tests so can't tell for sure, but looks like System.Text.Encodings.Web.Tests project imports the modified source file UnicodeRangesTests.generated which uses for at least one of its tests:

<Compile Include="..\src\System\Text\Unicode\UnicodeHelpers.generated.cs" Link="System\Text\Unicode\UnicodeHelpers.generated.cs" />

It also seems like we are embedding the source UnicodeData.txt file into that test assembly, and it is loading it in a specific test and doing some validation:

string[] allLines = new StreamReader(typeof(UnicodeHelpersTests).GetTypeInfo().Assembly.GetManifestResourceStream(UnicodeDataFileName)).ReadAllLines();

@joperezr
Copy link
Member Author

joperezr commented Mar 8, 2022

We also seem to be doing something similar in the Globalization tests:

private static readonly Lazy<List<CharUnicodeInfoTestCase>> s_testCases = new Lazy<List<CharUnicodeInfoTestCase>>(() =>
{
List<CharUnicodeInfoTestCase> testCases = new List<CharUnicodeInfoTestCase>();
string fileName = "UnicodeData.txt";
Stream stream = typeof(CharUnicodeInfoTestData).GetTypeInfo().Assembly.GetManifestResourceStream(fileName);
using (StreamReader reader = new StreamReader(stream))
{
while (!reader.EndOfStream)
{
Parse(testCases, reader.ReadLine());
}
}
return testCases;
});

@tarekgh
Copy link
Member

tarekgh commented Mar 8, 2022

    private static ReadOnlySpan<byte> TitlecaseValues => new byte[468]

why including these tables? I think you need to remove TitlecaseValues and CaseFoldValues tables?


Refers to: src/libraries/System.Private.CoreLib/src/System/Globalization/CharUnicodeInfoData.cs:1416 in 2db5f07. [](commit_id = 2db5f07, deletion_comment = False)

@tarekgh
Copy link
Member

tarekgh commented Mar 8, 2022

@joperezr the test projects include the Unicode text file from the package path

<EmbeddedResource Include="$(PkgSystem_Private_Runtime_UnicodeData)\contentFiles\any\any\$(UnicodeUcdVersion).0\ucd\UnicodeData.txt">
. Ensure this path is pointing at the correct version.

@joperezr
Copy link
Member Author

joperezr commented Mar 8, 2022

why including these tables? I think you need to remove TitlecaseValues and CaseFoldValues tables?

I was only following the instructions over at https://github.com/dotnet/runtime/blob/2db5f079aa06a7bf453c1b8898dcb134214f9449/src%2Fcoreclr%2FSystem.Private.CoreLib%2FTools%2FGenUnicodeProp%2FReadme.md

It isn't clear in the document weather or not CasingData should be includded by default in CharUnicodeInfoData.cs so I assumed it was. If it is not supposed to be, I can run it again without that parameter.

@joperezr
Copy link
Member Author

joperezr commented Mar 8, 2022

Ensure this path is pointing at the correct version.

Yes, after making my changes I built the relevant test projects and ensured that the resources were still embedded (meaning that the packagepath had not changed)

@MichalStrehovsky
Copy link
Member

If there's any changes to upper/lower casing, we should also update src\coreclr\pal\src\locale\unicodedata.cpp. IIRC, this file is generated by running src\coreclr\pal\src\locale\unicodedata.cs, giving it a path to UnicodeData.txt.

unicodedata.cpp is used by e.g. the reflection stack to implement BindingFlags.IgnoreCase.

@tarekgh
Copy link
Member

tarekgh commented Mar 9, 2022

If there's any changes to upper/lower casing, we should also update src\coreclr\pal\src\locale\unicodedata.cpp

I hope one day we consolidate this table with the table we generate inside corelib. The one in corelib is more optimized in size and access code speed too.

@stephentoub
Copy link
Member

The one in corelib is more optimized in size and access code speed too.

Until they're consolidated, what prevents us from using the same approach on the managed side?

@tarekgh
Copy link
Member

tarekgh commented Mar 9, 2022

Until they're consolidated, what prevents us from using the same approach on the managed side?

nothing. It needs someone to spend some time doing it. I recall I had some offline discussion with @jkotas before when I was optimizing the ordinal operations, but it was not a priority to do anything in the runtime at that time. In addition, the current perf for this runtime table is acceptable as no one complained about it.

@jkotas
Copy link
Member

jkotas commented Mar 9, 2022

I recall I had some offline discussion with @jkotas before when I was optimizing the ordinal operations, but it was not a priority to do anything in the runtime at that time.

The north start is to move all case-insensitive comparisons to C# and delete the casing table in the unmanaged runtime.

@joperezr
Copy link
Member Author

joperezr commented Mar 9, 2022

@GrabYourPitchforks @tarekgh @jkotas @MichalStrehovsky I added as part of the PR in the last commit one markdown file that is making note of all of the instructions that I followed when updating to the 14.0 version, so that there is a simple guide to follow for next update. Could you PTAL to that comment to make sure I didn't miss anything that needs to be checked/updated and that I covered all steps correctly?

Copy link
Member

@tarekgh tarekgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I passed offline comment to Jose. Otherwise, LGTM.

@jkotas
Copy link
Member

jkotas commented Mar 10, 2022

Could you please also move the tool under libraries while you are on it? The tool is not coreclr specific, and so it should not live under coreclr. The tool is updating the tables in the shared CoreLib part under libraries and so it should live there too.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@joperezr joperezr merged commit 458524a into dotnet:main Mar 10, 2022
@joperezr joperezr deleted the UpdateUnicode branch March 10, 2022 22:28
@ghost ghost locked as resolved and limited conversation to collaborators Apr 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update .NET 7 Unicode data to version 14.0.0
5 participants