Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating Unicode files to use 14.0.0 #66362

Merged
merged 7 commits into from
Mar 10, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion THIRD-PARTY-NOTICES.TXT
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ License notice for Unicode data

https://www.unicode.org/license.html

Copyright © 1991-2020 Unicode, Inc. All rights reserved.
Copyright © 1991-2022 Unicode, Inc. All rights reserved.
Distributed under the Terms of Use in https://www.unicode.org/copyright.html.

Permission is hereby granted, free of charge, to any person obtaining
Expand Down
11 changes: 11 additions & 0 deletions src/coreclr/pal/src/locale/unicodedata.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@

//
// THIS FILE IS GENERATED. DO NOT HAND EDIT.
// IF YOU NEED TO UPDATE UNICODE VERSION FOLLOW THE GUIDE AT src/libraries/System.Private.CoreLib/Tools/GenUnicodeProp/Updating-Unicode-Versions.md
//

CONST UnicodeDataRec UnicodeData[] = {
Expand Down Expand Up @@ -1783,6 +1784,7 @@ CONST UnicodeDataRec UnicodeData[] = {
{ 0x2C2C, UPPER_CASE, 0x2C5C },
{ 0x2C2D, UPPER_CASE, 0x2C5D },
{ 0x2C2E, UPPER_CASE, 0x2C5E },
{ 0x2C2F, UPPER_CASE, 0x2C5F },
{ 0x2C30, LOWER_CASE, 0x2C00 },
{ 0x2C31, LOWER_CASE, 0x2C01 },
{ 0x2C32, LOWER_CASE, 0x2C02 },
Expand Down Expand Up @@ -1830,6 +1832,7 @@ CONST UnicodeDataRec UnicodeData[] = {
{ 0x2C5C, LOWER_CASE, 0x2C2C },
{ 0x2C5D, LOWER_CASE, 0x2C2D },
{ 0x2C5E, LOWER_CASE, 0x2C2E },
{ 0x2C5F, LOWER_CASE, 0x2C2F },
{ 0x2C60, UPPER_CASE, 0x2C61 },
{ 0x2C61, LOWER_CASE, 0x2C60 },
{ 0x2C62, UPPER_CASE, 0x26B },
Expand Down Expand Up @@ -2213,6 +2216,8 @@ CONST UnicodeDataRec UnicodeData[] = {
{ 0xA7BD, LOWER_CASE, 0xA7BC },
{ 0xA7BE, UPPER_CASE, 0xA7BF },
{ 0xA7BF, LOWER_CASE, 0xA7BE },
{ 0xA7C0, UPPER_CASE, 0xA7C1 },
{ 0xA7C1, LOWER_CASE, 0xA7C0 },
{ 0xA7C2, UPPER_CASE, 0xA7C3 },
{ 0xA7C3, LOWER_CASE, 0xA7C2 },
{ 0xA7C4, UPPER_CASE, 0xA794 },
Expand All @@ -2222,6 +2227,12 @@ CONST UnicodeDataRec UnicodeData[] = {
{ 0xA7C8, LOWER_CASE, 0xA7C7 },
{ 0xA7C9, UPPER_CASE, 0xA7CA },
{ 0xA7CA, LOWER_CASE, 0xA7C9 },
{ 0xA7D0, UPPER_CASE, 0xA7D1 },
{ 0xA7D1, LOWER_CASE, 0xA7D0 },
{ 0xA7D6, UPPER_CASE, 0xA7D7 },
{ 0xA7D7, LOWER_CASE, 0xA7D6 },
{ 0xA7D8, UPPER_CASE, 0xA7D9 },
{ 0xA7D9, LOWER_CASE, 0xA7D8 },
{ 0xA7F5, UPPER_CASE, 0xA7F6 },
{ 0xA7F6, LOWER_CASE, 0xA7F5 },
{ 0xAB53, LOWER_CASE, 0xA7B3 },
Expand Down
1 change: 1 addition & 0 deletions src/coreclr/pal/src/locale/unicodedata.cs
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ static void Main(string[] args)
Console.WriteLine();
Console.WriteLine("//");
Console.WriteLine("// THIS FILE IS GENERATED. DO NOT HAND EDIT.");
Console.WriteLine("// IF YOU NEED TO UPDATE UNICODE VERSION FOLLOW THE GUIDE AT src/libraries/System.Private.CoreLib/Tools/GenUnicodeProp/Updating-Unicode-Versions.md");
Console.WriteLine("//");
Console.WriteLine();

Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<AllowUnsafeBlocks>true</AllowUnsafeBlocks>
<UnicodeUcdVersion>13.0</UnicodeUcdVersion>
<UnicodeUcdVersion>14.0</UnicodeUcdVersion>
<TargetFramework>$(NetCoreAppCurrent)</TargetFramework>
</PropertyGroup>
<ItemGroup>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
<IncludeRemoteExecutor>true</IncludeRemoteExecutor>
<!-- This test project is Windows only as it forces the use of NLS as the Globlaization platform -->
<TargetFramework>$(NetCoreAppCurrent)-windows</TargetFramework>
<UnicodeUcdVersion>13.0</UnicodeUcdVersion>
<UnicodeUcdVersion>14.0</UnicodeUcdVersion>
</PropertyGroup>
<ItemGroup>
<!-- Include tests from System.Globalization.Tests -->
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
<TestRuntime>true</TestRuntime>
<IncludeRemoteExecutor>true</IncludeRemoteExecutor>
<TargetFramework>$(NetCoreAppCurrent)</TargetFramework>
<UnicodeUcdVersion>13.0</UnicodeUcdVersion>
<UnicodeUcdVersion>14.0</UnicodeUcdVersion>
</PropertyGroup>
<ItemGroup>
<Compile Include="AssemblyInfo.cs" />
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>netcoreapp3.1</TargetFramework>
<UnicodeUcdVersion>13.0</UnicodeUcdVersion>
<UnicodeUcdVersion>14.0</UnicodeUcdVersion>
</PropertyGroup>

<ItemGroup>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,7 @@ private static void Main(string[] args)

file.Write(" // THE FOLLOWING DATA IS AUTO GENERATED BY GenUnicodeProp program UNDER THE TOOLS FOLDER\n");
file.Write(" // PLEASE DON'T MODIFY BY HAND\n");
file.Write(" // IF YOU NEED TO UPDATE UNICODE VERSION FOLLOW THE GUIDE AT src/libraries/System.Private.CoreLib/Tools/GenUnicodeProp/Updating-Unicode-Versions.md\n");

PrintAssertTableLevelsBitCountRoutine("CategoryCasing", file, categoryCasingTableLevelBits);

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@ Before running this tool, ensure the following are all in sync:
- The package at https://github.com/dotnet/runtime-assets/tree/master/src/System.Private.Runtime.UnicodeData contains
the up-to-date Unicode data you want to process.

- The <SystemPrivateRuntimeUnicodeDataVersion> element in $(REPOROOT)\eng\Versions.props contains the correct version
- The <SystemPrivateRuntimeUnicodeDataVersion> element in $(REPOROOT)/eng/Versions.props contains the correct version
of the package mentioned above.

- The <UnicodeUcdVersion> element in .\GenUnicodeProp.csproj contains the UCD version of the files you wish to process.
- The <UnicodeUcdVersion> element in ./GenUnicodeProp.csproj contains the UCD version of the files you wish to process.

Once this has been configured, from this directory, invoke:

Expand All @@ -18,5 +18,5 @@ If you want to include casing data (simple case mappings + case folding) in the

> `dotnet run -- -IncludeCasingData`

Then move the generated CharUnicodeInfoData.cs file to $(LIBRARIESROOT)\System.Private.CoreLib\src\System\Globalization,
Then move the generated CharUnicodeInfoData.cs file to $(LIBRARIESROOT)/System.Private.CoreLib/src/System/Globalization,
overwriting the file in that directory, and commit it. DO NOT commit the file to this directory.
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Instructions for updating Unicode version in dotnet/runtime

## Table of Contents
- [Instructions for updating Unicode version in dotnet/runtime](#instructions-for-updating-unicode-version-in-dotnetruntime)
- [Table of Contents](#table-of-contents)
- [Overview](#overview)
- [Add the new Unicode files into the runtime-assets repo](#add-the-new-unicode-files-into-the-runtime-assets-repo)
- [Ingest the created package into dotnet/runtime repo](#ingest-the-created-package-into-dotnetruntime-repo)
- [Update dotnet/runtime libraries to consume the new Unicode changes](#update-dotnetruntime-libraries-to-consume-the-new-unicode-changes)

## Overview

This repository has several places that need to be updated when we are ingesting a new version of Unicode, mainly because different libraries we have in the runtime depend on specific data that could change with each update (e.g., new characters being added, casing information changing, etc.). Here are the steps that need to be followed when ingesting a new version of Unicode in dotnet/runtime:

## Add the new Unicode files into the runtime-assets repo

1. First step is that we need to add the Unicode data to somewhere that our dotnet/runtime repo can later ingest. This "somewhere" is a package that we build in the runtime-assets repo. The Unicode data can be downloaded from the [Unicode website](https://www.unicode.org/), and more specifically from the files pushed to the following location: `https://www.unicode.org/Public/14.0.0/` (<-- change 14.0.0 for the version that you want to ingest.) Go into the `ucd` folder and download the following files:
- CaseFolding.txt
- PropList.txt
- UnicodeData.txt
- auxiliary/GraphemeBreakProperty.txt
- auxiliary/GraphemeBreakTest.txt
- emoji/emoji-data.txt
- extracted/DerivedBidiClass.txt
- extracted/DerivedName.txt

2. Once you have downloaded all those files, create a fork of the repo https://github.com/dotnet/runtime-assets and send a PR which creates a folder at `src/System.Private.Runtime.UnicodeData/<YourUnicodeVersion>` and places all of the downloaded files from step 1 there. You can look at a sample PR that did this for Unicode 14.0.0 here: https://github.com/dotnet/runtime-assets/pull/179


## Ingest the created package into dotnet/runtime repo

This should be done automatically by dependency-flow, so in theory there shouldn't be any user-action in order for this to happen, but we still call it out on these instructions since there could be a problem in the ingestion and that would cause a problem with the process. The way the process works, is that after the PR from the runtime-assets repo gets merged, a new build will be triggered in the runtime-assets pipeline which will produce the new Unicode package, and once that build is done (and assuming it succeeds) it will also trigger the subscription that dotnet/runtime has against the runtime-assets repo, which will generate a dependency PR (like [this one](https://github.com/dotnet/runtime/pull/65843)) which will ingest the new package version in dotnet/runtime.

## Update dotnet/runtime libraries to consume the new Unicode changes

1. Follow the [instructions to run GenUnicodeProp](./Readme.md) which will generate a new `CharUnicodeInfoData.cs` file and will tell you where you need to copy the generated file. Make sure after compiling the GenUnicodeProp tool, that by inspecting the contents of the produced assembly, it contains all of the updated resources embedded into it, since those embedded resources are what is used to produce `CharUnicodeInfoData.cs`. You can inspect the embedded resources on the assembly using a tool like ILSpy.
2. Follow the [instructions on how to update System.Text.Encondings.Web](../../../System.Text.Encodings.Web/tools/updating-encodings.md) projects. Those instructions will help you generate the files `UnicodeHelpers.generated.cs` and `UnicodeRangesTests.generated.cs`, which are consumed by both the test and the implementation projects for System.Text.Encodings.Web.
3. Search across the repo for all of the .csproj files which have the property `<UnicodeUcdVersion>` and update it to use the new version. If a project defines this property, then it is very likely it is consuming the runtime-assets package in some form, so it needs to be updated to consume the new version. At the time of the writing of this doc, the project files which need to be updated are:
- GenUnicodeProp.csproj
- TestUtilities.Unicode.csproj
- System.Globalization.Tests.csproj
- System.Globalization.Nls.Tests.csproj
- System.Text.Encodings.Web.Tests.csproj
4. If the new Unicode data contains casing changes/updates, then we will also need to update `src/coreclr/pal/src/locale/unicodedata.cpp` file. This file is used by most of the reflection stack whenever you specify the `BindingFlags.IgnoreCase`. In order to regenerate the contents of the `unicdedata.cpp` file, you need to run the Program located at `src/coreclr/pal/src/locale/unicodedata.cs` and give a full path to the new UnicodeData.txt as a parameter.
5. If the new Unicode data made changes on what character class a specific character belongs to, or added new characters, you may need to update the serialized Unicode character classes data in `System.Text.RegularExpressions` for the `NonBacktracking` engine. The telling sign that will show you if you need to do this, is if any tests are failing in the `System.Text.RegularExpressions.Tests` test project. In case some tests do fail (which means you need to update the serialized mappings), you will need to edit the file `src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/RegexExperiment.cs` and set the `Enabled` bool to `true`, and re-run the RegexTests. This will generate a couple of files in your `%temp%` directory: `IgnoreCaseRelation.cs` and `UnicodeCategoryRanges.cs`. These files will need to be copied (and overwrite the existing ones) to the folder `src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/Unicode/`
6. Finally, last step is to update the license for the Unicode data into our [Third party notices](../../../../../THIRD-PARTY-NOTICES.TXT) by copying the contents located in `https://www.unicode.org/license.html` to the section that has the Unicode license in our notices.
7. That's it, now commit all of the changed files, and send a PR into dotnet/runtime with the updates. If there were any special things you had to do that are not noted on this document, PLEASE UPDATE THESE INSTRUCTIONS to facilitate future updates.
Loading