Fix #1339: Change encoding of .cs files to utf-8 without BOM #1470

krwq · 2015-04-22T18:55:13Z

Conversion script:
For each .cs file:

Read all bytes
Analyze BOM to get encoding
If encoding not any of those then end: Encoding.UTF32; Encoding.BigEndianUnicode; Encoding.Unicode
Get string from bytes read from the file (encoding.GetString(bytes))
Get utf-8 (no BOM) bytes from the string ((new Utf8Encoding(false)).GetBytes(str))
Save to file

I'm not sure to reliably check if we didn't lose anything although I did run all tests (including outerloop) and made sure no test is testing anything related to encoding (looking at the file name)

akoeplinger · 2015-04-22T19:35:43Z

Hmm, the files still have a BOM for me when I checkout your branch. But at least they are in UTF8 now :)

krwq · 2015-04-22T20:03:37Z

Ok found the issue, will update once the tests finish. Will also remove BOM from other utf-8 .cs files

akoeplinger · 2015-04-22T20:07:42Z

@krwq don't bother to invest much time in it, there are many files with a BOM in the repo and this didn't cause any problems yet ;)

krwq · 2015-04-22T20:10:36Z

I already converted every single file since I'm doing it already :P Let me just finish running tests :)

stephentoub · 2015-04-24T14:45:07Z

LGTM. @ellismg, any concerns with this?

ellismg · 2015-04-24T15:03:24Z

Nope. Seems like a good idea to me.

stephentoub · 2015-04-24T15:05:12Z

@krwq, looks like you'll need to resolve conflicts or rebase before we can merge in the UI.

krwq · 2015-04-24T15:17:29Z

Done, will merge after the merge build is finished.

sharwell · 2015-04-25T05:45:42Z

❗ The files need a byte order mark.

sharwell · 2015-04-25T05:49:03Z

If you don't put a byte order mark and the files do not contain any characters outside of 7-bit ASCII, then many Windows editors will treat them as CP1252 instead of UTF-8 if someone adds a character which is supported in CP1252. If this happens the file will not be a valid UTF-8 file.

Adding the byte order mark guarantees that this situation will not happen. While the byte order mark for a UTF-8 file is optional (viewed as pointless by some), it is an important part of ensuring that all editors treat the files in the repository equally.

sharwell · 2015-04-25T05:56:51Z

Example situation that looks innocent but would break the code:

A new developer opens a source file in their favorite text editor (might not be Visual Studio), and adds a reference to §x.y of ECMA-335. The § gets saved back as 0xA7 (single byte CP1252 because that's the default on most US Windows installations), and now you have a file that isn't valid UTF-8. Maybe the diff will show it, but maybe it won't. But you can be sure the next time you try to read the file as UTF-8 you will get an exception, because the UTF-8 encoding for § is two bytes (0xC2 0xA7).

You do not want to get in the business of keeping track of every user's editor of preference and how to override the default encoding it uses for 7-bit ASCII files. And if you put the byte order mark back, you don't have to.

ellismg · 2015-04-25T06:04:17Z

/cc: @jaredpar. IIRC, Roslyn has moved to a UTF-8 No BOM world as well.

krwq · 2015-04-25T16:34:59Z

@sharwell, I thought we already added a guideline to use \uXXXX
after this:
https://github.com/dotnet/corefx/issues/77

sharwell · 2015-04-25T18:03:45Z

@krwq The escape sequence guideline would be a workaround for a problem that is already solved. Also, it still leaves you explaining to people what they can and can't do. If you put the BOM back, everything will just always work and no one has to think about it. Also, the escape sequence pattern doesn't work in comments. You wouldn't be able to use § in a regular comment, and you'd have to use § for a documentation comment. Either way it defeats the purpose of using the character for clarity in referencing another document.

krwq · 2015-04-25T23:14:09Z

I think that encoding is something which we can easily lose so we shouldn't be using any special characters anyway. I don't have a strong opinion if we should be using BOM or not but I believe we should be consistent with whatever we choose.

dsplaisted · 2015-04-28T23:13:21Z

Based on @sharwell's comments, it seems to me like we should have the BOM.

tarekgh · 2015-04-29T00:06:41Z

I think we should have the BOM too. for Roslyn, I think they still use 1252 encoding for compatibility. as long as the BOM not hurting then it is good idea to keep it.

krwq · 2015-04-29T04:00:50Z

If no objections will update PR in 1-2 days (let's make sure everyone agrees)

jasonmalinowski · 2015-04-29T16:37:33Z

Visual Studio defaults to adding BOMs. Enforcing no-BOM would be fighting against it and might cause lots of pain. Yes, we need a way to make that configurable, but that's the world for now.

ellismg · 2015-04-29T22:45:44Z

Does this mean that we are going to require every file in the repo to have a BOM or we will continue to be inconsistent? Most other editors default to writing UTF8 without a BOM (Including the newly released Visual Studio Code).

Given that VS always writes it and we don't want to swim up stream, I would be fine keeping the existing BOMs (and allow new ones to be introduced) under the following conditions:

We continue to not use non ASCII characters, and enforce this either via tooling or PRs.
We don't include the addition or removal of a BOM in diffs.

Alternately we make a guideline that we ALWAYS use a BOM and ask folks to configure their editors appropriately (.editorconfig might be able to help here, although they say to shy away from the utf-with-bom encoding directive) and write some bots or something to go clean up the inevitable mistakes.

Thoughts?

jasonmalinowski · 2015-04-29T22:58:46Z

So as far as BOM vs. not-BOM, I don't really care. My preference is we get a time machine and invent UTF-8 100 years ago and just get everybody to use it from day 1 of computing. But alas...

If VS adds BOMs, and non-VS doesn't add BOMs, honestly just use your gut to decide which group will hate you less and go with that. I don't have a good sense of which editors people are using for corefx, but that'd inform the decision.

jaredpar · 2015-04-30T14:56:25Z

The policy Roslyn adopted was simply:

If it needs a BOM to correctly decode the file then add a BOM, else please consider removing the BOM.

The reason we adopted this policy was because it was already the defacto policy in the code base. Even though Roslyn is a product primarily written in Visual Studio the majority of files we were looking at didn't have a BOM.

Could be we simply tested incorrectly. But we've had a soft policy of removing BOMs for ~6 months now and there haven't been any real issues with it.

akoeplinger · 2015-05-07T20:00:58Z

I think we shouldn't modify the the files that already have a BOM right now. As @jasonmalinowski says, VS automatically adds a BOM for new files and other tools don't, so we're always going the be inconsistent.
@jaredpar out of interest, I ran grep -rl $'\xEF\xBB\xBF' . | wc -l against roslyn and it told me 6809 files have a BOM there.

The main reason for this PR and #1339 was moving off of UTF-16, we should focus on that until BOM/no-BOM becomes a problem.

ghost · 2015-05-28T01:53:27Z

@krwq, I think the verdict is to have files with BOM (VS way) is the way for now. Please rebase and merge this, so at least one weird issue is solved: github-linguist/linguist#2310 [GH-ligust's misclassification of .cs file with UTF-16 and hence the incorrect language stats in homepage of this repo].

stephentoub · 2015-06-10T18:31:32Z

@krwq, should we close this out for now? Lots has changed in the repo since this was opened.

mellinoe · 2015-06-19T19:32:37Z

I know Krzysztof is on vacation this week, but yeah, we should probably close this and then re-open it on against all of the new stuff if we still want to do it.

stephentoub · 2015-06-19T19:36:09Z

Ok, I'm closing this out for now. Thanks.

joshfree added the 2 - In Progress label Apr 22, 2015

dnfclas added the cla-already-signed label Apr 22, 2015

krwq force-pushed the 1339_encodingutf8 branch 2 times, most recently from dbf6b89 to aa97830 Compare April 22, 2015 20:49

krwq force-pushed the 1339_encodingutf8 branch from aa97830 to 7a49856 Compare April 24, 2015 15:16

Fix #1339: Change encoding of .cs files to utf-8 without BOM

4d4b4e2

krwq force-pushed the 1339_encodingutf8 branch from 7a49856 to 4d4b4e2 Compare April 24, 2015 18:29

stephentoub closed this Jun 19, 2015

mmitche removed the 2 - In Progress label Jun 19, 2015

akoeplinger mentioned this pull request Jul 13, 2015

Convert UTF-16 encoded files into UTF8 #2346

Merged

karelz modified the milestone: 1.0.0-rtm Dec 3, 2016

krwq deleted the 1339_encodingutf8 branch March 4, 2017 19:30

sharwell mentioned this pull request Jun 5, 2018

Shall we unify the file encodings of C# source files? dotnet/machinelearning#302

Open

KirillOsenkov mentioned this pull request Jul 16, 2018

Normalize whitespace in VSEditor files. mono/monodevelop#5403

Closed

sharwell mentioned this pull request Dec 7, 2018

charset utf-8 dotnet/winforms#205

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #1339: Change encoding of .cs files to utf-8 without BOM #1470

Fix #1339: Change encoding of .cs files to utf-8 without BOM #1470

krwq commented Apr 22, 2015

akoeplinger commented Apr 22, 2015

krwq commented Apr 22, 2015

akoeplinger commented Apr 22, 2015

krwq commented Apr 22, 2015

stephentoub commented Apr 24, 2015

ellismg commented Apr 24, 2015

stephentoub commented Apr 24, 2015

krwq commented Apr 24, 2015

sharwell commented Apr 25, 2015

sharwell commented Apr 25, 2015

sharwell commented Apr 25, 2015

ellismg commented Apr 25, 2015

krwq commented Apr 25, 2015

sharwell commented Apr 25, 2015

krwq commented Apr 25, 2015

dsplaisted commented Apr 28, 2015

tarekgh commented Apr 29, 2015

krwq commented Apr 29, 2015

jasonmalinowski commented Apr 29, 2015

ellismg commented Apr 29, 2015

jasonmalinowski commented Apr 29, 2015

jaredpar commented Apr 30, 2015

akoeplinger commented May 7, 2015

ghost commented May 28, 2015

stephentoub commented Jun 10, 2015

mellinoe commented Jun 19, 2015

stephentoub commented Jun 19, 2015

Fix #1339: Change encoding of .cs files to utf-8 without BOM #1470

Fix #1339: Change encoding of .cs files to utf-8 without BOM #1470

Conversation

krwq commented Apr 22, 2015

akoeplinger commented Apr 22, 2015

krwq commented Apr 22, 2015

akoeplinger commented Apr 22, 2015

krwq commented Apr 22, 2015

stephentoub commented Apr 24, 2015

ellismg commented Apr 24, 2015

stephentoub commented Apr 24, 2015

krwq commented Apr 24, 2015

sharwell commented Apr 25, 2015

sharwell commented Apr 25, 2015

sharwell commented Apr 25, 2015

ellismg commented Apr 25, 2015

krwq commented Apr 25, 2015

sharwell commented Apr 25, 2015

krwq commented Apr 25, 2015

dsplaisted commented Apr 28, 2015

tarekgh commented Apr 29, 2015

krwq commented Apr 29, 2015

jasonmalinowski commented Apr 29, 2015

ellismg commented Apr 29, 2015

jasonmalinowski commented Apr 29, 2015

jaredpar commented Apr 30, 2015

akoeplinger commented May 7, 2015

ghost commented May 28, 2015

stephentoub commented Jun 10, 2015

mellinoe commented Jun 19, 2015

stephentoub commented Jun 19, 2015