Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.
/ corefx Public archive

Fix #1339: Change encoding of .cs files to utf-8 without BOM #1470

Closed
wants to merge 1 commit into from

Conversation

krwq
Copy link
Member

@krwq krwq commented Apr 22, 2015

Conversion script:
For each .cs file:

  • Read all bytes
  • Analyze BOM to get encoding
  • If encoding not any of those then end: Encoding.UTF32; Encoding.BigEndianUnicode; Encoding.Unicode
  • Get string from bytes read from the file (encoding.GetString(bytes))
  • Get utf-8 (no BOM) bytes from the string ((new Utf8Encoding(false)).GetBytes(str))
  • Save to file

I'm not sure to reliably check if we didn't lose anything although I did run all tests (including outerloop) and made sure no test is testing anything related to encoding (looking at the file name)

@akoeplinger
Copy link
Member

Hmm, the files still have a BOM for me when I checkout your branch. But at least they are in UTF8 now :)

@krwq
Copy link
Member Author

krwq commented Apr 22, 2015

Ok found the issue, will update once the tests finish. Will also remove BOM from other utf-8 .cs files

@akoeplinger
Copy link
Member

@krwq don't bother to invest much time in it, there are many files with a BOM in the repo and this didn't cause any problems yet ;)

@krwq
Copy link
Member Author

krwq commented Apr 22, 2015

I already converted every single file since I'm doing it already :P Let me just finish running tests :)

@krwq krwq force-pushed the 1339_encodingutf8 branch 2 times, most recently from dbf6b89 to aa97830 Compare April 22, 2015 20:49
@stephentoub
Copy link
Member

LGTM. @ellismg, any concerns with this?

@ellismg
Copy link
Contributor

ellismg commented Apr 24, 2015

Nope. Seems like a good idea to me.

@stephentoub
Copy link
Member

@krwq, looks like you'll need to resolve conflicts or rebase before we can merge in the UI.

@krwq krwq force-pushed the 1339_encodingutf8 branch from aa97830 to 7a49856 Compare April 24, 2015 15:16
@krwq
Copy link
Member Author

krwq commented Apr 24, 2015

Done, will merge after the merge build is finished.

@krwq krwq force-pushed the 1339_encodingutf8 branch from 7a49856 to 4d4b4e2 Compare April 24, 2015 18:29
@sharwell
Copy link
Member

❗ The files need a byte order mark.

@sharwell
Copy link
Member

If you don't put a byte order mark and the files do not contain any characters outside of 7-bit ASCII, then many Windows editors will treat them as CP1252 instead of UTF-8 if someone adds a character which is supported in CP1252. If this happens the file will not be a valid UTF-8 file.

Adding the byte order mark guarantees that this situation will not happen. While the byte order mark for a UTF-8 file is optional (viewed as pointless by some), it is an important part of ensuring that all editors treat the files in the repository equally.

@sharwell
Copy link
Member

Example situation that looks innocent but would break the code:

A new developer opens a source file in their favorite text editor (might not be Visual Studio), and adds a reference to §x.y of ECMA-335. The § gets saved back as 0xA7 (single byte CP1252 because that's the default on most US Windows installations), and now you have a file that isn't valid UTF-8. Maybe the diff will show it, but maybe it won't. But you can be sure the next time you try to read the file as UTF-8 you will get an exception, because the UTF-8 encoding for § is two bytes (0xC2 0xA7).

You do not want to get in the business of keeping track of every user's editor of preference and how to override the default encoding it uses for 7-bit ASCII files. And if you put the byte order mark back, you don't have to.

@ellismg
Copy link
Contributor

ellismg commented Apr 25, 2015

/cc: @jaredpar. IIRC, Roslyn has moved to a UTF-8 No BOM world as well.

@krwq
Copy link
Member Author

krwq commented Apr 25, 2015

@sharwell, I thought we already added a guideline to use \uXXXX
after this:
https://github.com/dotnet/corefx/issues/77

@sharwell
Copy link
Member

@krwq The escape sequence guideline would be a workaround for a problem that is already solved. Also, it still leaves you explaining to people what they can and can't do. If you put the BOM back, everything will just always work and no one has to think about it. Also, the escape sequence pattern doesn't work in comments. You wouldn't be able to use § in a regular comment, and you'd have to use § for a documentation comment. Either way it defeats the purpose of using the character for clarity in referencing another document.

@krwq
Copy link
Member Author

krwq commented Apr 25, 2015

I think that encoding is something which we can easily lose so we shouldn't be using any special characters anyway. I don't have a strong opinion if we should be using BOM or not but I believe we should be consistent with whatever we choose.

@dsplaisted
Copy link
Member

Based on @sharwell's comments, it seems to me like we should have the BOM.

@tarekgh
Copy link
Member

tarekgh commented Apr 29, 2015

I think we should have the BOM too. for Roslyn, I think they still use 1252 encoding for compatibility. as long as the BOM not hurting then it is good idea to keep it.

@krwq
Copy link
Member Author

krwq commented Apr 29, 2015

If no objections will update PR in 1-2 days (let's make sure everyone agrees)

@jasonmalinowski
Copy link
Member

Visual Studio defaults to adding BOMs. Enforcing no-BOM would be fighting against it and might cause lots of pain. Yes, we need a way to make that configurable, but that's the world for now.

@ellismg
Copy link
Contributor

ellismg commented Apr 29, 2015

Does this mean that we are going to require every file in the repo to have a BOM or we will continue to be inconsistent? Most other editors default to writing UTF8 without a BOM (Including the newly released Visual Studio Code).

Given that VS always writes it and we don't want to swim up stream, I would be fine keeping the existing BOMs (and allow new ones to be introduced) under the following conditions:

  1. We continue to not use non ASCII characters, and enforce this either via tooling or PRs.
  2. We don't include the addition or removal of a BOM in diffs.

Alternately we make a guideline that we ALWAYS use a BOM and ask folks to configure their editors appropriately (.editorconfig might be able to help here, although they say to shy away from the utf-with-bom encoding directive) and write some bots or something to go clean up the inevitable mistakes.

Thoughts?

@jasonmalinowski
Copy link
Member

So as far as BOM vs. not-BOM, I don't really care. My preference is we get a time machine and invent UTF-8 100 years ago and just get everybody to use it from day 1 of computing. But alas...

If VS adds BOMs, and non-VS doesn't add BOMs, honestly just use your gut to decide which group will hate you less and go with that. I don't have a good sense of which editors people are using for corefx, but that'd inform the decision.

@jaredpar
Copy link
Member

The policy Roslyn adopted was simply:

If it needs a BOM to correctly decode the file then add a BOM, else please consider removing the BOM.

The reason we adopted this policy was because it was already the defacto policy in the code base. Even though Roslyn is a product primarily written in Visual Studio the majority of files we were looking at didn't have a BOM.

Could be we simply tested incorrectly. But we've had a soft policy of removing BOMs for ~6 months now and there haven't been any real issues with it.

@akoeplinger
Copy link
Member

I think we shouldn't modify the the files that already have a BOM right now. As @jasonmalinowski says, VS automatically adds a BOM for new files and other tools don't, so we're always going the be inconsistent.
@jaredpar out of interest, I ran grep -rl $'\xEF\xBB\xBF' . | wc -l against roslyn and it told me 6809 files have a BOM there.

The main reason for this PR and #1339 was moving off of UTF-16, we should focus on that until BOM/no-BOM becomes a problem.

@ghost
Copy link

ghost commented May 28, 2015

@krwq, I think the verdict is to have files with BOM (VS way) is the way for now. Please rebase and merge this, so at least one weird issue is solved: github-linguist/linguist#2310 [GH-ligust's misclassification of .cs file with UTF-16 and hence the incorrect language stats in homepage of this repo].

@stephentoub
Copy link
Member

@krwq, should we close this out for now? Lots has changed in the repo since this was opened.

@mellinoe
Copy link
Contributor

I know Krzysztof is on vacation this week, but yeah, we should probably close this and then re-open it on against all of the new stuff if we still want to do it.

@stephentoub
Copy link
Member

Ok, I'm closing this out for now. Thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.