-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Fix #1339: Change encoding of .cs files to utf-8 without BOM #1470
Conversation
Hmm, the files still have a BOM for me when I checkout your branch. But at least they are in UTF8 now :) |
Ok found the issue, will update once the tests finish. Will also remove BOM from other utf-8 .cs files |
@krwq don't bother to invest much time in it, there are many files with a BOM in the repo and this didn't cause any problems yet ;) |
I already converted every single file since I'm doing it already :P Let me just finish running tests :) |
dbf6b89
to
aa97830
Compare
LGTM. @ellismg, any concerns with this? |
Nope. Seems like a good idea to me. |
@krwq, looks like you'll need to resolve conflicts or rebase before we can merge in the UI. |
Done, will merge after the merge build is finished. |
❗ The files need a byte order mark. |
If you don't put a byte order mark and the files do not contain any characters outside of 7-bit ASCII, then many Windows editors will treat them as CP1252 instead of UTF-8 if someone adds a character which is supported in CP1252. If this happens the file will not be a valid UTF-8 file. Adding the byte order mark guarantees that this situation will not happen. While the byte order mark for a UTF-8 file is optional (viewed as pointless by some), it is an important part of ensuring that all editors treat the files in the repository equally. |
Example situation that looks innocent but would break the code: A new developer opens a source file in their favorite text editor (might not be Visual Studio), and adds a reference to §x.y of ECMA-335. The § gets saved back as 0xA7 (single byte CP1252 because that's the default on most US Windows installations), and now you have a file that isn't valid UTF-8. Maybe the diff will show it, but maybe it won't. But you can be sure the next time you try to read the file as UTF-8 you will get an exception, because the UTF-8 encoding for § is two bytes (0xC2 0xA7). You do not want to get in the business of keeping track of every user's editor of preference and how to override the default encoding it uses for 7-bit ASCII files. And if you put the byte order mark back, you don't have to. |
/cc: @jaredpar. IIRC, Roslyn has moved to a UTF-8 No BOM world as well. |
@sharwell, I thought we already added a guideline to use \uXXXX |
@krwq The escape sequence guideline would be a workaround for a problem that is already solved. Also, it still leaves you explaining to people what they can and can't do. If you put the BOM back, everything will just always work and no one has to think about it. Also, the escape sequence pattern doesn't work in comments. You wouldn't be able to use § in a regular comment, and you'd have to use |
I think that encoding is something which we can easily lose so we shouldn't be using any special characters anyway. I don't have a strong opinion if we should be using BOM or not but I believe we should be consistent with whatever we choose. |
Based on @sharwell's comments, it seems to me like we should have the BOM. |
I think we should have the BOM too. for Roslyn, I think they still use 1252 encoding for compatibility. as long as the BOM not hurting then it is good idea to keep it. |
If no objections will update PR in 1-2 days (let's make sure everyone agrees) |
Visual Studio defaults to adding BOMs. Enforcing no-BOM would be fighting against it and might cause lots of pain. Yes, we need a way to make that configurable, but that's the world for now. |
Does this mean that we are going to require every file in the repo to have a BOM or we will continue to be inconsistent? Most other editors default to writing UTF8 without a BOM (Including the newly released Visual Studio Code). Given that VS always writes it and we don't want to swim up stream, I would be fine keeping the existing BOMs (and allow new ones to be introduced) under the following conditions:
Alternately we make a guideline that we ALWAYS use a BOM and ask folks to configure their editors appropriately (.editorconfig might be able to help here, although they say to shy away from the utf-with-bom encoding directive) and write some bots or something to go clean up the inevitable mistakes. Thoughts? |
So as far as BOM vs. not-BOM, I don't really care. My preference is we get a time machine and invent UTF-8 100 years ago and just get everybody to use it from day 1 of computing. But alas... If VS adds BOMs, and non-VS doesn't add BOMs, honestly just use your gut to decide which group will hate you less and go with that. I don't have a good sense of which editors people are using for corefx, but that'd inform the decision. |
The policy Roslyn adopted was simply:
The reason we adopted this policy was because it was already the defacto policy in the code base. Even though Roslyn is a product primarily written in Visual Studio the majority of files we were looking at didn't have a BOM. Could be we simply tested incorrectly. But we've had a soft policy of removing BOMs for ~6 months now and there haven't been any real issues with it. |
I think we shouldn't modify the the files that already have a BOM right now. As @jasonmalinowski says, VS automatically adds a BOM for new files and other tools don't, so we're always going the be inconsistent. The main reason for this PR and #1339 was moving off of UTF-16, we should focus on that until BOM/no-BOM becomes a problem. |
@krwq, I think the verdict is to have files with BOM (VS way) is the way for now. Please rebase and merge this, so at least one weird issue is solved: github-linguist/linguist#2310 [GH-ligust's misclassification of .cs file with UTF-16 and hence the incorrect language stats in homepage of this repo]. |
@krwq, should we close this out for now? Lots has changed in the repo since this was opened. |
I know Krzysztof is on vacation this week, but yeah, we should probably close this and then re-open it on against all of the new stuff if we still want to do it. |
Ok, I'm closing this out for now. Thanks. |
Conversion script:
For each .cs file:
I'm not sure to reliably check if we didn't lose anything although I did run all tests (including outerloop) and made sure no test is testing anything related to encoding (looking at the file name)