-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What encoding is used for string types. #45
Comments
This has been a concern for me since they started their own support for py3. The docs sort of indicate one byte per character (ASCII or iso-8859-1/latin-1 I think being 'plausible' options then) and latin-1 was selected in pysunspec because it 'made tests pass' (all byte sequences are valid latin-1 AFAIK). IIRC many of the string fields are only 8 bytes or so which leaves little room for i18n via UTF-8. Mostly I think this is an area which is not clearly understood and thus the docs are not clear. I would vote for longer strings and UTF-8 probably but given the present size you almost have to leave the encoding unspecified at the SunSpec level and just know out of band from the manufacturer thus allowing for a byte-length optimized encoding to be selected. Of course, this is a mess to deal with. |
This was an oversight in the original spec, but most people are treating it as ASCII. It is one byte per character and the chosen character encoding is latin-1 to get largest character set and still maintain backwards compatibility. |
Hmm, ok. So the current status is that essentially only the lower 128 values of a byte are implicitly defined as ASCII. My proposal would be to define the standard to use UTF-8 for strings. By using UTF-8 the current ASCII characters remain unchanged (i.e. 1 char = 1 byte) and as such is backwards compatible with everything I have seen so far. Also UTF-8 is a so common standard that all programming languages and environments support it. A good read about character encoding: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ @silvia2019 @altendky What do you think? |
I agree that anything but utf-* is very much 'western' or even 'english' oriented (and that that is a negative thing). And yes, utf-8 is what I would choose as well (vs utf-16-le or whatever Windows uses, etc). I think that practically speaking the strings are too short to support other character sets well since they get even fewer usable characters so I think that should be reviewed as well. I guess that's a 'yes I like utf-8' plus 'consider longer strings'. Given that the documentation only specified one byte per character but no encoding, I think that can be ignored when considering backwards compatibility. You had to just know the proper encoding and once you have that you don't need to also know how many bytes per character so it was really completely undefined. I'll just take your word for it that ASCII is what tended to be used in practice. |
I'm implementing some software to read SunSpec and I have not been able to find the character encoding that is used for string typed fields.
The documentation I have been able to find only shows a simple example that is limited to the English letters of ASCII.
Given only this example it is still possible that the real encoding standard is something like UTF8.
Does anyone know where I can find the chosen encoding?
The text was updated successfully, but these errors were encountered: