Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List of remarks of current draft 04 #95

Closed
ben221199 opened this issue May 23, 2023 · 16 comments
Closed

List of remarks of current draft 04 #95

ben221199 opened this issue May 23, 2023 · 16 comments

Comments

@ben221199
Copy link

ben221199 commented May 23, 2023

As seen in #83, I made some remarks on draft 03. In this issue I will list things that are not fully handled by the current draft (04) and some additional things I have in mind:

  1. As mentioned in List of remarks of current draft 03 #83 (point 3), I talked about the Discord variant of Twitter snowflake. Today, I was thinking about UUIDv7. I think it is better to allow other time offsets next to 1970-01-01. Why? If Twitter decides to convert their IDs to UUID, they choose the current form of UUIDv7. However, because Discord has another offset (2015-01-01), if they want to use the current form of UUIDv7, they also have to convert all there timestamps with an additional 45 years offset. I don't think this is desirable. I propose some changes on this:
    -UUID Version 7 features a time-ordered value field derived from the
    +UUID Version 7 features a time-ordered value field containing the number
    -widely implemented and well known Unix Epoch timestamp source, the
    +of miliseconds, leap seconds excluded, since a defined offset, in most
    -number of milliseconds since midnight 1 Jan 1970 UTC, leap seconds
    +cases the RECOMMENDED offset of midnight 1 Jan 1970 UTC. If a system
    -excluded.  UUIDv7 generally has improved entropy characteristics over
    +requires a different offset to be used, another offset MAY be used.
    -UUIDv1 or UUIDv6.
    +UUIDv7 generally has improved entropy characteristics over UUIDv1 or UUIDv6.
    
    -UUIDv7 values are created by allocating a Unix timestamp in
    +UUIDv7 values are created by allocating a timestamp in
    milliseconds in the most significant 48 bits and filling the
    remaining 74 bits, jointly, excluding the required version and
    variant bits, with a combination of the following subfields, in this
    order from the most significant bits to the least:

  1. There is a difference between UUID and GUID. Not in the wire format, but in the representation format. If a UUID/GUID has the hexadecimal value 00 11 22 33 44 55 66 77 88 99 AA BB CC DD EE FF, the UUID will be written like 00112233-4455-6677-8899-AABBCCDDEEFF, where the GUID will be written like {33221100-5544-7766-8899-AABBCCDDEEFF}. Except the { and }, it is also vissible that some parts are flipped. This is seen in HxD (a well-known Hex-Editor), for example: image
    I think it is worth noting that GUIDs have this effect.

  1. I think we still have to talk about the name of Max UUID. I understand the people that say "It is the maximum value". However, the makers of RFC 4122 went with "Nil UUID" when talking about the "minimum value". I think we should stay in the same jargon when naming things. So, we rename "Nil UUID" to "Min UUID" or something, or we change "Max UUID" to "Omni UUID". I think we should go for the Nil/Omni pair rather than the Min/Max pair. For example, Min/Max doesn't make sense when looking at UUID as a 128 bit SIGNED integer. All bits set is not the maximum and all bits not is not the minimum, see https://en.wikipedia.org/wiki/Two%27s_complement. I would rather go for the Nil/Omni pair:

    Nil (https://en.wiktionary.org/wiki/nil#Latin) is an alternative form of "nihil" and means "nothing" in Latin. This is true, because of all bits, NOTHING is set.
    Omni (https://en.wiktionary.org/wiki/omni#Latin) is an declension of "omnis" and means "all" in Latin. This is true, because ALL bits are set.

@LiosK
Copy link
Contributor

LiosK commented May 23, 2023

  1. Those who needs another timestamp epoch than 1970 or 1582 must use UUIDv8. It doesn't make sense to allow different epochs within the scope of UUIDv7 because the standard has to coordinate all the implementations to produce sortable values.

@kyzer-davis
Copy link
Collaborator

kyzer-davis commented May 31, 2023

  1. Agree with @LiosK here.
  2. I am not sure it is worth noting the that level of detail.
  3. See:
    3a @bradleypeabody you had said to me you wanted to touch add some fluff text to the Nil/Max, can you hit a PR on this topic before the close of WGLC (June 4st for review by June 9th)? Edit: saw we did add this at the end of both sections already in draft-02.
    3b. As for Omni vs Max; I don't care I just want to get the document finished. I just don't want to change Nil since it is referenced everywhere all over the net, in docs, and misc. code.

@ben221199
Copy link
Author

  1. I don't know if my point is fully understood. Of course I want all implementations to be coordinated to produce sortable values, but I can imagine that companies, like Discord, will say "Yeah, we will use UUIDv7 with another offset", like they did with Twitter's snowflake.

    I see that both Twitter and Discord have another offset than 1970. For Twitter is Thursday 4 November 2010 01:42:54.657 (1288834974657) and for Discord it is Thursday 1 January 2015 00:00:00 (1420070400000). In my mind Twitter used 1970, so that was why I came up with the idea. Because it isn't the case, it is indeed better to use UUIDv8 for it for now and I will agree with @LiosK.

    (Companies always can choose to patch the UUIDv7 with their own offset, but they have to clearly define this offset in their documentation, like Discord did for Snowflake. However, doing this will make it not fully follow the spec, so it will be likely that not all implementations will support custom offsets. This is not a problem for the people making the spec, (althrough it could be shortly mentioned), but it is more a issue for the company that defined this new offset.)


  1. I think one or two sentences mentioning it should be enough. It is just a representation format, but can be important during serializing and deserializing.

  1. Changes of changelog + max/omni changes #98 seem fine.

@kyzer-davis
Copy link
Collaborator

For item 2, do you have an application example (or two) that does that?
Personally I feel like they shouldn't be doing that... but at least looking at the app and docs myself will help me think of any text I may be able to put in there on the topic.

@ben221199
Copy link
Author

Well, Raymond Chen made an blog post on it: https://devblogs.microsoft.com/oldnewthing/20220928-00/.

The editor HxD (download here: https://mh-nexus.de/en/hxd/), has a Data Inspector that supports GUIDs (and no UUIDs).
Also mentioning this post: https://stackoverflow.com/questions/10190817/guid-byte-order-in-net
And the Wikipedia: https://en.wikipedia.org/wiki/Universally_unique_identifier#Encoding

Imagine having the following binary data:
image

Normally, when reading this data, we use a big-endian representation format, so the UUID will have the following representation: 00112233-4455-6677-8899-AABBCCDDEEFF

However, Microsoft has a GUID struct:

struct GUID
{
    uint32_t Data1;
    uint16_t Data2;
    uint16_t Data3;
    uint8_t  Data4[8];
}

The 128 bit data is still read the same. However, the representation format is different. A GUID will print all little-endian using the struct above, so:

An uint32_t, so 00 11 22 33. In case of little-endian it becomes 33 22 11 00 when printed.
An uint16_t, so 44 55. In case of little-endian it becomes 55 44 when printed.
An uint16_t, so 66 77. In case of little-endian it becomes 77 66 when printed.
Then an array of uint8_t, 8 times. An octet isn't possible to write in some endianness, so the last 8 octets are printed in the same order as viewed the hex editor.

The representation becomes: {33221100-5544-7766-8899-AABBCCDDEEFF}

Maybe that is also why there is a diffence between UUID en GUID, to distinguish the difference in encoding? I don't know, but it sounds reasonable to me.

(I think I will do a big rewrite of the Wikipedia soon.)

@kyzer-davis
Copy link
Collaborator

Thanks, I think I can add a bullet to the ones I added last update and state that MS implementation of GUIDs leverage little-endian and cite that MS post but also generally discourage that practice while we are at it.

@ben221199
Copy link
Author

I think it isn't that simple to discourage an UUID format that is used by one of the biggest tech companies in the world. I will try to investigate the problem a little bit more, so we can describe the situation in the best way possible.

@ben221199
Copy link
Author

I did some testing on GUID in C++:

CLSID StringToGUID(){
	CLSID clsid;
	wchar_t* clsid_str = L"{5C98B8E4-2B3D-12B1-ADE2-0000F86456B2}";
	CLSIDFromString(clsid_str, &clsid);
	return clsid;
}

CLSID cls = StringToGUID();

char data[16];
memcpy(&data,&cls,16);

std::ofstream myfile;
myfile.open("example.dat");
for(int i=0;i<16;++i){
	char c[2];
	sprintf(c,"%02X", data[i] & 0xFF);
	myfile << c;
}
myfile.close();

When viewing the content of example.dat, the following is visible:
image
The value E4B8985C3D2BB112ADE20000F86456B2.

Comparing it with the input string, it is definitely little-endian. Note that CLSIDFromString is a Windows function: https://learn.microsoft.com/en-us/windows/win32/api/combaseapi/nf-combaseapi-clsidfromstring. Discouraging this representtion format seems a bad thing to me in this case.


Imagine having an UUID with the bytes 00 11 22 33 44 55 66 77 88 99 AA BB CC DD EE FF saved in memory, e.g. a file. In big-endian there is no trouble. The UUID is always represented in the same (network) byte order, regardless of the fields of the wire format.

However, when going to GUID, the same sequence of bytes in the memory will behave differently. The representation format is a little different, because little-endian is used.

Actually, if you look very specificly to the whole thing, it is not the representation format that behaves like it is little-endian, but it is the binary format. In case of 5C98B8E4-2B3D-12B1-ADE2-0000F86456B2, the Data1 value of the GUID struct is 1553512676. In hexadecimal format that is 0x5C98B8E4. That is the same order as the representation format!!!

So, it is like the following:

UUID: Representation format <--(no conversion)--> Field values <--(no conversion)--> Wire format
GUID: Representation format <--(no conversion)--> Field values <--(conversion of endianness*)--> Wire format

Note that GUID uses a Data1, Data2, Data3, Data4 struct that is linked to the Microsoft variant, but that also non-Microsoft variant are used in this little-endian world. I have to investigate more about the handling of version fields in that case, because the version field is located at the part that is visibly influenced by little-endian.

I will also make a table explaining it in some way.

Name Endianness Description
UUID (initially) Big-endian Originally the Apollo UUID that was binary saved as big-endian.
UUID (currently) Both Name for all 128 bit values that follow the specs, like RFC 4122, including the Microsoft variant.
GUID (initially) Little-endian Name for the specific Microsoft variant.
GUID (currently) Both Same as UUID currently. GUID and UUID are used interchangeably at the moment.

So if you have a struct like this:

struct GUID
{
    uint32_t Data1;
    uint16_t Data2;
    uint16_t Data3;
    uint8_t  Data4[8];
}

In case of big-endian, you DON'T HAVE TO flip bytes when encoding to and decoding from the representation format and wire-format.
In case of little-endian, you only HAVE TO flip bytes when encoding to and decoding from the wire-format. Between fields and representation format, things should stay the same.

@ben221199
Copy link
Author

Let's take IID_IActivation as example. It has the following value: 4d9f4ab8-7d1c-11cf-861e-0020af6e7c57.


UUID

Using https://www.uuidtools.com/decode, it tells us it is DCE variant and version 1.

The UUID is saved in the same byte order.


GUID

Using the GUID struct, the value is parsed. The fields have the following value:

  • Data1: 0x4d9f4ab8
  • Data2: 0x7d1c
  • Data3: 0x11cf
  • Data4: [0x86, 0x1e, 0x00, 0x20, 0xaf, 0x6e, 0x7c, 0x57] (array of 8 items)

Note that we still have the same order as the representation format.

The GUID is saved in little-endian, so every byte should be flipped per field as seen above:

  • 0x4d9f4ab8 => 0xb84a9f4d
  • 0x7d1c => 0x1c7d
  • 0x11cf => 0x11cf
  • The array just as is, because it are seperate octets and octets don't have endianness.

In the end, concatenate: 0xb84a9f4d + 0x1c7d + 0x11cf + array => b84a9f4d1c7d11cf861e0020af6e7c57 (as only seen in the hex editor.


Conclusion

Maybe I have used too many words for it, but we are NOT talking about a new representation format, as I found out by now. We are talking about Microsoft having a different way of writing UUIDs to disk. It is not big-endian, like the normal UUID, but also not a fully 16-byte flipped thing. It is something between it, because they use their own GUID struct for chunking.

I propose to not add it as new representation format. It simply isn't and I was wrong in the beginning of this issue about it.

I propose we add some little sentences about Microsoft using a slightly different way of writing the UUID to disk/memory and further say it is out of scope of this specification to define how, because it has actually to do with Microsoft's variant, which is out of scope too.

Something like this:

When saving the UUIDs to its binary form, normally this is done by sequencing all fields in big-endian. Microsoft uses a slightly different method when saving GUIDs, but this is out of scope for this specification. This different method doesn't have any effect on the representation format, so parsing human readable UUIDs should not give a problem.

@ben221199
Copy link
Author

Defining this different method will be done in the historical RFC that @kyzer-davis and I will try to make after the successor of RFC 4122 is published.

@kyzer-davis
Copy link
Collaborator

When saving the UUIDs to its binary form, normally this is done by sequencing all fields in big-endian. Microsoft uses a slightly different method when saving GUIDs, but this is out of scope for this specification. This different method doesn't have any effect on the representation format, so parsing human readable UUIDs should not give a problem.

Defining this different method will be done in the historical RFC that @kyzer-davis and I will try to make after the successor of RFC 4122 is published.

Just reviewed:
So do we want the MS GUID text or are we good with covering that topic in the historical doc?
Let me know so I can get the latest Draft 05 up to IETF for the last week of WLGC review.

If I add the text I will add an asterisk next to [Microsoft] in the first section 4 bullet and then call cite that text at the end of section 4.

@ben221199
Copy link
Author

I think we should maybe only mention the fact that Microsoft's GUID has a slightly different binary output then the original UUID to take into account, but nothing more than that. The details will be in the historical RFC. The text I made previously (When saving ... a problem.) seems good enough for me, but you can make some modifications to it. I would suggest to add the text after paragraph 2 in section 4.

What do you mean by adding a asterisk next to [Microsoft] in the first section 4 bullet? (I assume you mean Some UUID ... curly braces..) The binary form of Microsoft's GUID doesn't have anything to do with the curly braces.

@kyzer-davis
Copy link
Collaborator

What do you mean by adding a asterisk next to [Microsoft] in the first section 4 bullet? (I assume you mean Some UUID ... curly braces..) The binary form of Microsoft's GUID doesn't have anything to do with the curly braces.

Yeah, I was trying to tie to MS so it flows nicely but since we are talking about big-endian encoding in paragraph 2 it can work there. Let me whip up a PR real fast.

@ben221199
Copy link
Author

Sure. I already started to improve the Wikipedia.

@kyzer-davis
Copy link
Collaborator

@ben221199, take a peek at 408c5a7

If that looks good I will merge down and get this back to IETF under draft 05 closing this thread.

@ben221199
Copy link
Author

Seems good to me. I think that DCOM (Distributed Component Object Model) is better than COM (https://nl.wikipedia.org/wiki/Distributed_component_object_model), but I also see COM appearing in some docs. I see that Raymond uses COM, so let's stick with that. Perfect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants