Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for "non traditional" dataclasses #495

Closed
provinzkraut opened this issue Jul 26, 2023 · 5 comments · Fixed by #501
Closed

Support for "non traditional" dataclasses #495

provinzkraut opened this issue Jul 26, 2023 · 5 comments · Fixed by #501

Comments

@provinzkraut
Copy link

Description

I ran into an issue encoding EdgeDB's Objects with msgspec, where it would always come out as an empty dict.

However, according to their docs, these support the dataclass protocol, and inspecting them manually reveals this to be the case. All conversion / inspection methods from the dataclasses module (asdict, astuple, fields, etc.) work as expected on them.

I checked how msgspec encodes these and I believe the issue might have something to do with the fact that msgspec tries to serialise any object that reports being a dataclass (i.e. has the respective __dataclass_fields__ set) [1], using its __dict__ and __slots__ [2]. This however seems to cause issues because the EdgeDB Object does not meet this expectation.

I realise this might be a bit tricky, but at the moment I can't really think of any workaround, since msgspec thinks these are a supported type, therefore preventing the use of a custom enc_hook.

Do you see a way for msgspec to either fully support dataclasses like these or raise an exception so that one could use an enc_hook?

@jcrist
Copy link
Owner

jcrist commented Jul 26, 2023

That's unfortunate. You're correct in your assessment of the cause here.

Two options:

  • We could change how we're encoding dataclasses to iterate over fields and call getattr to retrieve each one. This will have a performance cost, but I'm not sure how much. It's also tricky since the type information needed to implement this is cached on the type (for performance reasons), but we currently only do that caching when decoding not encoding. So encoding would have to now check for cached type info, create/cache it if missing, all while handling race conditions. There are correctness benefits to doing things this way though. Currently we serialize any attribute lacking a leading underscore, regardless if it's a declared field on the dataclass. Switching to rely on the dataclass fields instead would avoid this issue.

  • We might be able to improve our is-this-a-dataclass check to exclude edgedb's fake dataclasses. While they do work with the builtin dataclasses methods, there are quite a few hacks needed to make this work (although msgspec is also full of similar hacks :)). Excluding these as not-really-dataclasses feels reasonable. This would let you use the existing enc_hook method to encode these objects. This would be slower than option 1, but would avoid slowing down serialization of standard dataclasses.

In the short term I suspect that option 2 will be easier (if it's possible). In the long term option 1 may be a better choice.

@jcrist
Copy link
Owner

jcrist commented Jul 28, 2023

Ok, I did some experimenting with edgedb last night. It's annoying that these look like dataclasses but don't make use of any of the standard python dataclass machinery. In short:

  • All edgedb.Object instances are direct instances of edgedb.Object. There's no subclassing for various types. In this respect they're more like an AttrDict that happens to quack like a dataclass.
  • You cannot instantiate these classes yourself, they can only be returned from edgedb api calls. So we don't need to worry about decoding into them, only encoding.
  • To make them look like dataclasses they define both instance and class values for __dataclass_fields__. The instance values have the true field information, the class attribute (edgedb.Object.__dataclass_fields__) is an empty dict and only exists to satisfy the dataclasses.is_dataclass call.

If you look at the dataclass docs there's no mention of __dataclass_fields__ anywhere - the existence of this attribute is an implementation detail for dataclasses, not a protocol that other projects are supposed to emulate.

Anyway, two options for how to support these here:

Option 1

Any dataclass-like thing that looks magical (custom tp_getattro, no __dict__, ...) takes an alternate path that looks up fields in __dataclass_fields__. This makes no changes to the common path, and only exists to support the edgedb.Object types. Most users won't run into this distinction as most users are treating dataclasses as simple containers. The downside here is that it's harder to document the behavior since it adds another branch to the how-are-dataclass-like-things-encoded logic.

Option 2

We redo our dataclass encoding support to always lookup fields in __dataclass_fields__. This is a uniform path for anything that has a __dataclass_fields__ attribute. For performance reasons we still do some hackery to infer if something is a ClassVar/InitVar (since those aren't filtered out of __dataclass_fields__), but that's not too bad.

The nice thing about this design is it fixes a wart on our current dataclass support. Currently we encode any attributes on a dataclass that don't start with a _. In the common case this is what the user likely wants (and is also what orjson does). But in the presence of weird things (intentional _ variables, functools.cached_property, ...) this can result in unexpected behavior. With this new design this issue is fixed - we only encode attributes that were actually declared on the dataclass. I also like that it's a uniform path for all dataclass-like things, which makes it easy to document.

The downside is it comes at a performance cost. I wrote up a POC implementation - here's a quick benchmark comparing msgspec & orjson for encoding dataclasses with slots=False and slots=True:

Currently

Encoding a dataclass with 8 fields
* slots = False
  - orjson: 271.9 ns
  - msgspec: 214.2 ns
* slots = True
  - orjson: 671.1 ns
  - msgspec: 212.8 ns

With this change

Encoding a dataclass with 8 fields
* slots = False
  - orjson: 270.1 ns
  - msgspec: 263.3 ns
* slots = True
  - orjson: 672.6 ns
  - msgspec: 322.9 ns

Currently we're faster than orjson for encoding all dataclasses. With this change we're still always faster for slots=True dataclasses, but slower for larger slots=False classes (the default). If you model the cost of encoding a dataclass as y = m * nfields + b, with this change our constant b is still lower than orjson's, but the slope m is now higher. The crossover point (on my machine) seems to be nfields=8.


I could go either way here and would love a second opinion @provinzkraut if you have the time.

@provinzkraut
Copy link
Author

* To make them look like dataclasses they define both instance and class values for `__dataclass_fields__`. The instance values have the true field information, the class attribute (`edgedb.Object.__dataclass_fields__`) is an empty dict and only exists to satisfy the `dataclasses.is_dataclass` call.

Does that information need to necessarily be pulled from the class? I'm thinking that for decoding, having the instance __dataclass_fields__ should be enough. Am I missing something there?


I could go either way here and would love a second opinion @provinzkraut if you have the time.

My initial thought would be that option 2 is preferable, as it is the "correct" way of doing things, even though it comes at a performance cost. Would you be able to make the POC implementation of this available somewhere so I can play around with it a bit?

@jcrist
Copy link
Owner

jcrist commented Jul 31, 2023

I'm thinking that for decoding, having the instance __dataclass_fields__ should be enough. Am I missing something there?

No, that's sufficient for encoding - these bullets were mostly just notes on their implementation on how they differed from the standard library.

Would you be able to make the POC implementation of this available somewhere so I can play around with it a bit?

Sure thing - I pushed up a WIP PR at #501. Only the JSON encoder has been updated (and still no tests/docs), but I think the implementation there is solid if this is the path we want to take.

@provinzkraut
Copy link
Author

@jcrist you're too fast for me (=

Anyway, I got to run some tests with your WIP, specifically comparing its performance to orjson. I have a project that has about ~170 dataclasses of varying sizes, complexity, levels of nesting, slots, no slots. I tested the serialization of those with real world data and found that on average, msgspec performs the same as orjson here.

I was planning to have more detailed stats for this, but seeing as you already completed the work on this it seems like that's not needed anymore.

Thank you once more for the fast turnaround! ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants