Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MNT: Suggestion how to simplify serialization of funcs #320

Conversation

BenjaminBossan
Copy link
Collaborator

Description

It looks like the state we save for functions contains unnecessary information (the class name, which is not used) and duplicate information (the module name appears twice). This change gets rid of that unnecessary information.

This also allows to remove the need for ufunc_get_state, which can now be replaced by function_get_state.

Comment

Maybe I'm missing something and there is a reason for the previous structure, but if there is, there should be a test to check whatever is covered by that, or at the very least a comment to explain.

Btw. can't remember who wrote this, could be my fault :)

Description

It looks like the state we save for functions contains unnecessary
information (the class name, which is not used) and duplicate
information (the module name appears twice). This change gets rid of
that unnecessary information.

This also allows to remove the need for ufunc_get_state, which can now
be replaced by function_get_state.

Comment

Maybe I'm missing something and there is a reason for the previous
structure, but if there is, there should be a test to check whatever is
covered by that, or at the very least a comment to explain.

Btw. can't remember who wrote this, could be my fault :)
@BenjaminBossan
Copy link
Collaborator Author

@skops-dev/maintainers ready for review

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice cleanup

@@ -200,13 +201,9 @@ def _construct(self):
# get_state method for them here. The load is the same as other functions.
def ufunc_get_state(obj: Any, save_context: SaveContext) -> dict[str, Any]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you want to remove this then?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, good catch, done

@adrinjalali
Copy link
Member

We finally got here. This simplification means users with a new skops version won't be able to load a model dumped with an older skops version. So we need to increase the file format version with this, and somehow route loading to the right methods.

The "easiest" way would be to have a separate folder/module for each version, I'm not sure how we can do better.

@BenjaminBossan
Copy link
Collaborator Author

Aha, yes, totally right.

The "easiest" way would be to have a separate folder/module for each version, I'm not sure how we can do better.

I assume that is to avoid cluttering the code with if protocol == ...? As long as there are only few changes, I could live with this clutter, depending on how long we want to keep supporting older protocols. If we want to support them for a long time or even forever, I see how that would make the code hard to maintain. But that's probably unnecessary, given how sklearn itself doesn't give any such guarantees.

It would be nice if we could automatically dispatch not only on type but also protocol. Then we could have different versions for, say, FunctionNode registered and the right one would be dispatched to automatically. However, this doesn't work because we hard-code which Node class should be used for a given state (though some dict magic could be used to address this) and because we define the mapping at root level, so it is fixed once skops is loaded. What would you suggest?

@adrinjalali
Copy link
Member

given how sklearn itself doesn't give any such guarantees.

The comparison here is with pickle, not scikit-learn, and I think you can load old pickle files pretty much forever. At least for now, I'm okay with not supporting our first versions forever, but we need to have a clear and easy way for people to convert their files to a newer version.

adding a version to the dispatcher sounds like a very good idea to me, do you think you could come up with a prototype for that?

@BenjaminBossan
Copy link
Collaborator Author

The comparison here is with pickle, not scikit-learn

Hmm, yes and no. Of course, we're positioning ourselves to replace pickle for this purpose. But if a user wants to load a 5 year old sklearn model dumped with skops, they would either be on an unsupported sklearn version or they would load with a new sklearn version, which gives no guarantees that the model still works.

I think it would provide tremendous value if we could give long guarantees, longer than sklearn would. But I think we should allow ourselves a "breaking" release (skops 1.0) as we should expect the format to still change quite a lot.

adding a version to the dispatcher sounds like a very good idea to me, do you think you could come up with a prototype for that?

Not sure, given the problems I mentioned above, I'll think about it.

BenjaminBossan added a commit to BenjaminBossan/skops that referenced this pull request Mar 21, 2023
Taking the changes from skops-dev#320 to how functions are persisted and adding
them here. Therefore, this PR supersedes skops-dev#320.

That change is added here because it is the perfect test case for the
update route of the skops protocol.
@BenjaminBossan
Copy link
Collaborator Author

Closed in favor of #322

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants