Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hide serialization/deserialization logic from the public API #117

Open
jngrad opened this issue Feb 24, 2025 · 0 comments
Open

Hide serialization/deserialization logic from the public API #117

jngrad opened this issue Feb 24, 2025 · 0 comments

Comments

@jngrad
Copy link
Member

jngrad commented Feb 24, 2025

The internal state of the pyMBE object is currently stored in a single data.frame. The checkpointing mechanism (write_pmb_df() and read_pmb_df()) relies on several helper methods to serialize and deserialize that data.frame, which are currently part of the public API. This is most likely unintentional. This has a negative impact on the project usability and sustainability. First, users can directly call these methods, and in doing so, might potentially leave the internal data.frame in an inconsistent state; these helper methods should instead be private methods (e.g. with a leading underscore in their name). Second, these helper functions are written as stateful methods (they take self as argument), which means the way they serialize/deserialize data can depend on the state of data members of the class; they should be rewritten as stateless methods, e.g. take cls as argument, to make it impossible for them to change behavior during the pyMBE object lifetime. Third, the exact data format used to store the data.frame should be an implementation detail and allowed to change; right now, if we want to refactor the checkpointing logic in any meaningful way, we must release a new major version of pyMBE, because this would be a silent API change of the public serialization/deserialization helper methods. This was an issue in #113, where I replaced the unsafe ast.literal_eval() calls by equivalent json.loads() calls, but had to do so while maintaining the exact same method names, argument lists and method behavior, so as to abide by our semantic versioning guidelines.

Proposed solution:

  • make all serialization/deserialization helper methods stateless by rewriting them as class methods (@classmethod) or static methods (@staticmethod), depending on whether they need to be able to call other helper methods in the same class or not
  • move these helper methods to a new private class called _Checkpointing or similar to fully encapsulate the code

With the aforementioned checkpointing class, we could in the future explore other checkpointing mechanisms without changing the public API. The binary compatibility might change, though, for example a new checkpointing mechanism that relies on SQL can no longer read an older data.frame saved as CSV. There is no easy way to handle this, even in ESPResSo we have no mechanism to deal with this. We could hardcode the release number in the data.frame, but then changes to the internal checkpointing mechanism must be reflected in the public API. We could hardcode the commit revision in the data.frame, but this would be quite disruptive to users who also develop code. To my knowledge, other software that deal with this kind of issue often use a version number, which is manually incremented every time a serialization function changes how data is written to a file. This version number isn't necessarily part of the public API.

Right now we enforce a CSV file format whose cells are encoded in JSON. Before #113, the cells could contain any executable Python code, which was quite dangerous. We could explore other data formats like SQL. We could also explore maintaining multiple data.frame objects, one for each feature (bonds, particles, interactions, etc.), instead of maintaining the current data.frame that mixes all features together in a sparse matrix (whose null elements are pd.NA, which is confusing and makes checkpointing a lot more difficult).


Regarding private methods in Python: the easiest way is to prefix a function or class name with a single leading underscore. Two leading underscores are reserved for overridable Python class methods like __init__(). The leading underscore tells Sphinx to ignore this class/method/function/attribute in the generated documentation, and tells the Python interpreter to skip them in the autocompletion feature. Here is a MWE tested in the Python interpreter and Jupyter Console (pip install jupyter-console):

class MyClass:
    DELICIOUS_PIE = 3.14
    _FORBIDDEN_PIE = 3.15
    def test(self): pass
    def _zest(self): pass

obj = MyClass()
# in the line below, use Tab autocompletion
obj.

Output:

obj.DELICIOUS_PIE  obj.test()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants