-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT CLI conversion #249
FEAT CLI conversion #249
Conversation
No detailed review from me yet, just some more general things to discuss:
|
Ready for review @skops-dev/maintainers :) |
Addressing @BenjaminBossan's comments:
|
My impression is that some of the things we discuss depend on what the general expectations of a command line tool are. I've never seen the use of (To be clear, I don't think we need rich, it's just in case we want to make the output prettier.) Regarding output dir, we could keep it (but I wonder if
On the other hand, having a way to indicate the output file name is something I'd expect from similar tools, that's why I would have added it (but we could do that later). Regarding the |
As for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need an empty __init__.py
file, do we?
Nothing major here, looks pretty good to me. I don't have a strong opinion on print vs logging, but if we're using logging
, we should certainly get an instance for this library, and under that a CLI one, instead of using the global instance.
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Addressed your PR comments, feel free to have another look @adrinjalali :) As it stands:
I still want to add in documentation to cover the additions, I could maybe use someone pointing me in the right direction to write up changes in the docs folder as I've not got much experience writing documentation :') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks quite good, but there still some points to discuss. (Interesting how many design decisions still have to be made for such a seemingly simple task.)
but if we're using logging, we should certainly get an instance for this library,
I think this has not been addressed. IIUC, Adrin means that we should have a logger instance, i.e. logger = logging.getLogger(...)
and call logger.info(...)
etc.
Personally, I still find it strange to use logging here, but if you two agree, it's fine by me.
Users can pass in multiple input files and output file names
Really not sure if we need this or whether it makes our code more complicated then needs be? If users want to convert multiple files, there are already ways to achieve this in the command line, e.g. through globbing or a small bash script.
I still want to add in documentation to cover the additions, I could maybe use someone pointing me in the right direction to write up changes in the docs folder as I've not got much experience writing documentation :')
I'm sure you can do it, read the existing docs for persistence and then try to put yourself in the shoes of a user, what information would be most important to them? The outcome doesn't have to be perfect on first try, we can iterate in this review.
Could you please also add an entry to changes.rst
?
One more minor thing, when I call skops --help
, I get this error:
$ skops --help
usage: Skops {convert}
Skops: error: the following arguments are required: function
On holidays, so I'll either review next year or happy for this to be merged, but one think which I think is important is for us to have the |
I'll be on holiday too until next year, but will take a look from time to time when I have the opportunity. |
Ok, phew, should now be ready to review again @skops-dev/maintainers :) A couple of notes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good, I tested it locally and it works as expected. Thanks also for adding to the docs.
Btw. I checked and it seems we already have a transient dependency on click
, so it would theoretically not cost us anything to build our CLI with it. I assume that after all this work, you're not interested in rewriting it with click
, which is fine, I just wanted to mention it.
There still a few points left that I commented on, please take a look. Also, I wonder about this comment:
but if we're using logging, we should certainly get an instance for this library,
I think this has not been addressed. IIUC, Adrin means that we should have a logger instance, i.e. logger = logging.getLogger(...) and call logger.info(...) etc.
"""Takes in verbosity from a CLI entrypoint (number of times -v specified), | ||
and sets the logger to the required log level""" | ||
|
||
all_levels = [logging.WARNING, logging.INFO, logging.DEBUG] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, warning level logs would always be shown, even if I don't add -v
. IMO, this should not be the case. So I would prefer that without -v
, the log level should be ERROR
or CRITICAL
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the python docs:
The default level is WARNING, which means that only events of this level and above will be tracked, unless the logging package is configured to do otherwise.
I followed this because it's the default for python packages.
We can adjust the default level for our CLI, or we could also add a -q, --quiet
option to reduce the log level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is true for logging in general, but for CLIs, there is the convention to not print anything if there is no issue. Ideally, I would like to avoid having a --quiet
option, because it increases complexity and because it's unclear how it interacts with -v
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good point.
I would raise that it could be considered an issue if untrusted types were present.
If a user had a file with untrusted types in it, and we converted it silently, they wouldn't know they needed to specify anything as trusted when using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, this is more of an issue for the person loading the file, and they will know (unless they use trusted=True
). Personally, I would not report it by default, but I can live with the current way.
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Thanks for the review @BenjaminBossan, I've made adjustments along the lines of most of your comments, let me know if there's anything I've missed. WRT logging, I've moved to using a That way, if we do implement any different logging elsewhere in our codebase, it automatically keeps the config specified for whatever command we're calling. Let me know if you think there's a better way to achieve this; Also, the log format currently looks like:
How does this look to all? We can add things like time, function call, etc at a later date if we need it. Also, when we implement logging somewhere else, might make sense to move this formatting out to a common module. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing the changes, this looks great to me now. Let's wait for one more review before merging.
WRT logging, I've moved to using a logger instance, but still set a basic config at the top level for our selected entrypoint, as I felt that the config should be set at the top level for whatever entrypoint we're using.
Also, when we implement logging somewhere else, might make sense to move this formatting out to a common module.
I think it's good for the time being.
Also, the log format currently looks like:
INFO : No unknown types found in abc.
How does this look to all? We can add things like time, function call, etc at a later date if we need it.
Yes, let's keep it simple for now.
|
||
Parameters | ||
---------- | ||
input_file : os.PathLike |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard to say if a user who doesn't know the type would guess that str
is a path-like. Anyway, I won't object to leaving it as is, especially since it means that the type annotation and docstring are identical, I just wanted to bring up the possible friction here.
skops/cli/_convert.py
Outdated
f"While converting {model_name}, " | ||
"the following unknown types were found: " | ||
f"{untrusted_str}. " | ||
f"When loading {output_file}, add 'trusted=True' to the skops.load call. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this touches on a more general point of how we should deal with trusted=True
. I posted on discord about it, which is a better place to discuss than here :)
"""Takes in verbosity from a CLI entrypoint (number of times -v specified), | ||
and sets the logger to the required log level""" | ||
|
||
all_levels = [logging.WARNING, logging.INFO, logging.DEBUG] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is true for logging in general, but for CLIs, there is the convention to not print anything if there is no issue. Ideally, I would like to avoid having a --quiet
option, because it increases complexity and because it's unclear how it interacts with -v
.
def _convert_file( | ||
input_file: os.PathLike, | ||
output_file: os.PathLike, | ||
logger: logging.Logger = logging.getLogger(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should instantiate the logger just once at the root of the module, and then just set logger=logger
here. Is there any disadvantage to that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we could, although that would mean it gets initialised even if we import functions from the file to use elsewhere.
I don't think it would really impact much, I just personally dislike writing anything other than definitions at the top level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would mean it gets initialised even if we import functions from the file to use elsewhere.
That's true. On the other hand it's now being initialized each time the function is called. Given that CLI usage consists of creating a new process anyway, it makes indeed no difference either way. If we ever get to a point where we use the logger elsewhere, we might have to rethink this.
def _convert_file( | ||
input_file: os.PathLike, | ||
output_file: os.PathLike, | ||
logger: logging.Logger = logging.getLogger(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would mean it gets initialised even if we import functions from the file to use elsewhere.
That's true. On the other hand it's now being initialized each time the function is called. Given that CLI usage consists of creating a new process anyway, it makes indeed no difference either way. If we ever get to a point where we use the logger elsewhere, we might have to rethink this.
"""Takes in verbosity from a CLI entrypoint (number of times -v specified), | ||
and sets the logger to the required log level""" | ||
|
||
all_levels = [logging.WARNING, logging.INFO, logging.DEBUG] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, this is more of an issue for the person loading the file, and they will know (unless they use trusted=True
). Personally, I would not report it by default, but I can live with the current way.
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
@BenjaminBossan do we still want one more review before merging this in? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for the hard work. This LGTM and can be merged if @adrinjalali is also fine with it.
As mentioned in some of the comments, there are some things that I think can be improved, but they are not blocking this PR, as they can be added later without BC breakage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @E-Aho . I think we might need improve docs when we get back to this, but happy to merge this and have it out.
Addresses #241
This PR adds in an entrypoint,
skops convert
, which allows users to automatically convert files from the command line.I didn't end up adding
inspect
, as it felt odd to me to inspect a .pkl file which would require loading it intoskops
format based on how we currently have the safety checks implemented. Happy to implement an entrypoint to do it if other maintainers think it would be useful, it would essentially just need to not save the outputskops
in the current conversion method.