Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add {numpy, scipy}.ufuncs to the trusted list #224

Closed
adrinjalali opened this issue Nov 29, 2022 · 20 comments · Fixed by #295 or #336
Closed

Add {numpy, scipy}.ufuncs to the trusted list #224

adrinjalali opened this issue Nov 29, 2022 · 20 comments · Fixed by #295 or #336
Labels
good first issue Good for newcomers persistence Secure persistence feature

Comments

@adrinjalali
Copy link
Member

I think most (if not all) ufuncs are actually safe, so we can add them to the default trusted list and reduce the size of untrusted_types.

@adrinjalali adrinjalali added the persistence Secure persistence feature label Nov 29, 2022
@BenjaminBossan
Copy link
Collaborator

I would also assume that those ufuncs are safe, e.g. they operate on a very specific set of inputs and should have no side effects, but I'm not sure how to verify this without actually going through the code.

Just a quick check I did:

>>> npufuncs = [getattr(np, a) for a in dir(np) if isinstance(getattr(np, a), np.ufunc)]
>>> len(npufuncs)
91
>>> spufuncs = [getattr(scipy.special, a) for a in dir(scipy.special) if isinstance(getattr(scipy.special, a), np.ufunc)]
>>> len(spufuncs)
231

Not sure if that's all.

@adrinjalali
Copy link
Member Author

@rgommers would you know if we can consider all these ufuncs safe in this context, and if @BenjaminBossan 's snippet captures all of them?

Context: we're created an alternative to pickle persistence format, where users would have to pass what they trust for a file to load, and right now none of the ufuncs are trusted, which are used inside sklearn estimators. We're trying to reduce the noise on the user side by trusting certain funcs/objects by default.

@rgommers
Copy link

rgommers commented Dec 1, 2022

Ufuncs should be stateless, so safe to obtain from numpy or scipy.

"they operate on a very specific set of inputs" isn't quite true, at least in terms of Python types, given that they accept anything that can be coerced to a numpy.ndarray. But that should not be relevant I'd think.

The list isn't quite complete for scipy. There are also some ufuncs in scipy.stats._boost .

@E-Aho E-Aho added the good first issue Good for newcomers label Dec 3, 2022
@anamika-yadav99
Copy link

Hi @adrinjalali ! I'm an undergrad from India. I was exploring the repository and I find it interesting. I want to contribute to the project. Can I take up this issue?

@adrinjalali
Copy link
Member Author

@anamika-yadav99 yes of course :) You can have a look at #237 for a similar pull request.

@BenjaminBossan
Copy link
Collaborator

@anamika-yadav99 Do you still plan to work on this?

@anamika-yadav99
Copy link

Hi @BenjaminBossan ! Apologies, give me this weekend. I'll raise a PR.

@BenjaminBossan
Copy link
Collaborator

@anamika-yadav99 a friendly ping to ask if you still plan on working on this :)

@omar-araboghli
Copy link
Contributor

Hi, I would be happy to contribute, if @BenjaminBossan and @anamika-yadav99 would give the 🟢 light!

@BenjaminBossan
Copy link
Collaborator

@omar-araboghli Since we haven't heard back in a while from Anamika, I think it's safe to say that you can take this task.

@omar-araboghli
Copy link
Contributor

Hi @BenjaminBossan and @adrinjalali, I have done some experiments so far to make sure we're including all ufuncs from both numpy and scipy. I compared @BenjaminBossan's snippet (v1) with my version (v2) where I do traverse all modules and inspect those objects that are typed with np.ufunc. I found more ones that are not mentioned in the documention. For some reason ufuncs under scipy.stats._boost are not captured as well. Still looking into that.

Please check the following snippet and output and give your advice, whether it's safe to include the inspected ones. The traversal is inspired from sklearn.util.all_estimators()

def all_typed_instances(root_module: ModuleType, base_type: Type, modules_to_ignore: list[str] = None) -> list[str]:
    if not root_module or not base_type:
        return []

    _MODULES_TO_IGNORE = [
        "tests",
        "setup",
        "conftest"
    ]

    _SUBSTRINGS_TO_IGNORE = [
        ".__",
        "test"
    ]

    if modules_to_ignore is None:
        modules_to_ignore = _MODULES_TO_IGNORE
    else:
        modules_to_ignore += _MODULES_TO_IGNORE

    typed_instances = []
    for _, module_name, _ in pkgutil.walk_packages(root_module.__path__, prefix=f"{root_module.__name__}."):
        module_parts = module_name.split('.')
        if (
                any(part in modules_to_ignore for part in module_parts)
                or any(substring in module_name for substring in _SUBSTRINGS_TO_IGNORE)
        ):
            continue

        try:
            module = importlib.import_module(module_name)
        except Exception:
            continue

        typed_members = [
            get_type_name(member[1]) for member in inspect.getmembers(module)
            if isinstance(member[1], base_type)
            and get_type_name(member[1]).startswith(root_module.__name__)
        ]

        typed_instances.extend(typed_members)

    return sorted(set(typed_instances))

----------------------------------------------------

numpy_v1 = sorted(set([
    get_type_name(getattr(np, attr))
    for attr in dir(np)
    if isinstance(getattr(np, attr), np.ufunc)
    and get_type_name(getattr(np, attr)).startswith("numpy")
]))

numpy_v2 = all_typed_instances(np, np.ufunc)

scipy_v1 = sorted(set([
    get_type_name(getattr(scipy.special, attr))
    for attr in dir(scipy.special)
    if isinstance(getattr(scipy.special, attr), np.ufunc)
    and get_type_name(getattr(scipy.special, attr)).startswith("scipy")
]))

scipy_v2 = all_typed_instances(scipy, np.ufunc)

----------------------------------------------------

>>> set(numpy_v1).difference(numpy_v2)
set()
>>> set(numpy_v2).difference(numpy_v1)
{'numpy.linalg._umath_linalg.lstsq_n', 'numpy.linalg._umath_linalg.eigvalsh_lo', 'numpy.core._multiarray_umath.clip', 'numpy.linalg._umath_linalg.inv', 'numpy.linalg._umath_linalg.svd_m', 'numpy.core._multiarray_umath._ones_like', 'numpy.linalg._umath_linalg.svd_n', 'numpy.linalg._umath_linalg.svd_m_s', 'numpy.linalg._umath_linalg.eig', 'numpy.linalg._umath_linalg.svd_n_f', 'numpy.linalg._umath_linalg.eigvals', 'numpy.linalg._umath_linalg.solve', 'numpy.linalg._umath_linalg.qr_complete', 'numpy.linalg._umath_linalg.eigh_lo', 'numpy.linalg._umath_linalg.qr_r_raw_n', 'numpy.linalg._umath_linalg.solve1', 'numpy.linalg._umath_linalg.eigvalsh_up', 'numpy.linalg._umath_linalg.slogdet', 'numpy.linalg._umath_linalg.qr_reduced', 'numpy.linalg._umath_linalg.svd_m_f', 'numpy.linalg._umath_linalg.eigh_up', 'numpy.linalg._umath_linalg.qr_r_raw_m', 'numpy.linalg._umath_linalg.svd_n_s', 'numpy.core._multiarray_umath._arg', 'numpy.linalg._umath_linalg.lstsq_m', 'numpy.linalg._umath_linalg.det', 'numpy.linalg._umath_linalg.cholesky_lo'}
>>> len(numpy_v1)
86
>>> len(numpy_v2)
113

----------------------------------------------------

>>> set(scipy_v1).difference(scipy_v2)
set()
>>> set(scipy_v2).difference(scipy_v1)
{'scipy.stats.invgauss_ufunc._invgauss_cdf', 'scipy.special._ufuncs._riemann_zeta', 'scipy.special._ufuncs._smirnovc', 'scipy.stats.ncx2_ufunc._ncx2_kurtosis_excess', 'scipy.stats.skewnorm_ufunc._skewnorm_ppf', 'scipy.special._ufuncs._cospi', 'scipy.stats.nct_ufunc._nct_mean', 'scipy.stats.ncx2_ufunc._ncx2_isf', 'scipy.stats.skewnorm_ufunc._skewnorm_sf', 'scipy.special._ufuncs._zeta', 'scipy.stats.beta_ufunc._beta_isf', 'scipy.stats.binom_ufunc._binom_cdf', 'scipy.stats.nct_ufunc._nct_cdf', 'scipy.stats.nbinom_ufunc._nbinom_sf', 'scipy.stats.hypergeom_ufunc._hypergeom_cdf', 'scipy.special._ufuncs._spherical_kn_d', 'scipy.stats.ncx2_ufunc._ncx2_variance', 'scipy.special._ufuncs._struve_power_series', 'scipy.stats.hypergeom_ufunc._hypergeom_isf', 'scipy.stats.nbinom_ufunc._nbinom_mean', 'scipy.special._ufuncs._log1pmx', 'scipy.stats.binom_ufunc._binom_sf', 'scipy.special._ufuncs._ellip_harm', 'scipy.special._ufuncs._spherical_yn', 'scipy.stats.invgauss_ufunc._invgauss_mean', 'scipy.stats.invgauss_ufunc._invgauss_ppf', 'scipy.stats.hypergeom_ufunc._hypergeom_variance', 'scipy.stats.skewnorm_ufunc._skewnorm_variance', 'scipy.stats.nct_ufunc._nct_sf', 'scipy.special._ufuncs._cosine_cdf', 'scipy.stats.ncx2_ufunc._ncx2_sf', 'scipy.special._ufuncs._smirnovp', 'scipy.stats.ncx2_ufunc._ncx2_ppf', 'scipy.stats.binom_ufunc._binom_kurtosis_excess', 'scipy.stats.skewnorm_ufunc._skewnorm_kurtosis_excess', 'scipy.stats.hypergeom_ufunc._hypergeom_skewness', 'scipy.stats.ncf_ufunc._ncf_sf', 'scipy.stats.nct_ufunc._nct_isf', 'scipy.special._ufuncs._kolmogc', 'scipy.stats.skewnorm_ufunc._skewnorm_isf', 'scipy.stats.ncx2_ufunc._ncx2_cdf', 'scipy.stats.hypergeom_ufunc._hypergeom_kurtosis_excess', 'scipy.stats.nbinom_ufunc._nbinom_kurtosis_excess', 'scipy.special._ufuncs._spherical_jn_d', 'scipy.stats.binom_ufunc._binom_skewness', 'scipy.stats.ncx2_ufunc._ncx2_mean', 'scipy.stats.nct_ufunc._nct_skewness', 'scipy.stats.beta_ufunc._beta_variance', 'scipy.special._ufuncs._lgam1p', 'scipy.stats.binom_ufunc._binom_isf', 'scipy.special._ufuncs._spherical_in', 'scipy.stats.beta_ufunc._beta_sf', 'scipy.stats.nbinom_ufunc._nbinom_isf', 'scipy.stats.ncf_ufunc._ncf_pdf', 'scipy.stats.hypergeom_ufunc._hypergeom_ppf', 'scipy.special._ufuncs._lambertw', 'scipy.special._ufuncs._igam_fac', 'scipy.stats.invgauss_ufunc._invgauss_isf', 'scipy.special._ufuncs._smirnovci', 'scipy.stats.beta_ufunc._beta_pdf', 'scipy.stats.ncf_ufunc._ncf_cdf', 'scipy.stats.skewnorm_ufunc._skewnorm_skewness', 'scipy.stats.nct_ufunc._nct_kurtosis_excess', 'scipy.special._ufuncs._spherical_kn', 'scipy.stats.skewnorm_ufunc._skewnorm_pdf', 'scipy.stats.nbinom_ufunc._nbinom_skewness', 'scipy.stats.invgauss_ufunc._invgauss_variance', 'scipy.stats.beta_ufunc._beta_ppf', 'scipy.stats.skewnorm_ufunc._skewnorm_mean', 'scipy.special._ufuncs._spherical_yn_d', 'scipy.stats.invgauss_ufunc._invgauss_skewness', 'scipy.stats.ncf_ufunc._ncf_kurtosis_excess', 'scipy.stats.invgauss_ufunc._invgauss_kurtosis_excess', 'scipy.stats.invgauss_ufunc._invgauss_pdf', 'scipy.special._ufuncs._lanczos_sum_expg_scaled', 'scipy.special._ufuncs._cosine_invcdf', 'scipy.stats.nct_ufunc._nct_variance', 'scipy.stats.nbinom_ufunc._nbinom_pdf', 'scipy.stats.binom_ufunc._binom_variance', 'scipy.special._ufuncs._sf_error_test_function', 'scipy.stats.ncf_ufunc._ncf_skewness', 'scipy.stats.ncx2_ufunc._ncx2_skewness', 'scipy.stats.nct_ufunc._nct_pdf', 'scipy.stats.nbinom_ufunc._nbinom_cdf', 'scipy.special._ufuncs._spherical_jn', 'scipy.stats.binom_ufunc._binom_ppf', 'scipy.special._ufuncs._factorial', 'scipy.stats.ncx2_ufunc._ncx2_pdf', 'scipy.stats.ncf_ufunc._ncf_mean', 'scipy.stats.beta_ufunc._beta_skewness', 'scipy.stats.invgauss_ufunc._invgauss_sf', 'scipy.special._ufuncs._spherical_in_d', 'scipy.stats.beta_ufunc._beta_cdf', 'scipy.special._ufuncs._kolmogp', 'scipy.stats.hypergeom_ufunc._hypergeom_sf', 'scipy.stats.hypergeom_ufunc._hypergeom_pdf', 'scipy.special._ufuncs._struve_bessel_series', 'scipy.special._ufuncs._kolmogci', 'scipy.stats.ncf_ufunc._ncf_variance', 'scipy.stats.nbinom_ufunc._nbinom_variance', 'scipy.stats.binom_ufunc._binom_pdf', 'scipy.stats.nct_ufunc._nct_ppf', 'scipy.stats.binom_ufunc._binom_mean', 'scipy.stats.beta_ufunc._beta_mean', 'scipy.stats.hypergeom_ufunc._hypergeom_mean', 'scipy.stats.skewnorm_ufunc._skewnorm_cdf', 'scipy.stats.ncf_ufunc._ncf_ppf', 'scipy.special._ufuncs._struve_asymp_large_z', 'scipy.special._ufuncs._sinpi', 'scipy.stats.beta_ufunc._beta_kurtosis_excess', 'scipy.stats.ncf_ufunc._ncf_isf', 'scipy.stats.nbinom_ufunc._nbinom_ppf'}
>>> len(scipy_v1)
230
>>> len(scipy_v2)
342

@BenjaminBossan
Copy link
Collaborator

Thanks a lot @omar-araboghli for looking into this. I tried to run your snippet locally but there were a lot of missing imports. Could you be so kind as to provide a code block that works as is?

Regarding the results, I think we're collecting too many functions here. To check, I just picked the first and last result from the list of differences you provided for numpy and scipy:

  • numpy.linalg._umath_linalg.lstsq_n is a "private "function that is not supposed to be used directly, but seems to be a helper function to numpy.linalg.lstsq.
  • numpy.linalg._umath_linalg.cholesky_lo is a "private" function that is not supposed to be used directly, but seems to be a helper function to numpy.linalg.cholesky.
  • scipy.stats.invgauss_ufunc._invgauss_cdf I couldn't find any usage (but it seems to be related to boost).
  • scipy.stats.nbinom_ufunc._nbinom_pp is a "private" function that is not supposed to be used directly, but seems to be a helper function to scipy.stats.nbinom (via scipy.stats._discrete_distns.nbinom_gen).

The problem with adding those is that given their "private" nature, they are much more likely to change in a way that creates maintenance burden for us (being removed, renamed, moved to a different module) while being extremely unlikely to be used by any user directly. And I would assume that if a user wants to use those private functions, they are very knowledgeable, so they should not have a problem adding them to the trusted argument.

Therefore, I would rather be more conservative and only add "public" functions. Do you want to investigate how to filter out the "private" functions?

omar-araboghli added a commit to omar-araboghli/skops that referenced this issue Feb 6, 2023
@omar-araboghli
Copy link
Contributor

Hi @BenjaminBossan and thank you for the quick review! I just commited these quick changes on my fork here and you can also quickly check them in this notebook, but with different numpy/scipy versions indeed.

Thanks for the clarification about the results. I was and still am a bit sceptical about the private functions, since @rgommers mentioned _boost as a module having ufuncs interesting for us. Isn't it a private module ? Public functions from a private module should be imported from a public module and caught with the above all_typed_instances() function, but they weren't (or maybe something still wrong with my implementation for now).

As a generic approach, would we settle for getting everything possible except private and test functions/objects ? private modules are also dropped ?

The filtering can be done with

    _MODULES_TO_IGNORE = [
        "tests",
        "setup",
        "conftest"
    ]

    _SUBSTRINGS_TO_IGNORE = [
        ".__",
        "test"
    ]

@rgommers
Copy link

rgommers commented Feb 6, 2023

Note that in the most recent version of SciPy, we consistently use underscores for private things. There's some non-underscored shims for backwards compat, but those all raise warnings and are not present in __all__. So just using dir(scipy) and then dir(scipy.special) (and other submodules) should give you a clean interface to everything that is public.

@BenjaminBossan
Copy link
Collaborator

@omar-araboghli Thx for providing more context and for the draft PR, now I understand better what's going on.

I think the suggestion by Ralf of using dir(...) is a very good one. This way, we capture the functions that are really intended for usage by scipy users. Maybe we will miss something this way that could be caught with the introspective approach you suggest, but I think that's fine. After all, users can always add those types by hand. And in this case, I feel like robustness is more important than completeness, i.e. even if users may need to add some funcs by hand, they can be sure the same code still works after a scipy upgrade, even if a private function was renamed or moved. Therefore, I'd go with Ralf's suggestion.

I hope we can use the same logic for numpy, but if it's easier to start with scipy and figure out numpy later, we can just have two separate PRs.

@rgommers
Copy link

rgommers commented Feb 7, 2023

I hope we can use the same logic for numpy,

Unfortunately not yet, the NumPy public/private split is a huge mess. I plan to fix that for NumPy 2.0

@BenjaminBossan
Copy link
Collaborator

Thanks for the update. @omar-araboghli so how about starting with scipy first and doing numpy after that?

@omar-araboghli
Copy link
Contributor

omar-araboghli commented Feb 9, 2023

Thanks for the discussions. I will take that up tomorrow or lately by the weekend and raise questions, if any.

omar-araboghli added a commit to omar-araboghli/skops that referenced this issue Feb 10, 2023
…nd embedding the test within estimators_fitted.
omar-araboghli added a commit to omar-araboghli/skops that referenced this issue Feb 13, 2023
@BenjaminBossan
Copy link
Collaborator

Sorry, this should not have been closed, since numpy ufuncs are still not trusted by default.

Numpy dtypes could also be added to the types trusted by default.

@omar-araboghli
Copy link
Contributor

Thought a separate issue will be created. I will work on this soon-ish. Thanks for pinging!

omar-araboghli added a commit to omar-araboghli/skops that referenced this issue Apr 1, 2023
omar-araboghli added a commit to omar-araboghli/skops that referenced this issue Apr 1, 2023
omar-araboghli added a commit to omar-araboghli/skops that referenced this issue Apr 1, 2023
omar-araboghli added a commit to omar-araboghli/skops that referenced this issue Apr 1, 2023
omar-araboghli added a commit to omar-araboghli/skops that referenced this issue Apr 11, 2023
omar-araboghli added a commit to omar-araboghli/skops that referenced this issue Apr 13, 2023
omar-araboghli added a commit to omar-araboghli/skops that referenced this issue May 14, 2023
omar-araboghli added a commit to omar-araboghli/skops that referenced this issue May 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers persistence Secure persistence feature
Projects
None yet
6 participants