Add {numpy, scipy}.ufuncs to the trusted list #224

adrinjalali · 2022-11-29T16:47:53Z

I think most (if not all) ufuncs are actually safe, so we can add them to the default trusted list and reduce the size of untrusted_types.

The text was updated successfully, but these errors were encountered:

BenjaminBossan · 2022-12-01T11:30:33Z

I would also assume that those ufuncs are safe, e.g. they operate on a very specific set of inputs and should have no side effects, but I'm not sure how to verify this without actually going through the code.

Just a quick check I did:

>>> npufuncs = [getattr(np, a) for a in dir(np) if isinstance(getattr(np, a), np.ufunc)]
>>> len(npufuncs)
91
>>> spufuncs = [getattr(scipy.special, a) for a in dir(scipy.special) if isinstance(getattr(scipy.special, a), np.ufunc)]
>>> len(spufuncs)
231

Not sure if that's all.

adrinjalali · 2022-12-01T11:41:44Z

@rgommers would you know if we can consider all these ufuncs safe in this context, and if @BenjaminBossan 's snippet captures all of them?

Context: we're created an alternative to pickle persistence format, where users would have to pass what they trust for a file to load, and right now none of the ufuncs are trusted, which are used inside sklearn estimators. We're trying to reduce the noise on the user side by trusting certain funcs/objects by default.

rgommers · 2022-12-01T20:42:09Z

Ufuncs should be stateless, so safe to obtain from numpy or scipy.

"they operate on a very specific set of inputs" isn't quite true, at least in terms of Python types, given that they accept anything that can be coerced to a numpy.ndarray. But that should not be relevant I'd think.

The list isn't quite complete for scipy. There are also some ufuncs in scipy.stats._boost .

anamika-yadav99 · 2022-12-16T10:41:10Z

Hi @adrinjalali ! I'm an undergrad from India. I was exploring the repository and I find it interesting. I want to contribute to the project. Can I take up this issue?

adrinjalali · 2022-12-16T10:47:28Z

@anamika-yadav99 yes of course :) You can have a look at #237 for a similar pull request.

BenjaminBossan · 2023-01-04T19:14:55Z

@anamika-yadav99 Do you still plan to work on this?

anamika-yadav99 · 2023-01-04T19:30:43Z

Hi @BenjaminBossan ! Apologies, give me this weekend. I'll raise a PR.

BenjaminBossan · 2023-01-18T10:21:41Z

@anamika-yadav99 a friendly ping to ask if you still plan on working on this :)

omar-araboghli · 2023-02-01T16:17:09Z

Hi, I would be happy to contribute, if @BenjaminBossan and @anamika-yadav99 would give the 🟢 light!

BenjaminBossan · 2023-02-02T10:52:35Z

@omar-araboghli Since we haven't heard back in a while from Anamika, I think it's safe to say that you can take this task.

omar-araboghli · 2023-02-04T12:30:06Z

Hi @BenjaminBossan and @adrinjalali, I have done some experiments so far to make sure we're including all ufuncs from both numpy and scipy. I compared @BenjaminBossan's snippet (v1) with my version (v2) where I do traverse all modules and inspect those objects that are typed with np.ufunc. I found more ones that are not mentioned in the documention. For some reason ufuncs under scipy.stats._boost are not captured as well. Still looking into that.

Please check the following snippet and output and give your advice, whether it's safe to include the inspected ones. The traversal is inspired from sklearn.util.all_estimators()

def all_typed_instances(root_module: ModuleType, base_type: Type, modules_to_ignore: list[str] = None) -> list[str]:
    if not root_module or not base_type:
        return []

    _MODULES_TO_IGNORE = [
        "tests",
        "setup",
        "conftest"
    ]

    _SUBSTRINGS_TO_IGNORE = [
        ".__",
        "test"
    ]

    if modules_to_ignore is None:
        modules_to_ignore = _MODULES_TO_IGNORE
    else:
        modules_to_ignore += _MODULES_TO_IGNORE

    typed_instances = []
    for _, module_name, _ in pkgutil.walk_packages(root_module.__path__, prefix=f"{root_module.__name__}."):
        module_parts = module_name.split('.')
        if (
                any(part in modules_to_ignore for part in module_parts)
                or any(substring in module_name for substring in _SUBSTRINGS_TO_IGNORE)
        ):
            continue

        try:
            module = importlib.import_module(module_name)
        except Exception:
            continue

        typed_members = [
            get_type_name(member[1]) for member in inspect.getmembers(module)
            if isinstance(member[1], base_type)
            and get_type_name(member[1]).startswith(root_module.__name__)
        ]

        typed_instances.extend(typed_members)

    return sorted(set(typed_instances))

----------------------------------------------------

numpy_v1 = sorted(set([
    get_type_name(getattr(np, attr))
    for attr in dir(np)
    if isinstance(getattr(np, attr), np.ufunc)
    and get_type_name(getattr(np, attr)).startswith("numpy")
]))

numpy_v2 = all_typed_instances(np, np.ufunc)

scipy_v1 = sorted(set([
    get_type_name(getattr(scipy.special, attr))
    for attr in dir(scipy.special)
    if isinstance(getattr(scipy.special, attr), np.ufunc)
    and get_type_name(getattr(scipy.special, attr)).startswith("scipy")
]))

scipy_v2 = all_typed_instances(scipy, np.ufunc)

----------------------------------------------------

>>> set(numpy_v1).difference(numpy_v2)
set()
>>> set(numpy_v2).difference(numpy_v1)
{'numpy.linalg._umath_linalg.lstsq_n', 'numpy.linalg._umath_linalg.eigvalsh_lo', 'numpy.core._multiarray_umath.clip', 'numpy.linalg._umath_linalg.inv', 'numpy.linalg._umath_linalg.svd_m', 'numpy.core._multiarray_umath._ones_like', 'numpy.linalg._umath_linalg.svd_n', 'numpy.linalg._umath_linalg.svd_m_s', 'numpy.linalg._umath_linalg.eig', 'numpy.linalg._umath_linalg.svd_n_f', 'numpy.linalg._umath_linalg.eigvals', 'numpy.linalg._umath_linalg.solve', 'numpy.linalg._umath_linalg.qr_complete', 'numpy.linalg._umath_linalg.eigh_lo', 'numpy.linalg._umath_linalg.qr_r_raw_n', 'numpy.linalg._umath_linalg.solve1', 'numpy.linalg._umath_linalg.eigvalsh_up', 'numpy.linalg._umath_linalg.slogdet', 'numpy.linalg._umath_linalg.qr_reduced', 'numpy.linalg._umath_linalg.svd_m_f', 'numpy.linalg._umath_linalg.eigh_up', 'numpy.linalg._umath_linalg.qr_r_raw_m', 'numpy.linalg._umath_linalg.svd_n_s', 'numpy.core._multiarray_umath._arg', 'numpy.linalg._umath_linalg.lstsq_m', 'numpy.linalg._umath_linalg.det', 'numpy.linalg._umath_linalg.cholesky_lo'}
>>> len(numpy_v1)
86
>>> len(numpy_v2)
113

----------------------------------------------------

>>> set(scipy_v1).difference(scipy_v2)
set()
>>> set(scipy_v2).difference(scipy_v1)
{'scipy.stats.invgauss_ufunc._invgauss_cdf', 'scipy.special._ufuncs._riemann_zeta', 'scipy.special._ufuncs._smirnovc', 'scipy.stats.ncx2_ufunc._ncx2_kurtosis_excess', 'scipy.stats.skewnorm_ufunc._skewnorm_ppf', 'scipy.special._ufuncs._cospi', 'scipy.stats.nct_ufunc._nct_mean', 'scipy.stats.ncx2_ufunc._ncx2_isf', 'scipy.stats.skewnorm_ufunc._skewnorm_sf', 'scipy.special._ufuncs._zeta', 'scipy.stats.beta_ufunc._beta_isf', 'scipy.stats.binom_ufunc._binom_cdf', 'scipy.stats.nct_ufunc._nct_cdf', 'scipy.stats.nbinom_ufunc._nbinom_sf', 'scipy.stats.hypergeom_ufunc._hypergeom_cdf', 'scipy.special._ufuncs._spherical_kn_d', 'scipy.stats.ncx2_ufunc._ncx2_variance', 'scipy.special._ufuncs._struve_power_series', 'scipy.stats.hypergeom_ufunc._hypergeom_isf', 'scipy.stats.nbinom_ufunc._nbinom_mean', 'scipy.special._ufuncs._log1pmx', 'scipy.stats.binom_ufunc._binom_sf', 'scipy.special._ufuncs._ellip_harm', 'scipy.special._ufuncs._spherical_yn', 'scipy.stats.invgauss_ufunc._invgauss_mean', 'scipy.stats.invgauss_ufunc._invgauss_ppf', 'scipy.stats.hypergeom_ufunc._hypergeom_variance', 'scipy.stats.skewnorm_ufunc._skewnorm_variance', 'scipy.stats.nct_ufunc._nct_sf', 'scipy.special._ufuncs._cosine_cdf', 'scipy.stats.ncx2_ufunc._ncx2_sf', 'scipy.special._ufuncs._smirnovp', 'scipy.stats.ncx2_ufunc._ncx2_ppf', 'scipy.stats.binom_ufunc._binom_kurtosis_excess', 'scipy.stats.skewnorm_ufunc._skewnorm_kurtosis_excess', 'scipy.stats.hypergeom_ufunc._hypergeom_skewness', 'scipy.stats.ncf_ufunc._ncf_sf', 'scipy.stats.nct_ufunc._nct_isf', 'scipy.special._ufuncs._kolmogc', 'scipy.stats.skewnorm_ufunc._skewnorm_isf', 'scipy.stats.ncx2_ufunc._ncx2_cdf', 'scipy.stats.hypergeom_ufunc._hypergeom_kurtosis_excess', 'scipy.stats.nbinom_ufunc._nbinom_kurtosis_excess', 'scipy.special._ufuncs._spherical_jn_d', 'scipy.stats.binom_ufunc._binom_skewness', 'scipy.stats.ncx2_ufunc._ncx2_mean', 'scipy.stats.nct_ufunc._nct_skewness', 'scipy.stats.beta_ufunc._beta_variance', 'scipy.special._ufuncs._lgam1p', 'scipy.stats.binom_ufunc._binom_isf', 'scipy.special._ufuncs._spherical_in', 'scipy.stats.beta_ufunc._beta_sf', 'scipy.stats.nbinom_ufunc._nbinom_isf', 'scipy.stats.ncf_ufunc._ncf_pdf', 'scipy.stats.hypergeom_ufunc._hypergeom_ppf', 'scipy.special._ufuncs._lambertw', 'scipy.special._ufuncs._igam_fac', 'scipy.stats.invgauss_ufunc._invgauss_isf', 'scipy.special._ufuncs._smirnovci', 'scipy.stats.beta_ufunc._beta_pdf', 'scipy.stats.ncf_ufunc._ncf_cdf', 'scipy.stats.skewnorm_ufunc._skewnorm_skewness', 'scipy.stats.nct_ufunc._nct_kurtosis_excess', 'scipy.special._ufuncs._spherical_kn', 'scipy.stats.skewnorm_ufunc._skewnorm_pdf', 'scipy.stats.nbinom_ufunc._nbinom_skewness', 'scipy.stats.invgauss_ufunc._invgauss_variance', 'scipy.stats.beta_ufunc._beta_ppf', 'scipy.stats.skewnorm_ufunc._skewnorm_mean', 'scipy.special._ufuncs._spherical_yn_d', 'scipy.stats.invgauss_ufunc._invgauss_skewness', 'scipy.stats.ncf_ufunc._ncf_kurtosis_excess', 'scipy.stats.invgauss_ufunc._invgauss_kurtosis_excess', 'scipy.stats.invgauss_ufunc._invgauss_pdf', 'scipy.special._ufuncs._lanczos_sum_expg_scaled', 'scipy.special._ufuncs._cosine_invcdf', 'scipy.stats.nct_ufunc._nct_variance', 'scipy.stats.nbinom_ufunc._nbinom_pdf', 'scipy.stats.binom_ufunc._binom_variance', 'scipy.special._ufuncs._sf_error_test_function', 'scipy.stats.ncf_ufunc._ncf_skewness', 'scipy.stats.ncx2_ufunc._ncx2_skewness', 'scipy.stats.nct_ufunc._nct_pdf', 'scipy.stats.nbinom_ufunc._nbinom_cdf', 'scipy.special._ufuncs._spherical_jn', 'scipy.stats.binom_ufunc._binom_ppf', 'scipy.special._ufuncs._factorial', 'scipy.stats.ncx2_ufunc._ncx2_pdf', 'scipy.stats.ncf_ufunc._ncf_mean', 'scipy.stats.beta_ufunc._beta_skewness', 'scipy.stats.invgauss_ufunc._invgauss_sf', 'scipy.special._ufuncs._spherical_in_d', 'scipy.stats.beta_ufunc._beta_cdf', 'scipy.special._ufuncs._kolmogp', 'scipy.stats.hypergeom_ufunc._hypergeom_sf', 'scipy.stats.hypergeom_ufunc._hypergeom_pdf', 'scipy.special._ufuncs._struve_bessel_series', 'scipy.special._ufuncs._kolmogci', 'scipy.stats.ncf_ufunc._ncf_variance', 'scipy.stats.nbinom_ufunc._nbinom_variance', 'scipy.stats.binom_ufunc._binom_pdf', 'scipy.stats.nct_ufunc._nct_ppf', 'scipy.stats.binom_ufunc._binom_mean', 'scipy.stats.beta_ufunc._beta_mean', 'scipy.stats.hypergeom_ufunc._hypergeom_mean', 'scipy.stats.skewnorm_ufunc._skewnorm_cdf', 'scipy.stats.ncf_ufunc._ncf_ppf', 'scipy.special._ufuncs._struve_asymp_large_z', 'scipy.special._ufuncs._sinpi', 'scipy.stats.beta_ufunc._beta_kurtosis_excess', 'scipy.stats.ncf_ufunc._ncf_isf', 'scipy.stats.nbinom_ufunc._nbinom_ppf'}
>>> len(scipy_v1)
230
>>> len(scipy_v2)
342

BenjaminBossan · 2023-02-06T10:50:53Z

Thanks a lot @omar-araboghli for looking into this. I tried to run your snippet locally but there were a lot of missing imports. Could you be so kind as to provide a code block that works as is?

Regarding the results, I think we're collecting too many functions here. To check, I just picked the first and last result from the list of differences you provided for numpy and scipy:

numpy.linalg._umath_linalg.lstsq_n is a "private "function that is not supposed to be used directly, but seems to be a helper function to numpy.linalg.lstsq.
numpy.linalg._umath_linalg.cholesky_lo is a "private" function that is not supposed to be used directly, but seems to be a helper function to numpy.linalg.cholesky.
scipy.stats.invgauss_ufunc._invgauss_cdf I couldn't find any usage (but it seems to be related to boost).
scipy.stats.nbinom_ufunc._nbinom_pp is a "private" function that is not supposed to be used directly, but seems to be a helper function to scipy.stats.nbinom (via scipy.stats._discrete_distns.nbinom_gen).

The problem with adding those is that given their "private" nature, they are much more likely to change in a way that creates maintenance burden for us (being removed, renamed, moved to a different module) while being extremely unlikely to be used by any user directly. And I would assume that if a user wants to use those private functions, they are very knowledgeable, so they should not have a problem adding them to the trusted argument.

Therefore, I would rather be more conservative and only add "public" functions. Do you want to investigate how to filter out the "private" functions?

omar-araboghli · 2023-02-06T11:39:50Z

Hi @BenjaminBossan and thank you for the quick review! I just commited these quick changes on my fork here and you can also quickly check them in this notebook, but with different numpy/scipy versions indeed.

Thanks for the clarification about the results. I was and still am a bit sceptical about the private functions, since @rgommers mentioned _boost as a module having ufuncs interesting for us. Isn't it a private module ? Public functions from a private module should be imported from a public module and caught with the above all_typed_instances() function, but they weren't (or maybe something still wrong with my implementation for now).

As a generic approach, would we settle for getting everything possible except private and test functions/objects ? private modules are also dropped ?

The filtering can be done with

    _MODULES_TO_IGNORE = [
        "tests",
        "setup",
        "conftest"
    ]

    _SUBSTRINGS_TO_IGNORE = [
        ".__",
        "test"
    ]

rgommers · 2023-02-06T12:06:00Z

Note that in the most recent version of SciPy, we consistently use underscores for private things. There's some non-underscored shims for backwards compat, but those all raise warnings and are not present in __all__. So just using dir(scipy) and then dir(scipy.special) (and other submodules) should give you a clean interface to everything that is public.

BenjaminBossan · 2023-02-06T14:00:17Z

@omar-araboghli Thx for providing more context and for the draft PR, now I understand better what's going on.

I think the suggestion by Ralf of using dir(...) is a very good one. This way, we capture the functions that are really intended for usage by scipy users. Maybe we will miss something this way that could be caught with the introspective approach you suggest, but I think that's fine. After all, users can always add those types by hand. And in this case, I feel like robustness is more important than completeness, i.e. even if users may need to add some funcs by hand, they can be sure the same code still works after a scipy upgrade, even if a private function was renamed or moved. Therefore, I'd go with Ralf's suggestion.

I hope we can use the same logic for numpy, but if it's easier to start with scipy and figure out numpy later, we can just have two separate PRs.

rgommers · 2023-02-07T11:08:49Z

I hope we can use the same logic for numpy,

Unfortunately not yet, the NumPy public/private split is a huge mess. I plan to fix that for NumPy 2.0

BenjaminBossan · 2023-02-08T15:07:58Z

Thanks for the update. @omar-araboghli so how about starting with scipy first and doing numpy after that?

omar-araboghli · 2023-02-09T09:38:24Z

Thanks for the discussions. I will take that up tomorrow or lately by the weekend and raise questions, if any.

…nd embedding the test within estimators_fitted.

BenjaminBossan · 2023-03-15T11:48:57Z

Sorry, this should not have been closed, since numpy ufuncs are still not trusted by default.

Numpy dtypes could also be added to the types trusted by default.

omar-araboghli · 2023-03-15T11:51:51Z

Thought a separate issue will be created. I will work on this soon-ish. Thanks for pinging!

…ypes has the same type.

…type

adrinjalali added the persistence Secure persistence feature label Nov 29, 2022

E-Aho added the good first issue Good for newcomers label Dec 3, 2022

omar-araboghli added a commit to omar-araboghli/skops that referenced this issue Feb 6, 2023

skops-dev#224: getting {numpy,scipy}.ufuncs with different versions

895f84a

omar-araboghli mentioned this issue Feb 6, 2023

ENH trusting scipy.ufuncs #295

Merged

omar-araboghli added a commit to omar-araboghli/skops that referenced this issue Feb 10, 2023

skops-dev#224: trusting scipy.special ufuncs; updating FunctionNode a…

3abf300

…nd embedding the test within estimators_fitted.

omar-araboghli added a commit to omar-araboghli/skops that referenced this issue Feb 11, 2023

skops-dev#224: testing persisting all scipy_ufuncs; changelog updated

2423b8f

omar-araboghli added a commit to omar-araboghli/skops that referenced this issue Feb 13, 2023

skops-dev#224: applying code-review remarks.

853d67d

BenjaminBossan closed this as completed in #295 Feb 14, 2023

BenjaminBossan reopened this Mar 15, 2023

omar-araboghli added a commit to omar-araboghli/skops that referenced this issue Apr 1, 2023

skops-dev#224: trusting np.ufuncs and np.dtypes

89561a0

omar-araboghli mentioned this issue Apr 1, 2023

ENH trusting np.ufuncs and np.dtypes #336

Merged

omar-araboghli added a commit to omar-araboghli/skops that referenced this issue Apr 1, 2023

skops-dev#224: updating changes.rst

22b0043

omar-araboghli added a commit to omar-araboghli/skops that referenced this issue Apr 1, 2023

skops-dev#224: fixing failing tests.

3fc5116

omar-araboghli added a commit to omar-araboghli/skops that referenced this issue Apr 1, 2023

skops-dev#224: ignoring mypy errors. Should be safe as long as np.sct…

3ad5f78

…ypes has the same type.

omar-araboghli added a commit to omar-araboghli/skops that referenced this issue Apr 11, 2023

skops-dev#224: resolving review remarks.

71818db

omar-araboghli added a commit to omar-araboghli/skops that referenced this issue Apr 13, 2023

skops-dev#224: replacing isinstance with issubclass and _type with of…

98661be

…type

omar-araboghli added a commit to omar-araboghli/skops that referenced this issue Apr 13, 2023

skops-dev#224: clarifying the confusion and removing extra parentheses.

f9e1cba

omar-araboghli added a commit to omar-araboghli/skops that referenced this issue May 14, 2023

skops-dev#224: updating branch and fixing conflicts

bc2099c

omar-araboghli added a commit to omar-araboghli/skops that referenced this issue May 14, 2023

skops-dev#224: cleaning up tests with explicit numpy.dtypes

e0e951f

adrinjalali closed this as completed in #336 May 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add {numpy, scipy}.ufuncs to the trusted list #224

Add {numpy, scipy}.ufuncs to the trusted list #224

adrinjalali commented Nov 29, 2022

BenjaminBossan commented Dec 1, 2022

adrinjalali commented Dec 1, 2022

rgommers commented Dec 1, 2022

anamika-yadav99 commented Dec 16, 2022

adrinjalali commented Dec 16, 2022

BenjaminBossan commented Jan 4, 2023

anamika-yadav99 commented Jan 4, 2023

BenjaminBossan commented Jan 18, 2023

omar-araboghli commented Feb 1, 2023

BenjaminBossan commented Feb 2, 2023

omar-araboghli commented Feb 4, 2023

BenjaminBossan commented Feb 6, 2023

omar-araboghli commented Feb 6, 2023

rgommers commented Feb 6, 2023

BenjaminBossan commented Feb 6, 2023

rgommers commented Feb 7, 2023

BenjaminBossan commented Feb 8, 2023

omar-araboghli commented Feb 9, 2023 •

edited

Loading

BenjaminBossan commented Mar 15, 2023

omar-araboghli commented Mar 15, 2023

Add {numpy, scipy}.ufuncs to the trusted list #224

Add {numpy, scipy}.ufuncs to the trusted list #224

Comments

adrinjalali commented Nov 29, 2022

BenjaminBossan commented Dec 1, 2022

adrinjalali commented Dec 1, 2022

rgommers commented Dec 1, 2022

anamika-yadav99 commented Dec 16, 2022

adrinjalali commented Dec 16, 2022

BenjaminBossan commented Jan 4, 2023

anamika-yadav99 commented Jan 4, 2023

BenjaminBossan commented Jan 18, 2023

omar-araboghli commented Feb 1, 2023

BenjaminBossan commented Feb 2, 2023

omar-araboghli commented Feb 4, 2023

BenjaminBossan commented Feb 6, 2023

omar-araboghli commented Feb 6, 2023

rgommers commented Feb 6, 2023

BenjaminBossan commented Feb 6, 2023

rgommers commented Feb 7, 2023

BenjaminBossan commented Feb 8, 2023

omar-araboghli commented Feb 9, 2023 • edited Loading

BenjaminBossan commented Mar 15, 2023

omar-araboghli commented Mar 15, 2023

omar-araboghli commented Feb 9, 2023 •

edited

Loading