Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculation of confidence intervals in NormalInferenceResults becomes very slow when passing big dataframes #878

Closed
gdaiha opened this issue Apr 30, 2024 · 2 comments

Comments

@gdaiha
Copy link
Contributor

gdaiha commented Apr 30, 2024

return np.array([_safe_norm_ppf(alpha / 2, loc=p, scale=err)

conf_int(alpha) function becomes very slow when passing big dataframes (large number of rows).
i think this happens because we do it in a for-loop list comprehension

tests with 400k rows ~ 2 minutes, 1M rows ~ 5 minutes.

i think that we could try at least two things:

  • Parallelize this calculation, like we do in many other steps inside the library. I searched all over the library and did not find any place that uses this in a parallel loop outside (to avoid parallel overhead)
  • Try to find a way to do this calculus in a numpy-friendly way (but i didnt find until now how to do this)

Any other suggestions? @kbattocchi

@gdaiha
Copy link
Contributor Author

gdaiha commented Apr 30, 2024

Analyzing better how _safe_norm_ppf was built, I think the only needed fix that we could do was to remove the outside loop and only pass:

_safe_norm_ppf(1 - alpha / 2, loc=self.point_estimate, scale=self.stderr), _safe_norm_ppf(alpha / 2, loc=self.point_estimate, scale=self.stderr)

A simple simulation between the two methods:

import scipy
import timeit
import numpy as np

from time import perf_counter
from contextlib import contextmanager

size = 10000

@contextmanager
def catchtime() -> float:
    start = perf_counter()
    yield lambda: perf_counter() - start
    print(f'Time: {perf_counter() - start:.3f} seconds')

loc_array = np.random.normal(loc=0, scale=0.1, size=size)
scale_array = np.abs(np.random.normal(loc=0, scale=0.001, size=size))

print('Method without loop')
with catchtime() as t:
  a = scipy.stats.norm.ppf(q=0.05, loc=loc_array, scale=scale_array), scipy.stats.norm.ppf(q=0.95, loc=loc_array, scale=scale_array)

print(a)

print('Method with loop')
with catchtime() as t:
  b = np.array([scipy.stats.norm.ppf(q=0.05, loc=loc, scale=scale) for loc, scale in zip(loc_array, scale_array)]), np.array([scipy.stats.norm.ppf(q=0.95, loc=loc, scale=scale) for loc, scale in zip(loc_array, scale_array)])

print(b)

That prints:

Method without loop

Time: 0.003 seconds

(array([ 0.05063444, -0.00281694, 0.09580098, ..., -0.0103138 ,
0.05915183, 0.09202236]), array([ 0.05225244, -0.00042525, 0.09687321, ..., -0.00561028,
0.06531339, 0.0939663 ]))

Method with loop

Time: 2.423 seconds

(array([ 0.05063444, -0.00281694, 0.09580098, ..., -0.0103138 ,
0.05915183, 0.09202236]), array([ 0.05225244, -0.00042525, 0.09687321, ..., -0.00561028,
0.06531339, 0.0939663 ]))

[ 0.05063444 -0.00281694 0.09580098 ... -0.0103138 0.05915183
0.09202236]

@gdaiha
Copy link
Contributor Author

gdaiha commented May 7, 2024

Issue solved by #879

@gdaiha gdaiha closed this as completed May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant