Calculation of confidence intervals in NormalInferenceResults becomes very slow when passing big dataframes #878

gdaiha · 2024-04-30T18:01:20Z

Line 1073 in a0054d1

return np.array([_safe_norm_ppf(alpha / 2, loc=p, scale=err)

conf_int(alpha) function becomes very slow when passing big dataframes (large number of rows).
i think this happens because we do it in a for-loop list comprehension

tests with 400k rows ~ 2 minutes, 1M rows ~ 5 minutes.

i think that we could try at least two things:

Parallelize this calculation, like we do in many other steps inside the library. I searched all over the library and did not find any place that uses this in a parallel loop outside (to avoid parallel overhead)
Try to find a way to do this calculus in a numpy-friendly way (but i didnt find until now how to do this)

Any other suggestions? @kbattocchi

gdaiha · 2024-04-30T19:33:07Z

Analyzing better how _safe_norm_ppf was built, I think the only needed fix that we could do was to remove the outside loop and only pass:

_safe_norm_ppf(1 - alpha / 2, loc=self.point_estimate, scale=self.stderr), _safe_norm_ppf(alpha / 2, loc=self.point_estimate, scale=self.stderr)

A simple simulation between the two methods:

import scipy
import timeit
import numpy as np

from time import perf_counter
from contextlib import contextmanager

size = 10000

@contextmanager
def catchtime() -> float:
    start = perf_counter()
    yield lambda: perf_counter() - start
    print(f'Time: {perf_counter() - start:.3f} seconds')

loc_array = np.random.normal(loc=0, scale=0.1, size=size)
scale_array = np.abs(np.random.normal(loc=0, scale=0.001, size=size))

print('Method without loop')
with catchtime() as t:
  a = scipy.stats.norm.ppf(q=0.05, loc=loc_array, scale=scale_array), scipy.stats.norm.ppf(q=0.95, loc=loc_array, scale=scale_array)

print(a)

print('Method with loop')
with catchtime() as t:
  b = np.array([scipy.stats.norm.ppf(q=0.05, loc=loc, scale=scale) for loc, scale in zip(loc_array, scale_array)]), np.array([scipy.stats.norm.ppf(q=0.95, loc=loc, scale=scale) for loc, scale in zip(loc_array, scale_array)])

print(b)

That prints:

Method without loop

Time: 0.003 seconds

(array([ 0.05063444, -0.00281694, 0.09580098, ..., -0.0103138 ,
0.05915183, 0.09202236]), array([ 0.05225244, -0.00042525, 0.09687321, ..., -0.00561028,
0.06531339, 0.0939663 ]))

Method with loop

Time: 2.423 seconds

(array([ 0.05063444, -0.00281694, 0.09580098, ..., -0.0103138 ,
0.05915183, 0.09202236]), array([ 0.05225244, -0.00042525, 0.09687321, ..., -0.00561028,
0.06531339, 0.0939663 ]))

[ 0.05063444 -0.00281694 0.09580098 ... -0.0103138 0.05915183
0.09202236]

gdaiha · 2024-05-07T11:40:59Z

Issue solved by #879

gdaiha mentioned this issue May 1, 2024

Optimizing NormalInferenceResults confidence interval method speed #879

Merged

gdaiha closed this as completed May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculation of confidence intervals in NormalInferenceResults becomes very slow when passing big dataframes #878

Calculation of confidence intervals in NormalInferenceResults becomes very slow when passing big dataframes #878

gdaiha commented Apr 30, 2024

gdaiha commented Apr 30, 2024 •

edited

Loading

gdaiha commented May 7, 2024

Calculation of confidence intervals in NormalInferenceResults becomes very slow when passing big dataframes #878

Calculation of confidence intervals in NormalInferenceResults becomes very slow when passing big dataframes #878

Comments

gdaiha commented Apr 30, 2024

gdaiha commented Apr 30, 2024 • edited Loading

gdaiha commented May 7, 2024

gdaiha commented Apr 30, 2024 •

edited

Loading