Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

numDeriv::jacobian is slow when applied to every observation #36

Closed
vincentarelbundock opened this issue Aug 16, 2016 · 14 comments
Closed

Comments

@vincentarelbundock
Copy link

Thanks for your work on this!

I was playing around with this cool package, and wondering why margins was so slow when applied to a small logistic glm with 100 observations. Basic profiling suggests that the function spends about 80% of its take taking the jacobian in the delta_once function.

I can't quite find my way around the code yet, but it looks like it's calculating the variance for every observation, which may not be necessary in every use case.

Any thoughts on how to speed things up? If you point me to the right direction, I may be able to PR something.

@leeper
Copy link
Owner

leeper commented Aug 16, 2016

Definitely. If you're only interested in the effects (without their variances), you can use the marginal_effects() function, which is quite fast on its own.

I've been meaning to add a toggle to turn of the calculation of unit-specific variances (and, in fact, I think it should probably be off by default since it's probably not needed in most cases). I'll make that change shortly.

@leeper leeper added this to the CRAN Release milestone Aug 16, 2016
@leeper leeper self-assigned this Aug 16, 2016
@vincentarelbundock
Copy link
Author

Sounds great. Thanks for the marginal_effects pointer. Will give it a spin.

@leeper leeper closed this as completed in 427f0b2 Aug 18, 2016
@leeper
Copy link
Owner

leeper commented Aug 18, 2016

Just pushed an update where this is set to FALSE by default - let me know what kind of performance improvements you see with it.

@vincentarelbundock
Copy link
Author

Thanks!

I'm not quite sure what the expectation should be here, as I've never used Stata's margins command, but this still seems unreasonably long to me (at least for my kind of practical use, where I often go back and forth between data munging and estimating).

What's your expectation for reasonable time?

> library(tictoc)
> library(margins)
> N = 1000
> y = sample(c(0, 1), N, replace=TRUE)
> x = rnorm(N)
> mod = glm(y ~ x, family=binomial())
> tic()
> margins(mod)

       x 
0.005662 

> toc()
63.547 sec elapsed
> 

@leeper
Copy link
Owner

leeper commented Aug 22, 2016

What are the specs on your machine? I have not seen times like that.

@vincentarelbundock
Copy link
Author

That was on a 27" imac bought in 2014. This morning, on my linux machine (AMD FX(tm)-8350 Eight-Core Processor + 32GB RAM) microbenchmark gives me:

> microbenchmark(margins(mod), times=3)
Unit: seconds
         expr      min       lq     mean   median       uq      max neval
 margins(mod) 29.87201 29.99878 30.40625 30.12555 30.67337 31.22119     3

@leeper
Copy link
Owner

leeper commented Aug 22, 2016

Thanks. I think #37 will produce some (hopefully substantial) performance enhancements, so please hold out for that.

@vincentarelbundock
Copy link
Author

Sounds good. I guess I'm just not sure why you're using numerical approximations at all when the derivative is well know (at least in the logit case).

@leeper
Copy link
Owner

leeper commented Aug 22, 2016

Generality. If you scroll through the git history here, you'll see an approach using symbolic derivatives. It works in simple cases but not in many others, especially anytime any of the following occurs:

  • factor variables
  • I() expressions or similar transformations relevel(), center(), etc. in model formulae
  • some other edge cases

So, the choice is between an approach (symbolic derivatives) that only works in a set of cases for common models or an approach (numerical derivatives) that work for any formula and any model type. I think the latter is preferable because it will be possible to gradually optimize it once the code works as intended whereas the former approach simply won't work at all in lots of common cases.

@vincentarelbundock
Copy link
Author

[insert thumbs up emoticon]

@leeper
Copy link
Owner

leeper commented Aug 22, 2016

Indeed, this looks pretty good: 43fe90e

@vincentarelbundock
Copy link
Author

Yeah, that's a whole other ballgame:

> library(microbenchmark)
> library(margins)
> N = 1000
> y = sample(c(0, 1), N, replace=TRUE)
> x = rnorm(N)
> mod = glm(y ~ x, family=binomial())
> microbenchmark(margins(mod), times=3)
Unit: milliseconds
         expr      min       lq    mean   median       uq      max neval
 margins(mod) 37.10477 37.64591 44.0599 38.18706 47.53747 56.88788     3

@vincentarelbundock
Copy link
Author

nearly 800x faster

@leeper
Copy link
Owner

leeper commented Aug 22, 2016

Nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants