Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector-valued vectors with var in combine bug #2316

Closed
pdeffebach opened this issue Jul 15, 2020 · 5 comments · Fixed by #2357
Closed

Vector-valued vectors with var in combine bug #2316

pdeffebach opened this issue Jul 15, 2020 · 5 comments · Fixed by #2357
Labels
Milestone

Comments

@pdeffebach
Copy link
Contributor

I admit I went hunting for this after I saw that var was special-cased in combine.

Here is a MWE

julia> df = DataFrame(v = [rand(3) for i in 1:10], a = rand(1:2, 10));

julia> var(df.v)
3-element Array{Float64,1}:
 0.09855997546717102
 0.06939062184456897
 0.12074306756096125

julia> combine(groupby(df, :a), :v => var)
ERROR: MethodError: no method matching zero(::Type{Array{Float64,1}})
@bkamins bkamins added the bug label Jul 15, 2020
@bkamins bkamins added this to the 1.0 milestone Jul 15, 2020
@bkamins
Copy link
Member

bkamins commented Jul 15, 2020

Thank you for reporting it. It is a bug. We should update isagg to check the element type of a vector passed and use the fast method only if it matches the implementation.

@AliMalik9599
Copy link

Hi @bkamins I am interested in working on this issue. Where can I find the isagg function and any advice on getting started would be greatly appreciated. Thanks!

@bkamins
Copy link
Member

bkamins commented Aug 3, 2020

https://github.com/JuliaData/DataFrames.jl/blob/master/src/groupeddataframe/splitapplycombine.jl#L1079

you would have to redesign it to additionally pass the eltype of the vector selected by first(p). Then the allowed range of eltype that make isagg to use the fast path should be narrowed down.

Then we should also remove the check first(p) isa ColumnIndex as it is inefficient:

julia> @time combine(gdf, [:x] => sum);
  0.009956 seconds (1.11 k allocations: 7.666 MiB)

julia> @time combine(gdf, :x => sum);
  0.001730 seconds (163 allocations: 14.281 KiB)

(it is better to just check if a single column is passed no matter how it gets selected - here just make sure to correctly handle AsTable as it should not trigger this fast path)

As usual - the hardest thing to do will be:

  1. writing tests
  2. doing performance benchmarking if all is OK

@AliMalik9599
Copy link

What should the range of eltypes be for fast path? Also, what does the function first() do? Thanks for the help

@bkamins
Copy link
Member

bkamins commented Aug 3, 2020

first - takes the first element of the collection.

As for the range of types - I think Union{Missing, Number} should be OK except for minimum and maximum when it should be Union{Missing, Real}.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants