Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more splatting and appending of arguments in by #1620

Closed
wants to merge 40 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
7649508
First restart
pdeffebach Feb 13, 2019
9d2aa65
Get tests to pass
pdeffebach Feb 13, 2019
05959f6
add tests for multiple vector inputs
pdeffebach Feb 13, 2019
858ad17
Add Tuple argument for explicitness
pdeffebach Feb 23, 2019
34fc1c5
Same for `by`
pdeffebach Feb 23, 2019
a6f98bf
re-write docstring
pdeffebach Mar 3, 2019
be71aca
Remove tuples and add documentation
pdeffebach Mar 10, 2019
32f0f14
Add back in tuples to code and docs
pdeffebach Mar 10, 2019
5c843a6
Rephrase.
nalimilan Mar 12, 2019
fca4c5f
Getting back to work
pdeffebach Apr 4, 2019
315d8bc
¿ ¿ Int64 ¿ Int64 ¿
pdeffebach Apr 4, 2019
368c585
Merge branch 'vector_grouping' of https://github.com/pdeffebach/DataF…
pdeffebach Apr 4, 2019
c24cfbc
Fix bug (computer no longer freaking out)
pdeffebach Apr 4, 2019
16f3b54
Merge branch 'master' into vector_grouping
pdeffebach Apr 23, 2019
53ae557
Respond to review
pdeffebach Apr 23, 2019
174df20
remove colwise
pdeffebach Apr 23, 2019
8a1ecb3
Minor fix
pdeffebach Apr 23, 2019
e877e3b
Fixes
nalimilan Apr 25, 2019
087796f
Update src/groupeddataframe/grouping.jl
nalimilan Apr 30, 2019
48aa400
Added deleted sentence
pdeffebach Apr 30, 2019
982bf5c
Actually add back deleted sentence
pdeffebach Apr 30, 2019
e24d893
Remove deprecated test
pdeffebach May 3, 2019
9002528
Update deprecation
pdeffebach May 3, 2019
22f8521
Add back old behavior of combine(gd)
pdeffebach May 3, 2019
c3f351f
Get rid of Vararg and NT method
pdeffebach May 20, 2019
6ba86e5
Add back in Vararg
pdeffebach May 21, 2019
ddb77b2
Tests for tuples`
pdeffebach Jun 10, 2019
eacf310
Add more test_throws
pdeffebach Jun 10, 2019
145b3c8
Fixing conflicts pt 1
pdeffebach Jun 10, 2019
5fdb1b7
Merge branch 'master' into vector_grouping
pdeffebach Jun 10, 2019
fa5340f
allow for :b => mean, (:b, :c) => first
pdeffebach Jun 10, 2019
2773b82
Tests for symble and tuple arguments
pdeffebach Jun 10, 2019
44cd60c
Add TODO for type narrowing problem
pdeffebach Jun 10, 2019
c6d569b
splatting to fix names
pdeffebach Jun 11, 2019
ad142db
Test fixes
pdeffebach Jun 11, 2019
dd537b8
Milan's suggestion of first(p)
pdeffebach Jun 11, 2019
0b9dba5
Fix short circuit
pdeffebach Jun 11, 2019
3c6633a
add ColumnIndex
pdeffebach Jun 11, 2019
efb8ff7
Merge branch 'master' into vector_grouping
pdeffebach Jul 16, 2019
4991204
Merge remote-tracking branch 'pdeffebach/vector_grouping' into vector…
pdeffebach Jul 16, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 93 additions & 36 deletions src/groupeddataframe/grouping.jl
Original file line number Diff line number Diff line change
Expand Up @@ -322,26 +322,30 @@ function Base.map(f::Any, gd::GroupedDataFrame)
end

"""
combine(gd::GroupedDataFrame, cols => f...)
combine(gd::GroupedDataFrame, (cols => f)...)
combine(gd::GroupedDataFrame, [cols1 => f1, cols2 => f2]...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing (i.e. I am starting to read this PR and it is not clear for me what is the API). Typically we use a type signature. Do you mean here ::AbstractVector{<:Pair{Symbol, Callable}}? Can multiple vectors be passed and can this vector be mixed with non vector cols=>f syntax? If for all question the answer is yes maybe write combine(gd::GroupedDataFrame, args...) and later explain what each entry of args can be?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see below that we allow mixing, so probably the args approach would be cleanest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think cols can be either a Symbol or a Tuple of Symbols. Yes f should be Callable. But I thought Julia was no longer using the word Callable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Base.Callable is defined still. But as I have written later probably it is best to write agrs and explain later the details.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add args..., but having a relatively readable list of common cases sounds useful to me.

combine(gd::GroupedDataFrame; (colname = cols => f)...)
combine(gd::GroupedDataFrame, f)
combine(f, gd::GroupedDataFrame)

Transform a [`GroupedDataFrame`](@ref) into a `DataFrame`.

If the last argument(s) consist(s) in one or more `cols => f` pair(s), or if
`colname = cols => f` keyword arguments are provided, `cols` must be
a column name or index, or a vector or tuple thereof, and `f` must be a callable.
A pair or a (named) tuple of pairs can also be provided as the first or last argument.
If `cols` is a single column index, `f` is called with a `SubArray` view into that
column for each group; else, `f` is called with a named tuple holding `SubArray`
views into these columns.
The last argument(s) in `combine` can be either:

If the last argument is a callable `f`, it is passed a [`SubDataFrame`](@ref) view for each group,
and the returned `DataFrame` then consists of the returned rows plus the grouping columns.
Note that this second form is much slower than the first one due to type instability.
A method is defined with `f` as the first argument, so do-block
notation can be used.
* One or several `cols => f` pairs, or vectors or tuples of such pairs (mixing is allowed). `cols`
must be a column name or index in `gd`, or a vector or tuple thereof. `f` must be callable.
If `cols` is a single column index, `f` is called with a `SubArray` view into that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe: "f is passed a view into ..."?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the wording.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the docstring for map should be kept in sync.

column for each group; else, `f` is called with a named tuple holding `SubArray`
views into these columns.
* A named tuple of `colname = cols => f` pairs or keyword arguments of such pairs,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a single NamedTuple only allowed?

where `colname` indicates the name of the column to be created in the new `DataFrame`.
Pairs must obey the same rules as above.
* A callable `f` taking a `SubDataFrame` view for each group. The returned `DataFrame`
then consists of the returned rows plus the grouping columns.
Note that this form is much slower than the others due to type instability.

A method is defined with `f` as the first argument, so do-block notation can be used.
In that case `f` can also be a named tuple of pairs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we allow a NamedTuple of Pairs for this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay this is hard for me to wrap my head around.

  1. NamedTuple for combine(n::NamedTuple, gd) works because we define the keyword argument method as follows:
function combine(gd::GroupedDataFrame; f...)
    if length(f) == 0
        Base.depwarn("combine(gd) is deprecated, use DataFrame(gd) instead", :combine)
        combine(identity, gd)
    else
        combine(values(f), gd)
    end
end

However now I see that because we splay the f... there for the keyword arguments, the following also works:

combine(gd, (n = :b => first, ))

At least I think this is why it is behaving like that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation.

The core of my question is if we want to allow this. My intuitive understanding was that the form combine(f, gd) was intended only for the case where f is a function to allow do syntax (and if we did not want this syntax we would disallow putting f before gd as normally gd should be the first argument).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we probably don't need to allow named tuples here.


`f` can return a single value, a row or multiple rows. The type of the returned value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The restriction is if we write colname = cols=f it must return a column. Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the Pairs, including named tuple stuff, you can only return a scalar or a vector. The vectors get stacked. All pairs must return something of the same length.

Tuples of vectors do not get splatted. For instance combine(gd, :b => x -> (x, x .+ 1)) does not create two columns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. This is my understanding, but I could not find where we specify this in the docstring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's mentioned below:

As a special case, if a tuple or vector of pairs is passed as the first argument, each function
is required to return a single value or vector, which will produce each a separate column.

It's really hard to find a good way to organize all that stuff given how many possible combinations there are.

determines the shape of the resulting data frame:
Expand Down Expand Up @@ -407,6 +411,34 @@ julia> combine(:c => sum, gd)
│ 3 │ 3 │ 10 │
│ 4 │ 4 │ 12 │

julia> combine(gd, [:b, :c] .=> sum)
8×2 DataFrame
│ Row │ a │ x1 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 1 │ 7 │
│ 3 │ 2 │ 3 │
│ 4 │ 2 │ 7 │
│ 5 │ 3 │ 5 │
│ 6 │ 3 │ 9 │
│ 7 │ 4 │ 5 │
│ 8 │ 4 │ 9 │

julia> combine(gd, [:b, :c] .=> sum, :c => min)
8×2 DataFrame
│ Row │ a │ x1 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 1 │ 7 │
│ 3 │ 2 │ 3 │
│ 4 │ 2 │ 7 │
│ 5 │ 3 │ 5 │
│ 6 │ 3 │ 9 │
│ 7 │ 4 │ 5 │
│ 8 │ 4 │ 9 │

julia> combine(df -> sum(df.c), gd) # Slower variant
4×2 DataFrame
│ Row │ a │ x1 │
Expand Down Expand Up @@ -436,9 +468,11 @@ function combine(f::Any, gd::GroupedDataFrame)
return gd.parent[1:0, gd.cols]
end
end

combine(gd::GroupedDataFrame, f::Any) = combine(f, gd)
combine(gd::GroupedDataFrame, f::Pair...) = combine(f, gd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you remove this method, I guess you can also drop Tuple{Vararg{Pair}} from the signature of _combine?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That comment still applies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. It still works without this. It's deleted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out it is needed. combine(f, (:c => sum, :d => mean)) breaks without it.

I thought

combine(gd::GroupedDataFrame, f::Union{Pair, AbstractVector{<:Pair}}...) 

would capture it, but it does not. If we wanted to avoid the dispatch we could add a method for combine which collects a tuple of pairs into a Vector.

combine(gd::GroupedDataFrame, f::Pair) = combine(f, gd)

combine(gd::GroupedDataFrame, f::Union{Pair, AbstractVector{<:Pair}}...) =
combine(reduce(vcat, f), gd)

function combine(gd::GroupedDataFrame; f...)
if length(f) == 0
Expand Down Expand Up @@ -673,19 +707,22 @@ function do_f(f, x...)
end
end

function _combine(f::Union{AbstractVector{<:Pair}, Tuple{Vararg{Pair}},
function _combine(f::Union{AbstractVector{<:Pair},
Tuple{Vararg{Pair}},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old version was OK.

NamedTuple{<:Any, <:Tuple{Vararg{Pair}}}},
gd::GroupedDataFrame)
res = map(f) do p
agg = check_aggregate(last(p))
if agg isa AbstractAggregate && p isa Pair{<:ColumnIndex}

if agg isa AbstractAggregate && p isa Pair && first(p) isa ColumnIndex
incol = gd.parent[!, first(p)]
idx = gd.idx[gd.starts]
outcol = agg(incol, gd)
return idx, outcol
else
fun = do_f(last(p))
if p isa Pair{<:ColumnIndex}

if p isa Pair && first(p) isa ColumnIndex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if p isa Pair && first(p) isa ColumnIndex
if p isa Pair && first(p) isa ColumnIndex

incols = gd.parent[!, first(p)]
else
df = gd.parent[!, collect(first(p))]
Expand All @@ -705,7 +742,7 @@ function _combine(f::Union{AbstractVector{<:Pair}, Tuple{Vararg{Pair}},
if f isa NamedTuple
nams = collect(Symbol, propertynames(f))
else
nams = [f[i] isa Pair{<:ColumnIndex} ?
nams = [f[i] isa Pair && first(f[i]) isa ColumnIndex ?
Symbol(names(gd.parent)[index(gd.parent)[first(f[i])]],
'_', funname(last(f[i]))) :
Symbol('x', i)
Expand Down Expand Up @@ -924,7 +961,8 @@ function _combine_with_first!(first::Union{AbstractDataFrame,
end

"""
by(df::AbstractDataFrame, keys, cols => f...; sort::Bool = false)
by(df::AbstractDataFrame, keys, (cols => f)...; sort::Bool = false)
by(df::AbstractDataFrame, keys, [cols1 => f1, cols2 => f2]...; sort::Bool = false)
by(df::AbstractDataFrame, keys; (colname = cols => f)..., sort::Bool = false)
by(df::AbstractDataFrame, keys, f; sort::Bool = false)
by(f, df::AbstractDataFrame, keys; sort::Bool = false)
Expand All @@ -934,19 +972,22 @@ based on grouping columns `keys`, and return a `DataFrame`.

`keys` can be either a single column index, or a vector thereof.

If the last argument(s) consist(s) in one or more `cols => f` pair(s), or if
`colname = cols => f` keyword arguments are provided, `cols` must be
a column name or index, or a vector or tuple thereof, and `f` must be a callable.
A pair or a (named) tuple of pairs can also be provided as the first or last argument.
If `cols` is a single column index, `f` is called with a `SubArray` view into that
column for each group; else, `f` is called with a named tuple holding `SubArray`
views into these columns.
The third through last arguments in `combine` can can be either

If the last argument is a callable `f`, it is passed a [`SubDataFrame`](@ref) view for each group,
and the returned `DataFrame` then consists of the returned rows plus the grouping columns.
Note that this second form is much slower than the first one due to type instability.
A method is defined with `f` as the first argument, so do-block
notation can be used.
* One or several `cols => f` pairs, or vectors or tuples of such pairs (mixing is allowed). `cols`
must be a column name or index in `gd`, or a vector or tuple thereof. `f` must be callable.
If `cols` is a single column index, `f` is called with a `SubArray` view into that
column for each group; else, `f` is called with a named tuple holding `SubArray`
views into these columns.
* A named tuple of `colname = cols => f` pairs or keyword arguments of such pairs,
where `colname` indicates the name of the column to be created in the new `DataFrame`.
Pairs must obey the same rules as above.
* A callable `f` taking a `SubDataFrame` view for each group. The returned `DataFrame`
then consists of the returned rows plus the grouping columns.
Note that this form is much slower than the others due to type instability.

A method is defined with `f` as the first argument, so do-block notation can be used.
In that case `f` can also be a named tuple of pairs.

`f` can return a single value, a row or multiple rows. The type of the returned value
determines the shape of the resulting data frame:
Expand Down Expand Up @@ -1002,6 +1043,20 @@ julia> by(df, :a, :c => sum)
│ 3 │ 3 │ 10 │
│ 4 │ 4 │ 12 │

julia> combine(gd, [:b, :c] .=> sum, :c => min)
8×2 DataFrame
│ Row │ a │ x1 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 1 │ 7 │
│ 3 │ 2 │ 3 │
│ 4 │ 2 │ 7 │
│ 5 │ 3 │ 5 │
│ 6 │ 3 │ 9 │
│ 7 │ 4 │ 5 │
│ 8 │ 4 │ 9 │

julia> by(df, :a, d -> sum(d.c)) # Slower variant
4×2 DataFrame
│ Row │ a │ x1 │
Expand Down Expand Up @@ -1062,12 +1117,14 @@ julia> by(df, :a, (:b, :c) => x -> (minb = minimum(x.b), sumc = sum(x.c)))
"""
by(d::AbstractDataFrame, cols::Any, f::Any; sort::Bool = false) =
combine(f, groupby(d, cols, sort = sort))

by(f::Any, d::AbstractDataFrame, cols::Any; sort::Bool = false) =
by(d, cols, f, sort = sort)
by(d::AbstractDataFrame, cols::Any, f::Pair; sort::Bool = false) =
combine(f, groupby(d, cols, sort = sort))
by(d::AbstractDataFrame, cols::Any, f::Pair...; sort::Bool = false) =
combine(f, groupby(d, cols, sort = sort))

by(d::AbstractDataFrame, cols::Any, f::Union{Pair, AbstractVector{<:Pair}}...;
sort::Bool = false) =
combine(reduce(vcat, f), groupby(d, cols, sort = sort))

by(d::AbstractDataFrame, cols::Any; sort::Bool = false, f...) =
combine(values(f), groupby(d, cols, sort = sort))

Expand Down
6 changes: 2 additions & 4 deletions test/deprecated.jl
Original file line number Diff line number Diff line change
Expand Up @@ -50,10 +50,8 @@ end
end
end

# deprecated combine

df = DataFrame(a=[1, 1, 2, 2, 2], b=1:5)
gd = groupby(df, :a)
df = DataFrame(a=[1, 1, 2, 2, 2], b=1:5)
gd = groupby(df, :a)
@test combine(gd) == combine(identity, gd)

@testset "categorical constructor" begin
Expand Down
33 changes: 32 additions & 1 deletion test/grouping.jl
Original file line number Diff line number Diff line change
Expand Up @@ -626,7 +626,10 @@ end
by(df, :a, (:b => sum, :c => sum,)) ==
by(df, :a, [:b => sum, :c => sum]) ==
by(df, :a, b_sum = :b => sum, c_sum = :c => sum) ==
by(d -> (b_sum=sum(d.b), c_sum=sum(d.c)), df, :a)
by(d -> (b_sum=sum(d.b), c_sum=sum(d.c)), df, :a) ==
by(df, :a, [:b => sum], :c => sum) ==
by(df, :a, [:b => sum], [:c => sum]) ==
by(df, :a, [:b, :c] .=> sum) ==
by(df, :a, d -> (b_sum=sum(d.b), c_sum=sum(d.c)))

@test by(df, :a, :b => vexp, :c => identity) ==
Expand Down Expand Up @@ -663,6 +666,9 @@ end
@test combine(gd, :b => sum, :c => sum) ==
combine(gd, (:b => sum, :c => sum,)) ==
combine(gd, [:b => sum, :c => sum]) ==
combine(gd, [:b => sum], :c => sum) ==
combine(gd, [:b => sum], [:c => sum]) ==
combine(gd, [:b, :c] .=> sum) ==
combine(gd, b_sum = :b => sum, c_sum = :c => sum) ==
combine((:b, :c) => x -> (b_sum=sum(x.b), c_sum=sum(x.c)), gd) ==
combine(gd, (:b, :c) => x -> (b_sum=sum(x.b), c_sum=sum(x.c))) ==
Expand Down Expand Up @@ -765,6 +771,31 @@ end
end
end

# Test that multiple tuples in by and combine throw errs
@testset "Tuple errors" begin
df = DataFrame(a = repeat([1, 3, 2, 4], outer=[2]),
b = repeat([2, 1], outer=[4]))
gd = groupby(df, :a)
@test_throws MethodError combine(gd, (:b => first, ), (:b => last))
@test_throws MethodError combine(gd, (:b => first, ), :b => last)
@test_throws MethodError combine(gd, (:b => first, ), [:b => last])
@test_throws MethodError combine(gd, (:b => first, ), [:b => last], :b => length)
end

@testset "Symbol argument and typle argument" begin
df = DataFrame(a = repeat([1, 3, 2, 4], outer=[2]),
b = repeat([2, 1], outer=[4]))
gd = groupby(df, :a)

function bar(x)
first(x.a)
end

@test combine(gd, :b => first, (:a, :b) => bar) ==
combine(gd; b_first = :b => first, x2 = :a => first)
end


struct TestType end
Base.isless(::TestType, ::Int) = true
Base.isless(::Int, ::TestType) = false
Expand Down