-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more splatting and appending of arguments in by #1620
Changes from all commits
7649508
9d2aa65
05959f6
858ad17
34fc1c5
a6f98bf
be71aca
32f0f14
5c843a6
fca4c5f
315d8bc
368c585
c24cfbc
16f3b54
53ae557
174df20
8a1ecb3
e877e3b
087796f
48aa400
982bf5c
e24d893
9002528
22f8521
c3f351f
6ba86e5
ddb77b2
eacf310
145b3c8
5fdb1b7
fa5340f
2773b82
44cd60c
c6d569b
ad142db
dd537b8
0b9dba5
3c6633a
efb8ff7
4991204
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -322,26 +322,30 @@ function Base.map(f::Any, gd::GroupedDataFrame) | |||||
end | ||||||
|
||||||
""" | ||||||
combine(gd::GroupedDataFrame, cols => f...) | ||||||
combine(gd::GroupedDataFrame, (cols => f)...) | ||||||
combine(gd::GroupedDataFrame, [cols1 => f1, cols2 => f2]...) | ||||||
combine(gd::GroupedDataFrame; (colname = cols => f)...) | ||||||
combine(gd::GroupedDataFrame, f) | ||||||
combine(f, gd::GroupedDataFrame) | ||||||
|
||||||
Transform a [`GroupedDataFrame`](@ref) into a `DataFrame`. | ||||||
|
||||||
If the last argument(s) consist(s) in one or more `cols => f` pair(s), or if | ||||||
`colname = cols => f` keyword arguments are provided, `cols` must be | ||||||
a column name or index, or a vector or tuple thereof, and `f` must be a callable. | ||||||
A pair or a (named) tuple of pairs can also be provided as the first or last argument. | ||||||
If `cols` is a single column index, `f` is called with a `SubArray` view into that | ||||||
column for each group; else, `f` is called with a named tuple holding `SubArray` | ||||||
views into these columns. | ||||||
The last argument(s) in `combine` can be either: | ||||||
|
||||||
If the last argument is a callable `f`, it is passed a [`SubDataFrame`](@ref) view for each group, | ||||||
and the returned `DataFrame` then consists of the returned rows plus the grouping columns. | ||||||
Note that this second form is much slower than the first one due to type instability. | ||||||
A method is defined with `f` as the first argument, so do-block | ||||||
notation can be used. | ||||||
* One or several `cols => f` pairs, or vectors or tuples of such pairs (mixing is allowed). `cols` | ||||||
must be a column name or index in `gd`, or a vector or tuple thereof. `f` must be callable. | ||||||
If `cols` is a single column index, `f` is called with a `SubArray` view into that | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe: " There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Changed the wording. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that the docstring for |
||||||
column for each group; else, `f` is called with a named tuple holding `SubArray` | ||||||
views into these columns. | ||||||
* A named tuple of `colname = cols => f` pairs or keyword arguments of such pairs, | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is a single |
||||||
where `colname` indicates the name of the column to be created in the new `DataFrame`. | ||||||
Pairs must obey the same rules as above. | ||||||
* A callable `f` taking a `SubDataFrame` view for each group. The returned `DataFrame` | ||||||
then consists of the returned rows plus the grouping columns. | ||||||
Note that this form is much slower than the others due to type instability. | ||||||
|
||||||
A method is defined with `f` as the first argument, so do-block notation can be used. | ||||||
In that case `f` can also be a named tuple of pairs. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why do we allow a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay this is hard for me to wrap my head around.
However now I see that because we splay the
At least I think this is why it is behaving like that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the explanation. The core of my question is if we want to allow this. My intuitive understanding was that the form There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, we probably don't need to allow named tuples here. |
||||||
|
||||||
`f` can return a single value, a row or multiple rows. The type of the returned value | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The restriction is if we write There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the Tuples of vectors do not get splatted. For instance There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you. This is my understanding, but I could not find where we specify this in the docstring. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's mentioned below:
It's really hard to find a good way to organize all that stuff given how many possible combinations there are. |
||||||
determines the shape of the resulting data frame: | ||||||
|
@@ -407,6 +411,34 @@ julia> combine(:c => sum, gd) | |||||
│ 3 │ 3 │ 10 │ | ||||||
│ 4 │ 4 │ 12 │ | ||||||
|
||||||
julia> combine(gd, [:b, :c] .=> sum) | ||||||
8×2 DataFrame | ||||||
│ Row │ a │ x1 │ | ||||||
│ │ Int64 │ Int64 │ | ||||||
├─────┼───────┼───────┤ | ||||||
│ 1 │ 1 │ 3 │ | ||||||
│ 2 │ 1 │ 7 │ | ||||||
│ 3 │ 2 │ 3 │ | ||||||
│ 4 │ 2 │ 7 │ | ||||||
│ 5 │ 3 │ 5 │ | ||||||
│ 6 │ 3 │ 9 │ | ||||||
│ 7 │ 4 │ 5 │ | ||||||
│ 8 │ 4 │ 9 │ | ||||||
|
||||||
julia> combine(gd, [:b, :c] .=> sum, :c => min) | ||||||
8×2 DataFrame | ||||||
│ Row │ a │ x1 │ | ||||||
│ │ Int64 │ Int64 │ | ||||||
├─────┼───────┼───────┤ | ||||||
│ 1 │ 1 │ 3 │ | ||||||
│ 2 │ 1 │ 7 │ | ||||||
│ 3 │ 2 │ 3 │ | ||||||
│ 4 │ 2 │ 7 │ | ||||||
│ 5 │ 3 │ 5 │ | ||||||
│ 6 │ 3 │ 9 │ | ||||||
│ 7 │ 4 │ 5 │ | ||||||
│ 8 │ 4 │ 9 │ | ||||||
|
||||||
julia> combine(df -> sum(df.c), gd) # Slower variant | ||||||
4×2 DataFrame | ||||||
│ Row │ a │ x1 │ | ||||||
|
@@ -436,9 +468,11 @@ function combine(f::Any, gd::GroupedDataFrame) | |||||
return gd.parent[1:0, gd.cols] | ||||||
end | ||||||
end | ||||||
|
||||||
combine(gd::GroupedDataFrame, f::Any) = combine(f, gd) | ||||||
combine(gd::GroupedDataFrame, f::Pair...) = combine(f, gd) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you remove this method, I guess you can also drop There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That comment still applies. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes. It still works without this. It's deleted There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It turns out it is needed. I thought
would capture it, but it does not. If we wanted to avoid the dispatch we could add a method for |
||||||
combine(gd::GroupedDataFrame, f::Pair) = combine(f, gd) | ||||||
|
||||||
combine(gd::GroupedDataFrame, f::Union{Pair, AbstractVector{<:Pair}}...) = | ||||||
combine(reduce(vcat, f), gd) | ||||||
|
||||||
function combine(gd::GroupedDataFrame; f...) | ||||||
bkamins marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
if length(f) == 0 | ||||||
|
@@ -673,19 +707,22 @@ function do_f(f, x...) | |||||
end | ||||||
end | ||||||
|
||||||
function _combine(f::Union{AbstractVector{<:Pair}, Tuple{Vararg{Pair}}, | ||||||
function _combine(f::Union{AbstractVector{<:Pair}, | ||||||
Tuple{Vararg{Pair}}, | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The old version was OK. |
||||||
NamedTuple{<:Any, <:Tuple{Vararg{Pair}}}}, | ||||||
gd::GroupedDataFrame) | ||||||
res = map(f) do p | ||||||
agg = check_aggregate(last(p)) | ||||||
if agg isa AbstractAggregate && p isa Pair{<:ColumnIndex} | ||||||
|
||||||
if agg isa AbstractAggregate && p isa Pair && first(p) isa ColumnIndex | ||||||
incol = gd.parent[!, first(p)] | ||||||
idx = gd.idx[gd.starts] | ||||||
outcol = agg(incol, gd) | ||||||
return idx, outcol | ||||||
else | ||||||
fun = do_f(last(p)) | ||||||
if p isa Pair{<:ColumnIndex} | ||||||
|
||||||
if p isa Pair && first(p) isa ColumnIndex | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
incols = gd.parent[!, first(p)] | ||||||
else | ||||||
df = gd.parent[!, collect(first(p))] | ||||||
|
@@ -705,7 +742,7 @@ function _combine(f::Union{AbstractVector{<:Pair}, Tuple{Vararg{Pair}}, | |||||
if f isa NamedTuple | ||||||
nams = collect(Symbol, propertynames(f)) | ||||||
else | ||||||
nams = [f[i] isa Pair{<:ColumnIndex} ? | ||||||
nams = [f[i] isa Pair && first(f[i]) isa ColumnIndex ? | ||||||
Symbol(names(gd.parent)[index(gd.parent)[first(f[i])]], | ||||||
'_', funname(last(f[i]))) : | ||||||
Symbol('x', i) | ||||||
|
@@ -924,7 +961,8 @@ function _combine_with_first!(first::Union{AbstractDataFrame, | |||||
end | ||||||
|
||||||
""" | ||||||
by(df::AbstractDataFrame, keys, cols => f...; sort::Bool = false) | ||||||
by(df::AbstractDataFrame, keys, (cols => f)...; sort::Bool = false) | ||||||
by(df::AbstractDataFrame, keys, [cols1 => f1, cols2 => f2]...; sort::Bool = false) | ||||||
by(df::AbstractDataFrame, keys; (colname = cols => f)..., sort::Bool = false) | ||||||
by(df::AbstractDataFrame, keys, f; sort::Bool = false) | ||||||
by(f, df::AbstractDataFrame, keys; sort::Bool = false) | ||||||
|
@@ -934,19 +972,22 @@ based on grouping columns `keys`, and return a `DataFrame`. | |||||
|
||||||
`keys` can be either a single column index, or a vector thereof. | ||||||
|
||||||
If the last argument(s) consist(s) in one or more `cols => f` pair(s), or if | ||||||
nalimilan marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
`colname = cols => f` keyword arguments are provided, `cols` must be | ||||||
a column name or index, or a vector or tuple thereof, and `f` must be a callable. | ||||||
A pair or a (named) tuple of pairs can also be provided as the first or last argument. | ||||||
If `cols` is a single column index, `f` is called with a `SubArray` view into that | ||||||
column for each group; else, `f` is called with a named tuple holding `SubArray` | ||||||
views into these columns. | ||||||
The third through last arguments in `combine` can can be either | ||||||
|
||||||
If the last argument is a callable `f`, it is passed a [`SubDataFrame`](@ref) view for each group, | ||||||
and the returned `DataFrame` then consists of the returned rows plus the grouping columns. | ||||||
Note that this second form is much slower than the first one due to type instability. | ||||||
A method is defined with `f` as the first argument, so do-block | ||||||
notation can be used. | ||||||
* One or several `cols => f` pairs, or vectors or tuples of such pairs (mixing is allowed). `cols` | ||||||
must be a column name or index in `gd`, or a vector or tuple thereof. `f` must be callable. | ||||||
If `cols` is a single column index, `f` is called with a `SubArray` view into that | ||||||
column for each group; else, `f` is called with a named tuple holding `SubArray` | ||||||
views into these columns. | ||||||
* A named tuple of `colname = cols => f` pairs or keyword arguments of such pairs, | ||||||
where `colname` indicates the name of the column to be created in the new `DataFrame`. | ||||||
Pairs must obey the same rules as above. | ||||||
* A callable `f` taking a `SubDataFrame` view for each group. The returned `DataFrame` | ||||||
then consists of the returned rows plus the grouping columns. | ||||||
Note that this form is much slower than the others due to type instability. | ||||||
|
||||||
A method is defined with `f` as the first argument, so do-block notation can be used. | ||||||
In that case `f` can also be a named tuple of pairs. | ||||||
|
||||||
`f` can return a single value, a row or multiple rows. The type of the returned value | ||||||
determines the shape of the resulting data frame: | ||||||
|
@@ -1002,6 +1043,20 @@ julia> by(df, :a, :c => sum) | |||||
│ 3 │ 3 │ 10 │ | ||||||
│ 4 │ 4 │ 12 │ | ||||||
|
||||||
julia> combine(gd, [:b, :c] .=> sum, :c => min) | ||||||
8×2 DataFrame | ||||||
│ Row │ a │ x1 │ | ||||||
│ │ Int64 │ Int64 │ | ||||||
├─────┼───────┼───────┤ | ||||||
│ 1 │ 1 │ 3 │ | ||||||
│ 2 │ 1 │ 7 │ | ||||||
│ 3 │ 2 │ 3 │ | ||||||
│ 4 │ 2 │ 7 │ | ||||||
│ 5 │ 3 │ 5 │ | ||||||
│ 6 │ 3 │ 9 │ | ||||||
│ 7 │ 4 │ 5 │ | ||||||
│ 8 │ 4 │ 9 │ | ||||||
|
||||||
julia> by(df, :a, d -> sum(d.c)) # Slower variant | ||||||
4×2 DataFrame | ||||||
│ Row │ a │ x1 │ | ||||||
|
@@ -1062,12 +1117,14 @@ julia> by(df, :a, (:b, :c) => x -> (minb = minimum(x.b), sumc = sum(x.c))) | |||||
""" | ||||||
nalimilan marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
by(d::AbstractDataFrame, cols::Any, f::Any; sort::Bool = false) = | ||||||
combine(f, groupby(d, cols, sort = sort)) | ||||||
|
||||||
by(f::Any, d::AbstractDataFrame, cols::Any; sort::Bool = false) = | ||||||
by(d, cols, f, sort = sort) | ||||||
by(d::AbstractDataFrame, cols::Any, f::Pair; sort::Bool = false) = | ||||||
combine(f, groupby(d, cols, sort = sort)) | ||||||
by(d::AbstractDataFrame, cols::Any, f::Pair...; sort::Bool = false) = | ||||||
combine(f, groupby(d, cols, sort = sort)) | ||||||
|
||||||
by(d::AbstractDataFrame, cols::Any, f::Union{Pair, AbstractVector{<:Pair}}...; | ||||||
sort::Bool = false) = | ||||||
combine(reduce(vcat, f), groupby(d, cols, sort = sort)) | ||||||
|
||||||
by(d::AbstractDataFrame, cols::Any; sort::Bool = false, f...) = | ||||||
combine(values(f), groupby(d, cols, sort = sort)) | ||||||
|
||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit confusing (i.e. I am starting to read this PR and it is not clear for me what is the API). Typically we use a type signature. Do you mean here
::AbstractVector{<:Pair{Symbol, Callable}}
? Can multiple vectors be passed and can this vector be mixed with non vectorcols=>f
syntax? If for all question the answer is yes maybe writecombine(gd::GroupedDataFrame, args...)
and later explain what each entry ofargs
can be?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see below that we allow mixing, so probably the
args
approach would be cleanest.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
cols
can be either aSymbol
or aTuple
ofSymbols
. Yesf
should beCallable
. But I thought Julia was no longer using the wordCallable
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Base.Callable
is defined still. But as I have written later probably it is best to writeagrs
and explain later the details.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add
args...
, but having a relatively readable list of common cases sounds useful to me.