Fix various bugs in split/apply/combine in 0.21 release #2280

bkamins · 2020-06-10T22:35:43Z

In this PR I fix some corner cases that were bugs in 0.21 release. It should be merged before #2279 as it should be backported.

What it covers:

documentation improvements of what happens (and should happen in corner cases, especially when something is "empty", where something is: GroupedDataFrame, DataFrame, list of transformations, or groupcols)
fix a bug with GroupDataFrame indexing that caused StackOverflow
fix a bug in creation of GroupedDataFrame with ungroup=false (wrong columns could be shown as grouping columns in the past)
fix cases of "empty" stuff passed (listed above), especially when keepkeys=false or ungroup=false (there are several corner cases here); most likely in real usage scenarios these cases do not happen, but still we should provide a correct result here.

In general correctly handling all these corner cases is tricky so probably if someone would be willing to review it it might be challenging (so thank you in advance for the efforts). I will also try to improve test coverage for this if I find other "corner cases" still not covered.

src/abstractdataframe/selection.jl

src/groupeddataframe/splitapplycombine.jl

test/grouping.jl

nalimilan · 2020-06-19T08:43:41Z

src/groupeddataframe/splitapplycombine.jl

            # in this case we are sure that the result GroupedDataFrame has the
-            # same structure as the source
-            # we do not copy data as it should be safe - we never mutate fields of gd


When do we mutate fields now?

we do not mutate fields now.
It is a preparation for filter!. But I prefer to make a copy here as it is not very expensive, and be safe that in the future changing if someone introduces mutation to GroupedDataFrame (even if we skip filter! for now as I assume we will do) it does not get forgotten (as it is easy to forget that this line in such a large codebase assumes that it is not mutated).

I'd rather avoid making copies preventively as long as we never mutate fields. Especially if we drop filter! for now.

OK - changed. I have made an extensive implementation note, as GroupedDataFrame is currently quite tricky to implement new functionality for.

src/groupeddataframe/splitapplycombine.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2020-06-19T10:03:16Z

Code that works with a nonzero number of groups is likely to try to access new columns and therefore fail later.

This is what I was afraid of exactly. I will open a separate issue or PR to track this.

src/groupeddataframe/splitapplycombine.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2020-06-22T20:33:36Z

coverage only fails.

nalimilan

Just two suggestions.

src/groupeddataframe/groupeddataframe.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2020-06-23T17:07:51Z

Thank you!

nalimilan · 2020-06-24T08:55:58Z

src/groupeddataframe/groupeddataframe.jl

@@ -34,15 +50,19 @@ end
 function Base.getproperty(gd::GroupedDataFrame, f::Symbol)
    if f in (:idx, :starts, :ends)
        # Group indices are computed lazily the first time they are accessed
+        Threads.lock(gd.lazy_lock)


Thinking about this again, shouldn't this line be moved inside if getfield(gd, f) === nothing? Ideally threads wouldn't have to lock each other when accessing indices once they have been computed. To avoid threads from computing the fields multiple times (which would be wasteful, though probably not problematic), we could check getfield(gd, f) === nothing again after calling lock.

Yes - I was also thinking about it. The solution is exactly as you say:

we could check getfield(gd, f) === nothing again after calling lock.

I just wanted to have a simpler implementation as lock/unlock is cheap:

julia> function f(l) lock(l) unlock(l) end f (generic function with 1 method) julia> l = Threads.ReentrantLock() ReentrantLock(nothing, Base.GenericCondition{Base.Threads.SpinLock}(Base.InvasiveLinkedList{Task}(nothing, nothing), Base.Threads.SpinLock(0)), 0) julia> f(l) julia> @benchmark f($l) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 26.066 ns (0.00% GC) median time: 26.087 ns (0.00% GC) mean time: 27.893 ns (0.00% GC) maximum time: 77.285 ns (0.00% GC) -------------- samples: 10000 evals/sample: 993

and with threads often things are more tricky than they seem.

I will open a PR with the change you propose and we can discuss there.

bkamins added 2 commits June 10, 2020 13:10

improve docstrings, allow empty combine(gdf)

6281f83

fix bugs and add tests

3a99848

bkamins added bug priority grouping non-breaking The proposed change is not breaking backport labels Jun 10, 2020

bkamins added this to the 1.0 milestone Jun 10, 2020

bkamins mentioned this pull request Jun 10, 2020

add filter and filter! to GroupedDataFrame #2279

Merged

bkamins added 2 commits June 17, 2020 17:56

make GroupedDataFrame more threadsafe

36ea388

correct line order

a2cea21

bkamins requested a review from nalimilan June 17, 2020 16:00

nalimilan reviewed Jun 19, 2020

View reviewed changes

Apply suggestions from code review

8e92887

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

fixes after code review

387640c

nalimilan reviewed Jun 19, 2020

View reviewed changes

src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved

bkamins and others added 2 commits June 19, 2020 17:50

Update src/groupeddataframe/splitapplycombine.jl

76d7ce0

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

assume GroupedDataFrame is not mutated

8818342

nalimilan approved these changes Jun 23, 2020

View reviewed changes

src/groupeddataframe/groupeddataframe.jl Outdated Show resolved Hide resolved

src/groupeddataframe/groupeddataframe.jl Outdated Show resolved Hide resolved

src/groupeddataframe/groupeddataframe.jl Outdated Show resolved Hide resolved

Apply suggestions from code review

233aa52

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins merged commit e607175 into JuliaData:master Jun 23, 2020

bkamins deleted the fix_select_combine_empty branch June 23, 2020 17:07

nalimilan reviewed Jun 24, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix various bugs in split/apply/combine in 0.21 release #2280

Fix various bugs in split/apply/combine in 0.21 release #2280

bkamins commented Jun 10, 2020

nalimilan Jun 19, 2020

bkamins Jun 19, 2020 •

edited

Loading

nalimilan Jun 19, 2020

bkamins Jun 19, 2020

bkamins commented Jun 19, 2020

bkamins commented Jun 22, 2020

nalimilan left a comment

bkamins commented Jun 23, 2020

nalimilan Jun 24, 2020

bkamins Jun 24, 2020

Fix various bugs in split/apply/combine in 0.21 release #2280

Fix various bugs in split/apply/combine in 0.21 release #2280

Conversation

bkamins commented Jun 10, 2020

nalimilan Jun 19, 2020

Choose a reason for hiding this comment

bkamins Jun 19, 2020 • edited Loading

Choose a reason for hiding this comment

nalimilan Jun 19, 2020

Choose a reason for hiding this comment

bkamins Jun 19, 2020

Choose a reason for hiding this comment

bkamins commented Jun 19, 2020

bkamins commented Jun 22, 2020

nalimilan left a comment

Choose a reason for hiding this comment

bkamins commented Jun 23, 2020

nalimilan Jun 24, 2020

Choose a reason for hiding this comment

bkamins Jun 24, 2020

Choose a reason for hiding this comment

bkamins Jun 19, 2020 •

edited

Loading