-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix various bugs in split/apply/combine in 0.21 release #2280
Conversation
# in this case we are sure that the result GroupedDataFrame has the | ||
# same structure as the source | ||
# we do not copy data as it should be safe - we never mutate fields of gd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do we mutate fields now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do not mutate fields now.
It is a preparation for filter!
. But I prefer to make a copy here as it is not very expensive, and be safe that in the future changing if someone introduces mutation to GroupedDataFrame
(even if we skip filter!
for now as I assume we will do) it does not get forgotten (as it is easy to forget that this line in such a large codebase assumes that it is not mutated).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather avoid making copies preventively as long as we never mutate fields. Especially if we drop filter!
for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - changed. I have made an extensive implementation note, as GroupedDataFrame
is currently quite tricky to implement new functionality for.
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
This is what I was afraid of exactly. I will open a separate issue or PR to track this. |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
coverage only fails. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two suggestions.
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Thank you! |
@@ -34,15 +50,19 @@ end | |||
function Base.getproperty(gd::GroupedDataFrame, f::Symbol) | |||
if f in (:idx, :starts, :ends) | |||
# Group indices are computed lazily the first time they are accessed | |||
Threads.lock(gd.lazy_lock) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this again, shouldn't this line be moved inside if getfield(gd, f) === nothing
? Ideally threads wouldn't have to lock each other when accessing indices once they have been computed. To avoid threads from computing the fields multiple times (which would be wasteful, though probably not problematic), we could check getfield(gd, f) === nothing
again after calling lock
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - I was also thinking about it. The solution is exactly as you say:
we could check
getfield(gd, f) === nothing
again after calling lock.
I just wanted to have a simpler implementation as lock
/unlock
is cheap:
julia> function f(l)
lock(l)
unlock(l)
end
f (generic function with 1 method)
julia> l = Threads.ReentrantLock()
ReentrantLock(nothing, Base.GenericCondition{Base.Threads.SpinLock}(Base.InvasiveLinkedList{Task}(nothing, nothing), Base.Threads.SpinLock(0)), 0)
julia> f(l)
julia> @benchmark f($l)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 26.066 ns (0.00% GC)
median time: 26.087 ns (0.00% GC)
mean time: 27.893 ns (0.00% GC)
maximum time: 77.285 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 993
and with threads often things are more tricky than they seem.
I will open a PR with the change you propose and we can discuss there.
In this PR I fix some corner cases that were bugs in 0.21 release. It should be merged before #2279 as it should be backported.
What it covers:
GroupedDataFrame
,DataFrame
, list of transformations, orgroupcols
)GroupDataFrame
indexing that caused StackOverflowGroupedDataFrame
withungroup=false
(wrong columns could be shown as grouping columns in the past)keepkeys=false
orungroup=false
(there are several corner cases here); most likely in real usage scenarios these cases do not happen, but still we should provide a correct result here.In general correctly handling all these corner cases is tricky so probably if someone would be willing to review it it might be challenging (so thank you in advance for the efforts). I will also try to improve test coverage for this if I find other "corner cases" still not covered.