Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine and optimize mad! and mad #839

Merged
merged 1 commit into from
Dec 3, 2022

Conversation

LilithHafner
Copy link
Contributor

Generators don't have eltype (have eltype Any) so this explicitly makes an Any[] which, ironically, isn't even reused in the following code due to the isconcretetype check. A standard collect is better.

julia> @btime mad((i for i in 1:100)) # before
  13.956 μs (321 allocations: 9.59 KiB)
37.065055462640046

julia> @btime mad((i for i in 1:100)) # after
  1.621 μs (5 allocations: 4.38 KiB)
37.065055462640046

@LilithHafner
Copy link
Contributor Author

Bump, this seems pretty clearcut to me.

@andreasnoack
Copy link
Member

Maybe a good idea to add a test that counts the allocations of this case.

@mschauer
Copy link
Member

Trying to understand the code. Is median! guaranteed to preserve the elements, just change their order?

@LilithHafner
Copy link
Contributor Author

Is median! guaranteed to preserve the elements, just change their order?

No. median! "may overwrite the input vector." as per the docstring. However, its implementation does satisfy this property and as this code is in StatsBase.jl, it is okay to depend on that implementation detail. Perhaps we should strengthen the claim in median!'s docstring if folks outside of StatsBase are depending on that behavior.

@LilithHafner
Copy link
Contributor Author

Maybe a good idea to add a test that counts the allocations of this case.

Can add a rough test, but this calls median! which calls partialsort! which does not have a guaranteed limit on number of allocations.

@mschauer
Copy link
Member

mschauer commented Nov 25, 2022

And then, next question, if the center is float and the array is integer, how can x2 = collect(x) can hold the differences between the elements and the center? E.g. mad(Integer[1,1,2,2])?

@LilithHafner
Copy link
Contributor Author

Good catch. For collections that are not AbstractArrays but do have concrete eltypes that cannot hold their own middle this will error. I'll have to make a larger diff to cover all cases without any regressions.

@andreasnoack
Copy link
Member

E.g. mad(Integer[1,1,2,2])?

I don't think there is an issue with that. There is another branch https://github.com/JuliaStats/StatsBase.jl/pull/839/files#diff-f1f51914b0e1897b4ede4d6024d011c6aa91c7e74e483bae63a33f1e892d8eb9R545 that doesn't work in-place if the element type isn't concrete.

@LilithHafner
Copy link
Contributor Author

The counterexample is mad(Iterators.repeated(4, 10)).

@mschauer
Copy link
Member

There is also no point in collecting anything into x2 if c is provided. Maybe it's better to rewrite than to repair.

@LilithHafner
Copy link
Contributor Author

LilithHafner commented Nov 25, 2022

I also arrived at that conclusion and have done a rewrite. Indeed, I have avoided collecting anything into x2 in all cases.

@LilithHafner
Copy link
Contributor Author

I'd like to rebase this after #841 so that the diff can be more clear here.

@LilithHafner LilithHafner changed the title Remove harmful type specification Combine and optimize mad! and mad Nov 25, 2022
@LilithHafner
Copy link
Contributor Author

Less source code, more tests, more concrete types, fewer allocations, faster runtime, and hopefully not terribly buggy :) What more can you ask for?

This is ready for review!

c = center === nothing ? median!(x) : center
T = promote_type(typeof(c), eltype(x))
U = eltype(x)
x2 = U == T ? x : isconcretetype(U) && isconcretetype(T) && sizeof(U) == sizeof(T) ? reinterpret(T, x) : similar(x, T)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me just understand this: If U is an Int and T is Float64, this would allow to overwrite the Int array by reinterpreting it filling it with intermediate floats, but x interpreted as Ints is garbage afterwords. We should warn that x cannot be used afterwards in mad!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring begins with "Compute the median absolute deviation (MAD) of array x around center (by default, around the median), overwriting x in the process." so this warning does exist.

x can be reused, it has simply been overwritten with gibberish.

Previously, mad! began with x .= abs.(x .- center) and then partially sorted x, which makes the contents of x fairly unusable, though not quite as bizarre as x can end up now.

We can make the warning more visible if you'd like, but it's also contained in the function name itself. mad! clearly mutates its input, the warning is just that it probably mutates it in a way you aren't expecting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I am happy with "overwriting x in the process"

@LilithHafner
Copy link
Contributor Author

Is there anything else needed from me here?

@mschauer
Copy link
Member

mschauer commented Dec 2, 2022

No, if there is no protest I will merge this.

@mschauer mschauer merged commit 97a6bd4 into JuliaStats:master Dec 3, 2022
@LilithHafner LilithHafner deleted the patch-2 branch December 3, 2022 11:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants