Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'Shortcut' in fitting Histogram if binning is uniform / even #563

Closed
Moelf opened this issue Feb 29, 2020 · 13 comments
Closed

'Shortcut' in fitting Histogram if binning is uniform / even #563

Moelf opened this issue Feb 29, 2020 · 13 comments

Comments

@Moelf
Copy link
Contributor

Moelf commented Feb 29, 2020

StatsBase.jl/src/hist.jl

Lines 229 to 235 in 3762c78

@inline function _edge_binindex(edge::AbstractVector, closed::Symbol, x::Real)
if closed == :right
searchsortedfirst(edge, x) - 1
else
searchsortedlast(edge, x)
end
end

if binning is even, we can do a O(N) edge finding, this is useful when fitting large number of events.

@Moelf
Copy link
Contributor Author

Moelf commented Feb 29, 2020

after some fiddling I have concluded this should be done atfit(Histogram ....) level, for example, if nbins and edges = (a, b) are both provided, do something like:

function myfit(edge::AbstractVector, closed::Symbol, x::Array{<:Real})    
    diffs = diff(edge)        
    diff1 = diffs[1]        
    if all(isapprox.(diff1, diffs))        
        weights = zeros(length(edge)-1)       
        @views if closed == :right                   
            for i in x                            
                weights[ceil(Int, (i-edge[1]) / diff1)] += 1        
            end                      
        else                                                                  
            for i in x                                                                                                              
                weights[floor(Int, (i-edge[1]) / diff1) + 1] += 1                                                                   
            end      
        end                                                                                                                         
        return edge, weights, closed                                                                                                
    end                                                                                                                             
end 

Potentially we would have some gain:

julia> @benchmark Histogram(myfit(0:0.001:1, :right, $(rand(10^8)))...)
BenchmarkTools.Trial: 
  memory estimate:  20.52 KiB
  allocs estimate:  8
  --------------
  minimum time:     781.907 ms (0.00% GC)
  median time:      789.584 ms (0.00% GC)
  mean time:        790.228 ms (0.00% GC)
  maximum time:     799.641 ms (0.00% GC)
  --------------
  samples:          7
  evals/sample:     1

julia> @benchmark fit(Histogram, $(rand(10^8)), 0:0.001:1)
BenchmarkTools.Trial: 
  memory estimate:  8.03 KiB
  allocs estimate:  3
  --------------
  minimum time:     2.624 s (0.00% GC)
  median time:      2.637 s (0.00% GC)
  mean time:        2.637 s (0.00% GC)
  maximum time:     2.650 s (0.00% GC)
  --------------
  samples:          2
  evals/sample:     1


julia> a = rand(10^5);

julia> fit(Histogram, a, 0:0.1:1, closed=:right) == Histogram(myfit(0:0.1:1, :right, a)...)
true

julia> fit(Histogram, a, 0:0.1:1, closed=:left) == Histogram(myfit(0:0.1:1, :left, a)...)
true

@Moelf
Copy link
Contributor Author

Moelf commented Jun 15, 2020

bump for pointer to where to implement this since it looks like append! is called recursively in the current implementation;

we shouldn't miss a ~ x10 improvement

@Nosferican
Copy link
Contributor

For large histogram fitting at scale it might also be good to make it work by passing the value-count pairs as well. That is probably a different approach for histogram with numerous data... currently I use the value-count pairs and run a bar plot since histogram can't handle that scale.

@Moelf
Copy link
Contributor Author

Moelf commented Oct 29, 2020

@Nosferican huh? how did you obtain the value-count pair in the first place? also this is not only for 1D but also higher dimension.

@Nosferican
Copy link
Contributor

Something like countmap would be the most straightforward.

@Moelf
Copy link
Contributor Author

Moelf commented Oct 29, 2020

won't work if I have continuous data.

@Nosferican
Copy link
Contributor

Aye. For continuous data you need to cut/bin of some kind which depends on the algorithm used.

@Moelf
Copy link
Contributor Author

Moelf commented Oct 29, 2020

update:
it seems that the cause for the slowness is _edge_binindex thus the searchsorted, compare the following:
1.

function fit(::Type{Histogram{T}}, v::AbstractVector, edg::AbstractVector; closed::Symbol=:left) where {T}
    diffs = diff(edg)
    diff1 = diffs[1]
    weights = zeros(length(edg)-1)
    _edg = if all(isapprox.(diff1, diffs))
        edg[begin]:diff1:edg[end]
    else
        edg
    end
    if closed == :right
        for x in v
            weights[ceil(Int, (x-edg[1]) / diff1)] += 1
        end
    else
        for x in v
            weights[floor(Int, (x-edg[1]) / diff1) + 1] += 1
        end
    end
    Histogram(edg, weights, closed)
end
julia> @btime fit(Histogram, $(rand(10^6)), 0:0.1:1)
  2.162 ms (5 allocations: 528 bytes)

if we replace the ceil(Int, (x-edg[1]) / diff1) by one of the following:

            # weights[searchsortedfirst(_edg, x) - 1 ] += 1
            # weights[_edge_binindex(_edg, closed, x)] += 1

we immediately get slow down

julia> @btime fit(Histogram, $(rand(10^6)), 0:0.1:1)
  28.270 ms (5 allocations: 528 bytes)

@Moelf
Copy link
Contributor Author

Moelf commented Oct 29, 2020

basically it comes down to we shouldn't use base searchsorted:

julia> @btime searchsortedfirst(0.2:0.5:30, 12, Base.Order.ForwardOrdering())
  120.920 ns (0 allocations: 0 bytes)
25

julia> @btime ceil(Int, (12-0.2) / 0.5)
  0.020 ns (0 allocations: 0 bytes)
24

@mschauer
Copy link
Member

I’d rather say let’s check if search sorted could be made faster here

@oschulz
Copy link
Contributor

oschulz commented Nov 7, 2020

@Moelf if binning is even, we can do a O(N) edge finding, this is useful when fitting large number of events.

We could, of course, add a method _edge_binindex(edge::AbstractRange, closed::Symbol, x::Real). But _edge_binindex should already run in O(1) for AbstractRange because searchsortedfirst and similar are already specialized for AbstractRange:

https://github.com/JuliaLang/julia/blob/28330a2fef4d9d149ba0fd3ffa06347b50067647/base/sort.jl#L232

@oschulz
Copy link
Contributor

oschulz commented Nov 7, 2020

I wonder why searchsortedfirst takes so long, though - but it should run in constant time.

@Moelf
Copy link
Contributor Author

Moelf commented Aug 31, 2021

#613 (comment)

@Moelf Moelf closed this as completed Aug 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants