'Shortcut' in fitting Histogram if binning is uniform / even #563

Moelf · 2020-02-29T05:09:57Z

Lines 229 to 235 in 3762c78

    
           @inline function _edge_binindex(edge::AbstractVector, closed::Symbol, x::Real) 
        
               if closed == :right 
        
                   searchsortedfirst(edge, x) - 1 
        
               else 
        
                   searchsortedlast(edge, x) 
        
               end 
        
           end

if binning is even, we can do a O(N) edge finding, this is useful when fitting large number of events.

The text was updated successfully, but these errors were encountered:

Moelf · 2020-02-29T07:15:14Z

after some fiddling I have concluded this should be done atfit(Histogram ....) level, for example, if nbins and edges = (a, b) are both provided, do something like:

function myfit(edge::AbstractVector, closed::Symbol, x::Array{<:Real})    
    diffs = diff(edge)        
    diff1 = diffs[1]        
    if all(isapprox.(diff1, diffs))        
        weights = zeros(length(edge)-1)       
        @views if closed == :right                   
            for i in x                            
                weights[ceil(Int, (i-edge[1]) / diff1)] += 1        
            end                      
        else                                                                  
            for i in x                                                                                                              
                weights[floor(Int, (i-edge[1]) / diff1) + 1] += 1                                                                   
            end      
        end                                                                                                                         
        return edge, weights, closed                                                                                                
    end                                                                                                                             
end

Potentially we would have some gain:

julia> @benchmark Histogram(myfit(0:0.001:1, :right, $(rand(10^8)))...)
BenchmarkTools.Trial: 
  memory estimate:  20.52 KiB
  allocs estimate:  8
  --------------
  minimum time:     781.907 ms (0.00% GC)
  median time:      789.584 ms (0.00% GC)
  mean time:        790.228 ms (0.00% GC)
  maximum time:     799.641 ms (0.00% GC)
  --------------
  samples:          7
  evals/sample:     1

julia> @benchmark fit(Histogram, $(rand(10^8)), 0:0.001:1)
BenchmarkTools.Trial: 
  memory estimate:  8.03 KiB
  allocs estimate:  3
  --------------
  minimum time:     2.624 s (0.00% GC)
  median time:      2.637 s (0.00% GC)
  mean time:        2.637 s (0.00% GC)
  maximum time:     2.650 s (0.00% GC)
  --------------
  samples:          2
  evals/sample:     1


julia> a = rand(10^5);

julia> fit(Histogram, a, 0:0.1:1, closed=:right) == Histogram(myfit(0:0.1:1, :right, a)...)
true

julia> fit(Histogram, a, 0:0.1:1, closed=:left) == Histogram(myfit(0:0.1:1, :left, a)...)
true

Moelf · 2020-06-15T08:14:00Z

bump for pointer to where to implement this since it looks like append! is called recursively in the current implementation;

we shouldn't miss a ~ x10 improvement

Nosferican · 2020-10-29T10:20:46Z

For large histogram fitting at scale it might also be good to make it work by passing the value-count pairs as well. That is probably a different approach for histogram with numerous data... currently I use the value-count pairs and run a bar plot since histogram can't handle that scale.

Moelf · 2020-10-29T13:57:25Z

@Nosferican huh? how did you obtain the value-count pair in the first place? also this is not only for 1D but also higher dimension.

Nosferican · 2020-10-29T14:10:00Z

Something like countmap would be the most straightforward.

Moelf · 2020-10-29T14:11:10Z

won't work if I have continuous data.

Nosferican · 2020-10-29T14:13:24Z

Aye. For continuous data you need to cut/bin of some kind which depends on the algorithm used.

Moelf · 2020-10-29T16:25:35Z

update:
it seems that the cause for the slowness is _edge_binindex thus the searchsorted, compare the following:
1.

function fit(::Type{Histogram{T}}, v::AbstractVector, edg::AbstractVector; closed::Symbol=:left) where {T}
    diffs = diff(edg)
    diff1 = diffs[1]
    weights = zeros(length(edg)-1)
    _edg = if all(isapprox.(diff1, diffs))
        edg[begin]:diff1:edg[end]
    else
        edg
    end
    if closed == :right
        for x in v
            weights[ceil(Int, (x-edg[1]) / diff1)] += 1
        end
    else
        for x in v
            weights[floor(Int, (x-edg[1]) / diff1) + 1] += 1
        end
    end
    Histogram(edg, weights, closed)
end

julia> @btime fit(Histogram, $(rand(10^6)), 0:0.1:1)
  2.162 ms (5 allocations: 528 bytes)

if we replace the ceil(Int, (x-edg[1]) / diff1) by one of the following:

            # weights[searchsortedfirst(_edg, x) - 1 ] += 1
            # weights[_edge_binindex(_edg, closed, x)] += 1

we immediately get slow down

julia> @btime fit(Histogram, $(rand(10^6)), 0:0.1:1)
  28.270 ms (5 allocations: 528 bytes)

Moelf · 2020-10-29T16:56:26Z

basically it comes down to we shouldn't use base searchsorted:

julia> @btime searchsortedfirst(0.2:0.5:30, 12, Base.Order.ForwardOrdering())
  120.920 ns (0 allocations: 0 bytes)
25

julia> @btime ceil(Int, (12-0.2) / 0.5)
  0.020 ns (0 allocations: 0 bytes)
24

mschauer · 2020-10-29T18:42:29Z

I’d rather say let’s check if search sorted could be made faster here

oschulz · 2020-11-07T22:34:15Z

@Moelf if binning is even, we can do a O(N) edge finding, this is useful when fitting large number of events.

We could, of course, add a method _edge_binindex(edge::AbstractRange, closed::Symbol, x::Real). But _edge_binindex should already run in O(1) for AbstractRange because searchsortedfirst and similar are already specialized for AbstractRange:

https://github.com/JuliaLang/julia/blob/28330a2fef4d9d149ba0fd3ffa06347b50067647/base/sort.jl#L232

oschulz · 2020-11-07T22:36:14Z

I wonder why searchsortedfirst takes so long, though - but it should run in constant time.

Moelf · 2021-08-31T14:22:52Z

#613 (comment)

Moelf mentioned this issue Oct 29, 2020

faster bin index for uniform edge #613

Closed

Moelf closed this as completed Aug 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'Shortcut' in fitting Histogram if binning is uniform / even #563

'Shortcut' in fitting Histogram if binning is uniform / even #563

Moelf commented Feb 29, 2020 •

edited

Loading

Moelf commented Feb 29, 2020 •

edited

Loading

Moelf commented Jun 15, 2020 •

edited

Loading

Nosferican commented Oct 29, 2020

Moelf commented Oct 29, 2020

Nosferican commented Oct 29, 2020

Moelf commented Oct 29, 2020

Nosferican commented Oct 29, 2020

Moelf commented Oct 29, 2020

Moelf commented Oct 29, 2020 •

edited

Loading

mschauer commented Oct 29, 2020

oschulz commented Nov 7, 2020 •

edited

Loading

oschulz commented Nov 7, 2020

Moelf commented Aug 31, 2021

'Shortcut' in fitting Histogram if binning is uniform / even #563

'Shortcut' in fitting Histogram if binning is uniform / even #563

Comments

Moelf commented Feb 29, 2020 • edited Loading

Moelf commented Feb 29, 2020 • edited Loading

Moelf commented Jun 15, 2020 • edited Loading

Nosferican commented Oct 29, 2020

Moelf commented Oct 29, 2020

Nosferican commented Oct 29, 2020

Moelf commented Oct 29, 2020

Nosferican commented Oct 29, 2020

Moelf commented Oct 29, 2020

Moelf commented Oct 29, 2020 • edited Loading

mschauer commented Oct 29, 2020

oschulz commented Nov 7, 2020 • edited Loading

oschulz commented Nov 7, 2020

Moelf commented Aug 31, 2021

Moelf commented Feb 29, 2020 •

edited

Loading

Moelf commented Feb 29, 2020 •

edited

Loading

Moelf commented Jun 15, 2020 •

edited

Loading

Moelf commented Oct 29, 2020 •

edited

Loading

oschulz commented Nov 7, 2020 •

edited

Loading