Perf: avoid (un)premultiply when using overlayWith without alpha channel #573

strarsis · 2016-09-14T18:18:42Z

It appears that overlayWith used multiple time on the same sharp instance results
only in the last overlayed image being actually applied.
The workaround, writing to raw buffer and loading into new sharp instance
each time when overlaying a new image, would be wasteful.

lovell · 2016-09-14T18:24:10Z

Your experience is as expected, but perhaps not documented clearly enough. The discussion at #241 is the best place for this.

"would be wasteful"

It's probably not as bad as you think, but happy to be corrected by evidence to the contrary.

strarsis · 2016-09-15T12:00:17Z

I prepared a benchmark for sharp and canvas, in both cases image files are
placed/drawn/overlayed onto an existing base/page image and the result written back to disk:
https://github.com/strarsis/sharp-overlay-benchmark

The single sharp operations are nicely fast, but writing to raw buffer and reading back into new sharp instance every time, which is required for overlaying multiple image files, seems to be a bottleneck.
(Please note that my benchmark code could be indeed inefficient/terrible and causing performance issues.)

I get 0.70 ops/sec ±1.79% (7 runs sampled) for the canvas test on the test system,
the sharp test is fast first, then slows down to the end in an exponential fashion (memory related?),
while on the other hand it is very fast, comparable to canvas test, with smaller and/or less input files.

Setup: Debian Jessie amd64, node v6.3.0, sharp@0.16.0 and canvas@1.5.0.

lovell · 2016-09-16T14:33:10Z

Thanks for these benchmark tests; the bottleneck is the premultiply/unpremultiply roundtrip per overlay. The "faded out using the alpha value" statement in the cairo_paint_with_alpha docs suggest cairo will suffer this also.

A possible performance improvement to sharp would be to skip the (un)premultiply part of the process entirely when the overlay image does not contain an alpha channel. Let's use this issue to track that work as it will benefit everyone using overlayWith, be that either once or multiple times per input image.

sraka1 · 2016-12-22T10:00:39Z

@lovell Any updates on this? I'm trying to move some tasks from running in a background queue (current) to being on-the-fly. I do about 10-100 overlayWith statements per job and it takes a considerable/non-negligible amount of time, even on a powerful machine!

lovell · 2016-12-22T17:10:39Z

@sraka1 Not yet, but I'd be happy to help with and accept a PR if you'd like to work on it.

lovell · 2017-02-11T18:44:35Z

Commit 35859fd on the avoid-premultiply-for-overlay-without-alpha branch should do this.

@strarsis @sraka1 Are you able to verify this improves performance?

npm install lovell/sharp#avoid-premultiply-for-overlay-without-alpha

strarsis · 2017-02-11T19:28:11Z

@lovell: I installed the patched sharp and run the previously used benchmark with it:
With sharp#avoid-premultiply-for-overlay-without-alpha:
sharp x 0.02 ops/sec ±0.00% (0 runs sampled)

With current sharp release (unpatched):
Terminal session crashes after some time.

There is a definite improvement of memory consumption and stability,
it is still a bit too slow compared to canvas (with 0.70 ops/sec vs. the 0.02 ops/sec with sharp).

You can also try it out yourself here: https://github.com/strarsis/sharp-overlay-benchmark/tree/sharp-patch
Clone, run npm install and run the sharp test script, the patched sharp is already included in that branch.

lovell · 2017-02-12T09:21:02Z

@strarsis Thanks for checking. I've thought of a (hopefully) much faster approach for the non-alpha overlay case; leave it with me.

lovell · 2017-03-25T17:11:23Z

Commit 5c5b708 on the avoid-premultiply-for-overlay-without-alpha branch should improve things and even adds a test or two.

If the overlay image does not contain an alpha channel, and premultiplication is not otherwise required, a faster image "insert" will take place instead. This will now always be the case when overlaying non-alpha onto non-alpha.

lovell · 2017-03-25T17:21:14Z

@strarsis You'll need to set chnls = 3 in test-sharp.js#L19 to see the benefits - I see about 0.5 ops/s, a 25x improvement :)

strarsis · 2017-03-25T18:38:00Z

Yes, it is faster now. Interestingly it also seems to cache the previous operations.
When I cancel the test-sharp test and start it again,
it rushes to the row/column step it has been cancelled before.

But then in further test cycles it suddenly slows down a lot towards the middle/end.
Edit: After some more test cycles the terminal session started to hang.
The memory is quickly saturated.

lovell · 2017-03-26T12:51:43Z

@strarsis Thanks for confirming, I noticed memory usage was high too. My first thought is that sharpCreateEmpty could be modified to use the new create feature, which should reduce memory consumption a little.

@sraka1 Are you able to test with the commit referred to above to see if it helps the performance of your use case?

lovell · 2017-03-27T10:58:37Z

The change here is slightly breaking in that it's possible for output images that used to have 4 channels to now end up with only 3, so this will have to wait until v0.18.0.

strarsis · 2017-04-04T15:27:50Z

Updated sharp dependency. channel count and some minor improvements in benchmark code: https://github.com/strarsis/sharp-overlay-benchmark

With a more larger amount of images my system hangs because of too high memory consumption,
I even had to reboot it cold once.

lovell · 2017-04-05T22:29:26Z

Commit 7d901fc adds this to the ridge WIP branch.

strarsis · 2017-04-06T19:30:48Z

@lovell: With ridge branch performance increased 0.07 ops/sec. Runs very fast first, then sporadic slow downs/halts towards the end, faster after initial "warm up", heavy memory issues after 2nd consecutive run.
Benchmark branch using ridge can be found here:
https://github.com/strarsis/sharp-overlay-benchmark/tree/ridge

matAtWork · 2017-05-17T12:08:18Z

I'm attempting to get this technique to work, in order to do multiple overlayWith (to stitch together stripes from headless Chrome, which can only screenshot images < 16384 high). In my case, the overlay sources have 4 channels. Is there are a way to force use of unmultiply (i.e. ignore the alpha channel) since I know in advance it will be full of zeros?

My code, if it helps:

    var slices = [] ;
    for (var view=0; view < height; view += viewportHeight) {
        await client.Emulation.setVisibleSize({width: width, height: Math.min(viewportHeight,height-view)});
        await client.Emulation.forceViewport({x: 0, y: view, scale: 1});
        const capture = await client.Page.captureScreenshot('png',100,true);
        slices.push([Buffer.from(capture.data,'base64'),{
            top:view,
            left:0
        }]) ;
    }
    
    var img = sharp(null,{
        create:{
            channels:3,
            background: "#808080",
            width:width,
            height:height
        }
    }) ;

    for (var i=0; i<slices.length; i++) {
        var composite = img.overlayWith.apply(img,slices[i]).flatten().raw().toBuffer() ;
        img = sharp(await composite,{
            raw:{
                channels:3,
                width:width,
                height:height
            }
        }) ;
    }
    
    await img.jpeg().toFile(tempFile+".jpg") ;
}

Each loop around the overlayWith takes around 1 sec (Mac OS X), with the viewportHeight set to (only!) 512 (width is 1148). Stepping into the code (0.18.0-alpha) the buffer returned by the toBuffer() Promise always has 4 channels, and the call to flatten doesn't seem to make any difference, which is why I assume the "unmultiply" optimisation is not taking place.

On a side note (I know discussed in various issues), the variety and potential inefficiency in all this means supporting multiple overlayWith calls is a must in my opinion, even if it requires a new/updated API eg. Promise<sharp> sharp.overlayNow(...)

Specifically, whatever optimisations can be performed such as buffer reuse or other optimisations like the (un)multiply are in general way beyond simple devs like me :)

strarsis · 2017-05-17T13:04:08Z

@matAtWork: Are you using the ridge branch of sharp which contains these new optimizations?

$ npm install github:lovell/sharp#ridge

matAtWork · 2017-05-17T13:06:07Z

Yep

strarsis · 2017-05-17T13:12:29Z

@matAtWork: Also set chnls = 3 as described in #573 (comment) .

matAtWork · 2017-05-17T13:16:40Z

Yeah, I saw that. The issue is that although I create the destination with channels: 3, the source images are derived from Buffers containing PNG data (from Chrome), which always seems to yield sharp() instances with 4 channels, and I can't find a way to force it down to three, which I could do given the alpha is always full of zeros. The number of channels in the call to sharp(pngBuffer,{ channels:3, ... }) seems to have no effect (unsurprisingly, given the raw data is the raw data?)

matAtWork · 2017-05-17T13:47:45Z

Actually, forget that last bit - channels:3 doesn't work as the PNG buffer always renders into RGBA. It's that that is (probably) defeating the optimisation. So the question is a more general "How do I get a 3 channel raw data from sharp?"

strarsis · 2017-05-17T13:51:31Z

[removed]

matAtWork · 2017-05-17T13:55:59Z

I've tried various combinations of flatten already. It seems to "apply" any alpha to the underlying pixel data (which is a no-op in my case), but toBuffer() still renders 4 channel raw data, whereas I need it to produce 3 channel RGB data. I'll try fudging the data by hand to see if it speeds up the overlay, at least

strarsis · 2017-05-17T13:59:04Z

@matAtWork: Also see #680 , #534 .

matAtWork · 2017-05-17T14:34:05Z

Sorry if this has all gone a bit off-topic, but I found the fastest (by a long, long way) solution was to simply append the raw buffers (either 3 or 4 channel) produced by sharp(pngSlices) in memory using node, and then using sharp() again to generate an image based on the big buffer and write it to disk.

When I say faster, I mean like 10x faster:

var slices = [] ;
for (var view=0; view < height; view += viewportHeight) {
    await client.Emulation.setVisibleSize({width: width, height: Math.min(viewportHeight,height-view)});
    await client.Emulation.forceViewport({x: 0, y: view, scale: 1});
    const capture = await client.Page.captureScreenshot('png',100,true);
// gather each slice produced by Chrome (in PNG format) into Promises for raw buffers decoded by sharp()
    slices.push(sharp(Buffer.from(capture.data,'base64')).raw().toBuffer({resolveWithObject:true})) ;
}

slices = await Promise.all(slices) ;

// Concat all the raw data together
var bufSize = width * viewportHeight * slices[0].info.channels ;
var composite = Buffer.allocUnsafe(bufSize * slices.length) ;
slices.forEach((s,i) => s.data.copy(composite,i * bufSize)) ;
// Get sharp() to produce a final image based on the data.
await sharp(composite,{
    raw:{
        width:width,
        height:height,
        channels:slices[0].info.channels
    }
}).jpeg().toFile(tempFile+".jpg") ;

So the solution here is to avoid overlayWith, and it's complications, altogether.

Thanks for all your hints/tips - it definitely helped by understand the issues a lot better.

lovell · 2017-05-30T07:43:20Z

v0.18.0 now available with this improvement, thanks all for the feedback.

lovell closed this as completed Sep 14, 2016

lovell added the question label Sep 14, 2016

lovell reopened this Sep 16, 2016

lovell added enhancement and removed question labels Sep 16, 2016

lovell changed the title ~~Support multiple times overlayWith per instance~~ Perf: avoid (un)premultiply when using overlayWith without alpha channel Sep 16, 2016

lovell added this to the v0.17.0 milestone Oct 13, 2016

lovell removed this from the v0.17.0 milestone Nov 4, 2016

lovell added this to the v0.17.3 milestone Feb 11, 2017

lovell mentioned this issue Mar 7, 2017

Enhancement: support multiple overlayWith in single pipeline (including blend modes) via vips_composite #728

Closed

lovell modified the milestones: v0.18.0, v0.17.3 Mar 27, 2017

matAtWork mentioned this issue May 18, 2017

Trouble composing multiple files #805

Closed

lovell closed this as completed May 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf: avoid (un)premultiply when using overlayWith without alpha channel #573

Perf: avoid (un)premultiply when using overlayWith without alpha channel #573

strarsis commented Sep 14, 2016 •

edited

Loading

lovell commented Sep 14, 2016

strarsis commented Sep 15, 2016 •

edited

Loading

lovell commented Sep 16, 2016

sraka1 commented Dec 22, 2016

lovell commented Dec 22, 2016

lovell commented Feb 11, 2017

strarsis commented Feb 11, 2017 •

edited

Loading

lovell commented Feb 12, 2017

lovell commented Mar 25, 2017 •

edited

Loading

lovell commented Mar 25, 2017

strarsis commented Mar 25, 2017 •

edited

Loading

lovell commented Mar 26, 2017

lovell commented Mar 27, 2017

strarsis commented Apr 4, 2017 •

edited

Loading

lovell commented Apr 5, 2017

strarsis commented Apr 6, 2017 •

edited

Loading

matAtWork commented May 17, 2017

strarsis commented May 17, 2017

matAtWork commented May 17, 2017

strarsis commented May 17, 2017

matAtWork commented May 17, 2017

matAtWork commented May 17, 2017

strarsis commented May 17, 2017 •

edited by lovell

Loading

matAtWork commented May 17, 2017

strarsis commented May 17, 2017 •

edited

Loading

matAtWork commented May 17, 2017

lovell commented May 30, 2017

Perf: avoid (un)premultiply when using overlayWith without alpha channel #573

Perf: avoid (un)premultiply when using overlayWith without alpha channel #573

Comments

strarsis commented Sep 14, 2016 • edited Loading

lovell commented Sep 14, 2016

strarsis commented Sep 15, 2016 • edited Loading

lovell commented Sep 16, 2016

sraka1 commented Dec 22, 2016

lovell commented Dec 22, 2016

lovell commented Feb 11, 2017

strarsis commented Feb 11, 2017 • edited Loading

lovell commented Feb 12, 2017

lovell commented Mar 25, 2017 • edited Loading

lovell commented Mar 25, 2017

strarsis commented Mar 25, 2017 • edited Loading

lovell commented Mar 26, 2017

lovell commented Mar 27, 2017

strarsis commented Apr 4, 2017 • edited Loading

lovell commented Apr 5, 2017

strarsis commented Apr 6, 2017 • edited Loading

matAtWork commented May 17, 2017

strarsis commented May 17, 2017

matAtWork commented May 17, 2017

strarsis commented May 17, 2017

matAtWork commented May 17, 2017

matAtWork commented May 17, 2017

strarsis commented May 17, 2017 • edited by lovell Loading

matAtWork commented May 17, 2017

strarsis commented May 17, 2017 • edited Loading

matAtWork commented May 17, 2017

lovell commented May 30, 2017

strarsis commented Sep 14, 2016 •

edited

Loading

strarsis commented Sep 15, 2016 •

edited

Loading

strarsis commented Feb 11, 2017 •

edited

Loading

lovell commented Mar 25, 2017 •

edited

Loading

strarsis commented Mar 25, 2017 •

edited

Loading

strarsis commented Apr 4, 2017 •

edited

Loading

strarsis commented Apr 6, 2017 •

edited

Loading

strarsis commented May 17, 2017 •

edited by lovell

Loading

strarsis commented May 17, 2017 •

edited

Loading