Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf: avoid (un)premultiply when using overlayWith without alpha channel #573

Closed
strarsis opened this issue Sep 14, 2016 · 27 comments
Closed
Milestone

Comments

@strarsis
Copy link

strarsis commented Sep 14, 2016

It appears that overlayWith used multiple time on the same sharp instance results
only in the last overlayed image being actually applied.
The workaround, writing to raw buffer and loading into new sharp instance
each time when overlaying a new image, would be wasteful.

@lovell
Copy link
Owner

lovell commented Sep 14, 2016

Your experience is as expected, but perhaps not documented clearly enough. The discussion at #241 is the best place for this.

"would be wasteful"

It's probably not as bad as you think, but happy to be corrected by evidence to the contrary.

@strarsis
Copy link
Author

strarsis commented Sep 15, 2016

I prepared a benchmark for sharp and canvas, in both cases image files are
placed/drawn/overlayed onto an existing base/page image and the result written back to disk:
https://github.com/strarsis/sharp-overlay-benchmark

The single sharp operations are nicely fast, but writing to raw buffer and reading back into new sharp instance every time, which is required for overlaying multiple image files, seems to be a bottleneck.
(Please note that my benchmark code could be indeed inefficient/terrible and causing performance issues.)

I get 0.70 ops/sec ±1.79% (7 runs sampled) for the canvas test on the test system,
the sharp test is fast first, then slows down to the end in an exponential fashion (memory related?),
while on the other hand it is very fast, comparable to canvas test, with smaller and/or less input files.

Setup: Debian Jessie amd64, node v6.3.0, sharp@0.16.0 and canvas@1.5.0.

@lovell
Copy link
Owner

lovell commented Sep 16, 2016

Thanks for these benchmark tests; the bottleneck is the premultiply/unpremultiply roundtrip per overlay. The "faded out using the alpha value" statement in the cairo_paint_with_alpha docs suggest cairo will suffer this also.

A possible performance improvement to sharp would be to skip the (un)premultiply part of the process entirely when the overlay image does not contain an alpha channel. Let's use this issue to track that work as it will benefit everyone using overlayWith, be that either once or multiple times per input image.

@lovell lovell reopened this Sep 16, 2016
@lovell lovell changed the title Support multiple times overlayWith per instance Perf: avoid (un)premultiply when using overlayWith without alpha channel Sep 16, 2016
@lovell lovell added this to the v0.17.0 milestone Oct 13, 2016
@lovell lovell removed this from the v0.17.0 milestone Nov 4, 2016
@sraka1
Copy link

sraka1 commented Dec 22, 2016

@lovell Any updates on this? I'm trying to move some tasks from running in a background queue (current) to being on-the-fly. I do about 10-100 overlayWith statements per job and it takes a considerable/non-negligible amount of time, even on a powerful machine!

@lovell
Copy link
Owner

lovell commented Dec 22, 2016

@sraka1 Not yet, but I'd be happy to help with and accept a PR if you'd like to work on it.

@lovell
Copy link
Owner

lovell commented Feb 11, 2017

Commit 35859fd on the avoid-premultiply-for-overlay-without-alpha branch should do this.

@strarsis @sraka1 Are you able to verify this improves performance?

npm install lovell/sharp#avoid-premultiply-for-overlay-without-alpha

@lovell lovell added this to the v0.17.3 milestone Feb 11, 2017
@strarsis
Copy link
Author

strarsis commented Feb 11, 2017

@lovell: I installed the patched sharp and run the previously used benchmark with it:
With sharp#avoid-premultiply-for-overlay-without-alpha:
sharp x 0.02 ops/sec ±0.00% (0 runs sampled)

With current sharp release (unpatched):
Terminal session crashes after some time.

There is a definite improvement of memory consumption and stability,
it is still a bit too slow compared to canvas (with 0.70 ops/sec vs. the 0.02 ops/sec with sharp).

You can also try it out yourself here: https://github.com/strarsis/sharp-overlay-benchmark/tree/sharp-patch
Clone, run npm install and run the sharp test script, the patched sharp is already included in that branch.

@lovell
Copy link
Owner

lovell commented Feb 12, 2017

@strarsis Thanks for checking. I've thought of a (hopefully) much faster approach for the non-alpha overlay case; leave it with me.

@lovell
Copy link
Owner

lovell commented Mar 25, 2017

Commit 5c5b708 on the avoid-premultiply-for-overlay-without-alpha branch should improve things and even adds a test or two.

If the overlay image does not contain an alpha channel, and premultiplication is not otherwise required, a faster image "insert" will take place instead. This will now always be the case when overlaying non-alpha onto non-alpha.

@lovell
Copy link
Owner

lovell commented Mar 25, 2017

@strarsis You'll need to set chnls = 3 in test-sharp.js#L19 to see the benefits - I see about 0.5 ops/s, a 25x improvement :)

@strarsis
Copy link
Author

strarsis commented Mar 25, 2017

Yes, it is faster now. Interestingly it also seems to cache the previous operations.
When I cancel the test-sharp test and start it again,
it rushes to the row/column step it has been cancelled before.

But then in further test cycles it suddenly slows down a lot towards the middle/end.
Edit: After some more test cycles the terminal session started to hang.
The memory is quickly saturated.

@lovell
Copy link
Owner

lovell commented Mar 26, 2017

@strarsis Thanks for confirming, I noticed memory usage was high too. My first thought is that sharpCreateEmpty could be modified to use the new create feature, which should reduce memory consumption a little.

@sraka1 Are you able to test with the commit referred to above to see if it helps the performance of your use case?

@lovell
Copy link
Owner

lovell commented Mar 27, 2017

The change here is slightly breaking in that it's possible for output images that used to have 4 channels to now end up with only 3, so this will have to wait until v0.18.0.

@lovell lovell modified the milestones: v0.18.0, v0.17.3 Mar 27, 2017
@strarsis
Copy link
Author

strarsis commented Apr 4, 2017

Updated sharp dependency. channel count and some minor improvements in benchmark code: https://github.com/strarsis/sharp-overlay-benchmark

With a more larger amount of images my system hangs because of too high memory consumption,
I even had to reboot it cold once.

@lovell
Copy link
Owner

lovell commented Apr 5, 2017

Commit 7d901fc adds this to the ridge WIP branch.

@strarsis
Copy link
Author

strarsis commented Apr 6, 2017

@lovell: With ridge branch performance increased 0.07 ops/sec. Runs very fast first, then sporadic slow downs/halts towards the end, faster after initial "warm up", heavy memory issues after 2nd consecutive run.
Benchmark branch using ridge can be found here:
https://github.com/strarsis/sharp-overlay-benchmark/tree/ridge

@matAtWork
Copy link

I'm attempting to get this technique to work, in order to do multiple overlayWith (to stitch together stripes from headless Chrome, which can only screenshot images < 16384 high). In my case, the overlay sources have 4 channels. Is there are a way to force use of unmultiply (i.e. ignore the alpha channel) since I know in advance it will be full of zeros?

My code, if it helps:

    var slices = [] ;
    for (var view=0; view < height; view += viewportHeight) {
        await client.Emulation.setVisibleSize({width: width, height: Math.min(viewportHeight,height-view)});
        await client.Emulation.forceViewport({x: 0, y: view, scale: 1});
        const capture = await client.Page.captureScreenshot('png',100,true);
        slices.push([Buffer.from(capture.data,'base64'),{
            top:view,
            left:0
        }]) ;
    }
    
    var img = sharp(null,{
        create:{
            channels:3,
            background: "#808080",
            width:width,
            height:height
        }
    }) ;

    for (var i=0; i<slices.length; i++) {
        var composite = img.overlayWith.apply(img,slices[i]).flatten().raw().toBuffer() ;
        img = sharp(await composite,{
            raw:{
                channels:3,
                width:width,
                height:height
            }
        }) ;
    }
    
    await img.jpeg().toFile(tempFile+".jpg") ;
}

Each loop around the overlayWith takes around 1 sec (Mac OS X), with the viewportHeight set to (only!) 512 (width is 1148). Stepping into the code (0.18.0-alpha) the buffer returned by the toBuffer() Promise always has 4 channels, and the call to flatten doesn't seem to make any difference, which is why I assume the "unmultiply" optimisation is not taking place.

On a side note (I know discussed in various issues), the variety and potential inefficiency in all this means supporting multiple overlayWith calls is a must in my opinion, even if it requires a new/updated API eg. Promise<sharp> sharp.overlayNow(...)

Specifically, whatever optimisations can be performed such as buffer reuse or other optimisations like the (un)multiply are in general way beyond simple devs like me :)

@strarsis
Copy link
Author

@matAtWork: Are you using the ridge branch of sharp which contains these new optimizations?

$ npm install github:lovell/sharp#ridge

@matAtWork
Copy link

Yep

@strarsis
Copy link
Author

@matAtWork: Also set chnls = 3 as described in #573 (comment) .

@matAtWork
Copy link

Yeah, I saw that. The issue is that although I create the destination with channels: 3, the source images are derived from Buffers containing PNG data (from Chrome), which always seems to yield sharp() instances with 4 channels, and I can't find a way to force it down to three, which I could do given the alpha is always full of zeros. The number of channels in the call to sharp(pngBuffer,{ channels:3, ... }) seems to have no effect (unsurprisingly, given the raw data is the raw data?)

@matAtWork
Copy link

Actually, forget that last bit - channels:3 doesn't work as the PNG buffer always renders into RGBA. It's that that is (probably) defeating the optimisation. So the question is a more general "How do I get a 3 channel raw data from sharp?"

@strarsis
Copy link
Author

strarsis commented May 17, 2017

[removed]

@matAtWork
Copy link

I've tried various combinations of flatten already. It seems to "apply" any alpha to the underlying pixel data (which is a no-op in my case), but toBuffer() still renders 4 channel raw data, whereas I need it to produce 3 channel RGB data. I'll try fudging the data by hand to see if it speeds up the overlay, at least

@strarsis
Copy link
Author

strarsis commented May 17, 2017

@matAtWork: Also see #680 , #534 .

@matAtWork
Copy link

Sorry if this has all gone a bit off-topic, but I found the fastest (by a long, long way) solution was to simply append the raw buffers (either 3 or 4 channel) produced by sharp(pngSlices) in memory using node, and then using sharp() again to generate an image based on the big buffer and write it to disk.

When I say faster, I mean like 10x faster:

var slices = [] ;
for (var view=0; view < height; view += viewportHeight) {
    await client.Emulation.setVisibleSize({width: width, height: Math.min(viewportHeight,height-view)});
    await client.Emulation.forceViewport({x: 0, y: view, scale: 1});
    const capture = await client.Page.captureScreenshot('png',100,true);
// gather each slice produced by Chrome (in PNG format) into Promises for raw buffers decoded by sharp()
    slices.push(sharp(Buffer.from(capture.data,'base64')).raw().toBuffer({resolveWithObject:true})) ;
}

slices = await Promise.all(slices) ;

// Concat all the raw data together
var bufSize = width * viewportHeight * slices[0].info.channels ;
var composite = Buffer.allocUnsafe(bufSize * slices.length) ;
slices.forEach((s,i) => s.data.copy(composite,i * bufSize)) ;
// Get sharp() to produce a final image based on the data.
await sharp(composite,{
    raw:{
        width:width,
        height:height,
        channels:slices[0].info.channels
    }
}).jpeg().toFile(tempFile+".jpg") ;

So the solution here is to avoid overlayWith, and it's complications, altogether.

Thanks for all your hints/tips - it definitely helped by understand the issues a lot better.

@lovell
Copy link
Owner

lovell commented May 30, 2017

v0.18.0 now available with this improvement, thanks all for the feedback.

@lovell lovell closed this as completed May 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants