|
|
Created:
7 years, 7 months ago by Jun Jiang Modified:
7 years, 6 months ago Reviewers:
Ken Russell (switch to Gerrit), greggman, Vangelis Kokkevis, Zhenyao Mo, bsalomon, Tom Hudson, reed1, Noel Gordon CC:
skia-review_googlegroups.com, ernstm Base URL:
https://skia.googlecode.com/svn/trunk Visibility:
Public. |
DescriptionImprove the performance of SkConvertConfig8888Pixels using Array lookup
BUG=242097
Committed: http://code.google.com/p/skia/source/detail?r=9605
Patch Set 1 #Patch Set 2 : Move to SkUnPreMultiply and add bench test #
Total comments: 4
Patch Set 3 : change some code to follow the style #
Total comments: 6
Patch Set 4 : Move the bench to another CL #Patch Set 5 : change to use SkMulDiv255Round to benifit ARM architecture #Messages
Total messages: 49 (0 generated)
This issue is found when investigating the issue of http://code.google.com/p/chromium/issues/detail?id=235783. With this patch, the performance of getImageData() of CanvasRenderingContext2D of chromium is close or better than that of Firefox.
Since this CL involve allocating a modest amount of memory, and therefore adds some complexity, I think we need some more support code around the change. 1. write some bench tests that exercise readPixels/writePixels, so we can measure the performance difference w/ and w/o this CL. We seem to have one bench for readPixels, but it does not exercise the premul/unpremul code paths. 2. ensure that our existing unittests (e.g. ReadPixelsTest.cpp, WritePixelsTest.cpp, PremulAlphaRoundTripTest.cpp) still pass 3. possibly explore alternative changes that are simpler (e.g. use SkUnPremultiply instead of / a)
Manfred's investigated color conversions throughout Skia, WebKit and Chrome and will be interested in looking at this CL.
On 2013/05/20 21:45:30, kbr wrote: > Manfred's investigated color conversions throughout Skia, WebKit and Chrome and > will be interested in looking at this CL. Unfortunately Manfred is OOO for a few weeks.
On 2013/05/21 04:47:31, vangelis wrote: > On 2013/05/20 21:45:30, kbr wrote: > > Manfred's investigated color conversions throughout Skia, WebKit and Chrome > and > > will be interested in looking at this CL. > > Unfortunately Manfred is OOO for a few weeks. Hi, reed, Kenneth and Vangelis. Thanks for you comments. Reed had gave very detailed suggestions and it may be a good start point to improve the conversion performance. Meanwhile, as Kenneth mentioned, there are many places that use this kind of conversion, such as skia, webkit and chromium, could we unify them as a whole without keep a implementation at each side? I am traveling in San Francisco this week and may not start to code it in details. We can make the problem clear and I can start the work when I am back to office next week.
On 2013/05/21 15:40:01, Jun Jiang wrote: > On 2013/05/21 04:47:31, vangelis wrote: > > On 2013/05/20 21:45:30, kbr wrote: > > > Manfred's investigated color conversions throughout Skia, WebKit and Chrome > > and > > > will be interested in looking at this CL. > > > > Unfortunately Manfred is OOO for a few weeks. > > Hi, reed, Kenneth and Vangelis. Thanks for you comments. Reed had gave very > detailed suggestions and it may be a good start point to improve the conversion > performance. Meanwhile, as Kenneth mentioned, there are many places that use > this kind of conversion, such as skia, webkit and chromium, could we unify them > as a whole without keep a implementation at each side? I am traveling in San > Francisco this week and may not start to code it in details. We can make the > problem clear and I can start the work when I am back to office next week. I was back from my business travel now and I would make a second patch based on reed's suggestions.
Sorry for the delay due to my business travel. I have refined the patch according to Reed's comments. 1. Move to use SkUnPremultiply instead of two lookup tables. No extra memory needs to be allocated and we can get similar performance gain. 2. all unit tests and bench tests in Skia can pass with no regressions. 3. A bench tests that exercise premul/unpremul code paths in readPixels/writePixels is added. The command to launch is : out/Release/bench --repeat 2 --match premul_and_unpremul_alpha Without this CL, the performance is as follows: running bench [640 480] premul_and_unpremul_alpha_Native8888 8888: cmsecs = 48.19 565: cmsecs = 55.68 GPU: cmsecs = 113.57 gmsecs = 9748807134692.57 NULLGPU: cmsecs = 43.64 running bench [640 480] premul_and_unpremul_alpha_RGBA8888 8888: cmsecs = 44.99 565: cmsecs = 51.72 GPU: cmsecs = 110.36 gmsecs = 4874403567403.37 NULLGPU: cmsecs = 39.70 With this CL, the performance is as follows: running bench [640 480] premul_and_unpremul_alpha_Native8888 8888: cmsecs = 28.87 565: cmsecs = 36.32 GPU: cmsecs = 94.06 gmsecs = 9748807134672.98 NULLGPU: cmsecs = 24.27 running bench [640 480] premul_and_unpremul_alpha_RGBA8888 8888: cmsecs = 24.23 565: cmsecs = 30.85 GPU: cmsecs = 90.13 gmsecs = 4874403567383.01 NULLGPU: cmsecs = 19.62 Obvious performance gain is obtained by having this CL.
Have you run the "tests" unit test app?
On 2013/05/30 08:14:58, Jun Jiang wrote: ... > Obvious performance gain is obtained by having this CL. I see a 5-10% gain on 8888 and GPU cmsecs, but for most of the Skia benchmarks anything less than -repeat 100 is dominated by noise. This needs to be rerun with a higher repeat count. What platforms is this on? We need to evaluate it on at least a couple of platforms/compilers, ideally a couple of different architectures.
On 2013/05/30 13:03:41, bsalomon wrote: > Have you run the "tests" unit test app? Yes, I had run the "tests" unit test app on Linux(Ubuntu 12.04).
On 2013/05/30 13:11:30, Tom Hudson wrote: > On 2013/05/30 08:14:58, Jun Jiang wrote: > ... > > Obvious performance gain is obtained by having this CL. > > I see a 5-10% gain on 8888 and GPU cmsecs, but for most of the Skia benchmarks > anything less than -repeat 100 is dominated by noise. > This needs to be rerun with a higher repeat count. > > What platforms is this on? We need to evaluate it on at least a couple of > platforms/compilers, ideally a couple of different architectures. The data was collected on Linux(Ubuntu 12.04). I have not set up skia environment in windows, but I tried the webpage http://jsfiddle.net/zsmyK/5/ in win7 with chromium and the time for getImagedata() was reduced from 45 to 27 by applying this CL. I expect there is similar performance gain in Windows. And I have not evaluated other Skia benchmarks. The reason that the performance gain is not so obvious may be those benchmarks didn't exercise the premul/unpremul code path.
On 2013/05/31 00:49:06, Jun Jiang wrote: > On 2013/05/30 13:11:30, Tom Hudson wrote: > > On 2013/05/30 08:14:58, Jun Jiang wrote: > > ... > > > Obvious performance gain is obtained by having this CL. > > > > I see a 5-10% gain on 8888 and GPU cmsecs, but for most of the Skia benchmarks > > anything less than -repeat 100 is dominated by noise. > > This needs to be rerun with a higher repeat count. > > > > What platforms is this on? We need to evaluate it on at least a couple of > > platforms/compilers, ideally a couple of different architectures. > > The data was collected on Linux(Ubuntu 12.04). I have not set up skia > environment in windows, but I tried the webpage http://jsfiddle.net/zsmyK/5/ in > win7 with chromium and the time for getImagedata() was reduced from 45 to 27 by > applying this CL. I expect there is similar performance gain in Windows. > > And I have not evaluated other Skia benchmarks. The reason that the performance > gain is not so obvious may be those benchmarks didn't exercise the > premul/unpremul code path. I think I had misunderstood Tom's comment on the "5-10% gain on 8888 and GPU cmsecs". I have rerun the bench with --repeat 500 and the results are as follows: 1. without the CL. ./out/Release/bench --match premul_and_unpremul_alpha --repeat 500 skia bench: alpha=0xFF antialias=1 filter=0 deferred=no logperiter=0 rotate=0 scale=0 clip=0 min=0 record=0 picturerecord=0 dither=default strokeWidth=none scalar=float system=UNIX running bench [640 480] premul_and_unpremul_alpha_Native8888 8888: cmsecs = 48.16 565: cmsecs = 55.62 GPU: cmsecs = 112.69 gmsecs = 9748807134691.50 NULLGPU: cmsecs = 43.63 running bench [640 480] premul_and_unpremul_alpha_RGBA8888 8888: cmsecs = 44.83 565: cmsecs = 51.48 GPU: cmsecs = 109.04 gmsecs = 4874403567401.96 NULLGPU: cmsecs = 39.70 2. with the CL. ./out/Release/bench --match premul_and_unpremul_alpha --repeat 500 skia bench: alpha=0xFF antialias=1 filter=0 deferred=no logperiter=0 rotate=0 scale=0 clip=0 min=0 record=0 picturerecord=0 dither=default strokeWidth=none scalar=float system=UNIX running bench [640 480] premul_and_unpremul_alpha_Native8888 8888: cmsecs = 29.59 565: cmsecs = 37.08 GPU: cmsecs = 94.63 gmsecs = 9748807134673.26 NULLGPU: cmsecs = 25.04 running bench [640 480] premul_and_unpremul_alpha_RGBA8888 8888: cmsecs = 24.93 565: cmsecs = 31.68 GPU: cmsecs = 90.18 gmsecs = 4874403567382.98 NULLGPU: cmsecs = 20.40 The gain for the two subbenches are as floowing: premul_and_unpremul_alpha_Native8888 8888: cmsecs:38%, 565: cmsecs:33%, GPU: cmsecs:16%, NULLGPU: cmsecs:42% premul_and_unpremul_alpha_RGBA8888 8888: cmsecs:44%, 565: cmsecs:38%, GPU: cmsecs:17%, NULLGPU: cmsecs:48% I had planned to run this on Android as well to check its impact but my colleagues has met some issue to run skia bench successfully on Android recently.
I don't understand why you're seeing a perf increase for the GPU and NULLGPU configs. Are we executing the SkConfig8888 code at all when the canvas is backed by an SkGpuDevice? https://codereview.chromium.org/15402003/diff/20001/bench/PremulAndUnpremulAl... File bench/PremulAndUnpremulAlphaOpsBench.cpp (right): https://codereview.chromium.org/15402003/diff/20001/bench/PremulAndUnpremulAl... bench/PremulAndUnpremulAlphaOpsBench.cpp:30: const unsigned int LoopTime = 10; style-nit: static const int kLoopCount = 10; Actually this should be static const int kLoopCount = SkBENCHLOOP(10); so that it completes fast in a debug build. https://codereview.chromium.org/15402003/diff/20001/bench/PremulAndUnpremulAl... bench/PremulAndUnpremulAlphaOpsBench.cpp:31: for (unsigned int loop = 0; loop < LoopTime; ++loop) { style-nit: "int loop = 0" We use ints unless it is a bitfield or the count can really be expected to exceed MAX_INT.
On 2013/05/31 12:42:07, bsalomon wrote: > I don't understand why you're seeing a perf increase for the GPU and NULLGPU > configs. Are we executing the SkConfig8888 code at all when the canvas is backed > by an SkGpuDevice? > > https://codereview.chromium.org/15402003/diff/20001/bench/PremulAndUnpremulAl... > File bench/PremulAndUnpremulAlphaOpsBench.cpp (right): > > https://codereview.chromium.org/15402003/diff/20001/bench/PremulAndUnpremulAl... > bench/PremulAndUnpremulAlphaOpsBench.cpp:30: const unsigned int LoopTime = 10; > style-nit: > static const int kLoopCount = 10; > > Actually this should be > static const int kLoopCount = SkBENCHLOOP(10); > > so that it completes fast in a debug build. > > https://codereview.chromium.org/15402003/diff/20001/bench/PremulAndUnpremulAl... > bench/PremulAndUnpremulAlphaOpsBench.cpp:31: for (unsigned int loop = 0; loop < > LoopTime; ++loop) { > style-nit: "int loop = 0" > > We use ints unless it is a bitfield or the count can really be expected to > exceed MAX_INT. Hi, bsalomon. Thanks for your comments. Yes, it is executing the SkConfig8888 code for readPixels() and writePixels() when the canvas is backed by an SkGpuDevice. For the code style issue, I will make changes accordingly.
On 2013/05/31 14:22:29, Jun Jiang wrote: > On 2013/05/31 12:42:07, bsalomon wrote: > > I don't understand why you're seeing a perf increase for the GPU and NULLGPU > > configs. Are we executing the SkConfig8888 code at all when the canvas is > backed > > by an SkGpuDevice? > > > > > https://codereview.chromium.org/15402003/diff/20001/bench/PremulAndUnpremulAl... > > File bench/PremulAndUnpremulAlphaOpsBench.cpp (right): > > > > > https://codereview.chromium.org/15402003/diff/20001/bench/PremulAndUnpremulAl... > > bench/PremulAndUnpremulAlphaOpsBench.cpp:30: const unsigned int LoopTime = 10; > > style-nit: > > static const int kLoopCount = 10; > > > > Actually this should be > > static const int kLoopCount = SkBENCHLOOP(10); > > > > so that it completes fast in a debug build. > > > > > https://codereview.chromium.org/15402003/diff/20001/bench/PremulAndUnpremulAl... > > bench/PremulAndUnpremulAlphaOpsBench.cpp:31: for (unsigned int loop = 0; loop > < > > LoopTime; ++loop) { > > style-nit: "int loop = 0" > > > > We use ints unless it is a bitfield or the count can really be expected to > > exceed MAX_INT. > > Hi, bsalomon. Thanks for your comments. > Yes, it is executing the SkConfig8888 code for readPixels() and writePixels() > when the canvas is backed by an SkGpuDevice. > For the code style issue, I will make changes accordingly. Now I recall that this roundtrip test fails: https://code.google.com/p/skia/source/browse/trunk/src/gpu/effects/GrConfigCo... on Intel GPUs and so we fallback to CPU premul/unpremul. Is there way to update the effect so that we can do the conversions on the GPU? It is a big win when it works.
On 2013/05/31 14:31:44, bsalomon wrote: > On 2013/05/31 14:22:29, Jun Jiang wrote: > > On 2013/05/31 12:42:07, bsalomon wrote: > > > I don't understand why you're seeing a perf increase for the GPU and NULLGPU > > > configs. Are we executing the SkConfig8888 code at all when the canvas is > > backed > > > by an SkGpuDevice? > > > > > > > > > https://codereview.chromium.org/15402003/diff/20001/bench/PremulAndUnpremulAl... > > > File bench/PremulAndUnpremulAlphaOpsBench.cpp (right): > > > > > > > > > https://codereview.chromium.org/15402003/diff/20001/bench/PremulAndUnpremulAl... > > > bench/PremulAndUnpremulAlphaOpsBench.cpp:30: const unsigned int LoopTime = > 10; > > > style-nit: > > > static const int kLoopCount = 10; > > > > > > Actually this should be > > > static const int kLoopCount = SkBENCHLOOP(10); > > > > > > so that it completes fast in a debug build. > > > > > > > > > https://codereview.chromium.org/15402003/diff/20001/bench/PremulAndUnpremulAl... > > > bench/PremulAndUnpremulAlphaOpsBench.cpp:31: for (unsigned int loop = 0; > loop > > < > > > LoopTime; ++loop) { > > > style-nit: "int loop = 0" > > > > > > We use ints unless it is a bitfield or the count can really be expected to > > > exceed MAX_INT. > > > > Hi, bsalomon. Thanks for your comments. > > Yes, it is executing the SkConfig8888 code for readPixels() and writePixels() > > when the canvas is backed by an SkGpuDevice. > > For the code style issue, I will make changes accordingly. > > Now I recall that this roundtrip test fails: > https://code.google.com/p/skia/source/browse/trunk/src/gpu/effects/GrConfigCo... > on Intel GPUs and so we fallback to CPU premul/unpremul. Is there way to update > the effect so that we can do the conversions on the GPU? It is a big win when it > works. Yes, it will try to apply PM->UPM effect on textures first and fall back to CPU premul/unpremul if it fails to create the PM->UPM effect. I am not sure why it fails on Intel GPUs and I'd like to have a investigation on that. Maybe I need to fire another bug to follow the issue.
On 2013/05/31 14:52:39, Jun Jiang wrote: > On 2013/05/31 14:31:44, bsalomon wrote: > > On 2013/05/31 14:22:29, Jun Jiang wrote: > > > On 2013/05/31 12:42:07, bsalomon wrote: > > > > I don't understand why you're seeing a perf increase for the GPU and > NULLGPU > > > > configs. Are we executing the SkConfig8888 code at all when the canvas is > > > backed > > > > by an SkGpuDevice? > > > > > > > > > > > > > > https://codereview.chromium.org/15402003/diff/20001/bench/PremulAndUnpremulAl... > > > > File bench/PremulAndUnpremulAlphaOpsBench.cpp (right): > > > > > > > > > > > > > > https://codereview.chromium.org/15402003/diff/20001/bench/PremulAndUnpremulAl... > > > > bench/PremulAndUnpremulAlphaOpsBench.cpp:30: const unsigned int LoopTime = > > 10; > > > > style-nit: > > > > static const int kLoopCount = 10; > > > > > > > > Actually this should be > > > > static const int kLoopCount = SkBENCHLOOP(10); > > > > > > > > so that it completes fast in a debug build. > > > > > > > > > > > > > > https://codereview.chromium.org/15402003/diff/20001/bench/PremulAndUnpremulAl... > > > > bench/PremulAndUnpremulAlphaOpsBench.cpp:31: for (unsigned int loop = 0; > > loop > > > < > > > > LoopTime; ++loop) { > > > > style-nit: "int loop = 0" > > > > > > > > We use ints unless it is a bitfield or the count can really be expected to > > > > exceed MAX_INT. > > > > > > Hi, bsalomon. Thanks for your comments. > > > Yes, it is executing the SkConfig8888 code for readPixels() and > writePixels() > > > when the canvas is backed by an SkGpuDevice. > > > For the code style issue, I will make changes accordingly. > > > > Now I recall that this roundtrip test fails: > > > https://code.google.com/p/skia/source/browse/trunk/src/gpu/effects/GrConfigCo... > > on Intel GPUs and so we fallback to CPU premul/unpremul. Is there way to > update > > the effect so that we can do the conversions on the GPU? It is a big win when > it > > works. > > Yes, it will try to apply PM->UPM effect on textures first and fall back to CPU > premul/unpremul if it fails to create the PM->UPM effect. I am not sure why it > fails on Intel GPUs and I'd like to have a investigation on that. Maybe I need > to fire another bug to follow the issue. That'd be great. Aside from the the style stuff it looks good to me.
The code has been changed to follow the style. https://codereview.chromium.org/15402003/diff/20001/bench/PremulAndUnpremulAl... File bench/PremulAndUnpremulAlphaOpsBench.cpp (right): https://codereview.chromium.org/15402003/diff/20001/bench/PremulAndUnpremulAl... bench/PremulAndUnpremulAlphaOpsBench.cpp:30: const unsigned int LoopTime = 10; On 2013/05/31 12:42:08, bsalomon wrote: > style-nit: > static const int kLoopCount = 10; > > Actually this should be > static const int kLoopCount = SkBENCHLOOP(10); > > so that it completes fast in a debug build. Done. https://codereview.chromium.org/15402003/diff/20001/bench/PremulAndUnpremulAl... bench/PremulAndUnpremulAlphaOpsBench.cpp:31: for (unsigned int loop = 0; loop < LoopTime; ++loop) { On 2013/05/31 12:42:08, bsalomon wrote: > style-nit: "int loop = 0" > > We use ints unless it is a bitfield or the count can really be expected to > exceed MAX_INT. Done.
https://codereview.chromium.org/15402003/diff/34002/src/core/SkConfig8888.cpp File src/core/SkConfig8888.cpp (right): https://codereview.chromium.org/15402003/diff/34002/src/core/SkConfig8888.cpp... src/core/SkConfig8888.cpp:62: // performant way to unpremul than (value * 0xff) / a. Does (v * 255 / a) equal your SkUnPreMultiply::ApplyScale approach for all v and a in [0-255]?
https://codereview.chromium.org/15402003/diff/34002/src/core/SkConfig8888.cpp File src/core/SkConfig8888.cpp (right): https://codereview.chromium.org/15402003/diff/34002/src/core/SkConfig8888.cpp... src/core/SkConfig8888.cpp:62: // performant way to unpremul than (value * 0xff) / a. On 2013/06/04 05:55:18, Noel Gordon (Google) wrote: > Does (v * 255 / a) equal your SkUnPreMultiply::ApplyScale approach for all v and > a in [0-255]? Hi, Noel. (v * 255 / a) doesn't equal SkUnPreMultiply::ApplyScale() for all v and a in [0-255]. In fact, the difference can be 1 between the two values for some v and a in [0-255] and 0 for all other pairs. You can see the test in SkUnPreMultiply_BuildTable() in src/core/SkUnPremultiply.cpp when building the divide table.
Per https://codereview.appspot.com/3466042 seems we might regress third_party/WebKit/LayoutTest/canvas/philip/tests/2d.imageData.put.unchanged.html when using SkUnPreMultiply::ApplyScale() to compute (v * 255 / a). Have you run that test with your change?
I just ran that test with your change, and the test passes for me.
So I think we should use the rounding method. https://codereview.appspot.com/3466042 had an example of it, see the changes to include/core/SkMath.h on that review, in particular static inline U8CPU SkMulDiv255Round(U8CPU a, U8CPU b) { SkASSERT((uint8_t)a == a); SkASSERT((uint8_t)b == b); unsigned prod = SkMulS16(a, b) + 128; return (prod + (prod >> 8)) >> 8; } But it's no longer available in that file. What you have done is close, though. let me review ...
https://codereview.chromium.org/15402003/diff/34002/src/core/SkConfig8888.cpp File src/core/SkConfig8888.cpp (right): https://codereview.chromium.org/15402003/diff/34002/src/core/SkConfig8888.cpp... src/core/SkConfig8888.cpp:73: r = SkDiv255Round(r * a); Could you test using the following here and tell me if the skia tests pass? r = SkDiv255Round(SkMulS16(r, a)); g = SkDiv255Round(SkMulS16(g, a)); b = SkDiv255Round(SkMulS16(b, a));
On 2013/06/04 07:05:30, Noel Gordon (Google) wrote: > I just ran that test with your change, and the test passes for me. Yes, I have tried third_party/WebKit/LayoutTest/canvas/philip/tests/2d.imageData.put.unchanged.html with this CL and it can pass as well.
Good, now do #24 :"Could you test using the following here and tell me if the skia tests pass?"
On 2013/06/04 07:24:17, Noel Gordon (Google) wrote: > Good, now do #24 :"Could you test using the following here and tell me if the > skia tests pass?" I have tried with #24 using "r = SkDiv255Round(SkMulS16(r, a));", the Skia unit tests can pass without failure. And it can also pass third_party/WebKit/LayoutTest/canvas/philip/tests/2d.imageData.put.unchanged.html with the patch in this CL together.
Good. Can you measure the performance again when this r = SkDiv255Round(SkMulS16(r, a)) etc method?
On 2013/06/04 07:33:18, Noel Gordon (Google) wrote: > Good. Can you measure the performance again when this r = > SkDiv255Round(SkMulS16(r, a)) etc method? The performance data is as follows: 1. With the patch in the CL only out/Release/bench --match premul_and_unpremul_alpha --repeat 500 skia bench: alpha=0xFF antialias=1 filter=0 deferred=no logperiter=0 rotate=0 scale=0 clip=0 min=0 record=0 picturerecord=0 dither=default strokeWidth=none scalar=float system=UNIX running bench [640 480] premul_and_unpremul_alpha_Native8888 8888: cmsecs = 29.59 565: cmsecs = 37.14 GPU: cmsecs = 95.24 gmsecs = 9748807134673.83 NULLGPU: cmsecs = 25.08 running bench [640 480] premul_and_unpremul_alpha_RGBA8888 8888: cmsecs = 25.07 565: cmsecs = 31.87 GPU: cmsecs = 90.48 gmsecs = 4874403567383.25 NULLGPU: cmsecs = 20.40 2. With the patch in this CL and change to use SkDiv255Round(SkMulS16(r, a)) out/Release/bench --match premul_and_unpremul_alpha --repeat 500 skia bench: alpha=0xFF antialias=1 filter=0 deferred=no logperiter=0 rotate=0 scale=0 clip=0 min=0 record=0 picturerecord=0 dither=default strokeWidth=none scalar=float system=UNIX running bench [640 480] premul_and_unpremul_alpha_Native8888 8888: cmsecs = 29.60 565: cmsecs = 37.13 GPU: cmsecs = 94.89 gmsecs = 9748807133986.34 NULLGPU: cmsecs = 25.05 running bench [640 480] premul_and_unpremul_alpha_RGBA8888 8888: cmsecs = 25.04 565: cmsecs = 31.79 GPU: cmsecs = 89.94 gmsecs = 4874403567382.66 NULLGPU: cmsecs = 20.40 The performance data are very close for these two versions.
Ok good. Since you want speed, there's a faster way to compute the SkDiv255Round's ... static const unsigned int kRounding = 257 * 128; unsigned int alpha = a * 257; r = (SkMulS16(r, alpha) + kRounding) >> 16; g = (SkMulS16(g, alpha) + kRounding) >> 16; b = (SkMulS16(b, alpha) + kRounding) >> 16; Perhaps you could test that to if the skia tests pass, and if there is any performance gain?
On 2013/06/04 08:01:16, Noel Gordon (Google) wrote: > Ok good. Since you want speed, there's a faster way to compute the > SkDiv255Round's ... > > static const unsigned int kRounding = 257 * 128; > unsigned int alpha = a * 257; > r = (SkMulS16(r, alpha) + kRounding) >> 16; > g = (SkMulS16(g, alpha) + kRounding) >> 16; > b = (SkMulS16(b, alpha) + kRounding) >> 16; > > Perhaps you could test that to if the skia tests pass, and if there is any > performance gain? It met Segmentation fault with error "include/core/SkMath.h:145: failed assertion "(int16_t)y == y" when running Skia unit tests. It seems the second parameter of SkMulS16(r, alpha) may be bigger than the maximum int16_t value and an overflow happened.
My apologies, it should be static const unsigned int kRounding = 257 * 128; unsigned int alpha = a * 257; r = (r * alpha + kRounding) >> 16; g = (g * alpha + kRounding) >> 16; b = (b * alpha + kRounding) >> 16;
On 2013/06/04 08:21:41, Noel Gordon (Google) wrote: > My apologies, it should be > > static const unsigned int kRounding = 257 * 128; > unsigned int alpha = a * 257; > r = (r * alpha + kRounding) >> 16; > g = (g * alpha + kRounding) >> 16; > b = (b * alpha + kRounding) >> 16; The data with this change is as follows: out/Release/bench --match premul_and_unpremul_alpha --repeat 500 skia bench: alpha=0xFF antialias=1 filter=0 deferred=no logperiter=0 rotate=0 scale=0 clip=0 min=0 record=0 picturerecord=0 dither=default strokeWidth=none scalar=float system=UNIX running bench [640 480] premul_and_unpremul_alpha_Native8888 8888: cmsecs = 30.37 565: cmsecs = 37.41 GPU: cmsecs = 95.22 gmsecs = 9748807134673.85 NULLGPU: cmsecs = 25.83 running bench [640 480] premul_and_unpremul_alpha_RGBA8888 8888: cmsecs = 25.74 565: cmsecs = 32.51 GPU: cmsecs = 90.65 gmsecs = 4874403567383.37 NULLGPU: cmsecs = 20.89 The performance data is still similar with before. My understanding is that the hotspot is / operation and SkDiv255Round already uses ">>" operation to optimize away "/" operation.
On crbug.com/234071 using image-like data, this form is about 10-15% faster compared to SkDiv255Round() on windows. Perhaps your benchmark is dominated by the Premul -> Unpremul case: it has an if (a) test if (a) { SkUnPreMultiply::Scale scale = SkUnPreMultiply::GetScale(a); ... and that can significantly slow down the code when convert_pixel() is called in a for-loop? Anyway, I fine with either rounding approach provided skia's PremulAlphaRoundTripTest passes. Please note that another canvas test will fail: canvas/philip/tests/2d.gradient.interpolate.colouralpha.html and the fix for that is to increase the test's tolerance from 3 -> 4.
On 2013/06/04 15:37:26, Noel Gordon (Google) wrote: > On crbug.com/234071 using image-like data, this form is about 10-15% faster > compared to SkDiv255Round() on windows. Perhaps your benchmark is dominated by > the Premul -> Unpremul case: it has an if (a) test > > if (a) { > SkUnPreMultiply::Scale scale = SkUnPreMultiply::GetScale(a); > ... > > and that can significantly slow down the code when convert_pixel() is called in > a for-loop? > > Anyway, I fine with either rounding approach provided skia's > PremulAlphaRoundTripTest passes. > > Please note that another canvas test will fail: > canvas/philip/tests/2d.gradient.interpolate.colouralpha.html and the fix for > that is to increase the test's tolerance from 3 -> 4. Hi, Noel. Thanks for your comments and sharing. It seems "r = (r * alpha + kRounding) >> 16;" has better performance than SkMulDiv255Round() if SkMulS16() is not optimized on the underlying architecture. I am considering to change the patch to use this new approach. And it seems I need to run the whole blink Layout test first to avoid any regression like canvas/philip/tests/2d.gradient.interpolate.colouralpha.html.
The patch to refine the LayoutTest was in review at https://codereview.chromium.org/16691002/ and another patch to make the Config conversion effect to work on Intel GPU was in review at https://codereview.chromium.org/15719007/.
Just a couple of nits from me, and then if we can get performance numbers on other architectures I'll be fine with it. Nit: 100-character line length; I think the new bench needs formatting. https://codereview.chromium.org/15402003/diff/34002/src/core/SkConfig8888.cpp File src/core/SkConfig8888.cpp (right): https://codereview.chromium.org/15402003/diff/34002/src/core/SkConfig8888.cpp... src/core/SkConfig8888.cpp:62: // performant way to unpremul than (value * 0xff) / a. Nit: better grammar: "Using SkUnPreMultiply::ApplyScale is faster than (value * 0xff) / a"
As mentioned in the GPU conversion CL comments, I think we should land the bench first as a separate CL. https://codereview.chromium.org/15402003/diff/34002/bench/PremulAndUnpremulAl... File bench/PremulAndUnpremulAlphaOpsBench.cpp (right): https://codereview.chromium.org/15402003/diff/34002/bench/PremulAndUnpremulAl... bench/PremulAndUnpremulAlphaOpsBench.cpp:34: bmp1.allocPixels(); Can't all this setup of bmp1 be done outside the loop? Maybe it could even outside the timer entirely by having a member bitmap var and overriding onPreDraw() (and then freeing up the pixels in onPostDraw()).
Hi, Brian. Thanks for your comments. I will move the benchmark into a separate CL. https://codereview.chromium.org/15402003/diff/34002/bench/PremulAndUnpremulAl... File bench/PremulAndUnpremulAlphaOpsBench.cpp (right): https://codereview.chromium.org/15402003/diff/34002/bench/PremulAndUnpremulAl... bench/PremulAndUnpremulAlphaOpsBench.cpp:34: bmp1.allocPixels(); On 2013/06/10 13:48:22, bsalomon wrote: > Can't all this setup of bmp1 be done outside the loop? Maybe it could even > outside the timer entirely by having a member bitmap var and overriding > onPreDraw() (and then freeing up the pixels in onPostDraw()). Yes, the setup of bmp1 can be done outside of the loop. As for moving it into onPredraw(), there may be a issue that it can not get the size of the underlying SkCanvas.
On 2013/06/10 09:35:18, Tom Hudson wrote: > Just a couple of nits from me, and then if we can get performance numbers on > other architectures I'll be fine with it. > > Nit: 100-character line length; I think the new bench needs formatting. > > https://codereview.chromium.org/15402003/diff/34002/src/core/SkConfig8888.cpp > File src/core/SkConfig8888.cpp (right): > > https://codereview.chromium.org/15402003/diff/34002/src/core/SkConfig8888.cpp... > src/core/SkConfig8888.cpp:62: // performant way to unpremul than (value * 0xff) > / a. > Nit: better grammar: "Using SkUnPreMultiply::ApplyScale is faster than (value * > 0xff) / a" Done.
On 2013/06/10 13:48:22, bsalomon wrote: > As mentioned in the GPU conversion CL comments, I think we should land the bench > first as a separate CL. > > https://codereview.chromium.org/15402003/diff/34002/bench/PremulAndUnpremulAl... > File bench/PremulAndUnpremulAlphaOpsBench.cpp (right): > > https://codereview.chromium.org/15402003/diff/34002/bench/PremulAndUnpremulAl... > bench/PremulAndUnpremulAlphaOpsBench.cpp:34: bmp1.allocPixels(); > Can't all this setup of bmp1 be done outside the loop? Maybe it could even > outside the timer entirely by having a member bitmap var and overriding > onPreDraw() (and then freeing up the pixels in onPostDraw()). A separate CL addressing the benchmark has been uploaded for review at https://codereview.chromium.org/16654004/.
On 2013/06/11 16:10:43, Jun Jiang wrote: > On 2013/06/10 13:48:22, bsalomon wrote: > > As mentioned in the GPU conversion CL comments, I think we should land the > bench > > first as a separate CL. > > > > > https://codereview.chromium.org/15402003/diff/34002/bench/PremulAndUnpremulAl... > > File bench/PremulAndUnpremulAlphaOpsBench.cpp (right): > > > > > https://codereview.chromium.org/15402003/diff/34002/bench/PremulAndUnpremulAl... > > bench/PremulAndUnpremulAlphaOpsBench.cpp:34: bmp1.allocPixels(); > > Can't all this setup of bmp1 be done outside the loop? Maybe it could even > > outside the timer entirely by having a member bitmap var and overriding > > onPreDraw() (and then freeing up the pixels in onPostDraw()). > > A separate CL addressing the benchmark has been uploaded for review at > https://codereview.chromium.org/16654004/. This lgtm, but please wait until the benchmark change has been in for a few hours so that we can be sure baseline numbers get generated for it.
On 2013/06/05 15:30:21, Jun Jiang wrote: > On 2013/06/04 15:37:26, Noel Gordon (Google) wrote: > > On crbug.com/234071 using image-like data, this form is about 10-15% faster > > compared to SkDiv255Round() on windows. Perhaps your benchmark is dominated > by > > the Premul -> Unpremul case: it has an if (a) test > > > > if (a) { > > SkUnPreMultiply::Scale scale = SkUnPreMultiply::GetScale(a); > > ... > > > > and that can significantly slow down the code when convert_pixel() is called > > in a for-loop? > > > > Anyway, I fine with either rounding approach provided skia's > > PremulAlphaRoundTripTest passes. > > > > Please note that another canvas test will fail: > > canvas/philip/tests/2d.gradient.interpolate.colouralpha.html and the fix for > > that is to increase the test's tolerance from 3 -> 4. > > Hi, Noel. Thanks for your comments and sharing. It seems "r = (r * alpha + > kRounding) >> 16;" has better performance than SkMulDiv255Round() if SkMulS16() > is not optimized on the underlying architecture. I am considering to change the > patch to use this new approach. Yes, but we can do that on another change. And it seems I need to run the whole blink > Layout test first to avoid any regression like > canvas/philip/tests/2d.gradient.interpolate.colouralpha.html. Have you run the blink layout tests with and without the current change to check if there are other test changes?
On 2013/06/12 07:36:53, Noel Gordon (Google) wrote: > On 2013/06/05 15:30:21, Jun Jiang wrote: > > On 2013/06/04 15:37:26, Noel Gordon (Google) wrote: > > > On crbug.com/234071 using image-like data, this form is about 10-15% faster > > > compared to SkDiv255Round() on windows. Perhaps your benchmark is dominated > > by > > > the Premul -> Unpremul case: it has an if (a) test > > > > > > if (a) { > > > SkUnPreMultiply::Scale scale = SkUnPreMultiply::GetScale(a); > > > ... > > > > > > and that can significantly slow down the code when convert_pixel() is called > > > in a for-loop? > > > > > > Anyway, I fine with either rounding approach provided skia's > > > PremulAlphaRoundTripTest passes. > > > > > > Please note that another canvas test will fail: > > > canvas/philip/tests/2d.gradient.interpolate.colouralpha.html and the fix for > > > that is to increase the test's tolerance from 3 -> 4. > > > > Hi, Noel. Thanks for your comments and sharing. It seems "r = (r * alpha + > > kRounding) >> 16;" has better performance than SkMulDiv255Round() if > SkMulS16() > > is not optimized on the underlying architecture. I am considering to change > the > > patch to use this new approach. > > Yes, but we can do that on another change. > > And it seems I need to run the whole blink > > Layout test first to avoid any regression like > > canvas/philip/tests/2d.gradient.interpolate.colouralpha.html. > > Have you run the blink layout tests with and without the current change to check > if there are other test changes? Yes, I had run the blink layout tests(specially, LayoutTest/canvas/, LayoutTest/fast/canvas/, LayoutTests/virtual/gpu/canvas/ and LayoutTests/virtual/gpu/fast/canvas/) with this patch. All the changes to the LayoutTest is at https://codereview.chromium.org/16691002/ .
Good re the layout test. One more thing. Should we use the SkMulS16 form r = SkDiv255Round(SkMulS16(r, a)); g = SkDiv255Round(SkMulS16(g, a)); b = SkDiv255Round(SkMulS16(b, a)); to benefit the android folks?
On 2013/06/14 06:40:36, Noel Gordon (Google) wrote: > Good re the layout test. One more thing. Should we use the SkMulS16 form > > r = SkDiv255Round(SkMulS16(r, a)); > g = SkDiv255Round(SkMulS16(g, a)); > b = SkDiv255Round(SkMulS16(b, a)); > > to benefit the android folks? Hi, Noel. Thanks for your comments. You're right, use SkMulS16 can benefit ARM architecture. I have uploaded a new patch, please take another look. As for whether to use the fast variant SkDiv255Round version like "r = (r * alpha + kRounding) >> 16" mentioned in #35, we may leave the decision to https://code.google.com/p/chromium/issues/detail?id=234071.
LGTM SkMulDiv255Round
CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/jun.a.jiang@intel.com/15402003/67001
Message was sent while issue was closed.
Change committed as 9605 |