|
|
Created:
7 years, 4 months ago by Yang Gu Modified:
7 years, 4 months ago CC:
skia-review_googlegroups.com Base URL:
https://skia.googlecode.com/svn/trunk Visibility:
Public. |
DescriptionExperiments on calculating reciprocal of square root
BUG=
Committed: http://code.google.com/p/skia/source/detail?r=10671
Patch Set 1 #Patch Set 2 : Change existing benchmark code #Patch Set 3 : Better calculate reciprocal of square root #
Total comments: 2
Messages
Total messages: 17 (0 generated)
Reciprocal of square root is often calculated here and there in the code, and we have several ways to do it: 1. Directly calculate 1/sqrt(x) (mark 'portable' below) 2. Use fast InvSqrt(http://en.wikipedia.org/wiki/Fast_inverse_square_root), and we also have a bench test implement this (SkFastInvSqrt() in bench/MathBench.cpp) ('fastinv') 3. Use intrinsic function of specific platform. ('intrinsic') I finished a patch to implement it using SSE2 intrinsic function on IA platform. But experiments surprised me that the second way (fast InvSqrt) is best. Below are the test data on IA desktop, IA phone and Nexus4. IA Desktop: running bench [640 480] rsqrtf__intrinsic NONRENDERING: cmsecs = 14.28 running bench [640 480] rsqrtf__fastinv NONRENDERING: cmsecs = 14.23 running bench [640 480] rsqrtf__portable NONRENDERING: cmsecs = 26.78 IA phone: running bench [640 480] rsqrtf__intrinsic NONRENDERING: cmsecs = 220.19 running bench [640 480] rsqrtf__fastinv NONRENDERING: cmsecs = 100.13 running bench [640 480] rsqrtf__portable NONRENDERING: cmsecs = 307.69 Nexus4: running bench [640 480] rsqrtf__intrinsic NONRENDERING: cmsecs = 239.81 running bench [640 480] rsqrtf__fastinv NONRENDERING: cmsecs = 96.19 running bench [640 480] rsqrtf__portable NONRENDERING: cmsecs = 238.76 I'd like to have your opinion on this. Is this worthy of a patch to make the calculation faster? If so, do you think if we should choose 2nd approach?
I'm not sure how important this is for Skia drawing operations. That said, have you tried inlining the impl for your two new functions (portable and not) and see how that affects your benches? For names, I find rsqrt a little two terse. sk_float_invsqrt might be more readable (or something like it)
https://code.google.com/p/skia/issues/detail?id=885 We've always seen "slow" inverse sqrt as faster than the "fast" variant in the past on desktop machines, but we knew fast was actually faster on (ARM) Android. We just never followed through with a way to change it per-platform.
Glad to know you already put this in your radar. I inlined the two new functions, and here comes the data (I manually sorted it). IA Desktop: running bench [640 480] rsqrtf_fastinv_inline NONRENDERING: cmsecs = 14.15 running bench [640 480] rsqrtf_intrinsic_inline NONRENDERING: cmsecs = 14.18 running bench [640 480] rsqrtf_intrinsic NONRENDERING: cmsecs = 14.18 running bench [640 480] rsqrtf_portable_inline NONRENDERING: cmsecs = 25.96 running bench [640 480] rsqrtf_portable NONRENDERING: cmsecs = 26.43 IA Phone: running bench [640 480] rsqrtf_intrinsic_inline NONRENDERING: cmsecs = 122.63 running bench [640 480] rsqrtf_fastinv_inline NONRENDERING: cmsecs = 127.62 running bench [640 480] rsqrtf_intrinsic NONRENDERING: cmsecs = 222.65 running bench [640 480] rsqrtf_portable_inline NONRENDERING: cmsecs = 240.16 running bench [640 480] rsqrtf_portable NONRENDERING: cmsecs = 315.18 Nexus4: running bench [640 480] rsqrtf_fastinv_inline NONRENDERING: cmsecs = 96.27 running bench [640 480] rsqrtf_portable_inline NONRENDERING: cmsecs = 165.66 running bench [640 480] rsqrtf_portable NONRENDERING: cmsecs = 238.85 Notes: 1. The data for IA phone is a bit different with previous data. I double checked it, and they are stable. The only difference is this time I tested 5 cases. And if I used same test suite as Nexus4, the result of fastinv would become 100 again (see below). This might be due to cache miss or other reasons, but I think the trend is similar. running bench [640 480] rsqrtf_fastinv_inline NONRENDERING: cmsecs = 100.14 running bench [640 480] rsqrtf_portable_inline NONRENDERING: cmsecs = 232.70 running bench [640 480] rsqrtf_portable NONRENDERING: cmsecs = 307.70 2. I could not produce the situation that on desktop, slowIsqrt is better than fastIsqrt. My data showed fast (fastinv_inline, 14.15) is about 2x of the performance of slow (portable_inline, 25.96) 3. For IA (desktop and phone), fastinv is on par with intrinsic (inlined) solution. 4. For desktop, intrinsic and intrinsic_inline is same, so maybe compiler auto-inlines the code, which is a difference with cross-compiler. 5. For all platform, fastinv (fastinv_inline) is about 2x performance comparing to slow one (portable_inline). Given these, do you think we can implement this using fastinv?
Invite Tom to review
Let's try to figure out why this is giving different results than our pre-existing inverse sqrt benchmarks. For example, on my Linux development machine: $ out/Release/bench --match sqrt --repeat 100 skia bench: alpha=0xFF antialias=1 filter=0 deferred=no logperiter=0 rotate=0 scale=0 clip=0 min=0 record=0 picturerecord=0 dither=default strokeWidth=none scalar=float system=UNIX running bench [640 480] math_fastIsqrt NONRENDERING: cmsecs = 10.33 running bench [640 480] math_slowIsqrt NONRENDERING: cmsecs = 8.07 that math_fastIsqrt should be the same math as your __fastinv, and math_slowIsqrt as your __portable. What values do you get for those on the machines you're testing with? What's the flaw in our pre-existing performance test that makes yours more representative?
As I wrote a new benchmark, and tested data against it, so the difference of two benchmarks (mine and existing one) result in quite different results. Below are some experiments I did to understand the reason. Without changing any code, the results at my side are: Desktop (i7-3770K @ 3.50GHz): running bench [640 480] math_fastIsqrt NONRENDERING: cmsecs = 5.96 running bench [640 480] math_slowIsqrt NONRENDERING: cmsecs = 5.56 IA phone: running bench [640 480] math_fastIsqrt NONRENDERING: cmsecs = 43.97 running bench [640 480] math_slowIsqrt NONRENDERING: cmsecs = 51.99 Nexus4: running bench [640 480] math_fastIsqrt NONRENDERING: cmsecs = 36.85 running bench [640 480] math_slowIsqrt NONRENDERING: cmsecs = 47.02 Observation: 1. I can somewhat reproduce your result that slowIsqrt is faster than fastIsqrt on desktop, while it's on the contrary on mobile. ===================================== Then I compared my benchmark with existing one, and found I used rand.nextF() as source data, instead of rand.nextSScalar1(). nextSScalar1 would return negative value, and cut out some instructions, which is unexpected. So I changed to rand.nextRangeF(1, 10000). Below are test results: Desktop: running bench [640 480] math_fastIsqrt NONRENDERING: cmsecs = 0.45 running bench [640 480] math_slowIsqrt NONRENDERING: cmsecs = 5.41 IA phone: running bench [640 480] math_fastIsqrt NONRENDERING: cmsecs = 17.76 running bench [640 480] math_slowIsqrt NONRENDERING: cmsecs = 38.76 Nexus4: running bench [640 480] math_fastIsqrt NONRENDERING: cmsecs = 33.36 running bench [640 480] math_slowIsqrt NONRENDERING: cmsecs = 29.29 Observation: 1. The result for desktop changed a lot. Now fastIsqrt is much better than slowIsqrt. 2. The result of Nexus4 became abnormal. 3. The ratio on desktop and IA phone has big difference. ===================================== I continued to look at difference, and found in my benchmark, in order to avoid unexpected optimization of compiler, I used some tricks (uploaded as patch set 2). Desktop: running bench [640 480] math_fastIsqrt NONRENDERING: cmsecs = 1.39 running bench [640 480] math_slowIsqrt NONRENDERING: cmsecs = 5.50 IA phone: running bench [640 480] math_fastIsqrt NONRENDERING: cmsecs = 14.34 running bench [640 480] math_slowIsqrt NONRENDERING: cmsecs = 38.33 Nexus4: running bench [640 480] math_fastIsqrt NONRENDERING: cmsecs = 18.09 running bench [640 480] math_slowIsqrt NONRENDERING: cmsecs = 29.99 Observation: 1. The results seem to make sense than before. 2. I think this is the correct way to measure the performance, though it introduces some (constant) overhead. Any comments?
It sounds like a large chunk of the difference is that your new code and the revised benchmarks are only passing nonnegative numbers to sqrt. That sounds to me like something it's safe to optimize for. I think a change from nextSScalar1() to nextRangeF() is good, although we should probably set the min to be 0 instead of 1. Thanks for putting in all the research work. Your Patch Set 2 left out all the new code you'd written for intrinsics, and your new benchmarks; was that intentional?
According to my test last time, intrinsic solution is on par with fastinv solution, so I think fastinv can be our final choice and simply benefits all platforms. Patch set 2 is just to show how I changed the benchmark to get the results. I uploaded patch set 3 as a formal patch of this change, in which I replaced two original benchmarks with a new converged one, so that we may extend it if needed (e.g., we may need a double version). Below are test results against it at my side: IA desktop: running bench [640 480] math_inv_sqrt_float_fast NONRENDERING: cmsecs = 2.22 running bench [640 480] math_inv_sqrt_float_slow NONRENDERING: cmsecs = 27.91 IA phone: running bench [640 480] math_inv_sqrt_float_fast NONRENDERING: cmsecs = 70.13 running bench [640 480] math_inv_sqrt_float_slow NONRENDERING: cmsecs = 190.19 Nexus4: running bench [640 480] math_inv_sqrt_float_fast NONRENDERING: cmsecs = 92.62 running bench [640 480] math_inv_sqrt_float_slow NONRENDERING: cmsecs = 149.13 Please have a review, thanks! Once we are clear with this patch, I will go over source code to replace existing calculations of reciprocal of square root with this fast solution.
lgtm
CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/yang.gu@intel.com/21755002/18001
Message was sent while issue was closed.
Change committed as 10671
Message was sent while issue was closed.
https://codereview.chromium.org/21755002/diff/18001/include/core/SkMath.h File include/core/SkMath.h (right): https://codereview.chromium.org/21755002/diff/18001/include/core/SkMath.h#new... include/core/SkMath.h:176: static inline float SkFloatInvSqrt(float x) { Need dox for a public API 1. Should probably be renamed to somehow reflect that this is an approximation of the inverse. 2. Why is this public right now? What is the driving case for it? 3. Possibly document why this works at all
Message was sent while issue was closed.
https://codereview.chromium.org/21755002/diff/18001/include/core/SkMath.h File include/core/SkMath.h (right): https://codereview.chromium.org/21755002/diff/18001/include/core/SkMath.h#new... include/core/SkMath.h:176: static inline float SkFloatInvSqrt(float x) { On 2013/08/12 14:05:54, reed1 wrote: > Need dox for a public API > > 1. Should probably be renamed to somehow reflect that this is an approximation > of the inverse. We may add "fast" into name (e.g., SkFloatFastInvSqrt) as this algorithm is often referred as fast invsqrt(http://en.wikipedia.org/wiki/Fast_inverse_square_root). I want to keep "float" in name as we may need a version of double (which uses Newton's method twice). However, I think current name is OK (SkFloatFastInvSqrt or so is a bit too long), and many libraries just use invsqrt or rsqrt as name. For example, in code of Quake 3 (origin of the algorithm though it has deeper root), the snippet is float InvSqrt (float x){ float xhalf = 0.5f*x; int i = *(int*)&x; i = 0x5f3759df - (i>>1); x = *(float*)&i; x = x*(1.5f - xhalf*x*x); return x; } And in hackersdelight, the code is float rsqrt(float x0) { union {int ix; float x;}; // union {int ihalf; float xhalf;}; // For alternative halving step. x = x0; // x can be viewed as int. // ihalf = ix - 0x00800000; // Alternative to line below, for x not a denorm. float xhalf = 0.5f*x; // ix = 0x5f3759df - (ix >> 1); // Initial guess (traditional), // but slightly better: ix = 0x5f375a82 - (ix >> 1); // Initial guess. x = x*(1.5f - xhalf*x*x); // Newton step. // x = x*(1.5008908 - xhalf*x*x); // Newton step for a balanced error. return x; } What's more, we may also evolve this API to be a bit more complicated. According to my test result, the performance of intrinsic solution for IA is on par with this fast solution, but the intrinsic solution may have better precision. We may have this API be used everywhere in Skia to calculate the reciprocal square root, without caring if it will lose precision. I'm working on this right now. Is it safe to do this everywhere, that is, is there any place that requires high precision that this algorithm can't meet? > 2. Why is this public right now? What is the driving case for it? The driven reason is performance. Actually we already had such benchmark in MathBench, and @tomhudson created an issue at https://code.google.com/p/skia/issues/detail?id=885. After some investigation, I think there are some issues with original benchmark. After the correction of benchmark, I saw an obvious speedup on all platforms (IA desktop, IA phone and ARM phone (Nexus4)). Please refer my previous post for detailed data. > 3. Possibly document why this works at all How about adding following comment on this API? (I referred a paper http://www.lomont.org/Math/Papers/2003/InvSqrt.pdf) /** * Return the approximate reciprocal square root of an IEEE * float (single precision). * The algorithm has obvious speedup than regular floating * point division, with a small sacrifice of precision. * According to some paper, its maximal relative error is * 0.175228. */ Should I add a patch set for this?
Message was sent while issue was closed.
Since there is no caller in skia, I think we should move this into the bench itself. If/when we have places in skia that use it, we can reconsider adding it to the core. Since we've lifted the impl from other sources, I think we should document that.
Message was sent while issue was closed.
On 2013/08/13 03:33:31, Yang Gu wrote: > > 2. Why is this public right now? What is the driving case for it? > The driven reason is performance. Actually we already had such benchmark in > MathBench, and @tomhudson created an issue at > https://code.google.com/p/skia/issues/detail?id=885. After some investigation, I > think there are some issues with original benchmark. After the correction of > benchmark, I saw an obvious speedup on all platforms (IA desktop, IA phone and > ARM phone (Nexus4)). Please refer my previous post for detailed data. I think Mike's point is that performance on the microbenchmarks is irrelevant. Performance only matters when it surfaces on real workloads. Unfortunately, on bug 885 I didn't make a note of which real Android workloads had shown heavy use of sqrt(). We'd have to dig to find them.
Message was sent while issue was closed.
My argue is that even if we find the real Android workloads, we may also question its generality. Actually the merged patch just moved code from MathBench to core (I only renamed and reformatted it). I'm OK to add comments to it, but I really don't know its original source. I think we can point it to code of Quake 3, which is the origin of this algorithm. For me, this patch is reasonable as the calculation of reciprocal square root happens at many places of Skia. So my plan is: 1) Have an API in core (the merged patch). 2) Replace original calculations in Skia with this new API. I definitely need your green light for step 1, so that I may continue to work on step 2. Below is a list (maybe not complete) I plan to work at step 2: experimental/Intersection/SkAntiEdge.cpp double dist = fabs(numer) / sqrt(denom); experimental/Intersection/LineParameters.h double normal = sqrt(normalSquared()); double reciprocal = 1 / normal; experimental/Intersection/ConvexHull_Test.cpp double length = sqrt(dx * dx + dy * dy); double invLength = 1 / length; experimental/Intersection/CubicUtilities.cpp double theta = acos(R / sqrt(Q3)); experimental/Intersection/DataTypes.h const double FLT_EPSILON_SQRT = sqrt(FLT_EPSILON); const double FLT_EPSILON_INVERSE = 1 / FLT_EPSILON; experimental/Intersection/CubicToQuadratics.cpp double dist = sqrt(dx * dx + dy * dy); double tDiv3 = precision / (adjust * dist); experimental/AndroidPathRenderer/AndroidPathRenderer.cpp float scaleX = sk_float_sqrt(m00 * m00 + m01 * m01); float scaleY = sk_float_sqrt(m10 * m10 + m11 * m11); inverseScaleX = (scaleX != 0) ? (1.0f / scaleX) : 1.0f; inverseScaleY = (scaleY != 0) ? (1.0f / scaleY) : 1.0f; samplecode/SampleAARects.cpp canvas->translate(SkFloatToScalar(20.0f / sqrtf(2.f)), SkFloatToScalar(20.0f / sqrtf(2.f))); src/effects/SkEmbossMaskFilter.cpp mag = SkScalarSqrt(mag); for (int i = 0; i < 3; i++) { v[i] = SkScalarDiv(v[i], mag); } src/views/SkTouchGesture.cpp double dist0 = sqrt(dx*dx + dy*dy); double scale = dist1 / dist0; src/pathops/SkPathOpsCubic.cpp double theta = acos(R / sqrt(Q3)); src/pathops/SkLineParameters.h double normal = sqrt(normalSquared()); double reciprocal = 1 / normal; src/pathops/SkPathOpsTypes.h const double FLT_EPSILON_SQRT = sqrt(FLT_EPSILON); const double FLT_EPSILON_INVERSE = 1 / FLT_EPSILON; src/utils/SkMatrix44.cpp double scale = 1 / sqrt(len2); src/utils/SkCamera.cpp float mag = sk_float_sqrt(fX*fX + fY*fY + fZ*fZ); float scale = 1.0f / mag; src/animator/SkAnimateActive.cpp SkScalar originalDistance = SkScalarSqrt(originalSum); SkScalar workingDistance = SkScalarSqrt(workingSum); existing->fState[index].fDuration = (SkMSec) SkScalarMulDiv(fState[index].fDuration, workingDistance, originalDistance); src/core/SkGeometry.cpp SkScalar root = SkScalarSqrt(tmp2[1].fZ); dst[0].fW = tmp2[0].fZ / root; dst[1].fW = tmp2[2].fZ / root; src/core/SkBitmapProcState.cpp SkScalar levelScale = SkScalarInvert(SkScalarSqrt(scaleSqd)); src/core/SkStrokerPriv.cpp sinHalfAngle = SkScalarSqrt(SkScalarHalf(SK_Scalar1 + dotProd)); mid.setLength(SkScalarDiv(radius, sinHalfAngle)); src/core/SkPoint.cpp mag = sk_float_sqrt(mag2); scale = 1 / mag; |