Issue 21755002: Experiments on calculating reciprocal of square root

Yang Gu

Reciprocal of square root is often calculated here and there in the code, and we ...

7 years, 4 months ago (2013-08-02 08:28:47 UTC) #1

reed1

I'm not sure how important this is for Skia drawing operations. That said, have you ...

7 years, 4 months ago (2013-08-02 13:53:42 UTC) #2

tomhudson

https://code.google.com/p/skia/issues/detail?id=885 We've always seen "slow" inverse sqrt as faster than the "fast" variant in the ...

7 years, 4 months ago (2013-08-02 14:20:09 UTC) #3

Yang Gu

Glad to know you already put this in your radar. I inlined the two new ...

7 years, 4 months ago (2013-08-05 10:12:23 UTC) #4

Glad to know you already put this in your radar. I inlined the two new
functions, and here comes the data (I manually sorted it).
IA Desktop:
running bench [640 480]        rsqrtf_fastinv_inline  NONRENDERING: cmsecs = 
14.15
running bench [640 480]      rsqrtf_intrinsic_inline  NONRENDERING: cmsecs = 
14.18
running bench [640 480]             rsqrtf_intrinsic  NONRENDERING: cmsecs = 
14.18
running bench [640 480]       rsqrtf_portable_inline  NONRENDERING: cmsecs = 
25.96
running bench [640 480]              rsqrtf_portable  NONRENDERING: cmsecs = 
26.43

IA Phone:
running bench [640 480]      rsqrtf_intrinsic_inline  NONRENDERING: cmsecs =
122.63
running bench [640 480]        rsqrtf_fastinv_inline  NONRENDERING: cmsecs =
127.62
running bench [640 480]             rsqrtf_intrinsic  NONRENDERING: cmsecs =
222.65
running bench [640 480]       rsqrtf_portable_inline  NONRENDERING: cmsecs =
240.16
running bench [640 480]              rsqrtf_portable  NONRENDERING: cmsecs =
315.18

Nexus4:
running bench [640 480]        rsqrtf_fastinv_inline  NONRENDERING: cmsecs = 
96.27
running bench [640 480]       rsqrtf_portable_inline  NONRENDERING: cmsecs =
165.66
running bench [640 480]              rsqrtf_portable  NONRENDERING: cmsecs =
238.85

Notes:
1. The data for IA phone is a bit different with previous data. I double checked
it, and they are stable. The only difference is this time I tested 5 cases. And
if I used same test suite as Nexus4, the result of fastinv would become 100
again (see below). This might be due to cache miss or other reasons, but I think
the trend is similar.
running bench [640 480]        rsqrtf_fastinv_inline  NONRENDERING: cmsecs =
100.14
running bench [640 480]       rsqrtf_portable_inline  NONRENDERING: cmsecs =
232.70
running bench [640 480]              rsqrtf_portable  NONRENDERING: cmsecs =
307.70

2. I could not produce the situation that on desktop, slowIsqrt is better than
fastIsqrt. My data showed fast (fastinv_inline, 14.15) is about 2x of the
performance of slow (portable_inline, 25.96)

3. For IA (desktop and phone), fastinv is on par with intrinsic (inlined)
solution.
4. For desktop, intrinsic and intrinsic_inline is same, so maybe compiler
auto-inlines the code, which is a difference with cross-compiler. 
5. For all platform, fastinv (fastinv_inline) is about 2x performance comparing
to slow one (portable_inline). 

Given these, do you think we can implement this using fastinv?

tomhudson

Let's try to figure out why this is giving different results than our pre-existing inverse ...

7 years, 4 months ago (2013-08-08 10:17:59 UTC) #6

Yang Gu

As I wrote a new benchmark, and tested data against it, so the difference of ...

7 years, 4 months ago (2013-08-09 06:15:15 UTC) #7

As I wrote a new benchmark, and tested data against it, so the difference of two
benchmarks (mine and existing one) result in quite different results. Below are
some experiments I did to understand the reason.
Without changing any code, the results at my side are:
Desktop (i7-3770K @ 3.50GHz):
running bench [640 480]               math_fastIsqrt  NONRENDERING: cmsecs =  
5.96
running bench [640 480]               math_slowIsqrt  NONRENDERING: cmsecs =  
5.56
IA phone:
running bench [640 480]               math_fastIsqrt  NONRENDERING: cmsecs = 
43.97
running bench [640 480]               math_slowIsqrt  NONRENDERING: cmsecs = 
51.99
Nexus4:
running bench [640 480]               math_fastIsqrt  NONRENDERING: cmsecs = 
36.85
running bench [640 480]               math_slowIsqrt  NONRENDERING: cmsecs = 
47.02

Observation: 
1. I can somewhat reproduce your result that slowIsqrt is faster than fastIsqrt
on desktop, while it's on the contrary on mobile.

=====================================
Then I compared my benchmark with existing one, and found I used rand.nextF() as
source data, instead of rand.nextSScalar1(). nextSScalar1 would return negative
value, and cut out some instructions, which is unexpected. So I changed to
rand.nextRangeF(1, 10000). Below are test results:
Desktop:
running bench [640 480]               math_fastIsqrt  NONRENDERING: cmsecs =  
0.45
running bench [640 480]               math_slowIsqrt  NONRENDERING: cmsecs =  
5.41

IA phone:
running bench [640 480]               math_fastIsqrt  NONRENDERING: cmsecs = 
17.76
running bench [640 480]               math_slowIsqrt  NONRENDERING: cmsecs = 
38.76

Nexus4:
running bench [640 480]               math_fastIsqrt  NONRENDERING: cmsecs = 
33.36
running bench [640 480]               math_slowIsqrt  NONRENDERING: cmsecs = 
29.29

Observation: 
1. The result for desktop changed a lot. Now fastIsqrt is much better than
slowIsqrt.
2. The result of Nexus4 became abnormal. 
3. The ratio on desktop and IA phone has big difference.

=====================================
I continued to look at difference, and found in my benchmark, in order to avoid
unexpected optimization of compiler, I used some tricks (uploaded as patch set
2).

Desktop:
running bench [640 480]               math_fastIsqrt  NONRENDERING: cmsecs =  
1.39
running bench [640 480]               math_slowIsqrt  NONRENDERING: cmsecs =  
5.50

IA phone:
running bench [640 480]               math_fastIsqrt  NONRENDERING: cmsecs = 
14.34
running bench [640 480]               math_slowIsqrt  NONRENDERING: cmsecs = 
38.33

Nexus4:
running bench [640 480]               math_fastIsqrt  NONRENDERING: cmsecs = 
18.09
running bench [640 480]               math_slowIsqrt  NONRENDERING: cmsecs = 
29.99

Observation:
1. The results seem to make sense than before.
2. I think this is the correct way to measure the performance, though it
introduces some (constant) overhead.


Any comments?

tomhudson

It sounds like a large chunk of the difference is that your new code and ...

7 years, 4 months ago (2013-08-09 09:48:58 UTC) #8

Yang Gu

According to my test last time, intrinsic solution is on par with fastinv solution, so ...

7 years, 4 months ago (2013-08-12 05:20:33 UTC) #9

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/yang.gu@intel.com/21755002/18001

7 years, 4 months ago (2013-08-12 08:30:19 UTC) #11

reed1

https://codereview.chromium.org/21755002/diff/18001/include/core/SkMath.h File include/core/SkMath.h (right): https://codereview.chromium.org/21755002/diff/18001/include/core/SkMath.h#newcode176 include/core/SkMath.h:176: static inline float SkFloatInvSqrt(float x) { Need dox for ...

7 years, 4 months ago (2013-08-12 14:05:54 UTC) #13

Yang Gu

https://codereview.chromium.org/21755002/diff/18001/include/core/SkMath.h File include/core/SkMath.h (right): https://codereview.chromium.org/21755002/diff/18001/include/core/SkMath.h#newcode176 include/core/SkMath.h:176: static inline float SkFloatInvSqrt(float x) { On 2013/08/12 14:05:54, ...

7 years, 4 months ago (2013-08-13 03:33:31 UTC) #14

Message was sent while issue was closed.

https://codereview.chromium.org/21755002/diff/18001/include/core/SkMath.h
File include/core/SkMath.h (right):

https://codereview.chromium.org/21755002/diff/18001/include/core/SkMath.h#new...
include/core/SkMath.h:176: static inline float SkFloatInvSqrt(float x) {
On 2013/08/12 14:05:54, reed1 wrote:
> Need dox for a public API
> 
> 1. Should probably be renamed to somehow reflect that this is an approximation
> of the inverse.
We may add "fast" into name (e.g., SkFloatFastInvSqrt) as this algorithm is
often referred as fast
invsqrt(http://en.wikipedia.org/wiki/Fast_inverse_square_root). I want to keep
"float" in name as we may need a version of double (which uses Newton's method
twice). However, I think current name is OK (SkFloatFastInvSqrt or so is a bit
too long), and many libraries just use invsqrt or rsqrt as name. For example, in
code of Quake 3 (origin of the algorithm though it has deeper root), the snippet
is
float InvSqrt (float x){
    float xhalf = 0.5f*x;
    int i = *(int*)&x;
    i = 0x5f3759df - (i>>1);
    x = *(float*)&i;
    x = x*(1.5f - xhalf*x*x);
    return x;
}

And in hackersdelight, the code is
float rsqrt(float x0) {
   union {int ix; float x;};
// union {int ihalf; float xhalf;}; // For alternative halving step.

   x = x0;                      // x can be viewed as int.
// ihalf = ix - 0x00800000;     // Alternative to line below, for x not a
denorm.
   float xhalf = 0.5f*x;
// ix = 0x5f3759df - (ix >> 1); // Initial guess (traditional),
//                                 but slightly better:
   ix = 0x5f375a82 - (ix >> 1); // Initial guess.
   x = x*(1.5f - xhalf*x*x);    // Newton step.
// x = x*(1.5008908 - xhalf*x*x);  // Newton step for a balanced error.
   return x;
}

What's more, we may also evolve this API to be a bit more complicated. According
to my test result, the performance of intrinsic solution for IA is on par with
this fast solution, but the intrinsic solution may have better precision. 
We may have this API be used everywhere in Skia to calculate the reciprocal
square root, without caring if it will lose precision. I'm working on this right
now. Is it safe to do this everywhere, that is, is there any place that requires
high precision that this algorithm can't meet?


> 2. Why is this public right now? What is the driving case for it?
The driven reason is performance. Actually we already had such benchmark in
MathBench, and @tomhudson created an issue at
https://code.google.com/p/skia/issues/detail?id=885. After some investigation, I
think there are some issues with original benchmark. After the correction of
benchmark, I saw an obvious speedup on all platforms (IA desktop, IA phone and
ARM phone (Nexus4)). Please refer my previous post for detailed data.

> 3. Possibly document why this works at all
How about adding following comment on this API? (I referred a paper
http://www.lomont.org/Math/Papers/2003/InvSqrt.pdf)
/**
 *  Return the approximate reciprocal square root of an IEEE
 *  float (single precision).
 *  The algorithm has obvious speedup than regular floating
 *  point division, with a small sacrifice of precision.
 *  According to some paper, its maximal relative error is
 *  0.175228. 
 */

Should I add a patch set for this?

reed1

Since there is no caller in skia, I think we should move this into the ...

7 years, 4 months ago (2013-08-13 12:37:02 UTC) #15

tomhudson

On 2013/08/13 03:33:31, Yang Gu wrote: > > 2. Why is this public right now? ...

7 years, 4 months ago (2013-08-13 21:39:25 UTC) #16

Yang Gu

7 years, 4 months ago (2013-08-14 03:17:08 UTC) #17

Message was sent while issue was closed.

My argue is that even if we find the real Android workloads, we may also
question its generality. Actually the merged patch just moved code from
MathBench to core (I only renamed and reformatted it). I'm OK to add comments to
it, but I really don't know its original source. I think we can point it to code
of Quake 3, which is the origin of this algorithm. 
For me, this patch is reasonable as the calculation of reciprocal square root
happens at many places of Skia. So my plan is: 1) Have an API in core (the
merged patch). 2) Replace original calculations in Skia with this new API. 
I definitely need your green light for step 1, so that I may continue to work on
step 2. Below is a list (maybe not complete) I plan to work at step 2:

experimental/Intersection/SkAntiEdge.cpp
    double dist = fabs(numer) / sqrt(denom);

experimental/Intersection/LineParameters.h
    double normal = sqrt(normalSquared());
    double reciprocal = 1 / normal;

experimental/Intersection/ConvexHull_Test.cpp
    double length = sqrt(dx * dx + dy * dy);
    double invLength = 1 / length;

experimental/Intersection/CubicUtilities.cpp
    double theta = acos(R / sqrt(Q3));

experimental/Intersection/DataTypes.h
    const double FLT_EPSILON_SQRT = sqrt(FLT_EPSILON);
    const double FLT_EPSILON_INVERSE = 1 / FLT_EPSILON;

experimental/Intersection/CubicToQuadratics.cpp
    double dist = sqrt(dx * dx + dy * dy);
    double tDiv3 = precision / (adjust * dist);

experimental/AndroidPathRenderer/AndroidPathRenderer.cpp
    float scaleX = sk_float_sqrt(m00 * m00 + m01 * m01);
    float scaleY = sk_float_sqrt(m10 * m10 + m11 * m11);
    inverseScaleX = (scaleX != 0) ? (1.0f / scaleX) : 1.0f;
    inverseScaleY = (scaleY != 0) ? (1.0f / scaleY) : 1.0f;

samplecode/SampleAARects.cpp
    canvas->translate(SkFloatToScalar(20.0f / sqrtf(2.f)), SkFloatToScalar(20.0f
/ sqrtf(2.f)));

src/effects/SkEmbossMaskFilter.cpp
    mag = SkScalarSqrt(mag);
    for (int i = 0; i < 3; i++) {
        v[i] = SkScalarDiv(v[i], mag);
    }

src/views/SkTouchGesture.cpp
    double dist0 = sqrt(dx*dx + dy*dy);
    double scale = dist1 / dist0;

src/pathops/SkPathOpsCubic.cpp
    double theta = acos(R / sqrt(Q3));

src/pathops/SkLineParameters.h
    double normal = sqrt(normalSquared());
    double reciprocal = 1 / normal;

src/pathops/SkPathOpsTypes.h
    const double FLT_EPSILON_SQRT = sqrt(FLT_EPSILON);
    const double FLT_EPSILON_INVERSE = 1 / FLT_EPSILON;

src/utils/SkMatrix44.cpp
    double scale = 1 / sqrt(len2);

src/utils/SkCamera.cpp
    float mag = sk_float_sqrt(fX*fX + fY*fY + fZ*fZ);
    float scale = 1.0f / mag;

src/animator/SkAnimateActive.cpp
    SkScalar originalDistance = SkScalarSqrt(originalSum);
    SkScalar workingDistance = SkScalarSqrt(workingSum);
    existing->fState[index].fDuration = (SkMSec)
SkScalarMulDiv(fState[index].fDuration,
        workingDistance, originalDistance);

src/core/SkGeometry.cpp
    SkScalar root = SkScalarSqrt(tmp2[1].fZ);
    dst[0].fW = tmp2[0].fZ / root;
    dst[1].fW = tmp2[2].fZ / root;


src/core/SkBitmapProcState.cpp
    SkScalar levelScale = SkScalarInvert(SkScalarSqrt(scaleSqd));

src/core/SkStrokerPriv.cpp
    sinHalfAngle = SkScalarSqrt(SkScalarHalf(SK_Scalar1 + dotProd));
    mid.setLength(SkScalarDiv(radius, sinHalfAngle));

src/core/SkPoint.cpp
    mag = sk_float_sqrt(mag2);
    scale = 1 / mag;

Issue 21755002: Experiments on calculating reciprocal of square root (Closed)

Description

Patch Set 1 #

Patch Set 2 : Change existing benchmark code #

Patch Set 3 : Better calculate reciprocal of square root #

Messages