Issue 2291773002: The function CopyCharsUnsigned(uint8_t* dest, const uint8_t* src, size_t chars) is now a specializa…

octavian.soldea

Description was changed from ========== The function CopyCharsUnsigned(uint8_t* dest, const uint8_t* src, size_t chars) is ...

4 years, 3 months ago (2016-08-29 21:15:13 UTC) #1

octavian.soldea

octavian.soldea@intel.com changed reviewers: + jochen eisinger jochen@chromium.org

4 years, 3 months ago (2016-08-29 21:15:13 UTC) #2

octavian.soldea

octavian.soldea@intel.com changed required reviewers: + jochen eisinger jochen@chromium.org

4 years, 3 months ago (2016-08-29 21:15:14 UTC) #3

jochen (gone - plz use gerrit)

jochen@chromium.org changed reviewers: + jkummerow@chromium.org, jochen@chromium.org - jochen eisinger jochen@chromium.org

4 years, 3 months ago (2016-08-30 11:46:17 UTC) #5

jochen (gone - plz use gerrit)

jochen@chromium.org changed required reviewers: - jochen eisinger jochen@chromium.org

4 years, 3 months ago (2016-08-30 11:46:17 UTC) #6

jochen (gone - plz use gerrit)

hi, thanks for sending this patch Could you please format the CL description according to ...

4 years, 3 months ago (2016-08-30 11:46:18 UTC) #7

Jakob Kummerow

Sorry, I'm not happy with this patch in its current form. See one comment on ...

4 years, 3 months ago (2016-08-30 13:29:28 UTC) #8

Michael Hablich

octavian.soldea can you please create an issue on V8's issue tracker with the data and ...

4 years, 3 months ago (2016-08-30 13:51:18 UTC) #9

Michael Hablich

hablich@chromium.org changed reviewers: + ofrobots@google.com

4 years, 3 months ago (2016-08-30 13:51:31 UTC) #10

Michael Hablich

hablich@chromium.org changed reviewers: - ofrobots@google.com

4 years, 3 months ago (2016-08-30 13:51:43 UTC) #11

Michael Hablich

On 2016/08/30 13:51:18, Hablich wrote: > octavian.soldea can you please create an issue on V8's ...

4 years, 3 months ago (2016-08-30 13:55:21 UTC) #12

octavian.soldea

On 2016/08/30 13:55:21, Hablich wrote: > On 2016/08/30 13:51:18, Hablich wrote: > > octavian.soldea can ...

4 years, 3 months ago (2016-08-30 18:00:40 UTC) #13

octavian.soldea

On 2016/08/30 18:00:40, octavian.soldea wrote: > On 2016/08/30 13:55:21, Hablich wrote: > > On 2016/08/30 ...

4 years, 3 months ago (2016-09-15 14:07:39 UTC) #14

Michael Hablich

Description was changed from ========== The function CopyCharsUnsigned(uint8_t* dest, const uint8_t* src, size_t chars) is ...

4 years, 3 months ago (2016-09-15 22:46:09 UTC) #15

octavian.soldea

On 2016/09/15 14:07:39, octavian.soldea wrote: > On 2016/08/30 18:00:40, octavian.soldea wrote: > > On 2016/08/30 ...

4 years, 2 months ago (2016-10-18 21:01:40 UTC) #16

Jakob Kummerow

How well does this alternative proposal work for you? https://chromiumcodereview.appspot.com/2438583002 I would prefer not to ...

4 years, 2 months ago (2016-10-19 16:16:22 UTC) #17

octavian.soldea

On 2016/10/19 16:16:22, Jakob Kummerow wrote: > How well does this alternative proposal work for ...

4 years, 1 month ago (2016-11-02 18:33:48 UTC) #18

octaviansoldea

On 2016/11/02 18:33:48, octavian.soldea wrote: > On 2016/10/19 16:16:22, Jakob Kummerow wrote: > > How ...

4 years ago (2016-11-24 00:50:54 UTC) #19

On 2016/11/02 18:33:48, octavian.soldea wrote:
> On 2016/10/19 16:16:22, Jakob Kummerow wrote:
> > How well does this alternative proposal work for you?
> > https://chromiumcodereview.appspot.com/2438583002
> > 
> > I would prefer not to make utils.h any more complicated than it already is.
If
> > anything, we should simplify the whole memcopy related code.
> 
> Dear Reviewer
> 
> Thank you for your message. I am working on collecting data and come back
asap.
> 
> Best regards,
>   Octavian

Dear Reviewer
 
Following a thorough analysis, I would like to conclude that the code proposed
last time is superior to other choices. Please find the evidence in the
following.
 
From a personal communication with Matteo Collina, who run a comparison test
designed by Andreas Madsen, one can prove that our patch, 2291773002, provides
superior performance as compared to the proposal in 2438583002. This is the
performance analysis from Matteo:
 
$ node benchmark/compare.js --old ./node --new ./patched-node --runs 5 --filter
parser http > compare-patch-v8-2291773002-http-parser.csv
 
$ cat compare-patch-v8-2291773002-http-parser.csv | Rscript benchmark/compare.R
                                         improvement significant     p.value
 
$ cat compare-patch-v8-2291773002-http-parser.csv | Rscript benchmark/compare.R
                                         improvement significant      p.value
 http/bench-parser.js n=100000 fields=16      8.83 %         *** 3.211221e-20
 http/bench-parser.js n=100000 fields=32     10.23 %         *** 5.008009e-25
 http/bench-parser.js n=100000 fields=4       2.39 %          ** 1.624678e-03
 http/bench-parser.js n=100000 fields=8       5.12 %         *** 2.419546e-07
 
The comparison with the one proposed in 2438583002:
 
(old is 2438583002, new is 2291773002)
 
$ node benchmark/compare.js --old ./node --new ./patched-node --runs 30 --filter
parser http > compare-patch-v8-2438583002-http-parser.c
 
 
$ cat compare-patch-v8-2438583002-http-parser.csv | Rscript benchmark/compare.R
                                         improvement significant     p.value
 http/bench-parser.js n=100000 fields=16      1.87 %           * 0.014321166
 http/bench-parser.js n=100000 fields=32      2.16 %          ** 0.004791449
 http/bench-parser.js n=100000 fields=4       0.94 %             0.410998592
 http/bench-parser.js n=100000 fields=8       1.20 %             0.135580498
 
Matteo provided measurements with the proposed fix 2438583002, vs master 
 
$ cat compare-patch-v8-2438583002-http-parser-master.csv | Rscript
benchmark/compare.R
                                         improvement significant      p.value
 http/bench-parser.js n=100000 fields=16      2.85 %             8.337515e-02
 http/bench-parser.js n=100000 fields=32      5.30 %         *** 3.567017e-04
 http/bench-parser.js n=100000 fields=4       3.59 %         *** 9.237691e-05
 http/bench-parser.js n=100000 fields=8       3.81 %             6.819173e-02
 
On top of Matteo’s experiments, we have compared the behavior of Let’s Chat
benchmark
and obtained the following results:
 
Comparison Based on Let's Chat						
						
             baseline 2291773002  2438583002        [baseline/committed]
[baseline/proposed review] [proposed review/committed]
                      committed   proposed reviewer           
n=10	     28.134      27.954	  28.076            1.00643915          
1.002065821                1.004364313
n=20	     55.477	 55.461	  55.545            1.000288491          0.9987757674	 
          1.001514578
n=10 c = 10  27.523	 27.481	  27.518            1.001528329          1.000181699
               1.001346385
n=20 c = 10  54.554	 54.368	  54.58             1.00342113           0.999523635
               1.003899353
n=40 c = 10 108.056     107.863	 108.213	    1.001789307          0.9985491577  
            1.003244857
 
The numbers in columns baseline, 2291773002, and 2438583002 means real time
measured
for n and c that appear in the first column. Parameter n means a number of
requests.
Parameter c means a parallelization of the requests, i.e. the number of
concurrent
requests. By default, c = 1. One can see that consistently, 2291773002 is
superior
to baseline and 2438583002 with a factor of something between 0.001-0.005%.
 
We have also studied the effect of the template specialization alone.
We have run the bench parser for n=100000 fields=512 (several times) and added
the requests
per second at each round. We measured 40720.66876 for the baseline and
40736.70174 for a version
in which we inserted the specialization of the template CopyCharsUnsigned.
Specifically:
 
void CopyCharsUnsigned(uint8_t* dest, const uint8_t* src, size_t chars)
 
is exactly like in 2291773002, however, expanded with the old value
 
const int kMinComplexMemCopy = 16 * kPointerSize;
 
While the advantage in using template specialization alone is minor, the
improvement is
consistent (we performed repeated measurements). The reason for which we ask for
keeping
the specialization is keeping the constant we propose specifically for x64
(without affecting
the default). If ARM architecture has specific code, we do not see any reason
why x64 should
not appear in its own nutshell and with the indicated constant.
 
Furthermore, we have analyzed the performance of 2291773002 versus 2438583002 in
V8 and Node.js.
For V8 we present sums of 10 consecutive runs
 
V8           baseline 2291773002 2438583002        [commited/baseline]
[commited/proposed reviewer]
                      committed  proposed reviewer
Richards     207975   208066     208025            1.000437553        
1.000197092
DeltaBlue    410845   408167     408855            0.9934817267       
0.9983172518
Crypto       192897   191252     191678            0.9914721328       
0.9977775227
RayTrace     506692   504248     506468            0.995176557        
0.9956167023
EarleyBoyer  312616   313181     313275            1.001807329        
0.9996999441
RegExp       38323    38944      39018             1.016204368        
0.9981034394
Splay        114061   114551     113713            1.004295947        
1.00736943
NavierStokes 190674   190993     190408            1.001673013        
1.00307235
Score        195971   196080     196075            1.000556205         1.0000255
 
In conclusion, the patches do not influence the V8 benchmark suite. However,
when
we have run the bench parser with different arguments we got the following
results

Node baseline  commited  proposed reviewer [commited/baseline]
[commited/proposed reviewer]
2    484626.95 514699.97 511345.6          1.062053957         1.006559888
4    407165.17 429887.95 420631.59         1.055807278         1.02200586
8    292921.2  313528.47 309092.17         1.0703509           1.014352677
16   197917.09 216019.11 206786.14         1.091462642         1.044649849
32   110473.44 119500.96 114584.84         1.081716655         1.042903756
64   60586.46  67277.23 63861.42           1.11043342          1.053487849
128  31910.09  35882.44 34103.25           1.124485703         1.052170688
256  16191     18462.42 17312.59           1.140289049         1.066415828
512  8204.49   9388.71  8852.18            1.144338039         1.060609929
1024 4092.53   4733.6   4411.79            1.156643934         1.072943182
2048 1998.4    2330.76  2178.85            1.16631305          1.069720265

Here, the command line was ./node benchmark/http/bench-parser.js n=constant
fields=variable.
One can see the performance benefit. While 2291773002 manifests 10-15%
performance benefit,
2438583002 shows 5-10% benefit as compared to the baseline. These measurements
were done
on a Haswell server.
 
Best regards,
  Octavian

Jakob Kummerow

My point still stands: the explicit specialization you provide doesn't do anything. It is the ...

4 years ago (2016-11-24 10:25:20 UTC) #20

My point still stands: the explicit specialization you provide doesn't do
anything. It is the same as the generic implementation. If you see a difference,
then either you're measuring noise (or artifacts like alignment), or your
compiler is really bad.

Look: Here's the original, generic, platform-independent implementation:

template <typename sourcechar, typename sinkchar>
void CopyCharsUnsigned(sinkchar* dest, const sourcechar* src, size_t chars) {
  sinkchar* limit = dest + chars;
  if ((sizeof(*dest) == sizeof(*src)) &&
      (chars >= static_cast<int>(kMinComplexMemCopy / sizeof(*dest)))) {
    MemCopy(dest, src, chars * sizeof(*dest));
  } else {
    while (dest < limit) *dest++ = static_cast<sinkchar>(*src++);
  }
}

If we request an automatic specialization for uint8_t, what the compiler
generates internally as a first step is:

void CopyCharsUnsigned(uint8_t* dest, const uint8_t* src, size_t chars) {
  uint8_t* limit = dest + chars;
  if ((sizeof(*dest) == sizeof(*src)) &&
      (chars >= static_cast<int>(kMinComplexMemCopy / sizeof(*dest)))) {
    MemCopy(dest, src, chars * sizeof(*dest));
  } else {
    while (dest < limit) *dest++ = static_cast<uint8_t>(*src++);
  }
}

Since the compiler knows that sizeof(uint8_t) == 1, and static_cast from uint8_t
to uint8_t is a no-op, this simplifies further to:

void CopyCharsUnsigned(uint8_t* dest, const uint8_t* src, size_t chars) {
  sinkchar* limit = dest + chars;
  if ((1 == 1) &&
      (chars >= static_cast<int>(kMinComplexMemCopy / 1))) {
    MemCopy(dest, src, chars * 1);
  } else {
    while (dest < limit) *dest++ = *src++;
  }
}

Now if the trivial expressions "1 == 1", "/ 1", "* 1" are constant-folded away,
you end up exactly with the specialization you provided manually. (With one
minor exception: you still include "/ sizeof(*dest)" in your code, even though
you know that dest has type uint8_t so the division is a no-op.) 

Am I missing something?

Which leads me to conclude that if the manual specialization does indeed produce
different code than automatic instantiation of the templatized version, then
it's a bug in your compiler.

Further, we strongly dislike platform-specific code because it's a maintenance
nightmare. If you want to add any "#ifdef *_ARCH_*" code, you need a pretty
strong reason that the given platform really benefits a lot. Copying
platform-independent code to a platform-specific #ifdef does not count as
significant benefit.

So it all seems to boil down to you saying that kMinComplexMemCopy = 8 is better
than kMinComplexMemCopy = 16. Fine. It's still a one-liner then.

(I'd also still be curious to hear if there are any other scenarios than the
bench-parser.js microbenchmark that benefit from this change. It's easy to
"prove" almost anything with a microbenchmark. It's also easy to make the wrong
engineering decisions based on microbenchmarks -- maybe a bigger
kMinComplexMemCopy constant is better in real workloads? That's the primary
reason why I'd tend to be careful with changing it; 128 -> 16 seems safer than
128 -> 8 just because it's a smaller step.)

octaviansoldea

On 2016/11/24 10:25:20, Jakob Kummerow wrote: > My point still stands: the explicit specialization you ...

4 years ago (2016-11-24 14:33:18 UTC) #21

On 2016/11/24 10:25:20, Jakob Kummerow wrote:
> My point still stands: the explicit specialization you provide doesn't do
> anything. It is the same as the generic implementation. If you see a
difference,
> then either you're measuring noise (or artifacts like alignment), or your
> compiler is really bad.
> 
> Look: Here's the original, generic, platform-independent implementation:
> 
> template <typename sourcechar, typename sinkchar>
> void CopyCharsUnsigned(sinkchar* dest, const sourcechar* src, size_t chars) {
>   sinkchar* limit = dest + chars;
>   if ((sizeof(*dest) == sizeof(*src)) &&
>       (chars >= static_cast<int>(kMinComplexMemCopy / sizeof(*dest)))) {
>     MemCopy(dest, src, chars * sizeof(*dest));
>   } else {
>     while (dest < limit) *dest++ = static_cast<sinkchar>(*src++);
>   }
> }
> 
> If we request an automatic specialization for uint8_t, what the compiler
> generates internally as a first step is:
> 
> void CopyCharsUnsigned(uint8_t* dest, const uint8_t* src, size_t chars) {
>   uint8_t* limit = dest + chars;
>   if ((sizeof(*dest) == sizeof(*src)) &&
>       (chars >= static_cast<int>(kMinComplexMemCopy / sizeof(*dest)))) {
>     MemCopy(dest, src, chars * sizeof(*dest));
>   } else {
>     while (dest < limit) *dest++ = static_cast<uint8_t>(*src++);
>   }
> }
> 
> Since the compiler knows that sizeof(uint8_t) == 1, and static_cast from
uint8_t
> to uint8_t is a no-op, this simplifies further to:
> 
> void CopyCharsUnsigned(uint8_t* dest, const uint8_t* src, size_t chars) {
>   sinkchar* limit = dest + chars;
>   if ((1 == 1) &&
>       (chars >= static_cast<int>(kMinComplexMemCopy / 1))) {
>     MemCopy(dest, src, chars * 1);
>   } else {
>     while (dest < limit) *dest++ = *src++;
>   }
> }
> 
> Now if the trivial expressions "1 == 1", "/ 1", "* 1" are constant-folded
away,
> you end up exactly with the specialization you provided manually. (With one
> minor exception: you still include "/ sizeof(*dest)" in your code, even though
> you know that dest has type uint8_t so the division is a no-op.) 
> 
> Am I missing something?
> 
> Which leads me to conclude that if the manual specialization does indeed
produce
> different code than automatic instantiation of the templatized version, then
> it's a bug in your compiler.
> 
> Further, we strongly dislike platform-specific code because it's a maintenance
> nightmare. If you want to add any "#ifdef *_ARCH_*" code, you need a pretty
> strong reason that the given platform really benefits a lot. Copying
> platform-independent code to a platform-specific #ifdef does not count as
> significant benefit.
> 
> So it all seems to boil down to you saying that kMinComplexMemCopy = 8 is
better
> than kMinComplexMemCopy = 16. Fine. It's still a one-liner then.
> 
> (I'd also still be curious to hear if there are any other scenarios than the
> bench-parser.js microbenchmark that benefit from this change. It's easy to
> "prove" almost anything with a microbenchmark. It's also easy to make the
wrong
> engineering decisions based on microbenchmarks -- maybe a bigger
> kMinComplexMemCopy constant is better in real workloads? That's the primary
> reason why I'd tend to be careful with changing it; 128 -> 16 seems safer than
> 128 -> 8 just because it's a smaller step.)

Dear Reviewer

Thank you for your message. I really appreciate and like your explanations. In
this context, I am preparing an answer and come back asap.

Best regards,
  Octavian

octavian.soldea

On 2016/11/24 14:33:18, octaviansoldea wrote: > On 2016/11/24 10:25:20, Jakob Kummerow wrote: > > My ...

3 years, 11 months ago (2017-01-10 03:40:29 UTC) #22

On 2016/11/24 14:33:18, octaviansoldea wrote:
> On 2016/11/24 10:25:20, Jakob Kummerow wrote:
> > My point still stands: the explicit specialization you provide doesn't do
> > anything. It is the same as the generic implementation. If you see a
> difference,
> > then either you're measuring noise (or artifacts like alignment), or your
> > compiler is really bad.
> > 
> > Look: Here's the original, generic, platform-independent implementation:
> > 
> > template <typename sourcechar, typename sinkchar>
> > void CopyCharsUnsigned(sinkchar* dest, const sourcechar* src, size_t chars)
{
> >   sinkchar* limit = dest + chars;
> >   if ((sizeof(*dest) == sizeof(*src)) &&
> >       (chars >= static_cast<int>(kMinComplexMemCopy / sizeof(*dest)))) {
> >     MemCopy(dest, src, chars * sizeof(*dest));
> >   } else {
> >     while (dest < limit) *dest++ = static_cast<sinkchar>(*src++);
> >   }
> > }
> > 
> > If we request an automatic specialization for uint8_t, what the compiler
> > generates internally as a first step is:
> > 
> > void CopyCharsUnsigned(uint8_t* dest, const uint8_t* src, size_t chars) {
> >   uint8_t* limit = dest + chars;
> >   if ((sizeof(*dest) == sizeof(*src)) &&
> >       (chars >= static_cast<int>(kMinComplexMemCopy / sizeof(*dest)))) {
> >     MemCopy(dest, src, chars * sizeof(*dest));
> >   } else {
> >     while (dest < limit) *dest++ = static_cast<uint8_t>(*src++);
> >   }
> > }
> > 
> > Since the compiler knows that sizeof(uint8_t) == 1, and static_cast from
> uint8_t
> > to uint8_t is a no-op, this simplifies further to:
> > 
> > void CopyCharsUnsigned(uint8_t* dest, const uint8_t* src, size_t chars) {
> >   sinkchar* limit = dest + chars;
> >   if ((1 == 1) &&
> >       (chars >= static_cast<int>(kMinComplexMemCopy / 1))) {
> >     MemCopy(dest, src, chars * 1);
> >   } else {
> >     while (dest < limit) *dest++ = *src++;
> >   }
> > }
> > 
> > Now if the trivial expressions "1 == 1", "/ 1", "* 1" are constant-folded
> away,
> > you end up exactly with the specialization you provided manually. (With one
> > minor exception: you still include "/ sizeof(*dest)" in your code, even
though
> > you know that dest has type uint8_t so the division is a no-op.) 
> > 
> > Am I missing something?
> > 
> > Which leads me to conclude that if the manual specialization does indeed
> produce
> > different code than automatic instantiation of the templatized version, then
> > it's a bug in your compiler.
> > 
> > Further, we strongly dislike platform-specific code because it's a
maintenance
> > nightmare. If you want to add any "#ifdef *_ARCH_*" code, you need a pretty
> > strong reason that the given platform really benefits a lot. Copying
> > platform-independent code to a platform-specific #ifdef does not count as
> > significant benefit.
> > 
> > So it all seems to boil down to you saying that kMinComplexMemCopy = 8 is
> better
> > than kMinComplexMemCopy = 16. Fine. It's still a one-liner then.
> > 
> > (I'd also still be curious to hear if there are any other scenarios than the
> > bench-parser.js microbenchmark that benefit from this change. It's easy to
> > "prove" almost anything with a microbenchmark. It's also easy to make the
> wrong
> > engineering decisions based on microbenchmarks -- maybe a bigger
> > kMinComplexMemCopy constant is better in real workloads? That's the primary
> > reason why I'd tend to be careful with changing it; 128 -> 16 seems safer
than
> > 128 -> 8 just because it's a smaller step.)
> 
> Dear Reviewer
> 
> Thank you for your message. I really appreciate and like your explanations. In
> this context, I am preparing an answer and come back asap.
> 
> Best regards,
>   Octavian

Dear Reviewer, Hi Jakob,

Thank you for your feedback provided in your your previous answers.

Taking into account many constraints, I understand the problems
related to specialization and accept these comments. While 
kMinComplexMemCopy = 16 presents better performance than the baseline,
I would prefer to set kMinComplexMemCopy = 8. Here is an experiment that
analyzes the performance of kMinComplexMemCopy = 8, 16, and baseline.

When benchmarking the versions with kMinComplexMemCopy = 8, 16, and baseline,
using parameters n=1000000 and fields 4, 256, and 512, I received
the following requests per seconds. Each line represents a constant field. In
each line, the first number represents kMinComplexMemCopy = 8, the second
is kMinComplexMemCopy = 16, and the third is the baseline.

            K8      K16     Baseline																			
fields=4    811,728 786,777 784,156																			
fields=256   34,844  33,364  29,990																			
fields=512   17,616  16,865  15,397		

Therefore, I would like to ask to approve kMinComplexMemCopy = 8.

I am looking forward to hearing from you.																

Wishing you A Happy New Year!
              Octavian

octaviansoldea

3 years, 10 months ago (2017-02-14 18:18:32 UTC) #23

On 2017/01/10 03:40:29, octavian.soldea wrote:
> On 2016/11/24 14:33:18, octaviansoldea wrote:
> > On 2016/11/24 10:25:20, Jakob Kummerow wrote:
> > > My point still stands: the explicit specialization you provide doesn't do
> > > anything. It is the same as the generic implementation. If you see a
> > difference,
> > > then either you're measuring noise (or artifacts like alignment), or your
> > > compiler is really bad.
> > > 
> > > Look: Here's the original, generic, platform-independent implementation:
> > > 
> > > template <typename sourcechar, typename sinkchar>
> > > void CopyCharsUnsigned(sinkchar* dest, const sourcechar* src, size_t
chars)
> {
> > >   sinkchar* limit = dest + chars;
> > >   if ((sizeof(*dest) == sizeof(*src)) &&
> > >       (chars >= static_cast<int>(kMinComplexMemCopy / sizeof(*dest)))) {
> > >     MemCopy(dest, src, chars * sizeof(*dest));
> > >   } else {
> > >     while (dest < limit) *dest++ = static_cast<sinkchar>(*src++);
> > >   }
> > > }
> > > 
> > > If we request an automatic specialization for uint8_t, what the compiler
> > > generates internally as a first step is:
> > > 
> > > void CopyCharsUnsigned(uint8_t* dest, const uint8_t* src, size_t chars) {
> > >   uint8_t* limit = dest + chars;
> > >   if ((sizeof(*dest) == sizeof(*src)) &&
> > >       (chars >= static_cast<int>(kMinComplexMemCopy / sizeof(*dest)))) {
> > >     MemCopy(dest, src, chars * sizeof(*dest));
> > >   } else {
> > >     while (dest < limit) *dest++ = static_cast<uint8_t>(*src++);
> > >   }
> > > }
> > > 
> > > Since the compiler knows that sizeof(uint8_t) == 1, and static_cast from
> > uint8_t
> > > to uint8_t is a no-op, this simplifies further to:
> > > 
> > > void CopyCharsUnsigned(uint8_t* dest, const uint8_t* src, size_t chars) {
> > >   sinkchar* limit = dest + chars;
> > >   if ((1 == 1) &&
> > >       (chars >= static_cast<int>(kMinComplexMemCopy / 1))) {
> > >     MemCopy(dest, src, chars * 1);
> > >   } else {
> > >     while (dest < limit) *dest++ = *src++;
> > >   }
> > > }
> > > 
> > > Now if the trivial expressions "1 == 1", "/ 1", "* 1" are constant-folded
> > away,
> > > you end up exactly with the specialization you provided manually. (With
one
> > > minor exception: you still include "/ sizeof(*dest)" in your code, even
> though
> > > you know that dest has type uint8_t so the division is a no-op.) 
> > > 
> > > Am I missing something?
> > > 
> > > Which leads me to conclude that if the manual specialization does indeed
> > produce
> > > different code than automatic instantiation of the templatized version,
then
> > > it's a bug in your compiler.
> > > 
> > > Further, we strongly dislike platform-specific code because it's a
> maintenance
> > > nightmare. If you want to add any "#ifdef *_ARCH_*" code, you need a
pretty
> > > strong reason that the given platform really benefits a lot. Copying
> > > platform-independent code to a platform-specific #ifdef does not count as
> > > significant benefit.
> > > 
> > > So it all seems to boil down to you saying that kMinComplexMemCopy = 8 is
> > better
> > > than kMinComplexMemCopy = 16. Fine. It's still a one-liner then.
> > > 
> > > (I'd also still be curious to hear if there are any other scenarios than
the
> > > bench-parser.js microbenchmark that benefit from this change. It's easy to
> > > "prove" almost anything with a microbenchmark. It's also easy to make the
> > wrong
> > > engineering decisions based on microbenchmarks -- maybe a bigger
> > > kMinComplexMemCopy constant is better in real workloads? That's the
primary
> > > reason why I'd tend to be careful with changing it; 128 -> 16 seems safer
> than
> > > 128 -> 8 just because it's a smaller step.)
> > 
> > Dear Reviewer
> > 
> > Thank you for your message. I really appreciate and like your explanations.
In
> > this context, I am preparing an answer and come back asap.
> > 
> > Best regards,
> >   Octavian
> 
> Dear Reviewer, Hi Jakob,
> 
> Thank you for your feedback provided in your your previous answers.
> 
> Taking into account many constraints, I understand the problems
> related to specialization and accept these comments. While 
> kMinComplexMemCopy = 16 presents better performance than the baseline,
> I would prefer to set kMinComplexMemCopy = 8. Here is an experiment that
> analyzes the performance of kMinComplexMemCopy = 8, 16, and baseline.
> 
> When benchmarking the versions with kMinComplexMemCopy = 8, 16, and baseline,
> using parameters n=1000000 and fields 4, 256, and 512, I received
> the following requests per seconds. Each line represents a constant field. In
> each line, the first number represents kMinComplexMemCopy = 8, the second
> is kMinComplexMemCopy = 16, and the third is the baseline.
> 
>             K8      K16     Baseline																			
> fields=4    811,728 786,777 784,156																			
> fields=256   34,844  33,364  29,990																			
> fields=512   17,616  16,865  15,397		
> 
> Therefore, I would like to ask to approve kMinComplexMemCopy = 8.
> 
> I am looking forward to hearing from you.																
> 
> Wishing you A Happy New Year!
>               Octavian

Hi Jakob, Dear Reviewer

In a trial to give a more complete answer, I would like to come back with
several arguments.

The compiler I have used is gcc version 5. In order to analyze the behavior of
the code I performed a comparative analysis
of the assembly code. When expanding the function CopyCharsUnsigned, in the
cases kMinComplexMemCpy=128 (baseline), 8, and 16,
the very first assembly instruction is cmp $0x[value]. The only difference I
have seen, is that the argument [value] of the cmp
instruction differs, i.e. [value] equals 7f, 7, and f respectively. Therefore,
the difference consists of when the memcpy
function is called. At this point CopyCharsUnsigned is calling the memcpy
version that is implemented
in the standard C library, i.e. it calls __mempy_avx_unaligned, which is highly
optimized with AVX instructions, on IA64.
The more we use memcpy the better. Therefore, there is an advantage in using it.
Furthermore, when renouncing to the kMinComplexMemCpy
constant, the assembly code expands in a direct call to memcpy, without a test
of size. This is the best, in my humble opinion.

Could you please take a decision regarding this change?
I am looking forward to hearing from you.

Best regards,
  Octavian

Issue 2291773002: The function CopyCharsUnsigned(uint8_t* dest, const uint8_t* src, size_t chars) is now a specializa…

Description

Patch Set 1 #

Patch Set 2 : Changing kMinComplexMemCopy to 8 on x64 architecture. #

Messages