|
|
Created:
4 years, 3 months ago by octavian.soldea Modified:
3 years, 10 months ago CC:
v8-reviews_googlegroups.com, ofrobots Target Ref:
refs/pending/heads/master Project:
v8 Visibility:
Public. |
DescriptionThe function CopyCharsUnsigned(uint8_t* dest, const uint8_t* src, size_t chars) is now a specialization of a template function in which I prefer
using MemCopy, a function which translates to AVX for the x64 architecture (on Intel platforms).
The submitted patch is a modification in file utils.h. I would like to ask for a review from Jochen Eisinger jochen@chromium.org.
The submission is not affecting practically the V8 performance (I runned the V8 benchmarks suite).
However, this submission is important due to the fact that I have seen improvement of 11% in the
http parsing benchmark of Node.js.
In this context, I would like to mention that the current submission was first submitted to Node.js under pull request number 7898
(see https://github.com/nodejs/node/pull/7898) and it was redirected to V8.
BUG=v8:5395
Patch Set 1 #
Total comments: 1
Patch Set 2 : Changing kMinComplexMemCopy to 8 on x64 architecture. #Messages
Total messages: 23 (8 generated)
Description was changed from ========== The function CopyCharsUnsigned(uint8_t* dest, const uint8_t* src, size_t chars) is now a specialization of a template function in which I prefer using MemCopy, a function which translates to AVX for the x64 architecture (on Intel platforms). The submitted patch is a modification in file utils.h. I would like to ask for a review from Jochen Eisinger jochen@chromium.org. The submission is not affecting practically the V8 performance (I runned the V8 benchmarks suite). However, this submission is important due to the fact that I have seen improvement of 11% in the http parsing benchmark of Node.js. In this context, I would like to mention that the current submission was first submitted to Node.js under pull request number 7898 (see https://github.com/nodejs/node/pull/7898) and it was redirected to V8. BUG= ========== to ========== The function CopyCharsUnsigned(uint8_t* dest, const uint8_t* src, size_t chars) is now a specialization of a template function in which I prefer using MemCopy, a function which translates to AVX for the x64 architecture (on Intel platforms). The submitted patch is a modification in file utils.h. I would like to ask for a review from Jochen Eisinger jochen@chromium.org. The submission is not affecting practically the V8 performance (I runned the V8 benchmarks suite). However, this submission is important due to the fact that I have seen improvement of 11% in the http parsing benchmark of Node.js. In this context, I would like to mention that the current submission was first submitted to Node.js under pull request number 7898 (see https://github.com/nodejs/node/pull/7898) and it was redirected to V8. BUG= ==========
octavian.soldea@intel.com changed reviewers: + jochen eisinger jochen@chromium.org
octavian.soldea@intel.com changed required reviewers: + jochen eisinger jochen@chromium.org
jochen@chromium.org changed reviewers: + jkummerow@chromium.org, jochen@chromium.org - jochen eisinger jochen@chromium.org
jochen@chromium.org changed required reviewers: - jochen eisinger jochen@chromium.org
hi, thanks for sending this patch Could you please format the CL description according to https://www.chromium.org/developers/contributing-code#Writing_change_list_des... I'm not really familiar with this piece. Jakob might be a better reviewer?
Sorry, I'm not happy with this patch in its current form. See one comment on the implementation below. However, I also have higher-level concerns. Your benchmark results seem rather surprising to me. You say that you're seeing a 10% improvement for "characteristic lengths such as fields=512", while the performance for 4 fields actually regresses by about 5%. How does the number of "fields" relate to the number of characters copied? As you've seen in the code, there is a threshold already, precisely because what's fastest depends on the copy size: MemCopy is used if chars >= kMinComplexMemCopy == 16 * kPointerSize == 128 (on x64). Given that, for large copy sizes (>= 128 characters), this patch doesn't change anything (or am I missing something?). It may well be possible that we should update this threshold on x64, but dropping it entirely is probably not the best approach. Can you suggest a reasonable value for the constant? Are there /any/ real-world workloads (on Node or Chrome) that are affected by this, or is it purely to look good on microbenchmarks? https://codereview.chromium.org/2291773002/diff/1/src/utils.h File src/utils.h (right): https://codereview.chromium.org/2291773002/diff/1/src/utils.h#newcode1221 src/utils.h:1221: if (sizeof(*dest) == sizeof(*src)) { Well, this check is pretty pointless, because always sizeof(*dest) == sizeof(*src) == sizeof(uint8_t) == 1. Comparing this with the function right above, it seems what you really want is to change the value of kMinComplexMemCopy (to 0?) on x64. Why don't you just do that?
octavian.soldea can you please create an issue on V8's issue tracker with the data and more background?
hablich@chromium.org changed reviewers: + ofrobots@google.com
hablich@chromium.org changed reviewers: - ofrobots@google.com
On 2016/08/30 13:51:18, Hablich wrote: > octavian.soldea can you please create an issue on V8's issue tracker with the > data and more background? And not in docx please :-). Txt is probably best.
On 2016/08/30 13:55:21, Hablich wrote: > On 2016/08/30 13:51:18, Hablich wrote: > > octavian.soldea can you please create an issue on V8's issue tracker with the > > data and more background? > > And not in docx please :-). Txt is probably best. Hello Thank you very much for the feedback. I am working on all these concerns and I will come back with answers as soon as possible. (This includes code changes, of course.) Best regards, Octavian
On 2016/08/30 18:00:40, octavian.soldea wrote: > On 2016/08/30 13:55:21, Hablich wrote: > > On 2016/08/30 13:51:18, Hablich wrote: > > > octavian.soldea can you please create an issue on V8's issue tracker with > the > > > data and more background? > > > > And not in docx please :-). Txt is probably best. > > Hello > > Thank you very much for the feedback. I am working on all these concerns and I > will come back with answers as soon as possible. > (This includes code changes, of course.) > > Best regards, > Octavian Hello Following the indications of the reviewers, I agree with the recommendations of the reviewers and propose a value of 8 for the kMinComplexMemCopy constant. In this context, I have added a performance analysis at https://bugs.chromium.org/p/v8/issues/detail?id=5395 and would like to mention that I searched for better numbers, this is the best I found. I attached files in three formats (csv, tsv, and pdf). These files include the same performance analysis results in different formats. While the performance of V8 is practically unchanged, Node.js is behaving better on typical http requests. Moreover, the new code does not change the behaviour of Node.js for small length requests. I would like to thank to the reviewers for underlining the sizeof issue. The reviewers are perfectly right. The current code reflects this change as well. Thank you once again. I am looking forward to hearing from you. Best regards, Octavian
Description was changed from ========== The function CopyCharsUnsigned(uint8_t* dest, const uint8_t* src, size_t chars) is now a specialization of a template function in which I prefer using MemCopy, a function which translates to AVX for the x64 architecture (on Intel platforms). The submitted patch is a modification in file utils.h. I would like to ask for a review from Jochen Eisinger jochen@chromium.org. The submission is not affecting practically the V8 performance (I runned the V8 benchmarks suite). However, this submission is important due to the fact that I have seen improvement of 11% in the http parsing benchmark of Node.js. In this context, I would like to mention that the current submission was first submitted to Node.js under pull request number 7898 (see https://github.com/nodejs/node/pull/7898) and it was redirected to V8. BUG= ========== to ========== The function CopyCharsUnsigned(uint8_t* dest, const uint8_t* src, size_t chars) is now a specialization of a template function in which I prefer using MemCopy, a function which translates to AVX for the x64 architecture (on Intel platforms). The submitted patch is a modification in file utils.h. I would like to ask for a review from Jochen Eisinger jochen@chromium.org. The submission is not affecting practically the V8 performance (I runned the V8 benchmarks suite). However, this submission is important due to the fact that I have seen improvement of 11% in the http parsing benchmark of Node.js. In this context, I would like to mention that the current submission was first submitted to Node.js under pull request number 7898 (see https://github.com/nodejs/node/pull/7898) and it was redirected to V8. BUG=v8:5395 ==========
On 2016/09/15 14:07:39, octavian.soldea wrote: > On 2016/08/30 18:00:40, octavian.soldea wrote: > > On 2016/08/30 13:55:21, Hablich wrote: > > > On 2016/08/30 13:51:18, Hablich wrote: > > > > octavian.soldea can you please create an issue on V8's issue tracker with > > the > > > > data and more background? > > > > > > And not in docx please :-). Txt is probably best. > > > > Hello > > > > Thank you very much for the feedback. I am working on all these concerns and I > > will come back with answers as soon as possible. > > (This includes code changes, of course.) > > > > Best regards, > > Octavian > > Hello > > Following the indications of the reviewers, I agree with the recommendations of > the reviewers and propose a value of 8 for the kMinComplexMemCopy > constant. In this context, I have added a performance analysis at > https://bugs.chromium.org/p/v8/issues/detail?id=5395 > and would like to mention that I searched for better numbers, this is the best I > found. I attached files in three formats (csv, tsv, and pdf). > These files include the same performance analysis results in different formats. > > While the performance of V8 is practically unchanged, Node.js is behaving better > on typical http requests. Moreover, the new code does not change the behaviour > of Node.js for small length requests. > > I would like to thank to the reviewers for underlining the sizeof issue. The > reviewers are perfectly right. The current code reflects this change as well. > > Thank you once again. I am looking forward to hearing from you. > > Best regards, > Octavian Dear Reviewers I would like to ask if you had a chance to review my proposal for changing. I am looking forward to hearing from you. Best regards, Octavian
How well does this alternative proposal work for you? https://chromiumcodereview.appspot.com/2438583002 I would prefer not to make utils.h any more complicated than it already is. If anything, we should simplify the whole memcopy related code.
On 2016/10/19 16:16:22, Jakob Kummerow wrote: > How well does this alternative proposal work for you? > https://chromiumcodereview.appspot.com/2438583002 > > I would prefer not to make utils.h any more complicated than it already is. If > anything, we should simplify the whole memcopy related code. Dear Reviewer Thank you for your message. I am working on collecting data and come back asap. Best regards, Octavian
On 2016/11/02 18:33:48, octavian.soldea wrote: > On 2016/10/19 16:16:22, Jakob Kummerow wrote: > > How well does this alternative proposal work for you? > > https://chromiumcodereview.appspot.com/2438583002 > > > > I would prefer not to make utils.h any more complicated than it already is. If > > anything, we should simplify the whole memcopy related code. > > Dear Reviewer > > Thank you for your message. I am working on collecting data and come back asap. > > Best regards, > Octavian Dear Reviewer Following a thorough analysis, I would like to conclude that the code proposed last time is superior to other choices. Please find the evidence in the following. From a personal communication with Matteo Collina, who run a comparison test designed by Andreas Madsen, one can prove that our patch, 2291773002, provides superior performance as compared to the proposal in 2438583002. This is the performance analysis from Matteo: $ node benchmark/compare.js --old ./node --new ./patched-node --runs 5 --filter parser http > compare-patch-v8-2291773002-http-parser.csv $ cat compare-patch-v8-2291773002-http-parser.csv | Rscript benchmark/compare.R improvement significant p.value $ cat compare-patch-v8-2291773002-http-parser.csv | Rscript benchmark/compare.R improvement significant p.value http/bench-parser.js n=100000 fields=16 8.83 % *** 3.211221e-20 http/bench-parser.js n=100000 fields=32 10.23 % *** 5.008009e-25 http/bench-parser.js n=100000 fields=4 2.39 % ** 1.624678e-03 http/bench-parser.js n=100000 fields=8 5.12 % *** 2.419546e-07 The comparison with the one proposed in 2438583002: (old is 2438583002, new is 2291773002) $ node benchmark/compare.js --old ./node --new ./patched-node --runs 30 --filter parser http > compare-patch-v8-2438583002-http-parser.c $ cat compare-patch-v8-2438583002-http-parser.csv | Rscript benchmark/compare.R improvement significant p.value http/bench-parser.js n=100000 fields=16 1.87 % * 0.014321166 http/bench-parser.js n=100000 fields=32 2.16 % ** 0.004791449 http/bench-parser.js n=100000 fields=4 0.94 % 0.410998592 http/bench-parser.js n=100000 fields=8 1.20 % 0.135580498 Matteo provided measurements with the proposed fix 2438583002, vs master $ cat compare-patch-v8-2438583002-http-parser-master.csv | Rscript benchmark/compare.R improvement significant p.value http/bench-parser.js n=100000 fields=16 2.85 % 8.337515e-02 http/bench-parser.js n=100000 fields=32 5.30 % *** 3.567017e-04 http/bench-parser.js n=100000 fields=4 3.59 % *** 9.237691e-05 http/bench-parser.js n=100000 fields=8 3.81 % 6.819173e-02 On top of Matteo’s experiments, we have compared the behavior of Let’s Chat benchmark and obtained the following results: Comparison Based on Let's Chat baseline 2291773002 2438583002 [baseline/committed] [baseline/proposed review] [proposed review/committed] committed proposed reviewer n=10 28.134 27.954 28.076 1.00643915 1.002065821 1.004364313 n=20 55.477 55.461 55.545 1.000288491 0.9987757674 1.001514578 n=10 c = 10 27.523 27.481 27.518 1.001528329 1.000181699 1.001346385 n=20 c = 10 54.554 54.368 54.58 1.00342113 0.999523635 1.003899353 n=40 c = 10 108.056 107.863 108.213 1.001789307 0.9985491577 1.003244857 The numbers in columns baseline, 2291773002, and 2438583002 means real time measured for n and c that appear in the first column. Parameter n means a number of requests. Parameter c means a parallelization of the requests, i.e. the number of concurrent requests. By default, c = 1. One can see that consistently, 2291773002 is superior to baseline and 2438583002 with a factor of something between 0.001-0.005%. We have also studied the effect of the template specialization alone. We have run the bench parser for n=100000 fields=512 (several times) and added the requests per second at each round. We measured 40720.66876 for the baseline and 40736.70174 for a version in which we inserted the specialization of the template CopyCharsUnsigned. Specifically: void CopyCharsUnsigned(uint8_t* dest, const uint8_t* src, size_t chars) is exactly like in 2291773002, however, expanded with the old value const int kMinComplexMemCopy = 16 * kPointerSize; While the advantage in using template specialization alone is minor, the improvement is consistent (we performed repeated measurements). The reason for which we ask for keeping the specialization is keeping the constant we propose specifically for x64 (without affecting the default). If ARM architecture has specific code, we do not see any reason why x64 should not appear in its own nutshell and with the indicated constant. Furthermore, we have analyzed the performance of 2291773002 versus 2438583002 in V8 and Node.js. For V8 we present sums of 10 consecutive runs V8 baseline 2291773002 2438583002 [commited/baseline] [commited/proposed reviewer] committed proposed reviewer Richards 207975 208066 208025 1.000437553 1.000197092 DeltaBlue 410845 408167 408855 0.9934817267 0.9983172518 Crypto 192897 191252 191678 0.9914721328 0.9977775227 RayTrace 506692 504248 506468 0.995176557 0.9956167023 EarleyBoyer 312616 313181 313275 1.001807329 0.9996999441 RegExp 38323 38944 39018 1.016204368 0.9981034394 Splay 114061 114551 113713 1.004295947 1.00736943 NavierStokes 190674 190993 190408 1.001673013 1.00307235 Score 195971 196080 196075 1.000556205 1.0000255 In conclusion, the patches do not influence the V8 benchmark suite. However, when we have run the bench parser with different arguments we got the following results Node baseline commited proposed reviewer [commited/baseline] [commited/proposed reviewer] 2 484626.95 514699.97 511345.6 1.062053957 1.006559888 4 407165.17 429887.95 420631.59 1.055807278 1.02200586 8 292921.2 313528.47 309092.17 1.0703509 1.014352677 16 197917.09 216019.11 206786.14 1.091462642 1.044649849 32 110473.44 119500.96 114584.84 1.081716655 1.042903756 64 60586.46 67277.23 63861.42 1.11043342 1.053487849 128 31910.09 35882.44 34103.25 1.124485703 1.052170688 256 16191 18462.42 17312.59 1.140289049 1.066415828 512 8204.49 9388.71 8852.18 1.144338039 1.060609929 1024 4092.53 4733.6 4411.79 1.156643934 1.072943182 2048 1998.4 2330.76 2178.85 1.16631305 1.069720265 Here, the command line was ./node benchmark/http/bench-parser.js n=constant fields=variable. One can see the performance benefit. While 2291773002 manifests 10-15% performance benefit, 2438583002 shows 5-10% benefit as compared to the baseline. These measurements were done on a Haswell server. Best regards, Octavian
My point still stands: the explicit specialization you provide doesn't do anything. It is the same as the generic implementation. If you see a difference, then either you're measuring noise (or artifacts like alignment), or your compiler is really bad. Look: Here's the original, generic, platform-independent implementation: template <typename sourcechar, typename sinkchar> void CopyCharsUnsigned(sinkchar* dest, const sourcechar* src, size_t chars) { sinkchar* limit = dest + chars; if ((sizeof(*dest) == sizeof(*src)) && (chars >= static_cast<int>(kMinComplexMemCopy / sizeof(*dest)))) { MemCopy(dest, src, chars * sizeof(*dest)); } else { while (dest < limit) *dest++ = static_cast<sinkchar>(*src++); } } If we request an automatic specialization for uint8_t, what the compiler generates internally as a first step is: void CopyCharsUnsigned(uint8_t* dest, const uint8_t* src, size_t chars) { uint8_t* limit = dest + chars; if ((sizeof(*dest) == sizeof(*src)) && (chars >= static_cast<int>(kMinComplexMemCopy / sizeof(*dest)))) { MemCopy(dest, src, chars * sizeof(*dest)); } else { while (dest < limit) *dest++ = static_cast<uint8_t>(*src++); } } Since the compiler knows that sizeof(uint8_t) == 1, and static_cast from uint8_t to uint8_t is a no-op, this simplifies further to: void CopyCharsUnsigned(uint8_t* dest, const uint8_t* src, size_t chars) { sinkchar* limit = dest + chars; if ((1 == 1) && (chars >= static_cast<int>(kMinComplexMemCopy / 1))) { MemCopy(dest, src, chars * 1); } else { while (dest < limit) *dest++ = *src++; } } Now if the trivial expressions "1 == 1", "/ 1", "* 1" are constant-folded away, you end up exactly with the specialization you provided manually. (With one minor exception: you still include "/ sizeof(*dest)" in your code, even though you know that dest has type uint8_t so the division is a no-op.) Am I missing something? Which leads me to conclude that if the manual specialization does indeed produce different code than automatic instantiation of the templatized version, then it's a bug in your compiler. Further, we strongly dislike platform-specific code because it's a maintenance nightmare. If you want to add any "#ifdef *_ARCH_*" code, you need a pretty strong reason that the given platform really benefits a lot. Copying platform-independent code to a platform-specific #ifdef does not count as significant benefit. So it all seems to boil down to you saying that kMinComplexMemCopy = 8 is better than kMinComplexMemCopy = 16. Fine. It's still a one-liner then. (I'd also still be curious to hear if there are any other scenarios than the bench-parser.js microbenchmark that benefit from this change. It's easy to "prove" almost anything with a microbenchmark. It's also easy to make the wrong engineering decisions based on microbenchmarks -- maybe a bigger kMinComplexMemCopy constant is better in real workloads? That's the primary reason why I'd tend to be careful with changing it; 128 -> 16 seems safer than 128 -> 8 just because it's a smaller step.)
On 2016/11/24 10:25:20, Jakob Kummerow wrote: > My point still stands: the explicit specialization you provide doesn't do > anything. It is the same as the generic implementation. If you see a difference, > then either you're measuring noise (or artifacts like alignment), or your > compiler is really bad. > > Look: Here's the original, generic, platform-independent implementation: > > template <typename sourcechar, typename sinkchar> > void CopyCharsUnsigned(sinkchar* dest, const sourcechar* src, size_t chars) { > sinkchar* limit = dest + chars; > if ((sizeof(*dest) == sizeof(*src)) && > (chars >= static_cast<int>(kMinComplexMemCopy / sizeof(*dest)))) { > MemCopy(dest, src, chars * sizeof(*dest)); > } else { > while (dest < limit) *dest++ = static_cast<sinkchar>(*src++); > } > } > > If we request an automatic specialization for uint8_t, what the compiler > generates internally as a first step is: > > void CopyCharsUnsigned(uint8_t* dest, const uint8_t* src, size_t chars) { > uint8_t* limit = dest + chars; > if ((sizeof(*dest) == sizeof(*src)) && > (chars >= static_cast<int>(kMinComplexMemCopy / sizeof(*dest)))) { > MemCopy(dest, src, chars * sizeof(*dest)); > } else { > while (dest < limit) *dest++ = static_cast<uint8_t>(*src++); > } > } > > Since the compiler knows that sizeof(uint8_t) == 1, and static_cast from uint8_t > to uint8_t is a no-op, this simplifies further to: > > void CopyCharsUnsigned(uint8_t* dest, const uint8_t* src, size_t chars) { > sinkchar* limit = dest + chars; > if ((1 == 1) && > (chars >= static_cast<int>(kMinComplexMemCopy / 1))) { > MemCopy(dest, src, chars * 1); > } else { > while (dest < limit) *dest++ = *src++; > } > } > > Now if the trivial expressions "1 == 1", "/ 1", "* 1" are constant-folded away, > you end up exactly with the specialization you provided manually. (With one > minor exception: you still include "/ sizeof(*dest)" in your code, even though > you know that dest has type uint8_t so the division is a no-op.) > > Am I missing something? > > Which leads me to conclude that if the manual specialization does indeed produce > different code than automatic instantiation of the templatized version, then > it's a bug in your compiler. > > Further, we strongly dislike platform-specific code because it's a maintenance > nightmare. If you want to add any "#ifdef *_ARCH_*" code, you need a pretty > strong reason that the given platform really benefits a lot. Copying > platform-independent code to a platform-specific #ifdef does not count as > significant benefit. > > So it all seems to boil down to you saying that kMinComplexMemCopy = 8 is better > than kMinComplexMemCopy = 16. Fine. It's still a one-liner then. > > (I'd also still be curious to hear if there are any other scenarios than the > bench-parser.js microbenchmark that benefit from this change. It's easy to > "prove" almost anything with a microbenchmark. It's also easy to make the wrong > engineering decisions based on microbenchmarks -- maybe a bigger > kMinComplexMemCopy constant is better in real workloads? That's the primary > reason why I'd tend to be careful with changing it; 128 -> 16 seems safer than > 128 -> 8 just because it's a smaller step.) Dear Reviewer Thank you for your message. I really appreciate and like your explanations. In this context, I am preparing an answer and come back asap. Best regards, Octavian
On 2016/11/24 14:33:18, octaviansoldea wrote: > On 2016/11/24 10:25:20, Jakob Kummerow wrote: > > My point still stands: the explicit specialization you provide doesn't do > > anything. It is the same as the generic implementation. If you see a > difference, > > then either you're measuring noise (or artifacts like alignment), or your > > compiler is really bad. > > > > Look: Here's the original, generic, platform-independent implementation: > > > > template <typename sourcechar, typename sinkchar> > > void CopyCharsUnsigned(sinkchar* dest, const sourcechar* src, size_t chars) { > > sinkchar* limit = dest + chars; > > if ((sizeof(*dest) == sizeof(*src)) && > > (chars >= static_cast<int>(kMinComplexMemCopy / sizeof(*dest)))) { > > MemCopy(dest, src, chars * sizeof(*dest)); > > } else { > > while (dest < limit) *dest++ = static_cast<sinkchar>(*src++); > > } > > } > > > > If we request an automatic specialization for uint8_t, what the compiler > > generates internally as a first step is: > > > > void CopyCharsUnsigned(uint8_t* dest, const uint8_t* src, size_t chars) { > > uint8_t* limit = dest + chars; > > if ((sizeof(*dest) == sizeof(*src)) && > > (chars >= static_cast<int>(kMinComplexMemCopy / sizeof(*dest)))) { > > MemCopy(dest, src, chars * sizeof(*dest)); > > } else { > > while (dest < limit) *dest++ = static_cast<uint8_t>(*src++); > > } > > } > > > > Since the compiler knows that sizeof(uint8_t) == 1, and static_cast from > uint8_t > > to uint8_t is a no-op, this simplifies further to: > > > > void CopyCharsUnsigned(uint8_t* dest, const uint8_t* src, size_t chars) { > > sinkchar* limit = dest + chars; > > if ((1 == 1) && > > (chars >= static_cast<int>(kMinComplexMemCopy / 1))) { > > MemCopy(dest, src, chars * 1); > > } else { > > while (dest < limit) *dest++ = *src++; > > } > > } > > > > Now if the trivial expressions "1 == 1", "/ 1", "* 1" are constant-folded > away, > > you end up exactly with the specialization you provided manually. (With one > > minor exception: you still include "/ sizeof(*dest)" in your code, even though > > you know that dest has type uint8_t so the division is a no-op.) > > > > Am I missing something? > > > > Which leads me to conclude that if the manual specialization does indeed > produce > > different code than automatic instantiation of the templatized version, then > > it's a bug in your compiler. > > > > Further, we strongly dislike platform-specific code because it's a maintenance > > nightmare. If you want to add any "#ifdef *_ARCH_*" code, you need a pretty > > strong reason that the given platform really benefits a lot. Copying > > platform-independent code to a platform-specific #ifdef does not count as > > significant benefit. > > > > So it all seems to boil down to you saying that kMinComplexMemCopy = 8 is > better > > than kMinComplexMemCopy = 16. Fine. It's still a one-liner then. > > > > (I'd also still be curious to hear if there are any other scenarios than the > > bench-parser.js microbenchmark that benefit from this change. It's easy to > > "prove" almost anything with a microbenchmark. It's also easy to make the > wrong > > engineering decisions based on microbenchmarks -- maybe a bigger > > kMinComplexMemCopy constant is better in real workloads? That's the primary > > reason why I'd tend to be careful with changing it; 128 -> 16 seems safer than > > 128 -> 8 just because it's a smaller step.) > > Dear Reviewer > > Thank you for your message. I really appreciate and like your explanations. In > this context, I am preparing an answer and come back asap. > > Best regards, > Octavian Dear Reviewer, Hi Jakob, Thank you for your feedback provided in your your previous answers. Taking into account many constraints, I understand the problems related to specialization and accept these comments. While kMinComplexMemCopy = 16 presents better performance than the baseline, I would prefer to set kMinComplexMemCopy = 8. Here is an experiment that analyzes the performance of kMinComplexMemCopy = 8, 16, and baseline. When benchmarking the versions with kMinComplexMemCopy = 8, 16, and baseline, using parameters n=1000000 and fields 4, 256, and 512, I received the following requests per seconds. Each line represents a constant field. In each line, the first number represents kMinComplexMemCopy = 8, the second is kMinComplexMemCopy = 16, and the third is the baseline. K8 K16 Baseline fields=4 811,728 786,777 784,156 fields=256 34,844 33,364 29,990 fields=512 17,616 16,865 15,397 Therefore, I would like to ask to approve kMinComplexMemCopy = 8. I am looking forward to hearing from you. Wishing you A Happy New Year! Octavian
On 2017/01/10 03:40:29, octavian.soldea wrote: > On 2016/11/24 14:33:18, octaviansoldea wrote: > > On 2016/11/24 10:25:20, Jakob Kummerow wrote: > > > My point still stands: the explicit specialization you provide doesn't do > > > anything. It is the same as the generic implementation. If you see a > > difference, > > > then either you're measuring noise (or artifacts like alignment), or your > > > compiler is really bad. > > > > > > Look: Here's the original, generic, platform-independent implementation: > > > > > > template <typename sourcechar, typename sinkchar> > > > void CopyCharsUnsigned(sinkchar* dest, const sourcechar* src, size_t chars) > { > > > sinkchar* limit = dest + chars; > > > if ((sizeof(*dest) == sizeof(*src)) && > > > (chars >= static_cast<int>(kMinComplexMemCopy / sizeof(*dest)))) { > > > MemCopy(dest, src, chars * sizeof(*dest)); > > > } else { > > > while (dest < limit) *dest++ = static_cast<sinkchar>(*src++); > > > } > > > } > > > > > > If we request an automatic specialization for uint8_t, what the compiler > > > generates internally as a first step is: > > > > > > void CopyCharsUnsigned(uint8_t* dest, const uint8_t* src, size_t chars) { > > > uint8_t* limit = dest + chars; > > > if ((sizeof(*dest) == sizeof(*src)) && > > > (chars >= static_cast<int>(kMinComplexMemCopy / sizeof(*dest)))) { > > > MemCopy(dest, src, chars * sizeof(*dest)); > > > } else { > > > while (dest < limit) *dest++ = static_cast<uint8_t>(*src++); > > > } > > > } > > > > > > Since the compiler knows that sizeof(uint8_t) == 1, and static_cast from > > uint8_t > > > to uint8_t is a no-op, this simplifies further to: > > > > > > void CopyCharsUnsigned(uint8_t* dest, const uint8_t* src, size_t chars) { > > > sinkchar* limit = dest + chars; > > > if ((1 == 1) && > > > (chars >= static_cast<int>(kMinComplexMemCopy / 1))) { > > > MemCopy(dest, src, chars * 1); > > > } else { > > > while (dest < limit) *dest++ = *src++; > > > } > > > } > > > > > > Now if the trivial expressions "1 == 1", "/ 1", "* 1" are constant-folded > > away, > > > you end up exactly with the specialization you provided manually. (With one > > > minor exception: you still include "/ sizeof(*dest)" in your code, even > though > > > you know that dest has type uint8_t so the division is a no-op.) > > > > > > Am I missing something? > > > > > > Which leads me to conclude that if the manual specialization does indeed > > produce > > > different code than automatic instantiation of the templatized version, then > > > it's a bug in your compiler. > > > > > > Further, we strongly dislike platform-specific code because it's a > maintenance > > > nightmare. If you want to add any "#ifdef *_ARCH_*" code, you need a pretty > > > strong reason that the given platform really benefits a lot. Copying > > > platform-independent code to a platform-specific #ifdef does not count as > > > significant benefit. > > > > > > So it all seems to boil down to you saying that kMinComplexMemCopy = 8 is > > better > > > than kMinComplexMemCopy = 16. Fine. It's still a one-liner then. > > > > > > (I'd also still be curious to hear if there are any other scenarios than the > > > bench-parser.js microbenchmark that benefit from this change. It's easy to > > > "prove" almost anything with a microbenchmark. It's also easy to make the > > wrong > > > engineering decisions based on microbenchmarks -- maybe a bigger > > > kMinComplexMemCopy constant is better in real workloads? That's the primary > > > reason why I'd tend to be careful with changing it; 128 -> 16 seems safer > than > > > 128 -> 8 just because it's a smaller step.) > > > > Dear Reviewer > > > > Thank you for your message. I really appreciate and like your explanations. In > > this context, I am preparing an answer and come back asap. > > > > Best regards, > > Octavian > > Dear Reviewer, Hi Jakob, > > Thank you for your feedback provided in your your previous answers. > > Taking into account many constraints, I understand the problems > related to specialization and accept these comments. While > kMinComplexMemCopy = 16 presents better performance than the baseline, > I would prefer to set kMinComplexMemCopy = 8. Here is an experiment that > analyzes the performance of kMinComplexMemCopy = 8, 16, and baseline. > > When benchmarking the versions with kMinComplexMemCopy = 8, 16, and baseline, > using parameters n=1000000 and fields 4, 256, and 512, I received > the following requests per seconds. Each line represents a constant field. In > each line, the first number represents kMinComplexMemCopy = 8, the second > is kMinComplexMemCopy = 16, and the third is the baseline. > > K8 K16 Baseline > fields=4 811,728 786,777 784,156 > fields=256 34,844 33,364 29,990 > fields=512 17,616 16,865 15,397 > > Therefore, I would like to ask to approve kMinComplexMemCopy = 8. > > I am looking forward to hearing from you. > > Wishing you A Happy New Year! > Octavian Hi Jakob, Dear Reviewer In a trial to give a more complete answer, I would like to come back with several arguments. The compiler I have used is gcc version 5. In order to analyze the behavior of the code I performed a comparative analysis of the assembly code. When expanding the function CopyCharsUnsigned, in the cases kMinComplexMemCpy=128 (baseline), 8, and 16, the very first assembly instruction is cmp $0x[value]. The only difference I have seen, is that the argument [value] of the cmp instruction differs, i.e. [value] equals 7f, 7, and f respectively. Therefore, the difference consists of when the memcpy function is called. At this point CopyCharsUnsigned is calling the memcpy version that is implemented in the standard C library, i.e. it calls __mempy_avx_unaligned, which is highly optimized with AVX instructions, on IA64. The more we use memcpy the better. Therefore, there is an advantage in using it. Furthermore, when renouncing to the kMinComplexMemCpy constant, the assembly code expands in a direct call to memcpy, without a test of size. This is the best, in my humble opinion. Could you please take a decision regarding this change? I am looking forward to hearing from you. Best regards, Octavian |