|
|
Created:
4 years, 2 months ago by fbarchard1 Modified:
4 years, 2 months ago Reviewers:
hubbe CC:
chromium-reviews Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
Descriptionlibyuv r1629 roll for AVX2 optimized HalfFloatPlane and ExtractAlpha.
Important changes included:
AVX2 and NEON halffloat conversion.
Add F16C cpu detection.
Support for I411 removed.
Side by side UV for I420 output improves memory coherency.
HalfFloat AVX2 ported from SSE2 using same magic number method, which
is 20% faster than vcvtps2ph method and produces identical results.
HalfFloat Neon version adapted from inner loop of vectorized C, but folds shift and narrow into
one instruction and uses element multiply instead of vector to save a
register and dup instruction. Neon version is also full performance with -Os.
This CL enables -O2 for libyuv_neon as well.
ExtractAlpha ported to AVX2.
ARGB4444ToI420 ported to MSA.
F16C cpu detection for AVX hardware that has halffloat conversion support.
Change log:
https://chromium.googlesource.com/libyuv/libyuv/+log/198bce39..550cf829
Full changes
https://chromium.googlesource.com/libyuv/libyuv/+/198bce39..550cf829
TEST=TestHalfFloatPlane_denormal
BUG=libyuv:560, libyuv:650, libyuv:572, libyuv:645, libyuv:649
R=hubbe@chromium.org
CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_precise_blink_rel
Committed: https://crrev.com/6f32b27bf0f9e369fe4e94c499cec6afe2abb56e
Cr-Commit-Position: refs/heads/master@{#426849}
Patch Set 1 #Patch Set 2 : avoid reading off end of buffer VideoResourceUpdaterTest.MakeHalfFloatTest #
Total comments: 2
Patch Set 3 : avoid reading off end of buffer VideoResourceUpdaterTest.MakeHalfFloatTest #Patch Set 4 : unittest change removed #Messages
Total messages: 22 (15 generated)
The CQ bit was checked by fbarchard@google.com to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Previous roll was rolled back due to test failure in cc_unittests on Mac for HalfFloat Manual test run ninja -j7 -v -C out/Release cc_unittests out/Release/cc_unittests --gtest_filter=*MakeHalfFloatTest but will need to run on AVX2 cpu or use emulator. New code uses same method for all cpus, so test should pass. To make it fail on purpose, enable F16 version of function in row.h
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_android_rel_ng on master.tryserver.chromium.android (JOB_FAILED, https://build.chromium.org/p/tryserver.chromium.android/builders/linux_androi...)
Description was changed from ========== libyuv r1628 for improved HalfFloat and ExtractAlpha AVX2 support HalfFloat AVX2 ported from SSE2 using same magic number method, which is 20% faster than vcvtps2ph method and produces identical results. HalfFloat Neon version adapted from inner loop of vectorized C, but folds shift and narrow into one instruction and uses element multiply instead of vector to save a register and dup instruction. Neon version is also full performance with -Os. This CL enables -O2 for libyuv_neon as well. ExtractAlpha ported to AVX2. TEST=TestHalfFloatPlane_denormal BUG=libyuv:560 R=hubbe@chromium.org ========== to ========== libyuv r1628 for improved HalfFloat and ExtractAlpha AVX2 support HalfFloat AVX2 ported from SSE2 using same magic number method, which is 20% faster than vcvtps2ph method and produces identical results. HalfFloat Neon version adapted from inner loop of vectorized C, but folds shift and narrow into one instruction and uses element multiply instead of vector to save a register and dup instruction. Neon version is also full performance with -Os. This CL enables -O2 for libyuv_neon as well. ExtractAlpha ported to AVX2. TEST=TestHalfFloatPlane_denormal BUG=libyuv:560 R=hubbe@chromium.org CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_precise_blink_rel ==========
https://codereview.chromium.org/2425423006/diff/20001/cc/resources/video_reso... File cc/resources/video_resource_updater_unittest.cc (right): https://codereview.chromium.org/2425423006/diff/20001/cc/resources/video_reso... cc/resources/video_resource_updater_unittest.cc:558: if (i < num_values - 1) { Why is this change needed? It seems it will just use the previous expected precision for the last number. That doesn't seem bad, but why do we need it?
The CQ bit was checked by fbarchard@google.com to run a CQ dry run
Description was changed from ========== libyuv r1628 for improved HalfFloat and ExtractAlpha AVX2 support HalfFloat AVX2 ported from SSE2 using same magic number method, which is 20% faster than vcvtps2ph method and produces identical results. HalfFloat Neon version adapted from inner loop of vectorized C, but folds shift and narrow into one instruction and uses element multiply instead of vector to save a register and dup instruction. Neon version is also full performance with -Os. This CL enables -O2 for libyuv_neon as well. ExtractAlpha ported to AVX2. TEST=TestHalfFloatPlane_denormal BUG=libyuv:560 R=hubbe@chromium.org CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_precise_blink_rel ========== to ========== libyuv r1629 for improved HalfFloat and ExtractAlpha AVX2 support HalfFloat AVX2 ported from SSE2 using same magic number method, which is 20% faster than vcvtps2ph method and produces identical results. HalfFloat Neon version adapted from inner loop of vectorized C, but folds shift and narrow into one instruction and uses element multiply instead of vector to save a register and dup instruction. Neon version is also full performance with -Os. This CL enables -O2 for libyuv_neon as well. ExtractAlpha ported to AVX2. TEST=TestHalfFloatPlane_denormal BUG=libyuv:560 R=hubbe@chromium.org CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_precise_blink_rel ==========
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The AVX2 version had a bug in the gcc version. This is fixed in 1629. https://codereview.chromium.org/2425423006/diff/20001/cc/resources/video_reso... File cc/resources/video_resource_updater_unittest.cc (right): https://codereview.chromium.org/2425423006/diff/20001/cc/resources/video_reso... cc/resources/video_resource_updater_unittest.cc:558: if (i < num_values - 1) { On 2016/10/20 18:55:11, hubbe wrote: > Why is this change needed? > It seems it will just use the previous expected precision for the last number. > That doesn't seem bad, but why do we need it? Done. unittest does not need a change.
lgtm
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
Description was changed from ========== libyuv r1629 for improved HalfFloat and ExtractAlpha AVX2 support HalfFloat AVX2 ported from SSE2 using same magic number method, which is 20% faster than vcvtps2ph method and produces identical results. HalfFloat Neon version adapted from inner loop of vectorized C, but folds shift and narrow into one instruction and uses element multiply instead of vector to save a register and dup instruction. Neon version is also full performance with -Os. This CL enables -O2 for libyuv_neon as well. ExtractAlpha ported to AVX2. TEST=TestHalfFloatPlane_denormal BUG=libyuv:560 R=hubbe@chromium.org CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_precise_blink_rel ========== to ========== libyuv r1629 for improved HalfFloat and ExtractAlpha AVX2 support HalfFloat AVX2 ported from SSE2 using same magic number method, which is 20% faster than vcvtps2ph method and produces identical results. HalfFloat Neon version adapted from inner loop of vectorized C, but folds shift and narrow into one instruction and uses element multiply instead of vector to save a register and dup instruction. Neon version is also full performance with -Os. This CL enables -O2 for libyuv_neon as well. ExtractAlpha ported to AVX2. ARGB4444ToI420 ported to MSA. F16C cpu detection for AVX hardware that has halffloat conversion support. Change log: https://chromium.googlesource.com/libyuv/libyuv/+log/198bce39..550cf829 Full changes https://chromium.googlesource.com/libyuv/libyuv/+/198bce39..550cf829 TEST=TestHalfFloatPlane_denormal BUG=libyuv:560 R=hubbe@chromium.org CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_precise_blink_rel ==========
Description was changed from ========== libyuv r1629 for improved HalfFloat and ExtractAlpha AVX2 support HalfFloat AVX2 ported from SSE2 using same magic number method, which is 20% faster than vcvtps2ph method and produces identical results. HalfFloat Neon version adapted from inner loop of vectorized C, but folds shift and narrow into one instruction and uses element multiply instead of vector to save a register and dup instruction. Neon version is also full performance with -Os. This CL enables -O2 for libyuv_neon as well. ExtractAlpha ported to AVX2. ARGB4444ToI420 ported to MSA. F16C cpu detection for AVX hardware that has halffloat conversion support. Change log: https://chromium.googlesource.com/libyuv/libyuv/+log/198bce39..550cf829 Full changes https://chromium.googlesource.com/libyuv/libyuv/+/198bce39..550cf829 TEST=TestHalfFloatPlane_denormal BUG=libyuv:560 R=hubbe@chromium.org CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_precise_blink_rel ========== to ========== libyuv r1629 roll for AVX2 optimized HalfFloatPlane and ExtractAlpha. Important changes included: AVX2 and NEON halffloat conversion. Add F16C cpu detection. Support for I411 removed. Side by side UV for I420 output improves memory coherency. HalfFloat AVX2 ported from SSE2 using same magic number method, which is 20% faster than vcvtps2ph method and produces identical results. HalfFloat Neon version adapted from inner loop of vectorized C, but folds shift and narrow into one instruction and uses element multiply instead of vector to save a register and dup instruction. Neon version is also full performance with -Os. This CL enables -O2 for libyuv_neon as well. ExtractAlpha ported to AVX2. ARGB4444ToI420 ported to MSA. F16C cpu detection for AVX hardware that has halffloat conversion support. Change log: https://chromium.googlesource.com/libyuv/libyuv/+log/198bce39..550cf829 Full changes https://chromium.googlesource.com/libyuv/libyuv/+/198bce39..550cf829 TEST=TestHalfFloatPlane_denormal BUG=libyuv:560, libyuv:650, libyuv:572, libyuv:645, libyuv:649 R=hubbe@chromium.org CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_precise_blink_rel ==========
The CQ bit was checked by fbarchard@google.com
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Message was sent while issue was closed.
Description was changed from ========== libyuv r1629 roll for AVX2 optimized HalfFloatPlane and ExtractAlpha. Important changes included: AVX2 and NEON halffloat conversion. Add F16C cpu detection. Support for I411 removed. Side by side UV for I420 output improves memory coherency. HalfFloat AVX2 ported from SSE2 using same magic number method, which is 20% faster than vcvtps2ph method and produces identical results. HalfFloat Neon version adapted from inner loop of vectorized C, but folds shift and narrow into one instruction and uses element multiply instead of vector to save a register and dup instruction. Neon version is also full performance with -Os. This CL enables -O2 for libyuv_neon as well. ExtractAlpha ported to AVX2. ARGB4444ToI420 ported to MSA. F16C cpu detection for AVX hardware that has halffloat conversion support. Change log: https://chromium.googlesource.com/libyuv/libyuv/+log/198bce39..550cf829 Full changes https://chromium.googlesource.com/libyuv/libyuv/+/198bce39..550cf829 TEST=TestHalfFloatPlane_denormal BUG=libyuv:560, libyuv:650, libyuv:572, libyuv:645, libyuv:649 R=hubbe@chromium.org CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_precise_blink_rel ========== to ========== libyuv r1629 roll for AVX2 optimized HalfFloatPlane and ExtractAlpha. Important changes included: AVX2 and NEON halffloat conversion. Add F16C cpu detection. Support for I411 removed. Side by side UV for I420 output improves memory coherency. HalfFloat AVX2 ported from SSE2 using same magic number method, which is 20% faster than vcvtps2ph method and produces identical results. HalfFloat Neon version adapted from inner loop of vectorized C, but folds shift and narrow into one instruction and uses element multiply instead of vector to save a register and dup instruction. Neon version is also full performance with -Os. This CL enables -O2 for libyuv_neon as well. ExtractAlpha ported to AVX2. ARGB4444ToI420 ported to MSA. F16C cpu detection for AVX hardware that has halffloat conversion support. Change log: https://chromium.googlesource.com/libyuv/libyuv/+log/198bce39..550cf829 Full changes https://chromium.googlesource.com/libyuv/libyuv/+/198bce39..550cf829 TEST=TestHalfFloatPlane_denormal BUG=libyuv:560, libyuv:650, libyuv:572, libyuv:645, libyuv:649 R=hubbe@chromium.org CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_precise_blink_rel ==========
Message was sent while issue was closed.
Committed patchset #4 (id:60001)
Message was sent while issue was closed.
Description was changed from ========== libyuv r1629 roll for AVX2 optimized HalfFloatPlane and ExtractAlpha. Important changes included: AVX2 and NEON halffloat conversion. Add F16C cpu detection. Support for I411 removed. Side by side UV for I420 output improves memory coherency. HalfFloat AVX2 ported from SSE2 using same magic number method, which is 20% faster than vcvtps2ph method and produces identical results. HalfFloat Neon version adapted from inner loop of vectorized C, but folds shift and narrow into one instruction and uses element multiply instead of vector to save a register and dup instruction. Neon version is also full performance with -Os. This CL enables -O2 for libyuv_neon as well. ExtractAlpha ported to AVX2. ARGB4444ToI420 ported to MSA. F16C cpu detection for AVX hardware that has halffloat conversion support. Change log: https://chromium.googlesource.com/libyuv/libyuv/+log/198bce39..550cf829 Full changes https://chromium.googlesource.com/libyuv/libyuv/+/198bce39..550cf829 TEST=TestHalfFloatPlane_denormal BUG=libyuv:560, libyuv:650, libyuv:572, libyuv:645, libyuv:649 R=hubbe@chromium.org CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_precise_blink_rel ========== to ========== libyuv r1629 roll for AVX2 optimized HalfFloatPlane and ExtractAlpha. Important changes included: AVX2 and NEON halffloat conversion. Add F16C cpu detection. Support for I411 removed. Side by side UV for I420 output improves memory coherency. HalfFloat AVX2 ported from SSE2 using same magic number method, which is 20% faster than vcvtps2ph method and produces identical results. HalfFloat Neon version adapted from inner loop of vectorized C, but folds shift and narrow into one instruction and uses element multiply instead of vector to save a register and dup instruction. Neon version is also full performance with -Os. This CL enables -O2 for libyuv_neon as well. ExtractAlpha ported to AVX2. ARGB4444ToI420 ported to MSA. F16C cpu detection for AVX hardware that has halffloat conversion support. Change log: https://chromium.googlesource.com/libyuv/libyuv/+log/198bce39..550cf829 Full changes https://chromium.googlesource.com/libyuv/libyuv/+/198bce39..550cf829 TEST=TestHalfFloatPlane_denormal BUG=libyuv:560, libyuv:650, libyuv:572, libyuv:645, libyuv:649 R=hubbe@chromium.org CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_precise_blink_rel Committed: https://crrev.com/6f32b27bf0f9e369fe4e94c499cec6afe2abb56e Cr-Commit-Position: refs/heads/master@{#426849} ==========
Message was sent while issue was closed.
Patchset 4 (id:??) landed as https://crrev.com/6f32b27bf0f9e369fe4e94c499cec6afe2abb56e Cr-Commit-Position: refs/heads/master@{#426849} |