|
|
Created:
4 years, 5 months ago by shiyu.zhang Modified:
4 years, 1 month ago CC:
Pan, Weiliang, tianyou.li Base URL:
https://chromium.googlesource.com/v8/v8.git@master Target Ref:
refs/pending/heads/master Project:
v8 Visibility:
Public. |
DescriptionAdd basic instruction latency modeling for ia32 and x64 respectively.
The bigcore shares same instruction latency table as smallcore (ATOM).
The accurate latency modeling will benefit the instruction scheduler for
ia32 and x64 without introducing extra regression.
Committed: https://crrev.com/1b08c7a777d613ee433886749c94c86fce9d20b2
Cr-Commit-Position: refs/heads/master@{#40493}
Patch Set 1 : Add instruction latency modeling only for Atom platform #Patch Set 2 : Big-core shares the same latency model as Atom #Patch Set 3 : Rebase on Sep. 6 #Patch Set 4 : Rebase on Oct. 8 #
Messages
Total messages: 25 (14 generated)
Description was changed from ========== [turbofan] Enable instruction scheduling for Intel Atom platform Enable instruction scheduling in turbofan for ia32 and x64 on Atom. Basic instruction latency modeling is added for ia32 and x64. Instruction scheduling can introduce performance improvement on Atom. For example, after turn on FLAG_turbo_instruction_scheduling: Octane2-zlib improves 5% on ia32 and 1.25% on x64, bench_skinning improves 12% on ia32 and 10% on x64, JetStream-float-mm.c improves 4% on both ia32 and x64, JetStream-n-body.c improves 5% on both ia32 and x64. BUG= ========== to ========== [turbofan] Enable instruction scheduling for ia32 and x64 platform Enable instruction scheduling in turbofan for ia32 & x64. Basic instruction latency modeling is added for ia32 and x64 respectively. Instruction scheduling can introduce performance improvement on Atom. For example, after turn on FLAG_turbo_instruction_scheduling: Octane2-zlib improves 5% on ia32 and 1.25% on x64, bench_skinning improves 12% on ia32 and 10% on x64, JetStream-float-mm.c improves 4% on both ia32 and x64, JetStream-n-body.c improves 5% on both ia32 and x64. BUG= ==========
Description was changed from ========== [turbofan] Enable instruction scheduling for ia32 and x64 platform Enable instruction scheduling in turbofan for ia32 & x64. Basic instruction latency modeling is added for ia32 and x64 respectively. Instruction scheduling can introduce performance improvement on Atom. For example, after turn on FLAG_turbo_instruction_scheduling: Octane2-zlib improves 5% on ia32 and 1.25% on x64, bench_skinning improves 12% on ia32 and 10% on x64, JetStream-float-mm.c improves 4% on both ia32 and x64, JetStream-n-body.c improves 5% on both ia32 and x64. BUG= ========== to ========== [turbofan] Enable instruction scheduling for ia32 and x64 platform Add basic instruction latency modeling for ia32 and x64 respectively. Fix an instruction selector related test case error caused by instruction scheduling. The bigcore and smallcore (ATOM) share same instruction latency table per Danno’s guide for better validation confidence. Instruction scheduling introduces performance improvement on Atom since scheduled instruction helps less-OOO ATOM processor to get better instruction latency hidden. For example, after turn on FLAG_turbo_instruction_scheduling on Atom: Octane2-zlib improves 5% on ia32 and 1.3% on x64, bench_skinning improves 12% on ia32 and 10% on x64, JetStream-float-mm.c improves 4% on both ia32 and x64, JetStream-n-body.c improves 5% on both ia32 and x64. Improvement on bigcore platform (Haswell) is also observed, for example: bench_copy improves 7.5% on ia32 and 2% on x64. However, two regressions occur on bigcore platform (Haswell): bench_corrections regresses 7% on ia32, bench_skinning regresses 8% on x64. The most likely reason for the regressions is that the benefit from instruction latency hidden is less in such strong-OOO processor, as well as the scheduled code may increase the register pressure. The accumulated result of those 2 exceptional cases is regression. The balance between instruction latency hidden and register pressure when implementing instruction scheduling is worth studying. ==========
shiyu.zhang@intel.com changed reviewers: + danno@chromium.org
Please take a look. Thanks!
Thanks for the patch! This matches my expectations about the implementation we discussed, but the big-core regressions seem quite large, and so I am hesitant to land-as is. I have some ideas about how to mitigate the register allocation problems that might be the root cause of the slowdown, let's discuss in today's Chrome/Intel meeting.
Description was changed from ========== [turbofan] Enable instruction scheduling for ia32 and x64 platform Add basic instruction latency modeling for ia32 and x64 respectively. Fix an instruction selector related test case error caused by instruction scheduling. The bigcore and smallcore (ATOM) share same instruction latency table per Danno’s guide for better validation confidence. Instruction scheduling introduces performance improvement on Atom since scheduled instruction helps less-OOO ATOM processor to get better instruction latency hidden. For example, after turn on FLAG_turbo_instruction_scheduling on Atom: Octane2-zlib improves 5% on ia32 and 1.3% on x64, bench_skinning improves 12% on ia32 and 10% on x64, JetStream-float-mm.c improves 4% on both ia32 and x64, JetStream-n-body.c improves 5% on both ia32 and x64. Improvement on bigcore platform (Haswell) is also observed, for example: bench_copy improves 7.5% on ia32 and 2% on x64. However, two regressions occur on bigcore platform (Haswell): bench_corrections regresses 7% on ia32, bench_skinning regresses 8% on x64. The most likely reason for the regressions is that the benefit from instruction latency hidden is less in such strong-OOO processor, as well as the scheduled code may increase the register pressure. The accumulated result of those 2 exceptional cases is regression. The balance between instruction latency hidden and register pressure when implementing instruction scheduling is worth studying. ========== to ========== Add basic instruction latency modeling for ia32 and x64 respectively. The bigcore shares same instruction latency table as smallcore (ATOM). The accurate latency modeling will benefit the instruction scheduler for ia32 and x64 without introducing extra regression. Instruction scheduling introduces performance improvement on Atom since scheduled instruction helps less-OOO ATOM processor to get better instruction latency hidden. For example, after turn on FLAG_turbo_instruction_scheduling on Atom: Octane2-zlib improves 5% on ia32, bench_skinning improves 12% on ia32 and 10% on x64, JetStream-float-mm.c improves 4% on both ia32 and x64, JetStream-n-body.c improves 5% on both ia32 and x64. Improvement on bigcore platform (Haswell) is also observed, for example: bench_copy improves 6% on ia32 and 2% on x64. However, one regression occurs on bigcore platform (Haswell): bench_skinning regresses 8% on x64. This regression exists whether this patch applies or not. The most likely reason for the regression is that the benefit from instruction latency hidden is less in such strong-OOO processor, as well as the scheduled code may increase the register pressure. The accumulated result of this exceptional case is regression. The balance between instruction latency hidden and register pressure when implementing instruction scheduling is worth studying. ==========
shiyu.zhang@intel.com changed reviewers: - danno@chromium.org
Description was changed from ========== Add basic instruction latency modeling for ia32 and x64 respectively. The bigcore shares same instruction latency table as smallcore (ATOM). The accurate latency modeling will benefit the instruction scheduler for ia32 and x64 without introducing extra regression. Instruction scheduling introduces performance improvement on Atom since scheduled instruction helps less-OOO ATOM processor to get better instruction latency hidden. For example, after turn on FLAG_turbo_instruction_scheduling on Atom: Octane2-zlib improves 5% on ia32, bench_skinning improves 12% on ia32 and 10% on x64, JetStream-float-mm.c improves 4% on both ia32 and x64, JetStream-n-body.c improves 5% on both ia32 and x64. Improvement on bigcore platform (Haswell) is also observed, for example: bench_copy improves 6% on ia32 and 2% on x64. However, one regression occurs on bigcore platform (Haswell): bench_skinning regresses 8% on x64. This regression exists whether this patch applies or not. The most likely reason for the regression is that the benefit from instruction latency hidden is less in such strong-OOO processor, as well as the scheduled code may increase the register pressure. The accumulated result of this exceptional case is regression. The balance between instruction latency hidden and register pressure when implementing instruction scheduling is worth studying. ========== to ========== Add basic instruction latency modeling for ia32 and x64 respectively. The bigcore shares same instruction latency table as smallcore (ATOM). The accurate latency modeling will benefit the instruction scheduler for ia32 and x64 without introducing extra regression. ==========
After enable instruction scheduling, the following performance improvement is observed on Atom. With the tuned latency model, some of the improvement can be enhanced and even a few regression is fixed. ia32 x64 Untuned Tuned Untuned Tuned Octane2-NavierStokes 3.3% 4.3% 5.9% 9.7% Octane2-Zlib 4.9% 5.1% 0.8% 1.1% Bench_skinning -7.5% 12.9% 8.6% 10.4% JetStream-float-mm.c 4.5% 4.5% 4.3% 4.3% JetStream-n-body.c -1% 5.3% 0.4% 6.1%
shiyu.zhang@intel.com changed reviewers: + danno@chromium.org
Hi Danno, thanks for the comments. Here is the instruction latency model for ia32 and x64 that I think is necessary for the instruction scheduler. Please take a look. Thanks!
Thanks for the patch. LGTM!
The CQ bit was checked by shiyu.zhang@intel.com
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: v8_presubmit on master.tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_presubmit/builds/27014)
shiyu.zhang@intel.com changed reviewers: + bmeurer@chromium.org, jarin@chromium.org
mstarzinger@chromium.org changed reviewers: + mstarzinger@chromium.org
LGTM (rubber-stamped).
The CQ bit was checked by shiyu.zhang@intel.com
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Message was sent while issue was closed.
Description was changed from ========== Add basic instruction latency modeling for ia32 and x64 respectively. The bigcore shares same instruction latency table as smallcore (ATOM). The accurate latency modeling will benefit the instruction scheduler for ia32 and x64 without introducing extra regression. ========== to ========== Add basic instruction latency modeling for ia32 and x64 respectively. The bigcore shares same instruction latency table as smallcore (ATOM). The accurate latency modeling will benefit the instruction scheduler for ia32 and x64 without introducing extra regression. ==========
Message was sent while issue was closed.
Committed patchset #4 (id:60001)
Message was sent while issue was closed.
Description was changed from ========== Add basic instruction latency modeling for ia32 and x64 respectively. The bigcore shares same instruction latency table as smallcore (ATOM). The accurate latency modeling will benefit the instruction scheduler for ia32 and x64 without introducing extra regression. ========== to ========== Add basic instruction latency modeling for ia32 and x64 respectively. The bigcore shares same instruction latency table as smallcore (ATOM). The accurate latency modeling will benefit the instruction scheduler for ia32 and x64 without introducing extra regression. Committed: https://crrev.com/1b08c7a777d613ee433886749c94c86fce9d20b2 Cr-Commit-Position: refs/heads/master@{#40493} ==========
Message was sent while issue was closed.
Patchset 4 (id:??) landed as https://crrev.com/1b08c7a777d613ee433886749c94c86fce9d20b2 Cr-Commit-Position: refs/heads/master@{#40493} |