+ All Categories
Home > Documents > Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf ·...

Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf ·...

Date post: 12-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
42
Reducing the Cost of Conditional Transfers of Control by Using Comparison Specifications May 30, 2006
Transcript
Page 1: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

Reducing the Cost of Conditional

Transfers of Control by Using Comparison

Specifications

May 30, 2006

Page 2: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Authors and Affiliations

• William Kreahling - Western Carolina University

• Stephen Hines - Florida State University

• David Whalley - Florida State University

• Gary Tyson - Florida State University

LCTES 2006 slide 1

Page 3: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Introduction

• Conditional transfers of control are expensive.

– consume a large number of cycles– cause pipeline flushes– inhibit other code improving transformations

• Conditional transfers of control can be broken into three portions.

– comparison (boolean test)– calculation of branch target address– actual transfer of control

• Most work done focuses on branch target address or branch itself.

• This research focuses on the comparison portion of conditional transfersof control.

LCTES 2006 slide 2

Page 4: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Separate Instructions

• comparison instruction sets a register

• accessed by the branch instruction

• advantage, freedom to encode all the necessary info

• Disadvantages

– two instructions needed– may stall at the comparison instruction

LCTES 2006 slide 3

Page 5: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Single Instruction

• single instruction performs compare and branch

• Advantages

– only one instruction– branch reached sooner, prediction made sooner

• Disadvantages

– less bits allocated for branch target address– may limit constant that can be compared

LCTES 2006 slide 4

Page 6: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Comparison Specifications with Cbranches

• Decouple the specification of the values to be compared with the actualcomparison.

– encoding flexibility of separate compare and branch instructions– efficiency of single compare and branch instruction

• New Instructions

– comparison specification (cmpspec)– compare and branch (cbranch)

LCTES 2006 slide 5

Page 7: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ New Hardware

• comparison register file

• read/write ports for this file

• forwarding hardware

– cmpspec → cbranch

• separate adder for calculating branch target address

LCTES 2006 slide 6

Page 8: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Overview of Decode Stage

PC

��

IF Stage First Half

ID Stage

Second Half

GPRegs

IF/ID

InstMem

ID/EX

CmpRegs

• Comparison register file is accessed in first half of stage.• GP register file accessed in second half of stage to get actual values.• Values to be compared are passed to the execute stage.• Constants may also stored in comparison register file.

LCTES 2006 slide 7

Page 9: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Experimental Environment

• VPO compiler

• classic five-stage in-order pipeline

• Arm port of the SimpleScalar Simulator

• modified GNU tools (assembler)

LCTES 2006 slide 8

Page 10: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Old Vs. New

1 r[2]=MEM;2 IC=r[2]?r[3];3 PC=IC<0,L6;

(a) Original RTLs

1 r[2]=MEM;2 c[0]=2,3;3 PC=c[0]<,L6;

(b) New RTLs

• (a) comparison on line 2, branch on line 3

• (b) cmpspec on line 2, cbranch on line 3

LCTES 2006 slide 9

Page 11: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Pipeline Diagrams

1 r[2]=MEM;2 IC=r[2]?r[3];3 PC=IC<0,L6;

(a) Original RTLs

1 r[2]=MEM;2 c[0]=2,3;3 PC=c[0]<,L6;

(b) New RTLs

inst 3 4 5 6 710 2

IF ID EX MEM WBload1)

3)

load

2) IF ID stall EX MEM WB

IF stall ID EX MEM WB

Cycles

cmp

branch

IF ID EX MEM WBload1)

3)

load

2) IF ID

IF

Cycles

cmpspec

cbranch

EX MEM WB

inst 3 4 5 6 710 2

ID EX MEM WB

LCTES 2006 slide 10

Page 12: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Loop-Invariant Code Motion

1 L3:2 r[2]=MEM;3 IC=r[1]?r[2];4 PC=IC<0,L3;

(a) Original Code

1 L3:2 r[2]=MEM;3 c[0]=1,2;4 PC=c[0]<,L3;

(b) Code with Cmpspec

1 c[0]=1,2;2 L3:3 r[2]=MEM;4 PC=c[0]<,L3;

(c) Cmpspec out of Loop

• cmpspecs within loops can typically be moved into loop preheaders• pay cost once, when loop is entered• values within registers being compared may change, cmpspec does not

LCTES 2006 slide 11

Page 13: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Pipeline Diagram

1 c[0]=1,2;2 L3:3 r[2]=MEM;4 PC=c[0]<,L3;

(c) Cmpspec out of Loop

IF ID EX MEM WBload1) load

2) IF ID

inst 3 4 510 2

Cycles

cbranch EX MEM WBstall

6

LCTES 2006 slide 12

Page 14: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Loop-Invariant Code Motion – cont

1 L2:2 c[0]=2,3;3 PC=c[0]==,L6;4 ...5 c[0]=5,12;6 PC=c[0]!=,L5;7 ...8 // br L2;

(a) Before Renaming

1 L2:2 c[0]=2,3;3 PC=c[0]==,L6;4 ...5 c[1]=5,12;6 PC=c[1]!=,L5;7 ...8 // br to L2;

(b) After Renaming

1 c[0]=2,3;2 c[1]=5,12;3 L2:4 PC=c[0]==,L6;5 ...6 PC=c[1]!=,L5;7 ...8 // br to L2;

(c) After Code Motion

• cmpspecs usually reference c[0]• conflict occurs rename a comparison register• no free registers, cmpspec remains inside loop

LCTES 2006 slide 13

Page 15: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Common Subexpression Elimination

1 IC=r[2]?r[3];2 PC=IC<0,L5;3 ...4 IC=r[2]?r[3];5 PC=IC>0,L5;

(a) Original Instructions

1 c[0]=2,3;2 PC=c[0]<,L5;3 ...4 c[0]=2,3;5 PC=c[0]>,L5;

(b) New Instructions

1 c[0]=2,3;2 PC=c[0]<,L5;3 ...4 PC=c[0]>,L5;

(c) After CSE

• CSE eliminates instructions that compute values already available• normally, cannot eliminate comparison instructions• in contrast, cmpspecs can often be eliminated

LCTES 2006 slide 14

Page 16: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ CSE – Reversing Conditions

1 c[2]=2,3;2 c[3]=3,2;3 L2:4 PC=c[2]>,L6;5 ...6 PC=c[3]<,L5;7 ...8 // br to L2;

(a) Original Code

1 c[2]=2,3;2 c[3]=2,3;3 L2:4 PC=c[2]>,L6;5 ...6 PC=c[3]>,L5;7 ...8 // br to L2;

(b) Reversed Condition

1 c[2]=2,3;2 L2:3 PC=c[2]>,L6;4 ...5 PC=c[2]>,L5;6 ...7 // br to L2;

(c) After CSE

LCTES 2006 slide 15

Page 17: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ CSE – Constant off by one

1 c[2]=2,0;2 c[3]=2,1;3 L2:4 PC=c[2]#>,L6;5 ...6 PC=c[3]#<,L5;7 ...8 // br to L2;

(a) Original Code

1 c[2]=2,0;2 c[3]=2,0;3 L2:4 PC=c[2]#>,L6;5 ...6 PC=c[3]#<=,L5

;7 ...8 // br to L2;

(b) After Modification

1 c[2]=2,0;2 L2:3 PC=c[2]#>,L6;4 ...5 PC=c[2]#<=,L5

;6 ...7 // br to L2;

(c) After CSE

LCTES 2006 slide 16

Page 18: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ CSE – Identical Cmpspecs

1 c[4]=2,1;2 PC=c[4]<=,L6;3 ...4 c[4]=2,1;5 PC=c[4]#==,L5;6 ...

(a) Identical Bit Pattern

1 c[4]=2,1;2 PC=c[4]<=,L6;3 ...4 PC=c[4]#==,L5;5 ...

(b) After CSE

LCTES 2006 slide 17

Page 19: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Register Encoding & New Instructions

Comparison Register15-12 11-4 3-0

reg num unused reg numreg num constant

New Instructions

cmpspec <creg>,index1,val; Assigns an index and an index or a constant.

cbr <creg><rel op>, <label>; Comparison register contains indices

cbri <creg><rel op>, <label>; Comparison register contains an index and a constant

[l/s]cfd <reg>,{register list}; CISC inst - stores/loads comparison registers to/from stack

LCTES 2006 slide 18

Page 20: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Benchmarks Tested

Name Description Name Description

adpcm adaptive pulse modulation encoder basicmath simple math calculations

bitcount bit manipulations blowfish block encryption

crc32 cyclic redundancy check dijkstra shortest path problem

fft fast Fourier transform ijpeg image compression

ispell spell checker lame MP3 encoder

patricia routing using reduced trees qsort quick sort of strings

rsynth text-to-speech analysis sha exchange of cryptographic keys

stringsearch search words susan image recognition

tiff convert a color TIFF image to b/w

LCTES 2006 slide 19

Page 21: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Results

LCTES 2006 slide 20

Page 22: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Dynamic Micro-Op Counts

• Average savings 5.6%

– Greatest savings came from adpcm at roughly 18%.– ispell was around 4% worse.

• lack of profile data

– saves and restores of comparison registers– loop preheader executing more than loop body

• Majority of savings comes from loop-invariant code motion 5.3%.

• CSE contributes another 0.3%.

LCTES 2006 slide 21

Page 23: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Execution Cycles

• Large portion of savings from not-stalling at cmpspec, 5.2%.

– Greatest savings came from stringsearch at roughly 18%.– Loss of roughly 3% with qsort.

• Loop-invariant code motion contributes around 0.9%.

• CSE contributes about 0.1%.

LCTES 2006 slide 22

Page 24: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Branch Prediction

• higher misprediction penalty for cbranches (like implicit branches)

• benefits of new instructions outweigh misprediction penalty

• modern more efficient branch predictors can be used

LCTES 2006 slide 23

Page 25: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Mispredictions Rates

bimodal-128 gshare-256 gshare-512 gshare-1024

Micro-ops Reduced 5.6% 5.7% 5.7% 5.8%

Cycles Reduced 5.2% 5.2% 5.4% 6.0%

Misprediction Rate 10% 9.9% 8.1% 6.9%

LCTES 2006 slide 24

Page 26: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Future Work

• Profiling could be better used to guide optimizations like loop-invariantcode motion.

– cases where loop header is executed more frequently than the loopbody

• With better analysis there should be more opportunities for CSE oncmpspecs.

• Implement technique on the Thumb.

• Implement loop unrolling in VPO.

LCTES 2006 slide 25

Page 27: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Conclusions

• Contributions

– Specification of the comparison is decoupled from the comparisonitself.

– Execution cycles are decreased because processor does useful workduring the cmpspec.

– Optimizations that cannot be applied to traditional comparisons canbe applied to cmpspecs.

• Summary

– 5.6% reduction dynamic instruction counts– 5.2% reduction in execution cycles

LCTES 2006 slide 26

Page 28: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ The End

Questions?

LCTES 2006 slide 27

Page 29: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ ...

LCTES 2006 slide 28

Page 30: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Delayed Branches

• One or more instructions following the branch are executed regardless ofwhether branch is taken or not taken.

• Compiler needs to fill the delay slots.

• Filled with no-ops if cannot find an instruction.

• Moving a instruction from before the branch always does useful work.

• Instruction from after the branch is more tricky.

• In some architectures, instructions in delay slots can be nullified.

LCTES 2006 slide 29

Page 31: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Branch Prediction

• Process of deciding which instruction to execute following a branch,before the outcome of a branch is known.

• Branch prediction buffer – low order bits of an instruction used to indexinto a table.

• prediction bit used to predict outcome of branch.

LCTES 2006 slide 30

Page 32: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Correlating or 2-level predictors

• Use the behavior of multiple instances of previous branches to makeprediction.

• Generalized: use the behavior of the last m branches to choose among2

m predictors each having n bits.

• GAg, PAg, GAp, PAp, Gshare

– G: Global, P: Per-address (1st level)– A: Adaptive– g: global, p: per-address (2nd level)

LCTES 2006 slide 31

Page 33: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ 2-level Predictors

GAg

GAp

PAg

PAp

k bit shift reg

2^k 2−bit counters

LCTES 2006 slide 32

Page 34: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Gshare

• Most recent branch outcomes are recorded in BHR - Branch Historyregister

• BHR is a single shift-register shared by all branches

• BHR xor’d with branch address to find entry in Pattern History Table

LCTES 2006 slide 33

Page 35: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Tournament Predictors

• Tournament or hybrid predictors combine two or more predictionmethods.

• Different methods methods work better for different branches.

• Array of two bit saturating counters used to determine which branchmethod to use.

• Each branch prediction make prediction each time.

• McFarling conducted experiments (bimodal and gshare) in combinationworked better then either separately.

LCTES 2006 slide 34

Page 36: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Markov Predictors

• Techniques common in the field of data compression used in branchprediction

• Work done by Chen, et. al., shows that correlating predictors are asimplification of an optimal predictor used in data compression.

• Predication by Partial Matching.

• Not feasible to build optimal predictors given the current level oftechnology.

LCTES 2006 slide 35

Page 37: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Neural Methods for Branch Predictions

• Simple neural methods use as alternatives to commonly used 2-bitcounters.

• Preceptron predictors consider longer histories than 2-bit predictors usingthe same resources.

• Experiments show better results than McFarling style hybrid predictors.

• Very complex hardware needed, feasibility in question.

LCTES 2006 slide 36

Page 38: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Branch Target Buffer

• Delay can occur while calculating the address of the branch target.

• BTB acts as a small cache of branch target addresses.

• Branch instruction’s address not in the BTB, prediction of not-takenoccurs.

LCTES 2006 slide 37

Page 39: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Branch Registers

• Uses traditional registers to hold the branch target address.

• Calculation of the branch target address is separated from the instructionthat uses it.

• This new instruction exposed to other compiler optimizations (loop-invariant code motion)

LCTES 2006 slide 38

Page 40: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Predication

• Conditional execution of an instruction based upon a boolean sourceoperand.

• Predicated instructions are fetched regardless of their predicate value.

• Reduce the number of branches.

• Eliminate frequently mispredicted branches.

LCTES 2006 slide 39

Page 41: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Loop Transformations

• Loop Unrolling

– replicate loop n number of times– reduce overhead of loop, reduce number of branches

• Loop Unswitching

– applied to loop that have a branch with invariant conditions– loop is replicated inside forks of the branch– reduce loop overhead and enable parallelization

LCTES 2006 slide 40

Page 42: Reducing the Cost of Conditional Transfers of Control by ...whalley/papers/lctes06.slides.pdf · – comparison (boolean test) – calculation of branch target address – actual

◆ Avoiding Conditional Branches

• compiler tries to determine if branches can be avoided

• find path from point after a conditional branch, back to branch wherecomparison is not affected.

• intraprocedurally interprocedurally

LCTES 2006 slide 41


Recommended