Date post: | 18-Dec-2015 |
Category: |
Documents |
Upload: | madeleine-hampton |
View: | 214 times |
Download: | 0 times |
IA-64 Register IA-64 Register Model: Stack & Model: Stack & RotationRotation
Dale Morris Dale Morris ArchitectArchitect
Hewlett Packard Co.Hewlett Packard Co.
PhilosophyPhilosophy
Large filesLarge files–Most processors have lots of registers Most processors have lots of registers
Explicit control over register-renamingExplicit control over register-renaming–Most processors have register renamingMost processors have register renaming
IA-64 makes the register names SW-IA-64 makes the register names SW-visible & makes the renaming explicitvisible & makes the renaming explicit
OutlineOutline
Register StackRegister Stack–Register Stack EngineRegister Stack Engine
Register RotationRegister Rotation–Loop BranchesLoop Branches
–Modulo-Scheduling of LoopsModulo-Scheduling of Loops
SummarySummary
Register StackRegister Stack
Motivation:Motivation:–Automatic save/restore of GRs on Automatic save/restore of GRs on
procedure call/returnprocedure call/return
–Cache traffic reductionCache traffic reduction
–Latency hiding of register spill/fillLatency hiding of register spill/fill
GR Stack FrameGR Stack Frame
(inputs)(inputs)
StaticStatic
00
31313232
127127
localslocalsoutputsoutputs
illegalillegal
size of frame (sof)size of frame (sof)
sofsofsolsol
Current Frame Marker (CFM)Current Frame Marker (CFM)
size of locals (sol)size of locals (sol)
GR Stack Frame - ExampleGR Stack Frame - Example
size of frame (sof)size of frame (sof)
size of locals (sol)size of locals (sol)
3232
4646
locloc
outout5252
sofsofsolsol
CFMCFM 21211414
GR Stack Frame - CallGR Stack Frame - Call
3232
4646
locloc
outout5252
sofsofsolsol
CFMCFM 21211414
PFMPFM xxxx
32323838
outout
sofsofsolsol
7700
21211414
callcall
GR Stack Frame - AllocateGR Stack Frame - Allocate
3232
4646
locloc
outout5252
sofsofsolsol
CFMCFM 21211414
PFMPFM xxxx
32323838
outout
sofsofsolsol
7700
21211414
callcall allocalloc
sofsofsolsol
19191616
21211414
3232
4848
locloc
outout5050
inputsinputs
GR Stack Frame - ReturnGR Stack Frame - Return
3232
4646
locloc
outout5252
sofsofsolsol
CFMCFM 21211414
PFMPFM xxxx
32323838
outout
sofsofsolsol
7700
21211414
callcall allocalloc
sofsofsolsol
19191616
21211414
3232
4848
locloc
outout5050
3232
4646
locloc
outout5252
sofsofsolsol
21211414
21211414
returnreturn
InstructionsInstructionsbr.callbr.call
–Copies CFM to PFMCopies CFM to PFM
–Creates new frame with only output regsCreates new frame with only output regs
–Saves local regs from previous frameSaves local regs from previous frame
allocalloc–Resizes current frameResizes current frame
–Saves PFM to a GRSaves PFM to a GR
Instructions (cont.)Instructions (cont.)mov to PFSmov to PFS
–Restores PFM from a GRRestores PFM from a GR
br.retbr.ret–Restores CFM from PFMRestores CFM from PFM
–Restores local regs for previous frameRestores local regs for previous frame
Leaf Procedure Leaf Procedure OptimizationOptimizationNo need to save/restore PFMNo need to save/restore PFMCan always use scratch static GRs Can always use scratch static GRs Can omit alloc if:Can omit alloc if:
–Not many registers neededNot many registers needed
–Register rotation not neededRegister rotation not needed
Register Save EngineRegister Save Engine
Automatically spills/fills registers from Automatically spills/fills registers from memory as neededmemory as needed
Registers saved on a Backing Store Registers saved on a Backing Store StackStack
Spills/fills NaT bits as wellSpills/fills NaT bits as well
Reg Stack & Backing StoreReg Stack & Backing Store
solsolaa
unallocatedunallocated
unallocatedunallocated
procAprocA
procBprocB
procCprocC
currentcurrentframeframe
solsolbb
sofsofcc
procA’sprocA’sancestorsancestors
procAprocA
procBprocB
callcall
returnreturnPhysicalPhysicalstackedstackedregistersregisters
BackingBackingStoreStore
RSERSEloads/loads/storesstores
A calls B calls CA calls B calls C
Register Stack: SummaryRegister Stack: Summary
Exposes register renaming to SWExposes register renaming to SWAvoids register spill when few neededAvoids register spill when few neededHides register spill/fillHides register spill/fillProgrammable sizesProgrammable sizes
–only use as many registers as you needonly use as many registers as you need
OutlineOutline
Register StackRegister Stack–Register Stack EngineRegister Stack Engine
Register RotationRegister Rotation–Loop BranchesLoop Branches
–Modulo-Scheduling of LoopsModulo-Scheduling of Loops
SummarySummary
Register RotationRegister Rotation
Motivation:Motivation:–pipeline-schedule loops onto HWpipeline-schedule loops onto HW
– remove extraneous work from loopremove extraneous work from loop
–minimize start-up overheadminimize start-up overhead
–small code footprintsmall code footprint
–maximum computational throughput with maximum computational throughput with few instructionsfew instructions
GR Stack Frame w/ RotationGR Stack Frame w/ Rotation
localslocals
StaticStatic
00
31313232
127127
outputsoutputssofsof
sofsofsolsol
Current Frame Marker (CFM)Current Frame Marker (CFM)
solsolSize of Rotating (sor)Size of Rotating (sor)
sorsorrrb.grrrb.grrrb.frrrb.frrrb.prrrb.pr
GR RotationGR Rotation
Size of rotating region multiple of 8Size of rotating region multiple of 8Rotating region overlays current frameRotating region overlays current frame
–Starts at r32Starts at r32
–Overlay allows rotation & stack renaming Overlay allows rotation & stack renaming in a single level of addersin a single level of adders
–Must copy input registers before loopMust copy input registers before loop
FR RotationFR Rotation
RotatingRotating
StaticStatic
00
31313232
127127
Upper 3/4 of registerUpper 3/4 of registerfile rotatesfile rotates
Predicate RotationPredicate Rotation
RotatingRotating
StaticStatic
00
15151616
6363
Upper 3/4 of registerUpper 3/4 of registerfile rotatesfile rotates
PalmPalm SunnySunnyisisSpringsSprings
RRB=0RRB=0
Register Rotation & RRBRegister Rotation & RRB Separate Rotating Register Base for each: GRs, FRs, PRsSeparate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB)Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number
– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.
ldld11 R35 R35
......
35:35:34:34:33:33:32:32:
36:36:
......
PalmPalm
PalmPalm SunnySunnyisisSpringsSprings
IA-64IA-64
......
35:35:34:34:33:33:32:32:
36:36:
......
RRB=0RRB=0
Register Rotation & RRBRegister Rotation & RRB Separate Rotating Register Base for each: GRs, FRs, PRsSeparate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB)Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number
– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.
PalmPalm
ldld22 R34 R34
stst11 R35 R35
SpringsSpringsPalmPalm
PalmPalm SunnySunnyisisSpringsSprings
IA-64IA-64
......
34:34:33:33:32:32:127:127:
35:35:
......
RRB=-1RRB=-1
Register Rotation & RRBRegister Rotation & RRB Separate Rotating Register Base for each: GRs, FRs, PRsSeparate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB)Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number
– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.
PalmPalm SpringsSprings
ldld33 R34 R34
stst22 R35 R35
isisSpringsSpringsPalmPalm
PalmPalm SunnySunnyisisSpringsSprings
IA-64IA-64
......
33:33:32:32:127:127:126:126:
34:34:
......
RRB=-2RRB=-2
Register Rotation & RRBRegister Rotation & RRB Separate Rotating Register Base for each: GRs, FRs, PRsSeparate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB)Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number
– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.
PalmPalm SpringsSprings
ldld44 R34 R34
stst33 R35 R35
SunnySunnyisisSpringsSprings
isis
PalmPalm SunnySunnyisisSpringsSprings
IA-64IA-64
......
32:32:127:127:126:126:125:125:
33:33:
......
RRB=-3RRB=-3
Register Rotation & RRBRegister Rotation & RRB Separate Rotating Register Base for each: GRs, FRs, PRsSeparate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB)Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number
– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.
PalmPalm SpringsSprings
stst44 R35 R35SunnySunnyisis
isis SunnySunny
Loop BranchesLoop Branches
br.cloop uses LC for simple, non-br.cloop uses LC for simple, non-pipelined loopspipelined loops–decrements LC and loops until LC is 0decrements LC and loops until LC is 0
br.ctop uses LC and EC for pipelined br.ctop uses LC and EC for pipelined counted loopscounted loops
br.wtop uses branch predicate and EC br.wtop uses branch predicate and EC for pipelined “while” loopsfor pipelined “while” loops
br.cexit, br.wexit used for unrolled, br.cexit, br.wexit used for unrolled, pipelined loopspipelined loops
br.ctopbr.ctop
Function (simplified):Function (simplified):– if (LC>0) if (LC>0)
{LC--;{LC--; pr[63]=1; pr[63]=1; rrb--;rrb--; loop;} loop;}else if (EC>1) else if (EC>1)
{EC--;{EC--; pr[63]=0; pr[63]=0; rrb--;rrb--; loop;} loop;}else else
{EC--;{EC--; pr[63]=0; pr[63]=0; rrb--;rrb--;fall_through;}fall_through;}
LC counts main loop iterationsLC counts main loop iterationsEC counts pipeline stages for drainEC counts pipeline stages for drain
Software PipeliningSoftware Pipelining Overlapping execution of different loop iterationsOverlapping execution of different loop iterations
vs.vs.
More iterations in same amount of timeMore iterations in same amount of time
Especially Useful for Integer Code With Small Especially Useful for Integer Code With Small Number of Loop IterationsNumber of Loop Iterations
Especially Useful for Integer Code With Small Especially Useful for Integer Code With Small Number of Loop IterationsNumber of Loop Iterations
Software PipeliningSoftware PipeliningTraditional architectures use loop unrollingTraditional architectures use loop unrolling
– High overhead: extra code for loop body, prologue, and High overhead: extra code for loop body, prologue, and epilogue epilogue
Synergistic use of IA-64 features:Synergistic use of IA-64 features:– Full PredicationFull Predication
– Special branchesSpecial branches
– Register rotation: removes loop copy overheadRegister rotation: removes loop copy overhead
– Predicate rotation: removes prologue & epiloguePredicate rotation: removes prologue & epilogue
Pipelined Loop ExamplePipelined Loop Example
DAXPY inner loopDAXPY inner loop–dy[i] = dy[i] + (da * dx[i])dy[i] = dy[i] + (da * dx[i])
–2 loads, 1 fma, 1 store / iteration2 loads, 1 fma, 1 store / iteration
Machine assumptionsMachine assumptions–can do 2 loads, 1 store, 1 fma, 1 br / cyclecan do 2 loads, 1 store, 1 fma, 1 br / cycle
– load latency of 2 clocksload latency of 2 clocks
– fma latency of 1 clocksfma latency of 1 clocks
Example: PipelineExample: Pipeline Each column represents 1 source iterationEach column represents 1 source iteration
load dx,dy
tmp = dy + da * dx
store dy
.rotf dx[3], dy[3], tmp[2]
mov ar.lc = 3 // #iterations-1
mov ar.ec = 4 // #stages
mov pr.rot = 0x10000
;;
looptop:
(p16) ldfd dx[0] = [dxsp],8
(p16) ldfd dy[0] = [dysp],8
(p18) fma.d tmp[0] = da, dx[2], dy[2]
(p19) stfd [dydp] = tmp[1],8
br.ctop looptop
;;
Example CodeExample Code
(p16) ldx (p16) ldy (p18) fma (p19) st
Loop ExecutionLoop Execution
..63:63: 0016:16: 1117:17: 0018:18: 0019:19: 00
......
RRB=0 LC=3 EC=4
(p16)
(p18)(p19)
Execution SequenceExecution Sequence
InitializationInitializationInitializationInitialization
(p63)
(p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st
..63:63: 0016:16: 1117:17: 0018:18: 0019:19: 00
......
RRB=0 LC=3 EC=4
(p16)
(p18)(p19)
Execution SequenceExecution Sequence
Branch 1Branch 1Branch 1Branch 1
......63:63: 1116:16: 1117:17: 0018:18: 0019:19: 00
......
......62:62: 0063:63: 1116:16: 1117:17: 0018:18: 00
......
Loop ExecutionLoop Execution
1
RRB=-1 LC=2 EC=4
(p63)
(p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st
..62:62: 0063:63: 1116:16: 1117:17: 0018:18: 00
......
RRB=-1 LC=2 EC=4
(p16)
(p18)(p19)
Execution SequenceExecution Sequence
Branch 2Branch 2Branch 2Branch 2
......62:62: 1163:63: 1116:16: 1117:17: 0018:18: 00
......
......61:61: 0062:62: 1163:63: 1116:16: 1117:17: 00
......
Loop ExecutionLoop Execution
1
RRB=-2 LC=1 EC=4
(p63)
(p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st
..61:61: 0062:62: 1163:63: 1116:16: 1117:17: 00
......
RRB=-2 LC=1 EC=4
(p16)
(p18)(p19)
Execution SequenceExecution Sequence
Branch 3Branch 3Branch 3Branch 3
......61:61: 1162:62: 1163:63: 1116:16: 1117:17: 00
......
Loop ExecutionLoop Execution
1
RRB=-3 LC=0 EC=4
......60:60: 0061:61: 1162:62: 1163:63: 1116:16: 11
......
(p63)
(p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st
..60:60: 0061:61: 1162:62: 1163:63: 1116:16: 11
......
RRB=-3 LC=0 EC=4
(p16)
(p18)(p19)
Execution SequenceExecution Sequence
Branch 4Branch 4Branch 4Branch 4
......59:59: 0060:60: 0061:61: 1162:62: 1163:63: 11
......
Loop ExecutionLoop Execution
0
RRB=-4 LC=0 EC=3
(p63)
(p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st
..59:59: 0060:60: 0061:61: 1162:62: 1163:63: 11
......
RRB=-4 LC=0 EC=3
(p16)
(p18)(p19)
Execution SequenceExecution Sequence
Branch 5Branch 5Branch 5Branch 5
......58:58: 0059:59: 0060:60: 0061:61: 1162:62: 11
......
Loop ExecutionLoop Execution
0
RRB=-5 LC=0 EC=2
(p63)
(p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st
..58:58: 0059:59: 0060:60: 0061:61: 1162:62: 11
......
RRB=-5 LC=0 EC=2
(p16)
(p18)(p19)
Execution SequenceExecution Sequence
Branch 6Branch 6Branch 6Branch 6
......57:57: 0058:58: 0059:59: 0060:60: 0061:61: 11
......
Loop ExecutionLoop Execution
0
RRB=-6 LC=0 EC=1
(p63)
(p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st
(p16) ldx (p16) ldy (p18) fma (p19) st
fall through
..57:57: 0058:58: 0059:59: 0060:60: 0061:61: 11
......
RRB=-6 LC=0 EC=1
(p16)
(p18)(p19)
Execution SequenceExecution Sequence
Branch 7Branch 7Branch 7Branch 7
......56:56: 0057:57: 0058:58: 0059:59: 0060:60: 00
......
Loop ExecutionLoop Execution
0
RRB=-7 LC=0 EC=0
(p63)
Pipelining & LatencyPipelining & Latency
Suppose we change the latenciesSuppose we change the latencies– load latency of 6 clocksload latency of 6 clocks
– fma latency of 4 clocksfma latency of 4 clocks
Example: New PipelineExample: New Pipeline Each column represents 1 source iterationEach column represents 1 source iteration
load dx,dy
tmp = dy + da * dx
store dy
.rotf dx[7], dy[7], tmp[5]
mov ar.lc = 3 // #iterations-1
mov ar.ec = 11 // #stages
mov pr.rot = 0x10000
;;
looptop:
(p16) ldfd dx[0] = [dxsp],8
(p16) ldfd dy[0] = [dysp],8
(p22) fma.d tmp[0] = da, dx[6], dy[6]
(p26) stfd [dydp] = tmp[4],8
br.ctop looptop
;;
Updated LoopUpdated Loop
Rotation: SummaryRotation: SummaryLoop pipelining maximizes performance; Loop pipelining maximizes performance;
minimizes overheadminimizes overhead– Avoids code expansion of unrolling and code Avoids code expansion of unrolling and code
explosion of prologue and epilogueexplosion of prologue and epilogue
– Smaller code means fewer cache misses Smaller code means fewer cache misses
– Greater performance improvements in higher Greater performance improvements in higher latency conditionslatency conditions
Reduced overhead allows S/W pipelining of Reduced overhead allows S/W pipelining of small loops with unknown trip countssmall loops with unknown trip counts– Typical of integer scalar codes Typical of integer scalar codes
OutlineOutline
Register StackRegister Stack–Register Stack EngineRegister Stack Engine
Register RotationRegister Rotation–Loop BranchesLoop Branches
–Modulo-Scheduling of LoopsModulo-Scheduling of Loops
SummarySummary
Register Model SummaryRegister Model SummaryGR StackGR Stack
–Overlap call/ret operations with real workOverlap call/ret operations with real work
–RSE hides spills/filllsRSE hides spills/fillls
GR, FR, PR RotationGR, FR, PR Rotation–General acceleration for all types of loopsGeneral acceleration for all types of loops
SW-visible resourcesSW-visible resources–Large named register files & renamingLarge named register files & renaming
HW simplicity and explicit controlHW simplicity and explicit control