+ All Categories
Home > Documents > IA-64 Register Model: Stack & Rotation Dale Morris Architect Hewlett Packard Co.

IA-64 Register Model: Stack & Rotation Dale Morris Architect Hewlett Packard Co.

Date post: 18-Dec-2015
Category:
Upload: madeleine-hampton
View: 214 times
Download: 0 times
Share this document with a friend
49
IA-64 Register IA-64 Register Model: Stack & Model: Stack & Rotation Rotation Dale Morris Dale Morris Architect Architect Hewlett Packard Co. Hewlett Packard Co.
Transcript

IA-64 Register IA-64 Register Model: Stack & Model: Stack & RotationRotation

Dale Morris Dale Morris ArchitectArchitect

Hewlett Packard Co.Hewlett Packard Co.

PhilosophyPhilosophy

Large filesLarge files–Most processors have lots of registers Most processors have lots of registers

Explicit control over register-renamingExplicit control over register-renaming–Most processors have register renamingMost processors have register renaming

IA-64 makes the register names SW-IA-64 makes the register names SW-visible & makes the renaming explicitvisible & makes the renaming explicit

OutlineOutline

Register StackRegister Stack–Register Stack EngineRegister Stack Engine

Register RotationRegister Rotation–Loop BranchesLoop Branches

–Modulo-Scheduling of LoopsModulo-Scheduling of Loops

SummarySummary

Register StackRegister Stack

Motivation:Motivation:–Automatic save/restore of GRs on Automatic save/restore of GRs on

procedure call/returnprocedure call/return

–Cache traffic reductionCache traffic reduction

–Latency hiding of register spill/fillLatency hiding of register spill/fill

General RegistersGeneral Registers

StackedStacked

StaticStatic

00

31313232

127127

GR Stack FrameGR Stack Frame

(inputs)(inputs)

StaticStatic

00

31313232

127127

localslocalsoutputsoutputs

illegalillegal

size of frame (sof)size of frame (sof)

sofsofsolsol

Current Frame Marker (CFM)Current Frame Marker (CFM)

size of locals (sol)size of locals (sol)

GR Stack Frame - ExampleGR Stack Frame - Example

size of frame (sof)size of frame (sof)

size of locals (sol)size of locals (sol)

3232

4646

locloc

outout5252

sofsofsolsol

CFMCFM 21211414

GR Stack Frame - CallGR Stack Frame - Call

3232

4646

locloc

outout5252

sofsofsolsol

CFMCFM 21211414

PFMPFM xxxx

32323838

outout

sofsofsolsol

7700

21211414

callcall

GR Stack Frame - AllocateGR Stack Frame - Allocate

3232

4646

locloc

outout5252

sofsofsolsol

CFMCFM 21211414

PFMPFM xxxx

32323838

outout

sofsofsolsol

7700

21211414

callcall allocalloc

sofsofsolsol

19191616

21211414

3232

4848

locloc

outout5050

inputsinputs

GR Stack Frame - ReturnGR Stack Frame - Return

3232

4646

locloc

outout5252

sofsofsolsol

CFMCFM 21211414

PFMPFM xxxx

32323838

outout

sofsofsolsol

7700

21211414

callcall allocalloc

sofsofsolsol

19191616

21211414

3232

4848

locloc

outout5050

3232

4646

locloc

outout5252

sofsofsolsol

21211414

21211414

returnreturn

InstructionsInstructionsbr.callbr.call

–Copies CFM to PFMCopies CFM to PFM

–Creates new frame with only output regsCreates new frame with only output regs

–Saves local regs from previous frameSaves local regs from previous frame

allocalloc–Resizes current frameResizes current frame

–Saves PFM to a GRSaves PFM to a GR

Instructions (cont.)Instructions (cont.)mov to PFSmov to PFS

–Restores PFM from a GRRestores PFM from a GR

br.retbr.ret–Restores CFM from PFMRestores CFM from PFM

–Restores local regs for previous frameRestores local regs for previous frame

Leaf Procedure Leaf Procedure OptimizationOptimizationNo need to save/restore PFMNo need to save/restore PFMCan always use scratch static GRs Can always use scratch static GRs Can omit alloc if:Can omit alloc if:

–Not many registers neededNot many registers needed

–Register rotation not neededRegister rotation not needed

Register Save EngineRegister Save Engine

Automatically spills/fills registers from Automatically spills/fills registers from memory as neededmemory as needed

Registers saved on a Backing Store Registers saved on a Backing Store StackStack

Spills/fills NaT bits as wellSpills/fills NaT bits as well

Reg Stack & Backing StoreReg Stack & Backing Store

solsolaa

unallocatedunallocated

unallocatedunallocated

procAprocA

procBprocB

procCprocC

currentcurrentframeframe

solsolbb

sofsofcc

procA’sprocA’sancestorsancestors

procAprocA

procBprocB

callcall

returnreturnPhysicalPhysicalstackedstackedregistersregisters

BackingBackingStoreStore

RSERSEloads/loads/storesstores

A calls B calls CA calls B calls C

Register Stack: SummaryRegister Stack: Summary

Exposes register renaming to SWExposes register renaming to SWAvoids register spill when few neededAvoids register spill when few neededHides register spill/fillHides register spill/fillProgrammable sizesProgrammable sizes

–only use as many registers as you needonly use as many registers as you need

OutlineOutline

Register StackRegister Stack–Register Stack EngineRegister Stack Engine

Register RotationRegister Rotation–Loop BranchesLoop Branches

–Modulo-Scheduling of LoopsModulo-Scheduling of Loops

SummarySummary

Register RotationRegister Rotation

Motivation:Motivation:–pipeline-schedule loops onto HWpipeline-schedule loops onto HW

– remove extraneous work from loopremove extraneous work from loop

–minimize start-up overheadminimize start-up overhead

–small code footprintsmall code footprint

–maximum computational throughput with maximum computational throughput with few instructionsfew instructions

GR Stack Frame w/ RotationGR Stack Frame w/ Rotation

localslocals

StaticStatic

00

31313232

127127

outputsoutputssofsof

sofsofsolsol

Current Frame Marker (CFM)Current Frame Marker (CFM)

solsolSize of Rotating (sor)Size of Rotating (sor)

sorsorrrb.grrrb.grrrb.frrrb.frrrb.prrrb.pr

GR RotationGR Rotation

Size of rotating region multiple of 8Size of rotating region multiple of 8Rotating region overlays current frameRotating region overlays current frame

–Starts at r32Starts at r32

–Overlay allows rotation & stack renaming Overlay allows rotation & stack renaming in a single level of addersin a single level of adders

–Must copy input registers before loopMust copy input registers before loop

FR RotationFR Rotation

RotatingRotating

StaticStatic

00

31313232

127127

Upper 3/4 of registerUpper 3/4 of registerfile rotatesfile rotates

Predicate RotationPredicate Rotation

RotatingRotating

StaticStatic

00

15151616

6363

Upper 3/4 of registerUpper 3/4 of registerfile rotatesfile rotates

PalmPalm SunnySunnyisisSpringsSprings

RRB=0RRB=0

Register Rotation & RRBRegister Rotation & RRB Separate Rotating Register Base for each: GRs, FRs, PRsSeparate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB)Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number

– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.

ldld11 R35 R35

......

35:35:34:34:33:33:32:32:

36:36:

......

PalmPalm

PalmPalm SunnySunnyisisSpringsSprings

IA-64IA-64

......

35:35:34:34:33:33:32:32:

36:36:

......

RRB=0RRB=0

Register Rotation & RRBRegister Rotation & RRB Separate Rotating Register Base for each: GRs, FRs, PRsSeparate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB)Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number

– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.

PalmPalm

ldld22 R34 R34

stst11 R35 R35

SpringsSpringsPalmPalm

PalmPalm SunnySunnyisisSpringsSprings

IA-64IA-64

......

34:34:33:33:32:32:127:127:

35:35:

......

RRB=-1RRB=-1

Register Rotation & RRBRegister Rotation & RRB Separate Rotating Register Base for each: GRs, FRs, PRsSeparate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB)Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number

– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.

PalmPalm SpringsSprings

ldld33 R34 R34

stst22 R35 R35

isisSpringsSpringsPalmPalm

PalmPalm SunnySunnyisisSpringsSprings

IA-64IA-64

......

33:33:32:32:127:127:126:126:

34:34:

......

RRB=-2RRB=-2

Register Rotation & RRBRegister Rotation & RRB Separate Rotating Register Base for each: GRs, FRs, PRsSeparate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB)Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number

– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.

PalmPalm SpringsSprings

ldld44 R34 R34

stst33 R35 R35

SunnySunnyisisSpringsSprings

isis

PalmPalm SunnySunnyisisSpringsSprings

IA-64IA-64

......

32:32:127:127:126:126:125:125:

33:33:

......

RRB=-3RRB=-3

Register Rotation & RRBRegister Rotation & RRB Separate Rotating Register Base for each: GRs, FRs, PRsSeparate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB)Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number

– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.

PalmPalm SpringsSprings

stst44 R35 R35SunnySunnyisis

isis SunnySunny

Loop BranchesLoop Branches

br.cloop uses LC for simple, non-br.cloop uses LC for simple, non-pipelined loopspipelined loops–decrements LC and loops until LC is 0decrements LC and loops until LC is 0

br.ctop uses LC and EC for pipelined br.ctop uses LC and EC for pipelined counted loopscounted loops

br.wtop uses branch predicate and EC br.wtop uses branch predicate and EC for pipelined “while” loopsfor pipelined “while” loops

br.cexit, br.wexit used for unrolled, br.cexit, br.wexit used for unrolled, pipelined loopspipelined loops

br.ctopbr.ctop

Function (simplified):Function (simplified):– if (LC>0) if (LC>0)

{LC--;{LC--; pr[63]=1; pr[63]=1; rrb--;rrb--; loop;} loop;}else if (EC>1) else if (EC>1)

{EC--;{EC--; pr[63]=0; pr[63]=0; rrb--;rrb--; loop;} loop;}else else

{EC--;{EC--; pr[63]=0; pr[63]=0; rrb--;rrb--;fall_through;}fall_through;}

LC counts main loop iterationsLC counts main loop iterationsEC counts pipeline stages for drainEC counts pipeline stages for drain

Software PipeliningSoftware Pipelining Overlapping execution of different loop iterationsOverlapping execution of different loop iterations

vs.vs.

More iterations in same amount of timeMore iterations in same amount of time

Especially Useful for Integer Code With Small Especially Useful for Integer Code With Small Number of Loop IterationsNumber of Loop Iterations

Especially Useful for Integer Code With Small Especially Useful for Integer Code With Small Number of Loop IterationsNumber of Loop Iterations

Software PipeliningSoftware PipeliningTraditional architectures use loop unrollingTraditional architectures use loop unrolling

– High overhead: extra code for loop body, prologue, and High overhead: extra code for loop body, prologue, and epilogue epilogue

Synergistic use of IA-64 features:Synergistic use of IA-64 features:– Full PredicationFull Predication

– Special branchesSpecial branches

– Register rotation: removes loop copy overheadRegister rotation: removes loop copy overhead

– Predicate rotation: removes prologue & epiloguePredicate rotation: removes prologue & epilogue

Pipelined Loop ExamplePipelined Loop Example

DAXPY inner loopDAXPY inner loop–dy[i] = dy[i] + (da * dx[i])dy[i] = dy[i] + (da * dx[i])

–2 loads, 1 fma, 1 store / iteration2 loads, 1 fma, 1 store / iteration

Machine assumptionsMachine assumptions–can do 2 loads, 1 store, 1 fma, 1 br / cyclecan do 2 loads, 1 store, 1 fma, 1 br / cycle

– load latency of 2 clocksload latency of 2 clocks

– fma latency of 1 clocksfma latency of 1 clocks

Example: PipelineExample: Pipeline Each column represents 1 source iterationEach column represents 1 source iteration

load dx,dy

tmp = dy + da * dx

store dy

.rotf dx[3], dy[3], tmp[2]

mov ar.lc = 3 // #iterations-1

mov ar.ec = 4 // #stages

mov pr.rot = 0x10000

;;

looptop:

(p16) ldfd dx[0] = [dxsp],8

(p16) ldfd dy[0] = [dysp],8

(p18) fma.d tmp[0] = da, dx[2], dy[2]

(p19) stfd [dydp] = tmp[1],8

br.ctop looptop

;;

Example CodeExample Code

(p16) ldx (p16) ldy (p18) fma (p19) st

Loop ExecutionLoop Execution

..63:63: 0016:16: 1117:17: 0018:18: 0019:19: 00

......

RRB=0 LC=3 EC=4

(p16)

(p18)(p19)

Execution SequenceExecution Sequence

InitializationInitializationInitializationInitialization

(p63)

(p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st

..63:63: 0016:16: 1117:17: 0018:18: 0019:19: 00

......

RRB=0 LC=3 EC=4

(p16)

(p18)(p19)

Execution SequenceExecution Sequence

Branch 1Branch 1Branch 1Branch 1

......63:63: 1116:16: 1117:17: 0018:18: 0019:19: 00

......

......62:62: 0063:63: 1116:16: 1117:17: 0018:18: 00

......

Loop ExecutionLoop Execution

1

RRB=-1 LC=2 EC=4

(p63)

(p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st

..62:62: 0063:63: 1116:16: 1117:17: 0018:18: 00

......

RRB=-1 LC=2 EC=4

(p16)

(p18)(p19)

Execution SequenceExecution Sequence

Branch 2Branch 2Branch 2Branch 2

......62:62: 1163:63: 1116:16: 1117:17: 0018:18: 00

......

......61:61: 0062:62: 1163:63: 1116:16: 1117:17: 00

......

Loop ExecutionLoop Execution

1

RRB=-2 LC=1 EC=4

(p63)

(p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st

..61:61: 0062:62: 1163:63: 1116:16: 1117:17: 00

......

RRB=-2 LC=1 EC=4

(p16)

(p18)(p19)

Execution SequenceExecution Sequence

Branch 3Branch 3Branch 3Branch 3

......61:61: 1162:62: 1163:63: 1116:16: 1117:17: 00

......

Loop ExecutionLoop Execution

1

RRB=-3 LC=0 EC=4

......60:60: 0061:61: 1162:62: 1163:63: 1116:16: 11

......

(p63)

(p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st

..60:60: 0061:61: 1162:62: 1163:63: 1116:16: 11

......

RRB=-3 LC=0 EC=4

(p16)

(p18)(p19)

Execution SequenceExecution Sequence

Branch 4Branch 4Branch 4Branch 4

......59:59: 0060:60: 0061:61: 1162:62: 1163:63: 11

......

Loop ExecutionLoop Execution

0

RRB=-4 LC=0 EC=3

(p63)

(p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st

..59:59: 0060:60: 0061:61: 1162:62: 1163:63: 11

......

RRB=-4 LC=0 EC=3

(p16)

(p18)(p19)

Execution SequenceExecution Sequence

Branch 5Branch 5Branch 5Branch 5

......58:58: 0059:59: 0060:60: 0061:61: 1162:62: 11

......

Loop ExecutionLoop Execution

0

RRB=-5 LC=0 EC=2

(p63)

(p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st

..58:58: 0059:59: 0060:60: 0061:61: 1162:62: 11

......

RRB=-5 LC=0 EC=2

(p16)

(p18)(p19)

Execution SequenceExecution Sequence

Branch 6Branch 6Branch 6Branch 6

......57:57: 0058:58: 0059:59: 0060:60: 0061:61: 11

......

Loop ExecutionLoop Execution

0

RRB=-6 LC=0 EC=1

(p63)

(p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st

(p16) ldx (p16) ldy (p18) fma (p19) st

fall through

..57:57: 0058:58: 0059:59: 0060:60: 0061:61: 11

......

RRB=-6 LC=0 EC=1

(p16)

(p18)(p19)

Execution SequenceExecution Sequence

Branch 7Branch 7Branch 7Branch 7

......56:56: 0057:57: 0058:58: 0059:59: 0060:60: 00

......

Loop ExecutionLoop Execution

0

RRB=-7 LC=0 EC=0

(p63)

Pipelining & LatencyPipelining & Latency

Suppose we change the latenciesSuppose we change the latencies– load latency of 6 clocksload latency of 6 clocks

– fma latency of 4 clocksfma latency of 4 clocks

Example: New PipelineExample: New Pipeline Each column represents 1 source iterationEach column represents 1 source iteration

load dx,dy

tmp = dy + da * dx

store dy

.rotf dx[7], dy[7], tmp[5]

mov ar.lc = 3 // #iterations-1

mov ar.ec = 11 // #stages

mov pr.rot = 0x10000

;;

looptop:

(p16) ldfd dx[0] = [dxsp],8

(p16) ldfd dy[0] = [dysp],8

(p22) fma.d tmp[0] = da, dx[6], dy[6]

(p26) stfd [dydp] = tmp[4],8

br.ctop looptop

;;

Updated LoopUpdated Loop

Rotation: SummaryRotation: SummaryLoop pipelining maximizes performance; Loop pipelining maximizes performance;

minimizes overheadminimizes overhead– Avoids code expansion of unrolling and code Avoids code expansion of unrolling and code

explosion of prologue and epilogueexplosion of prologue and epilogue

– Smaller code means fewer cache misses Smaller code means fewer cache misses

– Greater performance improvements in higher Greater performance improvements in higher latency conditionslatency conditions

Reduced overhead allows S/W pipelining of Reduced overhead allows S/W pipelining of small loops with unknown trip countssmall loops with unknown trip counts– Typical of integer scalar codes Typical of integer scalar codes

OutlineOutline

Register StackRegister Stack–Register Stack EngineRegister Stack Engine

Register RotationRegister Rotation–Loop BranchesLoop Branches

–Modulo-Scheduling of LoopsModulo-Scheduling of Loops

SummarySummary

Register Model SummaryRegister Model SummaryGR StackGR Stack

–Overlap call/ret operations with real workOverlap call/ret operations with real work

–RSE hides spills/filllsRSE hides spills/fillls

GR, FR, PR RotationGR, FR, PR Rotation–General acceleration for all types of loopsGeneral acceleration for all types of loops

SW-visible resourcesSW-visible resources–Large named register files & renamingLarge named register files & renaming

HW simplicity and explicit controlHW simplicity and explicit control

IA-64 Register IA-64 Register Model: Stack & Model: Stack & RotationRotation

Dale Morris Dale Morris ArchitectArchitect

Hewlett Packard Co.Hewlett Packard Co.


Recommended