Closely-Coupled Timing-Directed Partitioning in HAsim

Post on 08-Feb-2016

28 views 2 download

description

Closely-Coupled Timing-Directed Partitioning in HAsim. Michael Pellauer † pellauer@csail.mit.edu. Murali Vijayaraghavan † , Michael Adler ‡ , Arvind † , Joel Emer †‡. † MIT CS and AI Lab Computation Structures Group. ‡ Intel Corporation VSSAD Group. To Appear In: ISPASS 2008. Motivation. - PowerPoint PPT Presentation

transcript

Closely-CoupledTiming-Directed Partitioning

in HAsim

Michael Pellauer†

pellauer@csail.mit.eduMurali Vijayaraghavan†, Michael Adler‡, Arvind†, Joel Emer†‡

†MIT CS and AI LabComputation Structures Group

‡Intel CorporationVSSAD Group

To Appear In: ISPASS 2008

MotivationWe want to simulate target platforms quicklyWe also want to construct simulators quicklyPartitioned simulators are a known technique from traditional performance models:

• ISA• Off-chipcommunication

• Micro-architecture• Resource contention• Dependencies

Interaction

• Simplifies timing model• Amortize functional model design effort over many models• Functional Partition can be extremely FPGA-optimized

TimingPartition

FunctionalPartition

Different Partitioning SchemesAs categorized by Mauer, Hill and Wood:

Source: [MAUER 2002], ACM SIGMETRICSWe believe that a timing-directed solution will ultimately lead to the best performance

Both partitions upon the FPGA

Functional Partition in Software AsimGet Instruction (at a given Address)Get DependenciesGet Instruction ResultsRead Memory*

Speculatively Write Memory* (locally visible)Commit or Abort instructionWrite Memory* (globally visible)

* Optional depending on instruction type

Execution in Phases

F D X R C

F D X W C W

F D X C

The Emer Assertion:

All data dependencies can be represented via these phases

F D X R A

F D X X C W

Detailed Example: 3 Different Timing Models

Executing the same instruction sequence:

Functional Partition in Hardware?Requirements

Support these operations in hardwareAllow for out-of-order execution, speculation, rollback

ChallengesMinimize operation execution timesPipeline wherever possibleTradeoff between BRAM/multiport RAMsRace conditions due to extreme parallelism

Functional Partition As Pipeline

Conveys concept well, but poor performance

Token Gen Dec Exe Mem LCom GComFet

Timing Model

MemoryState

Register State

RegFile

FunctionalPartition

Implementation:Large Scoreboards in BRAM

Series of tables in BRAM

Store information about each in-flight instructionTables are indexed by “token”

Also used by the timing partition to refer to each instructionNew operation “getToken” to allocate a space in the tables

Implementing the Operations

See paper for details (also extra slides)

Assessment:Three Timing Models

Unpipelined Target

MIPS R10K-like out-of-order superscalar

5-Stage Pipeline

Assessment:Target Performance

Targets have idealized memory hierarchy

Target Processor CPI

0

0.5

1

1.5

2

2.5

3

3.5

median multiply qsort towers vvadd average

Mod

el C

ycle

s pe

r Ins

truct

ion

(CPI

)

Unpipelined5-stageOut-of-Order

Assessment:Simulator Performance

Some correspondence between target and functional partition is very helpful

Simulation Rate

0

5

10

15

20

25

30

35

40

45

median multiply qsort towers vvadd average

FPG

A-C

ycle

s pe

r Mod

el C

ycle

(FM

R)

Unpipelined5-StageOut-of-Order

Assessment:Reuse and Physical Stats

Where is functionality implemented:

FPGA usage:

Design IMem ProgramCounter

Branch Predictor

Scoreboard/ROB

RegFile

Maptable/Freelist

ALU DMem Store Buffer

Snapshots/Rollback

Functional Partition

Unpipelined N/A N/A N/A N/A N/A

5-Stage N/A

Out-of-Order

Unpipelined 5-stage Out of Order

FPGA Slices 6599 (20%) 9220 (28%) 22,873 (69%)

Block RAMs 18 (5%) 25 (7%) 25 (7%)

Clock Speed 98.8 MHz 96.9 MHz 95.0 MHz

Average FMR 41.1 7.49 15.6

Simulation Rate 2.4 MHz 14 MHz 6 MHz

Average Simulator IPS

2.4 MIPS 5.1 MIPS 4.7 MIPS

Virtex IIPro 70

Using ISE 8.1i

Future Work:Simulating Multicores

Scheme 1: Duplicate both partitions

Scheme 2: Cluster Timing Parititions

TimingModel

A

FuncReg +

Datapath

TimingModel

B

FuncReg +

Datapath

FuncReg +

Datapath

TimingModel

C

FuncReg +

Datapath

TimingModel

D

FunctionalMemory

State

TimingModel

A

TimingModel

B

TimingModel

C

TimingModel

D

FunctionalReg State +

Datapath

FunctionalMemory

State

Interactionoccurshere

Interactionstill occurs

here

Use a context IDto reference all state

lookups

Future Work: Simulating MulticoresScheme 3: Perform multiplexing of timing models themselves

Leverage HASim A-Ports in Timing ModelOut of scope of today’s talk

TimingModel

D

FunctionalReg State +

Datapath

FunctionalMemory

StateInteractionstill occurs

here

Use a context IDto reference all state

lookups

TimingModel

C

TimingModel

B

TimingModel

A

UT-FAST is Functional-First

This can be unified into Timing-DirectedJust do “execute-at-fetch”

Future Work:Unifying with the UT-FAST model

FuncPartition

TimingPartition

EmulatorØØØ

Ø

functionalemulatorrunning insoftware

FPGA

execution stream

resteer

execution stream

resteer

functionalemulatorrunning insoftware

SummaryDescribed a scheme for closely-coupled timing-directed partitioning

Both partitions are suitable for on-FPGA implementation

Demonstrated such a scheme’s benefits:Very Good Reuse, Very Good Area/Clock SpeedGood FPGA-to-Model Cycle Ratio:

Caveat: Assuming some correspondence between timing model and functional partitions (recall the unpipelined target)

We plan to extend this using contexts for hardware multiplexing [Chung 07]Future: rare complex operations (such as syscalls) could be done in software using virtual channels

Questions?

pellauer@csail.mit.edu

Extra Slides

pellauer@csail.mit.edu

Functional Partition Fetch

Functional Partition Decode

Functional Partition Execute

Functional Partition Back End

Timing Model: Unpipelined

5-Stage Pipeline Timing Model

Out-Of-Order Superscalar Timing Model