Multiscalar Processors

transcript

Presented by Matthew MislerGurindar S. Sohi, Scott E. Breach, T. N. Vijayjumar

University of Wisconsin-Madison

ISCA ‘95

Scalar Processors

Instruction Queue

addu $20, $20, 16

ld $23, SYMVAL -16($20)

move $17, $21

beq $17, $0, SKIPINNER

ld $8, LELE($17)

Execution Unit

SuperScalar Processors

Instruction Queue

addu $20, $20, 16

ld $23, SYMVAL -16($20)

move $17, $21

beq $17, $0, SKIPINNER

ld $8, LELE($17)

Execution Unit

Fetch-Execute

– Paradigm has been around for about 60 years

– Superscalar processors to execute instructions out of order– Sometimes re-ordering done in hardware– Sometimes software– Sometimes both

– Partial ordering

Control Flow Graphs

– Segments are split on control dependencies (conditional branches)

Sequential “Walk”

– Walk through the CFG with enough parallelism– Use speculative execution and branch

prediction to raise the level of parallelism

– Sequential semantics must be preserved– Can still execute out of order, but in-order

commit

Multiscalars and Tasks

– CFG broken down into tasks– Multiscalars step through at the task

level– No inspection of instructions within a task

– Each Task is assigned to one ‘processing unit’

– Multiple tasks can execute in parallel

Multiscalar Microarchitecture

– Sequencer– Queue of processing units

– Unidirectional ring– Each has an instruction cache, processing

element, register file

– Interconnect– Data Bank

– Each has: address resolution buffer, data cache

Multiscalar Microarchitecture

Outline

Multiscalar Microarchitecture Tasks Multiscalars in-depth Distribution of cycles Comparison to other paradigms Performance Conclusion

– Sequencer distributes a task to a Processing unit– Unit fetches and executes the task until

completion

– Instructions in the window are bounded – By the first instruction in the earliest executing

task– By the last instruction in the latest executing

– Sequencer distributes a task to a Processing unit– Unit fetches and executes the task until

completion

– The Instruction Window is bounded by– The first instruction in the earliest executing task– The last instruction in the latest executing task

– So? Instruction windows can be huge

Tasks Example

A B C D E

A B C B B C D

Tasks Example

A B C D E

A B C B B C D

A B B C D

A B C B C D E

– Hold true to sequential semantics inside each block

– Enforce sequential order overall on tasks– The circular queue takes care of this part

– In the previous example:– Head of queue does ABCBBCD– Middle unit does ABBCD– Tail of the queue ABCBCDE

– Registers– Create mask

– May produce values for a future task– Forward values down the ring

– Accum mask– Union of the create masks of active tasks

– Memory– If it’s a known producer-consumer, then

synchronize on loads and stores

– Memory (cont’d)– Unknown P-C relationship

– Conservative approach: wait– Aggressive approach: speculate

– Conservative approach means sequential operation

– Aggressive approach requires dynamic checking, squashing and recovery

Outline

Multiscalar basics Tasks Multiscalars in-depth Distribution of cycles Comparison to other paradigms Performance Conclusion

Multiscalar Programs

– Code for the tasks– Small changes to existing ISA

– add specification of tasks – no major overhaul

– Structure of the CFG and tasks– Communications between tasks

Control Flow Graph Structure

– Successors– Task descriptor

– Producing and consuming values– Forward register information on last update

– Compiler can mark instructions: operate and forward

– Stopping conditions– Special condition, evaluate conditions, complete

– All of these can be viewed as tag bits

Multiscalar Hardware

– Walks through the CFG– Assign tasks to processing units– Execute tasks in a ‘sequential’ order

– Sequencer fetches the task descriptors– Using the address of the first instruction– Specifying the create masks– Constructing the accum mask

– Using the task descriptor, predict successor

– Databanks– Updates to cache not speculative

– Use of Address Resolution Buffer– Detects violation of dependencies– Initiates corrective actions– If it runs out of space, squash tasks

– Not the head of the queue; it doesn’t use the ARB– Can stall rather than squash

– Remember the earlier architectural picture?

– It’s not the only possible architecture– Possible design with shared functional units– Possible design with ARB and data cache on

the same side as the processing units

– Scaling the interconnect is non-trivial– Glossed over

Outline

Multiscalar Basics Tasks Multiscalars In-Depth Distribution of Cycles Comparison to Other Paradigms Performance Conclusion

Distribution of Cycles

– Wasted cycles:– Non-useful computation

– Squashed

– No computation– Waiting

– Remains idle– No assigned task

– Non-useful computation cycles– Determine useless computation early

– Validate prediction early– Check if the next task is predicted correctly– Eg. Test for loop exit at the start of the loop

– Tasks violating sequentiality are squashed– To avoid, try to synchronize memory

communication with register communication– Could delay the load for a number of cycles– Can use signal-wait synchronization

– Contrast with no assigned task– No computation cycles

– Dependencies within the same task– Dependencies between tasks (earlier/later)– Load Balancing

Outline

Comparison to Other Paradigms

– Branch prediction– Sequencer only needs to predict branches

across tasks

– Wide instruction window– Check to see which is ready for issue, in

Multiscalar relatively few ready for inspection

– Issue logic– Superscalar processors have n2 logic– Multiscalar logic is distributed,

– Each processing unit issues instructions independently

– Loads and stores– Normally sequence numbers for managing the

buffers– In multiscalar, the loads and stores are

independent

– Superscalar processors need to discover CFG as it decodes branches

– Only requires the compiler to split code into tasks– Multiprocessors require all dependence to be

known or conservatively provided for– If a compiler could compile independently, it

can be executed in parallel

Outline

Performance

– Simulated– 5 stage pipeline

Functional unit latency

Performance

– Memory– Non-blocking loads and stores– 10 cycle latency for first 4 words– 1 cycle for each additional 4 words

– Instruction Cache: 1 cycle for 4 words– 10+3 cycles for miss

– Data Cache: 1 word per cycle multiscalar– 10+3 cycles + bus contention, for a miss

– 1024 entry cache of task descriptors

Performance

+12.2% on average

Performance – In-Order

Performance – Out-of-Order

Performance – Summary

– Most of the benchmarks achieve speedup– Eg. An average of 1.924 in 1-way in-order 4-

unit multiscalar

– Worst case 0.86 speedup (slowdown)– Many squashes in prediction and memory

order in Gcc and Xlisp– Leads to almost sequential execution

– Keeping in mind, 12.2% increase in IC

Outline

Conclusion

– Divide the CFG into tasks– Assign tasks to processing units

– Walk the CFG in task-size steps– Shows performance gains

Multiscalar Processors

Documents