Post on 17-Jan-2016
description
transcript
Multiscalar Processors
Presented by Matthew MislerGurindar S. Sohi, Scott E. Breach, T. N. Vijayjumar
University of Wisconsin-Madison
ISCA ‘95
Scalar Processors
2
Instruction Queue
addu $20, $20, 16
ld $23, SYMVAL -16($20)
move $17, $21
beq $17, $0, SKIPINNER
ld $8, LELE($17)
Execution Unit
SuperScalar Processors
3
Instruction Queue
addu $20, $20, 16
ld $23, SYMVAL -16($20)
move $17, $21
beq $17, $0, SKIPINNER
ld $8, LELE($17)
Execution Unit
Fetch-Execute
– Paradigm has been around for about 60 years
– Superscalar processors to execute instructions out of order– Sometimes re-ordering done in hardware– Sometimes software– Sometimes both
– Partial ordering
4
Control Flow Graphs
– Segments are split on control dependencies (conditional branches)
5
Sequential “Walk”
– Walk through the CFG with enough parallelism– Use speculative execution and branch
prediction to raise the level of parallelism
– Sequential semantics must be preserved– Can still execute out of order, but in-order
commit
6
Multiscalars and Tasks
– CFG broken down into tasks– Multiscalars step through at the task
level– No inspection of instructions within a task
– Each Task is assigned to one ‘processing unit’
– Multiple tasks can execute in parallel
7
Multiscalar Microarchitecture
– Sequencer– Queue of processing units
– Unidirectional ring– Each has an instruction cache, processing
element, register file
– Interconnect– Data Bank
– Each has: address resolution buffer, data cache
8
Multiscalar Microarchitecture
9
Outline
Multiscalar Microarchitecture Tasks Multiscalars in-depth Distribution of cycles Comparison to other paradigms Performance Conclusion
10
Tasks
– Sequencer distributes a task to a Processing unit– Unit fetches and executes the task until
completion
– Instructions in the window are bounded – By the first instruction in the earliest executing
task– By the last instruction in the latest executing
task
11
Tasks
– Sequencer distributes a task to a Processing unit– Unit fetches and executes the task until
completion
– The Instruction Window is bounded by– The first instruction in the earliest executing task– The last instruction in the latest executing task
– So? Instruction windows can be huge
12
Tasks Example
13
A B C D E
A B C B B C D
Tasks Example
14
A B C D E
A B C B B C D
A B B C D
A B C B C D E
Tasks
– Hold true to sequential semantics inside each block
– Enforce sequential order overall on tasks– The circular queue takes care of this part
– In the previous example:– Head of queue does ABCBBCD– Middle unit does ABBCD– Tail of the queue ABCBCDE
15
Tasks
– Registers– Create mask
– May produce values for a future task– Forward values down the ring
– Accum mask– Union of the create masks of active tasks
– Memory– If it’s a known producer-consumer, then
synchronize on loads and stores
16
Tasks
– Memory (cont’d)– Unknown P-C relationship
– Conservative approach: wait– Aggressive approach: speculate
– Conservative approach means sequential operation
– Aggressive approach requires dynamic checking, squashing and recovery
17
Outline
Multiscalar basics Tasks Multiscalars in-depth Distribution of cycles Comparison to other paradigms Performance Conclusion
18
Multiscalar Programs
– Code for the tasks– Small changes to existing ISA
– add specification of tasks – no major overhaul
– Structure of the CFG and tasks– Communications between tasks
19
Control Flow Graph Structure
– Successors– Task descriptor
– Producing and consuming values– Forward register information on last update
– Compiler can mark instructions: operate and forward
– Stopping conditions– Special condition, evaluate conditions, complete
– All of these can be viewed as tag bits
20
Multiscalar Hardware
– Walks through the CFG– Assign tasks to processing units– Execute tasks in a ‘sequential’ order
– Sequencer fetches the task descriptors– Using the address of the first instruction– Specifying the create masks– Constructing the accum mask
– Using the task descriptor, predict successor
21
Multiscalar Hardware
22
– Databanks– Updates to cache not speculative
– Use of Address Resolution Buffer– Detects violation of dependencies– Initiates corrective actions– If it runs out of space, squash tasks
– Not the head of the queue; it doesn’t use the ARB– Can stall rather than squash
Multiscalar Hardware
23
– Remember the earlier architectural picture?
Multiscalar Hardware
24
– It’s not the only possible architecture– Possible design with shared functional units– Possible design with ARB and data cache on
the same side as the processing units
– Scaling the interconnect is non-trivial– Glossed over
Outline
Multiscalar Basics Tasks Multiscalars In-Depth Distribution of Cycles Comparison to Other Paradigms Performance Conclusion
25
Distribution of Cycles
26
– Wasted cycles:– Non-useful computation
– Squashed
– No computation– Waiting
– Remains idle– No assigned task
Distribution of Cycles
– Non-useful computation cycles– Determine useless computation early
– Validate prediction early– Check if the next task is predicted correctly– Eg. Test for loop exit at the start of the loop
– Tasks violating sequentiality are squashed– To avoid, try to synchronize memory
communication with register communication– Could delay the load for a number of cycles– Can use signal-wait synchronization
27
Distribution of Cycles
– Contrast with no assigned task– No computation cycles
– Dependencies within the same task– Dependencies between tasks (earlier/later)– Load Balancing
28
Outline
Multiscalar Basics Tasks Multiscalars In-Depth Distribution of Cycles Comparison to Other Paradigms Performance Conclusion
29
Comparison to Other Paradigms
30
– Branch prediction– Sequencer only needs to predict branches
across tasks
– Wide instruction window– Check to see which is ready for issue, in
Multiscalar relatively few ready for inspection
Comparison to Other Paradigms
31
– Issue logic– Superscalar processors have n2 logic– Multiscalar logic is distributed,
– Each processing unit issues instructions independently
– Loads and stores– Normally sequence numbers for managing the
buffers– In multiscalar, the loads and stores are
independent
Comparison to Other Paradigms
32
– Superscalar processors need to discover CFG as it decodes branches
– Only requires the compiler to split code into tasks– Multiprocessors require all dependence to be
known or conservatively provided for– If a compiler could compile independently, it
can be executed in parallel
Outline
Multiscalar Basics Tasks Multiscalars In-Depth Distribution of Cycles Comparison to Other Paradigms Performance Conclusion
33
Performance
34
– Simulated– 5 stage pipeline
Functional unit latency
Performance
35
– Memory– Non-blocking loads and stores– 10 cycle latency for first 4 words– 1 cycle for each additional 4 words
– Instruction Cache: 1 cycle for 4 words– 10+3 cycles for miss
– Data Cache: 1 word per cycle multiscalar– 10+3 cycles + bus contention, for a miss
– 1024 entry cache of task descriptors
Performance
36
+12.2% on average
Performance – In-Order
37
Performance – Out-of-Order
38
Performance – Summary
39
– Most of the benchmarks achieve speedup– Eg. An average of 1.924 in 1-way in-order 4-
unit multiscalar
– Worst case 0.86 speedup (slowdown)– Many squashes in prediction and memory
order in Gcc and Xlisp– Leads to almost sequential execution
– Keeping in mind, 12.2% increase in IC
Outline
Multiscalar Basics Tasks Multiscalars In-Depth Distribution of Cycles Comparison to Other Paradigms Performance Conclusion
40
Conclusion
41
– Divide the CFG into tasks– Assign tasks to processing units
– Walk the CFG in task-size steps– Shows performance gains