Post on 01-Jan-2016
description
transcript
Superscalar Processors
• Scalar processors: one instruction per cycle• Superscalar : multiple instruction pipelines
are used.• Purpose: To exploit more instruction level
parallelism in user programs.• Only independent instructions can be
executed in parallel.
Superscalar Processors
• Here, the instruction decoding and execution resources are increased
• Example: A dual pipeline superscalar processor
Superscalar Processor - Example
• Can issue two instructions per cycle• There are two pipelines with four processing
stages : fetch, decode, execute and store• Two instruction streams are from a single I-
cache.• Assume each stage requires one cycle except
execution stage.
Superscalar Processor - Example• The four functional units of execution stage are:
• Functional units are shared on dynamic basis• Look-ahead Window: for out-of-order instruction
issue
Functional Unit Number of stages
Adder 02
Multiplier 03
Logic 01
Load 01
Superscalar Performance• Time required by scalar base machine is
T(1,1) = k+N-1• The ideal execution time required by an m-
issue superscalar machine is
k – time required to execute first m instructions(N-m)/m – time required to execute remaining (N-m) instructions
mm)-(Nk T(m,1)
Superscalar Performance
• The ideal speedup of the superscalar machine is
• As N ∞, the speedup S(m,1) =?
T(m,1)
T(1,1) S(m,1)
1)-m(kN
1)-km(N S(m,1)
Superscalar Performance
• The ideal speedup of the superscalar machine is
• As N ∞, the speedup S(m,1) m.
T(m,1)
T(1,1) S(m,1)
1)-m(kN
1)-km(N S(m,1)
Superpipeline Processors
• In a superpipelined processor of degree n, the pipeline cycle time is 1/n of base cycle.
Superpipeline Performance• Time to execute N instructions for a superpipelined
machine of degree n with k stages is T(1,n) = k + (N-1)/n
• Speedup is given as
• As N ∞ , S(1,n) n
1)-(Nnk
1)-Nn(k
n)T(1,
T(1,1) n)S(1,
Superpipelined Superscalar Processors• This machine executes m instructions every
cycle with a pipeline cycle 1/n of base cycle.
Superpipelined Superscalar Performance
• Time taken to execute N independent instructions on a superpipelined superscalar machine of degree (m,n) is
• The speedup over base machine is
• As N ∞, S(m,n)mn
mn
m-N k n)(m, T
mNmnk
Nkmn
)1(
n)T(m,
T(1,1)n)S(m,
Superscalar Processors
• Rely on spatial parallelism• Multiple operations running
on separate hardware concurrently
• Achieved by duplicating hardware resources such as execution units and register file ports
• Requires more transistors
Superpipelined Processors
• Rely on temporal parallelism• Overlapping multiple
operations on a common hardware
• Achieved through more deeply pipelined execution units with faster clock cycles
• Requires faster transistors
Systolic Architecture• Conventional architecture operate on load
and store operations from memory.• This requires more memory references which
slows down the system as shown below:
Systolic Architecture
• In systolic processing, data to be processed flows through various operation stages and finally put in memory as shown below:
Systolic Architecture• The basic architecture constitutes processing
elements (PEs) that are simple and identical in behavior at all instants.
• Each PE may have some registers and an ALU.• PEs are interlinked in a manner dictated by
the requirements of the specific algorithm.• E.g. 2D mesh, hexagonal arrays etc.
Systolic Architecture• PEs at the boundary of structure are connected
to memory • Data picked up from memory is circulated
among PEs which require it in a rhythmic manner and the result is fed back to memory and hence the name systolic
• Example : Multiplication of two n x n matrices
Example : Multiplication of two n x n matrices
• Every element in input is picked up n times from memory as it contributes to n elements in the output.
• To reduce this memory access, systolic architecture ensures that each element is pulled only once
• Consider an example where n = 3
Matrix Multiplicationa11 a12 a13a21 a22 a23a31 a32 a33 *
b11 b12 b13b21 b22 b23b31 b32 b33
=c11 c12 c13c21 c22 c23c31 c32 c33
Conventional Method: O(n3)
For I = 1 to N For J = 1 to N For K = 1 to N C[I,J] = C[I,J] + A[J,K] * B[K,J];
Systolic MethodThis will run in O(n) time!
To run in n time we need n x n processing units, in our example n = 9.
P9P8P7
P6P5P4
P1 P2 P3
For systolic processing, the input data need to be modified as:
a13 a12 a11a23 a22 a21a33 a32 a31
b31 b32 b33b21 b22 b23b11 b12 b13
Flip columns 1 & 3
Flip rows 1 & 3
and finally stagger the data sets for input.
At every tick of the global system clock, data is passed to each processor from two different directions, then it is multiplied and the result is saved in a register.
a13 a12 a11
a23 a22 a21
a33 a32 a31
b31b21b11
b32b22b12
b33b23b13
P9P8P7
P6P5P4
P1 P2 P3
3 4 2 2 5 33 2 5
* =
3 4 2 2 5 33 2 5
23 36 28 25 39 3428 32 37
Using a systolic array.
2 4 3
3 5 2
5 2 3
323
254
532
P9P8P7
P6P5P4
P1 P2 P3
P1 9
P2 0
P3 0
P4 0
P5 0
P6 0
P7 0
P8 0
P9 0
2 4
3 5 2
5 2 3
32
254
532
P9P8P7
P6P5P4
3*3 P2 P3
Clock tick : 1
P1 9+8=17
P2 12
P3 0
P4 6
P5 0
P6 0
P7 0
P8 0
P9 0
2
3 5
5 2 3
325
532
P9P8P7
P6P52*3
4*2 3*4 P3
Clock tick : 2
P1 17+6=23
P2 12+20=32
P3 6
P4 6+10=16
P5 8
P6 0
P7 9
P8 0
P9 0
3
5 2
2
53
P9P83*3
P62*45*2
2*3 4*5 3*2
Clock tick : 3
P1 23
P2 32+4=36
P3 6+12=18
P4 16+9=25
P5 8+25=33
P6 4
P7 9+4=13
P8 12
P9 05
5
P93*42*2
2*25*53*3
23 2*2 4*3
Clock tick : 4
P1 23
P2 36
P3 18+10=28
P4 25
P5 33+6=39
P6 4+15=19
P7 13+15=28
P8 12+10=22
P9 63*22*55*3
5*33*225
23 36 2*5
Clock tick : 5
P1 23
P2 36
P3 28
P4 25
P5 39
P6 19+15=34
P7 28
P8 22+10=32
P9 6+6=122*35*228
3*53925
23 36 28
Clock tick : 6