Date post: | 10-May-2019 |
Category: |
Documents |
Upload: | duongkhanh |
View: | 216 times |
Download: | 0 times |
(i)
(a) Three different code fragments for adding a list of four numbers.
(iii)(ii)
WB: Write−back
NA: No Action
E: Instruction Execute
IF ID ENA
IF ID
OF: Operand Fetch
OF
OF
IF ID OF E
IF ID EOF
IF ID NA WB
Adder Utilization
Clock cycle
5
6
7
4
Vertical waste
Horizontal wasteFull issue slots
Empty issue slots
1. load R1, @1000
2. load R2, @1008
3. add R1, @1004
4. add R2, @100C
5. add R1, R2
6. store R1, @2000
0 2 4Instruction cycles
6 8
1. load R1, @1000
2. add R1, @1004
3. add R1, @1008
4. add R1, @100C
5. store R1, @2000
1. load R1, @1000
3. load R2, @1008
4. add R2, @100C
5. add R1, R2
6. store R1, @2000
2. add R1, @1004
load R1, @1000
load R2, @1008
add R1, @1004
add R2, @100C
add R1, R2
store R1, @2000
ID: Instruction Decode
IF
IF: Instruction Fetch
ID
(b) Execution schedule for code fragment (i) above.
(c) Hardware utilization trace for schedule in (b).
Figure 2.1 Example of a two-way superscalar execution of instructions.
(a) Column major data access
A b A
=
b A b A b
+ + +
(b) Row major data access.
A b A b A b A b
= = = =
Figure 2.2 Multiplying a matrix with a vector: (a) multiplying column-by-column, keeping a runningsum; (b) computing each element of the result as a dot product of a row of the matrix with the vector.
(a) (b)
Global
+
+
+
+PE
PE
PE
PE
PE
PE
PE
PE
PE
control
unit
INT
ER
CO
NN
EC
TIO
N N
ET
WO
RK
INT
ER
CO
NN
EC
TIO
N N
ET
WO
RK
control unit
control unit
control unit
control unit
PE: Processing Element
Figure 2.3 A typical SIMD architecture (a) and a typical MIMD architecture (b).
Idle
Idle
(b)
Step 2
(a)
Idle
Step 1
Initial values
Idle
C
B
0
A
B
C 0
A
B
C0
A
B
A
0
else
C
Processor 0 Processor 1 Processor 2
5
0
4
2
1
1
0
0
A
B
C 0
A
B
C
A
B
C 0
A
B
C5 0
C = A/B;
C = A;
if (B == 0)
Processor 3
Processor 0 Processor 1 Processor 2 Processor 3
5
0
4
2
1
1
0
0
Processor 0 Processor 1 Processor 2 Processor 3
5
0
4
2
1
1
0
0
0
A
B
C
A
B
C
A
B
C
A
B
C 5 12
Figure 2.4 Executing a conditional statement on an SIMD computer with four processors: (a) theconditional statement; (b) the execution of the statement in two steps.
M
Inte
rcon
nect
ion
Net
wor
k
Inte
rcon
nect
ion
Net
wor
k
M
M
Inte
rcon
nect
ion
Net
wor
k
MM
P
CM
M
(b)
P
C
P
C
P
C
C
P
P
M
M
C
(a) (c)
P
P
P
Figure 2.5 Typical shared-address-space architectures: (a) Uniform-memory-access shared-address-space computer; (b) Uniform-memory-access shared-address-space computer with cachesand memories; (c) Non-uniform-memory-access shared-address-space computer with local memoryonly.
Static network Indirect network
Switching elementProcessing node
Network interface/switch
P
P P P
P
P
PP
Figure 2.6 Classification of interconnection networks: (a) a static network; and (b) a dynamicnetwork.
Cache /Local Memory
Cache /Local Memory
Shar
ed M
emor
y
Data
Processor 0
Address
Data
Shar
ed M
emor
yProcessor 0 Processor 1
(a)
(b)
Address
Processor 1
Figure 2.7 Bus-based interconnects (a) with no local caches; (b) with local memory/caches.
Memory Banks
b−1543210
Proc
essi
ng E
lem
ents
0
1
2
3
4
5
6
p−1
elementA switching
Figure 2.8 A completely non-blocking crossbar network connecting p processors to b memorybanks.
Memory banks
0
1
0
. . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Stage 1
b-1
Stage 2 Stage n
p-1
Processors Multistage interconnection network
1
Figure 2.9 The schematic of a typical multistage interconnection network.
000
010
100
110
001
011
101
111
000
010
100
110
001
011
101
111
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
= left_rotate(000)
= left_rotate(100)
= left_rotate(001)
= left_rotate(101)
= left_rotate(010)
= left_rotate(110)
= left_rotate(011)
= left_rotate(111)
Figure 2.10 A perfect shuffle interconnection for eight inputs and outputs.
(b)(a)
Figure 2.11 Two switching configurations of the 2 × 2 switch: (a) Pass-through; (b) Cross-over.
111
110
101
100
011
010
001
000 000
001
010
011
100
101
110
111
Figure 2.12 A complete omega network connecting eight inputs and eight outputs.
111
110
101
100
011
010
001
000 000
001
010
011
100
101
110
111
A
B
Figure 2.13 An example of blocking in omega network: one of the messages (010 to 111 or 110to 100) is blocked at link AB.
(a) (b)
Figure 2.14 (a) A completely-connected network of eight nodes; (b) a star connected network ofnine nodes.
(c)(b)(a)
Figure 2.16 Two and three dimensional meshes: (a) 2-D mesh with no wraparound; (b) 2-D meshwith wraparound link (2-D torus); and (c) a 3-D mesh with no wraparound.
0
1
00
01
10
11
000 010
001 011
100 110
111101
0000
0100
0001 0011
0101
0110
0010
0111
1100 1110
1111
10111001
1000
1101
1010
0-D hypercube 1-D hypercube 2-D hypercube 3-D hypercube
4-D hypercube
Figure 2.17 Construction of hypercubes from hypercubes of lower dimension.
(a) (b)
Processing nodes
Switching nodes
Figure 2.18 Complete binary tree networks: (a) a static tree network; and (b) a dynamic treenetwork.
C
B
A
P
PP
P
Figure 2.20 Bisection width of a dynamic network is computed by examining various equi-partitions of the processing nodes and selecting the minimum number of edges crossing the par-tition. In this case, each partition yields an edge cut of four. Therefore, the bisection width of thisgraph is four.
(b)
(a)
InvalidateMemoryMemory
P1P0P1P0
UpdateMemoryMemory
P1P0P1P0
load x
write #3, xload xload x
x = 1
x = 1x = 1
x = 1
x = 1x = 1
x = 3
x = 3
x = 3x = 3
x = 1
x = 1
write #3, xload x
Figure 2.21 Cache coherence in multiprocessor systems: (a) Invalidate protocol; (b) Update pro-tocol for shared variables.
flush
read/write
read write
C_read
read
C_write
write
C_write
Dirty
Shared
Invalid
Figure 2.22 State diagram of a simple three-state coherence protocol.
y = 13, D
y = 13, S
x = 6, S
x = 6, I
y = 19, D
y = 20, D
x = 5, S
y = 12, S
x = 5, I
y = 12, I
y = 13, S
x = 6, S
y = 13, I
x = 6, I
y = 13, I
x = 5, D
y = 12, D
x = 6, I
read x
x = x + 1
x = x + y
x = x + 1
read y
y = y + 1
read x
y = x + y
read y
y = 12, S
y = 13, I
x = 19, D
x = 6, S
x = 20, D
y = 13, S
x = 6, D
x = 5, S
y = y + 1
Processor 0
Variables andtheir states atProcessor 1
Variables andtheir states inProcessor 1Global mem.
Instruction atProcessor 0
Instruction atTimetheir states atVariables and
Figure 2.23 Example of parallel program execution with the simple three-state coherence protocoldiscussed in Section ??.
Tag
s
Snoo
p H
/W
Processor
Cache
Tag
s
Snoo
p H
/W
Processor
Cache
Tag
s
Snoo
p H
/W
Processor
Cache
Dirty
Address/data
Memory
Figure 2.24 A simple snoopy bus based cache coherence system.
(a) (b)
Directory
Data
State
PresenceBits
Cache
Processor
Processor
Cache
Processor
Cache
Processor
Cache
Inte
rcon
nect
ion
Net
wor
k
Inte
rcon
nect
ion
Net
wor
k
Memory
Pres
ence
bits
/ St
ate
Processor
Cache
Memory
Pres
ence
bits
/ St
ate
Figure 2.25 Architecture of typical directory based systems: (a) a centralized directory; and (b) adistributed directory.
P0
P3
P2
P1
P3
P1
P2
P3
P2
P0
P1
P0
Time
Time
Time
(a) A single message sent over astore-and-forward network
(b) The same message broken into two parts
and sent over the network.
(c) The same message broken into four parts
and sent over the network.
Figure 2.26 Passing a message from node P0 to P3 (a) through a store-and-forward communica-tion network; (b) and (c) extending the concept to cut-through routing. The shaded regions representthe time that the message is in transit. The startup time associated with this message transfer is as-sumed to be zero.
Desired direction of message traversal
Flit from m
essage 1
Flit from message 0
Flit from message 2
Flit
from
mes
sage
3
Flit buffers
A
B C
D
Figure 2.27 An example of deadlock in a cut-through routing network.
Step 2 (110 111)Step 1 (010 110)
111110
101
011
100
010
001000
111110
101
011
100
010
001001000
010
101100
011
110 111
000
Ps Ps Ps
Pd Pd Pd
Figure 2.28 Routing a message from node Ps (010) to node Pd (111) in a three-dimensionalhypercube using E-cube routing.
(d)(c)
(b)(a)
l
9
14 1615
8765
4321
16151413
1211109
8765
4321
4321
8765
16151413
1211109
a b d
e f
c
g h
i j k l
m n o p
a b c d
e f g h
i j k l
m n o p
k h m i
j p o b
d e a n
c g f
13
121110
Figure 2.29 Impact of process mapping on performance: (a) underlying architecture; (b) pro-cesses and their interactions; (c) an intuitive mapping of processes to nodes; and (d) a randommapping of processes to nodes.
1−bit Gray code 2−bit Gray code 3−bit Gray code 3−D hypercube 8−processor ring
0
1
3
2
6
7
5
4
0 0
0 1
1 1
1 0
0
1
0
1
2
3
4
5
6
7
0 0 0
0 0 1
0 1 1
0 1 0
1 1 0
1 1 1
1 0 1
1 0 0
line
along this
Reflect
(a)
110
010
000 001
011
111
101
(b)
100
Figure 2.30 (a) A three-bit reflected Gray code ring; and (b) its embedding into a three-dimensionalhypercube.
(3,3) 10 10
(2,3) 11 10
(1,3) 01 10
(0,3) 00 10
(3,2) 10 11
(2,2) 11 11
(1,2) 01 11
(0,2) 00 11
(3,1) 10 01
(2,1) 11 01
(1,1) 01 01
(0,1) 00 01
(3,0) 10 00
(2,0) 11 00
(1,0) 01 00
(0,0) 00 00
(0,0) 0 00 (0,1) 0 01 (0,2) 0 11 (0,3) 0 10
(1,0) 1 00 (1,1) 1 01 (1,2) 1 11 (1,3) 1 10
011
001000
010
110 111
101100
identical two least−significant bits
Processors in a column have Processors in a row have identical
two most−significant bits
(a)
(b)
Figure 2.31 (a) A 4 × 4 mesh illustrating the mapping of mesh nodes to the nodes in a four-dimensional hypercube; and (b) a 2 × 4 mesh embedded into a three-dimensional hypercube.
(a) Mapping a linear array into a
linear array (congestion 5)(b) Inverting the mapping − mapping a 2D mesh into a
2D mesh (congestion 1).
Figure 2.32 (a) Embedding a 16 node linear array into a 2-D mesh; and (b) the inverse of themapping. Solid lines correspond to links in the linear array and normal lines to links in the mesh.
...
....
..
(b) Chip (32 GF)
.
(a) CPU (1GF) (c) Board (2 TF)
(d) Tower (16 TF) (e) Blue Gene (1 PF)
Figure 2.34 The hierarchical architecture of Blue Gene.
Router
(a) (b)
P Control
Memory
Figure 2.35 Interconnection network of the Cray T3E: (a) node architecture; (b) network topology.
1 R-Brick, 4 C-Bricks, and16 processors at each vertex.
Processor
128 Processor Configuration
(16 processors)R-Brick
To meta-router
To 8 other R-Bricks
C-Brick
To 4 C-Bricks
C-Brick
C-Brick
Metarouter
128 processors
512 Processor Configuration
C-Brick
C-Brick
C-Brick
C-Brick
C-Brick
Crossbar
Memory/DirectoryC-Brick
I/P/D/X Brick
R-Brick
R-Brick
32 Processor Configuration
Figure 2.36 Architecture of the SGI Origin 3000 family of servers.
Starfire Ultra 1000 (up to 64 processors)
16 x
16
non-
bloc
king
cro
ssba
r
Add
ress
bus
System Board
System Board
32 b
yte
data
bus
Sun Ultra 6000 (6 - 30 processors)
Four
add
ress
bus
es
System Board
System Board
System BoardSystem Board
Figure 2.37 Architecture of the Sun Enterprise family of servers.
Switch
Switch
Host
Switch Switch
Switch
Host
Host
SwitchSwitch
Host
Switch
Host
Host
Hos
t
Host
Hos
t
Switch
Host
Hos
t Host
Inte
rfac
eIn
terf
ace
Hos
tH
ost
InterfaceInte
rfac
e
InterfaceHost
Hos
t
InterfaceHost
Interface
Interface Interface
Host
Interface
HostInterface
Host
Host
Host
TFiber toexternal network
Figure 2.38 A typical connection pattern for switches and hosts in a Myrinet. The figure alsoillustrates routing of messages between hosts. At any point of time, multiple pairs of processors withnon-conflicting paths may be communicate with each other.
000
111
110
101
100
011
010
001
000
111
110
101
100
011
010
001
Figure 2.39 A Butterfly network with eight processing nodes.
(b)
(c) (d)
(a)
Figure 2.41 The construction of a 4 × 4 mesh of trees: (a) a 4 × 4 grid, (b) complete binarytrees imposed over individual rows, (c) complete binary trees imposed over each column, and (d)the complete 4 × 4 mesh of trees.