Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | britanni-rogers |
View: | 45 times |
Download: | 0 times |
1
Pertemuan 26Parallel Processing 2
Matakuliah : H0344/Organisasi dan Arsitektur Komputer
Tahun : 2005
Versi : 1/1
2
Learning Outcomes
Pada akhir pertemuan ini, diharapkan mahasiswa
akan mampu :
• Menjelaskan prinsip kerja parallel processing
3
Outline Materi
• Multiple Processor Organization
• Symmetric Multiprocessor
• Cache Coherence and The MESI Protocol
• Clusters
• Non-uniform Memory Access
• Vector Computation
4
Cache coherence and MESI Protocol
The cache coherence:
Multiple copies of the same data can exist in different caches simultaneously, and if processors are allowed to update their own copies freely, an inconsistent view of memory can result.
• Software solution
• Hardware solution
• Directory protocol
• Snoopy protocol
5
Cache coherence and MESI Protocol
MESI cache line states
MModified
EExclusive
SShared
IInvalid
This cache line valid?
Yes Yes Yes No
The memory copy is …
Out of date Valid Valid -
Copies exist in other caches?
No No Maybe Maybe
A write to this line …
Does not go to bus Does not go to bus Goes to bus and updates cache
Goes directly to bus
6
Cache coherence and MESI Protocol
MESI state transition diagram
Invalid Shared
Modified Exclusive
RMS
RME
WH
WM
WH
RH
RH
WH
RH
(a) Line in cache at initiating processor
Invalid Shared
Modified Exclusive
SHW SHR
(a) Line in snooping cache
SHWSHR
SHW SHR
7
Clusters
Four benefits that can be achieved with clustering:
• Absolute scalability
• Incremental scalability
• High availability
• Superior price/performance
8
Clusters
Cluster configuration
P P
M I/O I/O
PP
MI/OI/OHigh speed message link
P P
M I/O I/O
PP
MI/OI/O
High speed message linkI/O I/O
RAID
(a) Standby server with no shared disk
(a) Shared disk
9
Clustering methods: benefits and limitationsClustering method Description Benefits Limitation
Passive standby A secondary server takes over in case of primary server failure.
Easy to implement. High cost because the secondary server is unavailable for other processing tasks.
Active secondary The secondary server is also used for processing tasks.
Reduces cost because secondary servers can be used for processing.
Increased complexity.
Separate servers Separate servers have their own disks. Data are continuously copied from primary to secondary server.
High availability. High network and server overhead due to copying operations.
Servers connected to disk
Servers are cabled to the same disks, but each server owns its disk. If one server fails, its disks are taken over by the other server.
Reduced network and server overhead due to elimination of copying operations.
Usually requires disk mirroring or RAID technology to compensate for risk of disk failure.
Servers share disks Multiple servers simultaneously share access to disks.
Low network and server overhead. Reduced risk of downtime caused by disk failure.
Requires look manager software. Usually used with disk mirroring or RAID technology.
10
Operating system design issue:
• Failure management
• Load balancing
• Parallel computation
• Parallelizing compiler
• Parallelized application
• Parametric computing
Clusters
11
• Uniform memory access (UMA)
• Non uniform memory access (NUMA)
• Cache coherent NUMA (CC-NUMA)
Non uniform memory access
12
L1 cache
Processor1-1
L1 cache
L1 cache
Processor1-m
L1 cache Directory
Mainmemory
1
I/O
L1 cache
Processor2-1
L1 cache
L1 cache
Processor2-m
L1 cache Directory
Mainmemory
2
I/O
L1 cache
ProcessorN-1
L1 cache
L1 cache
ProcessorN-m
L1 cache
DirectoryMain
memoryN
I/O
Interconnectionnetwork
Non uniform memory access
CC-NUMA Organization
13
DO 100 I = 1, N DO 100 I = 1, N
DO 100 J = 1, N C(I, J) = 0.0 (J = 1, N)
C(I, J) = 0.0 DO 100 K = 1, N
DO 100 K = 1, N C(1, J) = C(I, J) + A(I, K) * B(K, J) (J = 1, N)
C(I, J) = C(I, J) + A(I, K) * B(K, J) 100 CONTINUE
100 CONTINUE
(a) Scalar processing (b) Vector processing
DO 50 J = 1, N - 1
FORK 100
50 CONTINUE
J = N
100 DO 200 I = 1, N
C(I, J) = 0.0
DO 200 K = 1, N
C(I, J) = C(I, J) + A(I, K) * B(K, J)
200 CONTINUE
(c) Parallel processing
Vector computation
14
Memory
Inputregister
Pipelined ALU
Memory
Inputregister
Outputregister
Outputregister
(a) Pipelined ALU
(b) Parallel ALUs
ALU
ALU
ALU
Vector computation
15
Vectorcomputation
Compareexponent
Shiftsignificand
Addsignificands
Normalizexi
yizi
C S A N zixi
yi
C S A N zixi
yi
C S A N zi+1xi+1
yi+1
C S A N zi+2xi+2
yi+2
C S A N zi+3xi+3
yi+3
C S A Nx1, y1 z1
C S A Nx2, y2 z2
C S A Nx3, y3 z3
C S A Nx4, y4 z4
C S A Nx5, y5 z5
C S A Nx1, y1 z1
C S A Nx2, y2 z2
C S A Nx3, y3 z3
C S A Nx4, y4 z4
C S A Nx5, y5 z5
C S A Nx6, y6 z6
C S A Nx7, y7 z7
C S A Nx8, y8 z8
C S A Nx9, y9 z9
C S A Nx10, y10 z10
C S A N z11
C S A N z12
x11, y11
x12, y12
(a) Pipelined ALU
(b) Four parallel ALUs
16
Vectorcomputation
DO 100 J = 1, 50
CR(J) = AR(J) * BR(J) – AI(J) * BI(J)
100 CI(J) = AR(J) * BI(J) + AI(J) * BR(J)
Operation Cycle Operation Cycle
AR(J) * BR(J) T1(J) 3 AR(J) V1(J) 1
AI(J) * BI(J) T2(J) 3 BR(J) V2(J) 1
T1(J) – T2(J) CR(J) 3 V1(J) * V2(J) V3(J) 1
AR(J) * BI(J) T3(J) 3 AI(J) V4(J) 1
AI(J) * BR(J) T4(J) 3 BI(J) V5(J) 1
T3(J) + T4(J) CI(J) 3 V4(J) * V5(J) V6(J) 1
TOTAL 12 V3(J) – V6(J) V7(J) 1
(a) Storage to storage V7(J) CR(J) 1
V1(J) * V5(J) V8(J) 1
V4(J) * V2(J) V9(J) 1
V8(J) + V9(J) V0(J) 1
V0(J) CI(J) 1
TOTAL 12
(b) Register to register
17
Vectorcomputation
DO 100 J = 1, 50
CR(J) = AR(J) * BR(J) – AI(J) * BI(J)
100 CI(J) = AR(J) * BI(J) + AI(J) * BR(J)
Operation Cycle Operation Cycle
AR(J) V1(J) 1 AR(J) V1(J) 1
V1(J) * BR(J) V2(J) 1 V1(J) * BR(J) V2(J) 1
AI(J) V3(J) 1 AI(J) V3(J) 1
V3(J) * BI(J) V4(J) 1 V2(J) – V(3) * BI(J) V2(J) 1
V2(J) – V4(J) V5(J) 1 V2(J) CR(J) 1
V5(J) CR(J) 1 V1(J) * BI(J) V4(J) 1
V1(J) * BI(J) V6(J) 1 V4(J) + V3(J) * BR(J) V5(J) 1
V4(J) * BR(J) V7(J) 1 V5(J) CI(J) 1
V6(J) + V7(J) V8(J) 1 TOTAL 8
V8(J) CI(J) 1 (d) Compound instruction
TOTAL 10
(c) Storage to storage