Date post: | 26-Dec-2015 |
Category: |
Documents |
Upload: | arabella-marsh |
View: | 216 times |
Download: | 0 times |
SCALABLE PARALLEL COMPUTING
CENG 546Dr. Esma Yıldırım
Copyright © 2012, Elsevier Inc. All rights reserved. 2 - 2
What is a computing cluster? A computing cluster consists of a collection
of interconnected stand-alone/complete computers, which can cooperatively working together as a single, integrated computing resource. Cluster explores parallelism at job level and distributed computing with higher availability.
A typical cluster: Merging multiple system images to a SSI
(single-system image ) at certain functional levels. Low latency communication protocols applied Loosely coupled than an SMP with a SSI
3
What is a Commodity Cluster
It is a distributed/parallel computing system It is constructed entirely from commodity subsystems
All subcomponents can be acquired commercially and separately Computing elements (nodes) are employed as fully operational
standalone mainstream systems Two major subsystems:
Compute nodes System area network (SAN)
Employs industry standard interfaces for integration Uses industry standard software for majority of services Incorporates additional middleware for interoperability among
elements Uses software for coordinated programming of elements in
parallel
Copyright © 2012, Elsevier Inc. All rights reserved. 2 - 4
Multicomputer Clusters: Cluster: A network of computers supported
by middleware and interacting by message passing
PC Cluster (Most Linux clusters)
Workstation Cluster (NOW, COW)
Server cluster or Server Farm
Cluster of SMPs or ccNUMA systems
Cluster-structured massively parallel processors
(MPP) – about 85% of the top-500 systems
Copyright © 2012, Elsevier Inc. All rights reserved. 2 - 5
Copyright © 2012, Elsevier Inc. All rights reserved. 2 - 6
Operational Benefits of Clustering
System availability (HA) : Cluster offers inherent high system availability due to the redundancy of hardware, operating systems, and applications.
Hardware Fault Tolerance: Cluster has some degree of redundancy in most system components including both hardware and software modules.
OS and application reliability : Run multiple copies of the OS and applications, and through this redundancy
Scalability : Adding servers to a cluster or adding more clusters to a network as the application need arises.
High Performance : Running cluster enabled programs to yield higher throughput.
7
Scalability The ability to deliver proportionally greater sustained performance
through increased system resources Strong Scaling
Fixed size application problem Application size remains constant with increase in system size
Weak Scaling Variable size application problem Application size scales proportionally with system size
Capability computing in most pure form: strong scaling Marketing claims tend toward this class
Capacity computing Throughput computing Includes job-stream workloads In most simple form: weak scaling
Cooperative computing Interacting and coordinating concurrent processes Not a widely used term Also: “coordinated computing”
8
Performance Metrics Peak floating point operations per second (flops) Peak instructions per second (ips) Sustained throughput
Average performance over a period of time flops, Mflops, Gflops, Tflops, Pflops flops, Megaflops, Gigaflops, Teraflops, Petaflops ips, Mips, ops, Mops …
Cycles per instruction cpi Alternatively: instructions per cycle, ipc
Memory access latency cycles per second
Memory access bandwidth bytes per second (Bps) bits per second (bps) or Gigabytes per second, GBps, GB/s
Basic Uni-processor Architecture elements
I/O Interface Memory Interface Cache hierarchy Register Sets Control Execution pipeline Arithmetic Logic
Units
9
10
Multiprocessor A general class of system Integrates multiple processors in to an interconnected ensemble MIMD: Multiple Instruction Stream Multiple Data Stream Different memory models
Distributed memory Nodes support separate address spaces
Shared memory Symmetric multiprocessor UMA – uniform memory access Cache coherent
Distributed shared memory NUMA – non uniform memory access Cache coherent
PGAS Partitioned global address space NUMA Not cache coherence
Hybrid : Ensemble of distributed shared memory nodes Massively Parallel Processor, MPP
11
Massively Parallel Processor MPP General class of large scale multiprocessor Represents largest systems
IBM BG/L Cray XT3
Distinguished by memory strategy Distributed memory Distributed shared memory
Cache coherent Partitioned global address space
Custom interconnect network Potentially heterogeneous
May incorporate accelerator to boost peak performance
DM - MPP
12
13
IBM Blue Gene/L
Copyright © 2012, Elsevier Inc. All rights reserved. 2 - 14
IBM BlueGene/L Supercomputer: The
World Fastest Message-Passing MPP built in 2005
Built jointly by IBM and LLNL teams and funded by US DoE ASCI Research Program
15
Symmetric Multiprocessor(SMP)
Building block for large MPP Multiple processors
2 to 32 processors Now Multicore
Uniform Memory Access (UMA) shared memory Every processor has equal access in equal time to
all banks of the main memory Cache coherent
Multiple copies of variable maintained consistent by hardware
SMP - UMA
16
17
SMP Node Diagram
USBPeripherals
JTAG
MPL1L2
MPL1L2
L3
MPL1L2
MPL1L2
L3
M1 M1 Mn-1
Controller
S
S
NIC NIC
Legend : MP : MicroProcessorL1,L2,L3 : CachesM1.. : Memory BanksS : StorageNIC : Network Interface Card
Ethernet
PCI-e
DSM - NUMA
18
Distributed Shared Memory- Non-uniform memory access
19
Commodity Clusters vs “Constellations”
16X16X
16X 16X
System Area Network
64 Processor Constellation
64 Processor Commodity Cluster
4X
4X
4X
4X
4X 4X 4X 4X
4X
4X
4X
4X
4X 4X 4X 4X
System Area Network
• An ensemble of N nodes each comprising p computing elements
• The p elements are tightly bound shared memory (e.g., smp, dsm)
• The N nodes are loosely coupled, i.e., distributed memory
• p is greater than N• Distinction is which layer gives us the
most power through parallelism
System Stack
20
Science Problems : Environmental Modeling, Physics, Computational Chemistry, etc.
Application : Coastal Modeling, Black hole simulations, etc.
Algorithms : PDE, Gaussian Elimination, 12 Dwarves, etc.
Program Source Code
Programming Languages: Fortran, C, C++ , UPC, Fortress, X10, etc.
Compilers : Intel C/C++/Fortran Compilers, PGI C/C++/Fortran, IBM XLC, XLC++, XLF, etc.
Runtime Systems : Java Runtime, MPI etc.
Operating Systems : Linux, Unix, AIX etc.
Systems Architecture : Vector, SIMD array, MPP, Commodity Cluster
Firmware : Motherboard chipset, BIOS, NIC drivers,
Microarchitectures : Intel/AMD x86, SUN SPARC, IBM Power 5/6
Logic Design : RTL
Circuit Design : ASIC, FPGA, Custom VLSI
Device Technology : NMOS, CMOS, TTL, Optical
Mod
el o
f Com
puta
tion
Historical Top-500 List
21
22
Clusters Dominate Top-500