Date post: | 04-Jan-2016 |
Category: |
Documents |
Upload: | emmeline-hawkins |
View: | 224 times |
Download: | 7 times |
TM
February 2001
Presentation Overview
1. ccNUMA Basics2. SGI’s ccNUMA Implementation (O2K)3. Supporting OS Technology4. SGI’s NextGen ccNUMA (O3K) (brief)5. Q&A
TM
February 2001
ccNUMA
cc: cache coherentNUMA: Non-Uniform Memory Access•Memory is physically distributed
throughout the system•memory and peripherals are globally
addressable•Local memory accesses are faster than
remote accesses (Non-Uniform Memory Access = NUMA)
•Local accesses on different nodes do not interfere with each other
TM
February 2001
Typical SMP Model
I/OI/OI/OMain
MemoryMain
Memory
Processor Processor Processor
SnoopyCache
SnoopyCache
SnoopyCache
I/OMain
MemoryMain
Memory
Central Bus
TM
February 2001
Typical MPP Model
Interconnect Network (ie. GSN,100BaseT, Myrinet)
I/O
Processor
MainMemory
Operating System
I/O
Processor
MainMemory
Operating System
I/O
Processor
MainMemory
Operating System
TM
February 2001
Scalable Cache Coherent Memory
Easy to Program Easy to Scale
Shared-memorySystems (SMP)
Massively ParallelSystems (MMP)
Hard to scale Hard to program
Easy to ProgramEasy to Scale
Scalable Shared MemorySystems [ccNUMA)
TM
February 2001
Origin ccNUMA vs other Architectures
> Single Address Space
> Modular Design
> All aspects scale as system grows
> Low-latency, high bandwidth global memory
Ori
gin
Con
ven
tion
al S
MP
Oth
er
NU
MA
Clu
sters
/MPP
TM
February 2001
Origin ccNUMA Advantage
Fixed bus SMP Other NUMA
Clusters, MPPOrigin 2000 ccNUMA
InterconnectionBisections
N
N
NNN
N N
NN
N N NN N N N
N
N
R
R
R
R R
R
R
R
N
N
N
N
N
N
N
N
NN
N
N
N N
TM
February 2001
IDC: NUMA is the future
Source: High Performance Technical Computing Market: Review and Forecast, 1997-2002International Data Corporation, September 1998
- IDC, September 1998
Architecture typeBus-based SMPNUMA SMPMessage PassingSwitch-based SMPUni-processorNUMA (uni-node)
1997 share41.0%20.8%15.3%12.1%5.5%5.3%
Change-13.7 pts.+16.9 pts.-1.1 pts.-0.9 pts.-3.9 pts.+3.8 pts.
1996 share54.7%3.9%16.4%13.0%9.4%1.5%
“Buses are the preferred approach for SMP implementations because of their relatively low cost. However, scalability is limited by the performance of the bus.”“NUMA SMP ... appears to be the preferred memory architecture for next-generation systems.”
TM
SGI’s First Commercial ccNUMA Implementation
Origin 2000 Architecture
TM
February 2001
History of Multiprocessing at SGI
CPUs
2
32
64
128
256
1993 1996 1997 1998 1999
Challenge2-36 CPUs
Origin 20002-32 CPUs
Origin 20002-64 CPUs
Origin 20002-256 CPUs
Origin 30002-1024 CPUs
Origin 2000 ccNUMA introduced
2000
TM
February 2001
Origin 2000 Logical Diagram32 CPU Hypercube (3D)
N
N
R
R
R
R R
R
R
R
N
N
N
N
N
N
N
N
NN
N
N
N N
TM
February 2001
Origin 2000 Node Board
Basic Building Block
Directory>32P
Proc.Cache
Hub
MainMemory
Directory
Proc.Cache
Node Board
TM
February 2001
MIPS R12000 CPU
•64-bit RISC design, 0.25-micron CMOS process •Single-chip four-way superscalar RISC dataflow architecture •5 fully-pipelined execution units•supports speculative and out-of-order execution•8MB L2 cache Origin 2000, 4MB Origin 200 •32KB 2-way set-associative instruction and data caches •2,048-entry branch prediction table
•48-entry active list •32-entry two-way set-associative Branch Target Address Cache (BTAC) •Doubled L2 way prediction table for improved L2 hit rate •Improved branch prediction by using global history mechanism •Improved performance monitoring support •Maintains code and instruction set compatibility with R10000
TM
February 2001
Memory Hierarchy
1. local cpu registers2. local cpu cache 5 ns3. local memory318 ns4. remote memory554 ns5. remote caches
TM
February 2001
Directory Based Cache CoherencyCache Coherency == System hw guarantees that every cached copy remains
a true reflection of the memory data, without sw
intervention.
Directory Bits consist of two parts:
a. 8-bit integer representing node that has exclusive ownership of data
b. Bit map that represents which nodes have copies of data in cache.
TM
February 2001
Cache Example
1. data read into cache for thread on CPU 0
2. threads on CPUs 1 and 2 read data into cache
3. thread on CPU 2 updates data in cache
(cacheline is set exclusive)
4. Eventually cache line gets invalidated
TM
February 2001
•6-way non-blocking crossbar (9.3 Gbytes/sec)
•Link Level Protocol (LLP) uses CRC error checking
•1.56 Gbyte/sec (peak full-duplex) per port
•packet delivery prioritization (credits, aging)
•uses internal routing table and supports wormhole routing
•internal buffers (SSR/SSD) down-convert 390MHz external signaling to core frequency.
•Three ports connect to external 100 conductor NumaLink cables.
Router and Interconnect Fabric
Global SwitchInterconnect
N
N
R
R
R
R R
R
R
R
N
N
N
N
N
N
N
N
NN
N
N
N N
TM
February 2001
Origin 2000 Module
Basic Building Block
Node Boards
Directory>32P
Proc.Cache
Hub
MainMemoryDirectory
Proc.Cache
Directory>32P
Proc.Cache
Hub
MainMemoryDirectory
Proc.Cache
Directory>32P
Proc.Cache
Hub
MainMemoryDirectory
Proc.Cache
Directory>32P
Proc.Cache
Hub
MainMemoryDirectory
Proc.Cache Router
BoardRouterBoard
Midplane
XB
OW
XB
OW
TM
February 2001
Modules become Systems
Deskside(Module)
Rack(2 Modules)
Multi-rack(4 Modules)
Etc...
2-8 CPUs
16 CPUs
..128 CPUs
32 CPUs
TM
February 2001
Origin 2000 Grows to Single Rack
Single Rack System•2-16 CPUs•32GB Memory•24 XIO I/O slots
N
R R
R R N
NN
N
N
N N
TM
February 2001
Origin 2000 Grows to Multi-Rack
Multi-Rack System•17-32 CPUs•64GB Memory•48 XIO I/O slots•32-processor
hypercube building block
N
N
R
R
R
R R
R
R
R
N
N
N
N
N
N
NN
NN
N
N
N N
TM
February 2001
Origin 2000 Grows to Large Systems
Large Multi-Rack Systems•2-256 CPUs•512GB Memory•384 I/O slots
+
=
++
TM
February 2001
Bisection Bandwidth as System Grows
CPUs BisectionBW(total)
BisectionBW/CPU
RouterHops (max)
RouterHops (avg)
Latency(max)
Latency(avg)
2 na na na na 343 ns 343 ns
4 na na na na 554 ns 441 ns
8 1.56 0.195 1 0.75 759 ns 623 ns
16 3.12 0.195 2 1.63 759 ns 691 ns
32 6.24 0.195 3 2.19 836 ns 674 ns
64 12.5 0.195 5 2.97 1067 ns 851 ns
128 25.0 0.195 6 3.98 1169 ns 959 ns
TM
February 2001
Memory Latency as System Grows
CPUs BisectionBW(total)
BisectionBW/CPU
RouterHops (max)
RouterHops (avg)
Latency(max)
Latency(avg)
2 na na na na 343 ns 343 ns
4 na na na na 554 ns 441 ns
8 1.56 0.195 1 0.75 759 ns 623 ns
16 3.12 0.195 2 1.63 759 ns 691 ns
32 6.24 0.195 3 2.19 836 ns 674 ns
64 12.5 0.195 5 2.97 1067 ns 851 ns
128 25.0 0.195 6 3.98 1169 ns 959 ns
TM
February 2001
0
5000
10000
15000
20000
25000
0 10 20 30 40 50 60 70
CPUs
ST
RE
AM
Tri
ad B
and
wid
th
SGI Origin2000/300MHz
SGI Origin2000/250MHz
Sun UE10000
Compaq/DEC 8400
HP/Convex V2500
HP/Convex SPP
Origin 2000 Bandwidth Scales
Origin 2000/300MhZ
SUN UE10000
HP/Convex SPP
HP/Convex VCompaq/DEC 8400
STREAM Triad results
Origin 2000/250MhZ
TM
February 2001
Performance on HPC job mix
0
5000
10000
15000
20000
25000
30000
35000
8 28 48 68 88 108 128
SGI/195MHz
SGI/250MHz
SGI/300MHz
IBM/120MHz
IBM/160MHz
DEC
Sun E
Sun UE
HP/V
HP/SPP
HP/X
SPECfp_rate95 results Origin 300Mhz
IBM
DEC
SUN
HP
Origin 250Mhz
Origin 195Mhz
TM
Enabling Technologies
IRIX: NUMA Aware OS and System Utilities
TM
February 2001
Default Memory PlacementMemory allocated on “first-touch” basis
- on node where process that defines page is running- or as close as possible (minimize latency)- developers should initialize work areas in newly
created threads
IRIX scheduler maintains process affinity
- re-schedules jobs on processor where they ran last
- or on other CPU in the same node- or as close as possible )minimize latency)
TM
February 2001
Alternatives to “first-touch” policy
Round Robin Allocation- Data is distributed at run-time
among all nodes used for execution- setenv _DSM_ROUND_ROBIN_
TM
February 2001
Dynamic Page Migration
•IRIX can keep track of run-time memory access patterns and dynamically copy pages to new node.•Expensive operation. Requires: daemon, TLB invalidations, and the memory copy itself.)• setenv _DSM_MIGRATION ON • setenv _DSM_MIGRATION_LEVEL 90
TM
February 2001
Explicit Placement: source directives
integer i, j, n, niters parameter (n = 8*1024*1024, niters = 1000)
c-----Note that the distribute directive is used after the arraysc-----are declared.
real a(n), b(n), qc$distribute a(block), b(block)
c-----initializationdo i = 1, n a(i) = 1.0 - 0.5*i b(i) = -10.0 + 0.01*(i*i) enddo
c-----real work do it = 1, niters q = 0.01*itc$doacross local(i), shared(a,b,q), affinity (i) = data(a(i)) do i = 1, n a(i) = a(i) + q*b(i) enddo enddo
TM
February 2001
Explicit Placement: dprof / dplace
•Used for application that don’t use libmp (ie. explicit sproc, fork, pthreads, mpi, etc)•dprof: profiles memory access pattern•dplace can:
– Change the page size used – Enable page migration – Specify the topology used by the threads of a parallel program
– Indicate resource affinities – Assign memory ranges to particular nodes
TM
SGI 3rd Generation ccNUMA Implementation
Origin 3000 Family
TM
February 2001
Compute Module vs. Bricks
8P12 Compute Module (Origin 2000)
C-Brick
R-Brick
P-Brick
System “Bricks” (Origin 3000)
TM
February 2001
Feeds and Speeds
Origin 2000 Origin 3000
Node Density 2 CPUs/node 4 CPUs/node
Memory Density 4 GBytes/node 8 GBytes/node
Ports/Router 6 ports 6 or 8 ports
InterconnectBandwidth
1.6 GBytes/sec(full duplex)
3.2 GBytes/sec(full duplex)
CPU Technology MIPS MIPS or IA64
TM
Taking Advantage of Multiple CPUs
Parallel Programming Models Available on Origin Family
TM
February 2001
Many Different Models and Tools To Choose From
•Automatic Parallelization Option: compiler flags
•Compiler Source Directives: OpenMP, c$doacross, etc
•explicit multi-threading: pthreads, sproc
•Message Passing APIs: MPI, PVM
TM
February 2001
Computing Value of π: Simple Serial
program compute_pi
integer n, i
double precision w, x, sum,pi, f, a
c function to integrate
f(a) = 4.d0 / (1.d0 + a*a)
print *, ‘Enter number of intervals:’
read *,n
c calculate the interval size
w =1.0d0/n
sum = 0.0d0
do i = 1,n
x = w * (i - 0.5d0)
sum = sum + f(x)
end do
pi = w * sum
print *, ‘computed pi =‘ ,pi
stop
end
TM
February 2001
Automatic Parallelization Option
•Add-on option for SGI MipsPro compilers• compiler searches for loops that it can parallelize
f77 -apo compute_pi.f77setenv MP_SET_NUM_THREADS 4./a.out
TM
February 2001
OpenMP Source Directives
program compute_pi
integer n, i
double precision w, x, sum,pi, f, a
c function to integrate
f(a) = 4.d0 / (1.d0 + a*a)
print *, ‘Enter number of intervals:’
read *,n
c calculate the interval size
w =1.0d0/n
sum = 0.0d0
!$OMP PARALLEL DO PRIVATE(X), SHARED(W), REDUCTION(+:sum)
do i = 1,n
x = w * (i - 0.5d0)
sum = sum + f(x)
end do
!$OMP END PARALLEL DO
pi = w * sum
print *, ‘computed pi =‘ ,pi
stop
end
TM
February 2001
Message Passing Interface (MPI)
program compute_pi
Include ‘mpif.h’
integer n, i, myid, numprocs, rc
double precision w, x, sum,pi, f, a
c function to integrate
f(a) = 4.d0 / (1.d0 + a*a)
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
if (myid .eq. 0) then
print *, ‘Enter number of intervals:’
read *,n
endif
call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)
c calculate the interval size
w =1.0d0/n
sum = 0.0d0
do i = myid+1, n, numprocs
x = w * (i - 0.5d0)
sum = sum + f(x)
end do
TM
February 2001
Message Passing Interface (MPI)
mypi = w * sum
c collect all the partial sums
call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION,MPI_SUM,0,
$MPI_COMM_WORLD,ierr)
c node 0 prints the answer
if (myid .eq. 0) then
print *, ‘computed pi =‘ ,pi
endif
call MPI_FINALIZE(rc)
stop
end