IBM Research
© 2004 IBM Corporation
The BlueGene/L Supercomputer
Manish Gupta
IBM Thomas J. Watson Research Center
2
IBM Research
Blue Gene/L © 2004 IBM Corporation
What is BlueGene/L?
One of the world’s fastest supercomputers
A new approach to design of scalable parallel systems The current approach to large systems is to build clusters of large SMPs
(NEC Earth Simulator, ASCI machines, Linux clusters) Expensive switches for high performance High electrical power consumption: low computing power density Significant amount of resources devoted to improving single-thread
performance Blue Gene follows a more modular approach, with a simple building
block (or cell) that can be replicated ad infinitum as necessary – aggregate performance is important
System-on-a-chip offers cost/performance advantages Integrated networks for scalability Familiar software environment, simplified for HPC
3
IBM Research
Blue Gene/L © 2004 IBM Corporation
BlueGene/L Compute System-on-a-Chip ASIC
PLB (4:1)
“Double FPU”
Ethernet Gbit
JTAGAccess
144 bit wide DDR256MB
JTAG
Gbit Ethernet
440 CPU
440 CPUI/O proc
L2
L2
MultiportedSharedSRAM Buffer
Torus
DDR Control with ECC
SharedL3 directoryfor EDRAM
Includes ECC
4MB EDRAM
L3 CacheorMemory
l
6 out and6 in, each at 1.4 Gbit/s link
256
256
1024+144 ECC256
128
128
32k/32k L1
32k/32k L1
2.7GB/s
22GB/s
11GB/s
“Double FPU”
5.5GB/s
5.5 GB/s
256
snoop
Tree
3 out and3 in, each at 2.8 Gbit/s link
GlobalInterrupt
4 global barriers orinterrupts
128
5.6GFpeaknode
4
IBM Research
Blue Gene/L © 2004 IBM Corporation
Chip(2 processors)
Com pute Card(2 ch ips, 2x1x1)
Node Board(32 ch ips, 4x4x2)
16 Com pute C ards
System(64 cabinets, 64x32x32)
Cabinet(32 Node boards, 8x8x16)
2.8/5.6 G F/s4 M B
5.6/11.2 G F/s0.5 G B DDR
90/180 G F/s8 G B DDR
2.9/5.7 TF/s256 G B DDR
180/360 TF /s16 TB D DR
BlueGene/L
October 2003BG/L half rack prototype500 Mhz512 nodes/1024 proc.2 TFlop/s peak1.4 Tflop/s sustained
April 2004BlueGene/L
500 MHz4 rack prototype4096 compute nodes
64 I/O nodes16 TF/s peak
11.68 TF/s sustained
5
IBM Research
Blue Gene/L © 2004 IBM Corporation
BlueGene/L
Chip(2 processors)
Com pute Card(2 ch ips, 2x1x1)
Node Board(32 ch ips, 4x4x2)
16 Com pute C ards
System(64 cabinets, 64x32x32)
Cabinet(32 Node boards, 8x8x16)
2.8/5.6 G F/s4 M B
5.6/11.2 G F/s0.5 G B DDR
90/180 G F/s8 G B DDR
2.9/5.7 TF/s256 G B DDR
180/360 TF /s16 TB D DR
Sept 2004BlueGene/L700 MHz8 rack prototype36.01 TF/s sustained
6
IBM Research
Blue Gene/L © 2004 IBM Corporation
BlueGene/L Networks
3 Dimensional Torus Interconnects all compute nodes (65,536) Virtual cut-through hardware routing 1.4Gb/s on all 12 node links (2.1 GB/s per node) 1 µs latency between nearest neighbors, 5 µs to the
farthest Communications backbone for computations 0.7/1.4 TB/s bisection bandwidth, 68TB/s total bandwidth
Global Tree One-to-all broadcast functionality Reduction operations functionality 2.8 Gb/s of bandwidth per link Latency of one way tree traversal 2.5 µs Interconnects all compute and I/O nodes (1024)
Low Latency Global Barrier and Interrupt Latency of round trip 1.3 µs
Ethernet Incorporated into every node ASIC Active in the I/O nodes (1:64) All external comm. (file I/O, control, user interaction, etc.)
Control Network
7
IBM Research
Blue Gene/L © 2004 IBM Corporation
How to Make System Software Scale to 64K nodes?
Take existing software and keep on scaling it until we succeed?
or Start from scratch?
New languages and programming paradigms
8
IBM Research
Blue Gene/L © 2004 IBM Corporation
How to Make System Software Scale to 64K nodes?
Take existing software and keep on scaling it until we succeed?
or Start from scratch?
New languages and programming paradigms
9
IBM Research
Blue Gene/L © 2004 IBM Corporation
Problems with “Existing Software” Approach
Reliability If software fails on any node (independently) once in a month, a node
failure would be expected on 64K node system every 40 seconds
Interference effect Was about to send a message, but oops, got swapped out…
Resource limitations Reserve a few buffers for every potential sender….
Optimization point is different What about small messages?
10
IBM Research
Blue Gene/L © 2004 IBM Corporation
Interference problem
S
R
11
IBM Research
Blue Gene/L © 2004 IBM Corporation
Interference problem
S
R
S
R
Swapped out
Swapped in
12
IBM Research
Blue Gene/L © 2004 IBM Corporation
Interference problem
S
R
S
R
Swapped out
Swapped in
Swapped in
Swapped out
R
13
IBM Research
Blue Gene/L © 2004 IBM Corporation
Problems with “Existing Software” Approach
Reliability If software fails on any node (independently) once in a month, a node
failure would be expected on 64K node system every 40 seconds
Interference effect Was about to send a message, but oops, got swapped out…
Resource limitations Reserve a few buffers for every potential sender to hold early
messages
Optimization point is different What about small messages?
14
IBM Research
Blue Gene/L © 2004 IBM Corporation
Problems with “New Software” Approach
Sure, message passing is tedious – has anything else been proven to scale?
Do you really want me to throw away my 1 million line MPI program and start fresh?
If I start fresh, what’s the guarantee my “new” way of programming wouldn’t be rendered obsolete by future innovations?
15
IBM Research
Blue Gene/L © 2004 IBM Corporation
Our Solution
Simplicity Avoid features not absolutely necessary for high performance
computing Using simplicity to achieve both efficiency and reliability
New organization of familiar functionality Same interface, new implementation Hierarchical organization Message passing provides foundation
Research on higher level programming models using that base
16
IBM Research
Blue Gene/L © 2004 IBM Corporation
BlueGene/L Software Hierarchical Organization
Compute nodes dedicated to running user application, and almost nothing else - simple compute node kernel (CNK)
I/O nodes run Linux and provide a more complete range of OS services – files, sockets, process launch, debugging, and termination
Service node performs system management services (e.g., heart beating, monitoring errors) - largely transparent to application/system software
compute nodeapplication volume
I/O nodeoperational surface
service nodecontrol surface
17
IBM Research
Blue Gene/L © 2004 IBM Corporation
Blue Gene/L System Software Architecture
I/O Node 0
Linux
ciod
C-Node 0
CNK
I/O Node 1023
Linux
ciod
C-Node 0
CNK
C-Node 63
CNK
C-Node 63
CNK
IDo chip
Scheduler
Console
MMCS
JTAG
torus
tree
DB2
Pset 1023
Pset 0
I2C
ServiceNode Functiona
l Ethernet
Functional Ethernet
Control Ethernet
Control Ethernet
Front-endNodes
FileServers
18
IBM Research
Blue Gene/L © 2004 IBM Corporation
Programming Models and Development Environment
Familiar Aspects SPMD model - Fortran, C, C++ with MPI (MPI1 + subset of MPI2)
Full language support Automatic SIMD FPU exploitation
Linux development environment User interacts with system through FE nodes running Linux – compilation,
job submission, debugging Compute Node Kernel provides look and feel of a Linux environment –
POSIX system calls (with restrictions) Tools – support for debuggers (Aetnus TotalView), hardware
performance monitors (HPMLib), trace based visualization (Paraver) Restrictions (lead to significant scalability benefits)
Strictly space sharing - one parallel job (user) per partition of machine, one process per processor of compute node
Virtual memory constrained to physical memory size Implies no demand paging, only static linking
Other Issues: Mapping of applications to torus topology More important for larger systems (multi-rack systems) Working on techniques to provide transparent support
19
IBM Research
Blue Gene/L © 2004 IBM Corporation
Execution Modes for Compute Node
Communication coprocessor mode: CPU 0 executes user application while CPU 1 handles communications Preferred mode of operation for communication-intensive and memory
bandwidth intensive codes Requires coordination between CPUs, which is handled in libraries Computation offload feature (optional): CPU 1 also executes
some parts of user application offloaded by CPU 0 Can be selectively used for compute-bound parallel regions Asynchronous coroutine model (co_start / co_join) Need careful sequence of cache line flush, invalidate, and copy operations
to deal with lack of L1 cache coherence in hardware
Virtual node mode: CPU0 and CPU1 handle both computation and communication Two MPI processes on each node, one bound to each processor Distributed memory semantics – lack of L1 coherence not a problem
20
IBM Research
Blue Gene/L © 2004 IBM Corporation
MPI PMI
The BlueGene/L MPICH2 organization (with ANL)
bgltorus
collectivespt2pt datatype topo
Abstract Device Interface
CH3
socket
MM
simple
uniprocessorMessage passing Process management
mpd
MessageLayer
torustorus treetree GIGI
TorusPacket Layer
TreePacket Layer
GIDevice
CIOProtocol
bgltorus
debug
21
IBM Research
© 2004 IBM Corporation
Performance Limiting Factors in the MPI Design
Torus Network link bandwidth0.25 Bytes/cycle/link (theoretical)0.22 Bytes/cycle/link (effective)12*0.22 = 2.64 Bytes/cycle/node
Streaming memory bandwidth 4.3 Bytes/cycle/CPU memory copies are expensive
CPU/network interface204 cycles to read a packet;50 – 100 cycles to write a packetAlignment restrictions
Handling badly aligned data is expensive
Short FIFOsNetwork needs frequent attention
Network order semantics and routingDeterministic routing: in order, bad torus performanceAdaptive routing: excellent network performance, out-of-order packetsIn-order semantics is expensive
Dual core setup, memory coherencyExplicit coherency management via “blind device” and cache flush primitivesRequires communication between processorsBest done in large chunksCoprocessor cannot manage MPI data structures
CNK is single-threaded; MPICH2 is not thread safe
Context switches are expensiveInterrupt driven execution is slow
Hardware
Software
22
IBM Research
© 2004 IBM Corporation
•When alignments differ, extra memory copy is needed
•Sometimes torus read op. can be combined with re-alignment op.
•Constraint: Torus hardware only handles 16 byte aligned data
SENDER
RECEIVER
•When sender/receiver alignments are same:
•head and tail transmitted in a single “unaligned” packet
•aligned packets go directly to/from torus FIFOs
Packetization and packet alignment
23
IBM Research
© 2004 IBM Corporation
The BlueGene/L Message Layer
Looks very much like LAPI, GAMA Just a lot simpler ;-)
Simplest function: Deliver a buffer of bytes from one node to other Can do this using one of many protocols
One-packet protocol Rendezvous protocol Eager protocol Adaptive eager protocol! Virtual node mode copy protocol! Collective function protocols! … and others
24
IBM Research
© 2004 IBM Corporation
Optimizing point-to-point communication (short messages: 0-10 KBytes)
The thing to watch is overhead
Bandwidth CPU load Co-processor Network load
BlueGene/L network requires 16 byte aligned loads and storesMemory copies to resolve alignment issues
Compromise solution:Deterministic routing insures good latency but creates network hotspotsAdaptive routing avoids hotspots but doubles latencyCurrently: deterministic routing more advantageous at up to 4k nodesBalance may change as we scale to 64k nodes: shorter messages, more traffic
Not a factor:not enough
network traffic
protocol cycles s
short 2350 3.35
eager 4000 5.71
rendezvous 11000 15.71
25
IBM Research
© 2004 IBM Corporation
Optimizing collective performance:Barrier and short-message Allreduce
Barrier is implemented as an all-broadcast in each dimension
BG/L torus hardware can send deposit packets on a lineLow latency broadcast
Since packets are short, likelihood of conflicts is low
Latency = O(xsize+ysize+zsize)
Allreduce for very short messages is implemented with a similar multi-phase algorithm
Implemented by Yili Zheng (summer student)
Phase 1 Phase 2 Phase3
26
IBM Research
© 2004 IBM Corporation
Barrier and short message Allreduce: Latency and Scaling
MPI_Barrier Performance Comparison
0
5
10
15
20
25
30
35
40
45
50
2 4 8 16 32 64 128 256 512 1024
Number of Nodes (1024 is in virtual node mode)
Lat
ency
(m
icro
sec
on
ds)
BGL
MPICH2
Short-message Allreduce latencyvs. message size
Barrier latency vs. machine size
27
IBM Research
Blue Gene/L © 2004 IBM Corporation
Dual FPU Architecture
Designed with input from compiler and library developers
SIMD instructions over both register files FMA operations over double precision data More general operations available with
cross and replicated operands Useful for complex arithmetic, matrix
multiply, FFT Parallel (quadword) loads/stores
Fastest way to transfer data between processors and memory
Data needs to be 16-byte aligned Load/store with swap order available
Useful for matrix transpose
28
IBM Research
Blue Gene/L © 2004 IBM Corporation
Strategy to Exploit SIMD FPU
Automatic code generation by compiler User can help the compiler via pragmas and intrinsics
Pragma for data alignment: __alignx(16, var) Pragma for parallelism
Disjoint: #pragma disjoint (*a, *b) Independent: #pragma ibm independent loop
Intrinsics Intrinsic function defined for each parallel floating point operation
E.g.: D = __fpmadd(B, C, A) => fpmadd rD, rA, rC, rB Control over instruction selection, compiler retains responsibility for register
allocation and scheduling Using library routines where available
Dense matrix BLAS – e.g., DGEMM, DGEMV, DAXPY FFT MASS, MASSV
29
IBM Research
Blue Gene/L © 2004 IBM Corporation
IPA ObjectsIPA Objects
Other Other ObjectsObjects
System System LinkerLinker
Optimized Optimized ObjectsObjects
Wcode+
EXE
DLLPartitionsPartitions
TOBEYTOBEY
TPOTPO
C FEC FE C++ FEC++ FE FORTRAN FORTRAN FEFE
Compile StepOptimization
Wcode
LibrariesLibraries
PDF infoPDF info
Wcode+
Link StepOptimization
Instrumentedruns
Wcode
Wcode
Wcode
Wcode
IBM Compiler Architecture
30
IBM Research
Blue Gene/L © 2004 IBM Corporation
Example: Vector Add
void vadd(double* a, double* b, double* c, int n)
{
int i;
for (i=0; i<n; i++)
{
c[i] = a[i] + b[i];
}
}
31
IBM Research
Blue Gene/L © 2004 IBM Corporation
Compiler transformations for Dual FPU
void vadd(double* a, double* b, double* c, int n)
{
int i;
for (i=0; i<n-1; i+=2)
{
c[i] = a[i] + b[i];
c[i+1] = a[i+1] + b[i+1];
}
for (; i<n; i++) c[i] = a[i] + b[i];
}
32
IBM Research
Blue Gene/L © 2004 IBM Corporation
Compiler transformations for Dual FPU
void vadd(double* a, double* b, double* c, int n)
{
int i;
for (i=0; i<n-1; i+=2)
{
c[i] = a[i] + b[i];
c[i+1] = a[i+1] + b[i+1];
}
for (; i<n; i++) c[i] = a[i] + b[i];
}
LFPL (pa, sa) = (a[i], a[i+1])LFPL (pb, sb) = (b[i], b[i+1])FPADD (pc, sc) = (pa+pb, sa+sb)SFPL (c[i], c[i+1]) = (pc, sc)
33
IBM Research
Blue Gene/L © 2004 IBM Corporation
Pragmas and Advanced Compilation Techniques
void vadd(double* a, double* b, double* c, int n){#pragma disjoint(*a, *b, *c)__alignx(16,a+0);__alignx(16,b+0);__alignx(16,c+0); int i; for (i=0; i<n; i++) { c[i] = a[i] + b[i]; }}
Now Available (Using TPO)Interprocedural pointer alignment analysisLoop transformations to enable SIMD code generation in absence of compile-time alignment information
loop versioningloop peeling
Coming soon
34
IBM Research
Blue Gene/L © 2004 IBM Corporation
LINPACK summary
0
5000
10000
15000
20000
25000
30000
35000
40000
Gfl
op
s
1024 2048 4096 8192
Number of nodes
LINPACK Performance
Pass 1 hardware (@500 MHz) #4 on June 2004 TOP500 list 11.68 TFlop/s on 4096 nodes 71% of peak
Pass 2 hardware (@ 700 MHz) #8 on June 2004 TOP500 list 8.65 TFlop/s on 2048 nodes
Improved recently to 8.87 TFlop/s (would have been #7)
77% of peak
Achieved 36.01 TFlop/s with 8192 nodes on 9/16/04, beating Earth Simulator
78% of peak
35
IBM Research
© 2004 IBM Corporation
Cache coherence: a war story
Buffer 1: sent from (by CPU 0) Buffer 2: received into (by CPU 1)
Memory
Main processor cannot touchloop: ld …, buffer st …, network bdnz loop
Last iteration: • branch predictor predicts branch taken• ld executes speculatively
• cache miss causes first line of forbidden buffer area to be fetched into cache• system executes branch, rolls back speculative loads• does not roll back cache line fetch (because it’s nondestructive)
Conclusion: CPU 0 ends up with stale data in cacheBut only when cache line actually survives before being used
36
IBM Research
Blue Gene/L © 2004 IBM Corporation
HPC Challenge: Random Access Updates (GUP/s)
0
0.001
0.002
0.003
0.004
0.005
0.006
MPI
Cray X1/ORNL (64)
HP Alpha/PSC(128)
HP Itanium2/OSC(128)
BG/L/YKT (64)
37
IBM Research
Blue Gene/L © 2004 IBM Corporation
HPC Challenge: Latency (usec)
0
5
10
15
20
25
30
35
40
Rand Ring
Cray X1, ORNL (64)
HP Alpha, PSC(128)
HP Itanium2, OSC(128)
BG/L, YKT (64)
38
IBM Research
Blue Gene/L © 2004 IBM Corporation
Measured MPI Send Bandwidth and Latency
0
100
200
300
400
500
600
700
800
900
10001 2 4 8 16 32 64 128
256
512
1024
2048
4096
8192
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
Message size (bytes)
Ba
nd
wid
th (
MB
/s)
@ 7
00
MH
z
1 neighbor
2 neighbors
3 neighbors
4 neighbors
5 neighbors
6 neighbors
Latency @700 MHz = 3.3 + 0.090 * “Manhattan distance” + 0.045 * “Midplane hops” s
39
IBM Research
Blue Gene/L © 2004 IBM Corporation
Noise measurements (from Adolphy Hoisie)
Ref: Blue Gene: A Performance and Scalability Report at the 512-Processor Milestone, PAL/LANL, LA-UR- 04-1114, March 2004.
40
IBM Research
Blue Gene/L © 2004 IBM Corporation
SPPM on fixed grid size (BG/L 700 MHz)SPPM Scaling (128**3, real*8)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
1 10 100 1000
Nodes BG/L, processors p655
Ra
te (
po
ints
/se
c/it
er)
P655 1.7GHz
BG/L VNM
BG/L COP
41
IBM Research
Blue Gene/L © 2004 IBM Corporation
ASCI Purple Benchmarks – UMT2K
UMT2K: Unstructured mesh radiation transport
Strong scaling – problem size fixed Excellent scalability up to 128 nodes
load balancing problems on scaling up to 512 nodes, need algorithmic changes in original program
0
5
10
15
20
25
30
35
40
Sp
ee
d r
ela
tiv
e t
o 8
no
de
s
8 32 128 256 512
Number of nodes
BGL (500 MHz)
42
IBM Research
Blue Gene/L © 2004 IBM Corporation
SAGE on fixed grid size (BG/L 700 MHz)SAGE Scaling (timing_h, 32K cells/node)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
1 10 100 1000
Nodes BG/L, processors p655
Rat
e(ce
lls/n
od
e/se
c)
BG/L VNM
BG/L COP
43
IBM Research
Blue Gene/L © 2004 IBM Corporation
Effect of mapping on SAGE
SAGE Scaling (timing_h)
0
1000
2000
3000
4000
5000
6000
7000
8000
1 10 100 1000
Processors
Pro
cess
ing
Rat
e(ce
lls/
sec/
cpu
)
Random MapDefault MapHeuristic Map
44
IBM Research
Blue Gene/L © 2004 IBM Corporation
Miranda results (by LLNL)
1 10 100 1000
Processor count
0.1
1
10
100
Run
ning
tim
e (s
)
BG/L 500MCR
Miranda scaling on BG/L and MCR (Charles Crab)
45
IBM Research
Blue Gene/L © 2004 IBM Corporation
ParaDis on BG/L vs MCR (Linux Cluster) Peak 11.6 TF/s, Linpack: 7.634 TF/s)
Courtesy: Kim Yates
MCR is a large (11.2 TF) tightly Coupled Linux cluster: 1,152 nodes, each with two 2.4-GHz Pentium 4 Xeon processors and 4 GB of memory.
Study of Dislocation Dynamics in Metals
46
IBM Research
Blue Gene/L © 2004 IBM Corporation
CPMD History
Born at IBM Zurich from the original Car-Parrinello Code in 1993;
Developed in many other sites during the years (more than 150,000 lines of code); it has many unique features, e.g. path-integral MD, QM/MM interfaces, TD-DFT and LR calculations;
Since 2001 distributed free for academic institutions (www.cpmd.org); more than 5000 licenses in more than 50 countries.
47
IBM Research
Blue Gene/L © 2004 IBM Corporation
CPMD results (BG/L 500 MHz)CPMD 216 atom SiC supercell
1
10
100
8 16 32 64 128 256 512 1024
Processors
Ru
nn
ing
tim
e (
s)
BG/L
JS20
p690
48
IBM Research
Blue Gene/L © 2004 IBM Corporation
QCD CG Inverter - Wilson fermions21 CG iterations, 16x4x4x4 local lattice
10
12
14
16
18
20
22
24
26
28
30
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Number of CPUs
Sus
tain
ed p
erfo
rman
ce %
th
eore
tical
max
is ~
75%
BGL: 1TF/Rack vs QCDOC: ½ TF/Rack
49
IBM Research
Blue Gene/L © 2004 IBM Corporation
BlueGene/L software team
YORKTOWN:• 10 people (+ students)• Activities on all areas of system software• Focus on development of MPI and new features• Does some test, but depends on Rochester
ROCHESTER:• 15 people (plus performance & test)• Activities on all areas of system software• Most of development• Follows process between research and product• Main test center
HAIFA:• 4 people• Focus on job scheduling• LoadLeveler• Interfacing with Poughkeepsie
INDIA RESEARCH LAB:• 3 people• Checkpoint/restart• Runtime error verification• Benchmarking
TORONTO• Fortran95, C, C++ Compilers
50
IBM Research
Blue Gene/L © 2004 IBM Corporation
Conclusions
Using low power processor and chip-level integration is a promising path to supercomputing
We have developed a BG/L system software stack with Linux-like personality for user applications
Custom solution (CNK) on compute nodes for highest performance Linux solution on I/O nodes for flexibility and functionality
Encouraging performance results – NAS Parallel Benchmarks, ASCI Purple Benchmarks, LINPACK, early applications showing good performance
Many challenges ahead, particularly in performance and reliability Looking for collaborations
Work with broader class of applications on BG/L – investigate scaling issues
Research on higher level programming models
IBM Research
© 2004 IBM Corporation
Backup
52
IBM Research
Blue Gene/L © 2004 IBM Corporation
Principles of BlueGene/L system software design
Simplicity Need for an operating environment for 64k nodes, 128k processors Limited purpose machine – enabled simplifications Reliability through simplicity
Efficiency Dedicated hardware for different functions – enables simplicity Simplicity enables efficiency by dedicating hardware to function High performance without sacrificing security
Familiarity Standard programming languages and libraries Enough functionality to deliver a familiar system without sacrificing
simplicity or high performance
53
IBM Research
Blue Gene/L © 2004 IBM Corporation
Simplicity
Strictly space sharing One job (one user) per electrical partition of the machine One process per compute node in application volume One thread of execution per processor
Dedicate compute nodes to running applications More comprehensive system services (I/O, process control, debugging) are
offloaded to I/O nodes in functional surface System control and monitoring offloaded to service node in control surface
Hierarchical organization for management and operation Single-point of control at service node Form processing sets (psets) consisting of a collection of compute nodes under
control of an I/O node (for LLNL machine, pset = 1 I/O node + 64 compute nodes) Each processing set is under control of Linux image in the I/O node Interact with system as a cluster of I/O nodes
Flat view for application programs – collection of compute processes
54
IBM Research
Blue Gene/L © 2004 IBM Corporation
Efficiency
Dedicated processor for each application-level thread Deterministic, guaranteed execution Maximum performance for each thread Physical memory directly mapped to application address space – not
TLB misses (also, no paging) Statically-linked executables only
System services executed on dedicated I/O-node No daemons interfering with application execution No asynchronous events on computational volume
User-mode access to communication network Electrically isolated partition dedicated to one job + compute node
dedicated to one process = no protection necessary! User-mode communication = no context switching to supervisor mode
during application execution (except for I/O)
55
IBM Research
Blue Gene/L © 2004 IBM Corporation
Familiarity
Fortran, C, C++ with MPI Full language support Automatic SIMD FPU exploitation
Linux development environment User interacts with system through FE nodes running Linux –
compilation, job submission, debugging Compute Node Kernel provides look and feel of a Linux environment –
POSIX system calls (with restrictions)
Tools – support for debuggers, hardware performance monitors, trace based visualization