Download - The BlueGene/L Supercomputer

IBM Research

© 2004 IBM Corporation

The BlueGene/L Supercomputer

Manish Gupta

IBM Thomas J. Watson Research Center

2

IBM Research

Blue Gene/L © 2004 IBM Corporation

What is BlueGene/L?

One of the world’s fastest supercomputers

A new approach to design of scalable parallel systems The current approach to large systems is to build clusters of large SMPs

(NEC Earth Simulator, ASCI machines, Linux clusters) Expensive switches for high performance High electrical power consumption: low computing power density Significant amount of resources devoted to improving single-thread

performance Blue Gene follows a more modular approach, with a simple building

block (or cell) that can be replicated ad infinitum as necessary – aggregate performance is important

System-on-a-chip offers cost/performance advantages Integrated networks for scalability Familiar software environment, simplified for HPC

3

IBM Research


BlueGene/L Compute System-on-a-Chip ASIC

PLB (4:1)

“Double FPU”

Ethernet Gbit

JTAGAccess

144 bit wide DDR256MB

JTAG

Gbit Ethernet

440 CPU

440 CPUI/O proc

L2

L2

MultiportedSharedSRAM Buffer

Torus

DDR Control with ECC

SharedL3 directoryfor EDRAM

Includes ECC

4MB EDRAM

L3 CacheorMemory

l

6 out and6 in, each at 1.4 Gbit/s link

256

256

1024+144 ECC256

128

128

32k/32k L1

32k/32k L1

2.7GB/s

22GB/s

11GB/s

“Double FPU”

5.5GB/s

5.5 GB/s

256

snoop

Tree

3 out and3 in, each at 2.8 Gbit/s link

GlobalInterrupt

4 global barriers orinterrupts

128

5.6GFpeaknode

4

IBM Research


Chip(2 processors)

Com pute Card(2 ch ips, 2x1x1)

Node Board(32 ch ips, 4x4x2)

16 Com pute C ards

System(64 cabinets, 64x32x32)

Cabinet(32 Node boards, 8x8x16)

2.8/5.6 G F/s4 M B

5.6/11.2 G F/s0.5 G B DDR

90/180 G F/s8 G B DDR

2.9/5.7 TF/s256 G B DDR

180/360 TF /s16 TB D DR

BlueGene/L

October 2003BG/L half rack prototype500 Mhz512 nodes/1024 proc.2 TFlop/s peak1.4 Tflop/s sustained

April 2004BlueGene/L

500 MHz4 rack prototype4096 compute nodes

64 I/O nodes16 TF/s peak

11.68 TF/s sustained

5

IBM Research


BlueGene/L

Chip(2 processors)

Com pute Card(2 ch ips, 2x1x1)

Node Board(32 ch ips, 4x4x2)

16 Com pute C ards

System(64 cabinets, 64x32x32)

Cabinet(32 Node boards, 8x8x16)

2.8/5.6 G F/s4 M B

5.6/11.2 G F/s0.5 G B DDR

90/180 G F/s8 G B DDR

2.9/5.7 TF/s256 G B DDR

180/360 TF /s16 TB D DR

Sept 2004BlueGene/L700 MHz8 rack prototype36.01 TF/s sustained

6

IBM Research


BlueGene/L Networks

3 Dimensional Torus Interconnects all compute nodes (65,536) Virtual cut-through hardware routing 1.4Gb/s on all 12 node links (2.1 GB/s per node) 1 µs latency between nearest neighbors, 5 µs to the

farthest Communications backbone for computations 0.7/1.4 TB/s bisection bandwidth, 68TB/s total bandwidth

Global Tree One-to-all broadcast functionality Reduction operations functionality 2.8 Gb/s of bandwidth per link Latency of one way tree traversal 2.5 µs Interconnects all compute and I/O nodes (1024)

Low Latency Global Barrier and Interrupt Latency of round trip 1.3 µs

Ethernet Incorporated into every node ASIC Active in the I/O nodes (1:64) All external comm. (file I/O, control, user interaction, etc.)

Control Network

7

IBM Research


How to Make System Software Scale to 64K nodes?

Take existing software and keep on scaling it until we succeed?

or Start from scratch?

New languages and programming paradigms

8

IBM Research


How to Make System Software Scale to 64K nodes?

Take existing software and keep on scaling it until we succeed?

or Start from scratch?

New languages and programming paradigms

9

IBM Research


Problems with “Existing Software” Approach

Reliability If software fails on any node (independently) once in a month, a node

failure would be expected on 64K node system every 40 seconds

Interference effect Was about to send a message, but oops, got swapped out…

Resource limitations Reserve a few buffers for every potential sender….

Optimization point is different What about small messages?

10

IBM Research


Interference problem

S

R

11

IBM Research



S

R

S

R

Swapped out

Swapped in

12

IBM Research



S

R

S

R

Swapped out

Swapped in

Swapped in

Swapped out

R

13

IBM Research


Problems with “Existing Software” Approach

Reliability If software fails on any node (independently) once in a month, a node

failure would be expected on 64K node system every 40 seconds

Interference effect Was about to send a message, but oops, got swapped out…

Resource limitations Reserve a few buffers for every potential sender to hold early

messages

Optimization point is different What about small messages?

14

IBM Research


Problems with “New Software” Approach

Sure, message passing is tedious – has anything else been proven to scale?

Do you really want me to throw away my 1 million line MPI program and start fresh?

If I start fresh, what’s the guarantee my “new” way of programming wouldn’t be rendered obsolete by future innovations?

15

IBM Research


Our Solution

Simplicity Avoid features not absolutely necessary for high performance

computing Using simplicity to achieve both efficiency and reliability

New organization of familiar functionality Same interface, new implementation Hierarchical organization Message passing provides foundation

Research on higher level programming models using that base

16

IBM Research


BlueGene/L Software Hierarchical Organization

Compute nodes dedicated to running user application, and almost nothing else - simple compute node kernel (CNK)

I/O nodes run Linux and provide a more complete range of OS services – files, sockets, process launch, debugging, and termination

Service node performs system management services (e.g., heart beating, monitoring errors) - largely transparent to application/system software

compute nodeapplication volume

I/O nodeoperational surface

service nodecontrol surface

17

IBM Research


Blue Gene/L System Software Architecture

I/O Node 0

Linux

ciod

C-Node 0

CNK

I/O Node 1023

Linux

ciod

C-Node 0

CNK

C-Node 63

CNK

C-Node 63

CNK

IDo chip

Scheduler

Console

MMCS

JTAG

torus

tree

DB2

Pset 1023

Pset 0

I2C

ServiceNode Functiona

l Ethernet

Functional Ethernet

Control Ethernet

Control Ethernet

Front-endNodes

FileServers

18

IBM Research


Programming Models and Development Environment

Familiar Aspects SPMD model - Fortran, C, C++ with MPI (MPI1 + subset of MPI2)

Full language support Automatic SIMD FPU exploitation

Linux development environment User interacts with system through FE nodes running Linux – compilation,

job submission, debugging Compute Node Kernel provides look and feel of a Linux environment –

POSIX system calls (with restrictions) Tools – support for debuggers (Aetnus TotalView), hardware

performance monitors (HPMLib), trace based visualization (Paraver) Restrictions (lead to significant scalability benefits)

Strictly space sharing - one parallel job (user) per partition of machine, one process per processor of compute node

Virtual memory constrained to physical memory size Implies no demand paging, only static linking

Other Issues: Mapping of applications to torus topology More important for larger systems (multi-rack systems) Working on techniques to provide transparent support

19

IBM Research


Execution Modes for Compute Node

Communication coprocessor mode: CPU 0 executes user application while CPU 1 handles communications Preferred mode of operation for communication-intensive and memory

bandwidth intensive codes Requires coordination between CPUs, which is handled in libraries Computation offload feature (optional): CPU 1 also executes

some parts of user application offloaded by CPU 0 Can be selectively used for compute-bound parallel regions Asynchronous coroutine model (co_start / co_join) Need careful sequence of cache line flush, invalidate, and copy operations

to deal with lack of L1 cache coherence in hardware

Virtual node mode: CPU0 and CPU1 handle both computation and communication Two MPI processes on each node, one bound to each processor Distributed memory semantics – lack of L1 coherence not a problem

20

IBM Research


MPI PMI

The BlueGene/L MPICH2 organization (with ANL)

bgltorus

collectivespt2pt datatype topo

Abstract Device Interface

CH3

socket

MM

simple

uniprocessorMessage passing Process management

mpd

MessageLayer

torustorus treetree GIGI

TorusPacket Layer

TreePacket Layer

GIDevice

CIOProtocol

bgltorus

debug

21

IBM Research


Performance Limiting Factors in the MPI Design

Torus Network link bandwidth0.25 Bytes/cycle/link (theoretical)0.22 Bytes/cycle/link (effective)12*0.22 = 2.64 Bytes/cycle/node

Streaming memory bandwidth 4.3 Bytes/cycle/CPU memory copies are expensive

CPU/network interface204 cycles to read a packet;50 – 100 cycles to write a packetAlignment restrictions

Handling badly aligned data is expensive

Short FIFOsNetwork needs frequent attention

Network order semantics and routingDeterministic routing: in order, bad torus performanceAdaptive routing: excellent network performance, out-of-order packetsIn-order semantics is expensive

Dual core setup, memory coherencyExplicit coherency management via “blind device” and cache flush primitivesRequires communication between processorsBest done in large chunksCoprocessor cannot manage MPI data structures

CNK is single-threaded; MPICH2 is not thread safe

Context switches are expensiveInterrupt driven execution is slow

Hardware

Software

22

IBM Research


•When alignments differ, extra memory copy is needed

•Sometimes torus read op. can be combined with re-alignment op.

•Constraint: Torus hardware only handles 16 byte aligned data

SENDER

RECEIVER

•When sender/receiver alignments are same:

•head and tail transmitted in a single “unaligned” packet

•aligned packets go directly to/from torus FIFOs

Packetization and packet alignment

23

IBM Research


The BlueGene/L Message Layer

Looks very much like LAPI, GAMA Just a lot simpler ;-)

Simplest function: Deliver a buffer of bytes from one node to other Can do this using one of many protocols

One-packet protocol Rendezvous protocol Eager protocol Adaptive eager protocol! Virtual node mode copy protocol! Collective function protocols! … and others

24

IBM Research


Optimizing point-to-point communication (short messages: 0-10 KBytes)

The thing to watch is overhead

Bandwidth CPU load Co-processor Network load

BlueGene/L network requires 16 byte aligned loads and storesMemory copies to resolve alignment issues

Compromise solution:Deterministic routing insures good latency but creates network hotspotsAdaptive routing avoids hotspots but doubles latencyCurrently: deterministic routing more advantageous at up to 4k nodesBalance may change as we scale to 64k nodes: shorter messages, more traffic

Not a factor:not enough

network traffic

protocol cycles s

short 2350 3.35

eager 4000 5.71

rendezvous 11000 15.71

25

IBM Research


Optimizing collective performance:Barrier and short-message Allreduce

Barrier is implemented as an all-broadcast in each dimension

BG/L torus hardware can send deposit packets on a lineLow latency broadcast

Since packets are short, likelihood of conflicts is low

Latency = O(xsize+ysize+zsize)

Allreduce for very short messages is implemented with a similar multi-phase algorithm

Implemented by Yili Zheng (summer student)

Phase 1 Phase 2 Phase3

26

IBM Research


Barrier and short message Allreduce: Latency and Scaling

MPI_Barrier Performance Comparison

0

5

10

15

20

25

30

35

40

45

50

2 4 8 16 32 64 128 256 512 1024

Number of Nodes (1024 is in virtual node mode)

Lat

ency

(m

icro

sec

on

ds)

BGL

MPICH2

Short-message Allreduce latencyvs. message size

Barrier latency vs. machine size

27

IBM Research


Dual FPU Architecture

Designed with input from compiler and library developers

SIMD instructions over both register files FMA operations over double precision data More general operations available with

cross and replicated operands Useful for complex arithmetic, matrix

multiply, FFT Parallel (quadword) loads/stores

Fastest way to transfer data between processors and memory

Data needs to be 16-byte aligned Load/store with swap order available

Useful for matrix transpose

28

IBM Research


Strategy to Exploit SIMD FPU

Automatic code generation by compiler User can help the compiler via pragmas and intrinsics

Pragma for data alignment: __alignx(16, var) Pragma for parallelism

Disjoint: #pragma disjoint (*a, *b) Independent: #pragma ibm independent loop

Intrinsics Intrinsic function defined for each parallel floating point operation

E.g.: D = __fpmadd(B, C, A) => fpmadd rD, rA, rC, rB Control over instruction selection, compiler retains responsibility for register

allocation and scheduling Using library routines where available

Dense matrix BLAS – e.g., DGEMM, DGEMV, DAXPY FFT MASS, MASSV

29

IBM Research


IPA ObjectsIPA Objects

Other Other ObjectsObjects

System System LinkerLinker

Optimized Optimized ObjectsObjects

Wcode+

EXE

DLLPartitionsPartitions

TOBEYTOBEY

TPOTPO

C FEC FE C++ FEC++ FE FORTRAN FORTRAN FEFE

Compile StepOptimization

Wcode

LibrariesLibraries

PDF infoPDF info

Wcode+

Link StepOptimization

Instrumentedruns

Wcode

Wcode

Wcode

Wcode

IBM Compiler Architecture

30

IBM Research


Example: Vector Add

void vadd(double* a, double* b, double* c, int n)

{

int i;

for (i=0; i<n; i++)

{

c[i] = a[i] + b[i];

}

}

31

IBM Research


Compiler transformations for Dual FPU


{

int i;

for (i=0; i<n-1; i+=2)

{

c[i] = a[i] + b[i];

c[i+1] = a[i+1] + b[i+1];

}

for (; i<n; i++) c[i] = a[i] + b[i];

}

32

IBM Research


Compiler transformations for Dual FPU


{

int i;

for (i=0; i<n-1; i+=2)

{

c[i] = a[i] + b[i];

c[i+1] = a[i+1] + b[i+1];

}

for (; i<n; i++) c[i] = a[i] + b[i];

}

LFPL (pa, sa) = (a[i], a[i+1])LFPL (pb, sb) = (b[i], b[i+1])FPADD (pc, sc) = (pa+pb, sa+sb)SFPL (c[i], c[i+1]) = (pc, sc)

33

IBM Research


Pragmas and Advanced Compilation Techniques

void vadd(double* a, double* b, double* c, int n){#pragma disjoint(*a, *b, *c)__alignx(16,a+0);__alignx(16,b+0);__alignx(16,c+0); int i; for (i=0; i<n; i++) { c[i] = a[i] + b[i]; }}

Now Available (Using TPO)Interprocedural pointer alignment analysisLoop transformations to enable SIMD code generation in absence of compile-time alignment information

loop versioningloop peeling

Coming soon

34

IBM Research


LINPACK summary

0

5000

10000

15000

20000

25000

30000

35000

40000

Gfl

op

s

1024 2048 4096 8192

Number of nodes

LINPACK Performance

Pass 1 hardware (@500 MHz) #4 on June 2004 TOP500 list 11.68 TFlop/s on 4096 nodes 71% of peak

Pass 2 hardware (@ 700 MHz) #8 on June 2004 TOP500 list 8.65 TFlop/s on 2048 nodes

Improved recently to 8.87 TFlop/s (would have been #7)

77% of peak

Achieved 36.01 TFlop/s with 8192 nodes on 9/16/04, beating Earth Simulator

78% of peak

35

IBM Research


Cache coherence: a war story

Buffer 1: sent from (by CPU 0) Buffer 2: received into (by CPU 1)

Memory

Main processor cannot touchloop: ld …, buffer st …, network bdnz loop

Last iteration: • branch predictor predicts branch taken• ld executes speculatively

• cache miss causes first line of forbidden buffer area to be fetched into cache• system executes branch, rolls back speculative loads• does not roll back cache line fetch (because it’s nondestructive)

Conclusion: CPU 0 ends up with stale data in cacheBut only when cache line actually survives before being used

36

IBM Research


HPC Challenge: Random Access Updates (GUP/s)

0

0.001

0.002

0.003

0.004

0.005

0.006

MPI

Cray X1/ORNL (64)

HP Alpha/PSC(128)

HP Itanium2/OSC(128)

BG/L/YKT (64)

IBM_USER

This is the latest version of the NSA TableToy benchmark, included in the HPC Challenge benchmark suite, measuring ability to perform random memory updates per second (higher is good).The CPU* version measures random memory updates within a CPU, when the same program is running on all CPUs, depends on local memory subsystem.The MPI version measures random memory updates in global address space, and using MPI.The BG/L performance was measured on the system running at 500 MHz. The system (now being built) based on DD2 (second version of the chip) should show a 40% increase, with the ASIC running at 700 MHz, since everything, including memory and network cycles are tied to the ASIC frequency.

37

IBM Research


HPC Challenge: Latency (usec)

0

5

10

15

20

25

30

35

40

Rand Ring

Cray X1, ORNL (64)

HP Alpha, PSC(128)

HP Itanium2, OSC(128)

BG/L, YKT (64)

IBM_USER

On latency measurements (lower is good), BG/L shows the best performance among all listed machines. The BG/L results would improve further shortly as we use DD2 parts (@700 MHz, up from 500 MHz) to build the system.

38

IBM Research


Measured MPI Send Bandwidth and Latency

0

100

200

300

400

500

600

700

800

900

10001 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Message size (bytes)

Ba

nd

wid

th (

MB

/s)

@ 7

00

MH

z

1 neighbor

2 neighbors

3 neighbors

4 neighbors

5 neighbors

6 neighbors

Latency @700 MHz = 3.3 + 0.090 * “Manhattan distance” + 0.045 * “Midplane hops” s

IBM_USER

Key Messages:1. The MPI implementation can achieve close to peak bandwidth while sending data out on all 6 links.2. The "half-bandwidth" point (data size at which we achieve half of peak pandwidth) is quite low for BG/L, between 512-1024 Bytes, it is often several KB on other machines.

39

IBM Research


Noise measurements (from Adolphy Hoisie)

Ref: Blue Gene: A Performance and Scalability Report at the 512-Processor Milestone, PAL/LANL, LA-UR- 04-1114, March 2004.

40

IBM Research


SPPM on fixed grid size (BG/L 700 MHz)SPPM Scaling (128**3, real*8)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

1 10 100 1000

Nodes BG/L, processors p655

Ra

te (

po

ints

/se

c/it

er)

P655 1.7GHz

BG/L VNM

BG/L COP

41

IBM Research


ASCI Purple Benchmarks – UMT2K

UMT2K: Unstructured mesh radiation transport

Strong scaling – problem size fixed Excellent scalability up to 128 nodes

load balancing problems on scaling up to 512 nodes, need algorithmic changes in original program

0

5

10

15

20

25

30

35

40

Sp

ee

d r

ela

tiv

e t

o 8

no

de

s

8 32 128 256 512

Number of nodes

BGL (500 MHz)

42

IBM Research


SAGE on fixed grid size (BG/L 700 MHz)SAGE Scaling (timing_h, 32K cells/node)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

1 10 100 1000

Nodes BG/L, processors p655

Rat

e(ce

lls/n

od

e/se

c)

BG/L VNM

BG/L COP

43

IBM Research


Effect of mapping on SAGE

SAGE Scaling (timing_h)

0

1000

2000

3000

4000

5000

6000

7000

8000

1 10 100 1000

Processors

Pro

cess

ing

Rat

e(ce

lls/

sec/

cpu

)

Random MapDefault MapHeuristic Map

44

IBM Research


Miranda results (by LLNL)

1 10 100 1000

Processor count

0.1

1

10

100

Run

ning

tim

e (s

)

BG/L 500MCR

Miranda scaling on BG/L and MCR (Charles Crab)

45

IBM Research


ParaDis on BG/L vs MCR (Linux Cluster) Peak 11.6 TF/s, Linpack: 7.634 TF/s)

Courtesy: Kim Yates

MCR is a large (11.2 TF) tightly Coupled Linux cluster: 1,152 nodes, each with two 2.4-GHz Pentium 4 Xeon processors and 4 GB of memory.

Study of Dislocation Dynamics in Metals

46

IBM Research


CPMD History

Born at IBM Zurich from the original Car-Parrinello Code in 1993;

Developed in many other sites during the years (more than 150,000 lines of code); it has many unique features, e.g. path-integral MD, QM/MM interfaces, TD-DFT and LR calculations;

Since 2001 distributed free for academic institutions (www.cpmd.org); more than 5000 licenses in more than 50 countries.

47

IBM Research


CPMD results (BG/L 500 MHz)CPMD 216 atom SiC supercell

1

10

100

8 16 32 64 128 256 512 1024

Processors

Ru

nn

ing

tim

e (

s)

BG/L

JS20

p690

48

IBM Research


QCD CG Inverter - Wilson fermions21 CG iterations, 16x4x4x4 local lattice

10

12

14

16

18

20

22

24

26

28

30

0 500 1000 1500 2000 2500 3000 3500 4000 4500

Number of CPUs

Sus

tain

ed p

erfo

rman

ce %

th

eore

tical

max

is ~

75%

BGL: 1TF/Rack vs QCDOC: ½ TF/Rack

49

IBM Research


BlueGene/L software team

YORKTOWN:• 10 people (+ students)• Activities on all areas of system software• Focus on development of MPI and new features• Does some test, but depends on Rochester

ROCHESTER:• 15 people (plus performance & test)• Activities on all areas of system software• Most of development• Follows process between research and product• Main test center

HAIFA:• 4 people• Focus on job scheduling• LoadLeveler• Interfacing with Poughkeepsie

INDIA RESEARCH LAB:• 3 people• Checkpoint/restart• Runtime error verification• Benchmarking

TORONTO• Fortran95, C, C++ Compilers

50

IBM Research


Conclusions

Using low power processor and chip-level integration is a promising path to supercomputing

We have developed a BG/L system software stack with Linux-like personality for user applications

Custom solution (CNK) on compute nodes for highest performance Linux solution on I/O nodes for flexibility and functionality

Encouraging performance results – NAS Parallel Benchmarks, ASCI Purple Benchmarks, LINPACK, early applications showing good performance

Many challenges ahead, particularly in performance and reliability Looking for collaborations

Work with broader class of applications on BG/L – investigate scaling issues

Research on higher level programming models

IBM Research


Backup

52

IBM Research


Principles of BlueGene/L system software design

Simplicity Need for an operating environment for 64k nodes, 128k processors Limited purpose machine – enabled simplifications Reliability through simplicity

Efficiency Dedicated hardware for different functions – enables simplicity Simplicity enables efficiency by dedicating hardware to function High performance without sacrificing security

Familiarity Standard programming languages and libraries Enough functionality to deliver a familiar system without sacrificing

simplicity or high performance

53

IBM Research


Simplicity

Strictly space sharing One job (one user) per electrical partition of the machine One process per compute node in application volume One thread of execution per processor

Dedicate compute nodes to running applications More comprehensive system services (I/O, process control, debugging) are

offloaded to I/O nodes in functional surface System control and monitoring offloaded to service node in control surface

Hierarchical organization for management and operation Single-point of control at service node Form processing sets (psets) consisting of a collection of compute nodes under

control of an I/O node (for LLNL machine, pset = 1 I/O node + 64 compute nodes) Each processing set is under control of Linux image in the I/O node Interact with system as a cluster of I/O nodes

Flat view for application programs – collection of compute processes

54

IBM Research


Efficiency

Dedicated processor for each application-level thread Deterministic, guaranteed execution Maximum performance for each thread Physical memory directly mapped to application address space – not

TLB misses (also, no paging) Statically-linked executables only

System services executed on dedicated I/O-node No daemons interfering with application execution No asynchronous events on computational volume

User-mode access to communication network Electrically isolated partition dedicated to one job + compute node

dedicated to one process = no protection necessary! User-mode communication = no context switching to supervisor mode

during application execution (except for I/O)

55

IBM Research


Familiarity

Fortran, C, C++ with MPI Full language support Automatic SIMD FPU exploitation

Linux development environment User interacts with system through FE nodes running Linux –

compilation, job submission, debugging Compute Node Kernel provides look and feel of a Linux environment –

POSIX system calls (with restrictions)

Tools – support for debuggers, hardware performance monitors, trace based visualization