Implementing Optimized Collective Communication Routines...

transcript

4/15/05 1 of 37

Implementing Optimized Collective Communication Routines on the IBM BlueGene/L Supercomputer

CS 425 term projectBy Sam Miller

samm@scl.ameslab.govApril, 18, 2005

4/15/05 2 of 37

Outline

• What is BlueGene/L? (5 slides)• Hardware (3 slides)• Communication Networks (2 slides)• Software (2 slides)• MPI and MPICH (1 slide)• Collective Algorithms (5 slides)• Better Collective Algorithms! (12 slides)• Performance• Conclusion

4/15/05 3 of 37

Abbreviations Today

• BGL = BlueGene/L• CNK = Compute Node Kernel• MPI = Message Passing Interface• MPICH2 = MPICH 2.0 from Argonne Labs• ASIC = Application Specific Integrated Circuit• ALU = Arithmetic Logic Unit• IBM = International Biscuit Makers (duh)

4/15/05 4 of 37

What is BGL 1/2

• Massively parallel distributed memory cluster of embedded processors

• 65,536 nodes! 131,072 processors!• Low power requirements• Relatively small, compared to predecessors• Half system installed at LLNL• Other systems going online too

4/15/05 5 of 37

What is BGL 2/2

• BlueGene/L at LLNL (360 Tflops)– 2,500 square feet, half a tennis court

• Earth Simulator (40 Tflops)– 35,000 square feet, requires an entire building

4/15/05 6 of 37

4/15/05 7 of 37

4/15/05 8 of 37

4/15/05 9 of 37

Hardware 1/3

• CPU is PowerPC 440– Designed for embedded applications– Low power, low clock frequency (700 MHz)– 32 bit :-(

• FPU is custom 64-bit– Each PPC 440 core has two of these– The two FPUs operate in parallel– @ 700MHz this is 2.8 Gflops per PPC 440 core

4/15/05 10 of 37

Hardware 2/3

• BGL ASIC – Two PPC 440 cores, four FPUs– L1, L2, L3 caches– DDR memory controller– Logic for 5 separate communications networks– This forms one compute node

4/15/05 11 of 37

Hardware 3/3

• To build the entire 65,536 node system– Two ASICs with 256 or 512 MB DDR RAM

form a compute card– Sixteen compute cards form a node board– Sixteen node boards form a midplane– Two midplanes form a rack– Sixty four racks brings us to:– 2x16x16x2x64 = 65,536!

4/15/05 12 of 37

QuickTime™ and aGraphics decompressor

are needed to see this picture.

4/15/05 13 of 37

Communication Networks 1/2

• Five different networks– 3D torus

• Primary for MPI library– Global tree

• Used for collectives on MPI_COMM_WORLD• Used by compute nodes to communicate with I/O nodes

– Global interrupt• 1.5 usec latency over entire 65k node system!

– JTAG• Used for node bootup and servicing

– Gigabit Ethernet• Used by I/O nodes

4/15/05 14 of 37

Communication Networks 2/2

• Torus– 6 neighbors have bi-directional links at 154

MB/sec– Guarantees reliable, deadlock free delivery– Chosen due to high bandwidth nearest neighbor

connectivity– Used in prior supercomputers, such as Cray

4/15/05 15 of 37

Software 1/2

• Compute node runs stripped down Linux called CNK– Two threads, 1 per CPU– No context switching, no VM– Standard glibc interface, easy to port– 5000 lines of C++

• I/O nodes run standard PPC Linux– They have disk access– Run a daemon called console I/O daemon (ciod)

4/15/05 16 of 37

Software 2/2

• Network software has 3 layers– Topmost is MPI Library– Middle is Message Layer

• Allows transmission of arbitrary buffer sizes

– Bottom is Packet layer• Very simple• Stateless interface to torus, tree, and GI hardware• Facilitates sending & receiving packets

4/15/05 17 of 37

• Developed by Argonne National Labs• Open source, freely available, standards

compliant MPI implementation• Used by many vendors• Chosen by IBM due to use of Abstract

Device Interface (ADI) and design for scalability

4/15/05 18 of 37

Collective Algorithms 1/5

• Collectives can be implemented with basic send and receives– Better algorithms exist

• Default MPICH2 collectives perform poor on BGL– Assume crossbar network, poor node mapping– Point-to-point messages incur high overhead– No knowledge of network specific features

4/15/05 19 of 37

• Optimization is tricky– Message size and communicator shape are deciding

factors– Large messages == optimize bandwidth– Short messages == optimize latency

• I will not talk about short message collectives further today

• If optimized algorithm isn’t available, BGL falls back on default MPICH2– It will work because point-to-point messages work– Performance will suck however

4/15/05 20 of 37

• Conditions for selecting optimized collective algorithm are made locally– What is wrong with this?

• Example:char buf[100], buf2[20000];if (rank == 0) MPI_Bcast(buf, 100, …);else MPI_Bcast(buf2, 20000, …);

– Not legal according to MPI standard, but…– What if one node uses the optimized algorithm and the

others use the MPICH2 algorithm?• Deadlock - or worse

4/15/05 21 of 37

• Solution to previous problem:– Make optimization decisions globally– This incurs a slight latency hit– Thus, only used when offsetting increases in

bandwidth are important: Ex: long message collectives

4/15/05 22 of 37

• Remainder of slides– MPI_Bcast– MPI_Reduce, MPI_Allreduce– MPI_Alltoall, MPI_Alltoallv

• Using both the tree and torus networks– Tree operates only on MPI_COMM_WORLD

• Has a built in ALU, but only fixed point :-(

– Torus has deposit bit feature, requires rectangular communicator shape (for most algorithms)

4/15/05 23 of 37

Broadcast 1/3

• MPICH2– Binomial tree for short messages– Scatter then Allgather for large messages– Perform poor on BGL due to high CPU overhead and

lack of topology awareness

• Torus– Uses deposit bit feature– For n-dimension mesh, 1/n of message is sent in each

direction concurrently

• Tree– Does not use ALU

4/15/05 24 of 37

Broadcast 2/3

• Red lines represent one spanning tree of half the message

• Blue lines represent another spanning tree of the other message half

4/15/05 25 of 37

Broadcast 3/3

4/15/05 26 of 37

Reduce & Allreduce 1/4

• Reduce essentially a reverse broadcast• Allreduce is a reduce followed by broadcast• Torus

– Can’t use deposit bit feature– CPU bound, bandwidth is poor– Solution: Hamiltonian path, huge latency penalty, but

great bandwidth• Tree

– Natural choice for reduction using integers!– Floating point performance is bad

4/15/05 27 of 37

• Hamiltonian path for 4x4x4 cube

4/15/05 28 of 37

4/15/05 29 of 37

4/15/05 30 of 37

Alltoall and Alltoallv 1/5

• MPICH2 has 4 algorithms– Yes 4 separate ones– BGL performace is bad due to network hot spots and

CPU overhead

• Torus– No communicator size restriction!– Does not use deposit bit feature

• Tree– No alltoall tree algorithm, it would not make sense

4/15/05 31 of 37

• BGL Optimized torus algorithm– Uses randomized packet injection– Each node creates a destination list– Each node has same seed value, different offset

• Bad memory performance?– Yes!– Torus payload is 240 bytes (8 cache lines)– Multiple packets in adjacent cache lines to each

destination are injected before advancing• Measurements showed 2 packets to be optimal

4/15/05 32 of 37

4/15/05 33 of 37

4/15/05 34 of 37

4/15/05 35 of 37

Conclusion

• Optimized collectives on BGL off to a good start– Superior performance than MPICH2– Exploit knowledge about network features– Avoid performance penalties like memory copies and

network hot spots

• Much work remains– Short message collectives– Non-rectangular communicators for the torus network– Tree collectives using communicators other than

MPI_COMM_WORLD– Other collectives: scatter, gather, etc.

4/15/05 36 of 37

Questions?

Implementing Optimized Collective Communication Routines...

Documents