+ All Categories
Home > Documents > Implementing Optimized Collective Communication Routines...

Implementing Optimized Collective Communication Routines...

Date post: 27-Jul-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
36
4/15/05 1 of 37 Implementing Optimized Collective Communication Routines on the IBM BlueGene/L Supercomputer CS 425 term project By Sam Miller [email protected] April, 18, 2005
Transcript
Page 1: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 1 of 37

Implementing Optimized Collective Communication Routines on the IBM BlueGene/L Supercomputer

CS 425 term projectBy Sam Miller

[email protected], 18, 2005

Page 2: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 2 of 37

Outline

• What is BlueGene/L? (5 slides)• Hardware (3 slides)• Communication Networks (2 slides)• Software (2 slides)• MPI and MPICH (1 slide)• Collective Algorithms (5 slides)• Better Collective Algorithms! (12 slides)• Performance• Conclusion

Page 3: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 3 of 37

Abbreviations Today

• BGL = BlueGene/L• CNK = Compute Node Kernel• MPI = Message Passing Interface• MPICH2 = MPICH 2.0 from Argonne Labs• ASIC = Application Specific Integrated Circuit• ALU = Arithmetic Logic Unit• IBM = International Biscuit Makers (duh)

Page 4: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 4 of 37

What is BGL 1/2

• Massively parallel distributed memory cluster of embedded processors

• 65,536 nodes! 131,072 processors!• Low power requirements• Relatively small, compared to predecessors• Half system installed at LLNL• Other systems going online too

Page 5: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 5 of 37

What is BGL 2/2

• BlueGene/L at LLNL (360 Tflops)– 2,500 square feet, half a tennis court

• Earth Simulator (40 Tflops)– 35,000 square feet, requires an entire building

Page 6: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 6 of 37

Page 7: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 7 of 37

Page 8: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 8 of 37

Page 9: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 9 of 37

Hardware 1/3

• CPU is PowerPC 440– Designed for embedded applications– Low power, low clock frequency (700 MHz)– 32 bit :-(

• FPU is custom 64-bit– Each PPC 440 core has two of these– The two FPUs operate in parallel– @ 700MHz this is 2.8 Gflops per PPC 440 core

Page 10: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 10 of 37

Hardware 2/3

• BGL ASIC – Two PPC 440 cores, four FPUs– L1, L2, L3 caches– DDR memory controller– Logic for 5 separate communications networks– This forms one compute node

Page 11: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 11 of 37

Hardware 3/3

• To build the entire 65,536 node system– Two ASICs with 256 or 512 MB DDR RAM

form a compute card– Sixteen compute cards form a node board– Sixteen node boards form a midplane– Two midplanes form a rack– Sixty four racks brings us to:– 2x16x16x2x64 = 65,536!

Page 12: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 12 of 37

QuickTime™ and aGraphics decompressor

are needed to see this picture.

Page 13: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 13 of 37

Communication Networks 1/2

• Five different networks– 3D torus

• Primary for MPI library– Global tree

• Used for collectives on MPI_COMM_WORLD• Used by compute nodes to communicate with I/O nodes

– Global interrupt• 1.5 usec latency over entire 65k node system!

– JTAG• Used for node bootup and servicing

– Gigabit Ethernet• Used by I/O nodes

Page 14: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 14 of 37

Communication Networks 2/2

• Torus– 6 neighbors have bi-directional links at 154

MB/sec– Guarantees reliable, deadlock free delivery– Chosen due to high bandwidth nearest neighbor

connectivity– Used in prior supercomputers, such as Cray

T3E

Page 15: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 15 of 37

Software 1/2

• Compute node runs stripped down Linux called CNK– Two threads, 1 per CPU– No context switching, no VM– Standard glibc interface, easy to port– 5000 lines of C++

• I/O nodes run standard PPC Linux– They have disk access– Run a daemon called console I/O daemon (ciod)

Page 16: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 16 of 37

Software 2/2

• Network software has 3 layers– Topmost is MPI Library– Middle is Message Layer

• Allows transmission of arbitrary buffer sizes

– Bottom is Packet layer• Very simple• Stateless interface to torus, tree, and GI hardware• Facilitates sending & receiving packets

Page 17: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 17 of 37

MPICH

• Developed by Argonne National Labs• Open source, freely available, standards

compliant MPI implementation• Used by many vendors• Chosen by IBM due to use of Abstract

Device Interface (ADI) and design for scalability

Page 18: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 18 of 37

Collective Algorithms 1/5

• Collectives can be implemented with basic send and receives– Better algorithms exist

• Default MPICH2 collectives perform poor on BGL– Assume crossbar network, poor node mapping– Point-to-point messages incur high overhead– No knowledge of network specific features

Page 19: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 19 of 37

Collective Algorithms 2/5

• Optimization is tricky– Message size and communicator shape are deciding

factors– Large messages == optimize bandwidth– Short messages == optimize latency

• I will not talk about short message collectives further today

• If optimized algorithm isn’t available, BGL falls back on default MPICH2– It will work because point-to-point messages work– Performance will suck however

Page 20: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 20 of 37

Collective Algorithms 3/5

• Conditions for selecting optimized collective algorithm are made locally– What is wrong with this?

• Example:char buf[100], buf2[20000];if (rank == 0) MPI_Bcast(buf, 100, …);else MPI_Bcast(buf2, 20000, …);

– Not legal according to MPI standard, but…– What if one node uses the optimized algorithm and the

others use the MPICH2 algorithm?• Deadlock - or worse

Page 21: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 21 of 37

Collective Algorithms 4/5

• Solution to previous problem:– Make optimization decisions globally– This incurs a slight latency hit– Thus, only used when offsetting increases in

bandwidth are important: Ex: long message collectives

Page 22: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 22 of 37

Collective Algorithms 5/5

• Remainder of slides– MPI_Bcast– MPI_Reduce, MPI_Allreduce– MPI_Alltoall, MPI_Alltoallv

• Using both the tree and torus networks– Tree operates only on MPI_COMM_WORLD

• Has a built in ALU, but only fixed point :-(

– Torus has deposit bit feature, requires rectangular communicator shape (for most algorithms)

Page 23: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 23 of 37

Broadcast 1/3

• MPICH2– Binomial tree for short messages– Scatter then Allgather for large messages– Perform poor on BGL due to high CPU overhead and

lack of topology awareness

• Torus– Uses deposit bit feature– For n-dimension mesh, 1/n of message is sent in each

direction concurrently

• Tree– Does not use ALU

Page 24: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 24 of 37

Broadcast 2/3

• Red lines represent one spanning tree of half the message

• Blue lines represent another spanning tree of the other message half

Page 25: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 25 of 37

Broadcast 3/3

Page 26: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 26 of 37

Reduce & Allreduce 1/4

• Reduce essentially a reverse broadcast• Allreduce is a reduce followed by broadcast• Torus

– Can’t use deposit bit feature– CPU bound, bandwidth is poor– Solution: Hamiltonian path, huge latency penalty, but

great bandwidth• Tree

– Natural choice for reduction using integers!– Floating point performance is bad

Page 27: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 27 of 37

Reduce & Allreduce 2/4

• Hamiltonian path for 4x4x4 cube

Page 28: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 28 of 37

Reduce & Allreduce 3/4

Page 29: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 29 of 37

Reduce & Allreduce 4/4

Page 30: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 30 of 37

Alltoall and Alltoallv 1/5

• MPICH2 has 4 algorithms– Yes 4 separate ones– BGL performace is bad due to network hot spots and

CPU overhead

• Torus– No communicator size restriction!– Does not use deposit bit feature

• Tree– No alltoall tree algorithm, it would not make sense

Page 31: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 31 of 37

Alltoall and Alltoallv 2/5

• BGL Optimized torus algorithm– Uses randomized packet injection– Each node creates a destination list– Each node has same seed value, different offset

• Bad memory performance?– Yes!– Torus payload is 240 bytes (8 cache lines)– Multiple packets in adjacent cache lines to each

destination are injected before advancing• Measurements showed 2 packets to be optimal

Page 32: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 32 of 37

Alltoall and Alltoallv 3/5

Page 33: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 33 of 37

Alltoall and Alltoallv 4/5

Page 34: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 34 of 37

Alltoall and Alltoallv 5/5

Page 35: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 35 of 37

Conclusion

• Optimized collectives on BGL off to a good start– Superior performance than MPICH2– Exploit knowledge about network features– Avoid performance penalties like memory copies and

network hot spots

• Much work remains– Short message collectives– Non-rectangular communicators for the torus network– Tree collectives using communicators other than

MPI_COMM_WORLD– Other collectives: scatter, gather, etc.

Page 36: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines

4/15/05 36 of 37

Questions?


Recommended