Collective Communication on Architectures that Support Simultaneous Communication over Multiple...

transcript

Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links

Ernie Chan

Authors

Ernie Chan Robert van de Geijn

Department of Computer Sciences

The University of Texas at Austin

William Gropp Rajeev Thakur

Mathematics and Computer Science Division

Argonne National Laboratory

Testbed Architecture

IBM Blue Gene/L3D torus point-to-point interconnect networkOne rack

1024 dual-processor nodes Two 8 x 8 x 8 midplanes

Special feature to send simultaneously Use multiple calls to MPI_Isend

Outline

Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion

Model of Parallel Computation

Target Architectures Distributed-memory parallel architectures

Indexingp computational nodes Indexed 0 … p - 1

Logically Fully ConnectedA node can send directly to any other node

TopologyN-dimensional torus

12 13 15

Old Model of Communicating Between NodesUnidirectional sending or receiving

Old Model of Communicating Between NodesSimultaneous sending and receiving

Old Model of Communicating Between NodesBidirectional exchange

Communicating Between NodesA node can send or receive with 2N other

nodes simultaneously along its 2N different links

Communicating Between NodesCannot perform bidirectional exchange on any

link while sending or receiving simultaneously with multiple nodes

Cost of Communication

α + nβ

α: startup time, latencyn: number of bytes to communicateβ: per data transmission time, bandwidth

Outline

Sending Simultaneously

Old Cost of Communication with Sends to Multiple NodesCost to send to m separate nodes

(α + nβ) m

New Cost of Communication with Simultaneous Sends

(α + nβ) m

can be replaced with

(α + nβ) + (α + nβ) (m - 1)

(α + nβ) m

(α + nβ) + (α + nβ) (m - 1) τ

Cost of one send Cost of extra sends

(α + nβ) m

(α + nβ) + (α + nβ) (m - 1) τ

Cost of one send Cost of extra sends

0 ≤ τ ≤ 1

Benchmarking Sending SimultaneouslyLogarithmic-Logarithmic timing graphsMidplane – 512 nodesSending simultaneously with 1 – 6 neighbors8 bytes – 4 MB

Cost of Communication with Simultaneous Sends

(α + nβ) (1 + (m - 1) τ)

Outline

Collective Communication

Broadcast (Bcast)Motivating example

Before After

Outline

Generalized Algorithms

Short-Vector AlgorithmsMinimum-Spanning Tree

Long-Vector AlgorithmsBucket Algorithm

Minimum-Spanning Tree

Minimum-Spanning TreeDivide p nodes into N+1 partitions

Minimum-Spanning TreeDisjointed partitions on N-dimensional mesh

12 13 15

Minimum-Spanning TreeDivide dimensions by a decrementing counter

from N+1

12 13 15

Minimum-Spanning TreeNow divide into 2N+1 partitions

12 13 15

Outline

Performance Results

Single point-to-pointcommunication

Performance Results

my-bcast-MST

Outline

Conclusion

IBM Blue Gene/L supports functionality of sending simultaneouslyBenchmarking along with model checking

verifies this claim New generalized algorithms show clear

performance gains

Conclusion

Future DirectionsRoom for optimization to reduce

implementation overheadWhat if not using MPI_COMM_WORLD?Possible new algorithm for Bucket Algorithm

Questions? echan@cs.utexas.edu

Collective Communication on Architectures that Support Simultaneous Communication over Multiple...

Documents