Scalable Reconfigurable Interconnects
Ali Pinar
Lawrence Berkeley National Laboratory
joint work with Shoaib Kamil, Lenny Oliker, and John Shalf
CSCAPES Workshop, Santa Fe, June 11, 2008
Ultra-scale systems rely on increased concurrency.
Total # of Processors in Top15
0
50000
100000
150000
200000
250000
300000
350000
Jun-93Dec-93Jun-94Dec-94Jun-95Dec-95Jun-96Dec-96Jun-97Dec-97Jun-98Dec-98Jun-99Dec-99Jun-00Dec-00Jun-01Dec-01Jun-02Dec-02Jun-03Dec-03Jun-04Dec-04Jun-05Dec-05Jun-06
List
Processors
Huge increases in concurrency since 2004.
How to connect huge numbers of processors?
What is a good interconnect for ultra-scale systems?
Mesh/torus networks provide limited performance.
Fat-trees are widely used due to their flexibility. 94 of 100 of Top500 in 2004 72 of 100 of Top500 in 2007
Cost of a fat-tree scales as O(PlgP). Cost of the interconnect dominates
the cost of compute power for large numbers of processors.
Fat tree
Torus
Step-by-step approach
Characterize the communication requirements of applications. Replaces theoretical metrics with practical ones.
Minimize the interconnection requirements Choice of subdomains Task-to-processor mapping Scheduling of messages
Design alternative interconnects Static networks: Fit-trees Reconfigurable networks
Static Applications
Name Lines Discipline Problem & Method Structure
Cactus 84k Astrophysics Einstein’s Theory of GR via Finite Differencing
Grid
LBMHD 1500 Plasma Physics Magneto-Hydrodynamics via Lattice-Boltzmann
Lattice/Grid
GTC 5000 Magnetic Fusion Vlassov-Poisson Equation via Particle-in-Cell
Particle/Grid
MADbench 5000 Cosmology CMB Analysis via Newton-Raphson Dense Matrix
ELBM3D 3000 Fluid Dynamics Fluid Dynamics via Lattice-Boltzmann Lattice/Grid
Beam
Beam3D
23k Particle Physics Poisson’s Equation via Particle-in-Cell and FFT
Particle/Grid
Static Applications
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Most messages are small
Employ a separate network for low bandwidth messages
Most fat-tree ports are not utilized
>50% of the ports of a fat-tree are not used
Clever task-to-procesor allocation yields better results.
Hops reduced by an average of 25%; improved latency!
Do we need the fat-tree bandwidth?
We need the flexibility of a fat tree, but not the full bandwidth.
Bandwidth requirement can de decreased with careful placement of tasks.
Proposed alternative: Fit trees
Idea: Analyze the communication requirements of apps and design the interconnect for what is really needed.
Even all-to-all communication does not need a fat-tree.
All-to-all communication is the bottleneck for FFT. Clever scheduling of messages reduces bandwidth
requirement. Conventional algorithms for all-to-all communication do not
distribute communication evenly. The savings are even more pronounced in FFT with 2D
decomposition.
Communication Step
level
Standard Randomized Optimal
Fittrees: network should fit the application
Key observation: scalability of an application is related locality of computation.
Implication: required bandwidth decreases as we go higher in the tree.
Fitness ratio (f) : ratio of the bandwidth between two successive layers 2D domains: f ~=1.4 3D domains: f ~=1.2
Fittree
f N
N
Fattree
N
N
Fit-trees provide scalability
HFAST Hybrid Flexibly-Assignable Switch Topology
Use Layer-1 (circuit) switches to configure Layer-2 (packet) switches at run-time (O(10-100ms) cost of reconfiguration)
Hardware to do so exists (optical networks) Layer-1 switches cheaper per port (no dynamic decisions, like telephone
switchboard)
Collective communication uses a separate low-latency, low bandwidth tree network (like IBM BlueGene)
How to use HFAST
Improved task to processor assignments Even at runtime
Migrate processes with little overhead Adapt to changing communication
requirements Avoid defragmentation at the system level Build an interconnect for each application Avoid overprovisioning the communication
resources
Processor allocation for adaptive applications
We obtain 41% of ideal and 53% of ideal hops savings.
Conclusions
Massive concurrencies of ultrascale machines will require new interconnects. We cannot afford to overprovision the resources. There is no magic solution that is good for all
applications. Flexibility or reconfigurability is necessary.
The technology for reconfigurable networks is available.
We need to reduce the resource requirements design networks for typical workloads design methods to build networks for a given
application.