Scalable Reconfigurable Interconnects

Scalable Reconfigurable Interconnects

Ali Pinar

Lawrence Berkeley National Laboratory

joint work with Shoaib Kamil, Lenny Oliker, and John Shalf

CSCAPES Workshop, Santa Fe, June 11, 2008

Ultra-scale systems rely on increased concurrency.

Total # of Processors in Top15

0

50000

100000

150000

200000

250000

300000

350000

Jun-93Dec-93Jun-94Dec-94Jun-95Dec-95Jun-96Dec-96Jun-97Dec-97Jun-98Dec-98Jun-99Dec-99Jun-00Dec-00Jun-01Dec-01Jun-02Dec-02Jun-03Dec-03Jun-04Dec-04Jun-05Dec-05Jun-06

List

Processors

Huge increases in concurrency since 2004.

How to connect huge numbers of processors?

What is a good interconnect for ultra-scale systems?

Mesh/torus networks provide limited performance.

Fat-trees are widely used due to their flexibility. 94 of 100 of Top500 in 2004 72 of 100 of Top500 in 2007

Cost of a fat-tree scales as O(PlgP). Cost of the interconnect dominates

the cost of compute power for large numbers of processors.

Fat tree

Torus

Step-by-step approach

Characterize the communication requirements of applications. Replaces theoretical metrics with practical ones.

Minimize the interconnection requirements Choice of subdomains Task-to-processor mapping Scheduling of messages

Design alternative interconnects Static networks: Fit-trees Reconfigurable networks

Static Applications

Name Lines Discipline Problem & Method Structure

Cactus 84k Astrophysics Einstein’s Theory of GR via Finite Differencing

Grid

LBMHD 1500 Plasma Physics Magneto-Hydrodynamics via Lattice-Boltzmann

Lattice/Grid

GTC 5000 Magnetic Fusion Vlassov-Poisson Equation via Particle-in-Cell

Particle/Grid

MADbench 5000 Cosmology CMB Analysis via Newton-Raphson Dense Matrix

ELBM3D 3000 Fluid Dynamics Fluid Dynamics via Lattice-Boltzmann Lattice/Grid

Beam

Beam3D

23k Particle Physics Poisson’s Equation via Particle-in-Cell and FFT

Particle/Grid

Static Applications

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Most messages are small

Employ a separate network for low bandwidth messages

Most fat-tree ports are not utilized

>50% of the ports of a fat-tree are not used

Clever task-to-procesor allocation yields better results.

Hops reduced by an average of 25%; improved latency!

Do we need the fat-tree bandwidth?

We need the flexibility of a fat tree, but not the full bandwidth.

Bandwidth requirement can de decreased with careful placement of tasks.

Proposed alternative: Fit trees

Idea: Analyze the communication requirements of apps and design the interconnect for what is really needed.

Even all-to-all communication does not need a fat-tree.

All-to-all communication is the bottleneck for FFT. Clever scheduling of messages reduces bandwidth

requirement. Conventional algorithms for all-to-all communication do not

distribute communication evenly. The savings are even more pronounced in FFT with 2D

decomposition.

Communication Step

level

Standard Randomized Optimal

Fittrees: network should fit the application

Key observation: scalability of an application is related locality of computation.

Implication: required bandwidth decreases as we go higher in the tree.

Fitness ratio (f) : ratio of the bandwidth between two successive layers 2D domains: f ~=1.4 3D domains: f ~=1.2

Fittree

f N

N

Fattree

N

N

Fit-trees provide scalability

HFAST Hybrid Flexibly-Assignable Switch Topology

Use Layer-1 (circuit) switches to configure Layer-2 (packet) switches at run-time (O(10-100ms) cost of reconfiguration)

Hardware to do so exists (optical networks) Layer-1 switches cheaper per port (no dynamic decisions, like telephone

switchboard)

Collective communication uses a separate low-latency, low bandwidth tree network (like IBM BlueGene)

How to use HFAST

Improved task to processor assignments Even at runtime

Migrate processes with little overhead Adapt to changing communication

requirements Avoid defragmentation at the system level Build an interconnect for each application Avoid overprovisioning the communication

resources

Processor allocation for adaptive applications

We obtain 41% of ideal and 53% of ideal hops savings.

Conclusions

Massive concurrencies of ultrascale machines will require new interconnects. We cannot afford to overprovision the resources. There is no magic solution that is good for all

applications. Flexibility or reconfigurability is necessary.

The technology for reconfigurable networks is available.

We need to reduce the resource requirements design networks for typical workloads design methods to build networks for a given

application.

Date post:	12-Jan-2016
Category:	Documents
Upload:	casey
View:	36 times
Download:	0 times

Scalable Reconfigurable Interconnects

Documents