+ All Categories
Home > Documents > Scalable Reconfigurable Interconnects

Scalable Reconfigurable Interconnects

Date post: 12-Jan-2016
Category:
Upload: casey
View: 36 times
Download: 0 times
Share this document with a friend
Description:
Scalable Reconfigurable Interconnects. Ali Pinar Lawrence Berkeley National Laboratory joint work with Shoaib Kamil, Lenny Oliker, and John Shalf. CSCAPES Workshop, Santa Fe, June 11, 2008. Ultra-scale systems rely on increased concurrency. Huge increases in concurrency since 2004. - PowerPoint PPT Presentation
Popular Tags:
17
Scalable Reconfigurable Interconnects Ali Pinar Lawrence Berkeley National Laboratory joint work with Shoaib Kamil, Lenny Oliker, and John Shalf SCAPES Workshop, Santa Fe, June 11, 2008
Transcript
Page 1: Scalable Reconfigurable Interconnects

Scalable Reconfigurable Interconnects

Ali Pinar

Lawrence Berkeley National Laboratory

joint work with Shoaib Kamil, Lenny Oliker, and John Shalf

CSCAPES Workshop, Santa Fe, June 11, 2008

Page 2: Scalable Reconfigurable Interconnects

Ultra-scale systems rely on increased concurrency.

Total # of Processors in Top15

0

50000

100000

150000

200000

250000

300000

350000

Jun-93Dec-93Jun-94Dec-94Jun-95Dec-95Jun-96Dec-96Jun-97Dec-97Jun-98Dec-98Jun-99Dec-99Jun-00Dec-00Jun-01Dec-01Jun-02Dec-02Jun-03Dec-03Jun-04Dec-04Jun-05Dec-05Jun-06

List

Processors

Huge increases in concurrency since 2004.

How to connect huge numbers of processors?

Page 3: Scalable Reconfigurable Interconnects

What is a good interconnect for ultra-scale systems?

Mesh/torus networks provide limited performance.

Fat-trees are widely used due to their flexibility. 94 of 100 of Top500 in 2004 72 of 100 of Top500 in 2007

Cost of a fat-tree scales as O(PlgP). Cost of the interconnect dominates

the cost of compute power for large numbers of processors.

Fat tree

Torus

Page 4: Scalable Reconfigurable Interconnects

Step-by-step approach

Characterize the communication requirements of applications. Replaces theoretical metrics with practical ones.

Minimize the interconnection requirements Choice of subdomains Task-to-processor mapping Scheduling of messages

Design alternative interconnects Static networks: Fit-trees Reconfigurable networks

Page 5: Scalable Reconfigurable Interconnects

Static Applications

Name Lines Discipline Problem & Method Structure

Cactus 84k Astrophysics Einstein’s Theory of GR via Finite Differencing

Grid

LBMHD 1500 Plasma Physics Magneto-Hydrodynamics via Lattice-Boltzmann

Lattice/Grid

GTC 5000 Magnetic Fusion Vlassov-Poisson Equation via Particle-in-Cell

Particle/Grid

MADbench 5000 Cosmology CMB Analysis via Newton-Raphson Dense Matrix

ELBM3D 3000 Fluid Dynamics Fluid Dynamics via Lattice-Boltzmann Lattice/Grid

Beam

Beam3D

23k Particle Physics Poisson’s Equation via Particle-in-Cell and FFT

Particle/Grid

Page 6: Scalable Reconfigurable Interconnects

Static Applications

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 7: Scalable Reconfigurable Interconnects

Most messages are small

Employ a separate network for low bandwidth messages

Page 8: Scalable Reconfigurable Interconnects

Most fat-tree ports are not utilized

>50% of the ports of a fat-tree are not used

Page 9: Scalable Reconfigurable Interconnects

Clever task-to-procesor allocation yields better results.

Hops reduced by an average of 25%; improved latency!

Page 10: Scalable Reconfigurable Interconnects

Do we need the fat-tree bandwidth?

We need the flexibility of a fat tree, but not the full bandwidth.

Bandwidth requirement can de decreased with careful placement of tasks.

Proposed alternative: Fit trees

Idea: Analyze the communication requirements of apps and design the interconnect for what is really needed.

Page 11: Scalable Reconfigurable Interconnects

Even all-to-all communication does not need a fat-tree.

All-to-all communication is the bottleneck for FFT. Clever scheduling of messages reduces bandwidth

requirement. Conventional algorithms for all-to-all communication do not

distribute communication evenly. The savings are even more pronounced in FFT with 2D

decomposition.

Communication Step

level

Standard Randomized Optimal

Page 12: Scalable Reconfigurable Interconnects

Fittrees: network should fit the application

Key observation: scalability of an application is related locality of computation.

Implication: required bandwidth decreases as we go higher in the tree.

Fitness ratio (f) : ratio of the bandwidth between two successive layers 2D domains: f ~=1.4 3D domains: f ~=1.2

Fittree

f N

N

Fattree

N

N

Page 13: Scalable Reconfigurable Interconnects

Fit-trees provide scalability

Page 14: Scalable Reconfigurable Interconnects

HFAST Hybrid Flexibly-Assignable Switch Topology

Use Layer-1 (circuit) switches to configure Layer-2 (packet) switches at run-time (O(10-100ms) cost of reconfiguration)

Hardware to do so exists (optical networks) Layer-1 switches cheaper per port (no dynamic decisions, like telephone

switchboard)

Collective communication uses a separate low-latency, low bandwidth tree network (like IBM BlueGene)

Page 15: Scalable Reconfigurable Interconnects

How to use HFAST

Improved task to processor assignments Even at runtime

Migrate processes with little overhead Adapt to changing communication

requirements Avoid defragmentation at the system level Build an interconnect for each application Avoid overprovisioning the communication

resources

Page 16: Scalable Reconfigurable Interconnects

Processor allocation for adaptive applications

We obtain 41% of ideal and 53% of ideal hops savings.

Page 17: Scalable Reconfigurable Interconnects

Conclusions

Massive concurrencies of ultrascale machines will require new interconnects. We cannot afford to overprovision the resources. There is no magic solution that is good for all

applications. Flexibility or reconfigurability is necessary.

The technology for reconfigurable networks is available.

We need to reduce the resource requirements design networks for typical workloads design methods to build networks for a given

application.


Recommended