Computational Infrastructure for Lattice Gauge Theoryweb.physics.ucsb.edu/~sugar/clusters.pdf ·...

Computational Infrastructure for Lattice Gauge Theory

LQCD Optimized Clusters

Chip WatsonJefferson Lab

Motivation• Moore's Law delivers increases in processor price performance of the

order of 60% per year• A high volume market has driven the cost of CPUs and components

extremely low, with newer components available every few months,allowing increased capability each year at constant investment

• Home video gaming has encouraged the development of multi-media extensions; these small vector processors on commodity processors deliver super-scalar performance, exceeding 8.2 Gflops sustained(single precision, on a cache resident matrix multiply problem) on a 2.66 GHz Pentium 4 – scaling this to a cluster is the challenge!

• Cluster interconnects are maturing, allowing ever larger clusters to be constructed from semi-commodity parts

Cluster ArchitecturesClusters can be constructed from a number of components

available on the market today…

Nodes:– Single CPU or multi-processor– Intel, AMD, PowerPC… ; various motherboards & chipsets– Various cache sizes – this is significant, because cache memory is

much faster, and so cache resident problems run much faster

Cluster Interconnect:• Switched:

– Commodity switched (100 Mb ethernet, gigE)– Semi-commodity switched (myrinet, quadrix, … )

• Mesh: (not typically used, but well matched to LQCD)– Commodity cards used as a mesh (gigE)– Commodity chips on a custom card (FPGA’s)

Cluster Architectures - 2

Optimal architecture changes with the market and with the size of machine desired.

Switched ethernet: OK for very small clusters (switch costs for large high performance switches excessive)

Myrinet: Good solution for up to 128 nodes (soon 256); above this point add 25% to cost

Gigabit ethernet mesh: 6 (3D) or 8 (4D) links per node yields higher bandwidth, but needs custom software (1st version done, optimizations remain)

Other options will certainly emerge in time (e.g. Infiniband)FNAL is also evaluating an Orca FPGA board from a data acquisition system development: 8 2 Gbps links

LQCD Optimized ClustersWhy is this different from NERSC, NCSA, … ?• Extremely low memory requirements (save 10%)• Low disk size, file I/O requirements (save 10%)• Don’t buy leading edge processors: our interest is greatest science /

dollar (capacity) at teraflops scale, not greatest single box capability; thus, defer Itanium-2 deployment for now (save 10% - 50%)

• Be sensitive to non-linear scaling of applications on clusters, and stop at an appropriate trade-off point, deploying several clusters if necessary (save 10% - 50%)

• Deploy mesh optimized high bandwidth networks, taking advantage of LQCD regular mesh communications pattern (save 20% - 30%)

• Result: science / dollar that is better by a factor of 2 – 5!

Cluster StrategyClusters allow us to take advantage of the very latest developments in processor design, memory sub-systems, and interconnect technology. However…– While CPU’s (and cache speed) accelerate at 60% / year (Moore’s Law)…– Memory speed generally advances less rapidly and with fewer discrete

steps, ~40% / year =>Performance ratio of in-cache to out-of-cache is growing

– Implications: Want to run as many applications in cache as possible (2x - 4x gain today)

=> a large cluster used for single application=> very high message rates (> 10 kHz ! ) due to the high performance

– Interconnects track external bus speeds, and server class motherboards will support processor evolutions for the next 2-3 years (PCI-X, PCI-Express)

– SciDAC software (QMP: QCD Message Passing) isolates applications from changes in cluster interconnects

SciDAC Prototype Clusters

The SciDAC project is funding a sequence of cluster prototypes which allow us to track industry developments and trends, which also deploying critical compute resources.

Myrinet + Pentium 4• 48 dual 2.0 GHz P4 at FNAL (Spring 2002)• 128 single 2.0 GHz P4 at JLab (Summer 2002)• 128 dual 2.4 GHz P4 at FNAL (Fall 2002)Gigabit Ethernet Mesh + Pentium 4• 256 (8x8x4) single 2.66 GHz P4 at JLab (now, 2003)Additional Technology Evaluations at FNAL for 2003• Itanium 2• AMD Opteron• IBM PowerPC 970• Infiniband switch

FNAL Clusters

Myrinet Cluster

Jefferson Lab clusters

Myrinet Cluster

GigE Mesh

Industry Processor TrendsPentium 4• 3.2 GHz Prescott (0.09 micron process) soon, 2003:

– 800 MHz front side bus: 67% above 533 MHz P4 performance; – 1 MB cache: 2x the size of current P4– lower power for higher clock speed: lower operating costs/Mflops– Anticipate $1/MFlops up to 1 TeraFlops (Domain Wall, single prec.)

• 3.2 GHz Nocona (0.09 process Xeon) late 2003 / early 2004:– Xeon version of Prescott, commodity dual processor

(cost effectiveness of 2nd processor to be determined, promised to be better)– 800 MHz FSB (“Lindenhurst”) in 1H 2004

• 4-5 GHz Pentium 4’s in the pipeline, including 2x faster bus

IA-64 / Itanium 2– too expensive for at least next 1-2 years – Today adds $1000+ per cpu, but may allow some larger problems to

run 2x faster in larger cache (to be studied in the coming year)

AMD• Opteron

– High I/O bandwidth potential in 3 links, could make quads commodity

– NUMA architecture offers higher bandwidth for multi-processors

Blade Architectures– Higher packing density, potentially lower cost– I/O limitations (typically 1-2 connections),

may be adequate in 2nd generation systems using Infiniband (if costs fall)

Industry Processor Trends

Industry I/O TrendsRecent trend:• I/O Bus: PCI-32, PCI-64, PCI-X (1 Gbyte/sec)• PCI-Express will continue this trend with 4x bandwidth boost in 2004• Links: 100 mb ethernet, myrinet, gigE, myrinet-2000, Infiniband

Current situation:• Server class motherboards have 2-3 PCI-X busses

(home market motherboards don’t yet have PCI-X!)• Myrinet is a good solution for up to 500 MB/s total bandwidth

(single PCI-X card), but cost is high• GigE appears to be a better solution for higher bandwidth (6-8 bi-

directional links spread across 2-3 PCI-X buses) – limit is the software overhead for communications (tradeoff performance against cost)

• Infiniband 4X delivers 800 MB/sec; 12X coming (needs faster PCI-Express bus or needs to bypass PCI); in some time frame make become most cost effective

Cluster ScalabilityA key challenge for clusters is scaling to large systems & problems, particularly to exploit cache performance.

• PCI 32/33 runs out of steam at around 128 node; can build systems of modest size today which achieve < $1 / Mflop (Fodor)

• PCI-X based systems are more expensive, but more scalable; price performance is worse (falling like Moore’s law), but larger systems are possible. Currently worse by a factor of 2 for small systems.

• High performance network costs are significant, $1400+ / node for myrinet. Myrinet is capable of ~ 490 MB/s (245 each way); Infiniband is twice (800 MB/sec) this, but is 20% more expensive– This bandwidth would support clusters of up to 512 – 1024 boxes with good

efficiency on lattice sizes of high interest today– A cluster of this size could run some problems in cache, with performace at

the level of multiple teraflops• Easiest way to scale: multiple machines!

Performance: 2002

Itanium 2, MILC code (preliminary)SU3

0.00E+005.00E+021.00E+031.50E+032.00E+032.50E+03ou

t_2X

2X2X

2.lo

g:

out_

2X2X

2X10

.log:

out_

2X4X

4X6.

log:

out_

4X4X

4X8.

log:

out_

6X6X

6X6.

log:

out_

6X6X

6X12

.log:

out_

8X8X

8X12

.log:

out_

12X

12X

12X

12

Lattice

mflo

ps SMP, 1CPU

MPI, 1CPU

MPI, 2CPU (same)

MPI, 2CPU(different,lowfat)MPI, 2CPU (different,normal)

Implications

1. Performance of N nodes can exceed N times performance of single node, or super-linear speed-up, as the problem is divided to the point where it fits into cache.

2. Price performance on clusters will be a function of problem size(or cluster size).

3. The current “sweet spot” for Pentium 4 is 44 / processor (soon to be 43x8), and for Itanium 2 (2MB) is 4x83. Hence, larger physics problems run better on larger machines.

Measured performance of the Dirac operator for single node, 1D mesh (ring), 2D mesh (toroid) and 3D mesh. Dashed lines are from a model calculation, and show good agreement.

Cluster Performance Measurements & Model

0

500

1000

1500

2000

2500

3000

2*2*2*

2

2*2*2*

4

2*2*4*

4

2*4*4*

4

4*4*4*

4

4*4*4*

8

4*4*8*

8

6*6*6*

6

8*8*8*

816

^4

MFl

op/s

single 1D 2D 3Dsingle model 1D model 2D model 3D model

Network I/O limited

Memory bandwidth limited

In cache

LQCD problemsize perprocessor

Optimum performance

Performance Model ExtrapolationsThis model is empirical, with 2 I/O overhead parameters tweaked to yield the

best match – these should ideally be measured independentlyExtrapolations of the model are useful for planning purposes

(running scenarios with faster processors, faster memory, faster links… )• Cluster model performance includes many factors:

– Lattice size (bigger is more efficient for network, smaller allows faster processing in cache)

– Processor speed– Memory bandwidth (effects efficiency of Nth processor)– Cache size– Link bandwidth– Link latency (mostly for global sums)

• Assumptions for extrapolations– Moore’s Law (60% processor improvement per year)– 2X step changes in link about every 2 years, achieving 50% of bus

bandwidth (PCI-X, PCI-Express)– Quad processor servers become “commodity” by FY05

1990 2000 2010

Mflops / $

101

10-1

100

QCDSP

Performance per Dollar for Typical LQCD Applications

•Commodity compute nodes (leverage marketplace & Moore’s law)

•Large enough to run problem in cache•Low latency, high bandwidth

network to exploit full I/O capability(& keep up with cache performance)

10-2

Vector Supercomputers, including the Japanese Earth Simulator

JLab SciDAC Prototype Cluster1st simulation of super cluster

BNL machine

QCDOC

Columbia machine

Note: QCDOC is more scalable in 2004 than clusters, and delivers superior double precision performance

Four Year Plan2003

– 256 node 8x8x4 gigabit ethernet mesh @ JLab– (256?) node @ FNAL (Infiniband, alternate processor? – tbd)

2004– Additional 128 to 256 node prototypes at JLab and FNAL to explore

latest options, possibly including custom NIC

2005– Large clusters of scale 3-4 Tflops– Reference machines: 8x8x16 gigabit ethernet mesh, or 1024 node

Infiniband 12X switch, 4-way SMP Xeon (800 MHz FSB, 1.25 MB cache, 4 GHz dual processor core)

– Expect $0.40 to $0.60 / Mflops on Domain Wall

2006– Large clusters of scale 5-6 TFlops

Summary

• Optimized LQCD Clusters appear to be a very cost effective solution for LQCD.

• The National LQCD Collaboration is effectively advancing the development of large, specialized LQCD Clusters as part of its long-term strategy.

Date post:	21-Jul-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Computational Infrastructure for Lattice Gauge Theoryweb.physics.ucsb.edu/~sugar/clusters.pdf ·...

Documents