Computational Infrastructure for Lattice Gauge Theory
LQCD Optimized Clusters
Chip WatsonJefferson Lab
Motivation• Moore's Law delivers increases in processor price performance of the
order of 60% per year• A high volume market has driven the cost of CPUs and components
extremely low, with newer components available every few months,allowing increased capability each year at constant investment
• Home video gaming has encouraged the development of multi-media extensions; these small vector processors on commodity processors deliver super-scalar performance, exceeding 8.2 Gflops sustained(single precision, on a cache resident matrix multiply problem) on a 2.66 GHz Pentium 4 – scaling this to a cluster is the challenge!
• Cluster interconnects are maturing, allowing ever larger clusters to be constructed from semi-commodity parts
Cluster ArchitecturesClusters can be constructed from a number of components
available on the market today…
Nodes:– Single CPU or multi-processor– Intel, AMD, PowerPC… ; various motherboards & chipsets– Various cache sizes – this is significant, because cache memory is
much faster, and so cache resident problems run much faster
Cluster Interconnect:• Switched:
– Commodity switched (100 Mb ethernet, gigE)– Semi-commodity switched (myrinet, quadrix, … )
• Mesh: (not typically used, but well matched to LQCD)– Commodity cards used as a mesh (gigE)– Commodity chips on a custom card (FPGA’s)
Cluster Architectures - 2
Optimal architecture changes with the market and with the size of machine desired.
Switched ethernet: OK for very small clusters (switch costs for large high performance switches excessive)
Myrinet: Good solution for up to 128 nodes (soon 256); above this point add 25% to cost
Gigabit ethernet mesh: 6 (3D) or 8 (4D) links per node yields higher bandwidth, but needs custom software (1st version done, optimizations remain)
Other options will certainly emerge in time (e.g. Infiniband)FNAL is also evaluating an Orca FPGA board from a data acquisition system development: 8 2 Gbps links
LQCD Optimized ClustersWhy is this different from NERSC, NCSA, … ?• Extremely low memory requirements (save 10%)• Low disk size, file I/O requirements (save 10%)• Don’t buy leading edge processors: our interest is greatest science /
dollar (capacity) at teraflops scale, not greatest single box capability; thus, defer Itanium-2 deployment for now (save 10% - 50%)
• Be sensitive to non-linear scaling of applications on clusters, and stop at an appropriate trade-off point, deploying several clusters if necessary (save 10% - 50%)
• Deploy mesh optimized high bandwidth networks, taking advantage of LQCD regular mesh communications pattern (save 20% - 30%)
• Result: science / dollar that is better by a factor of 2 – 5!
Cluster StrategyClusters allow us to take advantage of the very latest developments in processor design, memory sub-systems, and interconnect technology. However…– While CPU’s (and cache speed) accelerate at 60% / year (Moore’s Law)…– Memory speed generally advances less rapidly and with fewer discrete
steps, ~40% / year =>Performance ratio of in-cache to out-of-cache is growing
– Implications: Want to run as many applications in cache as possible (2x - 4x gain today)
=> a large cluster used for single application=> very high message rates (> 10 kHz ! ) due to the high performance
– Interconnects track external bus speeds, and server class motherboards will support processor evolutions for the next 2-3 years (PCI-X, PCI-Express)
– SciDAC software (QMP: QCD Message Passing) isolates applications from changes in cluster interconnects
SciDAC Prototype Clusters
The SciDAC project is funding a sequence of cluster prototypes which allow us to track industry developments and trends, which also deploying critical compute resources.
Myrinet + Pentium 4• 48 dual 2.0 GHz P4 at FNAL (Spring 2002)• 128 single 2.0 GHz P4 at JLab (Summer 2002)• 128 dual 2.4 GHz P4 at FNAL (Fall 2002)Gigabit Ethernet Mesh + Pentium 4• 256 (8x8x4) single 2.66 GHz P4 at JLab (now, 2003)Additional Technology Evaluations at FNAL for 2003• Itanium 2• AMD Opteron• IBM PowerPC 970• Infiniband switch
FNAL Clusters
Myrinet Cluster
Jefferson Lab clusters
Myrinet Cluster
GigE Mesh
Industry Processor TrendsPentium 4• 3.2 GHz Prescott (0.09 micron process) soon, 2003:
– 800 MHz front side bus: 67% above 533 MHz P4 performance; – 1 MB cache: 2x the size of current P4– lower power for higher clock speed: lower operating costs/Mflops– Anticipate $1/MFlops up to 1 TeraFlops (Domain Wall, single prec.)
• 3.2 GHz Nocona (0.09 process Xeon) late 2003 / early 2004:– Xeon version of Prescott, commodity dual processor
(cost effectiveness of 2nd processor to be determined, promised to be better)– 800 MHz FSB (“Lindenhurst”) in 1H 2004
• 4-5 GHz Pentium 4’s in the pipeline, including 2x faster bus
IA-64 / Itanium 2– too expensive for at least next 1-2 years – Today adds $1000+ per cpu, but may allow some larger problems to
run 2x faster in larger cache (to be studied in the coming year)
AMD• Opteron
– High I/O bandwidth potential in 3 links, could make quads commodity
– NUMA architecture offers higher bandwidth for multi-processors
Blade Architectures– Higher packing density, potentially lower cost– I/O limitations (typically 1-2 connections),
may be adequate in 2nd generation systems using Infiniband (if costs fall)
Industry Processor Trends
Industry I/O TrendsRecent trend:• I/O Bus: PCI-32, PCI-64, PCI-X (1 Gbyte/sec)• PCI-Express will continue this trend with 4x bandwidth boost in 2004• Links: 100 mb ethernet, myrinet, gigE, myrinet-2000, Infiniband
Current situation:• Server class motherboards have 2-3 PCI-X busses
(home market motherboards don’t yet have PCI-X!)• Myrinet is a good solution for up to 500 MB/s total bandwidth
(single PCI-X card), but cost is high• GigE appears to be a better solution for higher bandwidth (6-8 bi-
directional links spread across 2-3 PCI-X buses) – limit is the software overhead for communications (tradeoff performance against cost)
• Infiniband 4X delivers 800 MB/sec; 12X coming (needs faster PCI-Express bus or needs to bypass PCI); in some time frame make become most cost effective
Cluster ScalabilityA key challenge for clusters is scaling to large systems & problems, particularly to exploit cache performance.
• PCI 32/33 runs out of steam at around 128 node; can build systems of modest size today which achieve < $1 / Mflop (Fodor)
• PCI-X based systems are more expensive, but more scalable; price performance is worse (falling like Moore’s law), but larger systems are possible. Currently worse by a factor of 2 for small systems.
• High performance network costs are significant, $1400+ / node for myrinet. Myrinet is capable of ~ 490 MB/s (245 each way); Infiniband is twice (800 MB/sec) this, but is 20% more expensive– This bandwidth would support clusters of up to 512 – 1024 boxes with good
efficiency on lattice sizes of high interest today– A cluster of this size could run some problems in cache, with performace at
the level of multiple teraflops• Easiest way to scale: multiple machines!
Performance: 2002
Itanium 2, MILC code (preliminary)SU3
0.00E+005.00E+021.00E+031.50E+032.00E+032.50E+03ou
t_2X
2X2X
2.lo
g:
out_
2X2X
2X10
.log:
out_
2X4X
4X6.
log:
out_
4X4X
4X8.
log:
out_
6X6X
6X6.
log:
out_
6X6X
6X12
.log:
out_
8X8X
8X12
.log:
out_
12X
12X
12X
12
Lattice
mflo
ps SMP, 1CPU
MPI, 1CPU
MPI, 2CPU (same)
MPI, 2CPU(different,lowfat)MPI, 2CPU (different,normal)
Implications
1. Performance of N nodes can exceed N times performance of single node, or super-linear speed-up, as the problem is divided to the point where it fits into cache.
2. Price performance on clusters will be a function of problem size(or cluster size).
3. The current “sweet spot” for Pentium 4 is 44 / processor (soon to be 43x8), and for Itanium 2 (2MB) is 4x83. Hence, larger physics problems run better on larger machines.
Measured performance of the Dirac operator for single node, 1D mesh (ring), 2D mesh (toroid) and 3D mesh. Dashed lines are from a model calculation, and show good agreement.
Cluster Performance Measurements & Model
0
500
1000
1500
2000
2500
3000
2*2*2*
2
2*2*2*
4
2*2*4*
4
2*4*4*
4
4*4*4*
4
4*4*4*
8
4*4*8*
8
6*6*6*
6
8*8*8*
816
^4
MFl
op/s
single 1D 2D 3Dsingle model 1D model 2D model 3D model
Network I/O limited
Memory bandwidth limited
In cache
LQCD problemsize perprocessor
Optimum performance
Performance Model ExtrapolationsThis model is empirical, with 2 I/O overhead parameters tweaked to yield the
best match – these should ideally be measured independentlyExtrapolations of the model are useful for planning purposes
(running scenarios with faster processors, faster memory, faster links… )• Cluster model performance includes many factors:
– Lattice size (bigger is more efficient for network, smaller allows faster processing in cache)
– Processor speed– Memory bandwidth (effects efficiency of Nth processor)– Cache size– Link bandwidth– Link latency (mostly for global sums)
• Assumptions for extrapolations– Moore’s Law (60% processor improvement per year)– 2X step changes in link about every 2 years, achieving 50% of bus
bandwidth (PCI-X, PCI-Express)– Quad processor servers become “commodity” by FY05
1990 2000 2010
Mflops / $
101
10-1
100
QCDSP
Performance per Dollar for Typical LQCD Applications
•Commodity compute nodes (leverage marketplace & Moore’s law)
•Large enough to run problem in cache•Low latency, high bandwidth
network to exploit full I/O capability(& keep up with cache performance)
10-2
Vector Supercomputers, including the Japanese Earth Simulator
JLab SciDAC Prototype Cluster1st simulation of super cluster
BNL machine
QCDOC
Columbia machine
Note: QCDOC is more scalable in 2004 than clusters, and delivers superior double precision performance
Four Year Plan2003
– 256 node 8x8x4 gigabit ethernet mesh @ JLab– (256?) node @ FNAL (Infiniband, alternate processor? – tbd)
2004– Additional 128 to 256 node prototypes at JLab and FNAL to explore
latest options, possibly including custom NIC
2005– Large clusters of scale 3-4 Tflops– Reference machines: 8x8x16 gigabit ethernet mesh, or 1024 node
Infiniband 12X switch, 4-way SMP Xeon (800 MHz FSB, 1.25 MB cache, 4 GHz dual processor core)
– Expect $0.40 to $0.60 / Mflops on Domain Wall
2006– Large clusters of scale 5-6 TFlops
Summary
• Optimized LQCD Clusters appear to be a very cost effective solution for LQCD.
• The National LQCD Collaboration is effectively advancing the development of large, specialized LQCD Clusters as part of its long-term strategy.