A Strategy for the Future of High Performance Computing?
Advanced Computing LaboratoryLos Alamos National Laboratory
Pete Beckman
Advanced Computing LaboratoryLos Alamos National Laboratory
Pete Beckman 2
Observations:The US Supercomputing Industry
All US high-performance vendors are building clusters of SMPs (with the exception of Tera)
Each company, IBM, SGI, Compaq, HP, and SUN has a different version of Unix
Each company attempts to scale system software designed for database, internet, technical and servers
This fractured market forces 5 different parallel file systems, fast message implementations, etc
Supercomputer companies tend to go out of business
Advanced Computing LaboratoryLos Alamos National Laboratory
Pete Beckman 3
New Limitations
People used to say:“The number of Tflops available is limited only
by the amount of money you wish to spend”
The Reality:We are at a point where our ability to build machines from components exceeds our ability to admin, program and run them
But we do it anyway. Many large clusters are being installed...
Advanced Computing LaboratoryLos Alamos National Laboratory
Pete Beckman 4
Scalable System Software is currently the weak linkSoftware for Tflop clusters of SMPs is hard
System administration, configuration, booting, management, & monitoring
Scalable smart NIC messaging (zero copy) Cluster/Global/Parallel File System Job queuing and running I/O (scratch, prefetch, NASD) Fault tolerance and on-the-fly reconfiguration
Advanced Computing LaboratoryLos Alamos National Laboratory
Pete Beckman 5
Why use Linux for clusters of SMPs, and as a basis for system software research?
The OS for scalable clusters needs more research Open Source! (it’s more then just geek chic)
No lawyers, no NDAs, no worries mate! Visible code improves faster The whole environment, or just the mods can be distributed Scientific collaboration is just an URL away...
Small, well designed, stable, mature, kernel ~240K lines of code without device drivers /proc filesystem and dynamically loadable modules The OS is extendable, optimizable, tunable
Linux is a lot of fun (Shagadelic, Baby!)
Did I mention no lawyers?
Advanced Computing LaboratoryLos Alamos National Laboratory
Pete Beckman 6
Isn’t Open Source hype?Do you really need it?
A very quick example
Supermon and Superview: High Performance Cluster Monitoring Tools
Ron Minnich, Karen Reid, Matt Sottile
Advanced Computing LaboratoryLos Alamos National Laboratory
Pete Beckman 7
The problem: get really fast stats from a very large cluster
Monitor hundreds of nodes at rates up to 100 Hz
Monitor at 10Hz without significant impact on the application
Monitor hardware performance counters Collect a wide range of kernel
information (disk blocks, memory, interrupts, etc)
Advanced Computing LaboratoryLos Alamos National Laboratory
Pete Beckman 8
Solution
Modify the kernel so all the parameters can be grabbed without going though /proc
Tightly coupled clusters can get real-time monitoring stats.
This is not of general use to the desktop, and web server markets
Stats for 100 nodes takes about 20 ms
Advanced Computing LaboratoryLos Alamos National Laboratory
Pete Beckman 9
Superview: the Java tool for Supermon
Advanced Computing LaboratoryLos Alamos National Laboratory
Pete Beckman 10
Where should weconcentrate our efforts?
Some areas for improvement….
Scalable Linux System Software
Advanced Computing LaboratoryLos Alamos National Laboratory
Pete Beckman 11
Software: The hard partLinux environments (page 1)
Compilers F90 (PGI, Absoft, Compaq) F77 (GNU, PGI, Absoft, Compaq, Fujitsu) HPF (PGI, Compaq?) C/C++ (PGI, KAI, GNU, Compaq, Fujitsu) OpenMP (PGI) Metrowerks Code Warrior for C, C++, (Fortran?)
Debuggers Totalview… maybe, real soon now, almost? gdb, DDD, etc.
Advanced Computing LaboratoryLos Alamos National Laboratory
Pete Beckman 12
Software: The hard partLinux environments (page 2)
Message Passing MPICH, PVM, MPI MSTI, Nexus OS Bypass:
– ST, FM, AM, PM, GM, VIA, Portals, etc– Fast Interconnects: Myrinet, GigE, HiPPI, SCI
Shared Memory Programming Pthreads, Tulip-Threads, etc.
Parallel Performance Tools TAU, Vampir, PGI PGProf, Jumpshot, etc
Advanced Computing LaboratoryLos Alamos National Laboratory
Pete Beckman 13
Software: The hard partLinux environments (page 3)
File Systems & I/O e2fs (native), NFS PVFS, Coda, GFS MPI-IO, ROMIO
Archival Storage HPSS & ADSM clients
Job Control LSF, PBS, Maui
Advanced Computing LaboratoryLos Alamos National Laboratory
Pete Beckman 14
Software: The hard partLinux environments (page 4)
Libraries and Frameworks BLAS, OVERTURE, POOMA, Atlas Alpha math libraries (Compaq)
System Administration Building and booting tools Cfengine Monitoring and management tools Configuration database SGI Project Accounting
Advanced Computing LaboratoryLos Alamos National Laboratory
Pete Beckman 15
Software for Linux clustersA report card (current status)
Compilers
Parallel debuggers
Message passing
Shared memory prog.
Parallel performance tools
File Systems
Archival Storage
Job Control
Math Libraries
………………………………...………..A
…………………………..……...I
………………………….….…..A-
……….…………..……..A
…………...………..C+
………………….………...………..D
…………………………….……..C
………………………………...……..B-
……………………………………..B
Advanced Computing LaboratoryLos Alamos National Laboratory
Pete Beckman 16
Summary of the most important areas
First Priority Cluster management, administration, images, monitoring, etc Cluster/parallel/global file systems Continued work on scalable messaging Faster, more scalable SMP Virtual memory optimized for HPC TCP/IP improvements
Wish ListNIC boot, BIOS NVRAM, Serial consoleOS bypass standards in the kernelTightly-coupled scheduling, accountingNewest Drivers
Advanced Computing LaboratoryLos Alamos National Laboratory
Pete Beckman 17
Honest cluster costs: publish the numbers
How many sysadmins and programmers are we required for support?
What are the service and replacement costs?
How much was hardware integration? How many users can you support and at
what levels? How much was the hardware?
Advanced Computing LaboratoryLos Alamos National Laboratory
Pete Beckman 18
ComputeNodes
Gigabit Ethernet
ControlNode
ControlNode
Gigabit Multistage Interconnection Fabric
NetworkAttachedSecureDisks
Unit
Tera-Scale SMP Cluster Architecture
Example Cluster Costs
Myrinet30%
Ethernet4%
Compute Nodes59%
Control Fabric4%
Misc3%
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0 128 256 384 512 640 768 896 1024 1152 1280 1408 1536 1664 1792 1920 2048
Nodes (dual CPUs)
Co
st (
in t
ho
usa
nd
s)
Compute Nodes People (yearly: few projects) People (yearly: many projects)
Compile/RAID Servers Grand Total Hardware
Linux/Myrinet Cluster Cost Model
Advanced Computing LaboratoryLos Alamos National Laboratory
Pete Beckman 21
Let someone else put it together
Compaq Dell Penguin Computing Alta Tech VA Linux DCG Paralogic Microway Ask about support
Advanced Computing LaboratoryLos Alamos National Laboratory
Pete Beckman 22
Cluster BenchmarkingLies, Damn Lies, and the Top500
Make MPI zero-byte messaging a special case (improves latency numbers)
Convert multiply flops to addition, recount flops Hire a Linpack consultant to help you achieve “the
number” the vendor promised “We unloaded the trucks, and 24hrs later, we
calculated the size of the galaxy in acres.” For $15K and 3 rolls of duct tape I built a
supercomputer in my cubicle….
Vendor Published Linpack, Latency, and Bandwidth numbers are worthless
Advanced Computing LaboratoryLos Alamos National Laboratory
Pete Beckman 23
Plug-in Frameworkfor Cluster Benchmarks
Advanced Computing LaboratoryLos Alamos National Laboratory
Pete Beckman 24
MPI Message Matching
Advanced Computing LaboratoryLos Alamos National Laboratory
Pete Beckman 25
Price Performance of Single Myrinet NodesCompared to Dual SMP Myrinet Nodes
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 10 20 30 40 50 60Processors
Ra
tio
CG ratio EP ratio LU ratio IS ratio SP ratio
Dual SMP nodes have better price-performance
Single CPU nodes have better price-performance
Advanced Computing LaboratoryLos Alamos National Laboratory
Pete Beckman 26
Price Performance: 100BT compared to Myrinet
0
0.5
1
1.5
2
2.5
0 10 20 30 40 50 60Processors
Ra
tio
of
Pri
ce
Pe
rfo
rma
nc
e:
1
00
BT
/ M
yrin
et
IS EP CG MG
100BT nodes have better price-performance
Myrinet nodes have better price-performance
Advanced Computing LaboratoryLos Alamos National Laboratory
Pete Beckman 27
Conclusions
Lots of Linux clusters will be at SC99 The Big 5 vendors do not have the critical mass to
develop the system software for multi-teraflop clusters The HPC community (labs, vendors, universities, etc.)
needs to work together The hardware consolidation is nearly over, the software
consolidation is on its way A Linux-based “commodity” Open Source strategy
could provide a mechanism for: open vendor collaboration academic and laboratory participation one Open Source software environment
Advanced Computing LaboratoryLos Alamos National Laboratory
Pete Beckman 28
News and Announcements:
The next Extreme Linux conference will be in Williamsburg in October. The call for papers will be out soon, start preparing those technical papers…
There will be several cluster tutorials at SC99. Remy Evard, Bill Saphir, and Pete Beckman will be running one focused on system administration and user environment for large clusters.