A Strategy for the Future of High Performance Computing?

A Strategy for the Future of High Performance Computing?

Advanced Computing LaboratoryLos Alamos National Laboratory

Pete Beckman


Pete Beckman 2

Observations:The US Supercomputing Industry

All US high-performance vendors are building clusters of SMPs (with the exception of Tera)

Each company, IBM, SGI, Compaq, HP, and SUN has a different version of Unix

Each company attempts to scale system software designed for database, internet, technical and servers

This fractured market forces 5 different parallel file systems, fast message implementations, etc

Supercomputer companies tend to go out of business


Pete Beckman 3

New Limitations

People used to say:“The number of Tflops available is limited only

by the amount of money you wish to spend”

The Reality:We are at a point where our ability to build machines from components exceeds our ability to admin, program and run them

But we do it anyway. Many large clusters are being installed...


Pete Beckman 4

Scalable System Software is currently the weak linkSoftware for Tflop clusters of SMPs is hard

System administration, configuration, booting, management, & monitoring

Scalable smart NIC messaging (zero copy) Cluster/Global/Parallel File System Job queuing and running I/O (scratch, prefetch, NASD) Fault tolerance and on-the-fly reconfiguration


Pete Beckman 5

Why use Linux for clusters of SMPs, and as a basis for system software research?

The OS for scalable clusters needs more research Open Source! (it’s more then just geek chic)

No lawyers, no NDAs, no worries mate! Visible code improves faster The whole environment, or just the mods can be distributed Scientific collaboration is just an URL away...

Small, well designed, stable, mature, kernel ~240K lines of code without device drivers /proc filesystem and dynamically loadable modules The OS is extendable, optimizable, tunable

Linux is a lot of fun (Shagadelic, Baby!)

Did I mention no lawyers?


Pete Beckman 6

Isn’t Open Source hype?Do you really need it?

A very quick example

Supermon and Superview: High Performance Cluster Monitoring Tools

Ron Minnich, Karen Reid, Matt Sottile


Pete Beckman 7

The problem: get really fast stats from a very large cluster

Monitor hundreds of nodes at rates up to 100 Hz

Monitor at 10Hz without significant impact on the application

Monitor hardware performance counters Collect a wide range of kernel

information (disk blocks, memory, interrupts, etc)


Pete Beckman 8

Solution

Modify the kernel so all the parameters can be grabbed without going though /proc

Tightly coupled clusters can get real-time monitoring stats.

This is not of general use to the desktop, and web server markets

Stats for 100 nodes takes about 20 ms


Pete Beckman 9

Superview: the Java tool for Supermon


Pete Beckman 10

Where should weconcentrate our efforts?

Some areas for improvement….

Scalable Linux System Software


Pete Beckman 11

Software: The hard partLinux environments (page 1)

Compilers F90 (PGI, Absoft, Compaq) F77 (GNU, PGI, Absoft, Compaq, Fujitsu) HPF (PGI, Compaq?) C/C++ (PGI, KAI, GNU, Compaq, Fujitsu) OpenMP (PGI) Metrowerks Code Warrior for C, C++, (Fortran?)

Debuggers Totalview… maybe, real soon now, almost? gdb, DDD, etc.


Pete Beckman 12


Message Passing MPICH, PVM, MPI MSTI, Nexus OS Bypass:

– ST, FM, AM, PM, GM, VIA, Portals, etc– Fast Interconnects: Myrinet, GigE, HiPPI, SCI

Shared Memory Programming Pthreads, Tulip-Threads, etc.

Parallel Performance Tools TAU, Vampir, PGI PGProf, Jumpshot, etc


Pete Beckman 13


File Systems & I/O e2fs (native), NFS PVFS, Coda, GFS MPI-IO, ROMIO

Archival Storage HPSS & ADSM clients

Job Control LSF, PBS, Maui


Pete Beckman 14


Libraries and Frameworks BLAS, OVERTURE, POOMA, Atlas Alpha math libraries (Compaq)

System Administration Building and booting tools Cfengine Monitoring and management tools Configuration database SGI Project Accounting


Pete Beckman 15

Software for Linux clustersA report card (current status)

Compilers

Parallel debuggers

Message passing

Shared memory prog.

Parallel performance tools

File Systems

Archival Storage

Job Control

Math Libraries

………………………………...………..A

…………………………..……...I

………………………….….…..A-

……….…………..……..A

…………...………..C+

………………….………...………..D

…………………………….……..C

………………………………...……..B-

……………………………………..B


Pete Beckman 16

Summary of the most important areas

First Priority Cluster management, administration, images, monitoring, etc Cluster/parallel/global file systems Continued work on scalable messaging Faster, more scalable SMP Virtual memory optimized for HPC TCP/IP improvements

Wish ListNIC boot, BIOS NVRAM, Serial consoleOS bypass standards in the kernelTightly-coupled scheduling, accountingNewest Drivers


Pete Beckman 17

Honest cluster costs: publish the numbers

How many sysadmins and programmers are we required for support?

What are the service and replacement costs?

How much was hardware integration? How many users can you support and at

what levels? How much was the hardware?


Pete Beckman 18

ComputeNodes

Gigabit Ethernet

ControlNode

ControlNode

Gigabit Multistage Interconnection Fabric

NetworkAttachedSecureDisks

Unit

Tera-Scale SMP Cluster Architecture

Example Cluster Costs

Myrinet30%

Ethernet4%

Compute Nodes59%

Control Fabric4%

Misc3%

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0 128 256 384 512 640 768 896 1024 1152 1280 1408 1536 1664 1792 1920 2048

Nodes (dual CPUs)

Co

st (

in t

ho

usa

nd

s)

Compute Nodes People (yearly: few projects) People (yearly: many projects)

Compile/RAID Servers Grand Total Hardware

Linux/Myrinet Cluster Cost Model


Pete Beckman 21

Let someone else put it together

Compaq Dell Penguin Computing Alta Tech VA Linux DCG Paralogic Microway Ask about support


Pete Beckman 22

Cluster BenchmarkingLies, Damn Lies, and the Top500

Make MPI zero-byte messaging a special case (improves latency numbers)

Convert multiply flops to addition, recount flops Hire a Linpack consultant to help you achieve “the

number” the vendor promised “We unloaded the trucks, and 24hrs later, we

calculated the size of the galaxy in acres.” For $15K and 3 rolls of duct tape I built a

supercomputer in my cubicle….

Vendor Published Linpack, Latency, and Bandwidth numbers are worthless


Pete Beckman 23

Plug-in Frameworkfor Cluster Benchmarks


Pete Beckman 24

MPI Message Matching


Pete Beckman 25

Price Performance of Single Myrinet NodesCompared to Dual SMP Myrinet Nodes

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 10 20 30 40 50 60Processors

Ra

tio

CG ratio EP ratio LU ratio IS ratio SP ratio

Dual SMP nodes have better price-performance

Single CPU nodes have better price-performance


Pete Beckman 26

Price Performance: 100BT compared to Myrinet

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60Processors

Ra

tio

of

Pri

ce

Pe

rfo

rma

nc

e:

1

00

BT

/ M

yrin

et

IS EP CG MG

100BT nodes have better price-performance

Myrinet nodes have better price-performance


Pete Beckman 27

Conclusions

Lots of Linux clusters will be at SC99 The Big 5 vendors do not have the critical mass to

develop the system software for multi-teraflop clusters The HPC community (labs, vendors, universities, etc.)

needs to work together The hardware consolidation is nearly over, the software

consolidation is on its way A Linux-based “commodity” Open Source strategy

could provide a mechanism for: open vendor collaboration academic and laboratory participation one Open Source software environment


Pete Beckman 28

News and Announcements:

The next Extreme Linux conference will be in Williamsburg in October. The call for papers will be out soon, start preparing those technical papers…

There will be several cluster tutorials at SC99. Remy Evard, Bill Saphir, and Pete Beckman will be running one focused on system administration and user environment for large clusters.

Date post:	08-Jan-2016
Category:	Documents
Upload:	sloan
View:	28 times
Download:	1 times

A Strategy for the Future of High Performance Computing?

Documents