New Breakthroughs in LinuxSupercomputingAugust 1, 2003
Andy FenselauProduct Line Director, [email protected]
Agenda
• What’s the Problem?: Scaling Linux– Economic & Technology Strategies– Real-world Deployment Options
• SGI’s Solution• So What?: Early Results• Who Cares?:
– Developers– End-Users– Economic Buyers
What is “Supercomputing?”
High-Productivity Computing
Advanced Visualization
UltrafastStorage Solutions
Cost
1990 2005
ISV applicationsoftware ~ $5/hr
IT & engineeringpersonnel ~ $45/hr
2002
Costs now in ISV software and people
Hardware ~ $1/hr
VectorVectorRISCRISC
COTSCOTS
Changing Economics in HPC
Open-Standards HPC
Technology Drivers inTechnical Computing
– There’s always a larger problem– Datasets are growing exponentially– Elements of the problems tend to become
more interrelated – cpu performance, memoryaccess, communications bandwidth andlatency
– Parallel processing over single processorperformance
– Productivity Computing - realized gains vs.theoretical peak
Critical Components ofHigh-Productivity Computing• HPC technologies that shorten time to solution:• Balance-able, scalable performance• Low-latency memory access• Operating environment optimized for HPC• Enabled by system-, resource-, and data-
management tools• Easily deployable with ongoing investment
protection
Web serving:Small, integrated systemGenomics:Compute cycles
Signal processing:Networking and compute
Database/CRM/ERP:Storage
Media streaming:Access storage and networkingTraditional supercomputer:Compute, networking, storage
CPU
Memory
I/O
Compute Resources AsIndependent Variables
Deployment Options
Two questions you need to answer:• What do your users need?
– Architectural scaling of processing, memory, and I/Ofor individual and collective workloads
– OS and software support for real HPC system anddata management
• What could your users do?– Workflow/process improvements– Workflow/process breakthroughs– Capability breakthroughs
Large SMP
~256 CPUs
Moderate Node/Cluster
~64 CPUs
Fat NodeCluster
~128 CPUs
SmallNode/Cluster
~16 CPUsSingle
Application
Mix ofApplications
1 – 10 users(Departmental)
10 – 100 users(Data Center)
ApplicationComplexity
HPC ResourceRequirements
Choosing the Right Architecture:Small vs. Fat Nodes
• Four common challenges of small-node clusters
Choosing the Right Architecture:Supercluster vs. Small-Node Clusters
Capability and performance suffer from“islands of memory”• Job/data geometries are compromised
to single-node memory limits• Total memory is over-provisioned
• Nodes of up to 512GB local shared memory• Independently scalable Memory bricks• Up to 4TB global shared memory across
nodes with superior latency and bandwidth
• 6.4GB/sec SGI® NUMAlink™ fabric andsuperior XSCSI I/O
Performance suffers from networkingand disk I/O limitations of looselycoupled switched interconnect fabricsProductivity on real workloads suffersfrom weak system-, resource-, and data-management tools
• Porting of leading HPC tools from IRIX® toLinux®
Total cost of ownership suffers from highsystem administration and softwarelicensing costs
• Balanced, high-productivity scalablesolution
Complaint SGI® Altix™ 3000 Solution
Choosing the Right Architecture:32-bit vs. 64-bit Workloads
How are your users’ needs—and the nature of theproblems they are tackling—evolving?
• Memory size and addressability• I/O performance and scalability• Higher processor counts for problematic applications• Resource flexibility for mixed workloads• Manageability from an operational point of view• Storage volume and manageability• Operational scheduling• Operational flexibility to accommodate mixed applications
(programming models, etc.), algorithmic constraints, other future-proofing
Traditional Cluster Bottlenecks
Segment
explicit finite difference
semi-implicitfinite
difference
spectral climate models
spectralweather models
coupledclimate models
Software
MM5
HIRLAM
CCM3/CAM
NOGAPSIFS
ALADIN
CCSM2FMS
System Resource Benefits/Requirements
CPU Memory BW I/O BW Comm. BW Latency Scalability
H M L L H ~100–500p
H M L H H ~100–500p
H M L H M ~ 64–128p
H M L H M ~ 200p
H M L H H ~ 100p
SGI’s Solution
Altix 3000:Scaling Linux
toNew
Altitudes
Altix 3000:Scaling Linux
toNew
Altitudes
SGI® Altix™ 3000 Overview
Built like a cluster, works like a supercomputer
First Linux® node with 64 CPUs in single-OS imageFirst clusters with global shared memoryacross multiple nodesFirst Linux solution with HPC system- anddata-management toolsWorld-record performance for floating-point calculations, memory performance,I/O bandwidth, and real technicalapplications
The Best of Both Worlds
Small Node Clusters+ High scalability—scale out+ RAS, Open source+ Large development
community+ Inexpensive commodity
hardware
Supercomputers+ High scalability—scale up+ Large memory and I/O
handling+ Easy to program and
administer+ Robust software
productivity tools- Difficult to program andadminister
- Small node sizes, memory,and I/O limitations
- Poor total cost of ownership
- Moderate to expensive- Few, specialized
applications
✬ Difficult to program andadminister
✬ Small node sizes, memory,and I/O limitations
✬ Poor total cost of ownership
✬ Moderate to expensive✬ Few, specialized
applications
☺ Best value
☺ Built-in interconnect
The Best of Both Worlds
✪ High scalability—scale out✪ RAS, Open source✪ Large development
community✪ Inexpensive commodity
hardware
✪ High scalability—scale up✪ Large memory and I/O
handling✪ Easy to program and
administer✪ Robust software
productivity tools☺ Global shared memory
across cluster nodesNew capability
Benefits of Shared MemoryAccelerating Time to Solution
No support for shared-memory programming
models
Support for all majorparallel programming
models
Large data sets requiredisk swapping
Large data sets fit intoshared memory
Overprovisioning ofhardware, memory, and
softwareLarger nodes mean
lower total costs
Load balancing requirescommunication between
nodes
Efficient loadbalancing; no need to
move data
Traditional ClustersCommodity Interconnect
SGI® Altix™ 3000 Family
CXFS™ Shared Filesystem
Ultrafast SAN Storage
Fast SGI® NUMAflex™ Interconnect
Global Shared Memory
PX-brickPCI-X expansion
D-brick2Disk expansion
R-brickRouter interconnect
IX-brickBase I/O module
M-brickMemory
Itanium2™ C-brickCPU and memory
SGI® Altix™ 3000 HardwareOverview
Steps to Scaling Linux®
• Target specific applications (HPC, databases, etc.) andremove bottlenecks using various tools
– lockmeter and kernprof– PenguinoMeter– Performance Co-Pilot™, hardware performance counters, etc.
• User workload vs. kernel workload• Leverage technologies and experiences from IRIX®
– XSCSI, XFS™, XVM, cpusets, dplace, thread synchronization(fetchops), raw IO, etc.
• Run on in-house prototype HW
Linux® SW Development Strategy
• Lead community efforts in areas of expertise
– NUMA, scaling, I/O performance, visualization, APIs, etc.
– Linux scalability effort, Linux on large systems (Atlas), etc.
• Open-sourcing kernel changes
• Community acceptance (contribute changes)
SGI and the Open Source Community
• Linux kernel:– CPU/memory placement– Kernel debugging– Kernel profiling– Kernel lock metering– NUMA memory support
• Resource management:– Comprehensive system
accounting– Process aggregates
• Community projects:– Linux on Large Systems
Foundry– Linux Scalability Effort
Project
• Filesystem and storage:– Linux FailSafe™– XFS™ journaling
filesystem– File alteration/inode
monitor• Graphics:
– GLX™ OpenGL®
extensions– Open Inventor™ object-
oriented toolkit for 3D• Other projects:
– Performance Co-Pilot™– Digital media audio file
library– Linux kernel crash dumps
10 year history of participation and contribution10 year history of participation and contribution
Real HPC Workloads for RealHPC Users
Standard Linux® Distribution
SGI® Open-Source Enhancements
SGI ProPack™ HPC Value-AddEnhancements
Runs standard 64-bit Linux applications• Red Hat® Enterprise Linux® AS 2.1 binary compatible*• Easy to develop and administer• Intel® compilers and tools
Enabling features and functionality• Contributing SGI expertise in scalability, NUMA• XFS™ high-performance filesystem
Optimized high-productivity computing• System management: Partitioning, Performance Co-
Pilot™, high-availability FailSafe™• Resource management: CPU sets/memory placement,
MPT, array services, SCSL math libraries• Data management: CXFS™, hierarchical storage
management tools (DMF/TMF), XVM
*SGI Advanced Linux Environment 2.1 is based on Red Hat EnterpriseLinux AS2.1, but is not sponsored by or endorsed by Red Hat, Inc. inany way.
Customer SGI Bug
SGI generates a patchor workaround
Community Linux Distributor
SGI Provides Linux® Support Directly
SGI passes fix onto appropriate party
SGI Engineering
Developer Tools:Richest HPC Linux Environment
• Rapid Evolution–SGI knowledge of compilers–Intel knowledge of processors
• Leverage Open Source–Many apps available–We test to verify
• Differentiation–Enhanced ISV app performance fromSGI Libraries - MPT and SCSL
–Only on Altix
• Engagement with premier toolsISVs
–Etnus (TotalView)–Pallas (Vampir)
IntelIntel
ISVsISVs
Open SourceCommunity
Open SourceCommunity
SGISGI
World-Record Results
Fastest Linux I/Operformance
7GB/sec
Performance, Efficiency, Price/PerformanceUnsurpassed Linux® scalability
on real-world applicationsWorld-record memory bandwidth
STREAM Triad
World-record 16, 32, and 64P compute performance
SPEC® fp_rate base 2000SPEC® int_rate base 2000
Linpack NxN
102
89
141
164
281
405
322
541
644
539
1053
0 200 400 600 800 1000 1200
Sun Fire 15K, 1.2Ghz
HP AlphaServerGS1280 1.15 Ghz
IBM® p690/655(1.7/1.5GHz)
SGI® Altix™ 3000(1.3GHz)
SGI® Altix™ 3000(1.5GHz)
64 CPUs32 CPUs8 CPUs
SPECfp_rate_base2000
World-Record CPU Throughput:Floating Point Results
• World-record result for 64 and 32-processor systems• SGI’s 1.5Ghz, 32P result is 2x better performance than IBM eServer p690, 1.7 Ghz• SGI’s 1.3Ghz, 64P result is 1.95x better than Sun Fire 15K, 1.2 Ghz.• SGI is the only vendor providing high-performance 64-processor Linux® systems• SGI’s architecture is the most efficient choice for mixed workload environments
61.4
73
77.5
79
98
206
285
322
311
385
390
601
0 100 200 300 400 500 600 700
Sun Fire 15K, 1.2Ghz
HP AlphaServerGS1280 1.15 Ghz
IBM® p690/655(1.7/1.5GHz)
SGI® Altix™ 3000(1.3GHz)
SGI® Altix™ 3000(1.5GHz)
64 CPUs32 CPUs8 CPUs
SPECint_rate_base2000
World-Record CPU Throughput:Integer Results
• World-record result for 64, 32, and 8-processor systems• SGI’s 1.3Ghz, 64P is 1.54x better performance than Sun Fire 15K, 64P• SGI is the only vendor providing high-performance 64-processor
Linux® systems
95.3
143.3
166.7
134
191.4
276.5
249
378.2
553.3
0 100 200 300 400 500 600
HP Superdome™,750 Mhz
IBM p690, 1.3GHz
IBM® eServer™p690, 1.7Ghz
SGI® Altix™ 3000,1.3 GHz
SGI® Altix™ 3000,1.5 GHz
GB/sec
128 CPUs6432
World-leading Linpack HPC (NxN)Performance
Source:http://www.netlib.org/benchmark/performance.ps, Jun 3, 2003 and SGI and IBM performance reports
• World-record performance and efficiency for comparable configurations• SGI’s 1.3GHz, 128P is 1.46x better performance than IBM eServer p690 1.3GHz, 128P• Achieved 86% of peak performance at 32P versus IBM p690 1.7GHz, 32P (66%), Athlon Myrinet
Cluster (58%)• Efficiency comparable to fastest computer today, the “Earth Simulator” (87.5%)• In first 3 months of shipment, 6 Altix 128p systems are squarely in the middle of the Top 500
7
32
14
41
64
27
127255
0 25 50 75 100 125 150 175 200 225 250 275
HP Superdome™
IBM® eServer™p690, 1.7Ghz
SGI® Altix™ 3000,1.3Ghz
GB/sec
128* CPUs643216
World-Record Memory: STREAM TriadResults
• World-record STREAM 64P result for a microprocessor-based system and fifth overall• 1.56x better performance than IBM eServer p690 at 32P• The emerging bottleneck in high-performance computing is the system’s ability to efficiently
access information; SGI’s architecture delivers
* 128 CPU result uses MPI code to run onAltix Supercluster with two 64P nodes, forsmaller CPU counts OpenMP code was used.Cluster results not eligible for STREAM Top20 list.
0
8
16
24
32
40
48
56
64
1 16 32 48 64Nr. of Processors
Spee
dup
Gaussian (CCM )
Amber (CCM )
Fasta (BIO)
Star-CD (CFD)
Vectis (CFD)
Ls-Dyna (EFEA)
TAU (CFD)
HTC-Blast (BIO)
Fastx (BIO)
M M 5 (CWO)
CASTEP (CCM )
GAM ESS (CCM )
Ideal
SGI Altix 3000 Scalability forcompute intensive applications
Higheris
Better
Status: February 24,2003
World Record Linux Scalability on Altix 3000 surpasses
Who Cares?: Developers Do
512
256
128
64
32
16
8
4
2
1FEA CFD
32- 512p
8-64p
Chemistry BIO Seismic Reservoir
4-32p
8-64p
Climate
16-64p
4-16p
1-32pAltixTM 3000
IBM p690,NEC TX7,
Unisys ES7000
Origin® 3000
HP RX5670
SSI limit
1
10
100
1000
Nas
tran
An
sys
Aba
qus
Mar
c
Pam
cras
hLs
-Dyn
a
Rad
ioss
Pow
erfl
ow
Flu
ent
Sta
rHP
CC
FX
Fir
eV
ecti
s
Pam
-Flo
wG
auss
ian
Gam
ess
Ambe
r
CA
ST
EP
NW
Ch
em
Ch
amm
CN
X
NA
MD
AD
F
VA
SP
Dm
ol
BLA
ST
FA
ST
AC
lust
alW
HM
ME
RW
ise2
D2_
Clu
ster
MM
5H
IRLA
M
ALA
DIN
CC
SM
2
WR
FC
CM
3
FV
CC
M
IFS
PO
P
ET
A
Pro
MA
X
Om
ega
Geo
Clu
sute
r
FO
CU
S
Geo
Dep
th
Sei
sUP
Ecl
ipse
VIP
PO
WE
RS
Key ApplicationsScalability and Implementation
SMP MPPNr. of CPUS (log scale)
CSM CCM CWO RESCFD BIO SPI
Linux HPC Applications Strategy
Enablers Revenue Drivers
SYSTEM-LEVEL DIFFERENTIATIONS
SCSL, sparse direct solver, Extreme reordering partitioning,
parallel MPYAD, FFIO,HTC drivers, MPT, XPMEM
SSI, Memory Bandwidth,Comm. Latency and Bandwidth
XVM, dplace, cpusets, CXFS™
Nastran, Abaqus, Ansys, MarcLS-Dyna, Pam-Crash, Radioss
Gaussian, Amber, Castep,CNX, Dmol, Charmm, GAMESS
VIP, Eclipse, Mores
MM5, IFS, HIRLAM, ESA, ALADIN,CCM3, NCEP, POP, LM
FEACODE LEVEL DIFFERENTIATIONS
Fluent, StarHPC, CFX5, PowerflowFire, Vectis
FASTA, BLAST, CLustalW,HMERR, Wise2, D2-Cluster, PHRAP
ProMAX, Omega, Epos, Geocluster, SeisUP
CFD
Chemistry
BIO
Seismic
Reservoir
Weather
Application Readiness
2003
2024
28
33
44
50
3427
25
14
51
0
10
20
30
40
50
60
Jan Feb Mar Apr May June
App
licat
ions
read
y to
ben
chm
ark
0
5
10
15
20
25
30
35
40
45
50
Cer
tifie
d A
pplic
atio
ns
Ready-to-Benchmark Certifications
Who Cares?: End-Users Do
Real Workflow/ThroughputComparisons
Troughput AnalysisNo CPUs Cpus job Time #jobs
96P cluster 96 16 0.18 616P Altix 16 4 0.18 4
#CPUS Cpus/job Time Experiments/yearIA32 cluster 96 96 3 weeks 1716P Altix 16 16 8 days 45
Drug Discovery Case Study
• Incyte Genomics: Drug Discovery/bio-informatics• 20 databases, 6GB–10GB
• On Beowulf cluster:• Each node needs maximum memory• Start job by loading database from disk, with repeat I/O calls for
datasets >4GB
• On SGI® Altix™ 3000 supercluster:• Databases are loaded once into memory pool• Average memory per node• Jobs run immediately, referring to dbase in shared memory
• Incyte Genomics: Drug Discovery/bio-informatics• 20 databases, 6GB–10GB
• On Beowulf cluster:• Each node needs maximum memory• Start job by loading database from disk, with repeat I/O calls for
datasets >4GB
• On SGI® Altix™ 3000 supercluster:• Databases are loaded once into memory pool• Average memory per node• Jobs run immediately, referring to dbase in shared memory
CONCLUSION:• SGI® Altix™ 3000 supercluster can replace
Beowulf cluster at CPU ratio of 15:1• Faster time to solution• Reduced TCO: Fewer nodes, fewer
processors, less memory, easiermanagement, easier upgrades, lowersoftware costs, less power consumption,less rack space, easier administration
Manufacturing Case Study
2438
1
2484
1
2448
6
2452
1
2460
1
20000
21000
22000
23000
24000
25000
Stan
dalon
e/8p
Job
1/8p
Job
2/8p
Job
3/8p
Job
4/8p
Elap
sed
Time
(sec
)Throughput of4 jobs, each
8-way, LS-Dyna
System:Altix 3000, 32P, 64GB,XVM, SGI® TP900 SAN
Individual jobs in the throughput mix are between 0.4% and 1.8 % slower than thestandalone case
Altix 3000 for Multi-job, Multi-cpu Throughput
Who Cares?: Economic Buyers DoWhat is “Price/Performance”?It’s simple math … Or is it?• What is “price”?
– Total cost of hardware– Total cost of acquisition– Total cost of ownership
• What is “performance”?– Peak GFLOPS– Single-job/single-processor application benchmark– Multijob/multiprocessor application benchmarks– Broader productivity or workload benchmarks
Just The Beginning
IPF Roadmap•Compatible with next-generation Intel®
Itanium® 2 processor family
Hardware• Advanced supercomputing technologies and capabilities• Mid-range offering for departmental and database servers• Superclusters scaling to thousands of processors• Next-generation NUMAlink interconnect• Advanced multipipe graphics
Software/Linux• Further scalability for standard Linux• Ongoing improvement of superior system-, data-, and
resource-management tools• Further development of global shared-memory capabilities• Open-source and partner contributions• Big data handling and I/O capabilities• Compilers and development tool improvements