New Breakthroughs in Linux Supercomputing - UKUUG · • Job/data geometries are compromised ......

New Breakthroughs in LinuxSupercomputingAugust 1, 2003

Andy FenselauProduct Line Director, [email protected]

Agenda

• What’s the Problem?: Scaling Linux– Economic & Technology Strategies– Real-world Deployment Options

• SGI’s Solution• So What?: Early Results• Who Cares?:

– Developers– End-Users– Economic Buyers

What is “Supercomputing?”

High-Productivity Computing

Advanced Visualization

UltrafastStorage Solutions

Cost

1990 2005

ISV applicationsoftware ~ $5/hr

IT & engineeringpersonnel ~ $45/hr

2002

Costs now in ISV software and people

Hardware ~ $1/hr

VectorVectorRISCRISC

COTSCOTS

Changing Economics in HPC

Open-Standards HPC

Technology Drivers inTechnical Computing

– There’s always a larger problem– Datasets are growing exponentially– Elements of the problems tend to become

more interrelated – cpu performance, memoryaccess, communications bandwidth andlatency

– Parallel processing over single processorperformance

– Productivity Computing - realized gains vs.theoretical peak

Critical Components ofHigh-Productivity Computing• HPC technologies that shorten time to solution:• Balance-able, scalable performance• Low-latency memory access• Operating environment optimized for HPC• Enabled by system-, resource-, and data-

management tools• Easily deployable with ongoing investment

protection

Web serving:Small, integrated systemGenomics:Compute cycles

Signal processing:Networking and compute

Database/CRM/ERP:Storage

Media streaming:Access storage and networkingTraditional supercomputer:Compute, networking, storage

CPU

Memory

I/O

Compute Resources AsIndependent Variables

Deployment Options

Two questions you need to answer:• What do your users need?

– Architectural scaling of processing, memory, and I/Ofor individual and collective workloads

– OS and software support for real HPC system anddata management

• What could your users do?– Workflow/process improvements– Workflow/process breakthroughs– Capability breakthroughs

Large SMP

~256 CPUs

Moderate Node/Cluster

~64 CPUs

Fat NodeCluster

~128 CPUs

SmallNode/Cluster

~16 CPUsSingle

Application

Mix ofApplications

1 – 10 users(Departmental)

10 – 100 users(Data Center)

ApplicationComplexity

HPC ResourceRequirements

Choosing the Right Architecture:Small vs. Fat Nodes

• Four common challenges of small-node clusters

Choosing the Right Architecture:Supercluster vs. Small-Node Clusters

Capability and performance suffer from“islands of memory”• Job/data geometries are compromised

to single-node memory limits• Total memory is over-provisioned

• Nodes of up to 512GB local shared memory• Independently scalable Memory bricks• Up to 4TB global shared memory across

nodes with superior latency and bandwidth

• 6.4GB/sec SGI® NUMAlink™ fabric andsuperior XSCSI I/O

Performance suffers from networkingand disk I/O limitations of looselycoupled switched interconnect fabricsProductivity on real workloads suffersfrom weak system-, resource-, and data-management tools

• Porting of leading HPC tools from IRIX® toLinux®

Total cost of ownership suffers from highsystem administration and softwarelicensing costs

• Balanced, high-productivity scalablesolution

Complaint SGI® Altix™ 3000 Solution

Choosing the Right Architecture:32-bit vs. 64-bit Workloads

How are your users’ needs—and the nature of theproblems they are tackling—evolving?

• Memory size and addressability• I/O performance and scalability• Higher processor counts for problematic applications• Resource flexibility for mixed workloads• Manageability from an operational point of view• Storage volume and manageability• Operational scheduling• Operational flexibility to accommodate mixed applications

(programming models, etc.), algorithmic constraints, other future-proofing

Traditional Cluster Bottlenecks

Segment

explicit finite difference

semi-implicitfinite

difference

spectral climate models

spectralweather models

coupledclimate models

Software

MM5

HIRLAM

CCM3/CAM

NOGAPSIFS

ALADIN

CCSM2FMS

System Resource Benefits/Requirements

CPU Memory BW I/O BW Comm. BW Latency Scalability

H M L L H ~100–500p

H M L H H ~100–500p

H M L H M ~ 64–128p

H M L H M ~ 200p

H M L H H ~ 100p

SGI’s Solution

Altix 3000:Scaling Linux

toNew

Altitudes

Altix 3000:Scaling Linux

toNew

Altitudes

SGI® Altix™ 3000 Overview

Built like a cluster, works like a supercomputer

First Linux® node with 64 CPUs in single-OS imageFirst clusters with global shared memoryacross multiple nodesFirst Linux solution with HPC system- anddata-management toolsWorld-record performance for floating-point calculations, memory performance,I/O bandwidth, and real technicalapplications

The Best of Both Worlds

Small Node Clusters+ High scalability—scale out+ RAS, Open source+ Large development

community+ Inexpensive commodity

hardware

Supercomputers+ High scalability—scale up+ Large memory and I/O

handling+ Easy to program and

administer+ Robust software

productivity tools- Difficult to program andadminister

- Small node sizes, memory,and I/O limitations

- Poor total cost of ownership

- Moderate to expensive- Few, specialized

applications

✬ Difficult to program andadminister

✬ Small node sizes, memory,and I/O limitations

✬ Poor total cost of ownership

✬ Moderate to expensive✬ Few, specialized

applications

☺ Best value

☺ Built-in interconnect

The Best of Both Worlds

✪ High scalability—scale out✪ RAS, Open source✪ Large development

community✪ Inexpensive commodity

hardware

✪ High scalability—scale up✪ Large memory and I/O

handling✪ Easy to program and

administer✪ Robust software

productivity tools☺ Global shared memory

across cluster nodesNew capability

Benefits of Shared MemoryAccelerating Time to Solution

No support for shared-memory programming

models

Support for all majorparallel programming

models

Large data sets requiredisk swapping

Large data sets fit intoshared memory

Overprovisioning ofhardware, memory, and

softwareLarger nodes mean

lower total costs

Load balancing requirescommunication between

nodes

Efficient loadbalancing; no need to

move data

Traditional ClustersCommodity Interconnect

SGI® Altix™ 3000 Family

CXFS™ Shared Filesystem

Ultrafast SAN Storage

Fast SGI® NUMAflex™ Interconnect

Global Shared Memory

PX-brickPCI-X expansion

D-brick2Disk expansion

R-brickRouter interconnect

IX-brickBase I/O module

M-brickMemory

Itanium2™ C-brickCPU and memory

SGI® Altix™ 3000 HardwareOverview

Steps to Scaling Linux®

• Target specific applications (HPC, databases, etc.) andremove bottlenecks using various tools

– lockmeter and kernprof– PenguinoMeter– Performance Co-Pilot™, hardware performance counters, etc.

• User workload vs. kernel workload• Leverage technologies and experiences from IRIX®

– XSCSI, XFS™, XVM, cpusets, dplace, thread synchronization(fetchops), raw IO, etc.

• Run on in-house prototype HW

Linux® SW Development Strategy

• Lead community efforts in areas of expertise

– NUMA, scaling, I/O performance, visualization, APIs, etc.

– Linux scalability effort, Linux on large systems (Atlas), etc.

• Open-sourcing kernel changes

• Community acceptance (contribute changes)

SGI and the Open Source Community

• Linux kernel:– CPU/memory placement– Kernel debugging– Kernel profiling– Kernel lock metering– NUMA memory support

• Resource management:– Comprehensive system

accounting– Process aggregates

• Community projects:– Linux on Large Systems

Foundry– Linux Scalability Effort

Project

• Filesystem and storage:– Linux FailSafe™– XFS™ journaling

filesystem– File alteration/inode

monitor• Graphics:

– GLX™ OpenGL®

extensions– Open Inventor™ object-

oriented toolkit for 3D• Other projects:

– Performance Co-Pilot™– Digital media audio file

library– Linux kernel crash dumps

10 year history of participation and contribution10 year history of participation and contribution

Real HPC Workloads for RealHPC Users

Standard Linux® Distribution

SGI® Open-Source Enhancements

SGI ProPack™ HPC Value-AddEnhancements

Runs standard 64-bit Linux applications• Red Hat® Enterprise Linux® AS 2.1 binary compatible*• Easy to develop and administer• Intel® compilers and tools

Enabling features and functionality• Contributing SGI expertise in scalability, NUMA• XFS™ high-performance filesystem

Optimized high-productivity computing• System management: Partitioning, Performance Co-

Pilot™, high-availability FailSafe™• Resource management: CPU sets/memory placement,

MPT, array services, SCSL math libraries• Data management: CXFS™, hierarchical storage

management tools (DMF/TMF), XVM

*SGI Advanced Linux Environment 2.1 is based on Red Hat EnterpriseLinux AS2.1, but is not sponsored by or endorsed by Red Hat, Inc. inany way.

Customer SGI Bug

SGI generates a patchor workaround

Community Linux Distributor

SGI Provides Linux® Support Directly

SGI passes fix onto appropriate party

SGI Engineering

Developer Tools:Richest HPC Linux Environment

• Rapid Evolution–SGI knowledge of compilers–Intel knowledge of processors

• Leverage Open Source–Many apps available–We test to verify

• Differentiation–Enhanced ISV app performance fromSGI Libraries - MPT and SCSL

–Only on Altix

• Engagement with premier toolsISVs

–Etnus (TotalView)–Pallas (Vampir)

IntelIntel

ISVsISVs

Open SourceCommunity

Open SourceCommunity

SGISGI

World-Record Results

Fastest Linux I/Operformance

7GB/sec

Performance, Efficiency, Price/PerformanceUnsurpassed Linux® scalability

on real-world applicationsWorld-record memory bandwidth

STREAM Triad

World-record 16, 32, and 64P compute performance

SPEC® fp_rate base 2000SPEC® int_rate base 2000

Linpack NxN

102

89

141

164

281

405

322

541

644

539

1053

0 200 400 600 800 1000 1200

Sun Fire 15K, 1.2Ghz

HP AlphaServerGS1280 1.15 Ghz

IBM® p690/655(1.7/1.5GHz)

SGI® Altix™ 3000(1.3GHz)


64 CPUs32 CPUs8 CPUs

SPECfp_rate_base2000

World-Record CPU Throughput:Floating Point Results

• World-record result for 64 and 32-processor systems• SGI’s 1.5Ghz, 32P result is 2x better performance than IBM eServer p690, 1.7 Ghz• SGI’s 1.3Ghz, 64P result is 1.95x better than Sun Fire 15K, 1.2 Ghz.• SGI is the only vendor providing high-performance 64-processor Linux® systems• SGI’s architecture is the most efficient choice for mixed workload environments

61.4

73

77.5

79

98

206

285

322

311

385

390

601

0 100 200 300 400 500 600 700

Sun Fire 15K, 1.2Ghz

HP AlphaServerGS1280 1.15 Ghz

IBM® p690/655(1.7/1.5GHz)



64 CPUs32 CPUs8 CPUs

SPECint_rate_base2000

World-Record CPU Throughput:Integer Results

• World-record result for 64, 32, and 8-processor systems• SGI’s 1.3Ghz, 64P is 1.54x better performance than Sun Fire 15K, 64P• SGI is the only vendor providing high-performance 64-processor

Linux® systems

95.3

143.3

166.7

134

191.4

276.5

249

378.2

553.3

0 100 200 300 400 500 600

HP Superdome™,750 Mhz

IBM p690, 1.3GHz

IBM® eServer™p690, 1.7Ghz

SGI® Altix™ 3000,1.3 GHz

SGI® Altix™ 3000,1.5 GHz

GB/sec

128 CPUs6432

World-leading Linpack HPC (NxN)Performance

Source:http://www.netlib.org/benchmark/performance.ps, Jun 3, 2003 and SGI and IBM performance reports

• World-record performance and efficiency for comparable configurations• SGI’s 1.3GHz, 128P is 1.46x better performance than IBM eServer p690 1.3GHz, 128P• Achieved 86% of peak performance at 32P versus IBM p690 1.7GHz, 32P (66%), Athlon Myrinet

Cluster (58%)• Efficiency comparable to fastest computer today, the “Earth Simulator” (87.5%)• In first 3 months of shipment, 6 Altix 128p systems are squarely in the middle of the Top 500

7

32

14

41

64

27

127255

0 25 50 75 100 125 150 175 200 225 250 275

HP Superdome™

IBM® eServer™p690, 1.7Ghz

SGI® Altix™ 3000,1.3Ghz

GB/sec

128* CPUs643216

World-Record Memory: STREAM TriadResults

• World-record STREAM 64P result for a microprocessor-based system and fifth overall• 1.56x better performance than IBM eServer p690 at 32P• The emerging bottleneck in high-performance computing is the system’s ability to efficiently

access information; SGI’s architecture delivers

* 128 CPU result uses MPI code to run onAltix Supercluster with two 64P nodes, forsmaller CPU counts OpenMP code was used.Cluster results not eligible for STREAM Top20 list.

0

8

16

24

32

40

48

56

64

1 16 32 48 64Nr. of Processors

Spee

dup

Gaussian (CCM )

Amber (CCM )

Fasta (BIO)

Star-CD (CFD)

Vectis (CFD)

Ls-Dyna (EFEA)

TAU (CFD)

HTC-Blast (BIO)

Fastx (BIO)

M M 5 (CWO)

CASTEP (CCM )

GAM ESS (CCM )

Ideal

SGI Altix 3000 Scalability forcompute intensive applications

Higheris

Better

Status: February 24,2003

World Record Linux Scalability on Altix 3000 surpasses

Who Cares?: Developers Do

512

256

128

64

32

16

8

4

2

1FEA CFD

32- 512p

8-64p

Chemistry BIO Seismic Reservoir

4-32p

8-64p

Climate

16-64p

4-16p

1-32pAltixTM 3000

IBM p690,NEC TX7,

Unisys ES7000

Origin® 3000

HP RX5670

SSI limit

1

10

100

1000

Nas

tran

An

sys

Aba

qus

Mar

c

Pam

cras

hLs

-Dyn

a

Rad

ioss

Pow

erfl

ow

Flu

ent

Sta

rHP

CC

FX

Fir

eV

ecti

s

Pam

-Flo

wG

auss

ian

Gam

ess

Ambe

r

CA

ST

EP

NW

Ch

em

Ch

amm

CN

X

NA

MD

AD

F

VA

SP

Dm

ol

BLA

ST

FA

ST

AC

lust

alW

HM

ME

RW

ise2

D2_

Clu

ster

MM

5H

IRLA

M

ALA

DIN

CC

SM

2

WR

FC

CM

3

FV

CC

M

IFS

PO

P

ET

A

Pro

MA

X

Om

ega

Geo

Clu

sute

r

FO

CU

S

Geo

Dep

th

Sei

sUP

Ecl

ipse

VIP

PO

WE

RS

Key ApplicationsScalability and Implementation

SMP MPPNr. of CPUS (log scale)

CSM CCM CWO RESCFD BIO SPI

Linux HPC Applications Strategy

Enablers Revenue Drivers

SYSTEM-LEVEL DIFFERENTIATIONS

SCSL, sparse direct solver, Extreme reordering partitioning,

parallel MPYAD, FFIO,HTC drivers, MPT, XPMEM

SSI, Memory Bandwidth,Comm. Latency and Bandwidth

XVM, dplace, cpusets, CXFS™

Nastran, Abaqus, Ansys, MarcLS-Dyna, Pam-Crash, Radioss

Gaussian, Amber, Castep,CNX, Dmol, Charmm, GAMESS

VIP, Eclipse, Mores

MM5, IFS, HIRLAM, ESA, ALADIN,CCM3, NCEP, POP, LM

FEACODE LEVEL DIFFERENTIATIONS

Fluent, StarHPC, CFX5, PowerflowFire, Vectis

FASTA, BLAST, CLustalW,HMERR, Wise2, D2-Cluster, PHRAP

ProMAX, Omega, Epos, Geocluster, SeisUP

CFD

Chemistry

BIO

Seismic

Reservoir

Weather

Application Readiness

2003

2024

28

33

44

50

3427

25

14

51

0

10

20

30

40

50

60

Jan Feb Mar Apr May June

App

licat

ions

read

y to

ben

chm

ark

0

5

10

15

20

25

30

35

40

45

50

Cer

tifie

d A

pplic

atio

ns

Ready-to-Benchmark Certifications

Who Cares?: End-Users Do

Real Workflow/ThroughputComparisons

Troughput AnalysisNo CPUs Cpus job Time #jobs

96P cluster 96 16 0.18 616P Altix 16 4 0.18 4

#CPUS Cpus/job Time Experiments/yearIA32 cluster 96 96 3 weeks 1716P Altix 16 16 8 days 45

Drug Discovery Case Study

• Incyte Genomics: Drug Discovery/bio-informatics• 20 databases, 6GB–10GB

• On Beowulf cluster:• Each node needs maximum memory• Start job by loading database from disk, with repeat I/O calls for

datasets >4GB

• On SGI® Altix™ 3000 supercluster:• Databases are loaded once into memory pool• Average memory per node• Jobs run immediately, referring to dbase in shared memory

• Incyte Genomics: Drug Discovery/bio-informatics• 20 databases, 6GB–10GB

• On Beowulf cluster:• Each node needs maximum memory• Start job by loading database from disk, with repeat I/O calls for

datasets >4GB

• On SGI® Altix™ 3000 supercluster:• Databases are loaded once into memory pool• Average memory per node• Jobs run immediately, referring to dbase in shared memory

CONCLUSION:• SGI® Altix™ 3000 supercluster can replace

Beowulf cluster at CPU ratio of 15:1• Faster time to solution• Reduced TCO: Fewer nodes, fewer

processors, less memory, easiermanagement, easier upgrades, lowersoftware costs, less power consumption,less rack space, easier administration

Manufacturing Case Study

2438

1

2484

1

2448

6

2452

1

2460

1

20000

21000

22000

23000

24000

25000

Stan

dalon

e/8p

Job

1/8p

Job

2/8p

Job

3/8p

Job

4/8p

Elap

sed

Time

(sec

)Throughput of4 jobs, each

8-way, LS-Dyna

System:Altix 3000, 32P, 64GB,XVM, SGI® TP900 SAN

Individual jobs in the throughput mix are between 0.4% and 1.8 % slower than thestandalone case

Altix 3000 for Multi-job, Multi-cpu Throughput

Who Cares?: Economic Buyers DoWhat is “Price/Performance”?It’s simple math … Or is it?• What is “price”?

– Total cost of hardware– Total cost of acquisition– Total cost of ownership

• What is “performance”?– Peak GFLOPS– Single-job/single-processor application benchmark– Multijob/multiprocessor application benchmarks– Broader productivity or workload benchmarks

Just The Beginning

IPF Roadmap•Compatible with next-generation Intel®

Itanium® 2 processor family

Hardware• Advanced supercomputing technologies and capabilities• Mid-range offering for departmental and database servers• Superclusters scaling to thousands of processors• Next-generation NUMAlink interconnect• Advanced multipipe graphics

Software/Linux• Further scalability for standard Linux• Ongoing improvement of superior system-, data-, and

resource-management tools• Further development of global shared-memory capabilities• Open-source and partner contributions• Big data handling and I/O capabilities• Compilers and development tool improvements

Date post:	25-Jun-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

New Breakthroughs in Linux Supercomputing - UKUUG · • Job/data geometries are compromised ......

Documents