Simulation Science in Grid Environments: Integrated ...johnsson/Talks/ISSS_030703_Final_RS.pdf ·...

Post on 23-Mar-2018

220 views 6 download

transcript

Simulation Science in Grid Environments: Integrated

Adaptive Software SystemsLennart Johnsson

Advanced Computing Research Laboratory Department of Computer Science

University of Houstonand

Department of Numerical Analysis and Computer ScienceRoyal Institute of Technology, Stockholm

Outline

• Technology drivers• Sample applications• Domain specific software environments• High-performance software

Cost of Computing

In 2010, the compute power of today’s top-of-the-line PC can be found in $1 consumer electronics

Today’s most powerful computers (the power of 10,000 PCs at a cost of $100M) will cost a few hundred thousand dollars

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

1999 180 nm

2000 2001 130 nm

2002 .

2003 .

2004 90 nm

2005 .

2008 60 nm

2011 40 nm

2014 30 nm

Functions per chip, Mtransistors

DRAM, at Production

DRAM, at Introduction

MPU, Cost-Performance, atProduction MPU, Cost-Performance, atIntroduction MPU, High-Performance, atProduction ASIC, at Production

05

1015202530354045

$/M

tran

sist

or

1999 180nm

2001 130nm

2003 .

2005 100nm

2011 40 nm

Cost/Mtransistor

DRAM, cost x 100, at Introduction,$/MtransistorsDRAM, cost x 100, at Production,$/MtransistorsMPU, Cost-Performance, atIntroduction, $/MtransistorsMPU, Cost-Performance, atProduction, $/MtransistorsMPU, High Performance, atProduction, $/Mtransistors

SIA Roadmap

SIA Roadmap

In 2010, $1 will buy enough disk space to store

10,000 Books 35 hrs of CDQuality audio

2 min of DVDQuality Video

IBM 9.1GB Ultra 2XP

1980 1985 1990 1995 2000 2005 2010

Year

0.001

0.01

0.1

1

10

100

1000

Pric

e/M

Byt

e, D

olla

rs

HDD DRAM Flash Paper/FilmAverage Price of Storage

IBM 18.2GB Ultrastar

IBM Deskstar 37GB

Toshiba 6.4GB

IBM Deskstar4

IBM Deskstar3

IBM 16.8GB Deskstar

IBM 8.1GB Travelstar

Seagate 8.6GB

Quant 4.5GB

64MB

IBM 9.1GB Ultrastar

96 MB Flash Camera Mem.

64MB Flash

4MB Flash

16MB Flash1MB Flash

512KB Flash256KB Flash

128KB Flash

8KB

32KB 64KB

128KB

512KB 1MB 2MB

4MB

IBM6150

Wren II Seagate ST125

Maxt170IBM0615

IBM0663

Seagate B'cuda4

Seagate ST500

oem

prc2

000a

a.pr

z

128MB Flash

64MB

Ed Grochowski at Almaden

128MB Flash

IBM 25GB Travelstar

IBM 340 MB Microdrive

IBM Deskstar 25GB

IBM Deskstar 75GXP

IBM 1 GB Microdrive

1" HDD ProjectionDataQuest 2000

Flash ProjectionDataQuest 2000

Range of Paper/Film

3.5 " HDD 2.5 " HDD

1 " HDD

Flash

DRAM

0

100

200

300

400

500

600

700

800

In M

illio

ns

1992 1993 1994 1995 1996 1997 1998 1999 2000 2001

Cell SubscriptionsInternet Hosts

Growth of Cell vs. Internet

Access Technologies

Computing Platforms 2001 ⇒ 2030Personal Computers O[$1000]– 109 Flops/sec in 2001 ⇒ 1015 – 1017 Flops/sec by 2030

Supercomputers O[$100,000,000]– 1013 Flops/sec in 2001 ⇒ 1018 – 1020 Flops/sec by 2030

Number of Computers [global population ~1010]– SCs ⇒ 10-8 –10-6 per person ⇒ 102 – 104 systems– PCs ⇒ .1x – 10x per person ⇒ 109 – 1011 systems– Embedded ⇒ 10x – 105x per person ⇒ 1011 – 1015 systems– Nanocomputers ⇒ 0x – 1010 per person ⇒ 0 – 1020 systems

Available Flops Planetwide by 2030– 1024 – 1030 Flops/sec [assuming classical models of computation]

Courtesy Rick Stevens

MEMS - Biosensors

http://www.darpa.mil/mto/mems/presentations/memsatdarpa3.pdf

MEMS – Jet Engine Application

http://www.darpa.mil/mto/mems/presentations/memsatdarpa3.pdf

Smart Dust - UCB RF Mote

RF Mini Mote ILaser Mote

Sensor

IrDA Mote

http://robotics.eecs.berkeley.edu/~pister/SmartDust/

RF Mini Mote II

Laser Mote with CCD

Polymer Radio Frequency Identification Transponder

http://www.research.philips.com/pressmedia/pictures/polelec.html

Optical Communication costs

Larry Roberts, Caspian Networks

Fiber Optic Communication

In 2010. . .

A million books can be sent across the Pacific for 1$ in 8 seconds

All books in the American Research Libraries can be sent across the Pacific in about 1 hr for $500

Fiberoptic CommunicationMilestones

First Laser 1960

First room temperature laser, ~1970

Continuous mode commercial lasers, ~1980

Tunable lasers, ~1990

Commercial fiberoptic WANs, 1985

10 Tbps/strand demonstrated in 2000 (10% of fiber peak capacity). (10 Tbps is enough bandwidth to transmit a million high-definition resolution movies simultaneously, or over 100 million phone calls).

WAN fiberoptic cables often have 384 strands of fiber and would have a capacity of 2 Pbps. Several such cables are typically deployed in the same conduit/right-of-way

Pacific Capacity

Atlantic Capacity

NSFnetvBNS

Internet2 AbileneTeraGrid

0

2000

4000

6000

8000

10000

1986 1996 1997 1999 2001

Year

Mbi

t/s

OC-3 OC-12

OC-48

OC-192Doubling every year

Vancouver

SeattlePortland

San Francisco

Los Angeles

San Diego

(SDSC)

NCSA

Chicago NYC

SURFnetCA*net4

AMPATH

PSC

Atlanta

IU

U Wisconsin

DTF 40Gb

NTON

NTON

I-WIREUIC

ANL

NCSA/UIUC

UC

NU / Starlight

Star Tap

IIT

Charlie Catlett Argonne National Laboratory

• State Funded Infrastructure to support Networking and Applications Research

– $6.5M Total Funding• $4M FY00-01 (in hand)• $2.5M FY02 (approved 1-

June-01)• Possible add’l $1M in FY03-5

– Application Driven• Access Grid: Telepresence &

Media• Computational Grids: Internet

Computing• Data Grids: Information

Analysis– New Technologies Proving

Ground• Optical Switching• Dense Wave Division

Multiplexing• Ultra-High Speed SONET• Wireless• Advanced middleware

infrastructure

CalREN-2

CA*net 4 Architecture

Calgary ReginaWinnipeg

OttawaMontreal

Toronto

Halifax

St. John’s

Fredericton

Charlottetown

ChicagoSeattleNew York

CANARIEGigaPOPORAN DWDMCarrier DWDM

Thunder Bay

CA*net 4 node)Possible future CA*net 4 node

Quebec

Windsor

Edmonton

Saskatoon

VictoriaVancouver

Boston

Bill St Arnaud

CANARIE

Wavelength Disk Drives

Vancouver

Computer data continuously circulates around the WDD

Calgary

Regina

Winnipeg

Ottawa

Montreal

Toronto

Halifax

St. John’s

Fredericton

Charlottetown

CA*net 3/4

WDD Node

GEANT

Nordic Grid Networks

0.622 Gbps

2.5 Gbps

10 Gbps

0.155 Gbps

SURFnet4 Topology

Grid Applications

Grid Application Projects

ODIN

PAMELA

March 28, 2000 Fort Worth Tornado

Courtesy Kelvin Droegemeier

In 1988 … NEXRAD Was Becoming a Reality

s fn

Courtesy Kelvin Droegemeier

Houston, TX

Environmental Studies

Neptune Undersea Grid

Air Quality Measurementand Control

Surface dataRadar dataBallon dataSatellite data

Real-time data

NCAR

Digital MammographyAbout 40 million mammograms/yr (USA) (estimates 32 – 48 million)About 250,000 new breast cancer cases detected each yearOver 10,000 units (analogue)Resolution: up to about 25 microns/pixelImage size: up to about 4k x 6k (example: 4096 x 5624)Dynamic range: 12-bitsImage size: about 48 MbytesImages per patient: 4Data set size per patient: about 200 MbytesData set per year: about 10 PbytesData set per unit, if digital: 1 Tbytes/yr, on averageData rates/unit: 4 Gbytes/operating day, or 0.5 Gbytes/hr, or 1 MbpsComputation: 100 ops/pixel = 10 Mflops/unit, 100 Gflops total; 1000 ops/pixel = 1 Tflops total

E-Science: Data Gathering, Analysis, Simulation, and Collaboration

LHC

CMS

Simulated Higgs Decay

Molecular Dynamics

Jim Briggs

University of Houston

Molecular Dynamics Simulations

SimDB

Simulation Data Base

SimDBArchitecture

500 Å

JEOL3000-FEGLiquid He stageNSF support

Biological Imaging

No. of Particles Needed for 3-D Reconstruction

B = 100 Å2

8.5 Å 4.5 Å6,000 5,000,000

Resolution

B = 50 Å2 3,000 150,000 8.5 Å Structure of the HSV-1

Capsid

EMEN Database•Archival•Data Mining•Management

VitrificationRobot

Particle SelectionPower SpectrumAnalysis

Initial3D Model

ClassifyParticles

Reproject3D Model

AlignAverageDeconvolute

Build New3D Model

EMAN

Tele-MicroscopyOsaka, Japan Mark Ellisman, UCSD

28 Sep 00 - #17NORDUnet 2000

GEMSvizGEMSviz at at iGRIDiGRID 20002000

STAR TAP

NORDUnet

APAN

INET

ParalleldatorcentrumKTH Stockholm

Universityof Houston

Computational Steering

GrADS – Grid Application Development Software

Grids – Contract Development

Grids - Contract Development

Grids – Contract Development

Grids – Application Launch

Grids – Library Evaluation

Grids – Performance Models

Grids – Library Evaluation

Grids – Library Evaluation

Cactus on the Grid

Cactus – Job Migration

Cactus – Migration Architecture

Cactus – Migration example

Adaptive Software

• Diversity of execution environments– Growing complexity of modern microprocessors.

• Deep memory hierarchies• Out-of-order execution• Instruction level parallelism

– Growing diversity of platform characteristics• SMPs• Clusters (employing a range of interconnect

technologies)• Grids (heterogeneity, wide range of characteristics)

• Wide range of application needs– Dimensionality and sizes– Data structures and data types– Languages and programming paradigms

Challenges

• Algorithmic– High arithmetic efficiency

• low floating-point v.s. load/store ratio– Unfavorable data access patterns (big 2n strides)

• Application owns the datastructures/layout

– Additions/multiplications unbalanced

• Version explosion– Verification– Maintenance

Challenges

Opportunities

• Multiple algorithms with comparable numerical properties for many functions

• Improved software techniques and hardware performance

• Integrated performance monitors, models and data bases

• Run-time code construction

• Automatic algorithm selection – polyalgorithmicfunctions (CMSSL, FFTW, ATLAS, SPIRAL, …..)

• Exploit multiple precision options• Code generation from high-level descriptions

(WASSEM, CMSSL, CM-Convolution-Compiler, FFTW, UHFFT, SPIRAL, …..)

• Integrated performance monitoring, modeling and analysis

• Judicious choice between compile-time and run-time analysis and code construction

• Automated installation process

Approach

• Program preparation at installation (platform dependent)

• Integrated performance models (in progress) and data bases

• Algorithm selection at run-time from set defined at installation

• Automatic multiple precision constant generation

• Program construction at run-time based on application and performance predictions

The UHFFT

Input ParametersSize, dim., …

InitializationSelect best plan (factorization)

ExecutionCalculate one or more FFTs

Run-time

Performance Monitoring

Database update

Performance TuningMethodology

Input ParametersSystem specifics,

UHFFT Code generator

Library of FFT modules

Performancedatabase

User options

Installation

Codelet efficiency

Intel PIV 1.8 GHz AMD Athlon 1.4 GHz

PowerPC G4 867 MHz

Intel PIV 1.8 GHz AMD Athlon 1.4 GHz

PowerPC G4 867 MHz

Radix-4 codelet efficiency

Intel PIV 1.8 GHz AMD Athlon 1.4 GHz

PowerPC G4 867 MHz

Radix-8 codelet efficiency

Plan Performance, 32-bit Architectures

Power3 plan performance

0

50

100

150

200

250

300

350

MFL

OPS

16

2 8

4 4

8 2

2 2

4

2 4

2

4 2

2

2 2

2 2

Plan

222 MHz888 Mflops

Itanium …..

L1: 64K+32K+2K+2K(Data+Instruction+Pre-fetch+Write)

L2: up to 8M (off-die)1.5 GFlops750 MhzSun UltraSparc-III

L1: 16K+16K(Data+Instruction)

L2: 256K, L3: 3M (on-die)4 GFlops1000 MhzIntel Itanium 2

L1: 64K+32K+2K+2K(Data+Instruction+Pre-fetch+Write)

L2: up to 8M (off-die)2.1 GFlops1050 MhzSun UltraSparc-III

L1: 16K+16K(Data+Instruction)

L2: 256K, L3: 1.5M (on-die)3.6 GFlops900 MhzIntel Itanium 2

L1: 16K+16K(Data+Instruction)

L2: 92K, L3: 2-4M (off-die)3.2 GFlops800 MhzIntel Itanium

Cache structurePeak

PerformanceClock frequencyProcessor

Tested configuration

Memory Hierarchy

96K B64B/6-way

Min 6 cyclesMin 9 cycles

Write back, write allocate

256KB128B/8-wayMin 5 cyclesMin 6 cycles

Write back, write allocate

Size:Line size/Associativity:

Integer Latency:FP Latency:

Write Policies:

16KB + 16KB32B/4-way

1 cycleWrite through, No write allocate

16KB + 16KB64B/4-way

1 cycleWrite through, No write allocate

Size:Line size/Associativity:

Latency:Write Policies:

4MB or 2MB off chip64B/4-way

Min 21 cyclesMin 24 cycles

16B/cycle

3MB or 1.5MB on chip128B/12-wayMin 12 cyclesMin 13 cycles

32B/cycle

Size:Line size/Associativity:

Integer Latency:FP Latency:

Bandwith:

ItaniumItanium-2 (McKinley)

L1I

and

L1D

Unified

L2

Unified

L3

Itanium Comparison

HP zx1Intel 82460GXChipset

2 GB DDR SDRAM (266 MHz)2 GB SDRAM (133 MHz)Memory

Intel 6.0Intel 6.0Compiler

HP version of the 64-bit RH Linux 7.264-bit Red Hat Linux 7.1OS

128 bit64 bitBus Width

400 MHz133 MHZBus Speed

900 MHz Intel Itanium 2 (McKinley)800 MHz Intel ItaniumProcessor

HP zx2000HP i2000Workstation

HP zx1 Chipset

Features:•2-way and 4-way•Low latency connection to the DDR memory (112 ns)

•Directly (112 ns latency)•Through (up to 12 ) scalable memory expanders (+25 ns latency)

•Up to 64 GB of DDR today (256 in the future)•AGP 4x today (8x in the future versions)•1-8 I/O adapters supporting

•PCI, PCI-X, AGP 2-way block diagram

UHFFT Codelet Performance

UHFFT Codelet Performance

Codelet Performance Radix-2

Codelet Performance Radix-3

Codelet Performance Radix-4

Codelet Performance Radix-5

Codelet Performance Radix-6

Codelet Performance Radix-7

Codelet Performance Radix-13

Codelet Performance Radix-64

The UHFFT: Summary• Code generator written in C• Code is generated at installation• Codelet library is tuned to the underlying architecture • The whole library can be easily customized through

parameter specification – No need for laborious manual changes in the source

– Existing code generation infrastructure allows easy library extensions

• Future:– Inclusion of vector/streaming instruction set extension for various

architectures– Implementation of new scheduling/optimization algorithms– New codelet types and better execution routines– Unified algorithm specification on all levels

Acknowledgements

Dave Angulo, Ruth Aydt, Fran Berman, AnrewChien, Keith Cooper, Holly Dail, Jack Dongarra, Ian Foster, Sridhar Gullapallii, Lennart Johnsson, Ken Kennedy, Carl Kesselman, Chuck Koelbel, Bo Liu, Chuang Liu, Xin Liu, Anirban Mandal, Mark Mazina, John Mellor-Crummey, Celso Mendes, GrazianoObertelli, Alex Olugbile, Mitul Patel, Dan Reed, Martin Swany, Linda Torczon, Satish Vahidyar, Shannon Whitmore, Rich Wolski, Huaxia Xia, Lingyun Yang, Asim Yarkin, ….

GrADS contributors

Funding: NSF Next Generation Software initiative, Los Alamos Computer Science Institute

AcknowledgementsSimDB Contributors:

Matin Abdullah

Michael Feig

Lennart Johnsson

Seonah Kim

Prerna Kohsla

Gillian Lynch

Montgomery PettittFunding:

NPACI (NSF)

Texas Learning and Computation Center

Acknowledgements

UHFFT Contributors

Dragan Mirkovic

Rishad Mahasoom

Fredrick Mwandia

Nils Smeds

Funding:

Alliance (NSF)

LACSI (DoE)