+ All Categories
Home > Documents > O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences...

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences...

Date post: 17-Jan-2016
Category:
Upload: nathaniel-norris
View: 213 times
Download: 0 times
Share this document with a friend
41
1 OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY The Center for Computational Sciences 100 TF Sustained on Cray X Series SOS 8 April 13, 2004 James B. White III (Trey) [email protected]
Transcript
Page 1: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

1

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

100 TF Sustained on Cray X Series

SOS 8

April 13, 2004

James B. White III (Trey)

[email protected]

Page 2: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

2

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Disclaimer

The opinions expressed here do not necessarily represent those of the CCS, ORNL, DOE, the Executive Branch of the Federal Government of the United States of America, or even UT-Battelle.

Page 3: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

3

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Disclaimer (cont.)

Graph-free, chart-free environment For graphs and charts

http://www.csm.ornl.gov/evaluation/PHOENIX/

Page 4: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

4

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

100 Real TF on Cray Xn

Who needs capability computing? Application requirements Why Xn? Laundry, Clean and Otherwise Rants

Custom vs. Commodity MPI CAF Cray

Page 5: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

5

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Who needs capability computing?

OMB? Politicians? Vendors? Center directors? Computer scientists?

Page 6: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

6

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Who needs capability computing?

Application scientists According to scientists themselves

Page 7: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

7

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Personal Communications

Fusion General Atomics, Iowa, ORNL, PPPL, Wisconsin

Climate LANL, NCAR, ORNL, PNNL

Materials Cincinnati, Florida, NC State, ORNL, Sandia, Wisconsin

Biology NCI, ORNL, PNNL

Chemistry Auburn, LANL, ORNL, PNNL

Astrophysics Arizona, Chicago, NC State, ORNL, Tennessee

Page 8: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

8

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Scientists Need Capability

Climate scientists need simulation fidelity to support policy decisions All we can say now is that humans cause warming

Fusion scientists need to simulate fusion devices All we can do now is model decoupled

subprocesses at disparate time scales Materials scientists need to design new

materials Just starting to reproduce known materials

Page 9: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

9

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Scientists Need Capability

Biologists need to simulate proteins and protein pathways Baby steps with smaller molecules

Chemists need similar increases in complexity

Astrophysics need to simulate nucleogenesis (high-res, 3D CFD, 6D neutrinos, long times) Low-res, 3D CFD, approximate 3D neutrinos,

short times

Page 10: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

10

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Why Scientists Might Resist

Capacity also needed Software isn’t ready Coerced to run capability-sized jobs on

inappropriate systems

Page 11: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

11

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Capability Requirements

Sample DOE SC applications Climate: POP, CAM Fusion: AORSA, Gyro Materials: LSMS, DCA-QMC

Page 12: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

12

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Parallel Ocean Program (POP)

Baroclinic 3D, nearest neighbor, scalable Memory-bandwidth limited

Barotropic 2D implicit system, latency bound

Ocean-only simulation Higher resolution Faster time steps

As ocean component for CCSM Atmosphere dominates

Page 13: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

13

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Community Atmospheric Model (CAM)

Atmosphere component for CCSM Higher resolution?

Physics changes, parameterization must be retuned, model must be revalidated

Major effort, rare event Spectral transform not dominant

Dramatic increases in computation per grid point Dynamic vegetation, carbon cycle, atmospheric chemistry,

Faster time steps

Page 14: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

14

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

All-Orders Spectral Algorithm (AORSA)

Radio-frequency fusion-plasma simulation Highly scalable

Dominated by ScaLAPACK Still in weak-scaling regime

But… Expanded physics reducing ScaLAPACK

dominance Developing sparse formulation

Page 15: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

15

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Gyro

Continuum gyrokinetic simulation of fusion-plasma microturbulence

1D data decomposition Spectral method - high communication

volume Some need for increased resolution More iterations

Page 16: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

16

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Locally Self-Consistent Multiple Scattering (LSMS)

Calculates electronic structure of large systems

One atom per processor Dominated by local DGEMM First real application to sustain a TF But… moving to sparse formulation with

a distributed solve for each atom

Page 17: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

17

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Dynamic Cluster Aproximation (DCA-QMC)

Simulates high-temp superconductors Dominated by DGER (BLAS2)

Memory-bandwidth limited Quantum Monte Carlo, but…

Fixed start-up per process Favors fewer, faster processors Needs powerful processors to avoid

parallelizing each Monte-Carlo stream

Page 18: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

18

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Few DOE SC Applications

Weak-ish scaling Dense linear algebra But moving to sparse

Page 19: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

19

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Many DOE SC Applications

“Strong-ish” scaling Limited increase in gridpoints Major increase in expense per gridpoint Major increase in time steps

Fewer, more-powerful processors High memory bandwidth

High-bandwidth, low-latency communication

Page 20: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

20

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Why X1?

“Strong-ish” scaling Limited increase in gridpoints Major increase in expense per gridpoint Major increase in time steps

Fewer, more-powerful processors High memory bandwidth

High-bandwidth, low-latency communication

Page 21: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

21

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Tangent: Strongish* Scaling

Firm Semistrong Unweak Strongoidal MSTW (More Strong Than Weak) JTSoS (Just This Side of Strong) WNS (Well-Nigh Strong) Seak, Steak, Streak, Stroak, Stronk Weag, Weng, Wong, Wrong, Twong

* Greg Lindahl, Vendor Scum

Page 22: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

22

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

X1 for 100 TF Sustained?

Uh, no OS not scalable, fault-resilient enough

for 104 processors That “price/performance” thing That “power & cooling” thing

Page 23: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

23

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Xn for 100 TF Sustained

For DOE SC applications, YES Most-promising candidate

-or-

Least-implausible candidate

Page 24: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

24

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Why X, again?

Most-powerful processors Reduce need for scalability Obey Amdahl’s Law

High memory bandwidth See above

Globally addressable memory Lowest, most hide-able latency Scale latency-bound applications

High interconnect bandwidth Scale bandwidth-bound applications

Page 25: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

25

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

The Bad News

Scalar performance “Some tuning required” Ho-hum MPI latency

See Rants

Page 26: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

26

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Scalar Performance

Compilation is slow Amdahl’s Law for single processes

Parallelization -> Vectorization

Hard to port GNU tools GCC? Are you kidding? GCC compatibility, on the other hand…

Black Widow will be better

Page 27: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

27

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

“Some Tuning Required”

Vectorization requires: Independent operations Dependence information Mapping to vector instructions

Applications take a wide spectrum of steps to inhibit this May need a couple of compiler directives May need extensive rewriting

Page 28: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

28

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Application Results

Awesome Indifferent Recalcitrant Hopeless

Page 29: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

29

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Awesome Results

256-MSP X1 already showing unique capability

Apps bound by memory bandwidth, interconnect bandwidth, interconnect latency

POP, Gyro, DCA-QMC, AGILE-BOLTZTRAN, VH1, Amber, …

Many examples from DoD

Page 30: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

30

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Indifferent Results

Cray X1 is brute-force fast, but not cost effective

Dense linear algebra Linpack, AORSA, LSMS

Page 31: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

31

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Recalcitrant Results

Inherent algorithms are fine Source code or ongoing code mods

don’t vectorize Significant code rewriting done,

ongoing, or needed CLM, CAM, Nimrod, M3D

Page 32: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

32

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Aside: How to Avoid Vectorization

Use pointers to add false dependencies Put deep call stacks inside loops Put debug I/O operations inside

compute loops Did I mention using pointers?

Page 33: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

33

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Aside: Software Design

In general, we don’t know how to systematically design efficient, maintainable HPC software

Vectorization imposes constraints on software design Bad: Existing software must be rewritten Good: Resulting software often faster on modern

superscalar systems “Some tuning required” for X series

Bad: You must tune Good: Tuning is systematic, not a Black Art

Vectorization “constraints” may help us develop effective design patterns for HPC software

Page 34: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

34

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Hopeless Results

Dominated by unvectorizable algorithms Some benchmark kernels of

questionable relevance No known DOE SC applications

Page 35: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

35

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Summary

DOE SC scientists do need 100 TF and beyond of sustained application performance

Cray X series is the least-implausible option for scaling DOE SC applications to 100 TF of sustained performance and beyond

Page 36: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

36

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

“Custom” Rant

“Custom vs. Commodity” is Red Herring CMOS is commodity Memory is commodity Wires are commodity Cooling is independent of vector vs. scalar

PNNL liquid-cooling clusters Vector systems may move to air-cooling

All vendors do custom packaging Real issue: Software

Page 37: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

37

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

MPI Rant

Latency-bound apps often limited by “MPI_Allreduce(…, MPI_SUM, …)” Not ping pong! An excellent abstraction that is imminently optimizable

Some apps are limited by point-to-point Remote load/store implementations (CAF, UPC) have

performance advantages over MPI But MPI could be implemented using load/store, inlined, and

optimized On the other hand, easier to avoid pack/unpack with

load/store model

Page 38: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

38

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Co-Array-Fortran Rant

No such thing as one-sided communication It’s all two sided: send+receive, sync+put+sync,

sync+get+sync Same parallel algorithms

CAF mods can be highly nonlocal Adding CAF in a subroutine can have implications on the

argument types, and thus on the callers, the callers’ callers, etc.

Rarely the case for MPI We use CAF to avoid MPI-implementation

performance inadequacies Avoiding nonlocality by cheating with Cray pointers

Page 39: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

39

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Cray Rant

Cray XD1 (OctigaBay) follows in tradition of T3E

Page 40: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

40

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Cray Rant

Cray XD1 (OctigaBay) follows in tradition of T3E Very promising architecture Dumb name

Interesting competitor with Red Storm

Page 41: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

41

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

The Center forComputational Sciences

Questions?

James B. White III (Trey)

[email protected]

http://www.csm.ornl.gov/evaluation/PHOENIX/


Recommended