2 Chip-Multiprocessors & You John Dennis dennis@ucar.edu March 16, 2007 John Dennis dennis@ucar.edu...

Post on 28-Dec-2015

223 views 2 download

transcript

1

Chip-Multiprocessors & You

John Dennisdennis@ucar.edu

March 16, 2007

March 16, 2007 Software Engineering Working Group Meeting

2

Intel “Tera Chip”Intel “Tera Chip” 80 core chip 1 Teraflop 3.16 Ghz / 0.95V / 62W Process

45 nm technology High-K

2D mesh network Each processor has 5-

port router Connects to “3D-

memory”

80 core chip 1 Teraflop 3.16 Ghz / 0.95V / 62W Process

45 nm technology High-K

2D mesh network Each processor has 5-

port router Connects to “3D-

memory”

March 16, 2007 Software Engineering Working Group Meeting

3

OutlineOutline

Chip-MultiprocessorParallel I/O library (PIO)Full with Large Processor Counts

POPCICE

Chip-MultiprocessorParallel I/O library (PIO)Full with Large Processor Counts

POPCICE

March 16, 2007 Software Engineering Working Group Meeting

4

Moore’s LawMoore’s Law

Most things are twice as nice [18 months]Transistor countProcessor speedDRAM density

Historical Result:Solve problem twice as large in same timeSolve same size problem in half the time

--> Inactivity leads to progress!

Most things are twice as nice [18 months]Transistor countProcessor speedDRAM density

Historical Result:Solve problem twice as large in same timeSolve same size problem in half the time

--> Inactivity leads to progress!

5

The advent of Chip-multiprocessors

Moore’s Law gone bad!

March 16, 2007 Software Engineering Working Group Meeting

6

New implications of Moore’s Law

New implications of Moore’s Law

Every 18 months# of cores per socket doublesMemory density doublesClock cycle may increase slightly

18 months from now8 cores per socketSlight increase in clock cycle (~15%)Same memory per core!!

Every 18 months# of cores per socket doublesMemory density doublesClock cycle may increase slightly

18 months from now8 cores per socketSlight increase in clock cycle (~15%)Same memory per core!!

March 16, 2007 Software Engineering Working Group Meeting

7

New implications of Moore’s Law

(con’t) New implications of Moore’s Law

(con’t) Inactivity leads to no progress! Possible outcome

Same problem size / same parallelismsolve problem ~15% faster

Bigger problem sizescalable memory?

More processors enable ~2x reduction in time to solutionNon-scalable memory?

May limit number of processors that can be used Waste 1/2 of cores on sockets to use memory?

All components of application must scale to benefit from Moore’s Law increases!

Memory footprint problem will not solve itself!

Inactivity leads to no progress! Possible outcome

Same problem size / same parallelismsolve problem ~15% faster

Bigger problem sizescalable memory?

More processors enable ~2x reduction in time to solutionNon-scalable memory?

May limit number of processors that can be used Waste 1/2 of cores on sockets to use memory?

All components of application must scale to benefit from Moore’s Law increases!

Memory footprint problem will not solve itself!

March 16, 2007 Software Engineering Working Group Meeting

8

Questions ?Questions ?

9

Parallel I/O library (PIO)

John Dennis (dennis@ucar.edu)Ray Loy (rloy@mcs.anl.gov)

March 16, 2007

March 16, 2007 Software Engineering Working Group Meeting

10

IntroductionIntroduction

All component models need parallel I/OSerial I/O is bad!

Increased memory requirementTypically negative impact on performance

Primary Developers: [J. Dennis, R. Loy]Necessary for POP BGW runs

All component models need parallel I/OSerial I/O is bad!

Increased memory requirementTypically negative impact on performance

Primary Developers: [J. Dennis, R. Loy]Necessary for POP BGW runs

March 16, 2007 Software Engineering Working Group Meeting

11

Design goalsDesign goals

Provide parallel I/O for all component models

Encapsulate complexity into library Simple interface for component

developers to implement

Provide parallel I/O for all component models

Encapsulate complexity into library Simple interface for component

developers to implement

March 16, 2007 Software Engineering Working Group Meeting

12

Design goals (con’t)Design goals (con’t)

Extensible for future I/O technologyBackward compatible (node=0)Support for multiple formats

{sequential,direct} binarynetcdf

Preserve format of input/output filesSupports 1D, 2D and 3D arrays

Currently XYExtensible to XZ or YZ

Extensible for future I/O technologyBackward compatible (node=0)Support for multiple formats

{sequential,direct} binarynetcdf

Preserve format of input/output filesSupports 1D, 2D and 3D arrays

Currently XYExtensible to XZ or YZ

March 16, 2007 Software Engineering Working Group Meeting

13

Terms and ConceptsTerms and Concepts

PnetCDF: [ANL]High performance I/ODifferent interfaceStable

netCDF4 + HDF5 [NCSA]Same interfaceNeeds HDF5 libraryLess stableLower performanceNo support on Blue Gene

PnetCDF: [ANL]High performance I/ODifferent interfaceStable

netCDF4 + HDF5 [NCSA]Same interfaceNeeds HDF5 libraryLess stableLower performanceNo support on Blue Gene

March 16, 2007 Software Engineering Working Group Meeting

14

Terms and Concepts (con’t)

Terms and Concepts (con’t)

Processor stride:Allows matching of subset of MPI IO nodes

to system hardware

Processor stride:Allows matching of subset of MPI IO nodes

to system hardware

March 16, 2007 Software Engineering Working Group Meeting

15

Terms and Concepts (con’t)

Terms and Concepts (con’t)

IO decomp vs. COMP decompIO decomp == COMP decomp

MPI-IO + message aggregation

IO decomp != COMP decompNeed Rearranger : MCT

No component specific info in libraryPair with existing communication tech1D arrays in library; component must flatten

2D and 3D arrays

IO decomp vs. COMP decompIO decomp == COMP decomp

MPI-IO + message aggregation

IO decomp != COMP decompNeed Rearranger : MCT

No component specific info in libraryPair with existing communication tech1D arrays in library; component must flatten

2D and 3D arrays

March 16, 2007 Software Engineering Working Group Meeting

16

Component Model ‘issues’Component Model ‘issues’

POP & CICE:Missing blocks

Update of neighbors haloWho writes missing blocks?

Asymmetry between read/write

‘sub-block’ decompositions not rectangular

CLMDecomposition not rectangularWho writes missing data?

POP & CICE:Missing blocks

Update of neighbors haloWho writes missing blocks?

Asymmetry between read/write

‘sub-block’ decompositions not rectangular

CLMDecomposition not rectangularWho writes missing data?

March 16, 2007 Software Engineering Working Group Meeting

17

What worksWhat worksBinary I/O [direct]

Test on POWER5, BGLRearrange w/MCT + MPI-IOMPI-IO no rearrangement

netCDFnetCDF

Rearrange with MCT [New]Reduced memory

PnetCDF:Rearrange with MCTNo rearrangement

Test on POWER5, BGL

Binary I/O [direct]Test on POWER5, BGLRearrange w/MCT + MPI-IOMPI-IO no rearrangement

netCDFnetCDF

Rearrange with MCT [New]Reduced memory

PnetCDF:Rearrange with MCTNo rearrangement

Test on POWER5, BGL

March 16, 2007 Software Engineering Working Group Meeting

18

What works (con’t)What works (con’t) Prototype added to POP2

Reads restart and forcing files correctly Writes binary restart files correctly Necessary for BGW runs

Prototype implementation in HOMME [J. Edwards] Writes netCDF history files correctly

POPIO benchmark 2D array [3600x2400] (70 Mbyte) Test code for correctness and performance Tested on 30K BGL processors in Oct 06

Performance POWER5: 2-3x serial I/O approach BGL: mixed

Prototype added to POP2 Reads restart and forcing files correctly Writes binary restart files correctly Necessary for BGW runs

Prototype implementation in HOMME [J. Edwards] Writes netCDF history files correctly

POPIO benchmark 2D array [3600x2400] (70 Mbyte) Test code for correctness and performance Tested on 30K BGL processors in Oct 06

Performance POWER5: 2-3x serial I/O approach BGL: mixed

March 16, 2007 Software Engineering Working Group Meeting

19

Complexity / Remaining IssuesComplexity / Remaining Issues

Mulitple ways to express decompositionGDOF: global degree of freedom --> (MCT,

MPI-IO)Subarrays: start + count (pNetCDF)

Limited expressivenessWill not support ‘sub-block’ in POP & CICE, CLM

Need common language for interfaceInterface between component model

and library

Mulitple ways to express decompositionGDOF: global degree of freedom --> (MCT,

MPI-IO)Subarrays: start + count (pNetCDF)

Limited expressivenessWill not support ‘sub-block’ in POP & CICE, CLM

Need common language for interfaceInterface between component model

and library

March 16, 2007 Software Engineering Working Group Meeting

20

ConclusionsConclusions

Working prototypePOP2 for binary I/OHOMME for netCDF

PIO telecon: discuss progress every 2 weeks

Work in progress Multiple efforts underwayaccepting help

http://swiki.ucar.edu/ccsm/93In CCSM subversion repository

Working prototypePOP2 for binary I/OHOMME for netCDF

PIO telecon: discuss progress every 2 weeks

Work in progress Multiple efforts underwayaccepting help

http://swiki.ucar.edu/ccsm/93In CCSM subversion repository

21

Fun with Large Processor Counts:POP, CICE

John Dennisdennis@ucar.edu

March 16, 2007

March 16, 2007 Software Engineering Working Group Meeting

22

MotivationMotivationCan Community Climate System Model

(CCSM) be a Petascale Application?Use 10-100K processors per simulation

Increasing common access to large systemsORNL Cray XT3/4 : 20K [2-3 weeks]ANL Blue Gene/P : 160K [Jan 2008]TACC Sun : 55K [Jan 2008]

Petascale for the masses ?lagtime in Top 500 List [4-5 years]@ NCAR before 2015

Can Community Climate System Model (CCSM) be a Petascale Application?Use 10-100K processors per simulation

Increasing common access to large systemsORNL Cray XT3/4 : 20K [2-3 weeks]ANL Blue Gene/P : 160K [Jan 2008]TACC Sun : 55K [Jan 2008]

Petascale for the masses ?lagtime in Top 500 List [4-5 years]@ NCAR before 2015

March 16, 2007 Software Engineering Working Group Meeting

23

OutlineOutline

Chip-MultiprocessorParallel I/O library (PIO)Fun with Large Processor Counts

POPCICE

Chip-MultiprocessorParallel I/O library (PIO)Fun with Large Processor Counts

POPCICE

March 16, 2007 Software Engineering Working Group Meeting

24

Status of POPStatus of POP

Access to 17K Cray XT4 processors12.5 years/day [Current Record]70% of time in solver

Won BGW cycle allocationEddy Stirring: The Missing Ingredient in Nailing Down Ocean Tracer Transport [J. Dennis, F. Bryan, B. Fox-Kemper, M. Maltrud, J. McClean, S. Peacock]

110 Rack Days/ 5.4M CPU hours20 year 0.1° POP simulation

Includes a suite of dye-like tracersSimulate eddy diffusivity tensor

Access to 17K Cray XT4 processors12.5 years/day [Current Record]70% of time in solver

Won BGW cycle allocationEddy Stirring: The Missing Ingredient in Nailing Down Ocean Tracer Transport [J. Dennis, F. Bryan, B. Fox-Kemper, M. Maltrud, J. McClean, S. Peacock]

110 Rack Days/ 5.4M CPU hours20 year 0.1° POP simulation

Includes a suite of dye-like tracersSimulate eddy diffusivity tensor

March 16, 2007 Software Engineering Working Group Meeting

25

Status of POP (con’t)Status of POP (con’t)

Allocation will occur over ~7 daysRun in production on 30K

processorsNeeds Parallel I/O to write history

fileStart runs in 4-6 weeks

Allocation will occur over ~7 daysRun in production on 30K

processorsNeeds Parallel I/O to write history

fileStart runs in 4-6 weeks

March 16, 2007 Software Engineering Working Group Meeting

26

OutlineOutline

Chip-MultiprocessorParallel I/O library (PIO)Fun with Large Processor Counts

POPCICE

Chip-MultiprocessorParallel I/O library (PIO)Fun with Large Processor Counts

POPCICE

March 16, 2007 Software Engineering Working Group Meeting

27

Status of CICEStatus of CICE

Tested CICE @ 1/1010K Cray XT4 processors40K IBM Blue Gene processors [BGW

days]Use weighted space-filling curves

(wSFC)erfcclimatology

Tested CICE @ 1/1010K Cray XT4 processors40K IBM Blue Gene processors [BGW

days]Use weighted space-filling curves

(wSFC)erfcclimatology

March 16, 2007 Software Engineering Working Group Meeting

28

POP (gx1v3) + Space-filling curve

POP (gx1v3) + Space-filling curve

March 16, 2007 Software Engineering Working Group Meeting

29

Space-filling curve partition for 8 processors

Space-filling curve partition for 8 processors

March 16, 2007 Software Engineering Working Group Meeting

30

Weighted Space-filling curves

Weighted Space-filling curves

Estimate work for each grid block

Worki = w0 + Pi*w1

where:w0: Fixed work for all blocks

w1: Work if block contains Sea-ice

Pi: Probability block contains Sea-ice

For our experiments: w0 = 2, w1 = 10

Estimate work for each grid block

Worki = w0 + Pi*w1

where:w0: Fixed work for all blocks

w1: Work if block contains Sea-ice

Pi: Probability block contains Sea-ice

For our experiments: w0 = 2, w1 = 10

March 16, 2007 Software Engineering Working Group Meeting

31

Probability FunctionProbability Function

Error Function:Pi = erfc(( -max(|lati|))/)

where:lati max lat in block i

mean sea-ice extent variance in sea-ice extent

NH=70°, SH =60°, =5 °

Error Function:Pi = erfc(( -max(|lati|))/)

where:lati max lat in block i

mean sea-ice extent variance in sea-ice extent

NH=70°, SH =60°, =5 °

March 16, 2007 Software Engineering Working Group Meeting

32

1° CICE4 on 20 processors1° CICE4 on 20 processors

Small domains @ high latitudes

Large domains @ low latitudes

March 16, 2007 Software Engineering Working Group Meeting

33

0.1° CICE40.1° CICE4 Developed at LANL Finite Difference Models sea-ice Shares grid and infrastructure with POP

Reuse techniques from POP work Computational grid: [3600 x 2400 x 20] Computational load-imbalance creates problems:

~15% of grid has sea-ice Use weighted Space-filling curves?

Evaluate using Benchmark: 1 day/ Initial run / 30 minute timestep / no Forcing

Developed at LANL Finite Difference Models sea-ice Shares grid and infrastructure with POP

Reuse techniques from POP work Computational grid: [3600 x 2400 x 20] Computational load-imbalance creates problems:

~15% of grid has sea-ice Use weighted Space-filling curves?

Evaluate using Benchmark: 1 day/ Initial run / 30 minute timestep / no Forcing

March 16, 2007 Software Engineering Working Group Meeting

34

CICE4 @ 0.1°CICE4 @ 0.1°

March 16, 2007 Software Engineering Working Group Meeting

35

Timings for 1°,npes=160, NH=70°Timings for 1°,npes=160, NH=70°

Load-imbalance: Hudson Bay south of 70°

March 16, 2007 Software Engineering Working Group Meeting

36

Timings for 1°,npes=160, NH=55°Timings for 1°,npes=160, NH=55°

March 16, 2007 Software Engineering Working Group Meeting

37

Better Probability FunctionBetter Probability Function Climatological Function:

Where:

ij climatological maximum sea-ice extent [satellite observation]

ni is the number of points within block i with non-

zero ij

Climatological Function:

Where:

ij climatological maximum sea-ice extent [satellite observation]

ni is the number of points within block i with non-

zero ij

Pi =1.0 if φij

j

∑ ni ⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟≥ 0.1

0.0

⎨ ⎪

⎩ ⎪

March 16, 2007 Software Engineering Working Group Meeting

38

Timings for 1°,npes=160, climate-based

Timings for 1°,npes=160, climate-based

Reduces dynamics sub-cycling time by 28%!

March 16, 2007 Software Engineering Working Group Meeting

39

Acknowledgements/Questions?

Acknowledgements/Questions?

Thanks to: D. Bailey (NCAR)F. Bryan (NCAR)T. Craig (NCAR)J. Edwards (IBM)E. Hunke (LANL)B. Kadlec (CU)E. Jessup (CU)P. Jones (LANL)K. Lindsay (NCAR)W. Lipscomb (LANL)M. Taylor (SNL)H. Tufo (NCAR)M. Vertenstein (NCAR)S. Weese (NCAR)P. Worley (ORNL)

Thanks to: D. Bailey (NCAR)F. Bryan (NCAR)T. Craig (NCAR)J. Edwards (IBM)E. Hunke (LANL)B. Kadlec (CU)E. Jessup (CU)P. Jones (LANL)K. Lindsay (NCAR)W. Lipscomb (LANL)M. Taylor (SNL)H. Tufo (NCAR)M. Vertenstein (NCAR)S. Weese (NCAR)P. Worley (ORNL)

Computer Time: Blue Gene/L time:

NSF MRI GrantNCARUniversity of ColoradoIBM (SUR) program

BGW Consortium DaysIBM research (Watson)

Cray XT3/4 time:

ORNL

Sandia

Computer Time: Blue Gene/L time:

NSF MRI GrantNCARUniversity of ColoradoIBM (SUR) program

BGW Consortium DaysIBM research (Watson)

Cray XT3/4 time:

ORNL

Sandia

et

March 16, 2007 Software Engineering Working Group Meeting

40

Partitioning with Space-filling Curves

Partitioning with Space-filling Curves

Map 2D -> 1DVariety of sizes

Hilbert (Nb=2n)

Peano (Nb=3m)

Cinco (Nb=5p)Hilbert-Peano (Nb=2n3m)Hilbert-Peano-Cinco

(Nb=2n3m5p)

Partitioning 1D array

Map 2D -> 1DVariety of sizes

Hilbert (Nb=2n)

Peano (Nb=3m)

Cinco (Nb=5p)Hilbert-Peano (Nb=2n3m)Hilbert-Peano-Cinco

(Nb=2n3m5p)

Partitioning 1D array

Nb

March 16, 2007 Software Engineering Working Group Meeting

41

Scalable data structuresScalable data structuresCommon problem among applicationsWRF

Serial I/O [fixed]Duplication of lateral boundary values

POP & CICESerial I/O

CLMSerial I/ODuplication of grid info

Common problem among applicationsWRF

Serial I/O [fixed]Duplication of lateral boundary values

POP & CICESerial I/O

CLMSerial I/ODuplication of grid info

March 16, 2007 Software Engineering Working Group Meeting

42

Scalable data structures (con’t)

Scalable data structures (con’t)

CAMSerial I/OLookup tables

CPLSerial I/ODuplication of grid info

Memory footprint problem will not solve itself!

CAMSerial I/OLookup tables

CPLSerial I/ODuplication of grid info

Memory footprint problem will not solve itself!

March 16, 2007 Software Engineering Working Group Meeting

43

Remove Land blocksRemove Land blocks

March 16, 2007 Software Engineering Working Group Meeting

44

Case Study:Memory use in CLM

Case Study:Memory use in CLM

CLM Configuration:1x1.25 gridNo RTMMAXPATCH_PFT = 4No CN, DGVM

Measure stack and heap on 32-512 BG/L processors

CLM Configuration:1x1.25 gridNo RTMMAXPATCH_PFT = 4No CN, DGVM

Measure stack and heap on 32-512 BG/L processors

March 16, 2007 Software Engineering Working Group Meeting

45

Memory use of CLM on BGL

Memory use of CLM on BGL

March 16, 2007 Software Engineering Working Group Meeting

46

Motivation (con’t)Motivation (con’t)Multiple efforts underway

CAM scalability + high resolution coupled simulation [A. Mirin]

Sequential coupler [M. Vertenstein, R. Jacob]Single executable coupler [J. Wolfe]CCSM on Blue Gene [J. Wolfe, R. Loy, R.

Jacob]HOMME in CAM [J. Edwards]

Multiple efforts underwayCAM scalability + high resolution coupled

simulation [A. Mirin]Sequential coupler [M. Vertenstein, R. Jacob]Single executable coupler [J. Wolfe]CCSM on Blue Gene [J. Wolfe, R. Loy, R.

Jacob]HOMME in CAM [J. Edwards]

March 16, 2007 Software Engineering Working Group Meeting

47

OutlineOutline

Chip-MultiprocessorFun with Large Processor Counts

POPCICECLM

Parallel I/O library (PIO)

Chip-MultiprocessorFun with Large Processor Counts

POPCICECLM

Parallel I/O library (PIO)

March 16, 2007 Software Engineering Working Group Meeting

48

Status of CLMStatus of CLM

Work of T. CraigElimination of global memory

Reworking of decomposition algorithms

Addition of PIOShort term goal:

Participation in BGW days June 07Investigation scalability at 1/10

Work of T. CraigElimination of global memory

Reworking of decomposition algorithms

Addition of PIOShort term goal:

Participation in BGW days June 07Investigation scalability at 1/10

March 16, 2007 Software Engineering Working Group Meeting

49

Status of CLM memory usage

Status of CLM memory usage

May 1, 2006: memory usage increases with processor count Can run 1x1.25 on 32-512 processors of BGL

July 10, 2006: Memory usage scales to asymptote Can run 1x1.25 on 32- 2K processors of BGL ~350 persistent global arrays [24 Gbytes/proc @ 1/10 degree]

January, 2007: 150 persistent global arrays 1/2 degee runs on 32-2K BGL processors ~150 persistent global arrays [10.5 Gbytes/proc @ 1/10 degree]

February, 2007: 18 persistent global arrays [1.2 Gbytes/proc @ 1/10 degree]

Target: no persistent global arrays 1/10 degree runs on single rack BGL

May 1, 2006: memory usage increases with processor count Can run 1x1.25 on 32-512 processors of BGL

July 10, 2006: Memory usage scales to asymptote Can run 1x1.25 on 32- 2K processors of BGL ~350 persistent global arrays [24 Gbytes/proc @ 1/10 degree]

January, 2007: 150 persistent global arrays 1/2 degee runs on 32-2K BGL processors ~150 persistent global arrays [10.5 Gbytes/proc @ 1/10 degree]

February, 2007: 18 persistent global arrays [1.2 Gbytes/proc @ 1/10 degree]

Target: no persistent global arrays 1/10 degree runs on single rack BGL

March 16, 2007 Software Engineering Working Group Meeting

50

Proposed Petascale ExperimentProposed Petascale Experiment

Ensemble of 10 runs/200 years Petascale Configuration:

CAM (30 km, L66) POP @ 0.1°

12.5 years / wall-clock day [17K Cray XT4 processors] Sea-Ice @ 0.1°

42 years / wall-clock day [10K Cray XT3 processors Land model @ 0.1°

Sequential Design (105 days per run) 32K BGL/ 10K XT3 processors

Concurrent Design (33 days per run) 120K BGL / 42K XT3 processors

Ensemble of 10 runs/200 years Petascale Configuration:

CAM (30 km, L66) POP @ 0.1°

12.5 years / wall-clock day [17K Cray XT4 processors] Sea-Ice @ 0.1°

42 years / wall-clock day [10K Cray XT3 processors Land model @ 0.1°

Sequential Design (105 days per run) 32K BGL/ 10K XT3 processors

Concurrent Design (33 days per run) 120K BGL / 42K XT3 processors

March 16, 2007 Software Engineering Working Group Meeting

51

POPIO benchmark on BGWPOPIO benchmark on BGW

March 16, 2007 Software Engineering Working Group Meeting

52

CICE results (con’t)CICE results (con’t) Correct weighting increases simulation rate wSFC works best for high resolution Variable sized domains:

Large domains at low latitude -> higher boundary exchange cost

Small domains at high latitude-> lower floating-point cost

Optimal balance of computational and communication cost?

Work in progress!

Correct weighting increases simulation rate wSFC works best for high resolution Variable sized domains:

Large domains at low latitude -> higher boundary exchange cost

Small domains at high latitude-> lower floating-point cost

Optimal balance of computational and communication cost?

Work in progress!