2 Chip-Multiprocessors & You John Dennis dennis@ucar.edu March 16, 2007 John Dennis dennis@ucar.edu...

transcript

Chip-Multiprocessors & You

John Dennisdennis@ucar.edu

March 16, 2007

March 16, 2007 Software Engineering Working Group Meeting

Intel “Tera Chip”Intel “Tera Chip” 80 core chip 1 Teraflop 3.16 Ghz / 0.95V / 62W Process

45 nm technology High-K

2D mesh network Each processor has 5-

port router Connects to “3D-

memory”

80 core chip 1 Teraflop 3.16 Ghz / 0.95V / 62W Process

45 nm technology High-K

2D mesh network Each processor has 5-

port router Connects to “3D-

memory”

OutlineOutline

Chip-MultiprocessorParallel I/O library (PIO)Full with Large Processor Counts

POPCICE

Chip-MultiprocessorParallel I/O library (PIO)Full with Large Processor Counts

POPCICE

Moore’s LawMoore’s Law

Most things are twice as nice [18 months]Transistor countProcessor speedDRAM density

Historical Result:Solve problem twice as large in same timeSolve same size problem in half the time

--> Inactivity leads to progress!

Most things are twice as nice [18 months]Transistor countProcessor speedDRAM density

Historical Result:Solve problem twice as large in same timeSolve same size problem in half the time

--> Inactivity leads to progress!

The advent of Chip-multiprocessors

Moore’s Law gone bad!

New implications of Moore’s Law

Every 18 months# of cores per socket doublesMemory density doublesClock cycle may increase slightly

18 months from now8 cores per socketSlight increase in clock cycle (~15%)Same memory per core!!

Every 18 months# of cores per socket doublesMemory density doublesClock cycle may increase slightly

18 months from now8 cores per socketSlight increase in clock cycle (~15%)Same memory per core!!

New implications of Moore’s Law

(con’t) New implications of Moore’s Law

(con’t) Inactivity leads to no progress! Possible outcome

Same problem size / same parallelismsolve problem ~15% faster

Bigger problem sizescalable memory?

More processors enable ~2x reduction in time to solutionNon-scalable memory?

May limit number of processors that can be used Waste 1/2 of cores on sockets to use memory?

All components of application must scale to benefit from Moore’s Law increases!

Memory footprint problem will not solve itself!

Inactivity leads to no progress! Possible outcome

Same problem size / same parallelismsolve problem ~15% faster

Bigger problem sizescalable memory?

More processors enable ~2x reduction in time to solutionNon-scalable memory?

May limit number of processors that can be used Waste 1/2 of cores on sockets to use memory?

All components of application must scale to benefit from Moore’s Law increases!

Questions ?Questions ?

Parallel I/O library (PIO)

John Dennis (dennis@ucar.edu)Ray Loy (rloy@mcs.anl.gov)

March 16, 2007

IntroductionIntroduction

All component models need parallel I/OSerial I/O is bad!

Increased memory requirementTypically negative impact on performance

Primary Developers: [J. Dennis, R. Loy]Necessary for POP BGW runs

All component models need parallel I/OSerial I/O is bad!

Increased memory requirementTypically negative impact on performance

Primary Developers: [J. Dennis, R. Loy]Necessary for POP BGW runs

Design goalsDesign goals

Provide parallel I/O for all component models

Encapsulate complexity into library Simple interface for component

developers to implement

Provide parallel I/O for all component models

Encapsulate complexity into library Simple interface for component

developers to implement

Design goals (con’t)Design goals (con’t)

Extensible for future I/O technologyBackward compatible (node=0)Support for multiple formats

{sequential,direct} binarynetcdf

Preserve format of input/output filesSupports 1D, 2D and 3D arrays

Currently XYExtensible to XZ or YZ

Extensible for future I/O technologyBackward compatible (node=0)Support for multiple formats

{sequential,direct} binarynetcdf

Preserve format of input/output filesSupports 1D, 2D and 3D arrays

Currently XYExtensible to XZ or YZ

Terms and ConceptsTerms and Concepts

PnetCDF: [ANL]High performance I/ODifferent interfaceStable

netCDF4 + HDF5 [NCSA]Same interfaceNeeds HDF5 libraryLess stableLower performanceNo support on Blue Gene

PnetCDF: [ANL]High performance I/ODifferent interfaceStable

netCDF4 + HDF5 [NCSA]Same interfaceNeeds HDF5 libraryLess stableLower performanceNo support on Blue Gene

Terms and Concepts (con’t)

Processor stride:Allows matching of subset of MPI IO nodes

to system hardware

Processor stride:Allows matching of subset of MPI IO nodes

to system hardware

Terms and Concepts (con’t)

IO decomp vs. COMP decompIO decomp == COMP decomp

MPI-IO + message aggregation

IO decomp != COMP decompNeed Rearranger : MCT

No component specific info in libraryPair with existing communication tech1D arrays in library; component must flatten

2D and 3D arrays

IO decomp vs. COMP decompIO decomp == COMP decomp

MPI-IO + message aggregation

IO decomp != COMP decompNeed Rearranger : MCT

No component specific info in libraryPair with existing communication tech1D arrays in library; component must flatten

2D and 3D arrays

Component Model ‘issues’Component Model ‘issues’

POP & CICE:Missing blocks

Update of neighbors haloWho writes missing blocks?

Asymmetry between read/write

‘sub-block’ decompositions not rectangular

CLMDecomposition not rectangularWho writes missing data?

POP & CICE:Missing blocks

Update of neighbors haloWho writes missing blocks?

Asymmetry between read/write

‘sub-block’ decompositions not rectangular

CLMDecomposition not rectangularWho writes missing data?

What worksWhat worksBinary I/O [direct]

Test on POWER5, BGLRearrange w/MCT + MPI-IOMPI-IO no rearrangement

netCDFnetCDF

Rearrange with MCT [New]Reduced memory

PnetCDF:Rearrange with MCTNo rearrangement

Test on POWER5, BGL

Binary I/O [direct]Test on POWER5, BGLRearrange w/MCT + MPI-IOMPI-IO no rearrangement

netCDFnetCDF

Rearrange with MCT [New]Reduced memory

PnetCDF:Rearrange with MCTNo rearrangement

Test on POWER5, BGL

What works (con’t)What works (con’t) Prototype added to POP2

Reads restart and forcing files correctly Writes binary restart files correctly Necessary for BGW runs

Prototype implementation in HOMME [J. Edwards] Writes netCDF history files correctly

POPIO benchmark 2D array [3600x2400] (70 Mbyte) Test code for correctness and performance Tested on 30K BGL processors in Oct 06

Performance POWER5: 2-3x serial I/O approach BGL: mixed

Prototype added to POP2 Reads restart and forcing files correctly Writes binary restart files correctly Necessary for BGW runs

Prototype implementation in HOMME [J. Edwards] Writes netCDF history files correctly

POPIO benchmark 2D array [3600x2400] (70 Mbyte) Test code for correctness and performance Tested on 30K BGL processors in Oct 06

Performance POWER5: 2-3x serial I/O approach BGL: mixed

Complexity / Remaining IssuesComplexity / Remaining Issues

Mulitple ways to express decompositionGDOF: global degree of freedom --> (MCT,

MPI-IO)Subarrays: start + count (pNetCDF)

Limited expressivenessWill not support ‘sub-block’ in POP & CICE, CLM

Need common language for interfaceInterface between component model

and library

Mulitple ways to express decompositionGDOF: global degree of freedom --> (MCT,

MPI-IO)Subarrays: start + count (pNetCDF)

Limited expressivenessWill not support ‘sub-block’ in POP & CICE, CLM

Need common language for interfaceInterface between component model

and library

ConclusionsConclusions

Working prototypePOP2 for binary I/OHOMME for netCDF

PIO telecon: discuss progress every 2 weeks

Work in progress Multiple efforts underwayaccepting help

http://swiki.ucar.edu/ccsm/93In CCSM subversion repository

Working prototypePOP2 for binary I/OHOMME for netCDF

PIO telecon: discuss progress every 2 weeks

Work in progress Multiple efforts underwayaccepting help

http://swiki.ucar.edu/ccsm/93In CCSM subversion repository

Fun with Large Processor Counts:POP, CICE

John Dennisdennis@ucar.edu

March 16, 2007

MotivationMotivationCan Community Climate System Model

(CCSM) be a Petascale Application?Use 10-100K processors per simulation

Increasing common access to large systemsORNL Cray XT3/4 : 20K [2-3 weeks]ANL Blue Gene/P : 160K [Jan 2008]TACC Sun : 55K [Jan 2008]

Petascale for the masses ?lagtime in Top 500 List [4-5 years]@ NCAR before 2015

Can Community Climate System Model (CCSM) be a Petascale Application?Use 10-100K processors per simulation

Increasing common access to large systemsORNL Cray XT3/4 : 20K [2-3 weeks]ANL Blue Gene/P : 160K [Jan 2008]TACC Sun : 55K [Jan 2008]

Petascale for the masses ?lagtime in Top 500 List [4-5 years]@ NCAR before 2015

OutlineOutline

Chip-MultiprocessorParallel I/O library (PIO)Fun with Large Processor Counts

POPCICE

Status of POPStatus of POP

Access to 17K Cray XT4 processors12.5 years/day [Current Record]70% of time in solver

Won BGW cycle allocationEddy Stirring: The Missing Ingredient in Nailing Down Ocean Tracer Transport [J. Dennis, F. Bryan, B. Fox-Kemper, M. Maltrud, J. McClean, S. Peacock]

110 Rack Days/ 5.4M CPU hours20 year 0.1° POP simulation

Includes a suite of dye-like tracersSimulate eddy diffusivity tensor

Access to 17K Cray XT4 processors12.5 years/day [Current Record]70% of time in solver

Won BGW cycle allocationEddy Stirring: The Missing Ingredient in Nailing Down Ocean Tracer Transport [J. Dennis, F. Bryan, B. Fox-Kemper, M. Maltrud, J. McClean, S. Peacock]

110 Rack Days/ 5.4M CPU hours20 year 0.1° POP simulation

Includes a suite of dye-like tracersSimulate eddy diffusivity tensor

Status of POP (con’t)Status of POP (con’t)

Allocation will occur over ~7 daysRun in production on 30K

processorsNeeds Parallel I/O to write history

fileStart runs in 4-6 weeks

Allocation will occur over ~7 daysRun in production on 30K

processorsNeeds Parallel I/O to write history

fileStart runs in 4-6 weeks

OutlineOutline

POPCICE

Status of CICEStatus of CICE

Tested CICE @ 1/1010K Cray XT4 processors40K IBM Blue Gene processors [BGW

days]Use weighted space-filling curves

(wSFC)erfcclimatology

Tested CICE @ 1/1010K Cray XT4 processors40K IBM Blue Gene processors [BGW

days]Use weighted space-filling curves

(wSFC)erfcclimatology

POP (gx1v3) + Space-filling curve

Space-filling curve partition for 8 processors

Weighted Space-filling curves

Estimate work for each grid block

Worki = w0 + Pi*w1

where:w0: Fixed work for all blocks

w1: Work if block contains Sea-ice

Pi: Probability block contains Sea-ice

For our experiments: w0 = 2, w1 = 10

Estimate work for each grid block

Worki = w0 + Pi*w1

where:w0: Fixed work for all blocks

w1: Work if block contains Sea-ice

Pi: Probability block contains Sea-ice

For our experiments: w0 = 2, w1 = 10

Probability FunctionProbability Function

Error Function:Pi = erfc(( -max(|lati|))/)

where:lati max lat in block i

mean sea-ice extent variance in sea-ice extent

NH=70°, SH =60°, =5 °

Error Function:Pi = erfc(( -max(|lati|))/)

where:lati max lat in block i

mean sea-ice extent variance in sea-ice extent

NH=70°, SH =60°, =5 °

1° CICE4 on 20 processors1° CICE4 on 20 processors

Small domains @ high latitudes

Large domains @ low latitudes

0.1° CICE40.1° CICE4 Developed at LANL Finite Difference Models sea-ice Shares grid and infrastructure with POP

Reuse techniques from POP work Computational grid: [3600 x 2400 x 20] Computational load-imbalance creates problems:

~15% of grid has sea-ice Use weighted Space-filling curves?

Evaluate using Benchmark: 1 day/ Initial run / 30 minute timestep / no Forcing

Developed at LANL Finite Difference Models sea-ice Shares grid and infrastructure with POP

Reuse techniques from POP work Computational grid: [3600 x 2400 x 20] Computational load-imbalance creates problems:

~15% of grid has sea-ice Use weighted Space-filling curves?

Evaluate using Benchmark: 1 day/ Initial run / 30 minute timestep / no Forcing

CICE4 @ 0.1°CICE4 @ 0.1°

Timings for 1°,npes=160, NH=70°Timings for 1°,npes=160, NH=70°

Load-imbalance: Hudson Bay south of 70°

Timings for 1°,npes=160, NH=55°Timings for 1°,npes=160, NH=55°

Better Probability FunctionBetter Probability Function Climatological Function:

Where:

ij climatological maximum sea-ice extent [satellite observation]

ni is the number of points within block i with non-

zero ij

Climatological Function:

Where:

ij climatological maximum sea-ice extent [satellite observation]

ni is the number of points within block i with non-

zero ij

Pi =1.0 if φij

∑ ni ⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟≥ 0.1

⎨ ⎪

⎩ ⎪

Timings for 1°,npes=160, climate-based

Reduces dynamics sub-cycling time by 28%!

Acknowledgements/Questions?

Thanks to: D. Bailey (NCAR)F. Bryan (NCAR)T. Craig (NCAR)J. Edwards (IBM)E. Hunke (LANL)B. Kadlec (CU)E. Jessup (CU)P. Jones (LANL)K. Lindsay (NCAR)W. Lipscomb (LANL)M. Taylor (SNL)H. Tufo (NCAR)M. Vertenstein (NCAR)S. Weese (NCAR)P. Worley (ORNL)

Computer Time: Blue Gene/L time:

NSF MRI GrantNCARUniversity of ColoradoIBM (SUR) program

BGW Consortium DaysIBM research (Watson)

Cray XT3/4 time:

Sandia

Computer Time: Blue Gene/L time:

NSF MRI GrantNCARUniversity of ColoradoIBM (SUR) program

BGW Consortium DaysIBM research (Watson)

Cray XT3/4 time:

Sandia

Partitioning with Space-filling Curves

Map 2D -> 1DVariety of sizes

Hilbert (Nb=2n)

Peano (Nb=3m)

Cinco (Nb=5p)Hilbert-Peano (Nb=2n3m)Hilbert-Peano-Cinco

(Nb=2n3m5p)

Partitioning 1D array

Map 2D -> 1DVariety of sizes

Hilbert (Nb=2n)

Peano (Nb=3m)

Cinco (Nb=5p)Hilbert-Peano (Nb=2n3m)Hilbert-Peano-Cinco

(Nb=2n3m5p)

Partitioning 1D array

Scalable data structuresScalable data structuresCommon problem among applicationsWRF

Serial I/O [fixed]Duplication of lateral boundary values

POP & CICESerial I/O

CLMSerial I/ODuplication of grid info

Common problem among applicationsWRF

Serial I/O [fixed]Duplication of lateral boundary values

POP & CICESerial I/O

CLMSerial I/ODuplication of grid info

Scalable data structures (con’t)

CAMSerial I/OLookup tables

CPLSerial I/ODuplication of grid info

CAMSerial I/OLookup tables

CPLSerial I/ODuplication of grid info

Remove Land blocksRemove Land blocks

Case Study:Memory use in CLM

CLM Configuration:1x1.25 gridNo RTMMAXPATCH_PFT = 4No CN, DGVM

Measure stack and heap on 32-512 BG/L processors

CLM Configuration:1x1.25 gridNo RTMMAXPATCH_PFT = 4No CN, DGVM

Measure stack and heap on 32-512 BG/L processors

Memory use of CLM on BGL

Motivation (con’t)Motivation (con’t)Multiple efforts underway

CAM scalability + high resolution coupled simulation [A. Mirin]

Sequential coupler [M. Vertenstein, R. Jacob]Single executable coupler [J. Wolfe]CCSM on Blue Gene [J. Wolfe, R. Loy, R.

Jacob]HOMME in CAM [J. Edwards]

Multiple efforts underwayCAM scalability + high resolution coupled

simulation [A. Mirin]Sequential coupler [M. Vertenstein, R. Jacob]Single executable coupler [J. Wolfe]CCSM on Blue Gene [J. Wolfe, R. Loy, R.

Jacob]HOMME in CAM [J. Edwards]

OutlineOutline

Chip-MultiprocessorFun with Large Processor Counts

POPCICECLM

Chip-MultiprocessorFun with Large Processor Counts

POPCICECLM

Status of CLMStatus of CLM

Work of T. CraigElimination of global memory

Reworking of decomposition algorithms

Addition of PIOShort term goal:

Participation in BGW days June 07Investigation scalability at 1/10

Work of T. CraigElimination of global memory

Reworking of decomposition algorithms

Addition of PIOShort term goal:

Participation in BGW days June 07Investigation scalability at 1/10

Status of CLM memory usage

May 1, 2006: memory usage increases with processor count Can run 1x1.25 on 32-512 processors of BGL

July 10, 2006: Memory usage scales to asymptote Can run 1x1.25 on 32- 2K processors of BGL ~350 persistent global arrays [24 Gbytes/proc @ 1/10 degree]

January, 2007: 150 persistent global arrays 1/2 degee runs on 32-2K BGL processors ~150 persistent global arrays [10.5 Gbytes/proc @ 1/10 degree]

February, 2007: 18 persistent global arrays [1.2 Gbytes/proc @ 1/10 degree]

Target: no persistent global arrays 1/10 degree runs on single rack BGL

May 1, 2006: memory usage increases with processor count Can run 1x1.25 on 32-512 processors of BGL

July 10, 2006: Memory usage scales to asymptote Can run 1x1.25 on 32- 2K processors of BGL ~350 persistent global arrays [24 Gbytes/proc @ 1/10 degree]

January, 2007: 150 persistent global arrays 1/2 degee runs on 32-2K BGL processors ~150 persistent global arrays [10.5 Gbytes/proc @ 1/10 degree]

February, 2007: 18 persistent global arrays [1.2 Gbytes/proc @ 1/10 degree]

Target: no persistent global arrays 1/10 degree runs on single rack BGL

Proposed Petascale ExperimentProposed Petascale Experiment

Ensemble of 10 runs/200 years Petascale Configuration:

CAM (30 km, L66) POP @ 0.1°

12.5 years / wall-clock day [17K Cray XT4 processors] Sea-Ice @ 0.1°

42 years / wall-clock day [10K Cray XT3 processors Land model @ 0.1°

Sequential Design (105 days per run) 32K BGL/ 10K XT3 processors

Concurrent Design (33 days per run) 120K BGL / 42K XT3 processors

Ensemble of 10 runs/200 years Petascale Configuration:

CAM (30 km, L66) POP @ 0.1°

12.5 years / wall-clock day [17K Cray XT4 processors] Sea-Ice @ 0.1°

42 years / wall-clock day [10K Cray XT3 processors Land model @ 0.1°

Sequential Design (105 days per run) 32K BGL/ 10K XT3 processors

Concurrent Design (33 days per run) 120K BGL / 42K XT3 processors

POPIO benchmark on BGWPOPIO benchmark on BGW

CICE results (con’t)CICE results (con’t) Correct weighting increases simulation rate wSFC works best for high resolution Variable sized domains:

Large domains at low latitude -> higher boundary exchange cost

Small domains at high latitude-> lower floating-point cost

Optimal balance of computational and communication cost?

Work in progress!

Correct weighting increases simulation rate wSFC works best for high resolution Variable sized domains:

Large domains at low latitude -> higher boundary exchange cost

Small domains at high latitude-> lower floating-point cost

Optimal balance of computational and communication cost?

Work in progress!

2 Chip-Multiprocessors & You John Dennis dennis@ucar.edu March 16, 2007 John Dennis dennis@ucar.edu...

Documents