Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4...

nci.org.au@NCInews

ScalingWeatherClimateandEnvironmentalScienceApplications,andexperienceswithIntelKnightsLanding

BenEvans,DaleRoberts

17th WorkshoponHighPerformanceComputinginMeteorology

nci.org.au

NCIprogramtoaccelerateHPCScalingandOptimisation

© National ComputationalInfrastructure 2016

• ModellingExtreme&HighImpactevents– BoM• NWP,ClimateCoupledSystems&DataAssimilation– BoM,CSIRO,ResearchCollaboration• Hazards- GeoscienceAustralia,BoM,States• Geophysics,Seismic– GeoscienceAustralia,Universities• MonitoringtheEnvironment&Ocean– ANU,BoM,CSIRO,GA,Research,Fed/State• Internationalresearch– InternationalagenciesandCollaborativePrograms

TropicalCyclones

CycloneWinston20-21Feb,2016

VolcanicAsh

Manam Eruption31July,2015

WyeValleyandLorneFires25-31Dec,2015

BushFires Flooding

StGeorge,QLDFebruary,2011

BenEvans,ECMWF,Oct2016

nci.org.au

ACCESSModel:NWP,Climate,Ocean,ESM,SeasonalandMulti-year

Coupler

Carbon

Terrestrial

Oceanandsea-ice

Atmosphericchemistry

Atmosphere

Oceanandsea-ice

Carbon cycle (ACCESS-ESM)• Terrestrial – CABLE• Bio-geochemical• Couple to modified ACCESS1.3

Aerosols and Chemistry• UKCA


Core Model• Atmosphere – UM 10.5+• Ocean – MOM 5.1 (for most models)• NEMO 3.6 (for GC3 seasonal-only)• Sea-Ice – CICE5• Coupler – OASIS-MCT

Wave• WW3


nci.org.au

Additionalprioritycodes– AustralianstormsurgemodelusingROMS

© National ComputationalInfrastructure 2016 BenEvans,ECMWF,Oct2016

nci.org.au

DataCollections Approx.Capacity

CMIP5,CORDEX,ACCESSModels 5PbytesSatelliteEarthObs:LANDSAT,Himawari-8,Sentinels,plusMODIS,INSAR,… 2PbytesDigitalElevation,BathymetryOnshore/OffshoreGeophysics

1Pbytes

SeasonalClimate 700TbytesBureauofMeteorologyObservations 350TbytesBureauofMeteorologyOcean-Marine 350TbytesTerrestrialEcosystem 290TbytesReanalysisproducts 100Tbytes

1. Climate/ESSModelAssetsandDataProducts2. EarthandMarineObservationsandDataProducts3. GeoscienceCollections4. TerrestrialEcosystemsCollections5.WaterManagementandHydrologyCollections

http://geonetwork.nci.org.au

NCIResearchDataCollections:Model,dataprocessing,analysis


nci.org.au

Himawari-8Observations,DataAssimilationandAnalysis

CapturedatJMA,ProcessedafteracquisitionatBoMMadeavailableatNCI

DataProductsstilltobegenerated,butfirststagewastomaketheimagedataavailable.

10minutecaptureandprocess.Thenalsoneedtomakeitavailableforbroadanalysis.


nci.org.au

• Over300,000Landsatscenes(spatial/temporal)allowingflexible,efficient,large-scalein-situanalysis

• Spatially-regular,time-stamped,band-aggregatedtilespresentedastemporalstacks.

Spatiallypartitionedtiles TemporalAnalysis

EarthObservationTimeSeriesAnalysis

Continental-ScaleWaterObservationsfromSpace

WOFSwaterdetection• 27YearsofdatafromLS5&

LS7(1987-2014)• 25m NominalPixel

Resolution• Approx.300,000 individual

sourceARG-25scenesinapprox.20,000passes

• Entire27yearsof1,312,087ARG25tiles=>93x1012 pixelsvisited

• 0.75PBofdata• 3hrs atNCI(elapsed

time)tocompute.


nci.org.au

Enableglobalandcontinentalscale…andtoscale-downtolocal/catchment/plot

• Wateravailabilityandusageovertime

• Catchmentzone• Vegetationchanges• Datafusionwithpoint-clouds

andlocalorothermeasurements

• Statisticaltechniquesonkeyvariables

Preparingfor:• Betterprogrammaticaccess• Machine/DeepLearning• BetterIntegrationthrough

Semantic/Linkeddatatechnologies


nci.org.au

EmergingPetascale Geophysicscodes

- Assess priority Geophysics areas- 3D/4D Geophysics: Magnetotellurics, AEM- Hydrology, Groundwater, Carbon Sequestration- Forward and Inverse Seismic models and analysis (onshore and offshore)- Natural Hazard and Risk models: Tsunami, Ash-cloud

- Issues- Data across domains, data resolution (points, lines, grids), data coverage- Model maturity for running at scale- Ensemble, Uncertainty analysis and Inferencing


nci.org.au

NCIHighPerformanceOptimisation andScalingactivities2014-17

• Objectives:• Upscale&increaseperformanceofhigh-prioritynationalcodes– particularlyWeatherandClimate

• Year1• Characterise,OptimiseandTuneofcriticalapplicationsforhigherresolution• Bestpractiseconfigurationforimprovedthroughput• Establishanalysistoolsetsandmethodology

• Year2• Characterise,OptimiseandTuneofnextgenerationhighpriorityapplications• SelecthighprioritygeophysicscodesandexemplarHPCcodesforscalability• ParallelAlgorithmReviewandI/Ooptimisation methodstoenablebetterscaling• EstablishedTIHPOptimisation workpackageforUMcodes(Selwood,Evans)

• Year3• Assessbroadersetofcommunitycodesforscalability• Updatedhardware(many-core),memory/datalatency/bandwidths,energyefficiency• Communicationlibraries,mathlibraries


nci.org.au

CommonMethodologyandapproachforanalysis


• Analyse codetoestablishstrengthsandweaknesses.• Fullcodeanalysisincludinghotspotandalgorithmchoices• Exposemodeltomoreextremescaling– e.g.,realistichigherresolution• Analyse andcomparedifferentsoftwarestacks• Decompositionstrategiesfornodesandnodetuning• ParallelAlgorithms,MPIcommunicationPatterns.e.g.Haloanalysis,gridexchanges• I/Otechniques:Evaluateserialandparalleltechniques• Futurehardwaretechnologies


nci.org.au

Scaling/Optimisation:BoMResearch-to-OperationsHighProfilecases


Domain Yr1– 2014/5 Yr2– 2015/6 Yr3– 2016/7

Atmosphere APS1(UM8.2-4)• GlobalN320L70

(40km)andpre-APS2N512L70(25km)

• RegionalN768L70• (~17km)• City4.5k

UM10.x(PS36)• GlobalN768L70

(~17km)• Regional,City

APS3prep• UM10.xlatest• ACCESS-G(Global)

N1024L70/L85(12km) orN1280L70/L85(10km)

• ACCESS-GEGlobalEnsemble(N216L70)(~60km)

• ACCESS-TC4km• ACCESS-R(Regional)12km• ACCESS-C(City)1.5km


nci.org.au

NCI-FujitsuScaling/Optimisation:BoMResearch-to-OperationsHighProfilecases


Domain Yr1– 2014/5 Yr2– 2015/6 Yr3– 2016/7

Dataassimilation

4D-VARv30• N216L70,N320L70

• 4D-VARLatestforGlobalatN320L70

• enKF-C

Ocean MOM5.1• OFAM3• 0.25°, L50

MOM5.1• 0.1°, L50

• OceanMAPS3.1(MOM5)withenKF-C

• MOM5/6 0.1° and0.03°• ROMS(Regional)

• StormSurge (2D)• eReefs (3D)

Wave WaveWatch3v4.18(v5.08beta)

• AusWave-G0.4°• AusWave-R0.1°


nci.org.au

NCI-FujitsuScaling/Optimisation:BoMResearch-to-OperationsHighProfilecases


Domain Yr1– 2014/5 Yr2– 2015/6 Yr3– 2016/7CoupledSystems

Climate:ACCESS-CM• GA6(UM8.5)+MOM5withOASIS-MCT• GlobalN96L38

(135km),1° and0.25°ocean.

Climate:ACCESS-CMcont.

Climate:ACCESS-CM2(AOGCM)• GA7(UM10.4+andGLOMAPaerosol),• GlobalN96L85,N216L85(~60km)• MOM5.10.25°• CABLE2• UKCAaerosols

EarthSystem:ACCESS-ESM2• ACCESS-CM2+TerrestrialBiochemistry

– CASA-CNP• Oceanicbiogeochemistry– WOMBAT• Atmosphericchemistry– UKCA

SeasonalClimate:• ACCESS-S1- UK

GC2withOASIS3• N216L85(~60km)

NEMO0.25°

GC2NCIprofilingmethodologyapplied forMPMD

• Multi-weekandSeasonalClimate:• ACCESS-S2/UKGC3• Atmos:N216L85(60km)and

NEMO3.6 0.25° L75

nci.org.au

HPCGeneralScalingandOptimisation Approach


Domain Yr2– 2015/6 Yr3– 2016/7

ProfilingMethodology CreateMethodologyforprofilingcodes

UpdatestoMethodologybasedonapplicationacrossmorecodes

I/Oprofiling • BaselineprofilingforcomparisonofNetCDF3,NetCDF4,HDF5andGeoTIFF andAPIoptions(e.g.GDAL).

• ProfilingcomparisonofIOperformanceofLustre,NFS

• CompareMPI-IOvsPOSIXvsHDF5 onLustre

• AdvancedProfilingHDF5andNetCDF4forcompressionalgorithms,multithreading,cachemanagement

• Profilinganalysisofotherdataformats

• e.g., GRIB,AstronomyFITS,SEG-Y,BAM

AcceleratorTechnologyInvestigation

• IntelPhi(KnightsLanding)• AMDGPU

Profilingtoolssuite Review MajoropensourceProfilingTools

Investigationofprofilers forAccelerators


nci.org.au

HPCGeneralScalingandOptimisation Approach


Domain Yr2 Yr3

ComputeNodePerformanceAnalysis

• Partiallycommittingnodes

• HardwareHyper-threading

• MemoryBandwidth• Interconnect

bandwidth

• EvaluatingEnergyEfficiencyvsperformanceofnextgenerationprocessors

• Broadwellimprovements• Memory speed• Vectorisation• OpenMP coverage

SoftwareStacks • OpenMPI vsIntelMPIanalysis

• Intelcompilerversions

• OpenMPI vsIntelMPI analysis• Intelcompiler versions• MathLibraries

Analysisother EarthSystems&Geophysics prioritycodesandalgorithms

• InitialAnalysisofMPIcommunications

• Commenceanalysisofhighpriority/profileHPCcodes in

• DetailedAnalysisofMPIcommunicationdependentalgorithms

• SurveyofCodesandAlgorithmsused.


nci.org.au

NCIContributionstoUMcollaborationsofar


• UM10.4+IOServernowusingMPI-IO• ImmediatelyvaluableforNWP(UKMet,Aus,…)• Criticalfornextgenerationprocessors(i.e.,KnL)

• UM10.5+OpenMP coverage• Increasedperformance• Criticalforbothcurrentandnextarchitectures,especiallywithincreasingmem

bandwidthissues


nci.org.au

KnightsLandingnodesatNCI

KnLs• 32IntelXeonPhi7230processors

– 64cores/256threadspersocket,1socketpernode– 16GBMCDRAMonpackage(380+GB/sbandwidth)– 192GBDDR4-2400MHz(115.2GB/s)

• EDRInfiniBandinterconnectbetweenKnLs (100Gb/s)• FDRInfiniband linkstomainlustrestorage(56Gb/s)

KeplerK80GPUsalsoavailable

RaijinisaFujitsuPrimergy cluster• 57,472cores(IntelXeonSandyBridgetechnology,2.6GHz)in3592computenodes• Infiniband FDRinterconnect• 10PBytes Lustre forshort-termscratchspace• 30Pbytes fordatacollectionsstorage

BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016

nci.org.au

BasicKNLfirstimpression

• KnL Pros– Fullx86compatibility– applications‘justwork’withoutneedforcodemajorchanges– AVX512-bitinstructionsetgivesperformanceboostforwellvectorised applications– Potentialtoprocess2vectoroperationspercycle

• KnL Cons– CoresaresignificantlyslowerthantypicalXeonprocessors

• 1.3GHzKnL vs.2.5+GHzfortypicalHaswell/BroadwellXeons• Simplerarchitecturemeansfewerinstructionsprocessedpercycle

– Profilingdifficultandhardwarenotfullyexposingwhatisneeded

• Needtounderstandmuchmoreaboutourapplicationsandtheirmulti-phasicnature• DeepworkonbothIO,memorypressure,andinterprocessor comms• Relearnhowtoprojectforthevalueoftheprocessors• Useexperiencetolookatotheremergingtechnologiesinparallel


nci.org.au

ExperimentingwithKnL characteristics

• AustralianGeoscienceDataCubeLANDSATprocessingpipeline– ProcessaseriesofobservationsfromLANDSAT8satellite.

• NOAAMethodofSplittingTsunami(MOST)model– Wavepropagationdueto7.5magnitude

earthquakeinSunda subductionzone

• UKMOUnifiedModelv10.5– N96AMIPglobalmodel– N512NWPglobalmodel

• ThesearenotchosenasthebestcodesforKnL,butonesthatwerebothimportantandthatwecould“quickly“explore.


nci.org.au

LandsatNBARdataprocessingpipeline:KnL vs.SandyBridgeand1thread

• Sameexecutable,runonbotharchitectures(i.e.noAVX-512instructions)• Separatelyrecompiledwith-xMIC-AVX512

0

1

2

3

4

5

6

7

ModTrans CalcGrids TCBand Packager InterpBand Exctract

RelativeRu

ntime

PipelineTask

LandsatProcessingPipelinetasks

SandyBridge KnL KnLw/AVX-512• Mosttaskstooklongertocomplete• LANDSATpipelinetasksaremostlypoint-wisekernelsorIObound• Littleopportunityforthecompilertovectorise• AVXoperationsrunatlowerclockspeedontheKnL

• ‘ModTrans’and‘TCBand’tasksexceptions• ModTrans wasrelativelywellvectorised• TCBand (TerrainCorrection)wasconvertedfrompoint-wisekernelstovector-wisekernels• NotedtheyarefasterthanSnB (normalisedforclockspeed)


nci.org.au

NOAAMOSTTsunamiCode– singlethreadedperformance

TimespentonvectorisationistheimportantfirststeptoCPUperformanceonKnL

0 0.5 1 1.5 2 2.5

KnL- Vectorised

KnL- Original

SandyBridge

Time(s)

MOSTAverageTimestep

0 0.5 1 1.5 2

KnL- Vectorised

KnL- Original

SandyBridge

TimeScaledbyCPUclockspeed

MOSTAverageTimestep

• WhileMOSToriginalcodeisnotvectorised,butdoesrunonKnL• Replacekeyroutineswithvectorised versions• Comparebothrawperformanceandnormalisedbyclockspeed


nci.org.au

LANDSATprocessingpipeline– comparingNode-for-Nodeperformance

• ParallelisminAGDCLANDSATisobtainedthrough‘Luigi’pythonscheduler.– Taskdependenciesaretrackedwithinscenes,embarrassinglyparallel– For20scenes,2620tasksintotal

• ‘ideal’combinationoftasksbuilt(withandwithoutAVX-512instructions)• AGDCLANDSATProcessing

45:39

52:29

43:43

0 10 20 30 40 50 60

KnL- 128workers

KnL- 64workers

SandyBridge- 16workers

Time(minutes)

AGDCLANDSATProcessing- 20scenes

• KnightsLandingisslowerthanSandyBridgeinthiscase– Node-for-nodehascompetitiveperformance.– Vectorisationcanyetimprove– noted128tasksoutperforms64tasksbyover20%


nci.org.au

NOAAMOST– OpenMP performanceonKnL

• ParallelisminNOAAMOSTisobtainedthroughOpenMP

• Goodscalingoveralreadygoodsinglethreadedperformance– Over90%efficiencygoingfrom1threadtofullyoccupyinganode

• Doesnotbenefitfromoversubscription– Likelyduetothesubdomainsbecomingquitesmallathighthreadcounts

0

0.2

0.4

0.6

0.8

1

1.2

1 2 4 8 16 32 64 128

ScalingFactor

Threads

NOAAMOST- ScalingFactor


nci.org.au

NOAAMOST:KnL vsSandyBridgenode-for-node

• 3xFasternode-for-nodeaftervectorisation.• NotethatourexperimentshowsMOSTmaybeveryperformant

onGPUs

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

SandyBridge- 16threads

KnL- 64threads

KnLvectorised- 64threads

Time(s)

NOAAMOST- AverageTimestep,fullnode


nci.org.au

UnifiedModel

• UM(10.5)isparallelisedusingbothMPIandOpenMP• InitiallychoseAMIPN96globalmodel

– usefulforperformanceevaluationasrunonasinglenodeandnocomplexIOtraffic– FindbestdecompositiononasingleKnL node(willcomparewithbest

decompositiononasingleSandyBridgenode)

Outcomes:• OvercommittingKnL provesbeneficialtoperformancewiththeUM• All64threadjobsareoutperformedby128and256threadjobs

0

50

100

150

200

250

300

350

4x8 4x16 8x8

Time(s)

Decomposition

N96AMIPRuntime

64Threads

128Threads

256Threads


nci.org.au

SoftwareStacksfortheUM

• IntelMPIconsistentlyoutperformsOpenMPI fortheUMonKnL

• ButIntelMPIlackssomeofthefine-grainedcontrolweneed• Theabilitytospecifyindividualcoresinarankfile• Seeminglyunabletobindto‘none’– importantforexplicitbindingwithnumactl• Can’treportbindingwiththesamedetailasOpenMPI

• Weusedversions15or16oftheIntelFortran/C/C++compilers– ‘-x’compileroptionstoenableordisableAVX-512inordertotestthe

effectivenessofthelongervectorregistersorissues• LANDSATprocessingslowswithAVX-512enabled• SomeinstabilityintheUMwhenAVX-512enabled

0

50

100

150

200

250

300

4x8 4x16 8x8

Time(s)

Decomposition

N96AMIPRuntime- 128Threads

OpenMPI

IntelMPI


nci.org.au

UMdecompositioncomparison

• Bestperformingdecompositions:• KnL is4x16,with2OpenMP threadsperMPItask• SnB is2x8

• About20%fasterthanbestdecompositiononSandyBridge• DespitemodelinputI/Ostagetaking5xlongeronKnL• largerMPIdecompositionlimitsmultinode scalabilityforUMonKnL

• Hybridparallelismcanhelphere• MorethreadsperMPItaskmeanssmallerdecompositions• ManythreadingimprovementstocomeinUM10.6+

0 50 100 150 200 250 300

KnL- 4x16,2OMPthreads

SandyBridge,2x8

Runtime(s)

N96AMIPRuntimeComparison


nci.org.au

UMN512globalNWPonKnL vsSnB

• Usesamedecomposition:16x64,2threadsperMPItask,totalof2048threads

But16KnL nodesvs64SnB nodesmeansmodeluses33%fewernode-hoursonKnL

• MPI• TasklayoutisimportantonKnL• N512jobusestheUM’sIOserverfeaturewhereallIOtaskscanrunonaseparatenode• WhentheIOtasksareinterleavedwithmodel,runtimeincreases->NeedtoseparateIO

0 0.5 1 1.5 2 2.5 3

SandyBridge

KnL

RelativeRuntime- N512globalNWP

0 0.5 1 1.5 2 2.5 3 3.5

SandyBridge

KnL

KnL- interleavedIO

RelativeRuntime- N512globalNWP


nci.org.au

UMN96

• KnL hasa2stagememoryhierarchy• 16GBMCDRAMon-package(CacheorFlatmode)• 192GBDDR4

• AllourUMtestsshownsofarhavebeenin‘cache’mode.• N96AMIPglobaloccupiesjustover16GBRSSwhenrunina4x16decomposition• Canadditionalperformancebeextractedin‘flat’mode?

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

DefaultBind

MCDRAMBind

Cachemode

RelativeRuntime

RelativeRuntimeofN96AMIPwithdifferentmemorybindingsettings

• NoMPIdistributionperformsthebindingcorrectly,solaunchMPIprocessesusingnumactl• Bothdefaultbinding(DDR4)andMCDRAMbindingareslowerthancachemode.• LossofperformancewhenrunonDDR4impliesthattheUMisstillmemorybound,evenon

slowKnL cores.


nci.org.au

ProfilingwithScore-P

• Score-Pisaninstrumentingprofiler• Issues– instrumentingonKnL isverycostly

• Enteringandexitinginstrumentedareasseemstocostafixednumberofcycles• CyclestakemuchlongeronaKnL

• Comparewithlimitedinstrumentationtokey‘control’subroutines• Allowsidentificationofkeycodeareas(e.g.convection,radiationetc.),butnothing

withinthoseareas

• Partialinstrumentationisbetter,but• ifanOpenMP parallelsectionisnotinstrumented,timespentinthreadsotherthan

themainthreadislost• Can’tanalysethreadefficiencythisway

0 50 100 150 200 250 300 350 400 450 500

NoProfiling

Score-Penabled

Score-Ppartial

Runtime(s)

ProfilingN96AMIPjobwithScore-P


nci.org.au

Profilingthroughsampling– experiencesofar

• Samplingprofilingcanbeusedinstead– OpenSpeedShop– HPCToolkit– IntelVtune (can’tbeusedwithOpenMPI)

OpenSpeedShop• ProfilingUMwithOpenSpeedShop producesnegligibleoverhead• Potentialissuewithsamplingrate,butinpractisegoodagreement

IntelVtune• Around10%overheadinMOST

withIntelVTune• Somefeaturesarenotavailable

onKnL (e.g.AdvancedHotspots)


nci.org.au

SummaryofKnightsLandingexperiencesofar

• KnL’s looklikepromisingtechnologyandworthmoreinvestigation– Wellvectorised workloadsareessentialtoperformanceonKnL

• Unvectorised workloadsseeKnL outperformedbynode-for-nodebySandyBridge• Wellvectorised workloadsrunsignificantlyfaster

– Nodesaremoreenergyefficient.– Codechangesaremoregenerallyuseful,sonotspecificallytargetedforKnL.– HybridParallelismandreducingMPItaskmanagementisneededforlarge-scalejobs

• Data-intensiveIOneedsmoreattentionforperformance– especiallyparallelI/O– ParallelI/OavailablethroughNetCDF andHDF5

• Profilingapplicationsisstilldifficult– Instrumentedprofilerscan’tbeuseduntiltheoverheadcanbereduced– Samplingprofilersmaybemissingevents– Somemissingfunctionality

• Helpfulforunderstandingmoredetailsofthebehaviourofcodes• HowdoesitcomparetoGPUandotheremergingchiptechnologies?


Date post:	08-Jul-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Scaling Weather Climate and Environmental Science ... · • Advanced Profiling HDF5 and NetCDF4...

Documents