nci.org.au@NCInews
ScalingWeatherClimateandEnvironmentalScienceApplications,andexperienceswithIntelKnightsLanding
BenEvans,DaleRoberts
17th WorkshoponHighPerformanceComputinginMeteorology
nci.org.au
NCIprogramtoaccelerateHPCScalingandOptimisation
© National ComputationalInfrastructure 2016
• ModellingExtreme&HighImpactevents– BoM• NWP,ClimateCoupledSystems&DataAssimilation– BoM,CSIRO,ResearchCollaboration• Hazards- GeoscienceAustralia,BoM,States• Geophysics,Seismic– GeoscienceAustralia,Universities• MonitoringtheEnvironment&Ocean– ANU,BoM,CSIRO,GA,Research,Fed/State• Internationalresearch– InternationalagenciesandCollaborativePrograms
TropicalCyclones
CycloneWinston20-21Feb,2016
VolcanicAsh
Manam Eruption31July,2015
WyeValleyandLorneFires25-31Dec,2015
BushFires Flooding
StGeorge,QLDFebruary,2011
BenEvans,ECMWF,Oct2016
nci.org.au
ACCESSModel:NWP,Climate,Ocean,ESM,SeasonalandMulti-year
Coupler
Carbon
Terrestrial
Oceanandsea-ice
Atmosphericchemistry
Atmosphere
Oceanandsea-ice
Carbon cycle (ACCESS-ESM)• Terrestrial – CABLE• Bio-geochemical• Couple to modified ACCESS1.3
Aerosols and Chemistry• UKCA
© National ComputationalInfrastructure 2016
Core Model• Atmosphere – UM 10.5+• Ocean – MOM 5.1 (for most models)• NEMO 3.6 (for GC3 seasonal-only)• Sea-Ice – CICE5• Coupler – OASIS-MCT
Wave• WW3
BenEvans,ECMWF,Oct2016
nci.org.au
Additionalprioritycodes– AustralianstormsurgemodelusingROMS
© National ComputationalInfrastructure 2016 BenEvans,ECMWF,Oct2016
nci.org.au
DataCollections Approx.Capacity
CMIP5,CORDEX,ACCESSModels 5PbytesSatelliteEarthObs:LANDSAT,Himawari-8,Sentinels,plusMODIS,INSAR,… 2PbytesDigitalElevation,BathymetryOnshore/OffshoreGeophysics
1Pbytes
SeasonalClimate 700TbytesBureauofMeteorologyObservations 350TbytesBureauofMeteorologyOcean-Marine 350TbytesTerrestrialEcosystem 290TbytesReanalysisproducts 100Tbytes
1. Climate/ESSModelAssetsandDataProducts2. EarthandMarineObservationsandDataProducts3. GeoscienceCollections4. TerrestrialEcosystemsCollections5.WaterManagementandHydrologyCollections
http://geonetwork.nci.org.au
NCIResearchDataCollections:Model,dataprocessing,analysis
© National ComputationalInfrastructure 2016 BenEvans,ECMWF,Oct2016
nci.org.au
Himawari-8Observations,DataAssimilationandAnalysis
CapturedatJMA,ProcessedafteracquisitionatBoMMadeavailableatNCI
DataProductsstilltobegenerated,butfirststagewastomaketheimagedataavailable.
10minutecaptureandprocess.Thenalsoneedtomakeitavailableforbroadanalysis.
© National ComputationalInfrastructure 2016 BenEvans,ECMWF,Oct2016
nci.org.au
• Over300,000Landsatscenes(spatial/temporal)allowingflexible,efficient,large-scalein-situanalysis
• Spatially-regular,time-stamped,band-aggregatedtilespresentedastemporalstacks.
Spatiallypartitionedtiles TemporalAnalysis
EarthObservationTimeSeriesAnalysis
Continental-ScaleWaterObservationsfromSpace
WOFSwaterdetection• 27YearsofdatafromLS5&
LS7(1987-2014)• 25m NominalPixel
Resolution• Approx.300,000 individual
sourceARG-25scenesinapprox.20,000passes
• Entire27yearsof1,312,087ARG25tiles=>93x1012 pixelsvisited
• 0.75PBofdata• 3hrs atNCI(elapsed
time)tocompute.
© National ComputationalInfrastructure 2016 BenEvans,ECMWF,Oct2016
nci.org.au
Enableglobalandcontinentalscale…andtoscale-downtolocal/catchment/plot
• Wateravailabilityandusageovertime
• Catchmentzone• Vegetationchanges• Datafusionwithpoint-clouds
andlocalorothermeasurements
• Statisticaltechniquesonkeyvariables
Preparingfor:• Betterprogrammaticaccess• Machine/DeepLearning• BetterIntegrationthrough
Semantic/Linkeddatatechnologies
© National ComputationalInfrastructure 2016 BenEvans,ECMWF,Oct2016
nci.org.au
EmergingPetascale Geophysicscodes
- Assess priority Geophysics areas- 3D/4D Geophysics: Magnetotellurics, AEM- Hydrology, Groundwater, Carbon Sequestration- Forward and Inverse Seismic models and analysis (onshore and offshore)- Natural Hazard and Risk models: Tsunami, Ash-cloud
- Issues- Data across domains, data resolution (points, lines, grids), data coverage- Model maturity for running at scale- Ensemble, Uncertainty analysis and Inferencing
© National ComputationalInfrastructure 2016 BenEvans,ECMWF,Oct2016
nci.org.au
NCIHighPerformanceOptimisation andScalingactivities2014-17
• Objectives:• Upscale&increaseperformanceofhigh-prioritynationalcodes– particularlyWeatherandClimate
• Year1• Characterise,OptimiseandTuneofcriticalapplicationsforhigherresolution• Bestpractiseconfigurationforimprovedthroughput• Establishanalysistoolsetsandmethodology
• Year2• Characterise,OptimiseandTuneofnextgenerationhighpriorityapplications• SelecthighprioritygeophysicscodesandexemplarHPCcodesforscalability• ParallelAlgorithmReviewandI/Ooptimisation methodstoenablebetterscaling• EstablishedTIHPOptimisation workpackageforUMcodes(Selwood,Evans)
• Year3• Assessbroadersetofcommunitycodesforscalability• Updatedhardware(many-core),memory/datalatency/bandwidths,energyefficiency• Communicationlibraries,mathlibraries
© National ComputationalInfrastructure 2016 BenEvans,ECMWF,Oct2016
nci.org.au
CommonMethodologyandapproachforanalysis
© National ComputationalInfrastructure 2016
• Analyse codetoestablishstrengthsandweaknesses.• Fullcodeanalysisincludinghotspotandalgorithmchoices• Exposemodeltomoreextremescaling– e.g.,realistichigherresolution• Analyse andcomparedifferentsoftwarestacks• Decompositionstrategiesfornodesandnodetuning• ParallelAlgorithms,MPIcommunicationPatterns.e.g.Haloanalysis,gridexchanges• I/Otechniques:Evaluateserialandparalleltechniques• Futurehardwaretechnologies
BenEvans,ECMWF,Oct2016
nci.org.au
Scaling/Optimisation:BoMResearch-to-OperationsHighProfilecases
© National ComputationalInfrastructure 2016
Domain Yr1– 2014/5 Yr2– 2015/6 Yr3– 2016/7
Atmosphere APS1(UM8.2-4)• GlobalN320L70
(40km)andpre-APS2N512L70(25km)
• RegionalN768L70• (~17km)• City4.5k
UM10.x(PS36)• GlobalN768L70
(~17km)• Regional,City
APS3prep• UM10.xlatest• ACCESS-G(Global)
N1024L70/L85(12km) orN1280L70/L85(10km)
• ACCESS-GEGlobalEnsemble(N216L70)(~60km)
• ACCESS-TC4km• ACCESS-R(Regional)12km• ACCESS-C(City)1.5km
BenEvans,ECMWF,Oct2016
nci.org.au
NCI-FujitsuScaling/Optimisation:BoMResearch-to-OperationsHighProfilecases
© National ComputationalInfrastructure 2016
Domain Yr1– 2014/5 Yr2– 2015/6 Yr3– 2016/7
Dataassimilation
4D-VARv30• N216L70,N320L70
• 4D-VARLatestforGlobalatN320L70
• enKF-C
Ocean MOM5.1• OFAM3• 0.25°, L50
MOM5.1• 0.1°, L50
• OceanMAPS3.1(MOM5)withenKF-C
• MOM5/6 0.1° and0.03°• ROMS(Regional)
• StormSurge (2D)• eReefs (3D)
Wave WaveWatch3v4.18(v5.08beta)
• AusWave-G0.4°• AusWave-R0.1°
BenEvans,ECMWF,Oct2016
nci.org.au
NCI-FujitsuScaling/Optimisation:BoMResearch-to-OperationsHighProfilecases
© National ComputationalInfrastructure 2016
Domain Yr1– 2014/5 Yr2– 2015/6 Yr3– 2016/7CoupledSystems
Climate:ACCESS-CM• GA6(UM8.5)+MOM5withOASIS-MCT• GlobalN96L38
(135km),1° and0.25°ocean.
Climate:ACCESS-CMcont.
Climate:ACCESS-CM2(AOGCM)• GA7(UM10.4+andGLOMAPaerosol),• GlobalN96L85,N216L85(~60km)• MOM5.10.25°• CABLE2• UKCAaerosols
EarthSystem:ACCESS-ESM2• ACCESS-CM2+TerrestrialBiochemistry
– CASA-CNP• Oceanicbiogeochemistry– WOMBAT• Atmosphericchemistry– UKCA
SeasonalClimate:• ACCESS-S1- UK
GC2withOASIS3• N216L85(~60km)
NEMO0.25°
GC2NCIprofilingmethodologyapplied forMPMD
• Multi-weekandSeasonalClimate:• ACCESS-S2/UKGC3• Atmos:N216L85(60km)and
NEMO3.6 0.25° L75
nci.org.au
HPCGeneralScalingandOptimisation Approach
© National ComputationalInfrastructure 2016
Domain Yr2– 2015/6 Yr3– 2016/7
ProfilingMethodology CreateMethodologyforprofilingcodes
UpdatestoMethodologybasedonapplicationacrossmorecodes
I/Oprofiling • BaselineprofilingforcomparisonofNetCDF3,NetCDF4,HDF5andGeoTIFF andAPIoptions(e.g.GDAL).
• ProfilingcomparisonofIOperformanceofLustre,NFS
• CompareMPI-IOvsPOSIXvsHDF5 onLustre
• AdvancedProfilingHDF5andNetCDF4forcompressionalgorithms,multithreading,cachemanagement
• Profilinganalysisofotherdataformats
• e.g., GRIB,AstronomyFITS,SEG-Y,BAM
AcceleratorTechnologyInvestigation
• IntelPhi(KnightsLanding)• AMDGPU
Profilingtoolssuite Review MajoropensourceProfilingTools
Investigationofprofilers forAccelerators
BenEvans,ECMWF,Oct2016
nci.org.au
HPCGeneralScalingandOptimisation Approach
© National ComputationalInfrastructure 2016
Domain Yr2 Yr3
ComputeNodePerformanceAnalysis
• Partiallycommittingnodes
• HardwareHyper-threading
• MemoryBandwidth• Interconnect
bandwidth
• EvaluatingEnergyEfficiencyvsperformanceofnextgenerationprocessors
• Broadwellimprovements• Memory speed• Vectorisation• OpenMP coverage
SoftwareStacks • OpenMPI vsIntelMPIanalysis
• Intelcompilerversions
• OpenMPI vsIntelMPI analysis• Intelcompiler versions• MathLibraries
Analysisother EarthSystems&Geophysics prioritycodesandalgorithms
• InitialAnalysisofMPIcommunications
• Commenceanalysisofhighpriority/profileHPCcodes in
• DetailedAnalysisofMPIcommunicationdependentalgorithms
• SurveyofCodesandAlgorithmsused.
BenEvans,ECMWF,Oct2016
nci.org.au
NCIContributionstoUMcollaborationsofar
© National ComputationalInfrastructure 2016
• UM10.4+IOServernowusingMPI-IO• ImmediatelyvaluableforNWP(UKMet,Aus,…)• Criticalfornextgenerationprocessors(i.e.,KnL)
• UM10.5+OpenMP coverage• Increasedperformance• Criticalforbothcurrentandnextarchitectures,especiallywithincreasingmem
bandwidthissues
BenEvans,ECMWF,Oct2016
nci.org.au
KnightsLandingnodesatNCI
KnLs• 32IntelXeonPhi7230processors
– 64cores/256threadspersocket,1socketpernode– 16GBMCDRAMonpackage(380+GB/sbandwidth)– 192GBDDR4-2400MHz(115.2GB/s)
• EDRInfiniBandinterconnectbetweenKnLs (100Gb/s)• FDRInfiniband linkstomainlustrestorage(56Gb/s)
KeplerK80GPUsalsoavailable
RaijinisaFujitsuPrimergy cluster• 57,472cores(IntelXeonSandyBridgetechnology,2.6GHz)in3592computenodes• Infiniband FDRinterconnect• 10PBytes Lustre forshort-termscratchspace• 30Pbytes fordatacollectionsstorage
BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016
nci.org.au
BasicKNLfirstimpression
• KnL Pros– Fullx86compatibility– applications‘justwork’withoutneedforcodemajorchanges– AVX512-bitinstructionsetgivesperformanceboostforwellvectorised applications– Potentialtoprocess2vectoroperationspercycle
• KnL Cons– CoresaresignificantlyslowerthantypicalXeonprocessors
• 1.3GHzKnL vs.2.5+GHzfortypicalHaswell/BroadwellXeons• Simplerarchitecturemeansfewerinstructionsprocessedpercycle
– Profilingdifficultandhardwarenotfullyexposingwhatisneeded
• Needtounderstandmuchmoreaboutourapplicationsandtheirmulti-phasicnature• DeepworkonbothIO,memorypressure,andinterprocessor comms• Relearnhowtoprojectforthevalueoftheprocessors• Useexperiencetolookatotheremergingtechnologiesinparallel
BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016
nci.org.au
ExperimentingwithKnL characteristics
• AustralianGeoscienceDataCubeLANDSATprocessingpipeline– ProcessaseriesofobservationsfromLANDSAT8satellite.
• NOAAMethodofSplittingTsunami(MOST)model– Wavepropagationdueto7.5magnitude
earthquakeinSunda subductionzone
• UKMOUnifiedModelv10.5– N96AMIPglobalmodel– N512NWPglobalmodel
• ThesearenotchosenasthebestcodesforKnL,butonesthatwerebothimportantandthatwecould“quickly“explore.
BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016
nci.org.au
LandsatNBARdataprocessingpipeline:KnL vs.SandyBridgeand1thread
• Sameexecutable,runonbotharchitectures(i.e.noAVX-512instructions)• Separatelyrecompiledwith-xMIC-AVX512
0
1
2
3
4
5
6
7
ModTrans CalcGrids TCBand Packager InterpBand Exctract
RelativeRu
ntime
PipelineTask
LandsatProcessingPipelinetasks
SandyBridge KnL KnLw/AVX-512• Mosttaskstooklongertocomplete• LANDSATpipelinetasksaremostlypoint-wisekernelsorIObound• Littleopportunityforthecompilertovectorise• AVXoperationsrunatlowerclockspeedontheKnL
• ‘ModTrans’and‘TCBand’tasksexceptions• ModTrans wasrelativelywellvectorised• TCBand (TerrainCorrection)wasconvertedfrompoint-wisekernelstovector-wisekernels• NotedtheyarefasterthanSnB (normalisedforclockspeed)
BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016
nci.org.au
NOAAMOSTTsunamiCode– singlethreadedperformance
TimespentonvectorisationistheimportantfirststeptoCPUperformanceonKnL
0 0.5 1 1.5 2 2.5
KnL- Vectorised
KnL- Original
SandyBridge
Time(s)
MOSTAverageTimestep
0 0.5 1 1.5 2
KnL- Vectorised
KnL- Original
SandyBridge
TimeScaledbyCPUclockspeed
MOSTAverageTimestep
• WhileMOSToriginalcodeisnotvectorised,butdoesrunonKnL• Replacekeyroutineswithvectorised versions• Comparebothrawperformanceandnormalisedbyclockspeed
BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016
nci.org.au
LANDSATprocessingpipeline– comparingNode-for-Nodeperformance
• ParallelisminAGDCLANDSATisobtainedthrough‘Luigi’pythonscheduler.– Taskdependenciesaretrackedwithinscenes,embarrassinglyparallel– For20scenes,2620tasksintotal
• ‘ideal’combinationoftasksbuilt(withandwithoutAVX-512instructions)• AGDCLANDSATProcessing
45:39
52:29
43:43
0 10 20 30 40 50 60
KnL- 128workers
KnL- 64workers
SandyBridge- 16workers
Time(minutes)
AGDCLANDSATProcessing- 20scenes
• KnightsLandingisslowerthanSandyBridgeinthiscase– Node-for-nodehascompetitiveperformance.– Vectorisationcanyetimprove– noted128tasksoutperforms64tasksbyover20%
BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016
nci.org.au
NOAAMOST– OpenMP performanceonKnL
• ParallelisminNOAAMOSTisobtainedthroughOpenMP
• Goodscalingoveralreadygoodsinglethreadedperformance– Over90%efficiencygoingfrom1threadtofullyoccupyinganode
• Doesnotbenefitfromoversubscription– Likelyduetothesubdomainsbecomingquitesmallathighthreadcounts
0
0.2
0.4
0.6
0.8
1
1.2
1 2 4 8 16 32 64 128
ScalingFactor
Threads
NOAAMOST- ScalingFactor
BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016
nci.org.au
NOAAMOST:KnL vsSandyBridgenode-for-node
• 3xFasternode-for-nodeaftervectorisation.• NotethatourexperimentshowsMOSTmaybeveryperformant
onGPUs
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
SandyBridge- 16threads
KnL- 64threads
KnLvectorised- 64threads
Time(s)
NOAAMOST- AverageTimestep,fullnode
BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016
nci.org.au
UnifiedModel
• UM(10.5)isparallelisedusingbothMPIandOpenMP• InitiallychoseAMIPN96globalmodel
– usefulforperformanceevaluationasrunonasinglenodeandnocomplexIOtraffic– FindbestdecompositiononasingleKnL node(willcomparewithbest
decompositiononasingleSandyBridgenode)
Outcomes:• OvercommittingKnL provesbeneficialtoperformancewiththeUM• All64threadjobsareoutperformedby128and256threadjobs
0
50
100
150
200
250
300
350
4x8 4x16 8x8
Time(s)
Decomposition
N96AMIPRuntime
64Threads
128Threads
256Threads
BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016
nci.org.au
SoftwareStacksfortheUM
• IntelMPIconsistentlyoutperformsOpenMPI fortheUMonKnL
• ButIntelMPIlackssomeofthefine-grainedcontrolweneed• Theabilitytospecifyindividualcoresinarankfile• Seeminglyunabletobindto‘none’– importantforexplicitbindingwithnumactl• Can’treportbindingwiththesamedetailasOpenMPI
• Weusedversions15or16oftheIntelFortran/C/C++compilers– ‘-x’compileroptionstoenableordisableAVX-512inordertotestthe
effectivenessofthelongervectorregistersorissues• LANDSATprocessingslowswithAVX-512enabled• SomeinstabilityintheUMwhenAVX-512enabled
0
50
100
150
200
250
300
4x8 4x16 8x8
Time(s)
Decomposition
N96AMIPRuntime- 128Threads
OpenMPI
IntelMPI
BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016
nci.org.au
UMdecompositioncomparison
• Bestperformingdecompositions:• KnL is4x16,with2OpenMP threadsperMPItask• SnB is2x8
• About20%fasterthanbestdecompositiononSandyBridge• DespitemodelinputI/Ostagetaking5xlongeronKnL• largerMPIdecompositionlimitsmultinode scalabilityforUMonKnL
• Hybridparallelismcanhelphere• MorethreadsperMPItaskmeanssmallerdecompositions• ManythreadingimprovementstocomeinUM10.6+
0 50 100 150 200 250 300
KnL- 4x16,2OMPthreads
SandyBridge,2x8
Runtime(s)
N96AMIPRuntimeComparison
BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016
nci.org.au
UMN512globalNWPonKnL vsSnB
• Usesamedecomposition:16x64,2threadsperMPItask,totalof2048threads
But16KnL nodesvs64SnB nodesmeansmodeluses33%fewernode-hoursonKnL
• MPI• TasklayoutisimportantonKnL• N512jobusestheUM’sIOserverfeaturewhereallIOtaskscanrunonaseparatenode• WhentheIOtasksareinterleavedwithmodel,runtimeincreases->NeedtoseparateIO
0 0.5 1 1.5 2 2.5 3
SandyBridge
KnL
RelativeRuntime- N512globalNWP
0 0.5 1 1.5 2 2.5 3 3.5
SandyBridge
KnL
KnL- interleavedIO
RelativeRuntime- N512globalNWP
BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016
nci.org.au
UMN96
• KnL hasa2stagememoryhierarchy• 16GBMCDRAMon-package(CacheorFlatmode)• 192GBDDR4
• AllourUMtestsshownsofarhavebeenin‘cache’mode.• N96AMIPglobaloccupiesjustover16GBRSSwhenrunina4x16decomposition• Canadditionalperformancebeextractedin‘flat’mode?
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
DefaultBind
MCDRAMBind
Cachemode
RelativeRuntime
RelativeRuntimeofN96AMIPwithdifferentmemorybindingsettings
• NoMPIdistributionperformsthebindingcorrectly,solaunchMPIprocessesusingnumactl• Bothdefaultbinding(DDR4)andMCDRAMbindingareslowerthancachemode.• LossofperformancewhenrunonDDR4impliesthattheUMisstillmemorybound,evenon
slowKnL cores.
BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016
nci.org.au
ProfilingwithScore-P
• Score-Pisaninstrumentingprofiler• Issues– instrumentingonKnL isverycostly
• Enteringandexitinginstrumentedareasseemstocostafixednumberofcycles• CyclestakemuchlongeronaKnL
• Comparewithlimitedinstrumentationtokey‘control’subroutines• Allowsidentificationofkeycodeareas(e.g.convection,radiationetc.),butnothing
withinthoseareas
• Partialinstrumentationisbetter,but• ifanOpenMP parallelsectionisnotinstrumented,timespentinthreadsotherthan
themainthreadislost• Can’tanalysethreadefficiencythisway
0 50 100 150 200 250 300 350 400 450 500
NoProfiling
Score-Penabled
Score-Ppartial
Runtime(s)
ProfilingN96AMIPjobwithScore-P
BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016
nci.org.au
Profilingthroughsampling– experiencesofar
• Samplingprofilingcanbeusedinstead– OpenSpeedShop– HPCToolkit– IntelVtune (can’tbeusedwithOpenMPI)
OpenSpeedShop• ProfilingUMwithOpenSpeedShop producesnegligibleoverhead• Potentialissuewithsamplingrate,butinpractisegoodagreement
IntelVtune• Around10%overheadinMOST
withIntelVTune• Somefeaturesarenotavailable
onKnL (e.g.AdvancedHotspots)
BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016
nci.org.au
SummaryofKnightsLandingexperiencesofar
• KnL’s looklikepromisingtechnologyandworthmoreinvestigation– Wellvectorised workloadsareessentialtoperformanceonKnL
• Unvectorised workloadsseeKnL outperformedbynode-for-nodebySandyBridge• Wellvectorised workloadsrunsignificantlyfaster
– Nodesaremoreenergyefficient.– Codechangesaremoregenerallyuseful,sonotspecificallytargetedforKnL.– HybridParallelismandreducingMPItaskmanagementisneededforlarge-scalejobs
• Data-intensiveIOneedsmoreattentionforperformance– especiallyparallelI/O– ParallelI/OavailablethroughNetCDF andHDF5
• Profilingapplicationsisstilldifficult– Instrumentedprofilerscan’tbeuseduntiltheoverheadcanbereduced– Samplingprofilersmaybemissingevents– Somemissingfunctionality
• Helpfulforunderstandingmoredetailsofthebehaviourofcodes• HowdoesitcomparetoGPUandotheremergingchiptechnologies?
BenEvans,ECMWF,Oct2016© National ComputationalInfrastructure 2016