TechnologyandArchitecturesforFutureLarge‐ScaleCompu;ngSystems
RickStevens
ArgonneNa;onalLaboratory
TheUniversityofChicago
Argonne’sIBMBlueGene/P–556TFs
CrayXT5atORNL–1.6Pflop/s
Jaguar Total XT5 XT4Peak Performance 1,645 1,382 263AMD Opteron Cores 181,504 150,176 31,328System Memory (TB) 362 300 62Disk Bandwidth (GB/s) 284 240 44Disk Space (TB) 10,750 10,000 750Interconnect Bandwidth(TB/s)
532 374 157
ThesystemswillbecombinedaRer
acceptanceofthenewXT5upgrade.Each
systemwillbelinkedtothefilesystemthrough
4x‐DDRInfiniband
Tradi;onalSourcesofPerformanceImprovementareFlat‐Lining(2004)
• New Constraints– 15 years of exponential
clock rate growth has ended
• Moore’s Law reinterpreted:– How do we use all of those
transistors to keepperformance increasing athistorical rates?
– Industry Response: #coresper chip doubles every 18months instead of clockfrequency!
FigurecourtesyofKunleOlukotun,LanceHammond,HerbSu_er,andBurtonSmith
Transistorscon;nuetoscaleClockhasleveledoff(2‐4Ghz)Powerleveledoff(~100W‐200W)Performanceperclock(2‐4ops/clock)
Mul;corecomesinawidevariety– Mul;pleparallelgeneral‐purposeprocessors(GPPs)– Mul;pleapplica;on‐specificprocessors(ASPs)
“The Processor is thenew Transistor”
[Rowen]
Intel 4004 (1971):4-bit processor,2312 transistors,
~100 KIPS,10 micron PMOS,
11 mm2 chip
Sun Niagara8 GPP cores (32 threads)
Intel®XScale
™ Core32K IC32K DC
MEv210
MEv211
MEv212
MEv215
MEv214
MEv213
Rbuf64 @128B
Tbuf64 @128BHash
48/64/128Scratch
16KBQDR
SRAM2
QDRSRAM
1
RDRAM1
RDRAM3
RDRAM2
GASKET
PCI
(64b)66
MHz
IXP280IXP28000 16b16b
16b16b
1188
1188
1188
1188
1188
1188
1188
64b64b
SPI4orCSIX
Stripe
E/D Q E/D Q
QDRSRAM
3E/D Q1188
1188
MEv29
MEv216
MEv22
MEv23
MEv24
MEv27
MEv26
MEv25
MEv21
MEv28
CSRs-Fast_wr
-UART-Timers
-GPIO-BootROM/SlowPort
QDRSRAM
4E/D Q1188
1188
Intel Network Processor1 GPP Core
16 ASPs (128 threads)
IBM Cell1 GPP (2 threads)
8 ASPs
Picochip DSP1 GPP core248 ASPs
Cisco CRS-1188 Tensilica GPPs
What’sNext?
Source: Jack Dongarra, ISC 2008
E3AdvancedArchitectures–Findings(2007)
• Exascalesystemsarelikelyfeasibleby2017±2• 10‐100Millionprocessingelements(mini‐cores)withchipsas
denseas1,000corespersocket,clockrateswillgrowslowly• 3Dchippackaginglikely• Large‐scaleop;csbasedinterconnects• 10‐100PBofaggregatememory• >10,000’sofI/Ochannelsto10‐100Exabytesofsecondary
storage,diskbandwidthtostoragera;osnotop;malforHPCuse
• HardwareandsoRware(OS)basedfaultmanagement• Simula;onandmul;plepointdesignswillberequiredto
advanceourunderstandingofthedesignspace• Achievableperformanceperwa_willlikelybetheprimary
metricofprogress
TopTechnicalChallenges
• PowerConsump;on– Proc/mem,I/O,op;cal,memory,delivery
• Chip‐to‐ChipInterfaceScaling– pin/wirecount⇒3Dpackaging
• Package‐to‐PackageInterfaces– Scalableinterconnects
• CostPressureinOp;csandMemory
• FaultTolerance– FITratesandfaultmanagementconcepts
• E.g.Inplacespares,chipkill,etc.– Reliabilityofirregularlogic,designprac;ce
Exascalesystemsbuiltfromtoday’stechnologywouldconsumeoveraGigawa_
Exascalesystemsbuiltfromtoday’stechnologywouldcostoveraGigabuck
Exascalesystemsbuiltfromtoday’stechnologywouldhaveaMTBFof10minutes
LookingouttoExascaleConcurrencywillbeDoublingevery18months
SystemsScalingProjec;ons
ITRSRoadmapforLogicDevices
PowerDensity20x‐30xthan2005Voltageassump;onis.5v‐.7v
DarpaExascaleStudy
ConcludedthatitwillbeaMajorchallengetogettoSustainedExaopsperformanceLevelsby2020
TotalSystemConcurrencyLargestmachinesnowhavecombinedsystem’slevelconcurrencyapproaching1M
ThreadLevelConcurrencyEachthreadmustdomul;pleopera;onsperclockcycle
BG“doublehummer”
VectorMachines
Superscalar
ParallelismandLocalityTrendsAlgorithmsarepushingintheoppositedirec;onasthehardwaree.g.advancedalgorithmsarebecomingless“local”whilehardwareisbecomingmoreparallelandmorelocal
PowerConstrainedClockRateClock=Power_Density/(Capacitance_per_device*Transistor_Density*V2
dd)
Powerdensity(i.e.cooling)willlimittheclockrateasfeaturesizesshrink
GflopsperWa_mustgrow1000x(0.1⇒100)toachieveExascale
PowerandFPUscalingneededtoReachExa‐opera;ons/s(onlyflops)
Picojoule/FLOP FPUsperExaflop
InterconnectTechnologyRoadmapLongdistanceInterconnectsWillbeop;cal
Op;calmayreplaceCopperforoncarrierInterconnects
InterconnectpowerwillGrowto10%‐30%oftotalpower
HeatRemovalApproachesHighlylikelyallfuturemachinesatscalewillbeliquidcooled
High‐EndPackagingOp;ons
3DPackagingExamples
SecondaryStorageProjec;ons(Scratchat25x,Archiveat200x)
20,000to>1,000,000diskdrivesforsecondarystoragePhasechangememorymayreplacediskin2018;meframe
Evolu;onofStorage
AggressiveStrawman
~200,000nodeseachwith~750cores=~150McoresMemoryis16GBfora~5Tflopnodewith~1000threadsTotalsystemmemoryis~4PB.
SystemsScalingProjec;ons
TheBo_omLine
• Levelsofconcurrency(106⇒109) 1000x• ClockrateofCore(1‐4GHz⇒1‐4GHz) 1x• RAMperCore(1‐2GBnowto<1GB) <1x• TotalNumberofcores(200K⇒100M) 500x• Numberofcorespernode(8⇒>64‐512) >8x‐64x• AggressiveFaultManagementinHWandSW• I/Ochannels(>103⇒105) >100x• PowerConsump;on(10MW⇒40MW‐150MW)4x‐16x• ProgrammingModel(MPI⇒MPI+X)
ParallelProgrammingModels:TwentyYearsandCoun;ng
• Inlarge‐scalescien;ficcompu;ngtodayessen;allyallcodesaremessagepassingbased.Addi;onallymanywillusesomeformofmul;threadingonSMPormul;corenodes.
• Mul;coreischallengingprogrammingmodelsbuttherehasnotyetemergedadominatemodeltoaugmentmessagepassing
• Thereisaneedtoiden;fynewhierarchicalprogrammingmodelsthatwillbestableoverlongtermandcansupporttheconcurrencydoublingpressure
QuasiMainstreamProgrammingModels
• C,Fortran,C++andMPI
• OpenMP,pthreads
• (CUDA,RapidMind,Cn)OpenCL
• PGAS(UPC,CAF,Titanium)
• HPCSLanguages(Chapel,Fortress,X10)
• HPCResearchLanguagesandRun;me
• HLL(ParallelMatlab,GridMathema;ca,etc.)
April2009 Programmingmodels,Salishanconference 33
Clusters
w/Accelerators
Poten;alMigra;onPathsProgrammingModels
BaseandMPI
Base/OpenMPandMPI
CUDA
Base/OpenCL
libspe
Base/OpenMP+andMPI
Base/OpenMP
C/C++/Fortran/Java(Base)
Charm++
RapidMind
GEDAE/Streamingmodels
PGAS/APGAS
ReduceMPItasks
Makeportable,open
Makeportable,open
HarnessacceleratorsPr
oduc;vity,maintainability
Scale,produc;vity,maintainability
Scale
Scaleb
eyond
oneno
de
Scale
Green:open,widelyavailable*Blue:somewhereinbetweenRed:proprietary*OpenCLavailabilitypredicted
Base/OpenCLandMPI
Harne
ssaccelerators
Harnessaccelerators
ALF
Makeportable,open
Produc
;vity
FromVivekSarkarSC08ExascaleWorkshop
FromVivekSarkarSC08ExascaleWorkshop
ar
FromVivekSarkarSC08ExascaleWorkshop