Technology and Architectures for Future Large‐Scale ......Interconnect Technology Roadmap Long...

TechnologyandArchitecturesforFutureLarge‐ScaleCompu;ngSystems

RickStevens

ArgonneNa;onalLaboratory

TheUniversityofChicago

Argonne’sIBMBlueGene/P–556TFs

CrayXT5atORNL–1.6Pflop/s

Jaguar Total XT5 XT4Peak Performance 1,645 1,382 263AMD Opteron Cores 181,504 150,176 31,328System Memory (TB) 362 300 62Disk Bandwidth (GB/s) 284 240 44Disk Space (TB) 10,750 10,000 750Interconnect Bandwidth(TB/s)

532 374 157

ThesystemswillbecombinedaRer

acceptanceofthenewXT5upgrade.Each

systemwillbelinkedtothefilesystemthrough

4x‐DDRInfiniband

Tradi;onalSourcesofPerformanceImprovementareFlat‐Lining(2004)

• New Constraints– 15 years of exponential

clock rate growth has ended

• Moore’s Law reinterpreted:– How do we use all of those

transistors to keepperformance increasing athistorical rates?

– Industry Response: #coresper chip doubles every 18months instead of clockfrequency!

FigurecourtesyofKunleOlukotun,LanceHammond,HerbSu_er,andBurtonSmith

Transistorscon;nuetoscaleClockhasleveledoff(2‐4Ghz)Powerleveledoff(~100W‐200W)Performanceperclock(2‐4ops/clock)

Mul;corecomesinawidevariety– Mul;pleparallelgeneral‐purposeprocessors(GPPs)– Mul;pleapplica;on‐specificprocessors(ASPs)

“The Processor is thenew Transistor”

[Rowen]

Intel 4004 (1971):4-bit processor,2312 transistors,

~100 KIPS,10 micron PMOS,

11 mm2 chip

Sun Niagara8 GPP cores (32 threads)

Intel®XScale

™ Core32K IC32K DC

MEv210

MEv211

MEv212

MEv215

MEv214

MEv213

Rbuf64 @128B

Tbuf64 @128BHash

48/64/128Scratch

16KBQDR

SRAM2

QDRSRAM

1

RDRAM1

RDRAM3

RDRAM2

GASKET

PCI

(64b)66

MHz

IXP280IXP28000 16b16b

16b16b

1188

1188

1188

1188

1188

1188

1188

64b64b

SPI4orCSIX

Stripe

E/D Q E/D Q

QDRSRAM

3E/D Q1188

1188

MEv29

MEv216

MEv22

MEv23

MEv24

MEv27

MEv26

MEv25

MEv21

MEv28

CSRs-Fast_wr

-UART-Timers

-GPIO-BootROM/SlowPort

QDRSRAM

4E/D Q1188

1188

Intel Network Processor1 GPP Core

16 ASPs (128 threads)

IBM Cell1 GPP (2 threads)

8 ASPs

Picochip DSP1 GPP core248 ASPs

Cisco CRS-1188 Tensilica GPPs

What’sNext?

Source: Jack Dongarra, ISC 2008

E3AdvancedArchitectures–Findings(2007)

• Exascalesystemsarelikelyfeasibleby2017±2• 10‐100Millionprocessingelements(mini‐cores)withchipsas

denseas1,000corespersocket,clockrateswillgrowslowly• 3Dchippackaginglikely• Large‐scaleop;csbasedinterconnects• 10‐100PBofaggregatememory• >10,000’sofI/Ochannelsto10‐100Exabytesofsecondary

storage,diskbandwidthtostoragera;osnotop;malforHPCuse

• HardwareandsoRware(OS)basedfaultmanagement• Simula;onandmul;plepointdesignswillberequiredto

advanceourunderstandingofthedesignspace• Achievableperformanceperwa_willlikelybetheprimary

metricofprogress

TopTechnicalChallenges

• PowerConsump;on– Proc/mem,I/O,op;cal,memory,delivery

• Chip‐to‐ChipInterfaceScaling– pin/wirecount⇒3Dpackaging

• Package‐to‐PackageInterfaces– Scalableinterconnects

• CostPressureinOp;csandMemory

• FaultTolerance– FITratesandfaultmanagementconcepts

• E.g.Inplacespares,chipkill,etc.– Reliabilityofirregularlogic,designprac;ce

Exascalesystemsbuiltfromtoday’stechnologywouldconsumeoveraGigawa_

Exascalesystemsbuiltfromtoday’stechnologywouldcostoveraGigabuck

Exascalesystemsbuiltfromtoday’stechnologywouldhaveaMTBFof10minutes

LookingouttoExascaleConcurrencywillbeDoublingevery18months

SystemsScalingProjec;ons

ITRSRoadmapforLogicDevices

PowerDensity20x‐30xthan2005Voltageassump;onis.5v‐.7v

DarpaExascaleStudy

ConcludedthatitwillbeaMajorchallengetogettoSustainedExaopsperformanceLevelsby2020

TotalSystemConcurrencyLargestmachinesnowhavecombinedsystem’slevelconcurrencyapproaching1M

ThreadLevelConcurrencyEachthreadmustdomul;pleopera;onsperclockcycle

BG“doublehummer”

VectorMachines

Superscalar

ParallelismandLocalityTrendsAlgorithmsarepushingintheoppositedirec;onasthehardwaree.g.advancedalgorithmsarebecomingless“local”whilehardwareisbecomingmoreparallelandmorelocal

PowerConstrainedClockRateClock=Power_Density/(Capacitance_per_device*Transistor_Density*V2

dd)

Powerdensity(i.e.cooling)willlimittheclockrateasfeaturesizesshrink

GflopsperWa_mustgrow1000x(0.1⇒100)toachieveExascale

PowerandFPUscalingneededtoReachExa‐opera;ons/s(onlyflops)

Picojoule/FLOP FPUsperExaflop

InterconnectTechnologyRoadmapLongdistanceInterconnectsWillbeop;cal

Op;calmayreplaceCopperforoncarrierInterconnects

InterconnectpowerwillGrowto10%‐30%oftotalpower

HeatRemovalApproachesHighlylikelyallfuturemachinesatscalewillbeliquidcooled

High‐EndPackagingOp;ons

3DPackagingExamples

SecondaryStorageProjec;ons(Scratchat25x,Archiveat200x)

20,000to>1,000,000diskdrivesforsecondarystoragePhasechangememorymayreplacediskin2018;meframe

Evolu;onofStorage

AggressiveStrawman

~200,000nodeseachwith~750cores=~150McoresMemoryis16GBfora~5Tflopnodewith~1000threadsTotalsystemmemoryis~4PB.

SystemsScalingProjec;ons

TheBo_omLine

• Levelsofconcurrency(106⇒109) 1000x• ClockrateofCore(1‐4GHz⇒1‐4GHz) 1x• RAMperCore(1‐2GBnowto<1GB) <1x• TotalNumberofcores(200K⇒100M) 500x• Numberofcorespernode(8⇒>64‐512) >8x‐64x• AggressiveFaultManagementinHWandSW• I/Ochannels(>103⇒105) >100x• PowerConsump;on(10MW⇒40MW‐150MW)4x‐16x• ProgrammingModel(MPI⇒MPI+X)

ParallelProgrammingModels:TwentyYearsandCoun;ng

• Inlarge‐scalescien;ficcompu;ngtodayessen;allyallcodesaremessagepassingbased.Addi;onallymanywillusesomeformofmul;threadingonSMPormul;corenodes.

• Mul;coreischallengingprogrammingmodelsbuttherehasnotyetemergedadominatemodeltoaugmentmessagepassing

• Thereisaneedtoiden;fynewhierarchicalprogrammingmodelsthatwillbestableoverlongtermandcansupporttheconcurrencydoublingpressure

QuasiMainstreamProgrammingModels

• C,Fortran,C++andMPI

• OpenMP,pthreads

• (CUDA,RapidMind,Cn)OpenCL

• PGAS(UPC,CAF,Titanium)

• HPCSLanguages(Chapel,Fortress,X10)

• HPCResearchLanguagesandRun;me

• HLL(ParallelMatlab,GridMathema;ca,etc.)

April2009 Programmingmodels,Salishanconference 33

Clusters

w/Accelerators

Poten;alMigra;onPathsProgrammingModels

BaseandMPI

Base/OpenMPandMPI

CUDA

Base/OpenCL

libspe

Base/OpenMP+andMPI

Base/OpenMP

C/C++/Fortran/Java(Base)

Charm++

RapidMind

GEDAE/Streamingmodels

PGAS/APGAS

ReduceMPItasks

Makeportable,open

Makeportable,open

HarnessacceleratorsPr

oduc;vity,maintainability

Scale,produc;vity,maintainability

Scale

Scaleb

eyond

oneno

de

Scale

Green:open,widelyavailable*Blue:somewhereinbetweenRed:proprietary*OpenCLavailabilitypredicted

Base/OpenCLandMPI

Harne

ssaccelerators

Harnessaccelerators

ALF

Makeportable,open

Produc

;vity

FromVivekSarkarSC08ExascaleWorkshop


ar


Date post:	22-Mar-2021
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Technology and Architectures for Future Large‐Scale ......Interconnect Technology Roadmap Long...

Documents