Date post: | 29-Aug-2018 |
Category: |
Documents |
Upload: | nguyenduong |
View: | 224 times |
Download: | 0 times |
Notgoingtotalkaboutanyofthisstuffrightnow(buthappytointhehallwaytrack)• FinishedPhDatCambridgein2006• Workedinindustrialresearch(AT&TandIntel)• Twostartups(XenSource andCohoData)• AssociateprofatUBC• Threekids• Iwentheli skiinglastFriday.
Here'swhatIamgoingtodo
• Makesomeprettyobviousobservationsabouttechnologydirections.• Drawsomedodgyandhighlyspeculativeconclusionsfromthoseobservations.• Trytoinfluenceyourresearch.
• Disclaimer:thisisnotaconferencetalk,norisit5stapledtogetherconferencetalks.• Anotherdisclaimer:I'mgoingtogiveyoumoreproblemsthansolutions.
Section5:Evaluation.
• (Attheendoftheday,allsystemspapersareaboutperformance.)• Probablybecauseit'soneoftheonlythingsweknowhowtomeasure.• Therearetwotypesofperformanceresults:
1. Smallimprovementsinaverylargesystem.2. Speedupsthataresosignificantthattheychangefunctionality.
• GoogleandFacebookandAmazonandMicrosoftareprobablyalotbetteratsolvingmeaningfulproblemswiththeirsystemsthanyouare.
Herearethehigh-leveltrends/ideasbehindthistalk1. Diminishingscarcity.2. Practical/sensibletoownyourownhardwareagain.3. Thesoftwarewehaveisturningouttobeabigger,slower,more
onerousburdenthanthehardwareitrunson.• Itisapoormatchforchangingperformanceandfailurecharacteristicsofhardware.
• Itisapoormatchfortheoperationalneedsofusers.
Consequencesoftheseideas
• Thegoalpostsaremovingintermsofwhatwedesignsystemsfor.
• Humancostsassociatedwithrunningoursystemsareabiggerexpenseandinconvenience,atalllevels,thanthepiecewiseperformanceofcomponents.• Theyareactuallyabarrier.
• Theendofscarcitymarksthebeginningofapushforefficientpredictability.• Thisiswhystoragecustomersbyflash.It’salsoahardsystemsproblem.
11
Thisisagoogledatacentercirca2001.GFS(2003):largestdeploymentshadover1,000storagenodes,hundredsofclients,
300TBofstoragespace
http://itq.nl/intels-take-open-compute-project-rack-scale-architecture/
https://www.supermicro.com/solutions/SRSD.cfm
Whatis"rackscale"?
• Everythinginarackwillshareahighperformancebus.• Withinarack,opticalinterconnectsareexpectedtoreachterabitbandwidthintheneartermwithsub-microsecondlatencies.
• Theserverasweknowitwillbecompletelydisaggregated.• CPUs,GPUs,storage,networkinterfaces,andvolatilememorywilleachmovetoindependentphysicalenclosures.Arbitrarycompositionandindependentscale.
• Rackresourceswillbeverydense.• Like,really dense.• Asaballpark,withinarackwearelikelytoseethousandsofcores,tensofpetabytesofpersistentmemory,andterabytesofRAM.
• Inshort,asingledatacenterrackwithacapitalvalueinthelowmillionsofdollars,willbeascapableasentirefirst-generation(e.g.2003-era)"warehouse"datacentersfrompubliccloudproviders
What’schanging?
1. Storageisbecomingdense.• Problematicallydense!
2. Thememoryhierarchyishavinganidentitycrisis.3. Applicationlatencyisacrueltaskmaster.
DenseNonvolatileCapacity
• Flashvendorshavefinallystartedtorelaxaboutthedurabilityproblem.• Thejawdroppingbit:wewillsee4PBin1uinasmallnumberofyears.• Atapricethatapproachesspinningdisk.
• Thebadnews:intheimmediateterm,interconnectionwillbeaproblem.• Andinthelongertermitmaynotgetawholelotbetter.
TrendsSSD Cap/1u Xputperdata
2TB 64TB 312MB/s/TB
8TB 256TB 78MB/s/TB
32TB 1PB 20MB/s/TB
128TB 4PB 5MB/s/TB
18
TrendsSSD Cap/1u Xputperdata
2TB 64TB 312MB/s/TB
8TB 256TB 78MB/s/TB
32TB 1PB 20MB/s/TB
128TB 4PB 5MB/s/TB
NVMedevice:x4PCIeBroadwellCPU:40PCIelanes
19
TrendsSSD Cap/1u Xputperdata
2TB 64TB 312MB/s/TB
8TB 256TB 78MB/s/TB
32TB 1PB 20MB/s/TB
128TB 4PB 5MB/s/TB
NVMedevice:x4PCIeBroadwellCPU:40PCIelanes
TORcross-racklinkstypicallyoversubscribedat3or4:1
20
Thisisverydifferentfromallthestoragesystemsthatwe'vebuiltinthepast.• Noseekpenalty.
• MeansthatbackgroundI/Oisactuallyreasonabletodo.• Migrationforperformance.• Alternaterepresentations(e.g.materializedviews,intentionalDUPlication)oftenforperformance.
• Metadataalldaylong.• Sprinklerheadsareaproblem.
• 4PBisanawfullyscaryfailuredomain.• Sensibleapplicationoferasurecodingneedsfiveormorenodes.• East/westtrafficisconstrained.
• Capacity-motivateddeletionissillyinmostcases.• Butrealdeletionprobablyneedstobeencryptionbased.
PersistentMemory
• Everyoneisexcitedabout3DXpoint.• (Whattheheckis3dxpoint?)
• Badnews:persistentRAMisatotalPITA.• Becauseit'snotreallypersistentRAM:ramasyouthinkaboutitisatotalilusion.• It'sreallyasuperduperfastdisk.• Infact,it'sasuperduperfast*single**unreliable*disk.Butmoreonthisinasec.
• Butwait,thisdoesn'tmeanthatXPoint isn'taspectacularlygoodidea.• Withit,RAMisabouttobreakthroughthememorywall(coretocapacityratio).• TechnologieslikeXPoint willgiveusamultiplieronworkingset.• Persistencewillmassivelyspeeduprestarttimes,especiallyforread-onlydata.
Onemorespanner:Disaggregation.
• Somesignificantamountofmemoryisabouttomoveoffhost.• Nobodyseemstoagreeonhowthisisgoingtohappen.
• "remote"memoryvssharedphysicalbusvsRack-scaleCCNUMA• Allofthesethingsareinterestingintwoways.
• First,failuredomainsareverydifferent...inwaysthatAppsandOSesareNOTusedtoreasoningabout.
• Second,theyaffordanentirelynew(andexciting!)formofdynamism.• MapReduceandSparkhaveagoodbutverycoarse-grainednotionofpartitioning.• Thesesystemshavethepotentialtobesomuchmoredynamic.• Sameforscaleoutdatastores.• SameforstatereplicationandHA
Sowhat'sgoingtohappenhere
• Totalchaos.• Persistentmemorylookslikeareallyfastdisk.Disaggregatedmemorylookslikeanextensionofthecachehierarchy.• Ourviewofmemory,locality,andpersistenceisintrouble.• Interfacesandabstractionsreallyneedtochangeinsupportofthis.• Oneprediction:filesystemandvirtualmemorywillmerge.
• Loadsofreasonstodothis-- serializationoverheads,reboots,sharing.• butstillmanyopenquestions.
Latency
• Tellmeifyou’veheardthisonebefore:CPUsaren'tgettingfaster• I/Oisgettingfasterandwider.• Latencyisbecomingadominantmetric.
• Directimpactone.g.purchaseprobability.• Butit'samuchhardermetrictoworkwiththanthroughput.
• ShrinkingI/Olatenciesresultsinincreasedcomputationaldensity.• BecauseI/Owaitgoesaway(e.g.DBMS)
• Butalatencyfocusimposesalotofconstraintsonsoftwaredesign.• Especiallytail-latencySLOs.• Needtoreasonabouttheslowpathasacommoncase.
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
ContentionFree SingleLock
NumberofCores
Throughp
ut(K
IOPS)
THE COST OF CONTENTION
Core
DPDK
TCP
SPDK
BlockI/O
DecibelLogic
Userspace
Kernel
CoreCore
HardwareQueues
Decibel(NSDI‘17)
• Howshouldwestructureastoragesystemtoprovidevirtuallocaldisks?
• Partitionlikecrazy,crusadeagainstlatency,pushallunnecessaryfunctionalityupthestack.
• Thisgeneralizestoapplications.
0
100
200
300
400
500
600
700
800
900
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Local Decibel(DPDK) Decibel(Legacy)
0
100
200
300
400
500
600
700
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Local Decibel(DPDK) Decibel(Legacy)
DecibelPerformance(70/30MixedWorkload)
422 vs450 vs490μs
Throughput(KIOPS) Latency(μs)
NumberofCores NumberofCores
Everythinghurtslatency
• Redundancyisagoodexampleofwhythisgetshard.• Forinmemory,networkRTTwillapproachmediastoretime.• Soaremotewritedoublesthecost.• Worse:Replicationatlowerlayersofthesystemisinvariablyamplified.• Thisiswhyemergingdatastoresdon'tdoit.
• Areallatencyfocusdrivessoftwarearchitectureinaveryspecificdirection.• Contentionisasourceofhard-to-reason-aboutperformancevariance.• Soavoidcontentionatallcosts.Designitoutupfront.• (Ifyoudothisright,youbenefitfromnothavingtohiredevelopersthatunderstandlocking.)
• Doingthisrightmeansdesigningdataandcode-levelpartitioningverycarefully.• LessacademicallyrewardingthanOCCandlockfreedom,butseeparentheticpointabove.
Herearethehigh-leveltrends/ideas
1. Diminishingscarcity.2. Practical/sensibletoownyourownhardwareagain.3. Softwareneedstochange.