StorageLocalityandcountingwords
Countingthewordsinalongtext
• 218718 wordslong• 17150 differentwords• Howmanytimesdoeseachwordoccur?
Task:countthenumberofoccurrencesofeachwordinverylongtext.• Input: CallmeIshmael.Someyearsago--nevermindhowlongprecisely—havinglittleornomoneyinmypurse,andnothingparticulartointerestmeonshore,IthoughtIwouldsail…• Mobydick:about1.3MByte• Desiredoutput:• Call:354• Me:53423• Ismael:1322• ….
Simplesolution• Iterateoverwords.Updatecounterforcurrentword.
Letsuseasortedlist
5
8
Sort-basedsolution
Sortwords
Iterateoversortedlist
Countoccurrencesofsameword
Switchonwordboundry
Summary
• Sortingimprovesmemorylocalityforwordcounting• Improvedmemorylocalityreducesrun-time• Why?Becausecomputermemoryisorganizedinahierarchy.
StorageLatencySmallandFastvs.LargeandSlow
CPU
A
C
B
= *
A
C
B
TIME
Latencies
1. ReadA2. ReadB3. C=A*B4. WriteC
Withbigdata,mostofthelatencyismemorylatency(1,2,4),notcomputation(3)
• MainMemory(RAM)
• Spinningdisk
• Remotecomputer
Latency2
Latency1
Latency3
Latency4
TotalLaten
cy
StorageTypes
CPU
A
C
B
= *
A
C
B
NON-LOCALSTORAGEACCESS
middle
50…00
50…01
50…02
…….
…….
…….
middle50…
0050…
0150…
02…….
…….
…….
CPUA=0Fori inrange(100000):
A+=X[i]
Fori inrange(100000):A-=Y[i]
X
Y
LOCALSTORAGEACCESS
Summary
• Themajorsourceoflatencyindataanalysisisreadingandwritingtostorage• Differenttypesofstorageofferdifferentlatency,capacityandprice.• Bigdataanalyticsrevolvesaroundmethodsfororganizingstorageandcomputationinwaysthatmaximizespeedwhileminimizingcost.• Next,CachesandthememoryHierarchy.
CachesandtheMemoryHierarchy
Latency,sizeandpriceofcomputermemory
Givenabudget,weneedtotradeoff
$10:Fast& Small $10:Slow& Large
Cache:Thebasicidea
0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69
70 71 72 73 74 75 76 77 78 79
12 67
50 51
52 53
32 33
cpu
Memory
Cache
Fast&Small
Slow& Large
CacheHit
0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69
70 71 72 73 74 75 76 77 78 79
12 67
50 51
52 53
32 33
cpu
Memory
CacheCacheHit
CacheMiss
0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69
70 71 72 73 74 75 76 77 78 79
12 67
50 51
52 53
32 33
cpu
Memory
Cache
CacheMissService:1)Choosebytetodrop
0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69
70 71 72 73 74 75 76 77 78 79
12 67
50 51
52 53
32 33
cpu
Memory
Cache
67
CacheMissService:2)writeback
0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69
70 71 72 73 74 75 76 77 78 79
12
50 51
52 53
32 33
cpu
Memory
Cache
67
CacheMissService:3)ReadIn
0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69
70 71 72 73 74 75 76 77 78 79
12
50 51
52 53
32 33
cpu
Memory
Cache
47
AccessLocality
• ThecacheiseffectiveIfmostaccessesarehits.• CacheHitRateishigh.
• TemporalLocality:Multipleaccessestosame addresswithinashorttimeperiod
Spatiallocality
• SpatialLocality:Multipleaccessestoclose-togetheraddressesinshorttimeperiod.• Thedifferencebetweentwosums.• Countingwordsbysorting
• Benefitingfromspatiallocality• MemoryispartitionedintoBlocks/Linesratherthansinglebytes.• Movingablockofmemorytakesmuchlesstimethanmovingeachbyteindividually.• Memorylocationsthatareclosetoeachotherarelikelytofallinthesameblock.• Resultinginmorecachehits.
Cache:Lines/Blocks
0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69
70 71 72 73 74 75 76 77 78 79
50 51
52 53
32 33
34 35
cpu
Memory
Cache
SupportsSpatiallocality
Unsortedwordcount/poorlocality
• ConsiderthememoryaccesstothedictionaryD:• Countwithoutsort:D[the]=12332,…,D[but]=943,………,D[vernacular]=10,……….....,D[for]=..• Temporallocalityforverycommonwordslike“the”• Nospatiallocality
sortedwordcount/goodlocality
EntriestoDareaddedoneatatime.1. D[lines]=332. D[lines]=33,D[lingered]=53. D[lines]=33,D[lingered]=5,D[lingering]=8Assumingnewentriesareaddedattheend,thisgivesspatiallocality.Spatiallocalitymakescoderunfaster
Summary
• CachingreducesstoragelatencybybringingrelevantdataclosetotheCPU.• Thisrequiresthatcodeexhibitsaccesslocality:• Temporallocality:Accessingthesamelocationmultipletimes• Spatiallocality:Accessingneighboringlocations.
ThememoryHierarchy
TheMemoryHierarchy
• Realsystemshaveaseverallevelsstoragetypes:• Topofhierarchy:SmallandfaststorageclosetoCPU• BottomofHierarchy:LargeandslowstoragefurtherfromCPU
• Cachingisusedtotransferdatabetweendifferentlevelsofthehierarchy.• Programmer/compilerisoblivious:• Thehardwareprovidesanabstraction:memorylookslikelikeasinglelargearray.
• Butperformancedependsonprogram’saccesspattern.
TheMemoryHierarchy
Computerclustersextendthememoryhierarchy
• Adataprocessingclusterissimplymanycomputerslinkedthroughanethernet connection.• Storageisshared• Locality:Datatoresideonthecomputerthatwilluseit.• “Caching”isreplacedby“Shuffling”• AbstractionissparkRDD.
CPU(Registers)
L1Cache
L2Cache
L3Cache
MainMemory
DiskStorage
LocalAreaNetwork
Size(bytes) 1KB 64KB 256KB 4MB 4-16GB 4-16TB 16TB –10PB
Latency 300ps 1ns 5ns 20ns 100ns 2-10ms 2-10ms
Block size 64B 64B 64B 64B 32KB 64KB 1.5-64KB
Sizesandlatenciesinatypicalmemoryhierarchy.
12ordersofmagnitude
6ordersofmagnitude
Summary
• MemoryHierarchy:combiningstoragebankswithdifferentlatencies.• Clusters:multiplecomputers,connectedbyethernet,thatsharetheirstorage.
Ashorthistoryofaffordablemassivecomputing.
Supercomputers
• Cray,DeepBlue,BlueGene…• Specializedhardware• Veryexpensive• createdtosolvespecializedimportantproblems
DataCenters
DataCenters• Thephysicalaspectof”thecloud”• Collectionofcommoditycomputers• VASTnumberofcomputers(100,000’s)• Createdtoprovidecomputationforlargeandsmallorganizations.• Computationasacommodity.
MakingHistory:Google2003
• LarryPageandSergeyBrin developamethodforstoringverylargefilesonmultiplecommodity computers.• Eachfileisbrokenintofixed-sizechunks.• Eachchunkisstoredonmultiplechunkservers.• Thelocationsofthechunksismanagedbythemaster
HDFS:Chunkingfiles
File1
File2
File1,Chunk1
File1,Chunk2
File2, Chunk1
File2,Chunk2
Split
Split
File1,Chunk1Copy1
File1,Chunk2Copy1
File2, Chunk1Copy1
File2,Chunk2Copy1
File1,Chunk1Copy2
File1,Chunk2Copy2
File2, Chunk1Copy2
File2,Chunk2Copy2
Copy
Copy
File1,Chunk2Copy3
HDFS:DistributingChunks
File1,Chunk1Copy1
File1,Chunk2Copy1
File2, Chunk1Copy1
File2,Chunk2Copy1
File1,Chunk1Copy2
File1,Chunk2Copy2
File2, Chunk1Copy2
File2,Chunk2Copy2
File1,Chunk2Copy3
PropertiesofGFS/HDFS
• CommodityHardware:Lowcostperbyteofstorage.• Locality: datastoredclosetoCPU.• Redundancy: canrecoverfromserverfailures.• Simpleabstraction: lookstouserlikestandardfilesystem(files,directories,etc.)Chunkmechanismishidden.
Redundancy
ParallelismAssumeFile1containsalistofnumbers.
Serialcomputation:doeverythingononecomputer
Parallelmethod:processeachchunkonaseparatecomputer,thencombine.
Task:Sumallofthenumbersinfile1
Locality Becauseofredundancyitislikelythatatanymomentthereexistsanavailableworkerthatcontainsthechunkthemasterwishestoprocess.
Map-Reduce
• HDFSisastorageabstraction• Map-Reduce isacomputation abstractionthatworkswellwithHDFS• Allowsprogrammertospecifyparallelcomputationwithoutknowinghowthehardwareisorganized.• WewilldescribeMap-Reduce,usingSpark,inalatersection.
Spark
• DevelopedbyMatei Zaharia ,amplab,2014• Hadoopusessharedfilesystem (disk)• Sparkusessharedmemory – faster,lowerlatency.• Willbeusedinthiscourse
• Recallwordcountbysorting,wewillredoitusingmap-reduce!
Summary
• Bigdataanalysisisperformedonlargeclustersofcommoditycomputers.• HDFS(Hadoopfilesystem):breakdownfilestochunks,makecopies,distributerandomly.• HadoopMap-Reduce:acomputationabstractionthatworkswellwithHDFS• Spark:Sharingmemory insteadofsharingdisk.