1
Caches and Memory Hierarchy:Review
UCSB CS240A, Fall 2017
2
Motivation• Mostapplicationsinasingleprocessorrunsatonly10-20%oftheprocessorpeak
• Mostofthesingleprocessorperformancelossisinthememorysystem– Movingdatatakesmuchlongerthanarithmeticandlogic
• Parallel computing with low single machine performance is not good enough.
• Understand high performance computing and cost in a single machine setting
• Review of cache/memory hierarchy
Second-LevelCache(SRAM)
TypicalMemoryHierarchy
Control
Datapath
SecondaryMemory(Disk
OrFlash)
On-ChipComponents
RegFile
MainMemory(DRAM)Data
CacheInstrCache
Speed(cycles):½’s 1’s 10’s 100’s 1,000,000’s
Size(bytes): 100’s 10K’sM’sG’sT’s
3
• Principleoflocality+memoryhierarchypresentsprogrammerwith≈asmuchmemoryasisavailableinthecheapest technologyatthe≈speedofferedbythefastest technology
Cost/bit:highest lowest
Third-LevelCache(SRAM)
4
IdealizedUniprocessor Model• Processornamesbytes,words,etc.initsaddressspace
– Theserepresentintegers,floats,pointers,arrays,etc.• Operationsinclude
– Readandwriteintoveryfastmemorycalledregisters– Arithmeticandotherlogicaloperationsonregisters
• Orderspecifiedbyprogram– Readreturnsthemostrecentlywrittendata– Compilerandarchitecturetranslatehighlevelexpressionsinto“obvious”lowerlevelinstructions
– Hardwareexecutesinstructionsinorderspecifiedbycompiler• IdealizedCost
– Eachoperationhasroughlythesamecost(read,write,add,multiply,etc.)
A=B+CÞ
Readaddress(B)toR1Readaddress(C)toR2R3=R1+R2WriteR3toAddress(A)
5
Uniprocessors intheRealWorld• Realprocessorshave
– registersandcaches• smallamountsoffastmemory• storevaluesofrecentlyusedornearbydata• differentmemoryopscanhaveverydifferentcosts
– parallelism• multiple“functionalunits”thatcanruninparallel• differentorders,instructionmixeshavedifferentcosts
– pipelining• aformofparallelism,likeanassemblylineinafactory
• Whyisthisyourproblem?• Intheory,compilersandhardware“understand”allthisandcanoptimizeyourprogram;inpracticetheydon’t.
• Theywon’tknowaboutadifferentalgorithmthatmightbeamuchbetter“match”totheprocessor
6
MemoryHierarchy• Mostprogramshaveahighdegreeoflocality intheiraccesses
– spatiallocality: accessingthingsnearbypreviousaccesses– temporallocality: reusinganitemthatwaspreviouslyaccessed
• Memoryhierarchytriestoexploitlocalitytoimproveaverage
on-chipcacheregisters
datapath
control
processor
Secondlevelcache(SRAM)
Mainmemory
(DRAM)
Secondarystorage(Disk)
Tertiarystorage
(Disk/Tape)
Speed 1ns 10ns 100ns 1-10ms 10sec
Size KB MB GB TB PB
Processor
Control
Datapath
Review:CacheinModernComputerArchitecture
7
PC
Registers
Arithmetic&LogicUnit(ALU)
MemoryInput
Output
BytesAddress
WriteData
ReadData
Processor-MemoryInterface I/O-MemoryInterfaces
Program
Data
Cache
8
CacheBasics• Cacheisfast(expensive)memorywhichkeepscopyofdatainmainmemory;itishiddenfromsoftware– Simplestexample:dataatmemoryaddressxxxxx1101isstored
atcachelocation1101• Memorydataisdividedintoblocks
– Cacheaccessmemorybyablock(cacheline)– Cachelinelength:#ofbytesloadedtogetherinoneentry
• Cacheisdividedbythenumberofsets– Acacheblockcanbehostedinoneset.
• Cachehit:in-cachememoryaccess—cheap• Cachemiss:Needtoaccessnext,slowerlevelofcache
0000000001000100001100100001010011000111010000100101010010110110001101011100111110000100011001010011101001010110110101111100011001110101101111100111011111011111
0000000001000100001100100001010011000111010000100101010010110110001101011100111110000100011001010011101001010110110101111100011001110101101111100111011111011111
0000000001000100001100100001010011000111010000100101010010110110001101011100111110000100011001010011101001010110110101111100011001110101101111100111011111011111
8 88Byte
Word8-Byte Block
address address address
2 LSBs are 0 3 LSBs are 0
0
1
2
3
01234567012345670123456701234567
Byte offset in blockBlock #10/4/17 9
MemoryBlock-addressingexample
ProcessorAddressFieldsusedbyCacheController
• BlockOffset:Byteaddresswithinblock– Bisnumberofbytesperblock
• SetIndex:Selectswhichset.Sisthenumberofsets• Tag:Remainingportionofprocessoraddress
• SizeofTag=Addresssize– log(S)– log(B)
Block offsetSetIndexTag
10
ProcessorAddress
CacheSizeC= Associativity N × #ofSetS × CacheBlockSizeBExample:Cachesize16K.8bytesasablock.à 2Kblocksà IfN=1,S=2Kusing11bits.
Associativity N represents#itemsthatcanbeheldper set
010100100000
010100110000
010101000000
010101010000
010101100000
010101110000
010110000000
010110010000
010110100000
010110110000
010100100000
010100110000
010101000000
010101010000
010101100000
010101110000
010110000000
010110010000
010110100000
010110110000
82
83
84
85
86
87
88
89
90
91
2
3
4
5
6
7
0
1
2
3
0
1
0
1
0
1
0
1
0
1
010100100000
010100110000
010101000000
010101010000
010101100000
010101110000
010110000000
010110010000
010110100000
010110110000
Blocknumberaliasingexample
10/4/17 11
Block# Block#mod8 Block#mod2
12-bitmemoryaddresses,16Byteblocks
3-bitsetindex 1-bitsetindex
• 4byteblocks,cachesize=1Kwords(or4KB)
Direct-MappedCache:N=1.S=NumberofBlocks=210
20Tag 10Index
DataIndex TagValid012...
102110221023
3130... 131211... 210Byteoffset
20
Data
32
Hit
12
Validbitensures
somethingusefulincacheforthisindex
CompareTagwith
upperpartofAddresstoseeifaHit
Readdatafromcacheinstead
ofmemoryifaHit
Comparator
CacheSizeC= Associativity N × #ofSetS × CacheBlockSizeB
CacheOrganizations
• “FullyAssociative”:Blockcangoanywhere– N=numberofblocks.S=1
• “DirectMapped”:Blockgoesoneplace– N=1.S=cachecapacityintermsofnumberofblocks
• “N-waySetAssociative”:Nplacesforablock
BlockIDBlockID
CacheSizeC= N × #ofSetS × SizeBAssociativity N represents#itemsthatcanbeheldper set
Four-WaySet-AssociativeCache• 28 =256setseachwithfourways(eachwithoneblock)
3130...131211...210 Byteoffset
DataTagV012...
253254255
DataTagV012...
253254255
DataTagV012...
253254255
SetIndex
DataTagV012...
253254255
8Index
22Tag
Hit Data
32
4x1select
Way0 Way1 Way2 Way3
14
Howtofindifadataaddressincache?
15
• Assumeblocksize8bytesàlast3bitsofaddressareoffset.
• Setindex2bits.• Givenaddress0b1001011,wheretofindthis
itemfromcache?
0bmeansbinarynumber
Howtofindifadataaddressincache?
16
• Assumeblocksize8bytesàlast3bitsofaddressareoffset.
• Setindex2bits.• 0b1001011à Blocknumber0b1001.• Set index2bits(mod4)
• Set numberà 0b01.• Tag =0b10.
• If directorybasedcache,onlyoneblockinset#1.
• If 4ways,therecouldbe4blocksinset#1.• Use tag0b10 tocomparewhatisintheset.
0bmeansbinarynumber
CacheReplacementPolicies• RandomReplacement
– Hardwarerandomlyselectsacacheevict• Least-RecentlyUsed
– Hardwarekeepstrackofaccesshistory– Replacetheentrythathasnotbeenusedforthelongesttime– For2-wayset-associativecache,needonebitforLRUreplacement
• ExampleofaSimple“Pseudo”LRUImplementation– Assume64FullyAssociativeentries– Hardwarereplacementpointerpointstoonecacheentry– Wheneveraccessismadetotheentrythepointerpointsto:
• Movethepointertothenextentry– Otherwise:donotmovethepointer– (exampleof“not-most-recentlyused”replacementpolicy)
:
Entry0Entry1
Entry63
ReplacementPointer
17
HandlingDataWriting
• Storeinstructionswritetomemory,changingvalues
• Needtomakesurecacheandmemoryhavesamevaluesonwrites:2policies– 1)Write-throughpolicy:writecacheandwritethroughthecachetomemory
• Everywriteeventuallygetstomemory• Tooslow,soincludeWriteBuffertoallowprocessortocontinueoncedatainBuffer
• Bufferupdatesmemoryinparalleltoprocessor– 2)Write-backpolicy
18
Write-ThroughCache
• Writebothvaluesincacheandinmemory
• WritebufferstopsCPUfromstallingifmemorycannotkeepup
• Writebuffermayhavemultipleentriestoabsorbburstsofwrites
• Whatifstoremissesincache?
19
Processor
32-bitAddress
32-bitData
Cache
32-bitAddress
32-bitData
Memory
1022 99252
720
12
1312041 Addr Data
WriteBuffer
HandlingStoreswithWrite-Back
2)Write-BackPolicy:writeonlytocacheandthenwritecacheblockbacktomemorywhenevictblockfromcache– Writescollectedincache,onlysinglewritetomemoryperblock
– Includebittoseeifwrotetoblockornot,andthenonlywritebackifbitisset
• Called“Dirty”bit(writingmakesit“dirty”)
20
Write-BackCache
• Store/cachehit,writedataincacheonly&setdirtybit– Memoryhasstalevalue
• Store/cachemiss,readdatafrommemory,thenupdateandsetdirtybit– “Write-allocate”policy
• Load/cachehit,usevaluefromcache
• Onanymiss,writebackevictedblock,onlyifdirty.Updatecachewithnewblockandcleardirtybit.
21
Processor
32-bitAddress
32-bitData
Cache
32-bitAddress
32-bitData
Memory
1022 99252
720
12
1312041
DDDD
DirtyBits
Write-Throughvs.Write-Back
• Write-Through:– Simplercontrollogic– Morepredictabletimingsimplifiesprocessorcontrollogic
– Easiertomakereliable,sincememoryalwayshascopyofdata(bigidea:Redundancy!)
• Write-Back– Morecomplexcontrollogic– Morevariabletiming(0,1,2memoryaccessespercacheaccess)
– Usuallyreduceswritetraffic
– Hardertomakereliable,sometimescachehasonlycopyofdata
22
Cache(Performance) Terms
• Hitrate:fractionofaccessesthathitinthecache• Missrate:1– Hitrate• Misspenalty:timetoreplaceablockfromlowerlevelinmemoryhierarchytocache
• Hittime:timetoaccesscachememory(includingtagcomparison)
23
AverageMemoryAccessTime(AMAT)• AverageMemoryAccessTime(AMAT)istheaveragetimetoaccessmemoryconsideringbothhitsandmissesinthecache
AMAT= Timeforahit+Missrate× Misspenalty
24
Givena0.2nsclock,amisspenaltyof50clockcycles,amissrateof2%perinstructionandacachehittimeof1clockcycle,whatisAMAT?AMAT=1cycle+0.02*50=2cycles=0.4ns.