+ All Categories
Home > Documents > Caches and Memory Hierarchy: Review - UCSBtyang/class/240a17/slides/Cache1.pdf · 6 Memory...

Caches and Memory Hierarchy: Review - UCSBtyang/class/240a17/slides/Cache1.pdf · 6 Memory...

Date post: 20-Oct-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
24
1 Caches and Memory Hierarchy: Review UCSB CS240A, Fall 2017
Transcript
  • 1

    Caches and Memory Hierarchy:Review

    UCSB CS240A, Fall 2017

  • 2

    Motivation• Mostapplicationsinasingleprocessorrunsatonly10-20%oftheprocessorpeak

    • Mostofthesingleprocessorperformancelossisinthememorysystem– Movingdatatakesmuchlongerthanarithmeticandlogic

    • Parallel computing with low single machine performance is not good enough.

    • Understand high performance computing and cost in a single machine setting

    • Review of cache/memory hierarchy

  • Second-LevelCache(SRAM)

    TypicalMemoryHierarchy

    Control

    Datapath

    SecondaryMemory(Disk

    OrFlash)

    On-ChipComponents

    RegFile

    MainMemory(DRAM)Data

    CacheInstrCache

    Speed(cycles):½’s 1’s 10’s 100’s 1,000,000’s

    Size(bytes): 100’s 10K’sM’sG’sT’s

    3

    • Principleoflocality+memoryhierarchypresentsprogrammerwith≈asmuchmemoryasisavailableinthecheapest technologyatthe≈speedofferedbythefastest technology

    Cost/bit:highest lowest

    Third-LevelCache(SRAM)

  • 4

    IdealizedUniprocessor Model• Processornamesbytes,words,etc.initsaddressspace

    – Theserepresentintegers,floats,pointers,arrays,etc.• Operationsinclude

    – Readandwriteintoveryfastmemorycalledregisters– Arithmeticandotherlogicaloperationsonregisters

    • Orderspecifiedbyprogram– Readreturnsthemostrecentlywrittendata– Compilerandarchitecturetranslatehighlevelexpressionsinto“obvious”lowerlevelinstructions

    – Hardwareexecutesinstructionsinorderspecifiedbycompiler• IdealizedCost

    – Eachoperationhasroughlythesamecost(read,write,add,multiply,etc.)

    A=B+CÞ

    Readaddress(B)toR1Readaddress(C)toR2R3=R1+R2WriteR3toAddress(A)

  • 5

    Uniprocessors intheRealWorld• Realprocessorshave

    – registersandcaches• smallamountsoffastmemory• storevaluesofrecentlyusedornearbydata• differentmemoryopscanhaveverydifferentcosts

    – parallelism• multiple“functionalunits”thatcanruninparallel• differentorders,instructionmixeshavedifferentcosts

    – pipelining• aformofparallelism,likeanassemblylineinafactory

    • Whyisthisyourproblem?• Intheory,compilersandhardware“understand”allthisandcanoptimizeyourprogram;inpracticetheydon’t.

    • Theywon’tknowaboutadifferentalgorithmthatmightbeamuchbetter“match”totheprocessor

  • 6

    MemoryHierarchy• Mostprogramshaveahighdegreeoflocality intheiraccesses

    – spatiallocality: accessingthingsnearbypreviousaccesses– temporallocality: reusinganitemthatwaspreviouslyaccessed

    • Memoryhierarchytriestoexploitlocalitytoimproveaverage

    on-chipcacheregisters

    datapath

    control

    processor

    Secondlevelcache(SRAM)

    Mainmemory

    (DRAM)

    Secondarystorage(Disk)

    Tertiarystorage

    (Disk/Tape)

    Speed 1ns 10ns 100ns 1-10ms 10sec

    Size KB MB GB TB PB

  • Processor

    Control

    Datapath

    Review:CacheinModernComputerArchitecture

    7

    PC

    Registers

    Arithmetic&LogicUnit(ALU)

    MemoryInput

    Output

    BytesAddress

    WriteData

    ReadData

    Processor-MemoryInterface I/O-MemoryInterfaces

    Program

    Data

    Cache

  • 8

    CacheBasics• Cacheisfast(expensive)memorywhichkeepscopyofdatainmainmemory;itishiddenfromsoftware– Simplestexample:dataatmemoryaddressxxxxx1101isstored

    atcachelocation1101• Memorydataisdividedintoblocks

    – Cacheaccessmemorybyablock(cacheline)– Cachelinelength:#ofbytesloadedtogetherinoneentry

    • Cacheisdividedbythenumberofsets– Acacheblockcanbehostedinoneset.

    • Cachehit:in-cachememoryaccess—cheap• Cachemiss:Needtoaccessnext,slowerlevelofcache

  • 0000000001000100001100100001010011000111010000100101010010110110001101011100111110000100011001010011101001010110110101111100011001110101101111100111011111011111

    0000000001000100001100100001010011000111010000100101010010110110001101011100111110000100011001010011101001010110110101111100011001110101101111100111011111011111

    0000000001000100001100100001010011000111010000100101010010110110001101011100111110000100011001010011101001010110110101111100011001110101101111100111011111011111

    8 88Byte

    Word8-Byte Block

    address address address

    2 LSBs are 0 3 LSBs are 0

    0

    1

    2

    3

    01234567012345670123456701234567

    Byte offset in blockBlock #10/4/17 9

    MemoryBlock-addressingexample

  • ProcessorAddressFieldsusedbyCacheController

    • BlockOffset:Byteaddresswithinblock– Bisnumberofbytesperblock

    • SetIndex:Selectswhichset.Sisthenumberofsets• Tag:Remainingportionofprocessoraddress

    • SizeofTag=Addresssize– log(S)– log(B)

    Block offsetSetIndexTag

    10

    ProcessorAddress

    CacheSizeC= Associativity N × #ofSetS × CacheBlockSizeBExample:Cachesize16K.8bytesasablock.à 2Kblocksà IfN=1,S=2Kusing11bits.

    Associativity N represents#itemsthatcanbeheldper set

  • 010100100000

    010100110000

    010101000000

    010101010000

    010101100000

    010101110000

    010110000000

    010110010000

    010110100000

    010110110000

    010100100000

    010100110000

    010101000000

    010101010000

    010101100000

    010101110000

    010110000000

    010110010000

    010110100000

    010110110000

    82

    83

    84

    85

    86

    87

    88

    89

    90

    91

    2

    3

    4

    5

    6

    7

    0

    1

    2

    3

    0

    1

    0

    1

    0

    1

    0

    1

    0

    1

    010100100000

    010100110000

    010101000000

    010101010000

    010101100000

    010101110000

    010110000000

    010110010000

    010110100000

    010110110000

    Blocknumberaliasingexample

    10/4/17 11

    Block# Block#mod8 Block#mod2

    12-bitmemoryaddresses,16Byteblocks

    3-bitsetindex 1-bitsetindex

  • • 4byteblocks,cachesize=1Kwords(or4KB)

    Direct-MappedCache:N=1.S=NumberofBlocks=210

    20Tag 10Index

    DataIndex TagValid012...

    102110221023

    3130... 131211... 210Byteoffset

    20

    Data

    32

    Hit

    12

    Validbitensures

    somethingusefulincacheforthisindex

    CompareTagwith

    upperpartofAddresstoseeifaHit

    Readdatafromcacheinstead

    ofmemoryifaHit

    Comparator

    CacheSizeC= Associativity N × #ofSetS × CacheBlockSizeB

  • CacheOrganizations

    • “FullyAssociative”:Blockcangoanywhere– N=numberofblocks.S=1

    • “DirectMapped”:Blockgoesoneplace– N=1.S=cachecapacityintermsofnumberofblocks

    • “N-waySetAssociative”:Nplacesforablock

    BlockIDBlockID

    CacheSizeC= N × #ofSetS × SizeBAssociativity N represents#itemsthatcanbeheldper set

  • Four-WaySet-AssociativeCache• 28 =256setseachwithfourways(eachwithoneblock)

    3130...131211...210 Byteoffset

    DataTagV012...

    253254255

    DataTagV012...

    253254255

    DataTagV012...

    253254255

    SetIndex

    DataTagV012...

    253254255

    8Index

    22Tag

    Hit Data

    32

    4x1select

    Way0 Way1 Way2 Way3

    14

  • Howtofindifadataaddressincache?

    15

    • Assumeblocksize8bytesàlast3bitsofaddressareoffset.

    • Setindex2bits.• Givenaddress0b1001011,wheretofindthis

    itemfromcache?

    0bmeansbinarynumber

  • Howtofindifadataaddressincache?

    16

    • Assumeblocksize8bytesàlast3bitsofaddressareoffset.

    • Setindex2bits.• 0b1001011à Blocknumber0b1001.• Set index2bits(mod4)

    • Set numberà 0b01.• Tag =0b10.

    • If directorybasedcache,onlyoneblockinset#1.

    • If 4ways,therecouldbe4blocksinset#1.• Use tag0b10 tocomparewhatisintheset.

    0bmeansbinarynumber

  • CacheReplacementPolicies• RandomReplacement

    – Hardwarerandomlyselectsacacheevict• Least-RecentlyUsed

    – Hardwarekeepstrackofaccesshistory– Replacetheentrythathasnotbeenusedforthelongesttime– For2-wayset-associativecache,needonebitforLRUreplacement

    • ExampleofaSimple“Pseudo”LRUImplementation– Assume64FullyAssociativeentries– Hardwarereplacementpointerpointstoonecacheentry– Wheneveraccessismadetotheentrythepointerpointsto:

    • Movethepointertothenextentry– Otherwise:donotmovethepointer– (exampleof“not-most-recentlyused”replacementpolicy)

    :

    Entry0Entry1

    Entry63

    ReplacementPointer

    17

  • HandlingDataWriting

    • Storeinstructionswritetomemory,changingvalues

    • Needtomakesurecacheandmemoryhavesamevaluesonwrites:2policies– 1)Write-throughpolicy:writecacheandwritethroughthecachetomemory

    • Everywriteeventuallygetstomemory• Tooslow,soincludeWriteBuffertoallowprocessortocontinueoncedatainBuffer

    • Bufferupdatesmemoryinparalleltoprocessor– 2)Write-backpolicy

    18

  • Write-ThroughCache

    • Writebothvaluesincacheandinmemory

    • WritebufferstopsCPUfromstallingifmemorycannotkeepup

    • Writebuffermayhavemultipleentriestoabsorbburstsofwrites

    • Whatifstoremissesincache?

    19

    Processor

    32-bitAddress

    32-bitData

    Cache

    32-bitAddress

    32-bitData

    Memory

    1022 99252

    720

    12

    1312041 Addr Data

    WriteBuffer

  • HandlingStoreswithWrite-Back

    2)Write-BackPolicy:writeonlytocacheandthenwritecacheblockbacktomemorywhenevictblockfromcache– Writescollectedincache,onlysinglewritetomemoryperblock

    – Includebittoseeifwrotetoblockornot,andthenonlywritebackifbitisset

    • Called“Dirty”bit(writingmakesit“dirty”)

    20

  • Write-BackCache

    • Store/cachehit,writedataincacheonly&setdirtybit– Memoryhasstalevalue

    • Store/cachemiss,readdatafrommemory,thenupdateandsetdirtybit– “Write-allocate”policy

    • Load/cachehit,usevaluefromcache

    • Onanymiss,writebackevictedblock,onlyifdirty.Updatecachewithnewblockandcleardirtybit.

    21

    Processor

    32-bitAddress

    32-bitData

    Cache

    32-bitAddress

    32-bitData

    Memory

    1022 99252

    720

    12

    1312041

    DDDD

    DirtyBits

  • Write-Throughvs.Write-Back

    • Write-Through:– Simplercontrollogic– Morepredictabletimingsimplifiesprocessorcontrollogic

    – Easiertomakereliable,sincememoryalwayshascopyofdata(bigidea:Redundancy!)

    • Write-Back– Morecomplexcontrollogic– Morevariabletiming(0,1,2memoryaccessespercacheaccess)

    – Usuallyreduceswritetraffic

    – Hardertomakereliable,sometimescachehasonlycopyofdata

    22

  • Cache(Performance) Terms

    • Hitrate:fractionofaccessesthathitinthecache• Missrate:1– Hitrate• Misspenalty:timetoreplaceablockfromlowerlevelinmemoryhierarchytocache

    • Hittime:timetoaccesscachememory(includingtagcomparison)

    23

  • AverageMemoryAccessTime(AMAT)• AverageMemoryAccessTime(AMAT)istheaveragetimetoaccessmemoryconsideringbothhitsandmissesinthecache

    AMAT= Timeforahit+Missrate× Misspenalty

    24

    Givena0.2nsclock,amisspenaltyof50clockcycles,amissrateof2%perinstructionandacachehittimeof1clockcycle,whatisAMAT?AMAT=1cycle+0.02*50=2cycles=0.4ns.


Recommended