Lecture 10: Memory Hierarchy - GitHub Pages

Lecture10:MemoryHierarchy-- MemoryTechnologyandPrincipalofLocality

CSCE513ComputerArchitecture

1

DepartmentofComputerScienceandEngineeringYonghong Yan

[email protected]://passlab.github.io/CSCE513

TopicsforMemoryHierarchy

• MemoryTechnologyandPrincipalofLocality– CAQA:2.1,2.2,B.1– COD:5.1,5.2

• CacheOrganizationandPerformance– CAQA:B.1,B.2– COD:5.2,5.3

• CacheOptimization– 6BasicCacheOptimizationTechniques

• CAQA:B.3– 10AdvancedOptimizationTechniques

• CAQA:2.3• VirtualMemoryandVirtualMachine

– CAQA:B.4,2.4;COD:5.6,5.7– Skipforthiscourse

2

TheBigPicture:WhereareWeNow?

• Memorysystem– Supplyingdataontimeforcomputation(speed)– Largeenoughtoholdeverythingneeded(capacity)

Control

Datapath

Memory

ProcessorInput

Output

3

Overview

• Programmerswantunlimitedamountsofmemorywithlowlatency

• Fastmemorytechnologyismoreexpensiveperbitthanslowermemory

• Solution:organizememorysystemintoahierarchy– Entireaddressablememoryspaceavailableinlargest,slowest

memory– Incrementallysmallerandfastermemories,eachcontaininga

subsetofthememorybelowit,proceedinstepsuptowardtheprocessor

• Temporalandspatiallocalityinsuresthatnearlyallreferencescanbefoundinsmallermemories– Givestheallusionofalarge,fastmemorybeingpresentedtothe

processor4

MemoryHierarchy

MemoryHierarchy

5

MemoryTechnology

• RandomAccess:accesstimeisthesameforalllocations• DRAM:DynamicRandomAccessMemory

– Highdensity,lowpower,cheap,slow– Dynamic:needtobe“refreshed”regularly– 50ns– 70ns,$20– $75perGB

• SRAM:StaticRandomAccessMemory– Lowdensity,highpower,expensive,fast– Static:contentwilllast“forever”(untillosepower)– 0.5ns– 2.5ns,$2000– $5000perGB

• Magneticdisk– 5ms– 20ms,$0.20– $2perGB

Ideal memory: • Access time of SRAM• Capacity and cost/GB of

disk6

StaticRAM(SRAM)6-TransistorCell– 1Bit

• Write:1.Drivebitlines(bit=1,bit=0)2..Selectrow

• Read:1.Precharge bitandbittoVdd orVdd/2=>makesureequal!2..Selectrow3.Cellpullsonelinelow4.Senseamponcolumndetectsdifferencebetweenbitandbit

6-Transistor SRAM Cell

bit bit

word(row select)

bit bit

word

replaced with pullupto save area

10

0 1

7

DynamicRAM(DRAM)1-TransistorMemoryCell

• Write:– 1.Drivebitline– 2.Selectrow

• Read:– 1.Precharge bitlinetoVdd– 2.Selectrow– 3.Cellandbitlinesharecharges

• Verysmallvoltagechangesonthebitline– 4.Sense(fancysenseamp)

• Candetectchangesof~1millionelectrons– 5.Write:restorethevalue

• Refresh– 1.Justdoadummyreadtoeverycell.

row select

bit

8

Performance:LatencyandBandwidth

• PerformanceofMainMemory:– Latency:CacheMissPenalty

• AccessTime:timebetweenrequestandwordarrives• CycleTime:timebetweenrequests

– Bandwidth:I/O&LargeBlockMissPenalty(L2)• MainMemoryisDRAM: DynamicRandomAccessMemory

– Needstoberefreshedperiodically(8ms)– Addressesdividedinto2halves(Memoryasa2Dmatrix):

• RAS orRowAccessStrobe andCAS orColumnAccessStrobe• CacheusesSRAM: StaticRandomAccessMemory

– Norefresh(6transistors/bitvs.1transistor)Size:DRAM/SRAM 4-8Cost/Cycletime:SRAM/DRAM 8-16

9

Stacked/EmbeddedDRAMs

• StackedDRAMsinsamepackageasprocessor– HighBandwidthMemory(HBM)

10

FlashMemory

• TypeofEEPROM• Types:NAND(denser)andNOR(faster)• NANDFlash:

– Readsaresequential,readsentirepage(.5to4KiB)– 25usforfirstbyte,40MiB/sforsubsequentbytes– SDRAM:40nsforfirstbyte,4.8GB/sforsubsequentbytes– 2KiBtransfer:75uSvs500nsforSDRAM,150Xslower– 300to500Xfasterthanmagneticdisk

11

NANDFlashMemory

• Mustbeerased(inblocks)beforebeingoverwritten• Nonvolatile,canuseaslittleaszeropower• Limitednumberofwritecycles(~100,000)• $2/GiB,comparedto$20-40/GiBforSDRAMand$0.09GiBformagneticdisk

• Phase-Change/MemristerMemory– Possibly10Ximprovementinwriteperformanceand2X

improvementinreadperformance

12

CPU-MemoryPerformanceGap:Latency

CPU-DRAMMemoryLatency GapàMemoryWall

Processor-MemoryPerformance Gap:(grows 50% / year)

13

CPU-MemoryPerformanceGap:Bandwidth

• Memoryhierarchydesignbecomesmorecrucialwithrecentmulti-coreprocessors:

• Aggregatepeakbandwidthgrowswith#cores:– IntelCorei7cangeneratetworeferencespercoreperclock– Fourcoresand3.2GHzclock

• 25.6billion64-bitdatareferences/second+• 12.8billion128-bitinstructionreferences/second• =409.6GB/s!

– DRAMbandwidthisonly8%ofthis(34.1GB/s)• Requires:

– Multi-port,pipelinedcaches– Twolevelsofcachepercore– Sharedthird-levelcacheonchip

14

MemoryHierarchy

• Keepmostrecentaccesseddataanditsadjacentdatainthesmaller/fastercachesthatareclosertoprocessor

• Mechanismsforreplacingdata

15

Control

Datapath

SecondaryStorage(Disk)

Processor

Registers

MainMemory(DRAM)

2nd/3rd

LevelCache

(SRAM)

On-C

hipC

ache

1s 10,000,000s (10s ms)

Speed (ns): 10s 100s

100s GsSize (bytes): Ks Ms

TertiaryStorage(Tape)

10,000,000,000s (10s sec)

Ts

WhyHierarchyWorks

• ThePrincipleofLocality:– Programaccessarelativelysmallportionoftheaddressspaceatanyinstantoftime.

Address Space0 2^n - 1

Probabilityof reference

16

ThePrincipleofLocality• Programstendtoreusedataandinstructionsnearthosetheyhaveusedrecently,orthatwererecentlyreferencedthemselves

• Spatiallocality: Itemswithnearbyaddressestendtobereferencedclosetogetherintime

• Temporallocality: Recentlyreferenceditemsarelikelytobereferencedinthenearfuture

• Data–Referencearrayelementsinsuccession(stride-1referencepattern):SpatialLocality

–Referencesumeachiteration:TemporalLocality• Instructions

–Referenceinstructionsinsequence:SpatialLocality–Cyclethroughlooprepeatedly:TemporalLocality

sum = 0;for(i=0; i<n; i++)sum += a[i];

return sum;

17

MemoryHierarchyofaComputerSystem• Bytakingadvantageoftheprincipleoflocality:

– Presenttheuserwithasmuchmemoryasisavailableinthecheapesttechnology.

– Provideaccessatthespeedofferedbythefastesttechnology.

Control

Datapath


Processor

Registers

MainMemory(DRAM)

2nd/3rd

LevelCache

(SRAM)

On-C

hipC

ache

1s 10,000,000s (10s ms)




10,000,000,000s (10s sec)

Ts18

Vector/MatrixandArrayinC

• Chasrow-majorstorageformultipledimensionalarray– A[2,2]isfollowedbyA[2,3]

• 3-dimensionalarray– B[3][100][100]

19

int A[4][4]

• Steppingthroughcolumnsinonerow:for (i=0;i<4;i++)sum+=A[0][i];accessessuccessiveelements

• Steppingthroughrowsinonecolumn:for (i=0;i<4;i++)sum+=A[i][0];Stride-4access

LocalityExample

• Claim: Beingabletolookatcodeandgetqualitativesenseofitslocalityiskeyskillforprofessionalprogrammer

• Question: Doesthisfunctionhavegoodlocality?

int sumarrayrows(int a[M][N]){int i, j, sum = 0;

for (i = 0; i < M; i++)for (j = 0; j < N; j++)

sum += a[i][j];return sum;

}

20

✅

LocalityExample

• Question: Doesthisfunctionhavegoodlocality?

int sumarraycols(int a[M][N]){int i, j, sum = 0;

for (j = 0; j < N; j++)for (i = 0; i < M; i++)


}

21

❌

LocalityExample

• Question: Canyoupermutetheloopssothatthefunctionscansthe3-darraya[] withastride-1referencepattern(andthushasgoodspatiallocality)?

int sumarray3d(int a[M][N][N]) {int i, j, k, sum = 0;

for (i = 0; i < N; i++)for (j = 0; j < N; j++)

for (k = 0; k < M; k++)sum += a[k][i][j];

return sum;}

22

Review:MemoryTechnologyandHierarchy

23

Technology Challenge: Memory Wall

Address Space0 2^n - 1Prob

abili

tyof

refe

renc

e

Program Behavior: Principle of Locality

Architecture Approach: Memory Hierarchy

int sumarrayrows(int a[M][N]){int i, j, sum = 0;

for (i = 0; i < M; i++)for (j = 0; j < N; j++)sum += a[i][j];

return sum;}

Your Code: Exploit Locality, and Work With Memory Storage Type

MemoryHierarchy

24

• capacity:Register<<SRAM<<DRAM• latency:Register<<SRAM<<DRAM• bandwidth:on-chip>>off-chip

Onadataaccess:ifdataÎ fastmemoryÞ lowlatencyaccess(SRAM)ifdataÏ fastmemoryÞ highlatencyaccess(DRAM)

History:Alpha21164(1994)

25

https://en.wikipedia.org/wiki/Alpha_21164

History:FurtherBack

Ideallyonewoulddesireanindefinitelylargememorycapacitysuchthatanyparticular...wordwouldbeimmediatelyavailable....Weare...forcedtorecognizethepossibilityofconstructinga hierarchyofmemories,eachofwhichhasgreatercapacitythantheprecedingbutwhichislessquicklyaccessible.A.W.Burks,H.H.Goldstine,andJ.vonNeumannPreliminaryDiscussionoftheLogicalDesignofanElectronic

ComputingInstrument,1946

26

Next:MemoryHierarchy- CacheOrganization

• Keepmostrecentaccesseddataanditsadjacentdatainthesmaller/fastercachesthatareclosertoprocessor

• Mechanismsforreplacingdata

27

Control

Datapath


Processor

Registers

MainMemory(DRAM)

2nd/3rd

LevelCache

(SRAM)

On-C

hipC

ache

1s 10,000,000s (10s ms)




10,000,000,000s (10s sec)

Ts

MoreExamplesforLocalityDiscussion

28

Sourcesoflocality

• Temporallocality– Codewithinaloop– Sameinstructionsfetchedrepeatedly

• Spatiallocality– Dataarrays– Localvariablesinstack– Dataallocatedinchunks(contiguousbytes)

for(i=0;i<N;i++){A[i]=B[i]+C[i]*a;

}

29

int sumarrayrows(int a[M][N]){

int i, j, sum = 0;

for (i = 0; i < M; i++)for (j = 0; j < N; j++)


}

int sumarraycols(int a[M][N]){

int i, j, sum = 0;

for (j = 0; j < N; j++)for (i = 0; i < M; i++)


}

Miss rate = 1/4 = 25% Miss rate = 100%

WritingCacheFriendlyCode

• Repeatedreferencestovariablesaregood(temporallocality)• Stride-1referencepatternsaregood(spatiallocality)• Examples:

– coldcache,4-bytewords,4-wordcacheblocks

30

MatrixMultiplicationExample

• Majorcacheeffectstoconsider– Totalcachesize

• Exploittemporallocalityandblocking)– Blocksize

• Exploitspatiallocality

• Description:– MultiplyNxNmatrices– O(N3)totaloperations– Accesses

• Nreadspersourceelement• Nvaluessummedperdestination

– butmaybeabletoholdinregister

/* ijk */for (i=0; i<n; i++) {

for (j=0; j<n; j++) {sum = 0.0;for (k=0; k<n; k++)

sum += a[i][k] * b[k][j];c[i][j] = sum;

}}

Variable sumheld in register

31

MissRateAnalysisforMatrixMultiply

• Assume:– Cachelinesize=32Bytes(bigenoughfor464-bitwords)– Matrixdimension(N)isverylarge

• Approximate1/Nas0.0– Cacheisnotevenbigenoughtoholdmultiplerows

• Analysismethod:– Lookataccesspatternofinnerloop

32

MatrixMultiplication(ijk)

33

MatrixMultiplication(jik)

34

MatrixMultiplication(kij)

35

MatrixMultiplication(ikj)

36

MatrixMultiplication(jki)

37

MatrixMultiplication(kji)

38

SummaryofMatrixMultiplication

39

for (i=0; i<n; i++) {for (j=0; j<n; j++) {

sum = 0.0;for (k=0; k<n; k++)

sum += a[i][k] *b[k][j]; c[i][j] = sum;

}

}

ijk (& jik): kij (& ikj): jki (& kji):• 2 loads, 0 stores• misses/iter = 1.25

for (k=0; k<n; k++) {for (i=0; i<n; i++) {

r = a[i][k];for (j=0; j<n; j++)

c[i][j] += r * b[k][j];

}}

for (j=0; j<n; j++) {for (k=0; k<n; k++) {

r = b[k][j];for (i=0; i<n; i++)

c[i][j] += a[i][k] * r;}

}

• 2 loads, 1 store• misses/iter = 0.5

• 2 loads, 1 store• misses/iter = 2.0

Date post:	23-Feb-2022
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

Lecture 10: Memory Hierarchy - GitHub Pages

Documents