CCMMLLCCMMLL
CS 230: Computer CS 230: Computer Organization and Organization and
Assembly LanguageAssembly LanguageAviral
ShrivastavaDepartment of Computer Science and
EngineeringSchool of Computing and Informatics
Arizona State University
Slides courtesy: Prof. Yann Hang Lee, ASU, Prof. Mary Jane Irwin, PSU, Ande Carle, UCB
CCMMLLCCMMLL
AnnouncementsAnnouncements• Alternate Project
– Due Today
• Real Examples
• Finals– Tuesday, Dec 08, 2009– Please come on time (You’ll need all the time)– Open book, notes, and internet– No communication with any other human
CCMMLLCCMMLL
Time, Time, TimeTime, Time, Time• Making a Single Cycle Implementation is very easy
– Difficulty and excitement is in making it fast
• Two fundamental methods to make Computers fast– Pipelining– Caches
Address Instruction
InstructionMemory
Write Data
Reg Addr
Reg Addr
Reg Addr
Register
File ALU
DataMemory
Address
Write Data
Read DataPC
Read Data
Read Data
CCMMLLCCMMLL
Effect of high memory Effect of high memory LatencyLatency
• Single Cycle Implementation– Cycle time becomes very large– Operation that do not need memory also slow down
Address Instruction
InstructionMemory
Write Data
Reg Addr
Reg Addr
Reg Addr
Register
File ALU
DataMemory
Address
Write Data
Read DataPC
Read Data
Read Data
CCMMLLCCMMLL
Effect of high memory Effect of high memory LatencyLatency
Address
Read Data(Instr. or Data)
Memory
PC
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read Data 1
Read Data 2
ALU
Write Data
IRM
DR
AB A
LU
ou
t
• Multi-cycle Implementation– Cycle time becomes long
• But– Can make memory access multi-cycle– Avoid penalty to instructions that do not use memory
CCMMLL
Effects of high memory Effects of high memory latencylatency
AL
U
RegIM DM Reg
• Pipelined Implementation− Cycle time becomes long
• But− Can make memory access multi-cycle
− Avoid penalty to instructions that do not use memory
− Can overlap execution of other instructions with a memory operation
CCMMLLCCMMLL
Kinds of Memory Kinds of Memory
CPU Registers 100s Bytes <10s ns
SRAM K Bytes 10-20 ns $.00003/bit
DRAM M Bytes 50ns-100ns $.00001/bit
Disk G Bytes ms 10-6 cents
Tape infinite sec-min
Flipflops
SRAM
DRAM
Disk
Tape
faster
larger
CCMMLLCCMMLL
MemoriesMemories
• CPU Registers, Latches– Flip flops: very fast, but very small
• SRAM – Static RAM– Very fast, Low Power, but small– Data is persistent, until there is power
• DRAM – Dynamic RAM– Very dense– Like a vanishing ink – data disappears with time– Need to refresh the contents
CCMMLLCCMMLL
Flip FlopsFlip Flops
• Fastest form of memory– Store data using
combinational logic components only
• SR, JK, T, D- flip flops
2/10/2009 CSE 420: Computer Architecture I9
CCMMLLCCMMLL
SRAM CellSRAM Cell
Computer Scientist View
b b’
Electrical Engineering View
CCMMLLCCMMLL
A 4-bit SRAMA 4-bit SRAM
Word
- +Wr Driver
SRAMCell
SRAMCell
SRAMCell
SRAMCell
- +Wr Driver
- +Wr Driver
- +Wr Driver
WrEn
Precharge
Din 0Din 1Din 2Din 3
CCMMLLCCMMLL
Sense Amp Sense Amp Sense Amp Sense Amp- +- + - + - +
A 16X4 Static RAM A 16X4 Static RAM (SRAM)(SRAM)
Word 0
Word 1
Word 15
- +Wr Driver
Ad
dre
ss D
ecod
er
SRAMCell
SRAMCell
SRAMCell
SRAMCell
SRAMCell
SRAMCell
SRAMCell
SRAMCell
SRAMCell
: : :
Dout 0Dout 1Dout 2
SRAMCell
SRAMCell
SRAMCell
:
Dout 3
- +Wr Driver
- +Wr Driver
- +Wr Driver
WrEn
Precharge
Din 0Din 1Din 2Din 3
A0
A1
A2
A3
CCMMLLCCMMLL
Dynamic RAM (DRAM)Dynamic RAM (DRAM)
• Value is stored in the capacitor– Discharges with time– Needs to be refreshed regularly
• Dummy read will recharge the capacitor
• Very high density– Newest technology is first tried
on DRAMs
• Intel became popular because of DRAM– Biggest vendor of DRAM
Word line
Pass transistor
Capacitor
Bit line
CCMMLLCCMMLL
Why Not Only DRAM?Why Not Only DRAM?
• Not large enough for some things– Backed up by storage (disk)– Virtual memory, paging, etc.– Will get back to this
• Not fast enough for processor accesses– Takes hundreds of cycles to return data– OK in very regular applications
• Can use SW pipelining, vectors– Not OK in most other applications
CCMMLLCCMMLL
Is there a problem with Is there a problem with DRAM?DRAM?
µProc60%/yr.(2X/1.5yr)
DRAM9%/yr.(2X/10yrs)1
10
100
1000
198
0198
1 198
3198
4198
5 198
6198
7198
8198
9199
0199
1 199
2199
3199
4199
5199
6199
7199
8 199
9200
0
DRAM
CPU198
2
Processor-MemoryPerformance Gap:grows 50% / year
Perf
orm
an
ce
Time
“Moore’s Law”
Processor-DRAM Memory Gap (latency)
CCMMLLCCMMLL
Memory Hierarchy Analogy: Memory Hierarchy Analogy: Library (1/2)Library (1/2)
• You’re writing a term paper (Anthropology) at a table in Hayden
• Hayden Library is equivalent to disk– essentially limitless capacity– very slow to retrieve a book
• Table is memory– smaller capacity: means you must return book when
table fills up– easier and faster to find a book there once you’ve
already retrieved it
CCMMLLCCMMLL
Memory Hierarchy Analogy: Memory Hierarchy Analogy: Library (2/2)Library (2/2)
• Open books on table are cache– smaller capacity: can have very few open books fit on
table; again, when table fills up, you must close a book
– much, much faster to retrieve data
• Illusion created: whole library open on the tabletop – Keep as many recently used books open on table as
possible since likely to use again– Also keep as many books on table as possible, since
faster than going to library
CCMMLLCCMMLL
Memory Hierarchy: GoalsMemory Hierarchy: Goals
• Fact: Large memories are slow, fast memories are small• How do we create a memory that gives the illusion of
being large, cheap and fast (most of the time)?
CCMMLLCCMMLL
Memory Hierarchy: Memory Hierarchy: InsightsInsights
• Temporal Locality (Locality in Time):=> Keep most recently accessed data items closer to
the processor
• Spatial Locality (Locality in Space):=> Move blocks consists of contiguous words to the
upper levels
Lower LevelMemoryUpper Level
MemoryTo Processor
From ProcessorBlk X
Blk Y
CCMMLLCCMMLL
Memory Hierarchy: Memory Hierarchy: SolutionSolution
CPU Registers100s Bytes<10s ns
CacheK Bytes10-100 ns1-0.1 cents/bit
Main MemoryM Bytes200ns- 500ns$.0001-.00001 cents /bitDiskG Bytes, 10 ms (10,000,000 ns)
10 - 10 cents/bit-5 -6
CapacityAccess TimeCost
Tapeinfinitesec-min10 -8
Registers
Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
StagingXfer Unit
prog./compiler1-8 bytes
cache cntl8-128 bytes
OS4K-16K bytes
user/operatorMbytes
Upper Level
Lower Level
faster
Larger
Our currentfocus
CCMMLLCCMMLL
Memory Hierarchy: Memory Hierarchy: TerminologyTerminology
• Hit: data appears in some block in the upper level (Block X) – Hit Rate: fraction of memory accesses found in the upper level– Hit Time: Time to access the upper level which consists of
• RAM access time + Time to determine hit/miss
• Miss: data needs to be retrieve from a block in the lower level (Block Y)– Miss Rate = 1 - (Hit Rate)– Miss Penalty: Time to replace a block in the upper level
+ Time to deliver the block the processor – Hit Time << Miss Penalty
Lower LevelMemoryUpper Level
MemoryTo Processor
From ProcessorBlk X
Blk Y
CCMMLLCCMMLL
Memory Hierarchy: Show me Memory Hierarchy: Show me numbersnumbers
• Consider application− 30% instructions are load/stores
− Suppose memory latency = 100 cycles
− Time to execute 100 instructions = 70*1 + 30*100 = 3070 cycles
• Add a cache with latency 2 cycle− Suppose hit rate is 90%
− Time to execute 100 instructions= 70*1 + 27*2 + 3*100 = 70+54+300 = 424 cycles
CCMMLLCCMMLL
Yoda says…Yoda says…
You will find only what you bring in