+ All Categories
Home > Documents > 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... ·...

18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... ·...

Date post: 19-Mar-2018
Category:
Upload: lamhanh
View: 212 times
Download: 0 times
Share this document with a friend
79
18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013
Transcript
Page 1: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

18-447: Computer Architecture

Lecture 25: Main Memory

Prof. Onur Mutlu

Carnegie Mellon University

Spring 2013, 4/3/2013

Page 2: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Reminder: Homework 5 (Today)

Due April 3 (Wednesday!)

Topics: Vector processing, VLIW, Virtual memory, Caching

2

Page 3: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Reminder: Lab Assignment 5 (Friday)

Lab Assignment 5

Due Friday, April 5

Modeling caches and branch prediction at the microarchitectural level (cycle level) in C

Extra credit: Cache design optimization

Size, block size, associativity

Replacement and insertion policies

Cache indexing policies

Anything else you would like

3

Page 4: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Heads Up: Midterm II in Two Weeks

April 17

Similar format as Midterm I

4

Page 5: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Last Lecture

Wrap up virtual memory – cache interaction

Virtually-indexed physically-tagged caches

Solutions to the synonym problem

Improving cache (and memory hierarchy) performance

Cheaper alternatives to more associativity

Blocking and code reorganization

Memory-level-parallelism (MLP) aware cache replacement

Enabling multiple accesses in parallel

5

Page 6: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Today

Enabling multiple accesses in parallel

Main memory

6

Page 7: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Improving Basic Cache Performance Reducing miss rate

More associativity

Alternatives/enhancements to associativity

Victim caches, hashing, pseudo-associativity, skewed associativity

Better replacement/insertion policies

Software approaches

Reducing miss latency/cost

Multi-level caches

Critical word first

Subblocking/sectoring

Better replacement/insertion policies

Non-blocking caches (multiple cache misses in parallel)

Multiple accesses per cycle

Software approaches 7

Page 8: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

8

Review: Memory Level Parallelism (MLP)

Memory Level Parallelism (MLP) means generating and servicing multiple memory accesses in parallel [Glew’98]

Several techniques to improve MLP (e.g., out-of-order execution)

MLP varies. Some misses are isolated and some parallel

How does this affect cache replacement?

time

A B

C

isolated miss parallel miss

Page 9: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Review: Fewest Misses = Best Performance

9

H H H H M H H H M Hit/Miss

Misses=4 Stalls=4

S1 P4 P3 P2 P1 P1 P2 P3 P4 S2 S3

Time stall

Belady’s OPT replacement

M M

MLP-Aware replacement

Hit/Miss H H H H M M M H M M M

Time stall Misses=6Stalls=2

Saved cycles

Page 10: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Reading: MLP-Aware Cache Replacement

How do we incorporate MLP into replacement decisions?

Qureshi et al., “A Case for MLP-Aware Cache Replacement,” ISCA 2006.

Required reading for this week

10

Page 11: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Enabling Multiple Outstanding Misses

Page 12: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Handling Multiple Outstanding Accesses

Non-blocking or lockup-free caches

Kroft, “Lockup-Free Instruction Fetch/Prefetch Cache Organization," ISCA 1981.

Question: If the processor can generate multiple cache accesses, can the later accesses be handled while a previous miss is outstanding?

Idea: Keep track of the status/data of misses that are being handled in Miss Status Handling Registers (MSHRs)

A cache access checks MSHRs to see if a miss to the same block is already pending.

If pending, a new request is not generated

If pending and the needed data available, data forwarded to later load

Requires buffering of outstanding miss requests

12

Page 13: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Non-Blocking Caches (and MLP)

Enable cache access when there is a pending miss

Enable multiple misses in parallel

Memory-level parallelism (MLP)

generating and servicing multiple memory accesses in parallel

Why generate multiple misses?

Enables latency tolerance: overlaps latency of different misses

How to generate multiple misses?

Out-of-order execution, multithreading, runahead, prefetching

13

time

A C

B

isolated miss parallel miss

Page 14: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Miss Status Handling Register

Also called “miss buffer”

Keeps track of

Outstanding cache misses

Pending load/store accesses that refer to the missing cache block

Fields of a single MSHR entry

Valid bit

Cache block address (to match incoming accesses)

Control/status bits (prefetch, issued to memory, which subblocks have arrived, etc)

Data for each subblock

For each pending load/store

Valid, type, data size, byte in block, destination register or store buffer entry address

14

Page 15: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Miss Status Handling Register Entry

15

Page 16: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

MSHR Operation

On a cache miss:

Search MSHRs for a pending access to the same block

Found: Allocate a load/store entry in the same MSHR entry

Not found: Allocate a new MSHR

No free entry: stall

When a subblock returns from the next level in memory

Check which loads/stores waiting for it

Forward data to the load/store unit

Deallocate load/store entry in the MSHR entry

Write subblock in cache or MSHR

If last subblock, dellaocate MSHR (after writing the block in cache)

16

Page 17: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Non-Blocking Cache Implementation

When to access the MSHRs?

In parallel with the cache?

After cache access is complete?

MSHRs need not be on the critical path of hit requests

Which one below is the common case?

Cache miss, MSHR hit

Cache hit

17

Page 18: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Enabling High Bandwidth Caches

(and Memories in General)

Page 19: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Multiple Instructions per Cycle

Can generate multiple cache accesses per cycle

How do we ensure the cache can handle multiple accesses in the same clock cycle?

Solutions:

true multi-porting

virtual multi-porting (time sharing a port)

multiple cache copies

banking (interleaving)

19

Page 20: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Handling Multiple Accesses per Cycle (I)

True multiporting

Each memory cell has multiple read or write ports

+ Truly concurrent accesses (no conflicts regardless of address)

-- Expensive in terms of latency, power, area

What about read and write to the same location at the same time?

Peripheral logic needs to handle this

20

Page 21: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Peripheral Logic for True Multiporting

21

Page 22: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Peripheral Logic for True Multiporting

22

Page 23: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Handling Multiple Accesses per Cycle (I)

Virtual multiporting

Time-share a single port

Each access needs to be (significantly) shorter than clock cycle

Used in Alpha 21264

Is this scalable?

23

Page 24: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Cache Copy 1

Handling Multiple Accesses per Cycle (II)

Multiple cache copies

Stores update both caches

Loads proceed in parallel

Used in Alpha 21164

Scalability?

Store operations form a bottleneck

Area proportional to “ports”

24

Port 1

Load

Store

Port 1

Data

Cache Copy 2 Port 2

Load

Port 2

Data

Page 25: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Handling Multiple Accesses per Cycle (III)

Banking (Interleaving)

Bits in address determines which bank an address maps to

Address space partitioned into separate banks

Which bits to use for “bank address”?

+ No increase in data store area

-- Cannot satisfy multiple accesses

to the same bank

-- Crossbar interconnect in input/output

Bank conflicts

Two accesses are to the same bank

How can these be reduced?

Hardware? Software?

25

Bank 0: Even

addresses

Bank 1: Odd

addresses

Page 26: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

General Principle: Interleaving

Interleaving (banking)

Problem: a single monolithic memory array takes long to access and does not enable multiple accesses in parallel

Goal: Reduce the latency of memory array access and enable multiple accesses in parallel

Idea: Divide the array into multiple banks that can be accessed independently (in the same cycle or in consecutive cycles)

Each bank is smaller than the entire memory storage

Accesses to different banks can be overlapped

Issue: How do you map data to different banks? (i.e., how do you interleave data across banks?)

26

Page 27: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Main Memory

Page 28: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Main Memory in the System

28

CORE 1

L2 C

AC

HE

0

SH

AR

ED

L3 C

AC

HE

DR

AM

INT

ER

FA

CE

CORE 0

CORE 2 CORE 3 L

2 C

AC

HE

1

L2 C

AC

HE

2

L2 C

AC

HE

3

DR

AM

BA

NK

S

DRAM MEMORY

CONTROLLER

Page 29: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

The Memory Chip/System Abstraction

29

Page 30: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Review: Memory Bank Organization Read access sequence:

1. Decode row address & drive word-lines

2. Selected bits drive bit-lines

• Entire row read

3. Amplify row data

4. Decode column address & select subset of row

• Send to output

5. Precharge bit-lines

• For next access

30

Page 31: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Review: SRAM (Static Random Access Memory)

31

bit-cell array

2n row x 2m-col

(nm to minimize overall latency)

sense amp and mux 2m diff pairs

2n n

m

1

row select

bitlin

e

_bitlin

e

n+m

Read Sequence

1. address decode

2. drive row select

3. selected bit-cells drive bitlines

(entire row is read together)

4. diff. sensing and col. select

(data is ready)

5. precharge all bitlines

(for next read or write)

Access latency dominated by steps 2 and 3

Cycling time dominated by steps 2, 3 and 5

- step 2 proportional to 2m

- step 3 and 5 proportional to 2n

Page 32: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Review: DRAM (Dynamic Random Access Memory)

32

row enable _bitlin

e

bit-cell array

2n row x 2m-col

(nm to minimize overall latency)

sense amp and mux 2m

2n n

m

1

RAS

CAS

A DRAM die comprises of multiple such arrays

Bits stored as charges on node

capacitance (non-restorative)

- bit cell loses charge when read

- bit cell loses charge over time

Read Sequence

1~3 same as SRAM

4. a “flip-flopping” sense amp amplifies and regenerates the bitline, data bit is mux’ed out

5. precharge all bitlines

Refresh: A DRAM controller must

periodically read all rows within the

allowed refresh time (10s of ms)

such that charge is restored in cells

Page 33: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Review: DRAM vs. SRAM

DRAM

Slower access (capacitor)

Higher density (1T 1C cell)

Lower cost

Requires refresh (power, performance, circuitry)

Manufacturing requires putting capacitor and logic together

SRAM

Faster access (no capacitor)

Lower density (6T cell)

Higher cost

No need for refresh

Manufacturing compatible with logic process (no capacitor)

33

Page 34: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Some Fundamental Concepts (I)

Physical address space

Maximum size of main memory: total number of uniquely identifiable locations

Physical addressability

Minimum size of data in memory can be addressed

Byte-addressable, word-addressable, 64-bit-addressable

Addressability depends on the abstraction level of the implementation

Alignment

Does the hardware support unaligned access transparently to software?

Interleaving 34

Page 35: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Some Fundamental Concepts (II)

Interleaving (banking)

Problem: a single monolithic memory array takes long to access and does not enable multiple accesses in parallel

Goal: Reduce the latency of memory array access and enable multiple accesses in parallel

Idea: Divide the array into multiple banks that can be accessed independently (in the same cycle or in consecutive cycles)

Each bank is smaller than the entire memory storage

Accesses to different banks can be overlapped

Issue: How do you map data to different banks? (i.e., how do you interleave data across banks?)

35

Page 36: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Interleaving

36

Page 37: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Interleaving Options

37

Page 38: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Some Questions/Concepts

Remember CRAY-1 with 16 banks

11 cycle bank latency

Consecutive words in memory in consecutive banks (word interleaving)

1 access can be started (and finished) per cycle

Can banks be operated fully in parallel?

Multiple accesses started per cycle?

What is the cost of this?

We have seen it earlier (today)

Modern superscalar processors have L1 data caches with multiple, fully-independent banks

38

Page 39: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

The Bank Abstraction

39

Page 40: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

40

Rank

Page 41: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

The DRAM Subsystem

Page 42: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

DRAM Subsystem Organization

Channel

DIMM

Rank

Chip

Bank

Row/Column

42

Page 43: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

The DRAM Bank Structure

43

Page 44: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Page Mode DRAM

A DRAM bank is a 2D array of cells: rows x columns

A “DRAM row” is also called a “DRAM page”

“Sense amplifiers” also called “row buffer”

Each address is a <row,column> pair

Access to a “closed row”

Activate command opens row (placed into row buffer)

Read/write command reads/writes column in the row buffer

Precharge command closes the row and prepares the bank for next access

Access to an “open row”

No need for activate command

44

Page 45: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

DRAM Bank Operation

45

Row Buffer

(Row 0, Column 0)

Row

decoder

Column mux

Row address 0

Column address 0

Data

Row 0 Empty

(Row 0, Column 1)

Column address 1

(Row 0, Column 85)

Column address 85

(Row 1, Column 0)

HIT HIT

Row address 1

Row 1

Column address 0

CONFLICT !

Columns

Row

s

Access Address:

Page 46: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

The DRAM Chip

Consists of multiple banks (2-16 in Synchronous DRAM)

Banks share command/address/data buses

The chip itself has a narrow interface (4-16 bits per read)

46

Page 47: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

128M x 8-bit DRAM Chip

47

Page 48: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

DRAM Rank and Module

Rank: Multiple chips operated together to form a wide interface

All chips comprising a rank are controlled at the same time

Respond to a single command

Share address and command buses, but provide different data

A DRAM module consists of one or more ranks

E.g., DIMM (dual inline memory module)

This is what you plug into your motherboard

If we have chips with 8-bit interface, to read 8 bytes in a single access, use 8 chips in a DIMM

48

Page 49: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

A 64-bit Wide DIMM (One Rank)

49

DRAM

Chip

DRAM

Chip

DRAM

Chip

DRAM

Chip

DRAM

Chip

DRAM

Chip

DRAM

Chip

DRAM

Chip

Command Data

Page 50: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

A 64-bit Wide DIMM (One Rank)

Advantages: Acts like a high-

capacity DRAM chip with a wide interface

Flexibility: memory controller does not need to deal with individual chips

Disadvantages: Granularity:

Accesses cannot be smaller than the interface width

50

Page 51: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Multiple DIMMs

51

Advantages:

Enables even higher capacity

Disadvantages:

Interconnect complexity and energy consumption can be high

Page 52: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

DRAM Channels

2 Independent Channels: 2 Memory Controllers (Above)

2 Dependent/Lockstep Channels: 1 Memory Controller with wide interface (Not Shown above)

52

Page 53: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Generalized Memory Structure

53

Page 54: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Generalized Memory Structure

54

Page 55: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

The DRAM Subsystem

The Top Down View

Page 56: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

DRAM Subsystem Organization

Channel

DIMM

Rank

Chip

Bank

Row/Column

56

Page 57: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

The DRAM subsystem

Memory channel Memory channel

DIMM (Dual in-line memory module)

Processor

“Channel”

Page 58: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Breaking down a DIMM

DIMM (Dual in-line memory module)

Side view

Front of DIMM Back of DIMM

Page 59: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Breaking down a DIMM

DIMM (Dual in-line memory module)

Side view

Front of DIMM Back of DIMM

Rank 0: collection of 8 chips Rank 1

Page 60: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Rank

Rank 0 (Front) Rank 1 (Back)

Data <0:63> CS <0:1> Addr/Cmd

<0:63> <0:63>

Memory channel

Page 61: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Breaking down a Rank

Rank 0

<0:63>

Ch

ip 0

Ch

ip 1

Ch

ip 7

. . .

<0:7

>

<8:1

5>

<56

:63

>

Data <0:63>

Page 62: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Breaking down a Chip

Ch

ip 0

<0

:7>

Bank 0

<0:7>

<0:7>

<0:7>

...

<0:7

>

Page 63: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Breaking down a Bank

Bank 0

<0:7

>

row 0

row 16k-1

... 2kB

1B

1B (column)

1B

Row-buffer

1B

... <0

:7>

Page 64: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

DRAM Subsystem Organization

Channel

DIMM

Rank

Chip

Bank

Row/Column

64

Page 65: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Example: Transferring a cache block

0xFFFF…F

0x00

0x40

...

64B cache block

Physical memory space

Channel 0

DIMM 0

Rank 0

Page 66: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Example: Transferring a cache block

0xFFFF…F

0x00

0x40

...

64B cache block

Physical memory space

Rank 0 Chip 0 Chip 1 Chip 7

<0:7

>

<8:1

5>

<56

:63

>

Data <0:63>

. . .

Page 67: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Example: Transferring a cache block

0xFFFF…F

0x00

0x40

...

64B cache block

Physical memory space

Rank 0 Chip 0 Chip 1 Chip 7

<0:7

>

<8:1

5>

<56

:63

>

Data <0:63>

Row 0 Col 0

. . .

Page 68: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Example: Transferring a cache block

0xFFFF…F

0x00

0x40

...

64B cache block

Physical memory space

Rank 0 Chip 0 Chip 1 Chip 7

<0:7

>

<8:1

5>

<56

:63

>

Data <0:63>

8B

Row 0 Col 0

. . .

8B

Page 69: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Example: Transferring a cache block

0xFFFF…F

0x00

0x40

...

64B cache block

Physical memory space

Rank 0 Chip 0 Chip 1 Chip 7

<0:7

>

<8:1

5>

<56

:63

>

Data <0:63>

8B

Row 0 Col 1

. . .

Page 70: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Example: Transferring a cache block

0xFFFF…F

0x00

0x40

...

64B cache block

Physical memory space

Rank 0 Chip 0 Chip 1 Chip 7

<0:7

>

<8:1

5>

<56

:63

>

Data <0:63>

8B

8B

Row 0 Col 1

. . .

8B

Page 71: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Example: Transferring a cache block

0xFFFF…F

0x00

0x40

...

64B cache block

Physical memory space

Rank 0 Chip 0 Chip 1 Chip 7

<0:7

>

<8:1

5>

<56

:63

>

Data <0:63>

8B

8B

Row 0 Col 1

A 64B cache block takes 8 I/O cycles to transfer.

During the process, 8 columns are read sequentially.

. . .

Page 72: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Latency Components: Basic DRAM Operation

CPU → controller transfer time

Controller latency

Queuing & scheduling delay at the controller

Access converted to basic commands

Controller → DRAM transfer time

DRAM bank latency

Simple CAS if row is “open” OR

RAS + CAS if array precharged OR

PRE + RAS + CAS (worst case)

DRAM → CPU transfer time (through controller)

72

Page 73: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Multiple Banks (Interleaving) and Channels

Multiple banks

Enable concurrent DRAM accesses

Bits in address determine which bank an address resides in

Multiple independent channels serve the same purpose

But they are even better because they have separate data buses

Increased bus bandwidth

Enabling more concurrency requires reducing

Bank conflicts

Channel conflicts

How to select/randomize bank/channel indices in address?

Lower order bits have more entropy

Randomizing hash functions (XOR of different address bits)

73

Page 74: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

How Multiple Banks/Channels Help

74

Page 75: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Multiple Channels

Advantages

Increased bandwidth

Multiple concurrent accesses (if independent channels)

Disadvantages

Higher cost than a single channel

More board wires

More pins (if on-chip memory controller)

75

Page 76: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Address Mapping (Single Channel)

Single-channel system with 8-byte memory bus

2GB memory, 8 banks, 16K rows & 2K columns per bank

Row interleaving

Consecutive rows of memory in consecutive banks

Cache block interleaving

Consecutive cache block addresses in consecutive banks

64 byte cache blocks

Accesses to consecutive cache blocks can be serviced in parallel

How about random accesses? Strided accesses?

76

Column (11 bits) Bank (3 bits) Row (14 bits) Byte in bus (3 bits)

Low Col. High Column Row (14 bits) Byte in bus (3 bits) Bank (3 bits)

3 bits 8 bits

Page 77: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Bank Mapping Randomization

DRAM controller can randomize the address mapping to banks so that bank conflicts are less likely

77

Column (11 bits) 3 bits Byte in bus (3 bits)

XOR

Bank index

(3 bits)

Page 78: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Address Mapping (Multiple Channels)

Where are consecutive cache blocks?

78

Column (11 bits) Bank (3 bits) Row (14 bits) Byte in bus (3 bits) C

Column (11 bits) Bank (3 bits) Row (14 bits) Byte in bus (3 bits) C

Column (11 bits) Bank (3 bits) Row (14 bits) Byte in bus (3 bits) C

Column (11 bits) Bank (3 bits) Row (14 bits) Byte in bus (3 bits) C

Low Col. High Column Row (14 bits) Byte in bus (3 bits) Bank (3 bits)

3 bits 8 bits

C

Low Col. High Column Row (14 bits) Byte in bus (3 bits) Bank (3 bits)

3 bits 8 bits

C

Low Col. High Column Row (14 bits) Byte in bus (3 bits) Bank (3 bits)

3 bits 8 bits

C

Low Col. High Column Row (14 bits) Byte in bus (3 bits) Bank (3 bits)

3 bits 8 bits

C

Low Col. High Column Row (14 bits) Byte in bus (3 bits) Bank (3 bits)

3 bits 8 bits

C

Page 79: 18-447: Computer Architecture Lecture 25: Main Memoryece447/s13/lib/exe/fetch.php?media=onur... · 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie

Interaction with VirtualPhysical Mapping

Operating System influences where an address maps to in DRAM

Operating system can control which bank/channel/rank a virtual page is mapped to.

It can perform page coloring to minimize bank conflicts

Or to minimize inter-application interference

79

Column (11 bits) Bank (3 bits) Row (14 bits) Byte in bus (3 bits)

Page offset (12 bits) Physical Frame number (19 bits)

Page offset (12 bits) Virtual Page number (52 bits) VA

PA

PA


Recommended