+ All Categories
Home > Documents > Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM...

Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM...

Date post: 22-Jun-2020
Category:
Upload: others
View: 9 times
Download: 0 times
Share this document with a friend
159
Memory Scaling: A Systems Architecture Perspective Onur Mutlu [email protected] August 6, 2013 MemCon 2013
Transcript
Page 1: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Memory Scaling: A Systems Architecture Perspective

Onur Mutlu [email protected] August 6, 2013 MemCon 2013

Page 2: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

The Main Memory System

n  Main memory is a critical component of all computing systems: server, mobile, embedded, desktop, sensor

n  Main memory system must scale (in size, technology, efficiency, cost, and management algorithms) to maintain performance growth and technology scaling benefits

2

Processor and caches

Main Memory Storage (SSD/HDD)

Page 3: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Memory System: A Shared Resource View

3

Storage

Page 4: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

State of the Main Memory System n  Recent technology, architecture, and application trends

q  lead to new requirements q  exacerbate old requirements

n  DRAM and memory controllers, as we know them today, are (will be) unlikely to satisfy all requirements

n  Some emerging non-volatile memory technologies (e.g., PCM) enable new opportunities: memory+storage merging

n  We need to rethink the main memory system q  to fix DRAM issues and enable emerging technologies q  to satisfy all requirements

4

Page 5: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Agenda

n  Major Trends Affecting Main Memory n  The DRAM Scaling Problem and Solution Directions

q  Tolerating DRAM: New DRAM Architectures q  Enabling Emerging Technologies: Hybrid Memory Systems

n  How Can We Do Better? n  Summary

5

Page 6: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Major Trends Affecting Main Memory (I) n  Need for main memory capacity, bandwidth, QoS increasing

n  Main memory energy/power is a key system design concern

n  DRAM technology scaling is ending

6

Page 7: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Major Trends Affecting Main Memory (II) n  Need for main memory capacity, bandwidth, QoS increasing

q  Multi-core: increasing number of cores/agents q  Data-intensive applications: increasing demand/hunger for data q  Consolidation: cloud computing, GPUs, mobile, heterogeneity

n  Main memory energy/power is a key system design concern

n  DRAM technology scaling is ending

7

Page 8: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Example: The Memory Capacity Gap

n  Memory capacity per core expected to drop by 30% every two years n  Trends worse for memory bandwidth per core!

8

Core count doubling ~ every 2 years DRAM DIMM capacity doubling ~ every 3 years

Page 9: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Major Trends Affecting Main Memory (III) n  Need for main memory capacity, bandwidth, QoS increasing

n  Main memory energy/power is a key system design concern

q  ~40-50% energy spent in off-chip memory hierarchy [Lefurgy, IEEE Computer 2003]

q  DRAM consumes power even when not used (periodic refresh)

n  DRAM technology scaling is ending

9

Page 10: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Major Trends Affecting Main Memory (IV) n  Need for main memory capacity, bandwidth, QoS increasing

n  Main memory energy/power is a key system design concern

n  DRAM technology scaling is ending

q  ITRS projects DRAM will not scale easily below X nm q  Scaling has provided many benefits:

n  higher capacity (density), lower cost, lower energy

10

Page 11: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Agenda

n  Major Trends Affecting Main Memory n  The DRAM Scaling Problem and Solution Directions

q  Tolerating DRAM: New DRAM Architectures q  Enabling Emerging Technologies: Hybrid Memory Systems

n  How Can We Do Better? n  Summary

11

Page 12: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

The DRAM Scaling Problem n  DRAM stores charge in a capacitor (charge-based memory)

q  Capacitor must be large enough for reliable sensing q  Access transistor should be large enough for low leakage and high

retention time q  Scaling beyond 40-35nm (2013) is challenging [ITRS, 2009]

n  DRAM capacity, cost, and energy/power hard to scale

12

Page 13: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Solutions to the DRAM Scaling Problem

n  Two potential solutions q  Tolerate DRAM (by taking a fresh look at it) q  Enable emerging memory technologies to eliminate/minimize

DRAM

n  Do both q  Hybrid memory systems

13

Page 14: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Solution 1: Tolerate DRAM n  Overcome DRAM shortcomings with

q  System-DRAM co-design q  Novel DRAM architectures, interface, functions q  Better waste management (efficient utilization)

n  Key issues to tackle q  Reduce refresh energy q  Improve bandwidth and latency q  Reduce waste q  Enable reliability at low cost

n  Liu, Jaiyen, Veras, Mutlu, “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012. n  Kim, Seshadri, Lee+, “A Case for Exploiting Subarray-Level Parallelism in DRAM,” ISCA 2012. n  Lee+, “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,” HPCA 2013. n  Liu+, “An Experimental Study of Data Retention Behavior in Modern DRAM Devices” ISCA’13. n  Seshadri+, “RowClone: Fast and Efficient In-DRAM Copy and Initialization of Bulk Data,” 2013.

14

Page 15: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Solution 2: Emerging Memory Technologies n  Some emerging resistive memory technologies seem more

scalable than DRAM (and they are non-volatile) n  Example: Phase Change Memory

q  Expected to scale to 9nm (2022 [ITRS]) q  Expected to be denser than DRAM: can store multiple bits/cell

n  But, emerging technologies have shortcomings as well q  Can they be enabled to replace/augment/surpass DRAM?

n  Lee, Ipek, Mutlu, Burger, “Architecting Phase Change Memory as a Scalable DRAM Alternative,” ISCA 2009, CACM 2010, Top Picks 2010.

n  Meza, Chang, Yoon, Mutlu, Ranganathan, “Enabling Efficient and Scalable Hybrid Memories,” IEEE Comp. Arch. Letters 2012.

n  Yoon, Meza et al., “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” ICCD 2012. n  Kultursay+, “Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative,” ISPASS 2013.

15

Page 16: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Hybrid Memory Systems

Meza+, “Enabling Efficient and Scalable Hybrid Memories,” IEEE Comp. Arch. Letters, 2012. Yoon, Meza et al., “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” ICCD 2012 Best Paper Award.

CPU DRAMCtrl

Fast, durable Small,

leaky, volatile, high-cost

Large, non-volatile, low-cost Slow, wears out, high active energy

PCM Ctrl DRAM Phase Change Memory (or Tech. X)

Hardware/software manage data allocation and movement to achieve the best of multiple technologies

Page 17: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

An Orthogonal Issue: Memory Interference

Main Memory

17

Core Core

Core Core

Cores’ interfere with each other when accessing shared main memory

Page 18: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

n  Problem: Memory interference between cores is uncontrolled à unfairness, starvation, low performance à uncontrollable, unpredictable, vulnerable system

n  Solution: QoS-Aware Memory Systems q  Hardware designed to provide a configurable fairness substrate

n  Application-aware memory scheduling, partitioning, throttling

q  Software designed to configure the resources to satisfy different QoS goals

n  QoS-aware memory controllers and interconnects can provide predictable performance and higher efficiency

An Orthogonal Issue: Memory Interference

Page 19: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Agenda

n  Major Trends Affecting Main Memory n  The DRAM Scaling Problem and Solution Directions

q  Tolerating DRAM: New DRAM Architectures q  Enabling Emerging Technologies: Hybrid Memory Systems

n  How Can We Do Better? n  Summary

19

Page 20: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Tolerating DRAM: Example Techniques

n  Retention-Aware DRAM Refresh: Reducing Refresh Impact n  Tiered-Latency DRAM: Reducing DRAM Latency

n  RowClone: Accelerating Page Copy and Initialization

n  Subarray-Level Parallelism: Reducing Bank Conflict Impact

20

Page 21: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

DRAM Refresh n  DRAM capacitor charge leaks over time

n  The memory controller needs to refresh each row periodically to restore charge q  Activate each row every N ms q  Typical N = 64 ms

n  Downsides of refresh -- Energy consumption: Each refresh consumes energy

-- Performance degradation: DRAM rank/bank unavailable while refreshed

-- QoS/predictability impact: (Long) pause times during refresh -- Refresh rate limits DRAM capacity scaling 21

Page 22: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Refresh Overhead: Performance

22

8%  

46%  

Page 23: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Refresh Overhead: Energy

23

15%  

47%  

Page 24: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Retention Time Profile of DRAM

24

Page 25: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

RAIDR: Eliminating Unnecessary Refreshes n  Observation: Most DRAM rows can be refreshed much less often

without losing data [Kim+, EDL’09][Liu+ ISCA’13]

n  Key idea: Refresh rows containing weak cells more frequently, other rows less frequently

1. Profiling: Profile retention time of all rows 2. Binning: Store rows into bins by retention time in memory controller

Efficient storage with Bloom Filters (only 1.25KB for 32GB memory) 3. Refreshing: Memory controller refreshes rows in different bins at different rates

n  Results: 8-core, 32GB, SPEC, TPC-C, TPC-H q  74.6% refresh reduction @ 1.25KB storage q  ~16%/20% DRAM dynamic/idle power reduction q  ~9% performance improvement q  Benefits increase with DRAM capacity

25 Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.

Page 26: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Going Forward

n  How to find out and expose weak memory cells/rows q  Early analysis of modern DRAM chips:

n  Liu+, “An Experimental Study of Data Retention Behavior in Modern DRAM Devices: Implications for Retention Time Profiling Mechanisms”, ISCA 2013.

n  Low-cost system-level tolerance of DRAM errors

n  Tolerating cell-to-cell interference at the system level q  For both DRAM and Flash. Early analysis of Flash chips:

n  Cai+, “Program Interference in MLC NAND Flash Memory: Characterization, Modeling, and Mitigation,” ICCD 2013.

26

Page 27: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Tolerating DRAM: Example Techniques

n  Retention-Aware DRAM Refresh: Reducing Refresh Impact n  Tiered-Latency DRAM: Reducing DRAM Latency

n  RowClone: Accelerating Page Copy and Initialization

n  Subarray-Level Parallelism: Reducing Bank Conflict Impact

27

Page 28: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

28  

DRAM  Latency-­‐Capacity  Trend  

0  

20  

40  

60  

80  

100  

0.0  

0.5  

1.0  

1.5  

2.0  

2.5  

2000   2003   2006   2008   2011  

Latency  (ns)  

Capa

city  (G

b)  

Year  

Capacity   Latency  (tRC)  

16X  

-­‐20%  

DRAM  latency  con.nues  to  be  a  cri.cal  bo4leneck  

Page 29: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

29  

DRAM  Latency  =  Subarray  Latency  +  I/O  Latency  

     What  Causes  the  Long  Latency?  DRAM  Chip  

channel  

cell  array  

I/O  

DRAM  Chip  

channel  

I/O  

subarray  

DRAM  Latency  =  Subarray  Latency  +  I/O  Latency  

Dominant  Suba

rray  

I/O  

Page 30: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

30  

     Why  is  the  Subarray  So  Slow?  Subarray  

row  decod

er  

sense  amplifier  

capacitor  

access  transistor  

wordline  

bitline

 

Cell  

large  sense  amplifier  

bitline

:  512  cells  cell  

•  Long  bitline  – AmorQzes  sense  amplifier  cost  à  Small  area  – Large  bitline  capacitance  à  High  latency  &  power  

sense  am

plifier  

row  decod

er  

Page 31: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

31  

     Trade-­‐Off:  Area  (Die  Size)  vs.  Latency  

Faster  

Smaller  

Short  Bitline    

Long  Bitline    

Trade-­‐Off:  Area  vs.  Latency  

Page 32: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

32  

     Trade-­‐Off:  Area  (Die  Size)  vs.  Latency  

0  

1  

2  

3  

4  

0   10   20   30   40   50   60   70  

Normalized

 DRA

M  Area  

Latency  (ns)  

64  

32  

128  256   512  cells/bitline  

Commodity  DRAM  

Long  Bitline  

Cheape

r  

Faster  

Fancy  DRAM  Short  Bitline  

Page 33: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

33  

Short  Bitline  

Low  Latency    

     ApproximaQng  the  Best  of  Both  Worlds  Long  Bitline  

Small  Area    

Long  Bitline  

Low  Latency    

Short  Bitline  Our  Proposal  Small  Area    

Short  Bitline  è  Fast  Need  

IsolaGon  Add  IsolaGon  Transistors  

High  Latency  

Large  Area    

Page 34: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

34  

     ApproximaQng  the  Best  of  Both  Worlds  

Low  Latency    

Our  Proposal  Small  Area    

Long  Bitline  Small  Area    

Long  Bitline  

High  Latency  

Short  Bitline  

Low  Latency    

Short  Bitline  Large  Area    

Tiered-­‐Latency  DRAM  

Low  Latency  

Small  area  using  long  bitline  

Page 35: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

35  

     Tiered-­‐Latency  DRAM  

Near  Segment  

Far  Segment  

IsolaGon  Transistor  

•  Divide  a  bitline  into  two  segments  with  an  isolaQon  transistor  

Sense  Amplifier  

Lee+, “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,” HPCA 2013.

Page 36: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

36  

0%  

50%  

100%  

150%  

0%  

50%  

100%  

150%  

 Commodity  DRAM  vs.  TL-­‐DRAM    Latency  

Power  

–56%  

+23%  

–51%  

+49%  •  DRAM  Latency  (tRC)  •  DRAM  Power  

•  DRAM  Area  Overhead  ~3%:  mainly  due  to  the  isolaIon  transistors  

TL-­‐DRAM  Commodity  

DRAM  Near              Far   Commodity  

DRAM  Near              Far  TL-­‐DRAM  

 (52.5ns)  

Page 37: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

37  

     Trade-­‐Off:  Area  (Die-­‐Area)  vs.  Latency  

0  

1  

2  

3  

4  

0   10   20   30   40   50   60   70  

Normalized

 DRA

M  Area  

Latency  (ns)  

64  

32  

128  256        512  cells/bitline    

       

Cheape

r  

Faster  

Near  Segment   Far  Segment  

Page 38: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

38  

     Leveraging  Tiered-­‐Latency  DRAM  •  TL-­‐DRAM  is  a  substrate  that  can  be  leveraged  by  the  hardware  and/or  soOware  

•  Many  potenIal  uses  1. Use  near  segment  as  hardware-­‐managed  inclusive  cache  to  far  segment  

2. Use  near  segment  as  hardware-­‐managed  exclusive  cache  to  far  segment  

3. Profile-­‐based  page  mapping  by  operaIng  system  4. Simply  replace  DRAM  with  TL-­‐DRAM    

Page 39: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

39  

0%  

20%  

40%  

60%  

80%  

100%  

120%  

1  (1-­‐ch)   2  (2-­‐ch)   4  (4-­‐ch)  0%  

20%  

40%  

60%  

80%  

100%  

120%  

1  (1-­‐ch)   2  (2-­‐ch)   4  (4-­‐ch)  

     Performance  &  Power  ConsumpQon      11.5%  

 

Normalized

 Perform

ance  

Core-­‐Count  (Channel)  Normalized

 Pow

er  Core-­‐Count  (Channel)  

10.7%    

12.4%     –23%  

 –24%    

–26%    

Using  near  segment  as  a  cache  improves  performance  and  reduces  power  consumpGon  

Page 40: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Tolerating DRAM: Example Techniques

n  Retention-Aware DRAM Refresh: Reducing Refresh Impact n  Tiered-Latency DRAM: Reducing DRAM Latency

n  RowClone: Accelerating Page Copy and Initialization

n  Subarray-Level Parallelism: Reducing Bank Conflict Impact

40

Page 41: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Today’s  Memory:  Bulk  Data  Copy  

Memory      

MC L3 L2 L1 CPU

1)  High  latency  

2)  High  bandwidth  uIlizaIon  

3)  Cache  polluIon  

4)  Unwanted  data  movement  

41  

Page 42: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Future:  RowClone  (In-­‐Memory  Copy)  

Memory      

MC L3 L2 L1 CPU

1)  Low  latency  

2)  Low  bandwidth  uIlizaIon  

3)  No  cache  polluIon  

4)  No  unwanted  data  movement  

42  

Page 43: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

DRAM operation (load one byte)

Row Buffer (4 Kbits)

Memory Bus

Data pins (8 bits)

DRAM array

4 Kbits

Step 1: Activate row

Transfer row

Step 2: Read Transfer byte onto bus

Page 44: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

RowClone: in-DRAM Row Copy (and Initialization)

Row Buffer (4 Kbits)

Memory Bus

Data pins (8 bits)

DRAM array

4 Kbits

Step 1: Activate row A

Transfer row

Step 2: Activate row B

Transfer row

Page 45: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

RowClone:  Latency  and  Energy  Savings  

0  

0.2  

0.4  

0.6  

0.8  

1  

1.2  

Latency   Energy  

Normalized

 Savings  

Baseline   Intra-­‐Subarray  Inter-­‐Bank   Inter-­‐Subarray  

11.6x   74x  

45  Seshadri et al., “RowClone: Fast and Efficient In-DRAM Copy and Initialization of Bulk Data,” CMU Tech Report 2013.

Page 46: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

RowClone:  Overall  Performance  

46  

Page 47: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Goal: Ultra-efficient heterogeneous architectures

CPU core

CPU core

CPU core

CPU core

mini-CPU core

video core

GPU (throughput)

core

GPU (throughput)

core

GPU (throughput)

core

GPU (throughput)

core

LLC

Memory Controller Specialized

compute-capability in memory

Memory imaging core

Memory Bus

Slide credit: Prof. Kayvon Fatahalian, CMU

Page 48: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Enabling Ultra-efficient (Visual) Search

▪  What is the right partitioning of computation capability? ▪  What is the right low-cost memory substrate? ▪  What memory technologies are the best enablers? ▪  How do we rethink/ease (visual) search algorithms/applications?

Cache

Processor Core

Memory Bus

Main Memory

Database (of images)

Query vector

Results

Picture credit: Prof. Kayvon Fatahalian, CMU

Page 49: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Tolerating DRAM: Example Techniques

n  Retention-Aware DRAM Refresh: Reducing Refresh Impact n  Tiered-Latency DRAM: Reducing DRAM Latency

n  RowClone: In-Memory Page Copy and Initialization

n  Subarray-Level Parallelism: Reducing Bank Conflict Impact

49

Page 50: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

SALP: Reducing DRAM Bank Conflicts n  Problem: Bank conflicts are costly for performance and energy

q  serialized requests, wasted energy (thrashing of row buffer, busy wait)

n  Goal: Reduce bank conflicts without adding more banks (low cost) n  Key idea: Exploit the internal subarray structure of a DRAM bank to

parallelize bank conflicts to different subarrays q  Slightly modify DRAM bank to reduce subarray-level hardware sharing

n  Results on Server, Stream/Random, SPEC q  19% reduction in dynamic DRAM energy q  13% improvement in row hit rate q  17% performance improvement q  0.15% DRAM area overhead

50 Kim, Seshadri+ “A Case for Exploiting Subarray-Level Parallelism in DRAM,” ISCA 2012. 0.0

0.2

0.4

0.6

0.8

1.0

1.2

Nor

mal

ized

D

ynam

ic E

nerg

y

Baseline MASA

-19%

0%

20%

40%

60%

80%

100%

Row

-Buf

fer

Hit

-Rat

e

Baseline MASA

+13

%

Page 51: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Agenda

n  Major Trends Affecting Main Memory n  The DRAM Scaling Problem and Solution Directions

q  Tolerating DRAM: New DRAM Architectures q  Enabling Emerging Technologies: Hybrid Memory Systems

n  How Can We Do Better? n  Summary

51

Page 52: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Solution 2: Emerging Memory Technologies n  Some emerging resistive memory technologies seem more

scalable than DRAM (and they are non-volatile)

n  Example: Phase Change Memory q  Data stored by changing phase of material q  Data read by detecting material’s resistance q  Expected to scale to 9nm (2022 [ITRS]) q  Prototyped at 20nm (Raoux+, IBM JRD 2008) q  Expected to be denser than DRAM: can store multiple bits/cell

n  But, emerging technologies have (many) shortcomings q  Can they be enabled to replace/augment/surpass DRAM?

52

Page 53: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Phase Change Memory: Pros and Cons n  Pros over DRAM

q  Better technology scaling (capacity and cost) q  Non volatility q  Low idle power (no refresh)

n  Cons q  Higher latencies: ~4-15x DRAM (especially write) q  Higher active energy: ~2-50x DRAM (especially write) q  Lower endurance (a cell dies after ~108 writes)

n  Challenges in enabling PCM as DRAM replacement/helper: q  Mitigate PCM shortcomings q  Find the right way to place PCM in the system

53

Page 54: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

PCM-based Main Memory (I) n  How should PCM-based (main) memory be organized?

n  Hybrid PCM+DRAM [Qureshi+ ISCA’09, Dhiman+ DAC’09]: q  How to partition/migrate data between PCM and DRAM

54

Page 55: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

PCM-based Main Memory (II) n  How should PCM-based (main) memory be organized?

n  Pure PCM main memory [Lee et al., ISCA’09, Top Picks’10]:

q  How to redesign entire hierarchy (and cores) to overcome PCM shortcomings

55

Page 56: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

An Initial Study: Replace DRAM with PCM n  Lee, Ipek, Mutlu, Burger, “Architecting Phase Change

Memory as a Scalable DRAM Alternative,” ISCA 2009. q  Surveyed prototypes from 2003-2008 (e.g. IEDM, VLSI, ISSCC) q  Derived “average” PCM parameters for F=90nm

56

Page 57: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Results: Naïve Replacement of DRAM with PCM n  Replace DRAM with PCM in a 4-core, 4MB L2 system n  PCM organized the same as DRAM: row buffers, banks, peripherals n  1.6x delay, 2.2x energy, 500-hour average lifetime

n  Lee, Ipek, Mutlu, Burger, “Architecting Phase Change Memory as a

Scalable DRAM Alternative,” ISCA 2009. 57

Page 58: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Architecting PCM to Mitigate Shortcomings n  Idea 1: Use multiple narrow row buffers in each PCM chip

à Reduces array reads/writes à better endurance, latency, energy

n  Idea 2: Write into array at cache block or word granularity

à Reduces unnecessary wear

58

DRAM PCM

Page 59: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Results: Architected PCM as Main Memory n  1.2x delay, 1.0x energy, 5.6-year average lifetime n  Scaling improves energy, endurance, density

n  Caveat 1: Worst-case lifetime is much shorter (no guarantees) n  Caveat 2: Intensive applications see large performance and energy hits n  Caveat 3: Optimistic PCM parameters?

59

Page 60: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Hybrid Memory Systems

Meza+, “Enabling Efficient and Scalable Hybrid Memories,” IEEE Comp. Arch. Letters, 2012. Yoon, Meza et al., “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” ICCD 2012 Best Paper Award.

CPU DRAMCtrl

Fast, durable Small,

leaky, volatile, high-cost

Large, non-volatile, low-cost Slow, wears out, high active energy

PCM Ctrl DRAM Phase Change Memory (or Tech. X)

Hardware/software manage data allocation and movement to achieve the best of multiple technologies

Page 61: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

One Option: DRAM as a Cache for PCM n  PCM is main memory; DRAM caches memory rows/blocks

q  Benefits: Reduced latency on DRAM cache hit; write filtering

n  Memory controller hardware manages the DRAM cache q  Benefit: Eliminates system software overhead

n  Three issues: q  What data should be placed in DRAM versus kept in PCM? q  What is the granularity of data movement? q  How to design a low-cost hardware-managed DRAM cache?

n  Two solutions: q  Locality-aware data placement [Yoon+ , ICCD 2012]

q  Cheap tag stores and dynamic granularity [Meza+, IEEE CAL 2012]

61

Page 62: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

DRAM vs. PCM: An Observation n  Row buffers are the same in DRAM and PCM n  Row buffer hit latency same in DRAM and PCM n  Row buffer miss latency small in DRAM, large in PCM

n  Accessing the row buffer in PCM is fast n  What incurs high latency is the PCM array access à avoid this

62

CPU DRAMCtrl

PCM Ctrl

Bank Bank Bank Bank

Row  buffer  DRAM Cache PCM Main Memory

N ns row hit Fast row miss

N ns row hit Slow row miss

Page 63: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Row-Locality-Aware Data Placement n  Idea: Cache in DRAM only those rows that

q  Frequently cause row buffer conflicts à because row-conflict latency is smaller in DRAM

q  Are reused many times à to reduce cache pollution and bandwidth waste

n  Simplified rule of thumb: q  Streaming accesses: Better to place in PCM q  Other accesses (with some reuse): Better to place in DRAM

n  Yoon et al., “Row Buffer Locality-Aware Data Placement in Hybrid Memories,” ICCD 2012 Best Paper Award.

63

Page 64: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Row-Locality-Aware Data Placement: Results

64

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Server Cloud Avg

Nor

mal

ized

Wei

ghte

d Sp

eedu

p

Workload

FREQ FREQ-Dyn RBLA RBLA-Dyn

10% 14% 17%

Memory  energy-­‐efficiency  and  fairness  also  improve  correspondingly  

Page 65: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

0 0.2 0.4 0.6 0.8

1 1.2 1.4 1.6 1.8

2

Weighted Speedup Max. Slowdown Perf. per Watt Normalized Metric

16GB PCM RBLA-Dyn 16GB DRAM

0 0.2 0.4 0.6 0.8

1 1.2 1.4 1.6 1.8

2

Nor

mal

ized

Wei

ghte

d Sp

eedu

p

0

0.2

0.4

0.6

0.8

1

1.2

Nor

mal

ized

Max

. Slo

wdo

wn

Hybrid vs. All-PCM/DRAM

65  

31%  befer  performance  than  all  PCM,    within  29%  of  all  DRAM  performance  

31%  

29%  

Page 66: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Agenda

n  Major Trends Affecting Main Memory n  The DRAM Scaling Problem and Solution Directions

q  Tolerating DRAM: New DRAM Architectures q  Enabling Emerging Technologies: Hybrid Memory Systems

n  How Can We Do Better? n  Summary

66

Page 67: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Principles (So Far)

n  Better cooperation between devices and the system q  Expose more information about devices to upper layers q  More flexible interfaces

n  Better-than-worst-case design q  Do not optimize for the worst case q  Worst case should not determine the common case

n  Heterogeneity in design q  Enables a more efficient design (No one size fits all)

67

Page 68: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Other Opportunities with Emerging Technologies

n  Merging of memory and storage q  e.g., a single interface to manage all data

n  New applications q  e.g., ultra-fast checkpoint and restore

n  More robust system design q  e.g., reducing data loss

n  Processing tightly-coupled with memory q  e.g., enabling efficient search and filtering

68

Page 69: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Coordinated Memory and Storage with NVM (I) n  The traditional two-level storage model is a bottleneck with NVM

q  Volatile data in memory à a load/store interface q  Persistent data in storage à a file system interface q  Problem: Operating system (OS) and file system (FS) code to locate, translate,

buffer data become performance and energy bottlenecks with fast NVM stores

69

Two-Level Store

Processor and caches

Main Memory Storage (SSD/HDD)

Virtual memory

Address translation

Load/Store

Operating system

and file system

fopen, fread, fwrite, …

Page 70: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Coordinated Memory and Storage with NVM (II)

n  Goal: Unify memory and storage management in a single unit to eliminate wasted work to locate, transfer, and translate data q  Improves both energy and performance q  Simplifies programming model as well

70

Unified Memory/Storage

Processor and caches

Persistent (e.g., Phase-Change) Memory

Load/Store

Persistent Memory Manager

Feedback

Meza+, “A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory,” WEED 2013.

Page 71: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Performance Benefits of a Single-Level Store

71 Results for PostMark

0

0.2

0.4

0.6

0.8

1.0

HDD 2-level NVM 2-level Persistent Memory

Nor

mal

ized

Exe

cutio

n Ti

me

User CPU User Memory Syscall CPU Syscall I/O

0.0440.009

~5X

Page 72: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Energy Benefits of a Single-Level Store

72

0

0.2

0.4

0.6

0.8

1.0

HDD 2-level NVM 2-level Persistent Memory

Frac

tion

of T

otal

Ener

gy

User CPU Syscall CPU DRAM NVM HDD

0.0650.013

Results for PostMark

~5X

Page 73: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Agenda

n  Major Trends Affecting Main Memory n  The DRAM Scaling Problem and Solution Directions

q  Tolerating DRAM: New DRAM Architectures q  Enabling Emerging Technologies: Hybrid Memory Systems

n  How Can We Do Better? n  Summary

73

Page 74: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Summary: Main Memory Scaling n  Main memory scaling problems are a critical bottleneck for

system performance, efficiency, and usability

n  Solution 1: Tolerate DRAM with novel architectures q  RAIDR: Retention-aware refresh q  TL-DRAM: Tiered-Latency DRAM q  RowClone: Fast page copy and initialization q  SALP: Subarray-level parallelism

n  Solution 2: Enable emerging memory technologies q  Replace DRAM with NVM by architecting NVM chips well q  Hybrid memory systems with automatic data management q  Coordinated management of memory and storage

n  Software/hardware/device cooperation essential for effective scaling of main memory

74

Page 75: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

More Material: Slides, Papers, Videos

n  These slides are a very short version of the Scalable Memory Systems course at ACACES 2013

n  Website for Course Slides, Papers, and Videos q  http://users.ece.cmu.edu/~omutlu/acaces2013-memory.html q  http://users.ece.cmu.edu/~omutlu/projects.htm q  Includes extended lecture notes and readings

n  Overview Reading q  Onur Mutlu,

"Memory Scaling: A Systems Architecture Perspective" Proceedings of the 5th International Memory Workshop (IMW), Monterey, CA, May 2013. Slides (pptx) (pdf)

75

Page 76: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Thank you.

Feel free to email me with any feedback

[email protected]

76

Page 77: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Memory Scaling: A Systems Architecture Perspective

Onur Mutlu [email protected] August 6, 2013 MemCon 2013

Page 78: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Backup Slides

78

Page 79: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Backup Slides Agenda

n  Building Large DRAM Caches for Hybrid Memories n  Memory QoS and Predictable Performance n  Subarray-Level Parallelism (SALP) in DRAM n  Coordinated Memory and Storage with NVM

79

Page 80: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Building Large Caches for Hybrid Memories

80

Page 81: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

One Option: DRAM as a Cache for PCM n  PCM is main memory; DRAM caches memory rows/blocks

q  Benefits: Reduced latency on DRAM cache hit; write filtering

n  Memory controller hardware manages the DRAM cache q  Benefit: Eliminates system software overhead

n  Three issues: q  What data should be placed in DRAM versus kept in PCM? q  What is the granularity of data movement? q  How to design a low-cost hardware-managed DRAM cache?

n  Two ideas: q  Locality-aware data placement [Yoon+ , ICCD 2012]

q  Cheap tag stores and dynamic granularity [Meza+, IEEE CAL 2012]

81

Page 82: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

The Problem with Large DRAM Caches n  A large DRAM cache requires a large metadata (tag +

block-based information) store n  How do we design an efficient DRAM cache?

82

DRAM   PCM  

CPU

(small, fast cache) (high capacity)

Mem  Ctlr  

Mem  Ctlr  

LOAD  X  

Access X

Metadata:  X  à  DRAM  

X  

Page 83: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Idea 1: Store Tags in Main Memory n  Store tags in the same row as data in DRAM

q  Data and metadata can be accessed together

n  Benefit: No on-chip tag storage overhead n  Downsides:

q  Cache hit determined only after a DRAM access q  Cache hit requires two DRAM accesses

83

Cache  block  2  Cache  block  0   Cache  block  1  DRAM row

Tag0   Tag1   Tag2  

Page 84: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Idea 2: Cache Tags in On-Chip SRAM n  Recall Idea 1: Store all metadata in DRAM

q  To reduce metadata storage overhead

n  Idea 2: Cache in on-chip SRAM frequently-accessed metadata q  Cache only a small amount to keep SRAM size small

84

Page 85: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Idea 3: Dynamic Data Transfer Granularity n  Some applications benefit from caching more data

q  They have good spatial locality

n  Others do not q  Large granularity wastes bandwidth and reduces cache

utilization

n  Idea 3: Simple dynamic caching granularity policy q  Cost-benefit analysis to determine best DRAM cache block size

n  Meza, Chang, Yoon, Mutlu, Ranganathan, “Enabling Efficient and Scalable Hybrid Memories,” IEEE Comp. Arch. Letters, 2012.

85

Page 86: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

0  

0.1  

0.2  

0.3  

0.4  

0.5  

0.6  

0.7  

0.8  

0.9  

1  

SRAM   Region   TIM   TIMBER   TIMBER-­‐Dyn  

Normalized

 Weighted  Speedu

p  

86  

TIMBER  Performance  

-­‐6%  

Meza,  Chang,  Yoon,  Mutlu,  Ranganathan,  “Enabling  Efficient  and  Scalable  Hybrid  Memories,”  IEEE  Comp.  Arch.  Legers,  2012.  

Page 87: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

0  

0.2  

0.4  

0.6  

0.8  

1  

1.2  

SRAM   Region   TIM   TIMBER   TIMBER-­‐Dyn  

Normalized

 Perform

ance  per  W

af  

(for  M

emory  System

)  

87  

TIMBER  Energy  Efficiency  18%  

Meza,  Chang,  Yoon,  Mutlu,  Ranganathan,  “Enabling  Efficient  and  Scalable  Hybrid  Memories,”  IEEE  Comp.  Arch.  Legers,  2012.  

Page 88: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Hybrid Main Memory: Research Topics n  Many research topics from technology

layer to algorithms layer

n  Enabling NVM and hybrid memory q  How to maximize performance? q  How to maximize lifetime? q  How to prevent denial of service?

n  Exploiting emerging tecnologies q  How to exploit non-volatility? q  How to minimize energy consumption? q  How to minimize cost? q  How to exploit NVM on chip?

88

Microarchitecture

ISA

Programs

Algorithms Problems

Logic

Devices

Runtime System (VM, OS, MM)

User

Page 89: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Security Challenges of Emerging Technologies

1. Limited endurance à Wearout attacks 2. Non-volatility à Data persists in memory after powerdown à Easy retrieval of privileged or private information 3. Multiple bits per cell à Information leakage (via side channel)

89

Page 90: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Memory QoS

90

Page 91: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Trend: Many Cores on Chip n  Simpler and lower power than a single large core n  Large scale parallelism on chip

91

IBM  Cell  BE  8+1  cores  

Intel  Core  i7  8  cores  

Tilera  TILE  Gx  100  cores,  networked  

IBM  POWER7  8  cores  

Intel  SCC  48  cores,  networked  

Nvidia  Fermi  448  “cores”  

AMD  Barcelona  4  cores  

Sun  Niagara  II  8  cores  

Page 92: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Many Cores on Chip

n  What we want: q  N times the system performance with N times the cores

n  What do we get today?

92

Page 93: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Unfair Slowdowns due to Interference

Memory Performance Hog Low priority

High priority

(Core 0) (Core 1)

Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service in multi-core systems,” USENIX Security 2007.

matlab (Core 1)

gcc (Core 2)

93

Page 94: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

94

Uncontrolled Interference: An Example

CORE 1 CORE 2

L2 CACHE

L2 CACHE

DRAM MEMORY CONTROLLER

DRAM Bank 0

DRAM Bank 1

DRAM Bank 2

Shared DRAM Memory System

Multi-Core Chip

unfairness INTERCONNECT

matlab gcc

DRAM Bank 3

Page 95: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

// initialize large arrays A, B for (j=0; j<N; j++) { index = rand(); A[index] = B[index]; … }

95

A Memory Performance Hog

STREAM

-  Sequential memory access -  Very high row buffer locality (96% hit rate) -  Memory intensive

RANDOM

-  Random memory access -  Very low row buffer locality (3% hit rate) -  Similarly memory intensive

// initialize large arrays A, B for (j=0; j<N; j++) { index = j*linesize; A[index] = B[index]; … }

streaming random

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

Page 96: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

96

What Does the Memory Hog Do?

Row Buffer

Row

dec

oder

Column mux

Data

Row 0

T0: Row 0

Row 0

T1: Row 16 T0: Row 0 T1: Row 111

T0: Row 0 T0: Row 0 T1: Row 5

T0: Row 0 T0: Row 0 T0: Row 0 T0: Row 0 T0: Row 0

Memory Request Buffer

T0: STREAM T1: RANDOM

Row size: 8KB, cache block size: 64B 128 (8KB/64B) requests of T0 serviced before T1

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

Page 97: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Effect of the Memory Performance Hog

0

0.5

1

1.5

2

2.5

3

STREAM RANDOM

97

1.18X slowdown

2.82X slowdown

Results on Intel Pentium D running Windows XP (Similar results for Intel Core Duo and AMD Turion, and on Fedora Linux)

Slow

dow

n

0

0.5

1

1.5

2

2.5

3

STREAM gcc 0

0.5

1

1.5

2

2.5

3

STREAM Virtual PC

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

Page 98: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Greater Problem with More Cores

n  Vulnerable to denial of service (DoS)

n  Unable to enforce priorities or SLAs n  Low system performance

Uncontrollable, unpredictable system

98

Page 99: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Distributed DoS in Networked Multi-Core Systems

99

Attackers (Cores 1-8)

Stock option pricing application (Cores 9-64)

Cores connected via packet-switched routers on chip

~5000X slowdown

Grot, Hestness, Keckler, Mutlu, “Preemptive virtual clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip,“ MICRO 2009.

Page 100: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

How Do We Solve The Problem?

n  Inter-thread interference is uncontrolled in all memory resources q  Memory controller q  Interconnect q  Caches

n  We need to control it q  i.e., design an interference-aware (QoS-aware) memory system

100

Page 101: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

QoS-Aware Memory Systems: Challenges

n  How do we reduce inter-thread interference? q  Improve system performance and core utilization q  Reduce request serialization and core starvation

n  How do we control inter-thread interference? q  Provide mechanisms to enable system software to enforce

QoS policies q  While providing high system performance

n  How do we make the memory system configurable/flexible? q  Enable flexible mechanisms that can achieve many goals

n  Provide fairness or throughput when needed n  Satisfy performance guarantees when needed

101

Page 102: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Designing QoS-Aware Memory Systems: Approaches

n  Smart resources: Design each shared resource to have a configurable interference control/reduction mechanism q  QoS-aware memory controllers [Mutlu+ MICRO’07] [Moscibroda+, Usenix Security’07]

[Mutlu+ ISCA’08, Top Picks’09] [Kim+ HPCA’10] [Kim+ MICRO’10, Top Picks’11] [Ebrahimi+ ISCA’11, MICRO’11] [Ausavarungnirun+, ISCA’12]

q  QoS-aware interconnects [Das+ MICRO’09, ISCA’10, Top Picks ’11] [Grot+ MICRO’09, ISCA’11, Top Picks ’12]

q  QoS-aware caches

n  Dumb resources: Keep each resource free-for-all, but reduce/control interference by injection control or data mapping q  Source throttling to control access to memory system [Ebrahimi+ ASPLOS’10,

ISCA’11, TOCS’12] [Ebrahimi+ MICRO’09] [Nychis+ HotNets’10]

q  QoS-aware data mapping to memory controllers [Muralidhara+ MICRO’11]

q  QoS-aware thread scheduling to cores

102

Page 103: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

n  Memory Channel Partitioning q  Idea: System software maps badly-interfering applications’ pages

to different channels [Muralidhara+, MICRO’11]

n  Separate data of low/high intensity and low/high row-locality applications n  Especially effective in reducing interference of threads with “medium” and

“heavy” memory intensity q  11% higher performance over existing systems (200 workloads)

A Mechanism to Reduce Memory Interference

103

Core 0 App A

Core 1 App B

Channel 0

Bank 1

Channel 1

Bank 0

Bank 1

Bank 0

Conventional Page Mapping

Time Units

1 2 3 4 5

Channel Partitioning

Core 0 App A

Core 1 App B

Channel 0

Bank 1

Bank 0

Bank 1

Bank 0

Time Units

1 2 3 4 5

Channel 1

Page 104: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Designing QoS-Aware Memory Systems: Approaches

n  Smart resources: Design each shared resource to have a configurable interference control/reduction mechanism q  QoS-aware memory controllers [Mutlu+ MICRO’07] [Moscibroda+, Usenix Security’07]

[Mutlu+ ISCA’08, Top Picks’09] [Kim+ HPCA’10] [Kim+ MICRO’10, Top Picks’11] [Ebrahimi+ ISCA’11, MICRO’11] [Ausavarungnirun+, ISCA’12][Subramanian+, HPCA’13]

q  QoS-aware interconnects [Das+ MICRO’09, ISCA’10, Top Picks ’11] [Grot+ MICRO’09, ISCA’11, Top Picks ’12]

q  QoS-aware caches

n  Dumb resources: Keep each resource free-for-all, but reduce/control interference by injection control or data mapping q  Source throttling to control access to memory system [Ebrahimi+ ASPLOS’10,

ISCA’11, TOCS’12] [Ebrahimi+ MICRO’09] [Nychis+ HotNets’10] [Nychis+ SIGCOMM’12]

q  QoS-aware data mapping to memory controllers [Muralidhara+ MICRO’11]

q  QoS-aware thread scheduling to cores [Das+ HPCA’13]

104

Page 105: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

QoS-Aware Memory Scheduling

n  How to schedule requests to provide q  High system performance q  High fairness to applications q  Configurability to system software

n  Memory controller needs to be aware of threads

105

Memory  Controller  

Core   Core  

Core   Core  Memory  

Resolves memory contention by scheduling requests

Page 106: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

QoS-Aware Memory Scheduling: Evolution n  Stall-time fair memory scheduling [Mutlu+ MICRO’07]

q  Idea: Estimate and balance thread slowdowns

q  Takeaway: Proportional thread progress improves performance, especially when threads are “heavy” (memory intensive)

n  Parallelism-aware batch scheduling [Mutlu+ ISCA’08, Top Picks’09]

q  Idea: Rank threads and service in rank order (to preserve bank parallelism); batch requests to prevent starvation

q  Takeaway: Preserving within-thread bank-parallelism improves performance; request batching improves fairness

n  ATLAS memory scheduler [Kim+ HPCA’10]

q  Idea: Prioritize threads that have attained the least service from the memory scheduler

q  Takeaway: Prioritizing “light” threads improves performance 106

Page 107: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Take  turns  accessing  memory  

Throughput vs. Fairness

107  

Fairness  biased  approach  

thread  C  

thread  B  

thread  A  

less  memory    intensive  

higher  priority  

PrioriIze  less  memory-­‐intensive  threads  

Throughput  biased  approach  

Good  for  throughput  

starvaGon  è  unfairness  

thread  C   thread  B  thread  A  

Does  not  starve  

not  prioriGzed  è    reduced  throughput  

Single  policy  for  all  threads  is  insufficient  

Page 108: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Achieving the Best of Both Worlds

108  

thread  

thread  

higher  priority  

thread  

thread  

thread    

thread  

thread  

thread  

PrioriQze  memory-­‐non-­‐intensive  threads  

For  Throughput  

Unfairness  caused  by  memory-­‐intensive  being  prioriQzed  over  each  other    

•   Shuffle  thread  ranking  

Memory-­‐intensive  threads  have    different  vulnerability  to  interference  

•   Shuffle  asymmetrically  

For  Fairness  

thread  

thread  

thread  

thread  

Page 109: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Thread Cluster Memory Scheduling [Kim+ MICRO’10]

1.   Group  threads  into  two  clusters  2.   PrioriQze  non-­‐intensive  cluster  3.   Different  policies  for  each  cluster  

109  

thread  

Threads  in  the  system  

thread  

thread  

thread  

thread  

thread  

thread  

Non-­‐intensive    cluster  

Intensive  cluster  

thread  

thread  

thread  

Memory-­‐non-­‐intensive    

Memory-­‐intensive    

PrioriGzed  

higher  priority  

higher  priority  

Throughput  

Fairness  Kim+, “Thread Cluster Memory Scheduling,” MICRO 2010.

Page 110: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

TCM: Quantum-Based Operation

110  

Time  

Previous  quantum  (~1M  cycles)  

During  quantum:  • Monitor  thread  behavior  1. Memory  intensity  2. Bank-­‐level  parallelism  3. Row-­‐buffer  locality  

Beginning  of  quantum:  • Perform  clustering  • Compute  niceness  of  intensive  threads  

Current  quantum  (~1M  cycles)  

Shuffle  interval  (~1K  cycles)  

Kim+, “Thread Cluster Memory Scheduling,” MICRO 2010.

Page 111: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

TCM: Throughput and Fairness

FRFCFS  

STFM  

PAR-­‐BS  

ATLAS  

TCM  

4  

6  

8  

10  

12  

14  

16  

7.5   8   8.5   9   9.5   10  

Maxim

um  Slowdo

wn  

Weighted  Speedup  

111  

Beger  system  throughput  

Beger  fa

irness  

24  cores,  4  memory  controllers,  96  workloads    

TCM,  a  heterogeneous  scheduling  policy,  provides  best  fairness  and  system  throughput  

Page 112: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

TCM: Fairness-Throughput Tradeoff

112  

2  

4  

6  

8  

10  

12  

12   12.5   13   13.5   14   14.5   15   15.5   16  

Maxim

um  Slowdo

wn  

Weighted  Speedup  

When  configuraQon  parameter  is  varied…  

Adjus.ng    ClusterThreshold  

TCM  allows  robust  fairness-­‐throughput  tradeoff    

STFM  PAR-­‐BS  

ATLAS  

TCM  

Beger  system  throughput  

Beger  fa

irness   FRFCFS  

Page 113: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

More on TCM

n  Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter, "Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior" Proceedings of the 43rd International Symposium on Microarchitecture (MICRO), pages 65-76, Atlanta, GA, December 2010. Slides (pptx) (pdf)

113

Page 114: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Memory Control in CPU-GPU Systems n  Observation: Heterogeneous CPU-GPU systems require

memory schedulers with large request buffers

n  Problem: Existing monolithic application-aware memory scheduler designs are hard to scale to large request buffer sizes

n  Solution: Staged Memory Scheduling (SMS) decomposes the memory controller into three simple stages: 1) Batch formation: maintains row buffer locality 2) Batch scheduler: reduces interference between applications 3) DRAM command scheduler: issues requests to DRAM

n  Compared to state-of-the-art memory schedulers: q  SMS is significantly simpler and more scalable q  SMS provides higher performance and fairness

114 Ausavarungnirun+, “Staged Memory Scheduling,” ISCA 2012.

Page 115: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Key Idea: Decouple Tasks into Stages n  Idea: Decouple the functional tasks of the memory controller

q  Partition tasks across several simpler HW structures (stages)

1) Maximize row buffer hits q  Stage 1: Batch formation q  Within each application, groups requests to the same row into

batches

2) Manage contention between applications q  Stage 2: Batch scheduler q  Schedules batches from different applications

3) Satisfy DRAM timing constraints q  Stage 3: DRAM command scheduler q  Issues requests from the already-scheduled order to each bank

115

Page 116: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

SMS: Staged Memory Scheduling

116

Memory Scheduler

Core 1 Core 2 Core 3 Core 4

To DRAM

GPU

Req Req

Req

Req Req

Req Req Req Req Req

Req Req Req Req Req

Req Req

Req Req Req

Req Req Req

Req

Req

Req Req

Req Req Req Req Req

Req Req Req Req Req Req Req

Req

Req Req Batch Scheduler

Stage 1

Stage 2

Stage 3

Req

Mon

olith

ic S

ched

uler

Batch Formation

DRAM Command Scheduler

Bank 1 Bank 2 Bank 3 Bank 4

Page 117: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Stage 1

Stage 2

SMS: Staged Memory Scheduling

117

Core 1 Core 2 Core 3 Core 4

To DRAM

GPU

Req Req Batch Scheduler

Batch Formation

Stage 3

DRAM Command Scheduler

Bank 1 Bank 2 Bank 3 Bank 4

Page 118: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Current Batch Scheduling

Policy SJF

Current Batch Scheduling

Policy RR

Batch Scheduler

Bank 1 Bank 2 Bank 3 Bank 4

SMS: Staged Memory Scheduling

118

Core 1 Core 2 Core 3 Core 4

Stage 1: Batch Formation

Stage 3: DRAM Command Scheduler

GPU

Stage 2:

Ausavarungnirun+, “Staged Memory Scheduling,” ISCA 2012.

Page 119: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

SMS Complexity n  Compared to a row hit first scheduler, SMS consumes*

q  66% less area q  46% less static power

n  Reduction comes from: q  Monolithic scheduler à stages of simpler schedulers q  Each stage has a simpler scheduler (considers fewer

properties at a time to make the scheduling decision) q  Each stage has simpler buffers (FIFO instead of out-of-order) q  Each stage has a portion of the total buffer size (buffering is

distributed across stages)

119 * Based on a Verilog model using 180nm library

Page 120: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

SMS Performance

120

0

0.2

0.4

0.6

0.8

1

0.001 0.1 10 1000

Syst

em P

erfo

rman

ce

GPUweight

Previous Best Best Previous Scheduler

ATLAS TCM FR-FCFS

Page 121: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

n  At every GPU weight, SMS outperforms the best previous scheduling algorithm for that weight

SMS Performance

121

0

0.2

0.4

0.6

0.8

1

0.001 0.1 10 1000

Syst

em P

erfo

rman

ce

GPUweight

Previous Best

SMS SMS

Best Previous Scheduler

Page 122: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

CPU-GPU Performance Tradeoff

122

0 10 20 30 40 50 60 70 80 90

1 0.5 0.1 0.05 0

Fram

e R

ate

SJF Probability

GPU Frame Rate

0

1

2

3

4

5

6

1 0.5 0.1 0.05 0

Wei

ghte

d Sp

eedu

p

SJF Probability

CPU Performance

Page 123: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

More on SMS

n  Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel Loh, and Onur Mutlu, "Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems" Proceedings of the 39th International Symposium on Computer Architecture (ISCA), Portland, OR, June 2012. Slides (pptx)

123

Page 124: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Stronger Memory Service Guarantees [HPCA’13] n  Uncontrolled memory interference slows down

applications unpredictably n  Goal: Estimate and control slowdowns

n  MISE: An accurate slowdown estimation model q  Request Service Rate is a good proxy for performance

n  Slowdown = Request Service Rate Alone / Request Service Rate Shared q  Request Service Rate Alone estimated by giving an application highest

priority in accessing memory q  Average slowdown estimation error of MISE: 8.2% (3000 data pts)

n  Memory controller leverages MISE to control slowdowns q  To provide soft slowdown guarantees q  To minimize maximum slowdown

124 Subramanian+, “MISE,” HPCA 2013.

Page 125: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

More on MISE

n  Lavanya Subramanian, Vivek Seshadri, Yoongu Kim, Ben Jaiyen, and Onur Mutlu, "MISE: Providing Performance Predictability and Improving Fairness in Shared Main Memory Systems" Proceedings of the 19th International Symposium on High-Performance Computer Architecture (HPCA), Shenzhen, China, February 2013. Slides (pptx)

125

Page 126: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Memory QoS in a Parallel Application

n  Threads in a multithreaded application are inter-dependent n  Some threads can be on the critical path of execution due

to synchronization; some threads are not n  How do we schedule requests of inter-dependent threads to

maximize multithreaded application performance?

n  Idea: Estimate limiter threads likely to be on the critical path and prioritize their requests; shuffle priorities of non-limiter threads to reduce memory interference among them [Ebrahimi+, MICRO’11]

n  Hardware/software cooperative limiter thread estimation: n  Thread executing the most contended critical section n  Thread that is falling behind the most in a parallel for loop

126 Ebrahimi+, “Parallel Application Memory Scheduling,” MICRO 2011.

Page 127: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

More on PAMS

n  Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin, Chang Joo Lee, Onur Mutlu, and Yale N. Patt, "Parallel Application Memory Scheduling" Proceedings of the 44th International Symposium on Microarchitecture (MICRO), Porto Alegre, Brazil, December 2011. Slides (pptx)

127

Page 128: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Summary: Memory QoS Approaches and Techniques

n  Approaches: Smart vs. dumb resources q  Smart resources: QoS-aware memory scheduling q  Dumb resources: Source throttling; channel partitioning q  Both approaches are effective in reducing interference q  No single best approach for all workloads

n  Techniques: Request scheduling, source throttling, memory partitioning q  All approaches are effective in reducing interference q  Can be applied at different levels: hardware vs. software q  No single best technique for all workloads

n  Combined approaches and techniques are the most powerful q  Integrated Memory Channel Partitioning and Scheduling [MICRO’11]

128

Page 129: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

SALP: Reducing DRAM Bank Conflict Impact

129

Kim, Seshadri, Lee, Liu, Mutlu A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM ISCA 2012.

Page 130: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

SALP: Reducing DRAM Bank Conflicts n  Problem: Bank conflicts are costly for performance and energy

q  serialized requests, wasted energy (thrashing of row buffer, busy wait)

n  Goal: Reduce bank conflicts without adding more banks (low cost) n  Key idea: Exploit the internal subarray structure of a DRAM bank to

parallelize bank conflicts to different subarrays q  Slightly modify DRAM bank to reduce subarray-level hardware sharing

130 Kim, Seshadri+ “A Case for Exploiting Subarray-Level Parallelism in DRAM,” ISCA 2012.

-19%

+13

%

Page 131: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

SALP: Key Ideas

n  A DRAM bank consists of mostly-independent subarrays q  Subarrays share some global structures to reduce cost

131

Key Idea of SALP: Minimally reduce sharing of global structures

Reduce the sharing of … Global decoder à Enables pipelined access to subarrays Global row buffer à Utilizes multiple local row buffers

Page 132: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

SALP: Reduce Sharing of Global Decoder

132

Local row-buffer

Local row-buffer Global row-buffer

···

Glob

al  Decod

er  

Latch  

Latch  

Latch  

Instead of a global latch, have per-subarray latches

Page 133: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

SALP: Reduce Sharing of Global Row-Buffer

133

Wir

e

Global bitlines

Global row-buffer

Local row-buffer

Local row-buffer

Switch

Switch

READ READ

DD

DD

Selectively connect local row-buffers to global row-buffer using a Designated single-bit latch

Page 134: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

SALP: Baseline Bank Organization

134

Local row-buffer

Local row-buffer

Global row-buffer

Glob

al  Decod

er  

Global bitlines

Latch  

Page 135: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

SALP: Proposed Bank Organization

135

Local row-buffer

Local row-buffer

Global row-buffer

Glob

al  Decod

er  

Latch  

Latch  

D  

D  

Global bitlines

Overhead of SALP in DRAM chip: 0.15% 1. Global latch à per-subarray local latches 2. Designated bit latches and wire to selectively enable a subarray

Page 136: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

SALP: Results n  Wide variety of systems with different #channels, banks,

ranks, subarrays n  Server, streaming, random-access, SPEC workloads

n  Dynamic DRAM energy reduction: 19% q  DRAM row hit rate improvement: 13%

n  System performance improvement: 17% q  Within 3% of ideal (all independent banks)

n  DRAM die area overhead: 0.15% q  vs. 36% overhead of independent banks

136

Page 137: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

More on SALP

n  Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, and Onur Mutlu, "A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM" Proceedings of the 39th International Symposium on Computer Architecture (ISCA), Portland, OR, June 2012. Slides (pptx)

137

Page 138: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Coordinated Memory and Storage with NVM

138

Meza, Luo, Khan, Zhao, Xie, and Mutlu, "A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory” WEED 2013.

Page 139: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Overview n  Traditional systems have a two-level storage model

q  Access volatile data in memory with a load/store interface q  Access persistent data in storage with a file system interface q  Problem: Operating system (OS) and file system (FS) code and buffering

for storage lead to energy and performance inefficiencies

n  Opportunity: New non-volatile memory (NVM) technologies can help provide fast (similar to DRAM), persistent storage (similar to Flash) q  Unfortunately, OS and FS code can easily become energy efficiency and

performance bottlenecks if we keep the traditional storage model

n  This work: makes a case for hardware/software cooperative management of storage and memory within a single-level q  We describe the idea of a Persistent Memory Manager (PMM) for

efficiently coordinating storage and memory, and quantify its benefit q  And, examine questions and challenges to address to realize PMM

139

Page 140: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

A Tale of Two Storage Levels n  Two-level storage arose in systems due to the widely different

access latencies and methods of the commodity storage devices q  Fast, low capacity, volatile DRAM à working storage q  Slow, high capacity, non-volatile hard disk drives à persistent storage

n  Data from slow storage media is buffered in fast DRAM q  After that it can be manipulated by programs à programs cannot

directly access persistent storage q  It is the programmer’s job to translate this data between the two

formats of the two-level storage (files and data structures)

n  Locating, transferring, and translating data and formats between the two levels of storage can waste significant energy and performance

140

Page 141: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Opportunity: New Non-Volatile Memories n  Emerging memory technologies provide the potential for unifying

storage and memory (e.g., Phase-Change, STT-RAM, RRAM) q  Byte-addressable (can be accessed like DRAM) q  Low latency (comparable to DRAM) q  Low power (idle power better than DRAM) q  High capacity (closer to Flash) q  Non-volatile (can enable persistent storage) q  May have limited endurance (but, better than Flash)

n  Can provide fast access to both volatile data and persistent storage

n  Question: if such devices are used, is it efficient to keep a two-level storage model?

141

Page 142: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Eliminating Traditional Storage Bottlenecks

142

Normalized Total Energy

0

0.2

0.4

0.6

0.8

1.0

HDD Baseline NVM Baseline Persistent Memory

Frac

tion

of T

otal

Ene

rgy

0.0650.013

Today (DRAM + HDD) and two-level storage model Replace HDD

with NVM (PCM-like),

keep two-level storage model

Replace HDD and DRAM with NVM

(PCM-like), eliminate all

OS+FS overhead

Results for PostMark

Page 143: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Where is Energy Spent in Each Model?

143

0

0.2

0.4

0.6

0.8

1.0

HDD Baseline NVM Baseline Persistent Memory

Frac

tion

of T

otal

Ene

rgy

User CPU Syscall CPU DRAM NVM HDD

HDD access wastes energy

FS/OS overhead becomes important

Additional DRAM energy due to buffering overhead

of two-level model

No FS/OS overhead No additional buffering

overhead in DRAM

Results for PostMark

Page 144: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Our Proposal: Coordinated HW/SW Memory and Storage Management

n  Goal: Unify memory and storage to eliminate wasted work to locate, transfer, and translate data q  Improve both energy and performance q  Simplify programming model as well

144

Page 145: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Our Proposal: Coordinated HW/SW Memory and Storage Management

n  Goal: Unify memory and storage to eliminate wasted work to locate, transfer, and translate data q  Improve both energy and performance q  Simplify programming model as well

145

Before: Traditional Two-Level Store

Processor and caches

Main Memory Storage (SSD/HDD)

Virtual memory

Address translation

Load/Store

Operating system

and file system

fopen, fread, fwrite, …

Page 146: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Our Proposal: Coordinated HW/SW Memory and Storage Management

n  Goal: Unify memory and storage to eliminate wasted work to locate, transfer, and translate data q  Improve both energy and performance q  Simplify programming model as well

146

After: Coordinated HW/SW Management

Processor and caches

Persistent (e.g., Phase-Change) Memory

Load/Store

Persistent Memory Manager

Feedback

Page 147: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

The Persistent Memory Manager (PMM) n  Exposes a load/store interface to access persistent data

q  Applications can directly access persistent memory à no conversion, translation, location overhead for persistent data

n  Manages data placement, location, persistence, security q  To get the best of multiple forms of storage

n  Manages metadata storage and retrieval q  This can lead to overheads that need to be managed

n  Exposes hooks and interfaces for system software q  To enable better data placement and management decisions

147

Page 148: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

The Persistent Memory Manager n  Persistent Memory Manager

q  Exposes a load/store interface to access persistent data q  Manages data placement, location, persistence, security q  Manages metadata storage and retrieval q  Exposes hooks and interfaces for system software

n  Example program manipulating a persistent object:

148

2.2.1 Efficient Hardware and Software Support: We propose to investigate the efficient software andhardware support needed for single-level stores. A single-level store system should provide an abstractionthat maps persistent user data to physical addresses in memory. A software interface for programs wouldmap a pointer to the actual persistent data. Programs would be able to access any part of the data usingnormal load and store instructions. Figure 2 shows two examples of high-level abstractions which couldbe provided to programs to access persistent data in a single-level store system. In it, a program createsa persistent file (Figure 2 left) or object (Figure 2 right) using the handle “file.dat” and allocates an arrayof integers in it. Later—perhaps after the application or system is restarted—when the program executesthe updateValue() function, the system retrieves the persistent data for the same handle, and the programmodifies its state. With such an abstraction, a single-level store can eliminate the OS system calls to transferdata to and from disk. In addition, it eliminates the need for a file system to track physical file addressesby traversing metadata (such as inodes) in the OS. In this way, single-level stores provide the opportunity todesign a simple and efficient persistent data lookup system in hardware. We plan to research efficient waysto map files or objects to the virtual address space. In such a hardware-based design, the processor wouldmanage how data handles correspond to physical addresses. Note that, single-level stores can use alternativedesign choices, such as segments, to provide the high-level abstraction instead of files or objects. Regardless,segments, files, or objects will be mapped to physical addresses with hardware support. Prior works tried tomake file lookup and update efficient in software [27, 28] in the presence of persistent memory, and otherworks proposed using complex and potentially inefficient hardware directory techniques (e.g., [15]). Ourgoal is to design fast and efficient techniques that take into account the byte addressability of persistentmemory in a single-level store. To this end, we will research the following:• The efficient use of hash table and B-tree indices for locating files in a single-level store.• How techniques such as key-value stores can provide fast and efficient lookups in single-level stores.• Policies for intelligently caching some entries of these indices in hardware to improve system perfor-

mance.Every access in the single-level store needs to be translated from a virtual address used by a program to

a physical address used to access a device. We will investigate how to efficiently manage address translationso that locating data is simple and fast. We intend to explore the following directions to solve this problem:• We will design mechanisms to predict access patterns based on program behavior and pre-compute

virtual-to-physical address translations. We are interested in answering questions such as: What is thepattern of data accesses to a single-level store, and how can prefetching techniques be redesigned withsingle-level stores in mind to enable efficient address translation? How can simple application-level orprofile-based hints on access patterns be communicated to and used by hardware to make address translationand prefetching efficient?• We will design efficient translation lookaside buffer (TLB)-like structures which will cache the trans-

lation between virtual and physical addresses but for a much larger amount of physical memory than inexisting systems. In the presence of such a single-level store, many disparate data accesses could need alarge translation table to be serviced effectively. To reduce overhead of such hardware, we are interested inexploring whether TLB-like structures should favor storing translations only for particular classes of data,such as data with locality or data which is on the critical path of execution, which get the most benefitout of the limited structure space. In addition, we will investigate centralized versus distributed translationstructures to design techniques that, for example, buffer private data translation entries near the processor,while distributing shared entries across processors to minimize translation overheads. Such translation struc-

1 int main(void) {2 // data in file.dat is persistent3 FILE myData = "file.dat";4 myData = new int[64];5 }6 void updateValue(int n, int value) {7 FILE myData = "file.dat";8 myData[n] = value; // value is persistent9 }

1 int main(void) {2 // data in file.dat is persistent3 int *myData = new PersistentObject("file.dat");4 myData = new int[64];5 }6 void updateValue(int n, int value) {7 int *myData = PersistentObject.open("file.dat");8 myData[n] = value; // value is persistent9 }

Figure 2: Sample program with access to file-based (left) and object-based (right) persistent data.

5

Create persistent object and its handle Allocate a persistent array and assign

Load/store interface

Page 149: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Putting Everything Together

149

2.2.1 Efficient Hardware and Software Support: We propose to investigate the efficient software andhardware support needed for single-level stores. A single-level store system should provide an abstractionthat maps persistent user data to physical addresses in memory. A software interface for programs wouldmap a pointer to the actual persistent data. Programs would be able to access any part of the data usingnormal load and store instructions. Figure 2 shows two examples of high-level abstractions which couldbe provided to programs to access persistent data in a single-level store system. In it, a program createsa persistent file (Figure 2 left) or object (Figure 2 right) using the handle “file.dat” and allocates an arrayof integers in it. Later—perhaps after the application or system is restarted—when the program executesthe updateValue() function, the system retrieves the persistent data for the same handle, and the programmodifies its state. With such an abstraction, a single-level store can eliminate the OS system calls to transferdata to and from disk. In addition, it eliminates the need for a file system to track physical file addressesby traversing metadata (such as inodes) in the OS. In this way, single-level stores provide the opportunity todesign a simple and efficient persistent data lookup system in hardware. We plan to research efficient waysto map files or objects to the virtual address space. In such a hardware-based design, the processor wouldmanage how data handles correspond to physical addresses. Note that, single-level stores can use alternativedesign choices, such as segments, to provide the high-level abstraction instead of files or objects. Regardless,segments, files, or objects will be mapped to physical addresses with hardware support. Prior works tried tomake file lookup and update efficient in software [27, 28] in the presence of persistent memory, and otherworks proposed using complex and potentially inefficient hardware directory techniques (e.g., [15]). Ourgoal is to design fast and efficient techniques that take into account the byte addressability of persistentmemory in a single-level store. To this end, we will research the following:• The efficient use of hash table and B-tree indices for locating files in a single-level store.• How techniques such as key-value stores can provide fast and efficient lookups in single-level stores.• Policies for intelligently caching some entries of these indices in hardware to improve system perfor-

mance.Every access in the single-level store needs to be translated from a virtual address used by a program to

a physical address used to access a device. We will investigate how to efficiently manage address translationso that locating data is simple and fast. We intend to explore the following directions to solve this problem:• We will design mechanisms to predict access patterns based on program behavior and pre-compute

virtual-to-physical address translations. We are interested in answering questions such as: What is thepattern of data accesses to a single-level store, and how can prefetching techniques be redesigned withsingle-level stores in mind to enable efficient address translation? How can simple application-level orprofile-based hints on access patterns be communicated to and used by hardware to make address translationand prefetching efficient?• We will design efficient translation lookaside buffer (TLB)-like structures which will cache the trans-

lation between virtual and physical addresses but for a much larger amount of physical memory than inexisting systems. In the presence of such a single-level store, many disparate data accesses could need alarge translation table to be serviced effectively. To reduce overhead of such hardware, we are interested inexploring whether TLB-like structures should favor storing translations only for particular classes of data,such as data with locality or data which is on the critical path of execution, which get the most benefitout of the limited structure space. In addition, we will investigate centralized versus distributed translationstructures to design techniques that, for example, buffer private data translation entries near the processor,while distributing shared entries across processors to minimize translation overheads. Such translation struc-

1 int main(void) {2 // data in file.dat is persistent3 FILE myData = "file.dat";4 myData = new int[64];5 }6 void updateValue(int n, int value) {7 FILE myData = "file.dat";8 myData[n] = value; // value is persistent9 }

1 int main(void) {2 // data in file.dat is persistent3 int *myData = new PersistentObject("file.dat");4 myData = new int[64];5 }6 void updateValue(int n, int value) {7 int *myData = PersistentObject.open("file.dat");8 myData[n] = value; // value is persistent9 }

Figure 2: Sample program with access to file-based (left) and object-based (right) persistent data.

5

Load Store

DRAM Flash NVM HDD

Persistent Memory ManagerHardware

SoftwareData Layout, Persistence, Metadata, Security, ...

Hints from SW/OS/runtime

PMM  uses  access  and  hint  informaQon  to  allocate,  locate,  migrate  and  access  data  in  the  heterogeneous  array  of  devices  

Page 150: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Opportunities and Benefits

n  We’ve identified at least five opportunities and benefits of a unified storage/memory system that gets rid of the two-level model:

1.  Eliminating system calls for file operations

2.  Eliminating file system operations

3.  Efficient data mapping/location among heterogeneous devices

4.  Providing security and reliability in persistent memories

5.  Hardware/software cooperative data management

150

Page 151: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Evaluation Methodology n  Hybrid real system / simulation-based approach

q  System calls are executed on host machine (functional correctness) and timed to accurately model their latency in the simulator

q  Rest of execution is simulated in Multi2Sim (enables hardware-level exploration)

n  Power evaluated using McPAT and memory power models

n  16 cores, 4-wide issue, 128-entry instruction window, 1.6 GHz

n  Volatile memory: 4GB DRAM, 4KB page size, 100-cycle latency

n  Persistent memory q  HDD (measured): 4ms seek latency, 6Gbps bus rate

q  NVM: (modeled after PCM) 4KB page size, 160-/480-cycle (read/write) latency

151

Page 152: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Evaluated Systems n  HDD Baseline (HB)

q  Traditional system with volatile DRAM memory and persistent HDD storage q  Overheads of operating system and file system code and buffering

n  HDD without OS/FS (HW) q  Same as HDD Baseline, but with the ideal elimination of all OS/FS overheads q  System calls take 0 cycles (but HDD access takes normal latency)

n  NVM Baseline (NB) q  Same as HDD Baseline, but HDD is replaced with NVM q  Still has OS/FS overheads of the two-level storage model

n  Persistent Memory (PM) q  Uses only NVM (no DRAM) to ensure full-system persistence q  All data accessed using loads and stores q  Does not waste energy on system calls q  Data is manipulated directly on the NVM device

152

Page 153: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Evaluated Workloads n  Unix utilities that manipulate files

q  cp: copy a large file from one location to another q  cp –r: copy files in a directory tree from one location to another q  grep: search for a string in a large file q  grep –r: search for a string recursively in a directory tree

n  PostMark: an I/O-intensive benchmark from NetApp q  Emulates typical access patterns for email, news, web commerce

n  MySQL Server: a popular database management system q  OLTP-style queries generated by Sysbench q  MySQL (simple): single, random read to an entry q  MySQL (complex): reads/writes 1 to 100 entries per transaction

153

Page 154: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

0

0.2

0.4

0.6

0.8

1.0

HD

D

NV

M

PM

HD

D

NV

M

PM

HD

D

NV

M

PM

HD

D

NV

M

PM

HD

D

NV

M

PM

HD

D

NV

M

PM

HD

D

NV

M

PM

Nor

mal

ized

Exe

cutio

n T

ime

User CPU User Memory Syscall CPU Syscall I/O

cp cp -r grep grep -r PostMark MySQL(simple)

MySQL(complex)

Performance Results

154

The workloads that see the greatest improvement from using a Persistent Memory are those that spend a large portion of their time executing system call code due to

the two-level storage model

Page 155: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

0

0.2

0.4

0.6

0.8

1.0

HD

D

NVM PM HD

D

NVM PM HD

D

NVM PM HD

D

NVM PM HD

D

NVM PM HD

D

NVM PM HD

D

NVM PM

Nor

mal

ized

Ene

rgy

Con

sum

ptio

nUser CPU Syscall CPU DRAM NVM HDD

cp cp -r grep grep -r PostMark MySQL(simple)

MySQL(complex)

Energy Results: NVM to PMM

155

Between systems with and without OS/FS code, energy improvements come from: 1. reduced code footprint, 2. reduced data movement

Large  energy  reducQons  with  a  PMM  over  the  NVM  based  system  

Page 156: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Scalability Analysis: Effect of PMM Latency

156

Even if each PMM access takes a non-overlapped 50 cycles (conservative), PMM still provides an overall improvement compared to the NVM baseline

0

0.25

0.50

0.75

1.00

1.25

cp

cp -

r

grep

grep

-r

Pos

tMar

k

MyS

QL

(sim

ple)

MyS

QL

(com

plex

)

cp

cp -

r

grep

grep

-r

Pos

tMar

k

MyS

QL

(sim

ple)

MyS

QL

(com

plex

)

cp

cp -

r

grep

grep

-r

Pos

tMar

k

MyS

QL

(sim

ple)

MyS

QL

(com

plex

)

cp

cp -

r

grep

grep

-r

Pos

tMar

k

MyS

QL

(sim

ple)

MyS

QL

(com

plex

)

Nor

mal

ized

Exe

cutio

n T

ime

User CPU User Memory Syscall CPU Syscall I/O PMM

1 cycle 10 cycles 50 cyclesNB

1.53

Future  research  should  target  keeping  PMM  latencies  in  check  

Page 157: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

New Questions and Challenges n  We identify and discuss several open research questions

Ø  Q1. How to tailor applications for systems with persistent memory?

Ø  Q2. How can hardware and software cooperate to support a scalable, persistent single-level address space?

Ø  Q3. How to provide efficient backward compatibility (for two-level stores) on persistent memory systems?

Ø  Q4. How to mitigate potential hardware performance and energy overheads?

157

Page 158: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

Single-Level Stores: Summary and Conclusions n  Traditional two-level storage model is inefficient in terms of

performance and energy q  Due to OS/FS code and buffering needed to manage two models q  Especially so in future devices with NVM technologies, as we show

n  New non-volatile memory based persistent memory designs that use a single-level storage model to unify memory and storage can alleviate this problem

n  We quantified the performance and energy benefits of such a single-level persistent memory/storage design q  Showed significant benefits from reduced code footprint, data

movement, and system software overhead on a variety of workloads

n  Such a design requires more research to answer the questions we have posed and enable efficient persistent memory managers à can lead to a fundamentally more efficient storage system

158

Page 159: Memory Scaling : A Systems Architecture PerspectiveSolution 1: Tolerate DRAM ! Overcome DRAM shortcomings with " System-DRAM co-design " Novel DRAM architectures, interface, functions

End of Backup Slides

159


Recommended