Rethinking DRAM Power Modes for Energy Proportionality · 2012-12-08 · 8 OoO Nehalem cores at...

Post on 12-Aug-2020

2 views 0 download

transcript

Rethinking DRAM Power Modes for Energy Proportionality

Krishna Malladi1, Ian Shaeffer2, Liji Gopalakrishnan2,David Lo1, Benjamin Lee3, Mark Horowitz1

Stanford University1, Rambus Inc2, Duke University3

ktej@stanford.edu

2

Main Memory in Datacenters

Server power main energy bottleneck in datacentersPUE of ~1.1 the rest of the system is energy efficient

Significant main memory (DRAM) power25-40% of server power across all utilization pointsLow dynamic range No energy proportionality

3

Main Memory in Datacenters

Server power main energy bottleneck in datacentersPUE of ~1.1 the rest of the system is energy efficient

Significant main memory (DRAM) power25-40% of server power across all utilization pointsLow dynamic range No energy proportionality

4

Outline

Inefficiencies of DRAM interfaces

Energy-proportionality via fast DRAM interfaces- MemBlaze- MemCorrect- MemDrowsy

5

Outline

Inefficiencies of DRAM interfaces

Energy-proportionality via fast DRAM interfaces- MemBlaze- MemCorrect- MemDrowsy

6

DDR3 Energy & Powermodes

DDR3 optimized for high bandwidthHigh speed interface with DLLs, CLKs, ODTsVery high static power in active-idle

Hard to powerdown to deep statesLong impractical wakeup time to power up interfaceInsufficient idleness in workloads Significant active-idle time

Power Mode DIMM Idle Power (W) Exit Latency (ns)

Active Idle 5.36 0

Fast Powerdown 2.79 20

Deep Powerdown 0.92 768

7

DDR3 Energy & Powermodes

DDR3 optimized for high bandwidthHigh speed interface with DLLs, CLKs, ODTsVery high static power in active-idle

Hard to powerdown to deep statesLong impractical wakeup time to power up interfaceInsufficient idleness in workloads Significant active-idle time

Power Mode DIMM Idle Power (W) Exit Latency (ns)

Active Idle 5.36 0

Fast Powerdown 2.79 20

Deep Powerdown 0.92 768

8

DDR3 Energy & Powermodes

DDR3 optimized for high bandwidthHigh speed interface with DLLs, CLKs, ODTsVery high static power in active-idle

Hard to powerdown to deep statesLong impractical wakeup time to power up interfaceInsufficient idleness in workloads Significant active-idle time

Power Mode DIMM Idle Power (W) Exit Latency (ns)

Active Idle 5.36 0

Fast Powerdown 2.79 20

Deep Powerdown 0.92 768

9

DDR3 Energy & Powermodes

DDR3 optimized for high bandwidthHigh speed interface with DLLs, CLKs, ODTsVery high static power in active-idle

Hard to powerdown to deep statesLong impractical wakeup time to power up interfaceInsufficient idleness in workloads Significant active-idle time

88%! Power Mode DIMM Idle Power (W) Exit Latency (ns)

Active Idle 5.36 0

Fast Powerdown 2.79 20

Deep Powerdown 0.92 768

10

Path to Energy-Proportionality

11

Path to Energy-Proportionality

12

Path to Energy-Proportionality

Reduce active-idle power

13

Path to Energy-Proportionality

Reduce active-idle power

Reduce time in active-idleIncrease time in power-down

14

Path to Energy-Proportionality

Reduce active-idle power

Reduce time in active-idleIncrease time in power-down

Reduce power-down power

15

DRAM Interfaces

Bits are shortSampling window is only 625ps

Data (DQ) and Clock (CLK) signals forwarded to DRAMWrite data aligned to Clock edges

16

DRAM Interfaces

Dynamic chip variations affect ReadsPVT variations Misaligned DQS and CLK signalsNon-deterministic Read timing Incorrect sampling

17

DRAM Interfaces

On-chip DLLsAdjust delay to match chip temperature, voltage variationsAlign DQS, DQ to CLK

Power hungry, long settling time poor powermodes

18

Live with Slow-PowerupS/W mechanisms

Batch requests (or) subset ranks (or) Predict idlenessDegrades application performanceDegraded device density

H/W mechanismsStatically Disable DLLs in BIOS Statically lowers bandwidth

Worse performance

Use current deep powermodesLong memory wake-up latency

19

With Wakeup = 1u sec

E-D curves flatCan’t win with long wakeups

20

Faster Wakeups

Powerups should be much smaller

100ns

21

Faster Wakeups

Powerups should be much smaller

100ns

22

Outline

Inefficiencies of DRAM interfaces

Energy-proportionality via fast DRAM interfaces- MemBlaze- MemCorrect- MemDrowsy

23

Fast DRAM Wakeups

Enabling deep powerdown needs low-

latency wakeups

Rearchitectinterface to reduce

wakeup latency

Fast wakeup withMemBlaze

Retain interface but powerdown

aggressively

Speculative wakeup with MemCorrect

Lazy wakeup with MemDrowsy

24

Fast DRAM Wakeups

Enabling deep powerdown needs low-

latency wakeups

Rearchitectinterface to reduce

wakeup latency

Fast wakeup withMemBlaze

Retain interface but powerdown

aggressively

Speculative wakeup with MemCorrect

Lazy wakeup with MemDrowsy

25

Fast DRAM Wakeups

Enabling deep powerdown needs low-

latency wakeups

Rearchitectinterface to reduce

wakeup latency

Fast wakeup withMemBlaze

Retain interface but powerdown

aggressively

Speculative wakeup with MemCorrect

Lazy wakeup with MemDrowsy

26

Fast DRAM Wakeups

Enabling deep powerdown needs low-

latency wakeups

Rearchitectinterface to reduce

wakeup latency

Fast wakeup withMemBlaze

Retain interface but powerdown

aggressively

Speculative wakeup with MemCorrect

Lazy wakeup with MemDrowsy

27

Fast Wakeup with MemBlaze

No DLLPeriodic Timing reference signal stores DRAM offset in controllerCurrent-mode logic (CML) clocking has fewer variations

Fast turn-on of datapathCapacitive boosting quickly restores bias values

28

Fast Wakeup with MemBlaze

No DLLPeriodic Timing reference signal stores DRAM offset in controllerCurrent-mode logic (CML) clocking has fewer variations

Fast turn-on of datapathCapacitive boosting quickly restores bias values Exit latency ~ 10ns

29

MemBlaze DRAM + Controller

Integrated into DRAMs. Fabricated and testedMore details in the paper

30

Silicon Results

31

MethodologyWorkloads

MemcachedKey/value pairs with 100B and 10KB valuesZipf popularity distribution with exponential inter-arrival times

Yahoo! Cloud Benchmark (YCSB), SPECjbbMultiprogrammed (MP) and Multithreaded (MT)

SPECCPU 2006, SPECOMP 2001, PARSECHigh BW (HB), Medium BW (MB), Low BW (LB)

Architecture8 OoO Nehalem cores at 3GHz, 8MB shared L3 cache32 GB DRAM, 2Gb DDR3-1333 chipsFast powerdown baseline, 15 cycle powerdown timer

32

MemBlaze Evaluation

66% lower memory energy with MemBlaze fastlockNo performance penalty

33

Fast DRAM Wakeups

Enabling deep powerdown needs low-

latency wakeups

Rearchitectinterface to reduce

wakeup latency

Fast wakeup withMemBlaze

Retain interface but powerdown

aggressively

Speculative wakeup with MemCorrect

Lazy wakeup with MemDrowsy

34

Fast DRAM Wakeups

Enabling deep powerdown needs low-

latency wakeups

Rearchitectinterface to reduce

wakeup latency

Fast wakeup withMemBlaze

Retain interface but powerdown

aggressively

Speculative wakeup with MemCorrect

Lazy wakeup with MemDrowsy

35

Speculative Wakeup with MemCorrect

Fast wakeupUse deep power-down, which powers-off DLL, CLKTransfer speculatively before the long DLL recalibration

Error Detection/CorrectionDetector fires if power-down period accumulated large skewCorrector waits for recalibration before transfer

36

MemCorrect Evaluation

Vary probability of correct timing (p)40% energy savings (esp. for datacenters)Small p Recalibration latency exposed

Degrades performance for high-BW appsIncreases energy/bit

37

Fast DRAM Wakeups

Enabling deep powerdown needs low-

latency wakeups

Rearchitectinterface to reduce

wakeup latency

Fast wakeup withMemBlaze

Retain interface but powerdown

aggressively

Speculative wakeup with MemCorrect

Lazy wakeup with MemDrowsy

38

Fast DRAM Wakeups

Enabling deep powerdown needs low-

latency wakeups

Rearchitectinterface to reduce

wakeup latency

Fast wakeup withMemBlaze

Retain interface but powerdown

aggressively

Speculative wakeup with MemCorrect

Lazy wakeup with MemDrowsy

39

Lazy Wakeup with MemDrowsy

Fast wakeupWakeup from deep-powerdownTransfer at lower rate before DLL recalibration completes

Reduced Sampling RateLower data rate for READs during calibration time (~ 700ns)

Transfer each bit multiple times Wider sampling windowEliminates timing uncertainty

40

MemDrowsy Evaluation

Vary sampling reduction rate (Z)40% energy savings for datacenter appsHigh Z harms both performance and energy/bit

Energy per bit increases from wake-ups, higher bus activityZ=2 more realistic

41

MemCorrect + MemDrowsy

Combine MemCorrect and MemDrowsyIf error detected, halve sampling rate instead of backoff≤10% performance penalty50% energy/bit savings

42

Conclusion

DDR3 is energy-disproportionalDRAMs dissipate high static power

DDR3 interfaces are efficiency bottlenecksHigh active-idle powerLong wake-ups from power modes

Re-architect interfaces with MemBlazeOr use MemCorrect + MemDrowsy

Provide fast wake-up from power modesEnergy efficiency improves by 40-70%Performance impact is ≤ 10%

43

Thank you for your attention!

Questions?

ktej@stanford.edu