Rethinking DRAM Power Modes for Energy Proportionality
Krishna Malladi1, Ian Shaeffer2, Liji Gopalakrishnan2,David Lo1, Benjamin Lee3, Mark Horowitz1
Stanford University1, Rambus Inc2, Duke University3
2
Main Memory in Datacenters
Server power main energy bottleneck in datacentersPUE of ~1.1 the rest of the system is energy efficient
Significant main memory (DRAM) power25-40% of server power across all utilization pointsLow dynamic range No energy proportionality
3
Main Memory in Datacenters
Server power main energy bottleneck in datacentersPUE of ~1.1 the rest of the system is energy efficient
Significant main memory (DRAM) power25-40% of server power across all utilization pointsLow dynamic range No energy proportionality
4
Outline
Inefficiencies of DRAM interfaces
Energy-proportionality via fast DRAM interfaces- MemBlaze- MemCorrect- MemDrowsy
5
Outline
Inefficiencies of DRAM interfaces
Energy-proportionality via fast DRAM interfaces- MemBlaze- MemCorrect- MemDrowsy
6
DDR3 Energy & Powermodes
DDR3 optimized for high bandwidthHigh speed interface with DLLs, CLKs, ODTsVery high static power in active-idle
Hard to powerdown to deep statesLong impractical wakeup time to power up interfaceInsufficient idleness in workloads Significant active-idle time
Power Mode DIMM Idle Power (W) Exit Latency (ns)
Active Idle 5.36 0
Fast Powerdown 2.79 20
Deep Powerdown 0.92 768
7
DDR3 Energy & Powermodes
DDR3 optimized for high bandwidthHigh speed interface with DLLs, CLKs, ODTsVery high static power in active-idle
Hard to powerdown to deep statesLong impractical wakeup time to power up interfaceInsufficient idleness in workloads Significant active-idle time
Power Mode DIMM Idle Power (W) Exit Latency (ns)
Active Idle 5.36 0
Fast Powerdown 2.79 20
Deep Powerdown 0.92 768
8
DDR3 Energy & Powermodes
DDR3 optimized for high bandwidthHigh speed interface with DLLs, CLKs, ODTsVery high static power in active-idle
Hard to powerdown to deep statesLong impractical wakeup time to power up interfaceInsufficient idleness in workloads Significant active-idle time
Power Mode DIMM Idle Power (W) Exit Latency (ns)
Active Idle 5.36 0
Fast Powerdown 2.79 20
Deep Powerdown 0.92 768
9
DDR3 Energy & Powermodes
DDR3 optimized for high bandwidthHigh speed interface with DLLs, CLKs, ODTsVery high static power in active-idle
Hard to powerdown to deep statesLong impractical wakeup time to power up interfaceInsufficient idleness in workloads Significant active-idle time
88%! Power Mode DIMM Idle Power (W) Exit Latency (ns)
Active Idle 5.36 0
Fast Powerdown 2.79 20
Deep Powerdown 0.92 768
10
Path to Energy-Proportionality
11
Path to Energy-Proportionality
12
Path to Energy-Proportionality
Reduce active-idle power
13
Path to Energy-Proportionality
Reduce active-idle power
Reduce time in active-idleIncrease time in power-down
14
Path to Energy-Proportionality
Reduce active-idle power
Reduce time in active-idleIncrease time in power-down
Reduce power-down power
15
DRAM Interfaces
Bits are shortSampling window is only 625ps
Data (DQ) and Clock (CLK) signals forwarded to DRAMWrite data aligned to Clock edges
16
DRAM Interfaces
Dynamic chip variations affect ReadsPVT variations Misaligned DQS and CLK signalsNon-deterministic Read timing Incorrect sampling
17
DRAM Interfaces
On-chip DLLsAdjust delay to match chip temperature, voltage variationsAlign DQS, DQ to CLK
Power hungry, long settling time poor powermodes
18
Live with Slow-PowerupS/W mechanisms
Batch requests (or) subset ranks (or) Predict idlenessDegrades application performanceDegraded device density
H/W mechanismsStatically Disable DLLs in BIOS Statically lowers bandwidth
Worse performance
Use current deep powermodesLong memory wake-up latency
19
With Wakeup = 1u sec
E-D curves flatCan’t win with long wakeups
20
Faster Wakeups
Powerups should be much smaller
100ns
21
Faster Wakeups
Powerups should be much smaller
100ns
22
Outline
Inefficiencies of DRAM interfaces
Energy-proportionality via fast DRAM interfaces- MemBlaze- MemCorrect- MemDrowsy
23
Fast DRAM Wakeups
Enabling deep powerdown needs low-
latency wakeups
Rearchitectinterface to reduce
wakeup latency
Fast wakeup withMemBlaze
Retain interface but powerdown
aggressively
Speculative wakeup with MemCorrect
Lazy wakeup with MemDrowsy
24
Fast DRAM Wakeups
Enabling deep powerdown needs low-
latency wakeups
Rearchitectinterface to reduce
wakeup latency
Fast wakeup withMemBlaze
Retain interface but powerdown
aggressively
Speculative wakeup with MemCorrect
Lazy wakeup with MemDrowsy
25
Fast DRAM Wakeups
Enabling deep powerdown needs low-
latency wakeups
Rearchitectinterface to reduce
wakeup latency
Fast wakeup withMemBlaze
Retain interface but powerdown
aggressively
Speculative wakeup with MemCorrect
Lazy wakeup with MemDrowsy
26
Fast DRAM Wakeups
Enabling deep powerdown needs low-
latency wakeups
Rearchitectinterface to reduce
wakeup latency
Fast wakeup withMemBlaze
Retain interface but powerdown
aggressively
Speculative wakeup with MemCorrect
Lazy wakeup with MemDrowsy
27
Fast Wakeup with MemBlaze
No DLLPeriodic Timing reference signal stores DRAM offset in controllerCurrent-mode logic (CML) clocking has fewer variations
Fast turn-on of datapathCapacitive boosting quickly restores bias values
28
Fast Wakeup with MemBlaze
No DLLPeriodic Timing reference signal stores DRAM offset in controllerCurrent-mode logic (CML) clocking has fewer variations
Fast turn-on of datapathCapacitive boosting quickly restores bias values Exit latency ~ 10ns
29
MemBlaze DRAM + Controller
Integrated into DRAMs. Fabricated and testedMore details in the paper
30
Silicon Results
31
MethodologyWorkloads
MemcachedKey/value pairs with 100B and 10KB valuesZipf popularity distribution with exponential inter-arrival times
Yahoo! Cloud Benchmark (YCSB), SPECjbbMultiprogrammed (MP) and Multithreaded (MT)
SPECCPU 2006, SPECOMP 2001, PARSECHigh BW (HB), Medium BW (MB), Low BW (LB)
Architecture8 OoO Nehalem cores at 3GHz, 8MB shared L3 cache32 GB DRAM, 2Gb DDR3-1333 chipsFast powerdown baseline, 15 cycle powerdown timer
32
MemBlaze Evaluation
66% lower memory energy with MemBlaze fastlockNo performance penalty
33
Fast DRAM Wakeups
Enabling deep powerdown needs low-
latency wakeups
Rearchitectinterface to reduce
wakeup latency
Fast wakeup withMemBlaze
Retain interface but powerdown
aggressively
Speculative wakeup with MemCorrect
Lazy wakeup with MemDrowsy
34
Fast DRAM Wakeups
Enabling deep powerdown needs low-
latency wakeups
Rearchitectinterface to reduce
wakeup latency
Fast wakeup withMemBlaze
Retain interface but powerdown
aggressively
Speculative wakeup with MemCorrect
Lazy wakeup with MemDrowsy
35
Speculative Wakeup with MemCorrect
Fast wakeupUse deep power-down, which powers-off DLL, CLKTransfer speculatively before the long DLL recalibration
Error Detection/CorrectionDetector fires if power-down period accumulated large skewCorrector waits for recalibration before transfer
36
MemCorrect Evaluation
Vary probability of correct timing (p)40% energy savings (esp. for datacenters)Small p Recalibration latency exposed
Degrades performance for high-BW appsIncreases energy/bit
37
Fast DRAM Wakeups
Enabling deep powerdown needs low-
latency wakeups
Rearchitectinterface to reduce
wakeup latency
Fast wakeup withMemBlaze
Retain interface but powerdown
aggressively
Speculative wakeup with MemCorrect
Lazy wakeup with MemDrowsy
38
Fast DRAM Wakeups
Enabling deep powerdown needs low-
latency wakeups
Rearchitectinterface to reduce
wakeup latency
Fast wakeup withMemBlaze
Retain interface but powerdown
aggressively
Speculative wakeup with MemCorrect
Lazy wakeup with MemDrowsy
39
Lazy Wakeup with MemDrowsy
Fast wakeupWakeup from deep-powerdownTransfer at lower rate before DLL recalibration completes
Reduced Sampling RateLower data rate for READs during calibration time (~ 700ns)
Transfer each bit multiple times Wider sampling windowEliminates timing uncertainty
40
MemDrowsy Evaluation
Vary sampling reduction rate (Z)40% energy savings for datacenter appsHigh Z harms both performance and energy/bit
Energy per bit increases from wake-ups, higher bus activityZ=2 more realistic
41
MemCorrect + MemDrowsy
Combine MemCorrect and MemDrowsyIf error detected, halve sampling rate instead of backoff≤10% performance penalty50% energy/bit savings
42
Conclusion
DDR3 is energy-disproportionalDRAMs dissipate high static power
DDR3 interfaces are efficiency bottlenecksHigh active-idle powerLong wake-ups from power modes
Re-architect interfaces with MemBlazeOr use MemCorrect + MemDrowsy
Provide fast wake-up from power modesEnergy efficiency improves by 40-70%Performance impact is ≤ 10%