1
Decoupled DIMM: Building High-Bandwidth Memory System Using Low Speed DRAM Devices
Hongzhong Zheng1, Jiang Lin3, Zhao Zhang2, and Zhichun Zhu1
1Department of ECE University of Illinois at Chicago
2Department of ECEIowa State University
3Austin Research LabIBM Corp.
2
OutlineChallenges in DRAM memory system designs
Bandwidth, capacity, thermal and powerMotivation and backgroundDecoupled DRAM architecture
Memory performance, cost, and/or power optimizationExperimental methodologyResult analysisConclusion
3
Challenges in DRAM memory system designs
Multi-core processors Increasing demands on memory’s
BandwidthCapacity
Advancements on memory systemsDDR/DDR2/DDR3, Rambus XDRFB-DIMM, MetaRAM, Registered DIMM
Power and Thermal
4
Expensive for building high bandwidth channelHigh bandwidth channel Costly and high powerHigh density DRAM device Costly and late
Limited by DRAM device technologyChannel bandwidth evolvement ≤ DRAM device evolvement
Memory Channel Design ChallengesExampleDRAMChannel
20112010N/AN/A160012.8DDR3-1600
N/A
1333
1066800800667
Device (MT/s)
66
59487865
Total Power (W)
6542007200610912.96.4DDR2-800
BW/CH (GB/s)
4GB-x4-DR (W)
4GB-x4-DR ($)
I/O Road map (1Gb)
I/O Road map (4Gb)
Total Cost ($)
DDR2-667 5.3 10.8 83 2004 2005 498
DDR3-800 6.4 8.0 133 2007 2008 800DDR3-1066 8.5 9.9 180 2008 2009 1080
DDR3-1333 10.6 11 243 2009 2010 1458
DDR3-2133 17 N/A N/A 2012 2013
Kingston 4GB registered ECC DIMM; Power based on 2Gbit-x4 Micron device, 80% channel utilization
3-Channel, 24GBXeon 2.66GHz:$1000
5
Channel speed bind with DRAM devices speed
Rank BW = Channel BWNot necessary when multi-rank per channel
Multiple ranks per channel∑Ranks BW > Channel BWNOT fully utilize the DRAM devicesBandwidth bottleneck: Channel
DD
Rx
data
and
com
man
d bu
s
1066MT
x64
Memory controller
Rank
Rank
Conventional Memory Channel Organization
2DIMMs/Channel, 2Ranks/DIMM
1066MT/s8.5GB/s
1066MT
x64
∑Ranks BW (34GB/s) > Channel BW (8.5GB/s)
6
↑High speed I/O > ↑DRAM speedSlow evolvement of DRAM speed bottleneck for building high bandwidth memory channel
DRAM is optimized for capacity and cost, NOT for speed
High Speed I/O Technology AvailableDRAM I/O bandwidth vs. High-speed I/O
bandwidth (ITRS)
0
2
4
6
8
10
12
14
16
1995 2000 2005 2010 2015 2020 2025
Gb/
s/pi
n
DRAM I/O
High-speed I/O
667Mb/s1333Mb/s
6.4Gb/s6.4Gb/s11Gb/s
15Gb/s
7
Decoupled DIMMHigh bandwidth Channel + Low speed DRAM device ?
Memory channel design without DRAM evolving bottleneck Benefits on performance, cost and/or power efficiency
Design considerationsNo changes to DRAM devices
Decoupled DIMMAdding a bridge chip (Synchronization Buffer) to each DIMM/Rank
Breaking unnecessary bandwidth matchingSeparating two clock domains: Channel vs. DRAMDecoupling DRAM I/O tech. with Channel I/O tech.
8
DD
Rx
data
and
com
man
d bu
s
1066MT/s/rank
x64
Memory controller
Decoupled DIMM Design
Single DDR2/3 Channel
2133MT/s
1066MT/s/rank
x64
Channel BW > Rank BW
SYB
SYB
Building high bandwidth channel using low-speed DRAM devices
req
reqreq
Synchronization buffer (SYB)Separating two clock domainsBuffering data and commandIntroducing small latency
penaltyBreaking BW matching
Channel BW > Rank BWDDR3-1066 devices 2133MT/s/channel
DRAM Freq. : Channel Freq.1:m 1:2, 1:3 n:m 2:3, 3:5
9
Channel Throughput and Rank Utilization
0
5
10
15
20
D1066-B1066
D1066-B2133
Ave
rage
Cha
nnel
Th
roug
hput
(GB
/s)
0%
10%
20%
30%
40%
50%
Ave
rage
Ran
k U
tiliz
atio
n
Channel ThroughputRank Utilization
Significantly Increasing Memory Throughput
Example:2CH-2D-2R, DDR3-1066, Channel 1066MT/s vs.Channel 2133MT/s
Significantly improving memory throughput
2 x Channel BW ↑88% throughput (6.7GB/s)Increasing ranks utilization
22% (1066MT/s/CH) 41% (2132MT/s/CH)
swim+applu+art+lucasswim+applu+art+lucas
10
Benefits: Building high bandwidth channel using low-speed DRAM devices
High performance with high bandwidthChannel BW > DRAM BW
Low cost and high densityLow-speed DRAM devices Low cost and high density
High BW channel
Power/energy efficiencyOperating DRAM at low speed but keeping high channel BW
More DIMMs per channelReducing electrical load of each DIMM by buffering CMD/data
Good ReliabilityUsing standard voltage supply High BW channel
11
DDRx data interface with BUS
x64Data to/from DDRx bus
Dat
a to
/from
D
RA
M d
evic
es
x8
Dat
a in
terf
ace
with
D
RA
M d
evic
es
DDRx data interface with busDDRx control interfaceDelay/Phase Loop Lock
Data interface with DRAM devicesControl interface with DRAM devicesData/CMD entries inside SYB
Synchronization Buffer
x64
Data to/from DDRx bus
x8
CMD/Address to DRAM devices
Synchronization Buffer Design
Data to/from DRAM devices
CMD/Address from DDRx bus
Control interface withDRAM devices
CMD/Address to DRAM devices
RD
CM
D
WR
DD
Rx
cont
olin
terf
ace
CM
D/A
ddre
ss
from
DD
Rx
bus
DLL
cloc
k
Clock to DRAM devices
12
Memory Access Scheduling
Two level bus with SYB extends the data transfer timeSYB relays command and data For example, DRAM devices : Channel = 1 : 2 2 device cycles latency penalty = 1 cycle CMD delay +
1 cycle data delay
2133MT/s Channel &DDR3-1066 devices
13swim+applu+art+lucasswim+applu+art+lucas
DIMM Power Break Down of a Memory Intensive
Workload (2CH-2D-2R-x8)
0
2
4
6
8
10
Ave
rage
Pow
er (W
att)
SYB overhead
I/O with Channel
read/write
operation
background
Backgroundrelated to power state transition and power management policies
OperationActivation + Precharge
Read/writeI/O power
Driving output + termination
SYB Overhead
Power Saving of Decoupled DIMM with Given Channel Bandwidth
D1600D1600--B1600B1600
D800D800--B1600B1600
↓23%
↓31%
↓15%765mW
22GB/s 20GB/s
↓24%
2GB DIMM2GB DIMM
14
Energy Saving by Decoupled DIMM
SYB latency overhead for one more I/O2.50SYB Latency Overhead (ns)
1600MT/s Channel & DDR3-1600
1600MT/s Channel & DDR3-800 Comments
BW (MB/s/channel) 12800 12800 Same Channel BWDevices Freq. (MHz) 800 400 DRAM devices operating at low speedTpre,Tact,Tcol (ns) 13.75 15 Small change on operation delay Operating Cur. (mA) 120 90 25% power reduction on each operationBackground:Active Standby Cur. (mA) 65 50
>23% power reduction on background, applied most of time
Tbl Data burst Time (ns) 5 10 2 x data burst time by low speed devicesRead/Write Cur. (mA) 250 130 Nearly half of read/write power
SYB Power Overhead (mW) 0 382/rank SYB power overhead for one more I/O
Operation energy saving25% power reduction + slight change on operation delay
Background energy saving>23% power reduction + most of time
15
Experimental MethodologyM5 + detailed memory performance and power simulatorMulti-programming workloads formed by SPEC CPU2000Power model based on Micron power calculatorPower management policy
Transiting to low power mode when no pending requests on the rank after 7.5nsCC-Slow: Cache line interleaving, close page mode, and with precharge power-down slow low power mode (128mWatt, 11.25ns exit latency)PO-Fast: Page interleaving, open page mode, and with active
power-down low power mode (578mWatt, 7.5ns exit latency)
16
Parameters Values
Processor 4 cores, 3.2 GHz, 4-issue per core, 16-stage pipeline
Functional units 4 IntALU, 2 IntMult, 2 FPALU, 1 FPMult
IQ, ROB and LSQ size IQ 64, ROB 196, LQ 32, SQ 32
Physical register num 228 Int, 228 FP
Branch predictor Hybrid, 8k global + 2K local, 16-entry RAS, 4K-entry and 4-way BTB
L1 caches (per core)64KB Inst/64KB Data, 2-way, 64B line, hit latency: 1-cycle Inst / 3-cycle
Data
L2 cache (shared) 4MB, 4-way, 64B line, 15-cycle hit latency
MSHR entries Inst:8, Data:32, L2:64
Memory 4/2/1 channels, 2-DIMMs/channel, 2-ranks/DIMM, 8-banks/rank, 1GB/rank
Memory controller 128-entry buffer, 15ns overhead
DDR3 channel bandwidth 800/1066/1333/1600 MT/s (Mega Transfer/s), 8byte/channel
DDR3 DRAM latencyDDR3-800: 6-6-6, DDR3-1066: 8-8-8, DDR3-1333: 10-10-10, DDR3-1600: 11-11-11
Major Simulation Parameters
17
Workload Applications
MEM-1 swim,applu,art,lucas
MEM-2 fma3d,mgrid,galgel,equake
MEM-3 swim,applu,galgel,equake
MEM-4 art,lucas,mgrid,fma3d
MDE-1 ammp,gap,wupwise,vpr
MDE-2 mcf,parser,twolf,facerec
MDE-3 apsi,bzip2,ammp,gap
MDE-4 wupwise,vpr,mcf,parser
ILP-1 vortex,gcc,sixtrack,mesa
ILP-2 perlbmk,crafty,gzip,eon
ILP-3 vortex,gcc,gzip,eon
ILP-4 sixtrack,mesa,perlbmk,crafty
Workloads
Multiprogramming workloads randomly selected from SPEC 2000
MEM (memory-intensive)MDE (moderate)ILP (compute-intensive)
Simulation points are picked up by SimPointPerformance metrics
Weighted Speedup Harmonic mean of normalized IPCs
18
Average Performance Impact of Decoupled DIMM with Different Memory Configurations
0
0.5
1
1.5
2
2.5
MEM MDE ILP MEM MDE ILP MEM MDE ILP
Nor
mal
ized
Wei
ghte
d Sp
eedu
p
D1066-B1066 D1066-B2133 D2133-B2133
Average Performance of Decoupled DIMM with Given DRAM Device
79% 55% 25%
1CH-2D-2R 2CH-2D-2R 4CH-2D-2R
-10%-9%
-8%12% 5% 5%
19
Performance Comparision of Decoupled DIMM Designwith Conventional DDR3-1066/1333/1600 Deisgn
0
0.5
1
1.5
2
2.5
MEM-1 MDE-1 ILP-1 MEM-AVG MDE-AVG ILP-AVG
Nor
mal
ized
Wei
ghte
d Sp
eedu
p
D1066-B1066 D1333-B1333 D1600-B1600D1066-B2133 D1333-B2667
Trade-offs of Decoupled DIMM Design
7% Small impact
55%
2CH-2D-2R
16%37%
76%83%
19%47%
111%D1066-B2133 vs. D1333-B1333 : 36%D1333-B2667 vs. D1600-B1600 : 28%
20
Performance of Decoupled DIMM with Given Channel Bandwidth
0.8
0.85
0.9
0.95
1
MEM-AVG MDE-AVG ILP-AVG
Nor
mal
ized
Wei
ghte
d Sp
eedu
p
D1600-B1600 D1333-B1600D1066-B1600 D800-B1600
Power and Performance Impact with Given Channel Bandwidth
-8.1%
-2.5%-0.7%
2CH-2D-2R2CH-2D-2R
Power of Decoupled DIMM with Given Channel Bandwidth
0
10
20
30
MEM-AVG MDE-AVG ILP-AVG
Pow
er (W
att)
D1600-B1600 D1333-B1600D1066-B1600 D800-B1600
16%
10%
8%
21
Performance Impact with Given System BandwidthPerformance Impact of Decoupled DIMM
with 34GB/s System Bandwidth
0.9
0.95
1
MEM-AVG MDE-AVG ILP-AVG
Nor
mal
ized
Wei
ghte
d Sp
eedu
p
34GB/s 4CH-2D-1R D1066-B106634GB/s 2CH-2D-2R D1066-B2133
-4.4%-4.1%
-3.5%
22
Novel memory architecture --- Most related workMini-Rank [Zheng:MICRO2008], Threaded Memory Module [Ware:ICCD2006], Fully-Buffered DIMM [Intel2005], Register DIMM, MetaRAM [http://www.metaram.com]
Memory system performance evaluation and analysisDRAM/RAMBUS [Burger:ISCA1996, Cuppu:ISCA1999, Cuppu:ISCA2001], FBD [Ganesh:HPCA2007]
Memory access scheduling for performance and fairness Memory access reordering [McKee:HPCA1995, Rixner:ISCA2000, Hur:MICRO2004, Zhu:HPCA2005, Nesbit:MICRO2006, Mutlu:MICRO2007, Mutlu:ISCA2008, Ipek:ISCA2008]
DRAM Low power modes optimizations.Low power mode management for optimizing background power [Lebeck:ASPLOS2000, Delaluz:HPCA2001, Fan:ISLPED2001, Delaluz:DAC2002, Huang:USENIX2003, Li:ASPLOS2004, Zhou:ASPLOS2004, Pandey:HPCA2006]
Related Works of Decoupled DIMM
23
Cost effective high bandwidth memory system design
Using low-speed DRAM devices building high bandwidth memory channel
Significant benefits on performance, cost and power efficiency
Given DRAM devices high bandwidth channelGiven channel bandwidth power/energy savingGiven system bandwidth cost effectiveness with few channels
Small changes Synchronization Buffer on DIMM DRAM devices design untouchedSmall changes on memory requests scheduling
Decoupled DIMM Summary