Date post: | 11-Jan-2016 |
Category: |
Documents |
Upload: | lydia-mathews |
View: | 216 times |
Download: | 0 times |
CACTI-IO: CACTI With Off-Chip Power-Area-Timing Models
Norman P. Jouppi¥, Andrew B. Kahng†‡,Naveen Muralimanohar¥, Vaishnav Srinivas†
November 6th, 2012
ECE† and CSE‡ DepartmentsUniversity of California, San Diego
Hewlett-Packard Laboratories¥, Palo Alto
(2)
Agenda
• Introduction• Need for off-chip power-area-timing
models• CACTI-IO models• Case studies using CACTI-IO:
• High-capacity DDR3 configurations• 3-D stacking• LPDDRx for servers
• Summary
(3)
Memory Subsystem Performance• Latency/Access times: The Memory Wall
• Modern architectures try to hide the latency impact
• Capacity: Need for large server main memory• Bandwidth: The Memory Bandwidth Limit
• Latency hiding techniques do not help• Off-chip limits bandwidth
Source: Rogers et al.Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scaling
(4)
Memory Subsystem Power
• Memory subsystem power a significant portion
(5)
Memory Subsystem Power
• Memory subsystem power a significant portion• DRAM
(6)
Memory Subsystem Power
• Memory subsystem power a significant portion• DRAM, Buffers
(7)
Memory Subsystem Power
• Memory subsystem power a significant portion• DRAM, Buffers, Caches
(8)
Memory Subsystem Power
• Memory subsystem power a significant portion• DRAM, Buffers, Caches, Interconnect/IO/PHY
(9)
Memory Subsystem Power
• Memory subsystem power a significant portion• DRAM, Buffers, Caches, Interconnect/IO/PHY• Off-chip IO power is a key component
Source: Economou et al.Full-System Power Analysis and Modeling for Server Environments
(10)
Off-chip Performance
• Memory bandwidth limited by off-chip interface
(11)
Off-chip Performance
• Memory bandwidth limited by off-chip interface• Source-synchronous signaling
(12)
Off-chip Performance
• Memory bandwidth limited by off-chip interface• Source-synchronous signaling• Signal/Power Integrity
(13)
Off-chip Performance
• Memory bandwidth limited by off-chip interface• Source-synchronous signaling• Signal/Power Integrity: ISI
(14)
Off-chip Performance
• Memory bandwidth limited by off-chip interface• Source-synchronous signaling• Signal/Power Integrity: ISI, Crosstalk
(15)
Off-chip Performance
• Memory bandwidth limited by off-chip interface• Source-synchronous signaling• Signal/Power Integrity: ISI, Crosstalk, Supply Noise
(16)
Off-chip Performance
• Memory bandwidth limited by off-chip interface• Source-synchronous signaling• Signal, power integrity: ISI, Crosstalk, Supply Noise• Pincount
(17)
Off-chip Power
• Off-chip power significant portion of the memory subsystem
(18)
Off-chip Power
• Off-chip power significant portion of the memory subsystem
• Higher off-chip capacitance and voltages
(19)
Off-chip Power
• Off-chip power significant portion of the memory subsystem
• Higher off-chip capacitance and voltages• Terminations and Vref-biased receivers
(20)
Off-chip Power
• Off-chip power significant portion of the memory subsystem
• Higher off-chip capacitance and voltages• Terminations and Vref-biased receivers• Clocking elements
(21)
Off-chip PAT Models For Architects• Off-chip models for full-system simulator
• Simulators today do not account for IO/PHY power• Accurate off-chip power and performance numbers• Co-optimize off-chip & on-chip power/performance • Explore new off-chip topologies and technologies
Full System Simulator
Off-Chip Power/
Area/Timing Models
Accurate Off-chip Power/
Peformance
On-Chip Power/
Area/Timing Models
Optimal On-chip and
Off-chip Configuration
(22)
CACTI-IO
• CACTI well known for memory architects• CACTI-IO includes off-chip PAT models• CACTI-IO config file includes off-chip
parameters• CACTI-IO Tech Report available
# Memory State (R=Read, W=Write, I=Idle or S=Sleep)
//-iostate "R"-iostate "W"//-iostate "I"//-iostate "S"
# Is ECC Enabled (Y=Yes, N=No)
-dram_ecc "N"
#Address bus timing
//-addr_timing 0.5 //DDR, for LPDDR2 and LPDDR3-addr_timing 1.0 //SDR for DDR3, Wide-IO//-addr_timing 2.0 //2T timing//addr_timing 3.0 // 3T timing
# Bandwidth (Gbytes per second, this is the effective bandwidth)
-bus_bw 12.8 GBps
# Memory Density (Gbit per memory/DRAM die)
-mem_density 2 Gb
# IO frequency (MHz) (frequency of the external memory interface).
-bus_freq 800 MHz
# Duty Cycle (fraction of time in the Memory State defined above)
-duty_cycle 1.0
# Activity factor for Data (0->1 transitions) per cycle (for DDR, need to account for the higher activity in this parameter. E.g. max. activity factor for DDR is 1.0, for SDR is 0.5) -activity_dq 1.0
# Activity factor for Control/Address (0->1 transitions) per cycle (for DDR, need to account for the higher activity in this parameter. E.g. max. activity factor for DDR is 1.0, for SDR is 0.5)
-activity_ca 0
# Number of DQ pins
-num_dq 1
# Number of DQS pins
-num_dqs 0 //8 differential pairs
# Number of CA pins
-num_ca 0
# Number of CLK pins
-num_clk 2 //1 differential pair
# Number of Physical Ranks
-num_mem_dq 2 //Number of ranks (loads on DQ and DQS) per DIMM or buffer chip
# Width of the Memory Data Bus
-mem_data_width 1 //x4 or x8 or x16 or x32 memories
(23)
Agenda
• Introduction• Need for off-chip power-area-timing
models• CACTI-IO Models• Case Studies using CACTI-IO:
• High-capacity DDR3 configurations• 3-D Stacking• BOOM: LPDDRx for servers
• Summary
(24)
Dynamic Power• Dynamic Power (switching lumped caps)
• Interconnect Power
intE
fVVCαDNP dd
i
SWcpinsdyn ii
fEαDNP intcpinsint
tL VSW Vdd / Z0 if 2tL tb
tb VSW Vdd / Z0 if 2tL > tb
(25)
Termination Power• DQ:
• Multi rank• Few termination types• READ and WRITE• Assume 50% 0’s, 1’s• Includes Rx, Tx
• CA:• Fly-by• VDD/2 termination
(26)
PHY Power• Reference generators• Vref-biased receivers• Clock distribution• DLL/PLL• Phase Rotators
(27)
Performance: Eye Compliance• Timing Budget: Tx, Channel, and Rx (setup/hold)• Voltage Budget: Tx (VOL/VOH), Channel, Rx (VIL/VIH)
(28)
Channel Jitter
• DOE for topology parameters• Ron/Rtt/Cdram some of the key parameters• Linear interpolation of Taguchi array
(29)
Timing Budget
i i
ijitter RJiDJT 2
avgjitterjitter TT _0)F(
i
avgjitterioijitter TFFT _
DS
setupskew
setupjittererror
ck
DH
holdskew
holdjittererror
ck
TTTTT
TTTTT
4
4
(30)
Voltage Budget
NISWNN VVKV
N
SSOISIxtalkN
K
KKKK
for DOE
ILHrefM
NSWM
VVV
VVV
2
(31)
Area
fkfkfkR
N
)R,(R
kANArea
ONIO
TTIONIOIO
33
221
00
1
2min
• Driver area depends on RON and RTT
• Predriver stages fanout to driver• Fixed area for ESD and controls
(32)
Validation
• CACTI-IO models account for off-chip power, area and timing
• Validation against SPICE • Within 15% error across all the simulations• Lookup tables validated by construction
(33)
Power for LPDDR2 DQ Single-Lane
Total IO Power
(34)
Power for DDR3 DQ Single-Lane
Termination PowerTotal IO Power
(35)
Agenda
• Introduction• Need for off-chip power-area-timing
models• CACTI-IO Models• Case Studies using CACTI-IO:
• High-capacity DDR3 configurations• 3-D Stacking• BOOM: LPDDRx for servers
• Summary
(36)
Case Studies Using CACTI-IO
• We present three case studies:• High-capacity DDR3 configurations• 3-D configurations• BOOM (Buffered Output On Module): LPDDRx
for servers• Compare the configurations for:
• Capacity• Bandwidth• IO Power Efficiency
• BOOM case study with IO+DRAM power
(37)
Case Study 1: High-capacity DDR3• RDIMM
(38)
Case Study 1: High-capacity DDR3• RDIMM, LRDIMM
(39)
Case Study 1: High-capacity DDR3• RDIMM, LRDIMM, BoB (Buffer on Board) • BoB uses serial bus to host
(40)
Case Study 1: High-capacity DDR3• RDIMM, LRDIMM, BoB (Buffer on Board) • BoB uses serial bus to host• LRDIMM offers highest capacity• BoB offers best bandwidth and
power efficiency per GB of capacity
(41)
Case Study 2: 3-D Stacking• TSS based• Peak bandwidth of 176
GB/s for Micron’s Hybrid Memory Cube (HMC)
• Power efficiency varies by around 2X
Source: Micron
(42)
BOOM: LPDDRx for servers
• BOOM (Buffered Output On Module) architecture from Hewlett-Packard:• Buffer chip on the board• LPDDRx memories (lower speed, power)• Wider bus from the buffer to the DRAMs
• Achieves better power efficiency using LPDDRx memories
• Still meets performance using buffer
(43)
BOOM Topology
(44)
Case Study 3: BOOM
• 50% increase in IO efficiency with LPDDRx• No terminations with wider, slower buses• Serial bus from the buffer offers more
savings
(45)
BOOM: IO+DRAM Power
(46)
BOOM: IO+DRAM Power
• IO power a significant portion of the combined power (DRAM+IO): 50-60%
• IO Idle power a very significant contributor• LPDDR2 unterminated signaling reduces idle
power• BOOM-N4-L-400 w/ serial bus to host
provides a 3.4X energy savings (DRAM+IO) over the BOOM-N2-D-800
• Combining IO+DRAM allows for correct optimizations
(47)
Optimizing Fanout• IO power vs. number of ranks while
capacity and bandwidth are constant• Slower and wider provides better power• Die area and clock distribution goes up as
bus gets wider, so 200-400MHz seems like a sweet spot
BWfW
CapacityWWN
B
MBR
2
)/(
(48)
Agenda
• Introduction• Need for off-chip power-area-timing
models• CACTI-IO Models• Case Studies using CACTI-IO:
• High-capacity DDR3 configurations• 3-D Stacking• BOOM: LPDDRx for servers
• Summary
(49)
Summary• Introduced CACTI-IO with off-chip models• CACTI-IO models include
• IO/Interconnect dynamic and termination power• PHY power• Voltage/Timing budgets for eye compliance• IO area
• 3 case studies show the capabilities of CACTI-IO• Calculate off-chip power/area/timing• Combine on-chip and off-chip power• Identify key configuration choices and optimizations
• Ongoing work:• Extend the models to other types of off-chip memory
and off-chip configurations, including PCRAM
Thank You!