David GreenhillDistinguished Engineer Sun Microsystems [email protected]
Chip Multi-Threading Keepsthe Data Center Cool
FDIP 2006 David Greenhil 2
Introduction• Market forces are driving the following
> More performance> More threaded workloads> Power limited designs
• Technology is driving> Higher power each generation> Less performance from frequency gains> Good scaling of I/O bandwidth through SERDES
➔ This is exploited by Sun Chip Multi-threaded (CMT) Systems
FDIP 2006 David Greenhil 3
NoParallelism CC MM CC MM CC MM
ComputeInstruction
LevelParallelism
CC MM CC MM CC MM
Time
Time SavedThreadLevel
Parallelism
CC MMCC MM
CC MM
● ILP Offers Limited Headroom● TLP Provides Greater Performance Efficiency
Comparing Modern CPU Design Techniques
Memory
FDIP 2006 David Greenhil 4
Core 1
Memory Latency Compute
CMT – Multithreaded Cores
Thread 4Thread 3Thread 2Thread 1
Core 2Thread 4Thread 3Thread 2Thread 1
Core 3Thread 4Thread 3Thread 2Thread 1
Thread 4Thread 3Thread 2Thread 1
Core 4
Core 5Thread 4Thread 3Thread 2Thread 1
Core 6Thread 4Thread 3Thread 2Thread 1
Core 7Thread 4Thread 3Thread 2Thread 1
Core 8Thread 4Thread 3Thread 2Thread 1
Time
FDIP 2006 David Greenhil 5
Features:• 8 64-bit Multithreaded SPARC Cores• Shared 3MB L2 Cache• 16KB I-Cache per Core• 8KB D-Cache per Core• 4 144-bit DDR2 channels• 3.2 GB/sec JBUS I/O
Technology:• TI's 90nm CMOS Process• 9LM Cu Interconnect• 63 Watts @ 1.2GHz/1.2V• Die Size: 378mm2
• 279M Transistors• Flip-chip ceramic LGA
SPARC
Core 1
SPARC
Core 3
SPARC
Core 5
SPARC
Core 7
DD
R2_
0D
DR
2_1
DD
R2_
2D
DR
2_3
L2 Data
Bank 0
L2 Data
Bank 1
L2 Data
Bank 3
L2 Data
Bank 2L2Tag
Bank 0
L2Tag
Bank 2
L2 Buff
Bank 0L2 Buff
Bank 1
CLK /
Test
Unit
DRAM
Ctl 0,2
DRAM
Ctl 1,3
CROSSBAR
JBUS
IO
BridgeL2 Buff
Bank 2L2 Buff
Bank 3
FPU
SPARC
Core 0
SPARC
Core 2
SPARC
Core4
SPARC
Core 6
L2Tag
Bank 1
L2Tag
Bank 3
Niagara Micrograph and Overview
FDIP 2006 David Greenhil 6
CMOS Power
(Ignoring 2nd order terms)
Power = aCV2F + Leakage (P,V,T)
a = activity factorC = capacitance of nodesV = voltageF = chip frequency
FDIP 2006 David Greenhil 7
• Fully static design • Fine granularity clock gating
for datapaths (30% flops disabled)
• Lower 1.5 P/N width ratio for library cells
• Interconnect wire classes optimized for power x delay
• SRAM activation control
SPARC CoresLeakage
L2DataL2Tag UnitL2 Buffer Unit
Crossbar
Floating PointMisc Units
Wires & RptrsGlobal Clock
IOs
63W @ 1.2GHz / 1.2V< 2 Watts / Thread
Cores(26%)
Leakage(25%)
IOs(11%)
L2Cache(12%)
Wires &Rptrs(17%)Xbar
(6%)
“Niagara” T1 Chip Power
FDIP 2006 David Greenhil 8
84 86 87 88 90 91 93 94 95 97 98 99 01 02 04 051.001.251.501.752.002.252.502.753.003.253.503.754.004.254.504.755.00
Personal View of Processor Designs
Voltage
Year
Volta
ge
InmosTransputer
UltraSparc
Niagara
FDIP 2006 David Greenhil 9
84 86 87 88 90 91 93 94 95 97 98 99 01 02 04 051.001.251.501.752.002.252.502.753.003.253.503.754.004.254.504.755.00
Personal View of Processor Designs
Voltage
Year
Volta
ge
Trend of Mid 1990's not sustainable
FDIP 2006 David Greenhil 10
84 86 87 88 90 91 93 94 95 97 98 99 01 02 04 051.001.251.501.752.002.252.502.753.003.253.503.754.004.254.504.755.00
102030405060708090100110120130140150
Personal View of Processor Designs
Voltage Power
Year
Volta
ge
ChipMultithreadingEffect
InmosTransputer
Ultrasparc
Niagara
FDIP 2006 David Greenhil 11
Niagara low power • Niagara low power style
> No speculation> No out of order> No complex branch
prediction> No predication
> Short pipeline> Moderate clock frequency> Static CMOS design> Threading to cover
memory latency
• Typical competitor> Lots of speculation
Out of order etc> Wide issue> Deep pipelines> High frequency> Lots of dynamic circuits> Long stall when memory
is accessed
FDIP 2006 David Greenhil 12
CoolThreadsTM Advantages
59oC 107oC
66oC59oC
66oC59oC59oC
59oC
• Improved reliability with lower and more uniform junction temperatures– Increased lifetime – Total failure rate reduced by
~8X (vs 105oC)• Optimized performance/
reliability trade-off– Frequency guardbands due to
CHC, NBTI, etc. reduced by > 55%
– Reduced design margins (EM/NBTI)
– Less variation across die
FDIP 2006 David Greenhil 13
Data Center Constraints• Many data centers are maxed out
> Some constrained by cooling limits> Others by electrical substations
• Getting new buildings & equipment is expensive• In some locations e.g. financial centers its
impossible• Performance of the data center is constrained by
performance/watt of the servers
FDIP 2006 David Greenhil 14
Niagara-2
• Double throughput versus UltraSparc T1> Maintain Sparc binary compatibility
> http://opensparc.sunsource.net/nonav/index.html
• Improve throughput / watt• Improve single-thread performance• Integrate important SOC components
> Networking> Cryptography
FDIP 2006 David Greenhil 15
Niagara-2 Chip Overview • 8 Sparc cores, 8 threads each
• Shared 4MB L2, 8-banks, 16-way associative
• Four dual-channel FBDIMM memory controllers
• Two 10/1 Gb Enet ports w/onboard packet classification and filtering
• One PCI-E x8 1.0 port
• 711 signal I/O, 1831 total
FDIP 2006 David Greenhil 16
Sparc Core Block Diagram
EXU1
IFU
LSUSPU
TLU
MMU/HWTW
FGU
Gasket
xbar/L2
EXU0
• IFU – Instruction Fetch Unit> 16 KB I$, 32B lines, 8-way SA> 64-entry fully-associative ITLB
• EXU0/1 – Integer Execution Units> 4 threads share each unit> Executes one integer
instruction/cycle• LSU – Load/Store Unit
> 8KB D$, 16B lines, 4-way SA> 128-entry fully-associative
DTLB• FGU – Floating/Graphics Unit• SPU – Stream Processing Unit
> Cryptographic acceleration• TLU – Trap Logic Unit
> Updates machine state, handles exceptions and interrupts
• MMU – Memory Management Unit> Hardware tablewalk (HWTW)> 8KB, 64KB, 4MB, 256MB pages
FDIP 2006 David Greenhil 17
01/97 01/99 01/01 01/03 01/05 01/07100
1000
10000
SDram-100
Rambus-800
SDram-133
DDR-266
Rambus-1066
XDR-3200
DDR-400
DDR2-667
FBDIMM1
DDR3-1066
FBDIMM2
Dram Data Rates Versus Time
Date
Dat
a R
ate
Mb/
s
FDIP 2006 David Greenhil 18
Dram Issues• Niagara I uses DDRII – 400• Niagara II uses FBDIMM 4Gb/s
> Higher data rate is good > AMB power & cost is a problem
• Need to amortize the serialization cost across more memories> Stacking technologies> More Dram/DIMM> Other configurations of buffers to fanout to DDR DIMMs
FDIP 2006 David Greenhil 19
Niagara & Niagara II Packages
FDIP 2006 David Greenhil 20
Future packaging requirements• Challenges for future packages• Similar pin counts but:• Data rates keep increasing to match higher
processor performance• Costs getting squeezed – particularly in entry level
servers
FDIP 2006 David Greenhil 21
At the system level
E10000199732 x US277.4 ft3
2000 lbs13,456 W52,000 BTUs/hr
20051 x US T10.85 ft3
37 lbs~300 W1,364 BTUs/hr
T2000
Yesterdays Vertical System Todays Horizontal System
91x smaller54x lighter44x cooler
FDIP 2006 David Greenhil 22
T2000
FDIP 2006 David Greenhil 23
Conclusion• Chip multithreading does keep the data center cool • 2 generations of the Niagara processor• Future challenges
> Very high data rate interfaces> Managing power is an on going challenge at both CPU,
Memory and System Levels