The Engine of SOC Design
Application-Specific SupercomputingNew Building Blocks Enable New Systems Efficiency
Chris RowenPresident and CEOTensilica, Inc.
2 © 2006 Tensilica Inc.
Outline
• Fundamental Shift in Silicon Scaling• A Quick History of Processor Design• A New Method: Automatic Processor Generation• Tradeoffs in Efficiency versus Generality• Example of Large Processor Array for Embedded• An Architecture for Very Large Scale Compute• Conclusions: Essentials of the New HPC
• The Power Problem is Real• Work on Compute Efficiency – GFLOPS per Watt• Work on Communications Efficiency – Useful Messages/Data per
Watt
3 © 2006 Tensilica Inc.
Inflexion Point in Clock Frequency
1
10
2003 2005 2007 2009 2011 2013Year
Perf
orm
ance
(20
03 =
1.0
)
Transistor-constrained clock (21%)Power-constrained clock (4%)
Based on International Technology Roadmap for Semiconductors 12/03
4 © 2006 Tensilica Inc.
High-End Processors hit MHz ceiling
Clock < 4GHz
Basic Implications:
1. Get performance from better architecture instead of more MHz
2. Use multiple processors
5 © 2006 Tensilica Inc.
I decode
The history of the microprocessor
reg
Dmemory
ALU
next PC
Imemory
PC++
Imemory
PC++
μPC++
microcodememory
μΙ decode
cache-coherent
non-blockingcache miss
engine
added cache state
FP mult FP ALU
FP I decode
FP reg filereg
Dmemory
ALU
reg file
memorymgmt
D TLB
cache miss
engine
D tags
memorymgmt
I TLB
Imemory
PC++
Ι decode
memorymgmt
I TLB
cache miss
engine
I tags
branch memory
PC predict
PC++
renamereg
file[1]
renamereg
file[1]
branch memory
PC[1]++PC[0]++
thread selectPC predict
Imemory
Ι decode
memorymgmt
I TLB
out-of-ordercompletion engine
result queue
FP result queue
renamereg
file[0]
renameFP reg file[0]
ALU0 ALU1 ALU2 ALU3
Basic micro-controller
thread integer performanceprocessor area and power
Floating point
Superscalar (static or dyn.)
Symmetric multi-processing
Branch prediction
Micro-code
General register file
Simultaneous multi-thread
Data width:4 8 16 32…
Cache and memory protect
Pipelined load/store arch
SIMD multimedia ALU
Out-of-order execution
Relative Impact of Processor Features
6 © 2006 Tensilica Inc.
Intel’s Own Assessment
0
2
4
6
8
10
12
14
16
18
i486 (1989) Pentium™Processor
(1993)
Pentium™ 3Processor
(1999)
Pentium™ 4Processor
(2000)
Processor generation (Year)
Rel
ativ
e Pe
rform
ance
and
Effi
cien
cy (i
486
= 1.
0)
Performance (SPECint)
Die Size
Power
Power Efficiency (Perf/Power)
Transistor Speed
Power and Area increase more rapidly than Performance
1
10
100
1 10
Relative Scalar Performance
Ene
rgy
per I
nstru
ctio
n (n
J)
Pentium 1993Pentium Pro 1995Pentium 4 2001Pentium 4 2005Pentium-M 2003Pentium-M 2005Core Duo 2006Goal
7 © 2006 Tensilica Inc.
TensilicaProcessorGenerator
Tailored SW Tools: Compiler, debugger, simulators, OS ports
Application-optimized processor
implementation (RTL)
Base CPU
AppsDatapaths
OCD
Timer
FPUExtended Registers
Cache
The New WorldAutomatic Processor Generation
Processor configuration1. Select from menu2. Automatic instruction
discovery (XPRES Compiler)3. Explicit instruction
description (TIE)
8 © 2006 Tensilica Inc.
External Interface
Base ISA FeatureConfigurable FunctionOptional Function
User Defined Features (TIE)Optional & Configurable
User Defined Queues and Wires
JTAG Extended Instruction Align, Decode, Dispatch
Xtensa Processor InterfaceControl
Write Buffer
XtensaLocal Memory Interface
TRACE PortJTAG Tap Control
On Chip Debug
User Defined
Execution Units and Interfaces
Instruction Decode/Dispatch
Base ALU
Floating Point
Vectra DSP
MAC 16 DSPMUL 16/32
User Defined Register
Files
Instruction Fetch / PC
Data Load/Store
Unit
Data ROMs
Data RAMs
DataCache
DataMMU
User Defined
Execution Units
User Defined Register
Files
Vectra DSP
Base Register File
User Defined Execution Unit
Vectra DSP
Processor Controls
Interrupt Control
Data Address Watch Registers
Instruction Address Watch Registers
Timers
Used Defined Data Load/Store Units
Instruction ROMInstruction RAM
InstructionCache
Instruction MMU
PIF
Exception SupportException Handling
Registers
Trace
Interrupts
Tensilica Confidential
Build Almost Any Processor
9 © 2006 Tensilica Inc.
Covering Breadth of Processor Demands
Hard-wired logic
GP CPU
DSP
Dat
a P
roce
ssin
g E
ffici
ency
Nee
ds
Computing Complex
Xtensa Configurable Processors
10 © 2006 Tensilica Inc.
Simple Recipe
Highly efficient baseline core+
Complete instruction set extensibility+
Highly efficient interconnect supportBaseline core:• 90nm:
• 0.1mm2, <40μW per MHz in 90nm technology
• Energy efficient: <25mW @450MHz
• 65nm:• Baseline at 65nm: ~50K MIPS per watt
Comms• High-bandwidth local memory system: 128-256b per cycle
• Nearest-neighbor data-streaming queues (instruction or memory mapped)
• Efficient master or slave bus interface supporting DMA or direct fetch of memory
Full double-precision• IEEE floating point ISA
• 2-way SIMD, 3-way superscalar
• Alternatives:• 4-way superscalar• 4-way SIMD• True sequential vectors
11 © 2006 Tensilica Inc.
Benefit of Processor Generation
0.09
0.52
0.04 0.05 0.060.06 0.08
2.0
0.0
0.5
1.0
1.5
2.0
MIPS32b (NEC VR4122)ARM1136 (Freeescale iMX31) MIPS64b (NEC VR5000) ARM1020E MIPS64 20Kc Xtensa out-of-box Xtensa auto-optimized Xtensa optimized
ConsumerMarks per MHz
0.011 0.012 0.016 0.0170.03
0.013
0.47
0.0
0.1
0.2
0.3
0.4
0.5
TeleMarks per MHz
0.03
0.010.01
0.02 0.02 0.02
0.123
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
NetMarks per MHz
Source: EEMBC Certified Benchmarks
Consumer DSP Networking
0.6 0.70.9
4.2
0.6 0.6
1.0
0
1
2
3
4
Office Automation
OAMarks per MHz
12 © 2006 Tensilica Inc.
ARMMIPSBaseline Tensilica
Processors
T e n silic a X te n sa
Domain- Optimized Tensilica
Processors
0
2
4
6
8
10
12
0 25 50 75 100 125 150 175 200
Power (core mW)
Perfo
rman
ce
(AR
M11
36 @
333
MH
z =
1.0)
A Breakthrough in Low Power
Performance on EEMBC benchmarks aggregate for Consumer, Telecom, Office, Network, based on ARM1136J-S (Freescale i.MX31), ARM1026EJ-S, Tensilica Diamond 570T, T1050 and T1030, MIPS 20K, NECVR5000). MIPS M4K, MIPS 4Ke, MIPS 4Ks, MIPS 24K, ARM 968E-S, ARM 966E-S, ARM926EJ-S, ARM7TDMI-S scaled by ratio of Dhrystone MIPS within architecture family. All power figures from vendor websites, 2/23/2006
25-50x lower power at same performance
13 © 2006 Tensilica Inc.
Processor Flexibility and Efficiency
General-purposeXtensa
Domain-specificXtensa
Application-specificXtensa
Traditional processor
Flexibility or Generality
100-
1000
x
Per
form
ance
or E
ffici
ency
Hard-wiredLogic
• Instruction set extensions vary from general-purpose to highly task-specific
• Configurable processors break the Efficiency vs. Flexibility tradeoff
14 © 2006 Tensilica Inc.
True Multi-Processor System-on-Chip192 Xtensas per chip in Cisco CRS-1 Terabit Router
192 Xtensa network processing cores per Silicon Packet Processor
Up to 400,000 processors per system
Complete 32b processor
16 clusters per chip12 processor per cluster
15 © 2006 Tensilica Inc.
Example Parallel ArchitecturesCRS-1: Massive general-purpose throughput
Routing task runs to completion on each processor:
•IPv4 Unicast•MPLS–3 Labels•Link Bundling (v4)•Load Balancing L3 (v4)•1 Policier Check•Marking•TE/FRR•Sampled Netflow•WRED•ACL•IPv4 Multicast•IPv6 Unicast•Per prefix accounting•GRE/L2TPv3 Tunneling•RPF check (loose/strict) v4•Load Balancing V3 (v6)•Link Bundling (v6)•Congestion Control
18mm x 18mm IBM 0.13μm18M gates8Mbit SRAMs
50,000 general purpose MIPS175 Gb/s memory bandwidth
Programmability also means• Ability to juggle feature ordering• Support for heterogeneous mixes of feature chains• Rapid introduction of new features
96Gb/s 96Gb/s
16 © 2006 Tensilica Inc.
Basic Processor EfficiencyThe Usual List of Suspects
0.060.40.30.60.5IC GFLOPS/Watt
13065801030Approx IC Power (Watts)
5.624245.615Aggregate DP GLFOPS per IC
12228Processors per IC
1.43.03.00.73.2Cycles per second (GHz)
44440.6 per SPE
DP Operations per Cycle per Processor
Intel Itanium2
Intel Xeon 5100
Woodcrest
AMD Opteron
K8L
IBM BlueGene
/L (PowerPC 440 ASIC)
IBM/ Sony/
Toshiba Cell
Source: Vendor websites www.geek.com,www.answers.com
DP FP pipelines in FPGA: 15.9 GFLOPs @ 25W (Xilinx Virtex-4 LX200): 0.63 GFLOPS/W
7
12
8332
0.65
4
Xtensa-based
SIMD/LIW Scientific Engine
17 © 2006 Tensilica Inc.
Example Parallel Architectures Optimized Scientific Compute Processor
• Optimized for general-purpose double-precision computation on well-structured local data.
• Three-issue “FLIX” VLIW:1. 128b load-store (with update), FP convert,
general integer ops2. general integer op 3. 2-way SIMD FPop (IEEE add, sub, mul,
mul-add, mul-sub)• Free intermixing of 16b, 24b and 64b FLIX
instructions• 8-stage pipe with zero-overhead loops• 32 entry windowed address register file• 16 entry SIMD FP register file (16 x 2 x 64b)• 32KB I cache + 8KB D cache + 64-128KB data
RAM• Local instruction and data cache plus large
local data RAM (64K-128KB), dual-ported between processor and closely-coupled DMA engine
• Core: 220K gates = 1mm2 <100mW@650MHz
64b instruction word
VADDVSUBVMULVMADDVMSUB RADDSWAPV[compare]VMOV VMOVEQZVMOVGEZVMOVLTZVMOVNEZVMOVTVMOVF
27 integer ALU
LDI[U]LVI[U]LVX[U]SD{H.L}I[U]SVI[U]SVX[U]TRUNC{H,L}CEIL{H,L}ROUND{H,L}FLOOR{{H,L}FLOAT{H,L}UFLOAT{H,L}VMOVVNEGVABS+ 45 integer
ALU and LS
SIMD FP opsinteger ops
LSconvertinteger
18 © 2006 Tensilica Inc.
Example Parallel Architectures Local Interconnect Structure
CPUcore
Instcache
Datacache
DataRAM
DMA engine
Group of 4 Bus
NorthCPU
WestCPU
EastCPU
SouthCPU
PCIemessage pathmessage
path
Messagepath
Messagepath
9x9 Crossbar
XDR XIO XDR XIO
XDR XIO XDR XIO XDR XIO XDR XIO XDR XIO XDR XIO
Central broadcast,
boot & debug processor
On-chip eDRAM buffer
(optional)
PCIe
DRAMDRAMDRAMDRAMDRAMDRAMDRAM
DRAM
19 © 2006 Tensilica Inc.
Example Parallel Architectures Petascale Climate Modeling System
Technical/ Economic Challenges1. Variance in data-reference/
communication patterns for codes2. Potential for extreme scalability via large-
scale processing arrays3. Size, cost and maintenance strongly
correlated to system power dissipation4. General-purpose CPUs optimized for
integer applications – unimpressive performance per $, per watt
Technical/ Economic Challenges1. Variance in data-reference/
communication patterns for codes2. Potential for extreme scalability via large-
scale processing arrays3. Size, cost and maintenance strongly
correlated to system power dissipation4. General-purpose CPUs optimized for
integer applications – unimpressive performance per $, per watt
Parallel Climate ModelingLenny Oliker and Michael Wehner of Lawrence Berkeley
Lab speculation: a much more parallel climate model• 1.5km grid for Earth• 20,000,000 domains• 500 MFLOPS/domain• 500 MB/s per domain• Complex algorithms require general-purpose
programmability in double precision floating point• 2D communications mesh @~20MB/s per domain
System Architecture Approach• Highly suitable for distributed array computation• Two design challenges:
• Total memory bandwidth: 5-10 peta-bytes/s – many parallel local DRAM channels
• Power: GFLOPS/W best predictor of system cost, size • Best off-the-shelf processor (IBM Cell) is about 0.5 DP
GFLOPS/W• Domain-specific processor approach offers significant
potential advantage (>10 DP GFLOPS/W)
20 © 2006 Tensilica Inc.
Example Parallel Architectures 10 PetaFLOPS System Concept: 3.8M processors
~150 m2
<5MWatts
~ $100M
32 boards per rack
120 racks @ ~25KW
power + comms
32 chip + memory clusters per board (2.7
TFLOPS @ 1000W
VLIW CPU: • 128b load-store + 2 DP MUL/ADD + integer op/ DMA
per cycle:• Synthesizable at 650MHz in commodity 65nm • 1mm2 core, ~3mm2 with inst cache, data cache data
RAM, DMA interface, 0.25mW/MHz• Double precision SIMD FP : 4 ops/cycle (2.7GFLOPs)• Vectorizing compiler, cycle-accurate simulator,
debugger GUI• 8 channel DMA for streaming from on/off chip DRAM• Nearest neighbor 2D communications grid
ProcArray
RAM RAM
RAM RAM
8 DRAM perprocessor
chip: 50 GB/s
CPU64-128K D
2x128b
32K I
8 chanDMA
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
CPU
D
IDM A
Opt. 8M
B em
bedded DR
AM
External DRAM interface
External DRAM interface
External D
RA
M interfaceE
xter
nal D
RA
M in
terfa
ce
MasterProcessor
Comm LinkControl
32 processors per 65nm chip83 GFLOPS @ 12W
21 © 2006 Tensilica Inc.
Conclusions
Essentials of the New HPC• The practical and economic limitations of power are real:
Figure out how to live with 5MW• Cost of building/operating system now much greater than
cost of design: Build systems around particular application domains or communications paradigms
• Demonstrated potential to improve compute and local communications energy efficiency by 1-2 order of magnitude
• Next challenge: global communications efficiency:• Develop algorithms with reduced global comms needs• Develop ultra-low-energy global comms mechanisms