- 1/20 -
A Dynamically ReconfigurableProcessor Architecture
Masa Motomura
System ULSI Development DivisionNEC Corporation
MICROPROCESSOR FORUM 2002, Session 4-2, Oct. 16
MICROPROCESSOR FORUM 2002, Session 4-2, Oct. 16 - 2/20 -
Contents
� Introduction
� DRP architecture
� DRP compiler
� DRP Application development system
� Sample applications
� Future roadmap and summary
MICROPROCESSOR FORUM 2002, Session 4-2, Oct. 16 - 3/20 -
� Dynamically Reconfigurable Processor
=> DRP
� H/W-like performance by customized
datapath configurations using array of PEs
� S/W-like scalability by dynamic recon-
figuration of customized datapath planes
� C compiler is available
� Based on in-house high-level
synthesis tool, called Cyber
Introduction: What is DRP?
State Transition Controller
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
MemMem MemMem MemMem MemMem
MemMem MemMem MemMem MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
DRP Core (variable array size)
Array of processingelements (PEs) andmemories (Mems)
Array of processingelements (PEs) andmemories (Mems)
Datapath Plane 3
Datapath Plane 2
Dynamic Reconfiguration
Logic
Datapath Plane 1
MemoryBus
Function Unit
MICROPROCESSOR FORUM 2002, Session 4-2, Oct. 16 - 4/20 -
DRP Architecture
� Fine-grained processor-based programmable array architecture
� Architected for stream data processing, such as NW packet, motion/still picture, and wireless data streams, etc.
� A simple sequencer� A simple sequencer
� Array of configurabledata memories
� Array of configurabledata memories
State Transition Controller
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
MemMem MemMem MemMem MemMem
MemMem MemMem MemMem MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
DRP Core (variable array size)
� Array of byte-orientedprocessing elements
� Fully programmable inter-PE wiring resources
� Array of byte-orientedprocessing elements
� Fully programmable inter-PE wiring resources
MICROPROCESSOR FORUM 2002, Session 4-2, Oct. 16 - 5/20 -
DRP Architecture
� Finite state machine thatcontrols dynamic datapathreconfiguration
� Finite state machine thatcontrols dynamic datapathreconfiguration
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
MemMem MemMem MemMem MemMem
MemMem MemMem MemMem MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
DRP Core (variable array size)
C0 C1 C2 C3
� Sequencer is built-in, and is decoupled from PE array
� Datapath is dynamically reconfigured during runtime
� Customized datapath/memory organization at any time
� Customized datapathplane for each state
� Customized memory structures for each data-path plane (FIFO/table/scratch-pad, etc)
� Customized datapathplane for each state
� Customized memory structures for each data-path plane (FIFO/table/scratch-pad, etc)
Dyn
amic
Rec
onfigura
tion Datapath Plane
for State C0Datapath Plane
for State C0
… for C1
… for C3… for C2
Datapath Planefor State C0
Datapath Planefor State C0
… for C1
… for C3… for C2
MICROPROCESSOR FORUM 2002, Session 4-2, Oct. 16 - 6/20 -
Processing Element (PE)� ALU: ordinary byte arithmetic/logic operations
� DMU (data management unit): handles byte select, shift, mask, constant generation, etc., as well as bit manipulations
� An instruction dictates ALU/DMU operations and inter-PE connections
� Source/destination operands can either from/to
� its own register file
� other PEs (i.e., flow through)
� Instruction pointer (IP) is provided from STC (statetransition controller)
Dat
a_in
(8bx2
)
Dat
a_out
(8b)
Flag_in
Flag_out
Inst
ruct
ions
ALU
Reg
iste
r Fi
le
Data WireFlag WireIP
DM
U
MICROPROCESSOR FORUM 2002, Session 4-2, Oct. 16 - 7/20 -
Computation on DRP Core
� Instruction Pointer(IP) from STCidentifies a datapath plane
PE Array
Add Sel
Add Cmp
Add Add Cmp
Sel
PE
PE ALUDMU
012
Insts.
IP = “1”
1
� When IP changes, datapath plane switches instantaneously
� PE instructions as a collection behave like an extreme VLIW
� Sequencing through instructions=> Dynamic reconfiguration
� Spatial computation with using acustomized datapath plane
- 8/20 -
AES
3DES
MD5
SHA-1
Compress
Data In
Control
(task selectionby descriptor)
Example: IPSec on DRP Core
Dynamic Reconfiguration
Data Out
Multiple Datapath Planes
Different datapath plane for different set of algorithmic processing
- 9/20 -
AES
K08
128
8 Kr
128
Mix
Colu
mn
(r<10)
SS
SS
Place & Route Place & Route
MD5
32
128
(r<64)
g
+ + + R +
X[k] T[i]
+
2222
2222
&&&& &&&& ||||
&&&& &&&& ||||
IPSEC on DRP Core - Continued
PE
Dynamic Reconfiguration
- 10/20 -
Time
Input Packet
Checksum
Field extract
Reassemble
Field Check
Checksum
Checksum
Checksum
Table Lookup
Filed Check
Field Check
Table Lookup
Field Check
Example: Packet Processing
DynamicReconfiguration
1 Cycle
32b, 64b,,,
Different hardware for processing respective chunk of input packet
The compiler automaticallySchedules cycle-by-cycle datapath planes, statically
The compiler automaticallySchedules cycle-by-cycle datapath planes, statically
MICROPROCESSOR FORUM 2002, Session 4-2, Oct. 16 - 11/20 -
DRP Compiler
� High-level synthesis: generates finite state machine and associated datapathplanes from the C source code
� Modify Cyber to cope with DRP architectural features
� Mapper: maps RTL for each datapathplane to individual PEs and memories
� Place&Router: physically locate the PEs and memories, and mutually connect them
High LevelSynthesis
DatapathPlane RTLDatapathPlane RTL
Place&Router
C SourceVerilog
RTL Source(optional)
DatapathPlane RTLDatapathPlane RTLDatapathPlane RTLDatapathPlane RTLDatapathPlane RTLDatapathPlane RTL
FiniteState
Machine
FiniteState
Machine
Mapper
DRP Object Code
PE/MemoryArray Code
STCCode
Compile C source code into DRP object code
- 12/20 -
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
MemMem MemMem MemMem MemMem
MemMem MemMem MemMem MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
MemMem
C0 C1 C2 C3
Datapath Planefor State C0
Datapath Planefor State C0
… for C1
… for C3… for C2
Datapath Planefor State C0
Datapath Planefor State C0
… for C1
… for C3… for C2
Programming ModelDRP-CoreC source code
� “Finite state machine with datapath planes”
� DRP is architected based on a clear picture of how C code is compiled into hardware
Bit-orientedProcessing
Bit-orientedProcessingif ((a>b)||(c!=d))
DataPath_Func_1();
else if (e==f)
DataPath_Func_2();
else
DataPath_Func_3();
Control Structure=>
Finite State Machie
Control Structure=>
Finite State Machie
Byte-orientedProcessing
Byte-orientedProcessing
MICROPROCESSOR FORUM 2002, Session 4-2, Oct. 16 - 13/20 -
1st Implementation: DRP Tile
� Tile is a minimum unit of DRP array
� 8x8 PEs
� 1-p HMEM and 2-p VMEM
� A PE stores 16 instructions
� Reconfiguration of a tile is controlled by its STC
� External port can attach extendedfunctional units, such as
� External memory controllers
� Peripheral bus controllers(like PCI)
� Complex operation units(like multiplier, divider, etc)
PEPE
VMEM:8b x 256w1-R, 1-R/W
VMEM:8b x 256w1-R, 1-R/W
HMEM: 8b x 8kw, 1-R/WHMEM: 8b x 8kw, 1-R/W
STCSTC
VM-CTRVM-CTR
HM-CTRHM-CTR
External PortExternal Port
MICROPROCESSOR FORUM 2002, Session 4-2, Oct. 16 - 14/20 -
1st Silicon: DRP-1
PCICMUL
MC MULMUL PLLPLL
PLLPLL
PEVMEMHMEM
MUL
Data
CLK CLK
CLK CLK
STC
VM-CTRHM-CTR
PCI IF
Ctrl
MUL MUL
MULMUL
Test
- 8 Tiles. 512 PEs, 160kb VMEMs, 2Mb HMEMs
- 8 STCs - 8 32b Multipliers- 1 SDRAM/SRAM/CAM
controller- 1 PCI controller- 4 PLLs
- 8 Tiles. 512 PEs, 160kb VMEMs, 2Mb HMEMs
- 8 STCs - 8 32b Multipliers- 1 SDRAM/SRAM/CAM
controller- 1 PCI controller- 4 PLLs
To SDRAM/SRAM/CAM
Extended Functional Units
Program
DRP-1 Chip Blockdiagram
MICROPROCESSOR FORUM 2002, Session 4-2, Oct. 16 - 15/20 -
Current Status: HW Platform
PEProcessing Element
State Transition Controller
PCI Controller
Memory Controller
2-port Memory (2kb)
1-port Memory (64kb)
DRP-1 Die Photo � 0.15µm CMOS 8-Al process, 696-pin TBGA package
� 33-133MHz clock
� 22M logic transistors (44K for a PE)
� 1.5Mb configuration memories
� Fully functional w/o respin
� DRP-1 board is now up and running
� Demonstration of DRP as an off-loading engine to a host MPU through PCI or HyperTransport I/F
FPGA DRP-1
PCI
MC
PCI
CAM SRAMSRAM SDRAM
HTTo/from
Host MPU
MICROPROCESSOR FORUM 2002, Session 4-2, Oct. 16 - 16/20 -
Current Status: Compiler SuiteHigh Level Synthesis ViewHigh Level Synthesis View
Place&Router ViewPlace&Router View
Top WindowTop Window
Hierarchy BrowserHierarchy Browser
Memory Access SchedulingMemory Access Scheduling
Report SummaryReport Summary
Source Code Editor/ViewerSource Code Editor/Viewer
C codeC code
Verilog RTL
Verilog RTL
Scheduled DataFlow Graph
Scheduled DataFlow Graph
Critical PathDelay AnalysisCritical Path
Delay Analysis
PEPE
Scheduled StateTransition DiagramScheduled State
Transition Diagram
DRP application development system (HW+SW) is ready
DRP application development system (HW+SW) is ready
MICROPROCESSOR FORUM 2002, Session 4-2, Oct. 16 - 17/20 -
Example: IPv4 Packet Forwarding� RFC1812 (Header check, checksum, longest prefix match, header modification)
� 8x16 PEs, 100MHz Operation
� Table look up: 5 round index search using 100MHz SRAM
� 6 pipe-stages, 5-cycle each. 100MHz x 5Cycle => 20M Packet/second
Dynamic Reconfiguration (5 datapath planes) PE
MICROPROCESSOR FORUM 2002, Session 4-2, Oct. 16 - 18/20 -
Example: AES Encryption
ByteSub(8-bit TableLook-up x16)
ShiftRow
K0
8
128
128
8
Kr
128
128
Output
MixColumn (r<16)
(r<
10)
(r=10: Skip)
SSSS
CriticalPath
� Critical Path: 6.6ns. Throughput: 1.76Gbps (ECB mode)
� Fits in 8x16 PEs, single context (80 ALU/DMUs, 96 RFUs, 16vMEMs)
P & R Results
MICROPROCESSOR FORUM 2002, Session 4-2, Oct. 16 - 19/20 -
DRP Roadmap
2002 2003 2004
DRP-based Product
DRP Core150nm
133-166MHz
Year
150nm100-133MHz
130nm(Cu)166-200MHz
90nm(Cu)250MHz
(Prototype)
DRP ASSPDRP ASSP
DRPCoreDRPCoreDRPCoreDRPCoreHT*
DRP ASSP(2)DRP ASSP(2)
DRPCoreDRPCoreDRPCoreDRPCore
HT*
SPI4/5
* PCI-X/Express mayalso be available
DRP 1st Silicon
Embedded core for NEC ASICs
Application development system
Programmable off-load engine to host MPU
� Faster time to market
� Longer time in market
� Lower risk ASIC design
� Faster time to market
� Longer time in market
� Lower risk ASIC designHostMPUHostMPU
HyperTransport,
PCI
MICROPROCESSOR FORUM 2002, Session 4-2, Oct. 16 - 20/20 -
Summary
� DRP architecture features
�Array of PEs and memories for programmable datapath
�FSM-based sequencer, a state transition controller, for the control of dynamic datapathreconfiguration
� DRP application development system (DRP-1 chip & board, and the compiler suite) is ready
� Sample application development is underway
� Launch of DRP-based product planned during Y’03