Hardware / Software Co-Design For
FPGAs
L. Liu
Department of Computer Science, ETH Zürich
Fall semester, 2012
Reconfigurable Computing Systems (252-2210-00L)
Fall 2012
1
Discussion in Last Lecture
� Performance and cost overhead in previous trading
system implementation
2
Embedded System Design
� HW / SW co-design process
� Given the function, performance and cost requirements
• A hardware configuration must be chosen that will support the
requirements, especially the hard real-time and cost requirements
� Build a network of processing elements (CPUs, engines)
� Build the memory and peripheral interfaces
• A software design much be created to efficiently make use of the
hardware
� Divide the system function into communication processes
3
Hardware Design in the Trading System
TRM0 TRM1 TRM2 TRM3 TRM4 TRM5
TRM11 TRM10 TRM9 TRM8 TRM7 TRM6
TRMRin
g
RS232
TRMRin
g
TRMRin
g
TRMRin
g
TRMRin
g
TRMRin
g
TRMRin
g
TRMRin
g
TRMRin
g
TRMRin
g
TRMRin
g
TRMRin
g
01111110
01111110
01111110
01111110
01111110
01111110
01111110
01111110
01111110
01111110
01111110
01111110
LCD
4
Software Design in the Trading System
5
Embedded System Design Tasks
� Partitioning the function to be implemented into smaller, interacting pieces;
� Allocating those partitions to processors or other hardware units, where the function may be implemented directly in hardware or insoftware running on a processor;
� Scheduling the times at which functions are executed, which is important when several functional partitions share one hardware unit;
� Mapping a generic functional description into an implementation on a particular set of components, either as software suitable for a given processor or logic which can be implemented from the given hardware libraries.
Source:
Wayne H. Wolf, “Hardware-Software Co-Design of Embedded Systems”.
6
A Top-Down Design Flow for an Embedded System
specification
system architecture
behavior processes
register-transfer modules
logic high-level language
register-transfer modules
physical object code
integration
system testing
communicating processes
structural description
Detailed logic structure
Source:
Wayne H. Wolf, “Hardware-Software Co-Design of Embedded Systems”.
7
Cost Evaluation Results for the Trading System
� Cost
� FPGA Resource Usage
• LUTs: 11586 (40%)
• BRAMs: 48 (80%)
• DSPs: 12 (25%)
� Power Consumption
• Quiescent power (w) : 0.453
• Dynamic power (w) : 0.414
• Total power (w) : 0.867
8
Reasons of the Performance / Cost Overhead
� The parallelism granularity of the application does not
match that in the hardware
� The communication structure of the application does not
match the interconnect architecture in the hardware
9
Development Overhead
� Two-ladder development scheme
System specification, SW/HW
partitioning
Program
microcontroller
in Oberon
Program system
specific hardware
in HDL
Compilation Synthesis
Microcontroller + machine code +
specific hardware (eg. DSP)
Traditional HW/SW co-design
for embedded systems
10
Our Solution
• A single FPGA solution (currently)
• Application parallelism granularity
hardware parallelism granularity
• Communication structure
hardware interconnect architecture
11
Custom System on Button Push
System design
as high-level
program code
Electronic
Circuits
Computing model
Programming Language
Compiler, Synthesizer,
Hardware Library,
Simulator
Programmable Hardware
(FPGA)
12
kernels
streams
011010011111
011010011111
011010011111
011010011111
soft processor
FPGA
streaming application
cell
channel toolchain
hardware engine
13
Automatic System-on-Chip Design Flow
14
Hardware Library
Computation Components
• General purpose minimal machine: TRM
• Vector machine: VTRM
• MAC, 1D, 2D filter
Communication Components
• FIFOs
• 32 * 128
• 512 * 128
• 32, 64, 128, 1k * 32
Storage Components
• DDR2 controller
• configurable BRAMs
• CF controller
I/O Components
• UART controller
• LCD controller
• SPI, I2C controller
• VGA, DVI controller
15
THE ACTIVE CELLS
COMPUTING MODEL
Source:
Slides from Felix Friederich
16
Motivation for a new computing model (1)
Traditional model of hardware / software co-design
�Complicated development process.
�Several levels of technical knowledge required.
17
No OS!Hardwaregenerated by tools!17
Active Cell Components
� Active Cell
� Object with private state space
� Integrated control thread(s)
� Connected via channels
� Cell Net
� Network of communication cells
distributed system on a chip
18
Active Cells
� Scope and environment for a running isolated process.
� Cells do not immediately share memory
� Defined as types with port parameters
typeAdder = cell (in1, in2: port in; result: port out);var summand1, summand2: integer;begin
receive(in1, summand1);receive(in2, summand2);send(result, summand1 + summand2)
end Adder;
communication portscommunication ports
blocking receiveblocking receive
non-blocking sendnon-blocking send
(Adder)(Adder)
resultresult
in1in1 in2in2
19
Further Configurations:
Cell Capabilities
� Cells can be parametrized further, being provided with further capabilities
or non-default values.
typeFilter = cell {Vector, DataMemory(2048), DDR2}
(in: port in (64); result: port out);var ...
begin(* ... filter action ... *)
end Filter;....
Cell is a VectorTRM with 2k of Data Memory and has access to DDR2 memory
Cell is a VectorTRM with 2k of Data Memory and has access to DDR2 memory
This port is implemented with a (bit-)width of 64
This port is implemented with a (bit-)width of 64
20
Engine Cell Made From Hardware
� Special cells are provided as prefabricated hardware components
(Engines).
typeConvolver2Dd= cell {Engine}
(in: port in (64); result: port out);
end Convolver2d;
21
Hierarchic Composition: Cell Nets
� Cellnets consist of a set of cells that can be connected over their ports.� Allocation of cells: new statement
� Connection of cells: connect statement
� Cellnets can provide ports, ports of cells can be delegated to the ports of the net� Delegation of cells: delegate statement
� Terminal (or closed) Cellnets can be deployed to hardware
22
Example of a terminal Cellnet
cellnet Example;import RS232;type
UserInterface = cell {RS232}(out1, out2: port out; in: port in)(*...*) end UserInterface;
Adder = cell(in1, in2: port in; out: port out)(* ... *) end Adder;
var interface: UserInterface; adder: Adderbegin
new(interface);new(adder);connect(interface.out1, adder.in1);connect(interface.out2, adder.in2);connect(adder.result, interface.in);
end Example.
inte
rpre
ted c
ode
adder(Adder)
adder(Adder)
resultresult
in1in1 in2in2
interface(User
Interface)
interface(User
Interface)
out1out1 out2out2
inin
RS232
23
Building Components via Hierarchic Composition
module SimpleCells
import RS232;
type
Adder = cell(in1, in2: port in; result: port out)
(* ... *) end Adder;
Multiplier = cell(in1, in2: port in; result: port out)
(* ... *) end Adder;
ScalarProduct*= cellnet (vx,vy,xw,xy: port in; result: port out)
var adder: Adder; multiplier1, multiplier2: Multiplier;
begin
new(mul1); new(mul2); new(adder);
delegate(vx, mul.in1); delegate(wx, mul1.in2);
delegate(vy, mul2.in1); delegate(wy, mul2.in2);
connect(mul1.result, adder.in1); connect(mul2.result, adder.in2);
delegate(result, adder.result)
end ScalarProduct;
end SimpleCells
inte
rpre
ted c
ode
adder
(Adder)result
in1 in2
mul1
(Multiplier)result
in1 in2
mul2
(Multiplier)result
in1 in2
(ScalarProduct)(ScalarProduct) vXvX wxwx vYvY wYwY
resultresult
24
Example of a wired Cellnet
cellnet Test;
import SimpleCells, RS232;
type
Norm*=cellnet (vX,vY: port in; result: port out)
type
Dup*=cell(in: port in; out1,out2: port out)
var val: LONGINT;
begin
loop receive(in, val); send(out1, val); send(out2, val) end
end Dup;
var s: SimpleCells.ScalarProduct2d; dup1, dup2: Dup;
begin
new(s); new(dup1); new (dup2);
connect (dup1.out1,s.vX); connect(dup1.out2,s.wX);
connect(dup2.out1,s.vY); connect(dup2.out2,s.wY);
delegate(vX,dup1.in);delegate(vY,dup2.in);
delegate(result,s.result);
end Norm;inte
rpre
ted c
ode
s
(SimpleCells.ScalarProduct)
s
(SimpleCells.ScalarProduct)
dup1
(Norm.Dup)
dup1
(Norm.Dup)
dup2
(Norm.Dup)
dup2
(Norm.Dup)
inin
out1out1 out2out2
inin
out1out1 out2out2
vXvX wxwx vYvY wYwY
resultresult
norm(Norm)norm(Norm)vXvX vYvY
resultresult
25
Flattening
Calculator*=cell {RS232} (in: port in; outX,outY: port out)
var result: longint; vX,vY,wX,wY: longint;
begin
loop
RS232.ReceiveInteger(vX);
RS232.ReceiveInteger(vY);
send (outX,vX); send(outY,vY);
receive (in,result);
RS232.SendInteger(result);
end;
end Calculator;
var calculator: Calculator; norm:Norm;
begin
new(calculator); new(norm);
connect(calculator.outX,norm.vX);
connect(calculator.outY,norm.vY);
connect(norm.result,calculator.in);
end Test.inte
rpre
ted c
ode
s
(SimpleCells.ScalarProduct)
s
(SimpleCells.ScalarProduct)
dup1
(Norm.Dup)
dup1
(Norm.Dup)
dup2
(Norm.Dup)
dup2
(Norm.Dup)
inin
out1out1 out2out2
inin
out1out1 out2out2
vXvX wxwx vYvY wYwY
resultresult
norm(Norm)norm(Norm)vXvX vYvY
resultresult
calculator(Calculator)calculator
(Calculator)
inin
outYoutYoutXoutX
norm.s.adder(Adder)
norm.s.adder(Adder)
resultresult
in1in1
in2in2
norm.s.mul1(Multiplier)
norm.s.mul1(Multiplier)
resultresult
in1in1
in2in2
norm.s.mul2(Multiplier)
norm.s.mul2(Multiplier)
resultresult
in1in1
in2in2
norm.dup1
(Norm.Dup)
norm.dup1
(Norm.Dup)norm.dup2
(Norm.Dup)
norm.dup2
(Norm.Dup)
inin
out1out1
out2out2
inin
out1out1
out2out2
calculator(Calculator)
calculator(Calculator)
inin
outYoutY
outXoutX
RS232RS232
Core1
Core2 Core3
Core4
Core6
Core5
flattening
26
COMPILER
IMPLEMENTATION
27
Hybrid Compilation
Code body Role Compilation method
Cell Program logic Software Compilation
Cell Net Architecture Hardware Compilation
28
The Build Process
ModulesSimpleCells.MdfCalculator.Mdf
ModulesSimpleCells.MdfCalculator.Mdf
Compilation,Interpretation and Linking
Intermediate Code Files
SimpleCells.FilCalculator.Fil
Intermediate Code Files
SimpleCells.FilCalculator.Fil
Cell/NetworkSpecificationsSimpleCells.spec
Calculator.spec
Cell/NetworkSpecificationsSimpleCells.spec
Calculator.spec
Code and Data Calculator.Adder.code
Calculator.Adder.data
...
Calculator.Controller.codeCalculator.Controller.data
Code and Data Calculator.Adder.code
Calculator.Adder.data
...
Calculator.Controller.codeCalculator.Controller.data
Post processing:
Hardware Generation
Verilog Top
ModuleCalculator.v
Verilog Top
ModuleCalculator.v
Block Memory
Configuration Calculator.bmm
Block Memory
Configuration Calculator.bmm
TCL Script for
Project
ImplementationCalculator.tcl
TCL Script for
Project
ImplementationCalculator.tcl
Bitstream
Downloader
BatchCalculator.bat
Bitstream
Downloader
BatchCalculator.bat
Assembling IR-Code& Linking
29
Intermediate Hardware Specification
Active Cells Modulecellnet Calculator;IMPORT RS232;
typeAdder*=cell(in1,in2: port in;
result: port out)var v1,v2: longint;
......
var controller: Controller; scalarProduct:ScalarProduct;begin
new(calculator); new(norm);
connect(calculator.outX,scalarProduct.vX);
connect(calculator.outY,scalarProduct.vY);
connect(scalarProduct.result,controller.in);
end Calculator.
IR HW Specificationname=CalculatorinstructionSet=TRM imports=0
types=5
0 name=Adder instructionMemorySize=1 dataMemorySize=2ports=3
0 name=in1 direction=in adr=-48 width=32
.....
4 name=Controller instructionMemorySize=1 dataMemorySize=2...
devices=10 name=RS232 adr=-60
instances=60 name=controller type=Controller1 name=norm.s.mul1 type=Multiplier
channels=90 name=channel0 outInstance=norm.dup1 outPort=in
inInstance=controller inPort=outX size=32 width=32
1 name=channel1 outInstance=norm.dup2 outPort=in inInstance=controller inPort=outY size=32 width=32
interpretation
30
Verilog Top Module
module Calculator (input CLKBN, CLKBP, rstIn,input RxD, output TxD
);
...
TRM #(.IMB(1), .DMB(2)) Calculator_controller(.clk(clk), .....);
...
RS232 instCalculator_Controller_RS232(.clk(clk) .... );
....
ParChannel #(.Size(32)) Calculator_channel0(.clk(clk),....);
....
assign Calculator_controller_inbus = (Calculator_controller_ioadr == 16)? Calculator_channel2_outData:
...
endmodule
generated from IR spec
31
Conclusions
Usage of a dedicated compute model
�Decreases the debugging time
�Improves the productivity of FPGA-based on-chip systems
�Increases the number of potential FPGA users by
providing a high-level programming model
32
FPGA-based Low Power SoC Architecture
• separate system control/configuration from dataprocessing
• address both issues in an efficient manner
33
Building Blocks
� Hardware
� Computation engines
� On-chip interconnect
� I/O controller
� Software
� General purpose soft-core processor
+ program
offer the flexibility, the dynamic configurability
meet the performance requirement with low power consumption
34
Case Study:
a dynamically configurable feedback comb filter