26/04/16 1
VLSI Programming 2016: Lecture 3
Course: 2IMN35
Teachers: Kees van Berkel [email protected] Rudolf Mak [email protected]
Lab: Kees van Berkel, Rudolf Mak, Alok Lele
www: http://www.win.tue.nl/~wsinmak/Education/2IMN35/ Lecture 2 fpgas, verilog, lab assignment 1
26/04/16 2
VLSI Programming (2IMN35): time table 2016 2016 in Tue:h5-h8;MF.07 out 2016 in Thu:h1-h4;Gemini-Z3A-08/10/13 out
19-Apr
introduc/on,DSPgraphs,bounds,…
21-Apr
pipelining,re/ming,transposi/on,J-slow,unfolding
T1+T2
26-Apr
toolsinstalled
Introduc/onstoFPGAandVerilog
L1:audiofiltersimula/on
L1L2
28-Apr
T1+T2
unfolding,look-ahead,strengthreduc/on
L1cntd
T3+T4
3-May
folding
L2:audiofilteronXUPboard
5-May
10-May
T3+T4
DSPprocessors
L2cntd
L3
12-May
L3:sequen/alFIR+strength-reducedFIR
17-May
L3cntd
19-May
L3cntd
L4
24-May
systoliccomputa/on
T5
26-May
L3
L4
31-May
T5
L4:audiosamplerateconvertor
2-Jun
L4cntd
L5
7-Jun
L5:1024xaudiosamplerateconvertor
9-Jun
L4
L5cntd
14-Jun
16-Jun
L5
deadlinereportL5
26/04/16 3
Note on course literature
Lectures VLSI programming are loosely based on: • Keshab K. Parhi. VLSI Digital Signal Processing Systems, Design and
Implementation. Wiley Inter-Science 1999. • This book is recommended, but not mandatory
Accompanying slides can be found on: • http://www.ece.umn.edu/users/parhi/slides.html • http://www.win.tue.nl/~cberkel/2IN35/
Mandatory reading: • Edward A. Lee and David G. Messerschmitt. Synchronous Data
Flow. Proc. of the IEEE, Vol. 75, No. 9, Sept 1987, pp 1235-1245. • Keshab K. Parhi. High-Level Algorithm and Architecture
Transformations for DSP Synthesis. Journal of VLSI Signal Processing, 9, 121-143 (1995), Kluwer Academic Publishers.
26/04/16 4
Outline Lecture 3
• Introduction to FPGAs
• Introduction to Verilog
• Introduction to Lab assignment 1
• Hands on!
26/04/16 5
FPGA IC on a Xilinx XUP Board (Atlys)
Xilinx Spartan 6
FPGA
26/04/16 6
4 16 words x 1 bit m
emory
[A,B,C,D]
F
• A 4-input lookup table (LUT) can implement any function of 4 inputs.
• For example, a 1-bit adder needs 2 LUTs:
A⊕B⊕Ci
A.B.Ci
AB
Ci
Co
S
Building an FPGA: Logic First
Xilinx slide
26/04/16 7
Out
In 4
FF
CE RST
M
16 words x 1 bit m
emory
M
M
Clk CE Rst
Add FF to make a Logic Cell
Xilinx slide
26/04/16 8
4
FF
CE RST
M
16 words x 1 bit m
emory
Carry M
M
M
Din WE Cin
Cout
• Fast carry ripple to neighbor.
Arithmetic, Distributed RAM
Xilinx slide
26/04/16 9
4
4
4
4
4
4
4
4
40
• Group logic cells to reduce overhead.
• Add H, V routing channels with switchboxes.
• Add input, output MUXing between logic and routing.
Add Interconnect
Xilinx slide
26/04/16 10
4
4
4
4
4
4
4
4
40
4
4
4
4
4
4
4
4
40
4
4
4
4
4
4
4
4
40
4
4
4
4
4
4
4
4
40
Build an Array
Xilinx slide
26/04/16 11
Putting the ‘R’ in Reconfigurable Computing
• Fine-grained FPGAs are the platform of choice for Reconfigurable Computing.
State
Configuration
User Logic Configuration RAM
Xilinx slide
26/04/16 12
Add Bells & Whistles
Hard Processor
I/O
BRAM
Gigabit Serial
Multiplier
Programmable Termination
Clock Mgmt
18 Bit 18 Bit 36 Bit
Xilinx slide
Spartan DSP slice
• Useful for P := P + A × (B + D) and sub-expressions like P := A × B
• Note: A, B, C, D [18b], multiplier output [36b], and P [48b]
26/04/16 13
EEtimes slide
Spartan-6 FPGA
• http://www.xilinx.com/support/documentation/data_sheets/ds160.pdf
• 1 slice = 4 LUTs [6-input each] + 8 flipflops
• 1 DSP slice = 18b×18b multiplier + adder + accumulator
• 1 BRAM = 1k × 18b (OR 2 × 0.5k × 18b)
26/04/16 14
26/04/16 15
Atlys board, based on Xilinx Spartan 6
Xilinx Spartan 6
FPGA
FPGA comparison table [Xilinx]
26/04/16 16
Spartan-6 Artix-7 Kintex-7 Virtex-7 Kintex Kintex Virtex Virtex UltraScale UltraScale+ UltraScale UltraScale+ Feature size [nm] 45 28 28 28 20 20 16 16 Logic Cells (K) 147 215 478 1,955 1,161 915 4,433 2,863
UltraRAM (Mb) - - - - - 36.0 - 432.0
Block RAM (Mb) 4.8 13 34 68 76 34.5 132.9 94.5
DSP Slices 180 740 1,920 3,600 5,520 3,528 2,880 11,904 DSP Performance [GMACs] 140 930 2,845 5,335 8,180 6,287 4,268 21,213 Transceiver Count 8 16 32 96 64 76 120 1 Maximum Transceiver Speed (Gb/s) 3.2 6.6 12.5 28.05 16.3 32.75 30.5 32.75
Total Transceiver bw (full duplex) (Gb/s) 50 211 800 2,784 2,086 2,478 5,886 8,384
Memory Interface (DDR3 ) 800 1,066 1,866 1,866 2,133 2,133 2,133 2,133
PCI Express® x1 gen1 x4 gen2 x8 gen2 x8 gen3 x8 gen3 x16 gen 3 x8 gen3 x16 gen3
I/O Pins 576 500 500 1,200 832 572 1,456 832
I/O Voltage 1.2–3.3V 1.2–3.3V 1.2–3.3V 1.2–3.3V 1.0–3.3V 1.0–3.3V 1.0–3.3V 1.0-1.8V
26/04/16 17
Introduction to Verilog
26/04/16 18
Verilog (IEEE Std. 1364-1995).
• Verilog is a Hardware Description Language (HDL)
• Verilog is a text-based way to describe and exchange designs
• Verilog designs can be simulated,
• … and mapped onto gate-level designs (“logic synthesis”),
• … and subsequently translated to silicon/fpga primitives.
• Berkeley tutorial “CS61c: Verilog Tutorial” by J. Wawrzynek
• Verilog Golden Reference Guide by Doulos
(VHDL is an alternative HDL;
Verilog is easier to learn and use, mainly due its C-like syntax)
26/04/16 19
Verilog
• Despite C-like syntax, ...
• … Verilog is NOT an imperative programming language (C, C++, Java, Pascal, FORTRAN…)
• Implicit notion of global time (e.g. picoseconds) • time units can be used to express delays (“postpone by N units”)
• action can be triggered by events
• Popular language to describe digital circuits (e.g. circuits derived from data-flow graphs) as well as their test environments
26/04/16 20
Mux2: a 2-way multiplexor
26/04/16 21
Mux2: a 2-way multiplexor (behavioral)
module mux2 (in0, in1, select, out);
input in0, in1, select;
output out;
assign out = select ? in1 : in0 ;
endmodule // mux2
Verilog’s continous assignment:
Alternative, with delay of 3 time units:
assign #3 out = select & in1 | ˜select & in0 ;
26/04/16 22
Mux2: a 2-way multiplexor (gate level)
module mux2 (in0, in1, select, out); input in0, in1, select; output out; wire s0, w0, w1; not #1 (s0, select); // inverter, with 1 unit delay and #1 (w0, s0, in0), // and gate, with 1 unit delay (w1, select, in1); // and gate, with 1 unit delay or #1 (out, w0, w1); // OR gate, with 1 unit delay
endmodule // mux2
26/04/16 23
Mux2: a 2-way multiplexor (test bench)
module testmux; reg a, b, s; reg expected; wire f; mux2 myMux (.select(s), .in0(a), .in1(b), .out(f)); initial begin #0 s=0; a=0; b=1; expected=0; #10 a=1; b=0; expected=1; #10 s=1; a=0; b=1; expected=1; #10 $stop; end initial $monitor("select=%b in0=%b in1=%b out=%b, expected out=%b time=%d", s, a, b, f, expected, $time);
endmodule // testmux
26/04/16 24
Mux2: a 2-way multiplexor (test results)
select=0 in0=0 in1=1 out=0, expected out=0 time=0 select=0 in0=1 in1=0 out=1, expected out=1 time=10 select=1 in0=0 in1=1 out=1, expected out=1 time=20
26/04/16 25
Behavioral model of 4-bit Register
// positive edge-triggered,
// synchrounous active-high reset.
module reg4 (CLK,Q,D,RST);
input [3:0] D;
input CLK, RST;
output [3:0] Q;
reg [3:0] Q;
always @ (posedge CLK)
If (RST) #1 Q = 0; else #1 Q = D;
endmodule // reg4
26/04/16 26
Two possible assignment syntaxes: a = b and a <= b
a <= b
b <= a
swaps the values of a and b
a = b
b = a
simply sets both a and b to the previous value of b
Beware!
26/04/16 27
Designing a clock signal
…
reg CLK // clock is state variable!
…
initial
begin
CLK=1’b0; // clock initially 0 (low)
forever
#5 CLK = ˜CLK; // clock period = 10
end
26/04/16 28
A 22-stage FIR filter
×
+
D
y(n)
x(n) x(n-1)
22 stages
• Comprising 22 identical FIR stages
×
+
D x(n-20)
× h0
+
D x(n)
“0”
× b
+
D x(n-21)
h1 h20 h21
FIRstage
• .. as building block of the 22-stage FIR filter module FIRstage
… reg signed [0:DWIDTH-1] x;
assign a_out = x; assign b_out = b_in + (a_in * h_in);
always @(posedge clk) begin
if (enabled) x <= a_in; end endmodule
26/04/16 29
bout
ain aout
× hin
+
x
bin
26/04/16 30
Module FIRstage
module FIRstage #( parameter DWIDTH = 16, parameter DDWIDTH = 2 * DWIDTH) ( input clk, input enabled,
input signed [0:DWIDTH-1] a_in, input signed [0:DDWIDTH-1] b_in, output signed [0:DWIDTH-1] a_out, output signed [0:DDWIDTH-1] b_out, input signed [0:DWIDTH-1] h_in);
reg signed [0:DWIDTH-1] x; // Internal registers and wires assign a_out = x;
assign b_out = b_in + (a_in * h_in);
always @(posedge clk) begin // Process for the internal register if (enabled) x <= a_in; end endmodule
26/04/16 31
Module FIR (parameters and interface) module FIR #(parameter NR_STAGES = 22, parameter DWIDTH = 16, parameter CWIDTH = NR_STAGES * DWIDTH, // filter coefficients parameter DDWIDTH = 2 * DWIDTH) (input clk, input enabled, input signed [0:DWIDTH-1] a_in, output signed [0:DWIDTH-1] b_out, input [0:CWIDTH-1] h_in); // 22x16 wires // Generate and connect NR_STAGES filter stages (next slide)
endmodule
26/04/16 32
Module FIR (body)
wire signed [0:DWIDTH-1] a [0:NR_STAGES]; // Internal registers, wires wire signed [0:DDWIDTH-1] b [0:NR_STAGES];
generate // Generate filter stages genvar i; for (i = 0; i < NR_STAGES; i = i + 1) begin : stage FIRstage #(DWIDTH, DDWIDTH) comp (clk, enabled, a[i], b[i], a[i+1], b[i+1], h_in[i*DWIDTH:(i+1)*DWIDTH-1]); end endgenerate
assign b[0] = 0; assign a[0] = a_in; // connect stages to FIR interface assign b_out = b[NR_STAGES][0:DWIDTH-1];
26/04/16 33
A 22-stage FIR filter
×
+
D
y(n)
x(n) x(n-1)
• 22 registers are clocked simultaneously, always @(posedge clk)
• … and 22 multiply-adds run synchronously,
• … at a rate of 44.1 kHz (audio)
• critical path = 1 multiplication + 22 addition (non optimal)
×
+
D x(n-20)
× h0
+
D x(n)
“0”
× b
+
D x(n-21)
h1 h20 h21
clk
26/04/16 34
A 22-stage FIR filter
×
+
D
y(n)
x(n) x(n-1)
• Transposed / retimed version of this filter can easily run at 100 MHz on an FPGA: maximum fsample= fclock= 100MHz
• With fclock= 100MHz and fsample=44.1 kHz the HW utilization is only 44.1kHz/100000kHz = 0.044%
• Filter can also be realized with 1 adder + 1 multiplier (L3)
×
+
D x(n-20)
× h0
+
D x(n)
“0”
× b
+
D x(n-21)
h1 h20 h21
clk
26/04/16 35
2IN35: reporting guidelines 2016 (1)
1. Submit one report per team (2 students)
2. Respect deadlines: • Assignment L3: Thursday May 26, 2016 • Assignment L4: Thursday June 9, 2016 • Assignment L5: Thursday June 16, 2016
3. Make sure that assignments L3, L4, and L5 are demonstrated to and signed of by Alok, Rudolf, or Kees.
4. Report on lab assignments L3, L4, and L5.
5. Submit the reports using Peach (paper copies will not be accepted).
26/04/16 36
2IN35: reporting guidelines 2016 (2)
General guidelines (each assignments), to be followed strictly:
6. Analyze the specifications and requirements.
7. Present/motivate key ideas/decisions, design options, alternatives, trade-offs.
8. Draw architecture block diagram (= picture!).
9. Explain functional correctness of your Verilog programs(include your complete Verilog programs in an appendix).
10. Explain #clock cycles per sample time Ts. Include waveforms.
11. Report, analyze & explain FPGA-resource usage and utilization {#multipliers, #BRAMS, #LUTs} in relation to your design.
12. Report, analyze & explain (min) sample time Ts and (max) sample frequency fs, both after synthesis and after placement & routing.
2IN35: reporting guidelines 2016 (3)
13. Include simulation results: both wave forms in time domain, and in frequency domain (apply FFT) (assignments 3 and 4 only).
14. Include answers to the inline questions
15. Annotate all graphs to include for both axis: - quantity (weight, distance, duration, …) - unit (ounce, light year, century, …) - linear/log/... (ok to assume linear)
26/04/16 37
Lab assignment 1
Lab assignment 1:
• Today: start
• Tue May 3: completion
Lab assignment 2:
• Tue May 3: start
26/04/16 38
THANK YOU