VLSI Programming 2016: Lecture 3wsinmak/Education/2IMN35/2IMN35-2016-slides3.pdf · 1 26/04/16 VLSI...

Post on 04-Aug-2018

213 views 0 download

transcript

26/04/16 1

VLSI Programming 2016: Lecture 3

Course: 2IMN35

Teachers: Kees van Berkel c.h.v.berkel@tue.nl Rudolf Mak r.h.mak@tue.nl

Lab: Kees van Berkel, Rudolf Mak, Alok Lele

www: http://www.win.tue.nl/~wsinmak/Education/2IMN35/ Lecture 2 fpgas, verilog, lab assignment 1

26/04/16 2

VLSI Programming (2IMN35): time table 2016 2016 in Tue:h5-h8;MF.07 out 2016 in Thu:h1-h4;Gemini-Z3A-08/10/13 out

19-Apr

introduc/on,DSPgraphs,bounds,…

21-Apr

pipelining,re/ming,transposi/on,J-slow,unfolding

T1+T2

26-Apr

toolsinstalled

Introduc/onstoFPGAandVerilog

L1:audiofiltersimula/on

L1L2

28-Apr

T1+T2

unfolding,look-ahead,strengthreduc/on

L1cntd

T3+T4

3-May

folding

L2:audiofilteronXUPboard

5-May

10-May

T3+T4

DSPprocessors

L2cntd

L3

12-May

L3:sequen/alFIR+strength-reducedFIR

17-May

L3cntd

19-May

L3cntd

L4

24-May

systoliccomputa/on

T5

26-May

L3

L4

31-May

T5

L4:audiosamplerateconvertor

2-Jun

L4cntd

L5

7-Jun

L5:1024xaudiosamplerateconvertor

9-Jun

L4

L5cntd

14-Jun

16-Jun

L5

deadlinereportL5

26/04/16 3

Note on course literature

Lectures VLSI programming are loosely based on: •  Keshab K. Parhi. VLSI Digital Signal Processing Systems, Design and

Implementation. Wiley Inter-Science 1999. •  This book is recommended, but not mandatory

Accompanying slides can be found on: •  http://www.ece.umn.edu/users/parhi/slides.html •  http://www.win.tue.nl/~cberkel/2IN35/

Mandatory reading: •  Edward A. Lee and David G. Messerschmitt. Synchronous Data

Flow. Proc. of the IEEE, Vol. 75, No. 9, Sept 1987, pp 1235-1245. •  Keshab K. Parhi. High-Level Algorithm and Architecture

Transformations for DSP Synthesis. Journal of VLSI Signal Processing, 9, 121-143 (1995), Kluwer Academic Publishers.

26/04/16 4

Outline Lecture 3

•  Introduction to FPGAs

•  Introduction to Verilog

•  Introduction to Lab assignment 1

•  Hands on!

26/04/16 5

FPGA IC on a Xilinx XUP Board (Atlys)

Xilinx Spartan 6

FPGA

26/04/16 6

4 16 words x 1 bit m

emory

[A,B,C,D]

F

• A 4-input lookup table (LUT) can implement any function of 4 inputs.

• For example, a 1-bit adder needs 2 LUTs:

A⊕B⊕Ci

A.B.Ci

AB

Ci

Co

S

Building an FPGA: Logic First

Xilinx slide

26/04/16 7

Out

In 4

FF

CE RST

M

16 words x 1 bit m

emory

M

M

Clk CE Rst

Add FF to make a Logic Cell

Xilinx slide

26/04/16 8

4

FF

CE RST

M

16 words x 1 bit m

emory

Carry M

M

M

Din WE Cin

Cout

• Fast carry ripple to neighbor.

Arithmetic, Distributed RAM

Xilinx slide

26/04/16 9

4

4

4

4

4

4

4

4

40

• Group logic cells to reduce overhead.

• Add H, V routing channels with switchboxes.

• Add input, output MUXing between logic and routing.

Add Interconnect

Xilinx slide

26/04/16 10

4

4

4

4

4

4

4

4

40

4

4

4

4

4

4

4

4

40

4

4

4

4

4

4

4

4

40

4

4

4

4

4

4

4

4

40

Build an Array

Xilinx slide

26/04/16 11

Putting the ‘R’ in Reconfigurable Computing

• Fine-grained FPGAs are the platform of choice for Reconfigurable Computing.

State

Configuration

User Logic Configuration RAM

Xilinx slide

26/04/16 12

Add Bells & Whistles

Hard Processor

I/O

BRAM

Gigabit Serial

Multiplier

Programmable Termination

Clock Mgmt

18 Bit 18 Bit 36 Bit

Xilinx slide

Spartan DSP slice

• Useful for P := P + A × (B + D) and sub-expressions like P := A × B

• Note: A, B, C, D [18b], multiplier output [36b], and P [48b]

26/04/16 13

EEtimes slide

Spartan-6 FPGA

•  http://www.xilinx.com/support/documentation/data_sheets/ds160.pdf

•  1 slice = 4 LUTs [6-input each] + 8 flipflops

•  1 DSP slice = 18b×18b multiplier + adder + accumulator

•  1 BRAM = 1k × 18b (OR 2 × 0.5k × 18b)

26/04/16 14

26/04/16 15

Atlys board, based on Xilinx Spartan 6

Xilinx Spartan 6

FPGA

FPGA comparison table [Xilinx]

26/04/16 16

Spartan-6 Artix-7 Kintex-7 Virtex-7 Kintex Kintex Virtex Virtex UltraScale UltraScale+ UltraScale UltraScale+ Feature size [nm] 45 28 28 28 20 20 16 16 Logic Cells (K) 147 215 478 1,955 1,161 915 4,433 2,863

UltraRAM (Mb) - - - - - 36.0 - 432.0

Block RAM (Mb) 4.8 13 34 68 76 34.5 132.9 94.5

DSP Slices 180 740 1,920 3,600 5,520 3,528 2,880 11,904 DSP Performance [GMACs] 140 930 2,845 5,335 8,180 6,287 4,268 21,213 Transceiver Count 8 16 32 96 64 76 120 1 Maximum Transceiver Speed (Gb/s) 3.2 6.6 12.5 28.05 16.3 32.75 30.5 32.75

Total Transceiver bw (full duplex) (Gb/s) 50 211 800 2,784 2,086 2,478 5,886 8,384

Memory Interface (DDR3 ) 800 1,066 1,866 1,866 2,133 2,133 2,133 2,133

PCI Express® x1 gen1 x4 gen2 x8 gen2 x8 gen3 x8 gen3 x16 gen 3 x8 gen3 x16 gen3

I/O Pins 576 500 500 1,200 832 572 1,456 832

I/O Voltage 1.2–3.3V 1.2–3.3V 1.2–3.3V 1.2–3.3V 1.0–3.3V 1.0–3.3V 1.0–3.3V 1.0-1.8V

26/04/16 17

Introduction to Verilog

26/04/16 18

Verilog (IEEE Std. 1364-1995).

• Verilog is a Hardware Description Language (HDL)

• Verilog is a text-based way to describe and exchange designs

• Verilog designs can be simulated,

• … and mapped onto gate-level designs (“logic synthesis”),

• … and subsequently translated to silicon/fpga primitives.

• Berkeley tutorial “CS61c: Verilog Tutorial” by J. Wawrzynek

• Verilog Golden Reference Guide by Doulos

(VHDL is an alternative HDL;

Verilog is easier to learn and use, mainly due its C-like syntax)

26/04/16 19

Verilog

• Despite C-like syntax, ...

• … Verilog is NOT an imperative programming language (C, C++, Java, Pascal, FORTRAN…)

• Implicit notion of global time (e.g. picoseconds) • time units can be used to express delays (“postpone by N units”)

• action can be triggered by events

• Popular language to describe digital circuits (e.g. circuits derived from data-flow graphs) as well as their test environments

26/04/16 20

Mux2: a 2-way multiplexor

26/04/16 21

Mux2: a 2-way multiplexor (behavioral)

module mux2 (in0, in1, select, out);

input in0, in1, select;

output out;

assign out = select ? in1 : in0 ;

endmodule // mux2

Verilog’s continous assignment:

Alternative, with delay of 3 time units:

assign #3 out = select & in1 | ˜select & in0 ;

26/04/16 22

Mux2: a 2-way multiplexor (gate level)

module mux2 (in0, in1, select, out); input in0, in1, select; output out; wire s0, w0, w1; not #1 (s0, select); // inverter, with 1 unit delay and #1 (w0, s0, in0), // and gate, with 1 unit delay (w1, select, in1); // and gate, with 1 unit delay or #1 (out, w0, w1); // OR gate, with 1 unit delay

endmodule // mux2

26/04/16 23

Mux2: a 2-way multiplexor (test bench)

module testmux; reg a, b, s; reg expected; wire f; mux2 myMux (.select(s), .in0(a), .in1(b), .out(f)); initial begin #0 s=0; a=0; b=1; expected=0; #10 a=1; b=0; expected=1; #10 s=1; a=0; b=1; expected=1; #10 $stop; end initial $monitor("select=%b in0=%b in1=%b out=%b, expected out=%b time=%d", s, a, b, f, expected, $time);

endmodule // testmux

26/04/16 24

Mux2: a 2-way multiplexor (test results)

select=0 in0=0 in1=1 out=0, expected out=0 time=0 select=0 in0=1 in1=0 out=1, expected out=1 time=10 select=1 in0=0 in1=1 out=1, expected out=1 time=20

26/04/16 25

Behavioral model of 4-bit Register

// positive edge-triggered,

// synchrounous active-high reset.

module reg4 (CLK,Q,D,RST);

input [3:0] D;

input CLK, RST;

output [3:0] Q;

reg [3:0] Q;

always @ (posedge CLK)

If (RST) #1 Q = 0; else #1 Q = D;

endmodule // reg4

26/04/16 26

Two possible assignment syntaxes: a = b and a <= b

a <= b

b <= a

swaps the values of a and b

a = b

b = a

simply sets both a and b to the previous value of b

Beware!

26/04/16 27

Designing a clock signal

reg CLK // clock is state variable!

initial

begin

CLK=1’b0; // clock initially 0 (low)

forever

#5 CLK = ˜CLK; // clock period = 10

end

26/04/16 28

A 22-stage FIR filter

×

+

D

y(n)

x(n) x(n-1)

22 stages

• Comprising 22 identical FIR stages

×

+

D x(n-20)

× h0

+

D x(n)

“0”

× b

+

D x(n-21)

h1 h20 h21

FIRstage

• .. as building block of the 22-stage FIR filter module FIRstage

… reg signed [0:DWIDTH-1] x;

assign a_out = x; assign b_out = b_in + (a_in * h_in);

always @(posedge clk) begin

if (enabled) x <= a_in; end endmodule

26/04/16 29

bout

ain aout

× hin

+

x

bin

26/04/16 30

Module FIRstage

module FIRstage #( parameter DWIDTH = 16, parameter DDWIDTH = 2 * DWIDTH) ( input clk, input enabled,

input signed [0:DWIDTH-1] a_in, input signed [0:DDWIDTH-1] b_in, output signed [0:DWIDTH-1] a_out, output signed [0:DDWIDTH-1] b_out, input signed [0:DWIDTH-1] h_in);

reg signed [0:DWIDTH-1] x; // Internal registers and wires assign a_out = x;

assign b_out = b_in + (a_in * h_in);

always @(posedge clk) begin // Process for the internal register if (enabled) x <= a_in; end endmodule

26/04/16 31

Module FIR (parameters and interface) module FIR #(parameter NR_STAGES = 22, parameter DWIDTH = 16, parameter CWIDTH = NR_STAGES * DWIDTH, // filter coefficients parameter DDWIDTH = 2 * DWIDTH) (input clk, input enabled, input signed [0:DWIDTH-1] a_in, output signed [0:DWIDTH-1] b_out, input [0:CWIDTH-1] h_in); // 22x16 wires // Generate and connect NR_STAGES filter stages (next slide)

endmodule

26/04/16 32

Module FIR (body)

wire signed [0:DWIDTH-1] a [0:NR_STAGES]; // Internal registers, wires wire signed [0:DDWIDTH-1] b [0:NR_STAGES];

generate // Generate filter stages genvar i; for (i = 0; i < NR_STAGES; i = i + 1) begin : stage FIRstage #(DWIDTH, DDWIDTH) comp (clk, enabled, a[i], b[i], a[i+1], b[i+1], h_in[i*DWIDTH:(i+1)*DWIDTH-1]); end endgenerate

assign b[0] = 0; assign a[0] = a_in; // connect stages to FIR interface assign b_out = b[NR_STAGES][0:DWIDTH-1];

26/04/16 33

A 22-stage FIR filter

×

+

D

y(n)

x(n) x(n-1)

• 22 registers are clocked simultaneously, always @(posedge clk)

• … and 22 multiply-adds run synchronously,

• … at a rate of 44.1 kHz (audio)

• critical path = 1 multiplication + 22 addition (non optimal)

×

+

D x(n-20)

× h0

+

D x(n)

“0”

× b

+

D x(n-21)

h1 h20 h21

clk

26/04/16 34

A 22-stage FIR filter

×

+

D

y(n)

x(n) x(n-1)

• Transposed / retimed version of this filter can easily run at 100 MHz on an FPGA: maximum fsample= fclock= 100MHz

• With fclock= 100MHz and fsample=44.1 kHz the HW utilization is only 44.1kHz/100000kHz = 0.044%

• Filter can also be realized with 1 adder + 1 multiplier (L3)

×

+

D x(n-20)

× h0

+

D x(n)

“0”

× b

+

D x(n-21)

h1 h20 h21

clk

26/04/16 35

2IN35: reporting guidelines 2016 (1)

1.  Submit one report per team (2 students)

2.  Respect deadlines: • Assignment L3: Thursday May 26, 2016 • Assignment L4: Thursday June 9, 2016 • Assignment L5: Thursday June 16, 2016

3.  Make sure that assignments L3, L4, and L5 are demonstrated to and signed of by Alok, Rudolf, or Kees.

4.  Report on lab assignments L3, L4, and L5.

5.  Submit the reports using Peach (paper copies will not be accepted).

26/04/16 36

2IN35: reporting guidelines 2016 (2)

General guidelines (each assignments), to be followed strictly:

6.  Analyze the specifications and requirements.

7.  Present/motivate key ideas/decisions, design options, alternatives, trade-offs.

8.  Draw architecture block diagram (= picture!).

9.  Explain functional correctness of your Verilog programs(include your complete Verilog programs in an appendix).

10. Explain #clock cycles per sample time Ts. Include waveforms.

11. Report, analyze & explain FPGA-resource usage and utilization {#multipliers, #BRAMS, #LUTs} in relation to your design.

12. Report, analyze & explain (min) sample time Ts and (max) sample frequency fs, both after synthesis and after placement & routing.

2IN35: reporting guidelines 2016 (3)

13. Include simulation results: both wave forms in time domain, and in frequency domain (apply FFT) (assignments 3 and 4 only).

14.  Include answers to the inline questions

15.  Annotate all graphs to include for both axis: - quantity (weight, distance, duration, …) - unit (ounce, light year, century, …) - linear/log/... (ok to assume linear)

26/04/16 37

Lab assignment 1

Lab assignment 1:

• Today: start

• Tue May 3: completion

Lab assignment 2:

• Tue May 3: start

26/04/16 38

THANK YOU