+ All Categories
Home > Documents > Reconfigurable Computing Reconfigurable … Computing Reconfigurable Architectures Chapter 3.1 Prof....

Reconfigurable Computing Reconfigurable … Computing Reconfigurable Architectures Chapter 3.1 Prof....

Date post: 23-May-2018
Category:
Upload: vanthien
View: 242 times
Download: 1 times
Share this document with a friend
51
Reconfigurable Computing Reconfigurable Architectures Chapter 3.1 Prof. Dr.-Ing. Jürgen Teich Lehrstuhl für Hardware-Software-Co-Design Reconfigurable Computing
Transcript

Reconfigurable Computing

Reconfigurable Architectures

Chapter 3.1

Prof. Dr.-Ing. Jürgen Teich

Lehrstuhl für Hardware-Software-Co-Design

Reconfigurable Computing

Early Work

Reconfigurable Computing

Gerald Estrin Fix-Plus Machine

Vision of a restructurable computer system

Pragmatic problem studies predict gains in computation

speeds in a variety of computational tasks when executed

on appropriate problem-oriented configurations of the

variable structure computer. The economic feasibility of

the system is based on utilization of essentially the same

hardware in a variety of special purpose structures. This

capability is achieved by programmed or physical

restructuring of a part of the hardware.

G. Estrin, B. Bussel, R. Turn, J Bibb (UCLA 1963)

Reconfigurable Computing

3

Gerald Estrin Fix-Plus Machine

Fixed plus Variable

structure computer Proposed by G. Estrin in 1959

Consist of three parts A high speed general purpose

computer (the fix part F).

A variable part (V) consisting of various size high speed digital substructureswhich can be reorganized in problem-oriented special purpose configurations.

The supervisory control (SC) coordinates operations between the fix module and the variable module.

Speed gain over IBM7090 (2.5 to 1000)

Reconfigurable Computing

4

Source: G. Estrin et al., Parallel Processing in a

Restructurable Computer System, IEEE

Transactions on Electronic Computers,

vol 12, no 5, pp. 747-755, 1963.

Gerald Estrin Fix-Plus Machine

The Fixed Part (F) Was initially an IBM 7090, but could be any general purpose computer

The Variable Part (V) Made upon a set of problem-specific optimized functional units in the

basic configuration (trigonometric functions, logarithm, exponentials, n-th

power, roots, complex arithmetic, hyperbolic, matrix operation)

Reconfigurable Computing

5

Two types of basic building blocks The first basic element contains four

amplifiers and associated input logic for signal inversion, amplification, or high-speed storage

The second basic block consists of ten diodes and four output drivers and is for combinatoric application

The basic blocks

Source: G. Estrin et al., Parallel Processing in a

Restructurable Computer System, IEEE Transactions

on Electronic Computers, vol 12, no 5, pp. 747-755, 1963.

Gerald Estrin Fix-Plus Machine

Reconfigurable Computing

6

*The mother board

*The wiring harness

The basic modules can be inserted into any of 36 positions on a mother board.

The connection between the modules is

established through a wiring harness

Function Reconfiguration means changing

some modules

Routing Reconfiguration means changing

parts of the wiring harness

*Source: G. Estrin et al., Parallel Processing in a

Restructurable Computer System, IEEE Transactions

on Electronic Computers, vol 12, no 5, pp. 747-755, 1963.

Gerald Estrin Fix-Plus Machine

Reconfigurable Computing

7

Estrin at work.

Substantial effort on

manual reconfiguration

Source: C. Bobda, Introduction to Reconfigurable Computing, Springer, 2007.

The Rammig Machine

Goal

Investigation of a system, which, with no manual or

mechanical interference, permits the building, changing,

processing and destruction of real (not simulated) digital

hardware

Franz J. Rammig (University of Dortmund 1977)

The concept resulted in the construction of a

hardware editor

Useful to observe a circuit under test

(Hardware Emulation)

Reconfigurable Computing

8

The Rammig Machine

Implementation Outputs of modules connected to

selectors and selector outputs connected to module inputs.

Software-controlled module interconnection

Two main problems to solve: Because the circuit is not hard-wired, a

distortion of the behaviour is possible during reconfiguration

The timing is controlled by the circuit instead of being dictated by an observation mechanism

A time-control must therefore be provided by delay circuits and inertial-delay circuits

Reconfigurable Computing

9

Source: F.J. Rammig, A concept for the editing of

hardware resulting in an automatic hardware-editor,

Design Automation Conference, pp. 187-193, 1977.

Programmable Logic

Reconfigurable Computing

PALs and PLAs

• Pre-fabricated building block of many AND/OR gates (or NOR, NAND)

• "Personalized" by making or breaking connections between the gates

Reconfigurable Computing

11

Programmable Array Block Diagram for Sum of Products Form

Inputs

Dense array of AND gates Product

terms

Dense array of OR gates

Outputs

Reconfigurable Computing

12

Example:

Equations

Personality Matrix

Key to Success: Shared Product Terms

1 = asserted in term0 = negated in term- = does not participate

1 = term connected to output0 = no connection to output

Input Side:

Output Side:

Reuse of

terms

F 1

1

0

1

0

0

Outputs Inputs Product t erm A

1

-

1

-

1

B

1

0

-

0

-

C

-

1

0

0

-

F 0

0

0

0

1

1

F 2

1

0

0

1

0

F 3

0

1

0

0

1

A B

B C

A C

B C

A

F0 = A + B CF1 = A C + A BF2 = B C + A BF3 = B C + A

PALs and PLAs

PALs and PLAs

Reconfigurable Computing

13

Example Continued - Unprogrammed device

All possible connections are availablebefore programming

A B C

F0 F1 F2 F3

PALs and PLAs

Reconfigurable Computing

14

Example Continued -Programmed part Unwanted connections are "blown"

Note: some array structureswork by making connectionsrather than breaking them

A B C

F0 F1 F2 F3

AB

BC

AC

BC

A

PALs and PLAs

Reconfigurable Computing

15

Alternative representation for high fan-in structures

Short-hand notationso we don't have todraw all the wires!

x at junction indicatesa connection

Notation for implementation

F0 = A B + A B

F1 = C D + C D

A B C D

AB+AB CD+CD

AB

CD

CD

AB

Unprogrammed device

Programmed device

PALs and PLAs

Reconfigurable Computing

16

Design Example

F1 = A B C

F2 = A + B + C

F3 = A B C

F4 = A + B + C

F5 = A B C

F6 = A B C

Multiple functions of A, B, C

F1 F2 F3 F4 F5 F6

ABC

A

B

C

A

B

C

ABC

ABC

ABC

ABC

ABC

ABC

ABC

A B C

PALs and PLAs

Reconfigurable Computing

17

What is difference between Programmable Array Logic (PAL) andProgrammable Logic Array (PLA)?

PAL concept — implemented by Monolithic MemoriesAND array is programmable, OR array is fixed at fabrication

A given column of the OR arrayhas access to only a subset ofthe possible product terms

PLA concept — Both AND and OR arrays are programmable

PALs and PLAs

Reconfigurable Computing

18

Design Example: BCD to Gray Code Converter

Truth Table

K-maps

Minimized Functions:

A 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

B 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1

C 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1

D 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

W 0 0 0 0 0 1 1 1 1 1 X X X X X X

X 0 0 0 0 1 1 0 0 0 0 X X X X X X

Y 0 0 1 1 1 1 1 1 0 0 X X X X X X

Z 0 1 1 0 0 0 0 1 1 0 X X X X X X

AB

CD 00 01 11 10

00

01

11

10

D

B

C

A

0 0 X 1

0 1 X 1

0 1 X X

0 1 X X

K-map for W

AB

CD 00 01 11 10

00

01

11

10

D

B

C

A

0 1 X 0

0 1 X 0

0 0 X X

0 0 X X

K-map for X

AB

CD 00 01 11 10

00

01

11

10

D

B

C

A

0 1 X 0

0 1 X 0

1 1 X X

1 1 X X

K-map for Y

AB

CD 00 01 11 10

00

01

11

10

D

B

C

A

0 0 X 1

1 0 X 0

0 1 X X

1 0 X X

K-map for Z

W = A + B D + B CX = B CY = B + CZ = A B C D + B C D + A D + B C D

PALs and PLAs

Reconfigurable Computing

19

Programmed PAL:

4 product terms per each OR gate

Minimized Functions:

W = A + B D + B CX = B CY = B + CZ = A B C D + B C D + A D + B C D

A B C D

A B C D

A

BD

BC

0

0

0

0

B

C

0

0

BC

BCD

AD

BCD

W X Y Z

W = A + B D + B CX = B CY = B + CZ = A B C D + B C D + A D + B C D

Complex Programmable Logic Devices

• Complex PLDs (CPLD) typically combine PAL combinational logic with Flip Flops

– Organized into logic blocks connected in an interconnect

matrix

– Combinational or registered output

• Usually enough logic for simple counters, state

machines, decoders, etc.

• CPLDs logic is not enough for complex operations

• FPGAs have much more logic than CPLDs

• e.g., Xilinx Coolrunner II, etc.

Reconfigurable Computing

20

Xilinx Coolrunner CPLD

Reconfigurable Computing

21

Function Block Interconnection matrixInterconnection matrix

Macrocells for data storage

Macrocells for data storage

Source: Xilinx, Inc. DS090: CoolRunner-II CPLD Family, 2008

Field Programmable Gate Arrays (FPGAs)

Introduced in 1985 by Xilinx

Roughly seen, an FPGA consists of: A set of programmable macro cells

A programmable interconnection network

Programmable input/outputs

Subparts of a (complex) function are implemented

in macro cells which are then connected to build

the complete function

The I/O can be programmed to drive the macro

cell's inputs or to be driven by the macro cell's

outputs

Unlike traditional application-specific integrated

circuit (ASIC), function is specified by the user

after the device is manufactured

Physical structure and programming method is

vendor-dependent

Reconfigurable Computing

22

Programmable

macro cell

Programmable I/O

Programmable routing

FPGA Structure

Typical organization

Symmetrical Array

2 D array of processing elements (PE)

embedded in an interconnection

network

Interconnection points at the

horizontal-vertical intersection

Row based

Rows of Processing elements

Horizontal routing via horizontal

channels

Channels divided in segments

Vertical connections via dedicated

vertical tracks (not on the graphic)

Reconfigurable Computing

23

Symmetrical Array

Row-based

FPGA Structure

Typical organization (cont)

Sea of gates

2 D array of processing elements

No space left aside the PEs for

routing

Connection is done on a separate

layer on top of the cells

Hierarchical

Hierarchically placed Macro cells

Low-level macro cells are grouped to

build the higher-level PEs

Reconfigurable Computing

24

Sea of Gates

Hierarchical

FPGA Programming Technologies

SRAM (LUT-based)

An SRAM is used to store all possible values of a function

Value of a function for a given input is retrieved using the inputs

as SRAM-address

SRAM implementing a function is

called a look-up table (LUT)

A new function is implemented by

writing new values into the LUT

SRAM-based FPGA can

therefore be reprogrammed (configured) on the fly

Since a LUT is volatile, a LUT configuration is lost when

switching off the system

Reconfigurable Computing

25

FPGA Programming Technologies

Anti-fuse An anti-fuse normally presents a high-impedance state

can be “fused” into a low-impedance state when

programmed by a high voltage

The anti-fuse used in each of type of FPGA from

different company differs in construction

small area

lower resistance and parasitic capacitance than

transistors

-> reduce delays in routing

No re-programming possible

Reconfigurable Computing

26

FPGA Programming Technologies

Poly-diffusion Anti-fuse: ACTEL PLICE programmable low-impedance circuit

element

Poly-silicon terminal

Oxide-Nitride-Oxide dielectric

Melting the dielectric establish connection

Metal Anti-fuse:Q-Logic Vialink 2 Metal terminal layers (Titanium-

Tungsten)

Programming points isolated by

amorphous Silicon film

Reconfigurable Computing

27

Source: Microsemi Corp.

Source: Quicklogic Corp.

FPGA Programming Technologies

EEPROM (Flash) The same technology as that

used in EPROM and EEPROM memories.

EPROMs can be erased, but only as a whole.

EEPROM can be selectively re-programmed in-circuit.

EPROM's resistors consume static power.

EEPROM requires more chip area and multiple voltage sources.

Reconfigurable Computing

28

LUT LUTs are used as function generators in

SRAM-based FPGAs

A k-inputs LUT can implement up to different functions

A k-input LUT has 2k SRAM locations

A function is implemented by writing all possible values that the function can take in the LUT

The inputs values are used to address the LUT and retrieve the value of the function corresponding to the input values

k22

Reconfigurable Computing

29

a XOR b a b

0 0 0

0 1 1

1 0 1

1 1 0

0110

ab a xor b

LUT

FPGA Function generators

LUT-Realization

A LUT is basically a multiplexer that evaluates the truth

table stored in the configuration SRAM cells (can be seen

as a one bit wide ROM).0

A0

A1

A2

A3

1

2

3

4

5

6

7

... ...

...

...

...

tradeoff

between

power

and

speed

config

data

config

enable

Vdd

optimized for

lowest power

L

configuration SRAM cell

0

1

F

0

1

F

FF

A0

A1

A2

A3

L

Q

configura-tion cell

stateflip-flop

connection toswitch matrix

FPGA Function generators

30

Reconfigurable Computing

Source: All images from Xilinx, Inc. Virtex-II Platform FPGA Handbook.

Configuration examples

FPGA Function generators

31

Reconfigurable Computing

A3 A2 A1 A0

0: 0 0 0 0 0 0 0 0 0

1: 0 0 0 1 0 0 0 1 0

2: 0 0 1 0 0 0 1 0 0

3: 0 0 1 1 0 1 1 1 0

4: 0 1 0 0 0 1 0 0 0

5: 0 1 0 1 0 1 0 1 0

6: 0 1 1 0 0 1 1 0 0

7: 0 1 1 1 0 1 1 1 0

8: 1 0 0 0 0 0 0 0 1

9: 1 0 0 1 0 0 0 1 1

A: 1 0 1 0 0 0 1 0 1

B: 1 0 1 1 0 1 1 1 1

C: 1 1 0 0 0 1 1 0 1

D: 1 1 0 1 0 1 1 1 1

E: 1 1 1 0 0 1 1 0 1

F: 1 1 1 1 1 1 1 1 1

FPGA Function generators

LUT Example: Implement the function

using:

2-input LUTs

3-input LUTs

4-input LUTs

Reconfigurable Computing

32

AF = ABD + BC BCD +

AB

DB

CD

A

B

C

F

ABD

BCD

ABC

CD

AB

FF

• On real FPGAs: a cluster of LUTs per switch matrix (e.g., eight LUTs and switch matrix form a configurable logic block on Xilinx FPGAs)

clock

0

1

F

0

1

F

A0

A1

A2

A3

FF

clock

0

1

F

0

1

F

A0

A1

A2

A3

FF

clock

clock

0

1

F

0

1

F

A0

A1

A2

A3

FF

clock

0

1

F

0

1

F

A0

A1

A2

A3

FF

clock

...

possible configuration

switch

matrix

LUT

switch matrixmultiplexer

enable

data_in

data_out

I/O element

I/O

pad

configuration

SRAM cell

FPGA Fabric

33

Reconfigurable Computing

• The routing is implemented using pass transistor logic:

• If we only count transistors, it costs more transistors for the

configuration SRAM cells than for the routing logic

0 1 0 0 00 1

0

1

F

A0

...

...

...

0

1

F

A1

A2

A3

FF

clock

LUT

switch matrix multiplexer

a) b)

FPGA Routing

34

Reconfigurable Computing

The ACTEL ACT3 Family (row-based)

35

Reconfigurable Computing

Row-based FPGA

Module rows separated by routing channels

MUX-based macro-cells

C-Module

4x1 MUX + 1 OR + 1 AND

S-Module

4x1 MUX + 1 OR + 1 AND

1 Flip Flop

I/O placed aside the device

Source: All images from Microsemi Corp. Accelerator Series FPGAs – ACT3 Family, 2012.

The ACTEL ACT3 Family (row-based)

36

Reconfigurable Computing

Channels are composed of several segmented routing tracks

Minimum length = module pair width

Maximum length = row width

Long segment if segment width > 3

Connections are anti-fuse based

Horizontal-to-vertical (XF)

Horizontal-to-horizontal (HF)

Vertical-to-vertical (VF)

Fast vertical connection (FF)

Tracks for module inputs are segmented by pass transistor (inactive during normal operation)

Vertical inputs span the channels above and below

Source: All images from Microsemi Corp. Accelerator Series FPGAs – ACT3 Family, 2012.

The ACTEL ACT3 Family (row-based)

37

Reconfigurable Computing

Module outputs have dedicated channels which extend vertically to two channels above and two channels below, except at the bottom and the top

Source: Microsemi Corp. Accelerator Series FPGAs – ACT3 Family, 2012.

The Xilinx Virtex Family (symmetrical array)

Symmetrical-array Based FPGA Macro cells are configurable logic

block (CLBs), placed on line column intersection.

Additional modules exist:

Block RAM for internal use

Digital clock manager (DCM) for user specific clock frequency generation)

Embedded multiplier or DSP units

Global clock Multiplexers

Input output block (IOB) for off-chip communication

Reconfigurable Computing

38

Source: Xilinx, Inc. Virtex-II Platform: Complete Datasheet.

Macro cells are CLBs. A CLB

contains 2 identical slices on

Virtex 6

Reconfigurable Computing

39

2 slices are split in two

columns of 1 slices each

2 slices are summarized in one

column

Bottom-left corner of FPGA

The Xilinx Virtex Family (symmetrical array)

Source: All images from Xilinx, Inc. Virtex-II Platform: Complete Datasheet.

On “Virtex 6” 1 slice contains:

4x 6-inputs LUT

8x FF for storing LUT results

MUX to feed LUT either to a FF or the the output

Carry in and carry out help to construct fast adder circuits using neighbour CLBs

Reconfigurable Computing

40

The Xilinx Virtex Family (symmetrical array)

Source: Xilinx, Inc. Virtex-II Platform: Complete Datasheet.

The Xilinx Virtex Family (symmetrical array)

A CLB accesses the general routing matrix via a switch matrix

Fast connection lines are used for local connections

A switch matrix connects CLB terminal on the routing resource using multiplexers

4 horizontal resources per CLB for on-chip tri-state buses

Each CLB has two tri-state drivers (TBUF) that can drive on chip buses

Each TBUF has its own control pin and its own input pin

Newer Virtex-Devices uses AND-OR based logic for buses, i.e., timing is more predictable

Reconfigurable Computing

41

Source: All images from Xilinx, Inc. Virtex-

II Platform: Complete Datasheet.

The Xilinx Virtex Family (symmetrical array)

IOB for off-chip communication

Programmability allows the use of an IOB by any CLB.

Connection can be input, output or bidirectional.

6 IOB latches for double data rate (DDR) transmission.

One of the DDR registers can be used as input, output or tri-state.

DDR accomplished by the two registers on each path clocked by rising or falling edge from different clock nets.

The two clock signals generated by the DCM.

Reconfigurable Computing

42

Source: Xilinx, Inc. Virtex-II Platform: Complete

Datasheet.

The Actel ProAsic Family (sea-of-gates)

Sea-of-gates style (sea-of-tiles)

Macro cells are EEPROM based tiles

Four levels of hierarchical routing resources.

Local resource connects a tile to one of its 8 neighbours

Long-lines resource provides routing for long distance and high fan-out (spans 1, 2 or 4 tiles). Runs both horizontal and vertical

Very long-line resource spans the entire device

Global network (clocks, reset)

Reconfigurable Computing

43

Source: All images from Microsemi Corp.

ProAsicPLUS Flash Family FPGAs Datasheet.

The Altera Flex family (hierarchical)

Hierarchical-based FPGA

Logic elements (LE) are grouped into Logic array block (LAB), on the higher level

10 LE / LAB for the FLEX8000

LAB arranged as array on the device

Reconfigurable Computing

44

An LE contains:

1 4-input LUT

1 FF

carry-in, carry-out

MUX

additional logicSource: All images from Altera, Inc.

FLEX 10k Embedded Programmable

Logic Device Family Datasheet.

The Altera Flex family (hierarchical)

FastTrack interconnect provides on-chip routing resource

Connections among LEs and adjacent LABs via local interconnect signals

Connection inside each row of LAB is done by a dedicated row interconnect

Each column of LAB is served by a dedicated column interconnect.

LEs can drive the row or column channels

Column interconnect can drive row interconnect.

A signal from the column interconnect must be routed to the row interconnect before entering an LAB

LEs can drive global signals (clocks, reset, asynchronous clear, high fanout, etc.)

Reconfigurable Computing

45

Source: Altera, Inc. FLEX 10k Embedded Programmable Logic

Device Family Datasheet.

The Altera Flex family (hierarchical)

Programmable IO Element (IOE) allows on-chip and off-chip programmable communication

An IOE can be programmed as input, output or bidirectional.

IOE receives data from adjacent interconnect (can be driven by row or column interconnect)

IOE receives its chip enable (ce) from an adjacent LE.

One pin per output element (OE) -> possible open drain emulation

Open drain emulation is provided by:

Driving the data input low

Toggling the OE of each IOE

Reconfigurable Computing

46

Source: Altera, Inc. FLEX 10k Embedded Programmable Logic

Device Family Datasheet.

Hybrid FPGAs

The Xilinx Virtex 5 FXT

Basic structure: Virtex 5

Additional features:

Up to 2 hard-core embedded IBM power pc 440 RISC processors with up to 400 MHz

DSP48E Slices with 25 x 18 complement multiplication and MAC unit

Dual-ported RAM

Integrated Endpoint Block for PCI Express Compliance

Embedded high speed serial RocketIO multi-gigabit transceivers

Reconfigurable Computing

47

Source: Xilinx Inc.: Virtex-II Pro™ Platform FPGAs: Functional Description

Hybrid FPGAs

The Altera Excalibur

Specific features:

One ARM922T 32 bit RISC processor running at 200 MHz

Embedded multipliers

Internal single and dual-ported RAM and SDRAM controller

Expansion bus interface for Flash-RAM connection

Embedded SignalTap logic analyzer

Reconfigurable Computing

48

Source: Altera, Inc. Excalibur Device Overview Datasheet.

Tabula Spacetime Architecture

Reconfigurable Computing

49

Tabula’s 3D Architecture

8 configuration planes

Reconfiguration @ 1.6 GHz

Within netlist reconfiguration

(uses forwarding registers

called „time via“)

8 folds @ 1.6 GHz

200 MHz user clock

400 MHz user clock

Source: T. R. Halfhill, Tabula‘s Time Machine. In Microprocessor Report, Reed Electronic Group, 2010.

Tabula Spacetime Architecture

Example: 32- bit adder; a) conventional, b) time-multiplexed

FA

a4

b4

a5

b5

a6

b6

a7

b7

a0

b0

a1

b1

a2

b2

a3

b3

FAFAFA '0'FAFAFAFAFA

a12

b12

a13

b13

a14

b14

a15

b15

a8

b8

a9

b9

a10

b10

a11

b11

FAFAFAFAFAFAFA

s4

s5

s6

s7

s0

s1

s2

s3

s12

s13

s14

s15

s8

s9

s10

s11

a)

FA

a0

b0

a1

b1

a2

b2

a3

b3

'0'FAFAFA

s0

s1

s2

s3

a4

b4

a5

b5

a6

b6

a7

b7

FA

s4

s5

s6

s7

FA

a8

b8

a9

b9

a10

b10

a11

b11

FAFA

s8

s9

s10

s11

FA

b12

a13

b13

a14

b14

a15

b15

FAFAFA

s12

s13

s14

s15

FA FA FA

FA

a12

time forwarding latch

b)

mc1

mc2

mc3

mc4

x

y

t

basic logic element

micro

configuration

Reconfigurable Computing

50

Source: D. Koch, Partial Reconfiguration on FPGAs: Architecture, Tools and Applications, Springer, 2013.

Tabula Spacetime Architecture

Example: 32- bit adder; c) scheduling

Each micro configuration has its own set of SRAM cells

(which requires area on the die; savings are possible in the logic)

Rapid reconfiguration consumes power (millions of configuration bits)

better suitable for latest CMOS processes

(where static power dominates dynamic power)

c)clk

mc1

mc2

mc3

mc4

configuration

execution

microcycle

Reconfigurable Computing

51

Source: Koch, D. Partial Reconfiguration on FPGAs: Architecture, Tools and Applications, Springer, 2013.


Recommended