+ All Categories
Home > Documents > Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing...

Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing...

Date post: 30-Apr-2018
Category:
Upload: vuhanh
View: 225 times
Download: 2 times
Share this document with a friend
33
Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 Prof. Dr.-Ing. Jürgen Teich Lehrstuhl für Hardware-Software-Co-Design Reconfigurable Computing
Transcript
Page 1: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

Reconfigurable Computing

Reconfigurable Architectures

Chapter 3.2

Prof. Dr.-Ing. Jürgen Teich

Lehrstuhl für Hardware-Software-Co-Design

Reconfigurable Computing

Page 2: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

Coarse-Grained Reconfigurable Devices

Reconfigurable Computing

Page 3: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

Recall:

1. Brief Historically development (Estrin Fix-Plus and Rammig machine)

2. Programmable Logic

1. PALs and PLAs

2. CPLDs

3. FPGAs

1. Technology

2. Architecture by means of an example

1. Actel

2. Xilinx

3. Altera

Reconfigurable Computing

3

Page 4: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

Once again: General purpose vs Special purpose

With LUTs as function generators, FPGA can be seen as general purpose devices.

Like any general purpose device, they are flexible but often inefficient.

Flexible because any n-variable Boolean function can be implemented using an n-input LUT.

Inefficient since complex functions must be implemented in many LUTs at different locations. The connection among the LUTs is done using the routing matrix wich increases the signal delays.

LUT implementation is usually slower than direct

wiring.

Reconfigurable Computing

4

Page 5: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

Once again: General purpose vs Special purpose

Example: Implement the function

using 2-input LUTs.

LUTs are grouped in logic blocks (LB). 2 2-input LUT per LB

Connection inside a LB is efficient (direct)

Connection outside LBs are slow (Connection matrix)

Reconfigurable Computing

5

AF = ABD + AC BCD +

AB

D

A

CDA

B

C

F

Connection

matrix

Page 6: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

Once again: General purpose vs Special purpose

Idea: Implement frequently used blocks as hard-core module in the device

Reconfigurable Computing

6

ABD

ACDA

BC

F

Connection

matrix

A

B

C

D

F

Page 7: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

Coarse grained reconfigurable devices

Overcome the inefficiency of FPGAs by providing coarse grained functional units (adders, multipliers, integrators, etc.), efficiently implemented

Advantage: Very efficient in terms of speed (no need for connections over connection matrices for basic operators)

Advantage: Direct wiring instead of LUT implementation

A coarse grained device is usually an array of programmable and identical processing elements (PE) capable of executing few operations like addition and multiplication.

Depending on the manufacturer, the functional units communicate via buses or can be directly connected using programmable routing matrices.

Reconfigurable Computing

7

Page 8: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

Coarse grained reconfigurable devices

Memory exists between and inside the PEs.

Several other functional units according to the manufacturer.

A PE is usually an 8-bit, 16-bit or 32-bit tiny ALU which can be configured to execute only one operation on a given period (until the next configuration).

Communication among the PEs can be either packet oriented (on buses) or point-to-point (using crossbar switches).

Since each vendor has its own implementation approach, study will be done by means of few examples. Considered are: PACT XPP, Quicksilver ACM, NEC DRP, TCPA.

Reconfigurable Computing

8

Page 9: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

The PACT XPP – Overall structure

XPP (Extreme Processing Platform) is

a hierarchical structure consisting of:

An array of Processing Array Elements

(PAE) grouped in clusters called Processing

Array (PA)

PAC = Processing Array Cluster (PAC) +

Configuration manager (CM)

A hierarchical configuration tree

Local CMs manage the configuration at the

PA level

The local CMs access the local configuration

memory while the supervisor CM (SCM)

accesses external memory and supervises

the whole configuration process on the

device

Reconfigurable Computing

9

Source: V. Baumgarten et al., PACT XPP: A Self-Reconfigurable

Data Processing Architecture, Journal of Supercomputing. 2003.

Source: PACT XPP Technologies

Page 10: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

The PACT XPP – The Processing Array Elements

A Communication Network

Memory elements aside the PACs

A set of I/Os

Reconfigurable Computing

10

The PAE: Two types of PAE

The ALU PAE

The RAM PAE

The ALU PAE:

Contains an ALU which can be configured to perform basic operations

Back-register (BREG) provides routing channels for data and events from bottom to top

Forward Register (FREG) provides routing channels from top to bottom

The ALU PAE

Source: PACT XPP Technologies, XPP-III Processor

Overview, 2012.

Page 11: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

The PACT XPP - The Processing Array Elements

DataFlow Register (DF-REG) can be used at the object outputs for buffering data

Input register can be preloaded by

configuration data.

The RAM PAE:

1. Differs from the ALU-PAE only on the

function. Instead of an ALU, a RAM-PAE

contains a dual-ported RAM.

2. Useful for data storage

3. Data is written or read after the reading

of an address at the RAM-inputs

4. BREG, FREG, and DF-REG of the RAM-

PAE have the same function as in the

ALU-PAE

Reconfigurable Computing

11

The RAM PAE

Source: PACT XPP Technologies, XPP-III Processor

Overview, 2012.

Page 12: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

The PACT XPP - Routing

Routing in PACT XPP: Two independent networks

One for data transmission

The other for event transmission

A Configuration BUS exists besides the

data and event networks (very little

information exists about the

configuration bus)

All objects can be connected to

horizontal routing channels using

switch-objects

Vertical routing channels are provided

by the BREG and FREG

BREGs route from bottom to top

FREGs route from top to bottom

Reconfigurable Computing

12

Horizontal routing channels

Vertical routing channels

Source: PACT XPP Technologies, XPP-III Processor

Overview, white paper, 2012.

Page 13: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

The PACT XPP - Interface

Interfaces are available inside the chip

Number and type of interfaces vary

from device to device

On the XPP42-A1:

6 internal interfaces consisting of:

4 identical general purpose I/O on-chip

interfaces (bottom left, upper left, upper

right, and bottom right)

One configuration manager

One JTAG (Join Test Action Group,

"IEEE Standard 1149.1") Boundary scan

interface for testing purpose (not

shown in the picture)

Reconfigurable Computing

13

Interfaces

Source: PACT XPP Technologies

Page 14: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

The PACT XPP - Interface

The I/O interfaces can operateindependent from each other. Two operation modes

The RAM mode

The streaming mode

RAM mode:

Each port can access external Static

RAM (SRAM).

Control signals for the SRAM

transactions are available.

No additional logic required

Reconfigurable Computing

14

Source: PACT XPP Technologies

Page 15: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

The PACT XPP - Interface

Streaming mode:

1. For high speed streaming of data to

and from the device

2. Each I/O element provides two

bidirectional ports for data

streaming

3. Handshake signals are used for

synchronization of data packets to

external port

Reconfigurable Computing

15

Source: PACT XPP Technologies

Page 16: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

The Quicksilver ACM - Architecture

Structure: Fractal-like structure

Hierarchical group of four nodes with

full communication among the nodes

4 lower level nodes are grouped in a

higher level node

The lowest level consists of 4

heterogeneous processing nodes

The connection is done in a Matrix

Interconnect Network (MIN)

A system controller

Various I/O

Reconfigurable Computing

16

Source: B. Plunkett et al., Adapt2400 ACM architecture

overview, QuickSilver Technology, Inc., 2004.

Page 17: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

The Quicksilver ACM – The processing node

An ACM processing node

consists of:

An algorithmic engine. It is

unique to each node type and

defines the operation to perform

by the node.

The node memory for data

storage at the node level.

A node wrapper which is

common to all nodes. It is used to

hide the complexity of the

heterogeneous architecture.

Reconfigurable Computing

17

Source: B. Plunkett et al., Adapt2400 ACM architecture

overview, QuickSilver Technology, Inc., 2004.

Page 18: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

The Quicksilver ACM – The processing node

Four types of nodes exist:

The Programmable Scalar Node

(PSN) provides a standard 32-bit

RISC architecture with 32-bit

general purpose registers

The Adaptive Execution Node

(AXN) provides variable size MAC

and ALU operations

The Domain Bit Manipulation

(DBM) node provides bit

manipulation and byte oriented

operation

External Memory Controller node

provides DDRRAM, SRAM,

memory random access DMA

control interface

Reconfigurable Computing

18

ACM PSN-NodeSource: B. Plunkett et al., Adapt2400 ACM architecture

overview, QuickSilver Technology, Inc., 2004.

Page 19: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

The Quicksilver ACM – The processing node

Reconfigurable Computing

19

ACM DBM-NodeACM AXN-NodeSource: B. Plunkett et al., Adapt2400 ACM architecture overview, QuickSilver Technology, Inc., 2004.

Page 20: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

The Quicksilver ACM – The processing node

The node wrapper envelopes the

algorithmic engine and presents an

identical interface to neighbouring

nodes. It features:

1. A MIN interface to support the

communication among nodes via

the MIN-network

2. A hardware task manager for task

management at the node level

3. A DMA engine

4. Dedicated I/O circuitry

5. Memory controllers

6. Data distributors and aggregators

Reconfigurable Computing

20

The ACM Node WrapperSource: B. Plunkett et al., Adapt2400 ACM architecture

overview, QuickSilver Technology, Inc., 2004.

Page 21: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

The Quicksilver ACM - The MIN

Matrix Interconnect Network is

the communication medium in

an ACM chip

1. Hierarchically organized. The MIN

at a given level connects many

lower-level MINs

2. The MIN-Root is used for:

1. Off-chip communication

2. Configuration

3. Supports the communication

among nodes

4. Provides service like Point to

point dataflow streaming, Real-

time broadcasting, DMA, etc.

Reconfigurable Computing

21

Example of ACM

chip configuration

Source: B. Plunkett et al., Adapt2400 ACM architecture

overview, QuickSilver Technology, Inc., 2004.

Page 22: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

The Quicksilver ACM - The System Controller

The system controller is in charge of the system management:

Loads tasks into node ready-to-run

queue for execution

Statically or dynamically sets the

communication channels between

the processing nodes

Carries out the reconfiguration of

nodes on a clock cycle-by-clock

cycle basis

The ACM chip features a set of I/O

interfaces controllers like:

PCI

PLL

SDRAM and SRAM

Reconfigurable Computing

22

The system controller

The interface controllersSource: B. Plunkett et al., Adapt2400 ACM architecture

overview, QuickSilver Technology, Inc., 2004.

Page 23: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

The NEC DRP – Architecture

The NEC Dynamically

Reconfigurable Processor (DRP)

consists of:

A set of byte oriented processing

elements (PE)

A programmable interconnection

network for communication among

the PEs.

A sequencer. Can be programmed as

finite state machine (FSM) to control

the reconfiguration process

Memory around the device for storing

configuration and computation data

Various Interfaces

Reconfigurable Computing

23

Source: C. Bobda, Introduction to Reconfigurable

Computing, Springer, 2007. Original image adapted

from M. Motomura: A dynamically reconfigurable

processor architecture, Microprocessor Forum, 2002.

Page 24: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

The NEC DRP - The Processing Element

ALU: ordinary byte arithmetic/logic

operations

DMU (data management unit): handles

byte select, shift, mask, constant

generation, etc., as well as bit

manipulations

An instruction dictates ALU/DMU

operations and inter-PE connections

Source/destination operands can

either be from/to

its own register file

other PEs (i.e., flow through)

Instruction pointer (IP) is provided from

STC (state transition controller)

Reconfigurable Computing

24

Adapted from: M. Susuki et al., Stream Applications on

the Dynamically Reconfigurable Processor, International

Conference on Field-Programmable Technology, IEEE,

2004.

Page 25: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

The NEC DRP – Reconfiguration Process

Instruction Pointer (IP) from STC

identifies a datapath plane

Spatial computation with using a

customized datapath plane

When IP changes, datapath

plane switches instantaneously

PE instructions as a collection

behave like an extreme VLIW

Sequencing through instructions

=> Dynamic reconfiguration

Reconfigurable Computing

25

AES

3DES

MD5

SHA-1

Compress

Data In

Control

(task selectionby descriptor)

Dynamic Reconfiguration

Data Out

Multiple DatapathPlanes

Page 26: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

The NEC DRP – Reconfiguration Process

Reconfigurable Computing

26

Add

Sel

Add

Cmp

Add

Add

Cmp

Sel

PE

PE ArrayALUDMU

Insts.

PE

012

IP = “1”

1

3

4

PE Array

PE ALUDMU

012

Insts.

IP = “1”

1

1

2

1 Identify the instruction to be executed

2 Decode the instruction in the ALU plane

3 Configure the ALU Plane according to the

instruction4+

Page 27: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

Tightly Coupled Processor Arrays (TCPA)

• Processor elements (PEs) with VLIW (Very long

instruction word)-Architecture

• Weakly programmable

– Small local instruction memory

– Limited parametrizable instruction set focused on digital signal

processing

• Data flow oriented control path, no global address space,

data streaming over the processing field

• Regular interconnect network

• Application areas: Digital signal processing, e.g., mobile

communication, HDTV, multimedia, . . .

30

Reconfigurable Computing

Page 28: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

Tightly-Coupled Processor Arrays (TCPA)

31

Reconfigurable Computing

Source: D. Kissler et al., A Dynamically Reconfigurable Weakly Programmable Processor Array Architecture Template, International Workshop on

Reconfigurable Communication Centric System-on-Chips (ReCoSoC), 2006.

Page 29: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

• Basic structure: Grid

• Dynamic reconfigurable

• By using a bypass, more

than one hop is possible

in a single clock cycle

• Interconnect wrapper is

responsible for switching

TCPA – Interconnect Network

32

Reconfigurable Computing

Adapted from: D. Kissler et al., A Dynamically Reconfigurable Weakly Programmable Processor Array Architecture Template, International Workshop

on Reconfigurable Communication Centric System-on-Chips (ReCoSoC), 2006.

Page 30: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

TCPA – Network Example – 4D Hypercube

33

Reconfigurable Computing

Page 31: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

TCPA – Network Example – 2D Torus

34

Reconfigurable Computing

Page 32: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

• Multicast-Scheme for partial dynamic reconfiguration

• Differential reconfiguration (program/connections) also

possible

TCPA – Dynamic Reconfiguration

35

Reconfigurable Computing

Source: D. Kissler, Power-Efficient Tightly-Coupled Processor Arrays for Digital Signal Processing, PhD Dissertation, 2012.

Page 33: Reconfigurable Computing Reconfigurable Architectures ... · Reconfigurable Computing Reconfigurable Architectures Chapter 3.2 ... is usually an 8-bit, 16-bit or 32-bit tiny ALU ...

24 Core TCPA – Lehrstuhl für Informatik 12

• 24x 16 Bit cores

• Technology

• CMOS 1.0 V

• 9 metal layers

• 90 nm standard cell layout

• FUs/PE

• 2xAdd, 2xMul,

• 1xShift, 1xDPU

• Register/PE: 15

• Instruction memory

• 1024x32 = 4kB

• Clock frequency: 200 MHz

• Peak Performance: 24 GOPS

• Energy consumption

• 133 mW @ 200 MHz (Hybrid Clock Gating).

• Power efficiency: 180 MOPS/mW

36

Reconfigurable Computing


Recommended