SCORES: A Scalable and Parametric Streams-Based Communication Architecture for Modular...

Post on 18-Jan-2018

223 views 0 download

description

3 of 16 Communication Architectures uProc MEM DSP1 ASICDSP2 a) Bus BusNetwork-on-Chip (NoC) Advantages Disadvantages MEM uProcDSP1 ASICDSP2 b) Network-on-Chip NoC node Very well known Smaller hardware overhead SoC standards: Coreconnect®, Amba®, Wishbone Scalable Very high bandwidth Wires are broken in smaller segments Multiple and simultaneous parallel communications Does not scale well as number of modules increases High power consumption due to long wires Cross-talk issues Significant area overhead Exacerbated by store-and-forward routers Interfaces between modules and nodes are not standard Specific signals and handshaking protocols for each design

transcript

SCORES: A Scalable and Parametric Streams-Based Communication Architecture for Modular Reconfigurable Systems

Abelardo Jara-Berrocal, Ann Gordon-RossNSF Center for High-Performance Reconfigurable Computing (CHREC)

Department of Electrical and Computer EngineeringUniversity of Florida

2 of 16

Introduction – Parallel Computation

Edges indicate communication volume

1.System Formulation

3. Task Allocation / System Placement

Source

FIR

Sink

Matrix

IFFT

Angle

4000

15000

15000

82500

40000

4000

15000

FFT

1

2

3

4

5

6

7

2. Application decomposition

High Performance Application

1, 7 Data 2,6 4 3,5

uProc MEM DSP1 ASIC DSP2

Modules

To leverage parallel computation speedups, system can be decomposed in smaller tasks

Parallel communication

How do designers provide efficient module communication?

Problem: Speedup can be limited by inefficient communication!

Profile 1:DSP:0.5ms

uProc: 2.2ms

Profile 2:ASIC:0.5msDSP: 2.5ms

3 of 16

Communication Architectures

uProcMEM

DSP1

ASIC DSP2

a) Bus

Bus Network-on-Chip (NoC)

Adv

anta

ges

Dis

adva

ntag

esMEM

uProc DSP1

ASIC DSP2

b) Network-on-ChipNoC node

• Very well known • Smaller hardware overhead• SoC standards: Coreconnect®, Amba®, Wishbone

• Scalable• Very high bandwidth

• Wires are broken in smaller segments• Multiple and simultaneous parallel communications

• Does not scale well as number of modules increases• High power consumption due to long wires• Cross-talk issues

• Significant area overhead• Exacerbated by store-and-forward routers

• Interfaces between modules and nodes are not standard• Specific signals and handshaking protocols for each design

4 of 16

General NoC architecture

NoC Interface

NoC Link

NoC NodeRouters (packet switching)Switches (circuit switching)

MEM

uProcDSP1

ASIC DSP2

I/O Slave

DSP2

uProc

[1] Salminem et.al. Survey of Network-on-Chip Proposals. White Paper. OCP-IP, March 2008

NoC TopologyVary across designsCommonly 2D mesh or torus [1]

5 of 16

Motivation• Relevant NoC metrics:

• Throughput• Latency• Area• Power

• 2D Mesh NoC• High throughput• Low latency• High communication parallelism

• Due to these advantages, some commercial 2D NoCs for ASICs have appeared:

• Arteris®• How about NoC implementations in FPGAs?

• FPGAs are increasingly used in digital designs– Reconfigurable– Lower cost than ASICs

• NoC area overhead becomes a problem– Area of a 3x3 2D Mesh NoC consumed 28.72% of a Xilinx V2P30[2](for maximum throughput of 9.5Gbps for complete 3x3 2D NoC)

• Problem is exacerbated with low capacity & low cost FPGA devices

N7

N4

N1

N8

N5

N2

N9

N6

N3

Nod

e

Mod

ul e

Arteris NoC

[2] B. Sethuraman, P. Bhattacharya, J. Khan, Ranga Vemuri: LiPaR: A light-weight parallel router for FPGA-based networks-on-chip. ACM Great Lakes Symposium on VLSI 2005: 452-457

6 of 16

• SCORES = Scalable CCommunication Architecture for Reconfigurable Embedded Systems

• Main contributions:• High throughput / bandwidth

– Circuit switching scheme• Low area overhead

– Linear topology • Multiple clock domains• Scalability

– VHDL model with numerous architectural parameters– Allows customization for different SoCs communication needs

SCORES - Contributions

REC

ON

FIG

UR

AB

LE

DEV

ICE

(FPG

A)

Module 1 Module 2 Module 3

SCORESInterface Interface Interface

scores-clk

clk2clk3

clk1Diff

eren

t clo

ck d

omai

ns

Implemented in

Xilinx VLX25 FPGA

7 of 16

clk

REC

ON

FIG

UR

AB

LE

DEV

ICE

(FPG

A)

Module 1 Module 2 Module 3

clk2clk3

clk1

SCORES – Top Level Design• SCORES main components:

• Switches – communication nodes inside SCORES• Interfaces – communication between modules and SCORES• Channels – communication links between switches and other

switches or interfaces• Modules access interfaces through local input ports and local output

ports

Module

SCORES

Switch

Interface

Interface Interface Interface

8 of 16

SCORES – Parametric Architecture

Module 4Module 3Module 2Module 1

kl – number of left switch channels

kr – number of right switch channelsko - number local output ports from the interface

ki - number local input ports to the interface

SCORES

Interfaces

Switch

N = Number of modules W = Width of a channel in bits

Additional parameters

Parameters enable SCORES to conform to custom communication requirements

9 of 16

SCORES – Terminology

Interface InterfaceInterface Interface

Module 1 Module 4Module 2 Module 3

• Producer: module which transmits data

• Consumer: module which receives data

• Streaming Data Channel (SDC):• Dedicated path between a

producer and a consumer• Dynamically created and

destroyed inside SCORES• Bidirectional path

• Data flows from producer to consumer

• Control synchronization signals flow from consumer to producer Producer

Streaming Data Channel (SDC)

Consumer

10 of 16

SCORES – Communication Phases

Interface InterfaceInterface Interface

Module 1 Module 4Module 2 Module 3

• Three communication phases• Phase I: Channel establishment:

• Producer requests a path to the consumer

• Path iteratively created inside switches between the producer and the consumer

• If a switch has no available channels

– Sends a DENY signal to the producer

– Producer can drop or maintain the request

• If successful, the Streaming Data Channel (SDC) is created between the producer and the consumer

Producer

Streaming Data Channel (SDC)

Consumer

11 of 16

SCORES – Communication Phases• Phase II: Streaming

transmission• Pipelined operation• If consumer buffer is full

– Consumer asserts “Full” to inform producer to pause transmission

• Interfaces built around asynchronous FIFOs

– Eases crossing different clock domains

• Phase III: Channel release• Producer deasserts its

request• Path between the

producer and the consumer is iteratively destroyed

Interface InterfaceInterface Interface

Module 1 Module 4Module 2 Module 3

Producer

Streaming Data Channel (SDC)

Consumer

Register

12 of 16

SCORES – Simultaneous Data Transfers

Interface

Input Registers

Switch 1 Switch 2 Switch 3 Switch 4

Interface Interface Interface

MUXes Free channel

• Set of FSM controllers running at each switch• Allows SCORES to establish and operate multiple SDCs in parallel

13 of 16

Results – Clock FrequencyFr

eque

ncy

(MH

z)

Number of right switch channels (Kr) (1 left switch

channel)

Number of left and right switch channels (Kr, Kl) (1 local input

and 1 local output port per switch)

Number of local input and output ports (Ki, Ko) per switch (1 left and 1 right

switch channel)

• Achieved SCORES maximum frequency is equal to the SCORES maximum throughput

Customized SCORES switch with 32-bit channels, 2 left and right switch channels, and 1 local input and 1 local output port operates at 254 MHz (Throughput=8.0Gbps, post place-and-route timing report).

14 of 16

Results - AreaA

rea

(slic

es)

Customized SCORES switch with 32-bit channels, 2 left and right switch channels and 1 local input and 1 local output port consumes 315 slices (1.41% of Virtex 4 VLX25)

Number of right switch channels (Kr) (1 left switch

channel)

Number of left and right switch channels (Kr, Kl) (1 local input

and 1 local output port per switch)

Number of local input and output ports (Ki, Ko) per switch (1 left and 1 right

switch channel)

15 of 16

Conclusions• We developed SCORES (Scalable Communication

Architecture for Reconfigurable Embedded Systems) - a highly parametric communication architecture

• SCORES Contributions:– Low area overhead (315 slices for a 32-bit switch with multiple

ports)– Modules can run at different and independent clock frequencies– Highly parametric design, which enables architecture

optimization• Future work

– Optimization of switch FSM controllers– Development of algorithms for module placement inside

SCORES– Tools for automatic determination of SCORES parameter values

16 of 16

Questions