Date post: | 18-Jan-2018 |
Category: |
Documents |
Upload: | octavia-tucker |
View: | 223 times |
Download: | 0 times |
SCORES: A Scalable and Parametric Streams-Based Communication Architecture for Modular Reconfigurable Systems
Abelardo Jara-Berrocal, Ann Gordon-RossNSF Center for High-Performance Reconfigurable Computing (CHREC)
Department of Electrical and Computer EngineeringUniversity of Florida
2 of 16
Introduction – Parallel Computation
Edges indicate communication volume
1.System Formulation
3. Task Allocation / System Placement
Source
FIR
Sink
Matrix
IFFT
Angle
4000
15000
15000
82500
40000
4000
15000
FFT
1
2
3
4
5
6
7
2. Application decomposition
High Performance Application
1, 7 Data 2,6 4 3,5
uProc MEM DSP1 ASIC DSP2
Modules
To leverage parallel computation speedups, system can be decomposed in smaller tasks
Parallel communication
How do designers provide efficient module communication?
Problem: Speedup can be limited by inefficient communication!
Profile 1:DSP:0.5ms
uProc: 2.2ms
Profile 2:ASIC:0.5msDSP: 2.5ms
3 of 16
Communication Architectures
uProcMEM
DSP1
ASIC DSP2
a) Bus
Bus Network-on-Chip (NoC)
Adv
anta
ges
Dis
adva
ntag
esMEM
uProc DSP1
ASIC DSP2
b) Network-on-ChipNoC node
• Very well known • Smaller hardware overhead• SoC standards: Coreconnect®, Amba®, Wishbone
• Scalable• Very high bandwidth
• Wires are broken in smaller segments• Multiple and simultaneous parallel communications
• Does not scale well as number of modules increases• High power consumption due to long wires• Cross-talk issues
• Significant area overhead• Exacerbated by store-and-forward routers
• Interfaces between modules and nodes are not standard• Specific signals and handshaking protocols for each design
4 of 16
General NoC architecture
NoC Interface
NoC Link
NoC NodeRouters (packet switching)Switches (circuit switching)
MEM
uProcDSP1
ASIC DSP2
I/O Slave
DSP2
uProc
[1] Salminem et.al. Survey of Network-on-Chip Proposals. White Paper. OCP-IP, March 2008
NoC TopologyVary across designsCommonly 2D mesh or torus [1]
5 of 16
Motivation• Relevant NoC metrics:
• Throughput• Latency• Area• Power
• 2D Mesh NoC• High throughput• Low latency• High communication parallelism
• Due to these advantages, some commercial 2D NoCs for ASICs have appeared:
• Arteris®• How about NoC implementations in FPGAs?
• FPGAs are increasingly used in digital designs– Reconfigurable– Lower cost than ASICs
• NoC area overhead becomes a problem– Area of a 3x3 2D Mesh NoC consumed 28.72% of a Xilinx V2P30[2](for maximum throughput of 9.5Gbps for complete 3x3 2D NoC)
• Problem is exacerbated with low capacity & low cost FPGA devices
N7
N4
N1
N8
N5
N2
N9
N6
N3
Nod
e
Mod
ul e
Arteris NoC
[2] B. Sethuraman, P. Bhattacharya, J. Khan, Ranga Vemuri: LiPaR: A light-weight parallel router for FPGA-based networks-on-chip. ACM Great Lakes Symposium on VLSI 2005: 452-457
6 of 16
• SCORES = Scalable CCommunication Architecture for Reconfigurable Embedded Systems
• Main contributions:• High throughput / bandwidth
– Circuit switching scheme• Low area overhead
– Linear topology • Multiple clock domains• Scalability
– VHDL model with numerous architectural parameters– Allows customization for different SoCs communication needs
SCORES - Contributions
REC
ON
FIG
UR
AB
LE
DEV
ICE
(FPG
A)
Module 1 Module 2 Module 3
SCORESInterface Interface Interface
scores-clk
clk2clk3
clk1Diff
eren
t clo
ck d
omai
ns
Implemented in
Xilinx VLX25 FPGA
7 of 16
clk
REC
ON
FIG
UR
AB
LE
DEV
ICE
(FPG
A)
Module 1 Module 2 Module 3
clk2clk3
clk1
SCORES – Top Level Design• SCORES main components:
• Switches – communication nodes inside SCORES• Interfaces – communication between modules and SCORES• Channels – communication links between switches and other
switches or interfaces• Modules access interfaces through local input ports and local output
ports
Module
SCORES
Switch
Interface
Interface Interface Interface
8 of 16
SCORES – Parametric Architecture
Module 4Module 3Module 2Module 1
kl – number of left switch channels
kr – number of right switch channelsko - number local output ports from the interface
ki - number local input ports to the interface
SCORES
Interfaces
Switch
N = Number of modules W = Width of a channel in bits
Additional parameters
Parameters enable SCORES to conform to custom communication requirements
9 of 16
SCORES – Terminology
Interface InterfaceInterface Interface
Module 1 Module 4Module 2 Module 3
• Producer: module which transmits data
• Consumer: module which receives data
• Streaming Data Channel (SDC):• Dedicated path between a
producer and a consumer• Dynamically created and
destroyed inside SCORES• Bidirectional path
• Data flows from producer to consumer
• Control synchronization signals flow from consumer to producer Producer
Streaming Data Channel (SDC)
Consumer
10 of 16
SCORES – Communication Phases
Interface InterfaceInterface Interface
Module 1 Module 4Module 2 Module 3
• Three communication phases• Phase I: Channel establishment:
• Producer requests a path to the consumer
• Path iteratively created inside switches between the producer and the consumer
• If a switch has no available channels
– Sends a DENY signal to the producer
– Producer can drop or maintain the request
• If successful, the Streaming Data Channel (SDC) is created between the producer and the consumer
Producer
Streaming Data Channel (SDC)
Consumer
11 of 16
SCORES – Communication Phases• Phase II: Streaming
transmission• Pipelined operation• If consumer buffer is full
– Consumer asserts “Full” to inform producer to pause transmission
• Interfaces built around asynchronous FIFOs
– Eases crossing different clock domains
• Phase III: Channel release• Producer deasserts its
request• Path between the
producer and the consumer is iteratively destroyed
Interface InterfaceInterface Interface
Module 1 Module 4Module 2 Module 3
Producer
Streaming Data Channel (SDC)
Consumer
Register
12 of 16
SCORES – Simultaneous Data Transfers
Interface
Input Registers
Switch 1 Switch 2 Switch 3 Switch 4
Interface Interface Interface
MUXes Free channel
• Set of FSM controllers running at each switch• Allows SCORES to establish and operate multiple SDCs in parallel
13 of 16
Results – Clock FrequencyFr
eque
ncy
(MH
z)
Number of right switch channels (Kr) (1 left switch
channel)
Number of left and right switch channels (Kr, Kl) (1 local input
and 1 local output port per switch)
Number of local input and output ports (Ki, Ko) per switch (1 left and 1 right
switch channel)
• Achieved SCORES maximum frequency is equal to the SCORES maximum throughput
Customized SCORES switch with 32-bit channels, 2 left and right switch channels, and 1 local input and 1 local output port operates at 254 MHz (Throughput=8.0Gbps, post place-and-route timing report).
14 of 16
Results - AreaA
rea
(slic
es)
Customized SCORES switch with 32-bit channels, 2 left and right switch channels and 1 local input and 1 local output port consumes 315 slices (1.41% of Virtex 4 VLX25)
Number of right switch channels (Kr) (1 left switch
channel)
Number of left and right switch channels (Kr, Kl) (1 local input
and 1 local output port per switch)
Number of local input and output ports (Ki, Ko) per switch (1 left and 1 right
switch channel)
15 of 16
Conclusions• We developed SCORES (Scalable Communication
Architecture for Reconfigurable Embedded Systems) - a highly parametric communication architecture
• SCORES Contributions:– Low area overhead (315 slices for a 32-bit switch with multiple
ports)– Modules can run at different and independent clock frequencies– Highly parametric design, which enables architecture
optimization• Future work
– Optimization of switch FSM controllers– Development of algorithms for module placement inside
SCORES– Tools for automatic determination of SCORES parameter values
16 of 16
Questions