High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

transcript

High-Level Interconnect Architectures for FPGAs

Nick Barrow-Williams

Introduction

Continued shrinking of device dimension introduces new design challenges

Moving data around a chip can now be the limiting factor of performance

Existing interconnection solutions do not scale well

Why do existing solutions not scale?

Global connections are longer

Wire depth increased to counter width decrease

Parasitic capacitive effects increase and cause slow signal propagation

Why do existing solutions not scale?

Existing system-level connection uses buses

Buses increase resource efficiency and decrease wiring congestion

Not suitable for a large number of modules

A network based alternative would offer higher aggregate bandwidth

Why design for FPGA systems?

FPGA silicon area already dominated by wiring

Global wires are limited in number

Increasing gate count only increases wiring congestion

The Solution: Network-on-Chip

Use technologies from network systems

Replace inefficient global wiring with high-level interconnection network

Create scalable systems to handle large numbers of modules

Existing Solutions Most existing systems are for ASIC designs

Stanford Interconnect RAW SCALE SPIN

PNoC: An solution for FPGAs Complex High hardware cost

Other simulated solutions exist but few are implemented

Proposal: Two network systems

Existing solutions use either packet switching or circuit switching techniques

Design, implement, test and synthesise one of each to compare performance and hardware cost

Map solutions to an FPGA platform to evaluate hardware cost in current generation systems

Network Architecture Design

Topology Simple Scalable 2 Dimensional

Solution: 2D mesh Topology

Network Architecture Design

Routing Algorithm Deterministic

Data always follows same path through network Simple hardware Sensitive to congestion

Adaptive Paths through network can change according to load Complex hardware Avoids congestion

Network Architecture Design When choosing routing algorithms must avoid:

Deadlock:

Livelock

Solution: Use unidirectional wiring and allow each node to make two connections

Solution: Use deterministic routing

Network Architecture Design Flow control methods

Circuit switched Circuit request propagates through network Path reserved to destination Grant signal propagates back Data sent then circuit deallocated

Packet switched Use header, body and tail Wormhole routing

Forward header and body without waiting for tail Need buffers to store stalled packets

Router Design Each router contains a number of modules

FIFOs (only present in packet switched router)

Address to port-request decoder

Arbiter

Control finite state machines

Crossbar

Circuit Switched Router Structure

Request

equest

Request

Data In

In & Out Ports

CrossbarCrossbarCrossbarCrossbar

FSMFSMFSMFSM

ArbiterArbiterArbiterArbiter Address to Port Address to Port DecoderDecoder

Address to Port Address to Port DecoderDecoder

Packet Switched Router Structure

Request

Full In

Data From FIFOs

In & Out Ports

CrossbarCrossbarCrossbarCrossbar

ControlControlControlControl

ArbiterArbiterArbiterArbiter Address to Port Address to Port DecoderDecoder

Address to Port Address to Port DecoderDecoderFIFOFIFO FSMFSMData In

Router Implementation and Testing

Both routers were coded using VHDL

Simulation and testing used a combination of ModelSim and Xilinx ISE 9.1

Ad-hoc tests used for individual modules

VHDL testbench used for system verification

Testbench Structure

Mesh Network

ReadInputReadInput

Input Tables

TestTable

SourceSource

OutputTable

SinkSink

CompareCompare

TESTBENCH

Command File

Output File

Clock Gen

Reset Gen

Cycle CountCycle Count

Success: ID: 1 Source : (0,3) Dest : (1,0) Hops : 4 Latency: 34Success: ID: 2 Source : (0,2) Dest : (1,0) Hops : 3 Latency: 27Success: ID: 3 Source : (3,2) Dest : (1,1) Hops : 3 Latency: 22Success: ID: 4 Source : (1,3) Dest : (0,1) Hops : 3 Latency: 22Success: ID: 5 Source : (3,0) Dest : (3,1) Hops : 1 Latency: 12

#START SOURCE DEST SIZE ID# ------------------------------------------------------ 2 3 0 0 1 8 1 3 2 0 0 1 2 2 3 2 3 1 1 2 3 4 3 1 1 0 8 4 5 0 3 1 3 7 5

Synthesis

Each router was synthesised for a Virtex-4 LX platform

Post-synthesis verification

Resource usage

Timing

Circuit Switched Resource Usage

LUTsLUTs Flip-FlopsFlip-Flops

Total of 586 4 Input LUTS

~0.1% of a Virtex 5

Total of 202 Flip Flops

Packet Switched Resource Usage

LUTsLUTs Flip-FlopsFlip-Flops

Total of 786 4 Input LUTS

+34% compared to circuit switched

Total of 237Flip Flops

Timing Results

Circuit Switched Packet Switched

Max Freq 126.330MHz

Setup time 5.308ns

Hold time 0.272ns

Max Freq 144.533MHz

Setup time 6.125ns

Hold time 0.272ns

Critical path is through Arbiter in both designs

Project Appraisal Maintaining an accurate software simulation

proved difficult

A great deal was learnt during the implementation of the circuit switched network

HDL implementations are only prototypes

Testbench provides a good framework but more time is needed to gather performance data

Conclusions

Possible to make low complexity network-on-chip systems suitable for FPGAs

Latency has to be traded for throughput

Hard to collect performance data without application driven benchmarks

Both networks are viable so why not use both?

Future Work

Cycle accurate software simulations

Application driven benchmarking

Serial transmission

Power efficiency

Industry standard solution

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

Documents