NOCs : It is about the memory and the programming...

NOCs : It is about the memory and the programming model

Ivo Bolsens, Sr. VP and CTO

Xilinx, Inc 2009

FPGA Platform

� Optimized FPGA feature mix for

various applications

– LXT: General Logic + Serial

– SXT: Rich DSP & BRAM +Serial

– HXT: Highest Bandwidth Serial

� Ultimate flexibility

– Change FPGA feature mix at any time

during your design / product lifecycle

SelectIO Logic

Clock Manager

DSP Serial Transceiver

BRAM

PCI Express / EMAC

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

1.00E+07

1.00E+08

1.00E+09

1985 1990 1995 2000 2005 2010 2015 2020 2025

Year

Nu

mb

er

of

LC

sFPGA Capacity Trends

Historical Data

Largest Xilinx FPGA

ITRS 2013:

2.6M LCs(3.1B transistors)

10

100

1000

10000

1985 1990 1995 2000 2005 2010 2015 2020 2025

Sy

ste

m S

pe

ed

(M

Hz)

FPGA Performance Trends

Historical

FPGA data2007: 325 MHz typical;

500 MHz max

2013: 500MHz typical,

750MHz max

0.0001

0.001

0.01

0.1

1

1990 1995 2000 2005 2010 2015

$ /

LC

Price Per Logic Cell

Circuit Switched Point to Point

Example: Xilinx Virtex FPGA

� Staggered, Segmented Routing

� Wide Reach with Few Hops

� Circuit Switched Interconnect

– Dedicated path A=>B

� Guaranteed timing (frequency and latency)

� Static scheduling (requires place and route)

Circuit Switching Guarantees Timing

Focus: Real Time Design

Each Square = 1 CLB (Configurable Logic Block)

Each Hop = Active Buffer Switch

The FPGA Ecosystem

Digital Logic

Computer Architecture

Embeddedsystems

ConfigurableComputing

DigitalSignal

Processing

FECCoding Encryption

Networking

Robotics

High PerformanceComputing

Video

Image & VideoProcessing

Dynamicallyreconfigurable

systems

Hardwarecompilation

Hardware-software

co-design

SpeechRecognition

Programmablehardware

architectures

Surveillance

CADtools

Processor Metrics

FPGA:

• Massively parallel with pipelined throughput

• Distributed, granular memory architecture

0.2X30W130WPower

13.9X7.5 TB/Sec750 GB/Sec

LUTRAM BW Register File BW

9.6X3.3 TB/Sec188 GB/Sec

Block RAM BW L1 Cache BW

0.3X28.8 GB/Sec94 GB/Sec

FPGA to Local Memory BW (opt.)L2 Cache BW

=8.5 GB/s @ 1066MHz8.5 GB/s @ 1066MHz

FPGA to MCH (FSB)CPU to MCH (FSB)BW to Memory

Programmable to any depth14 StagesPipeline Depth

2.2X204 Gflop/s SP94 Gflop/s SPFLOPs (Mul+Add)

Programmable to any bit sizeClassical 8/16/32/64 bitInterger Operators

55.1X2.59 Trillion 64 bit Ops/Sec47 Billion 64 bit Ops/SecTheoretical Issue Rate

DeltaXilinx Virtex-5 SX240TIntel Xeon 7350 (Quad Core)Performance Metric

FPGA System Interconnect Evolution

� Co-processing– Non-coherent accelerator

– Software managed memory

consistency

– IO Device programming

model

– DMA engines

FSB

PCIe

2005

� Circuit Switched

Device Centric Shared Memory Programming:User Managed Memory Coherency

� 1. flushSourceToMem()

� 2. setupDMA()

� 3. HW Process()

– A. if DMA’event …

– B. DMAreadFromMainMem()

– C. HWcomputeProcess()

– C. DMAwriteToMainMem()

� 4. SignalDoneIRQ()

� 5. waitForHWDone()

� 6. rebuildCacheFromMem()

CPU

ROOT COMPLEXMemoryPCI Express

Graphics : 16X

SWITCH SWITCH

SWITCH

x2 EndPoint

x1 END POINT

x8 END POINT

LegacyEND

POINT

PCIBridge

PCI

1

2 DMA

3

4

5




consistency


model

– DMA engines

FSB FSB

PCIe PCIe

2005 2008

� Peer processing– Coherent accelerator

– Hardware managed memory

consistency

– Shared memory

programming model

� Circuit Switched � Transaction Based

Xeon 7300 System Platforms

Hybrid SMP & DSM + AcceleratorConvey HC-1 (2008)

� Socket Filler Module

� Bridge FPGA

� Implements FSB Protocol

� Full Snoop Support

� FPGA Based Compute Accelerator

� Pre-Defined Vector Instruction Set

� Shared Memory Programming Model

� ANSI C Support

� Accelerator Cache Memory

� 80 GB/s BW

� Snoop Coherent with System Memory

� Direct Cache Access CPU<->FPGASource: Convey Computer, 2008

MCLX155

MCLX155

MCLX155

MCLX155

MCLX155

MCLX155

MCLX155

MCLX155




consistency


model

– DMA engines

FSB FSBQPI

PCIe PCIe PCIe

PCIe2005 2008 2009

� Peer processing– Coherent accelerator

– Hardware managed memory

consistency

– Central memory

– Shared memory

programming model

� Scalable peer

processing– Scaleable coherency

– Directory + Snooping

– Distributed memory

– Shared memory programming

model

� Circuit Switched � Transaction Based � Packet Switched

Point To Point

Examples:

� AMD HyperTransport

� Intel QuickPath

Key Facts:

� Narrower buses, higher frequency

� Multiple active masters (N links => N * Throughput

� Single Hop (fully connected)

� Scalable topologies (packet switched interconnect)

– 1D Ring, 2D Mesh, 3D Torus

Point to Point:

Makes Interconnect Scalable

Distributed Shared Memory (DSM)AMD Hammer (Opteron) (2002)

Source: AMD, HotChips14, Fall 2002

• Distributed Shared Memory (DSM)

• Per-node Memory Controller

• Global Address Space

• Snoopy Coherency

ProgrammingModel

Interconnect

MemoryModel

Taxonomy of Large Multiprocessors

Multiprocessors

SharedAddress Space

DistributedAddress Space

SymmetricShared Memory

DistributedShared Memory

Cache CoherentccNUMA

Non Coherent

Commodity Cluster Custom Cluster

Uniform Cluster

Cluster ofSMPs or DSMs

Source: Asanovic, UCB, CS252 Class Notes, Fall 2007

Distributed Address Space:Hybrid CPU + FPGA Systems

NoCµB

FSB bridge

NoCµB

NoC HWMPE

X86

SW Process

X86

SW Process

FSB

NoC

PPC

GT/GTX

Serial I/O

NoC HWMPE

NoCµB

FSB bridge

NoCµB

NoC HWMPE

FSB

NoC

PPC

GT/GTX

Serial I/O

NoC HWMPE

MemoryX86

SW Process

X86

SW Process Memory

Source: ArchesComputing, 2009

� Multiple Private Memory Spaces

� Multiple Compute Nodes: X86, Embedded CPUs, FPGA HW

Unique

Address

Spaces

Distributed Address Space:Intel Polaris (2007)

Source: Intel, HotChips19, Fall 2007

160 Node Many-Core CPU

With Tiled, Distributed, Private Memories

� 8x10 Tiled Design

� 160 SP FP Tiles

� Private Memory per Tile

� 5 Ported Router per Tile

� 2D NOC Mesh Interconnect

� Wormhole Routing

� Message Passing Instructions

Taxonomy of Networks

Networks

Shared Bus Point to Point

Circuit

SwitchedFully Connected

Transaction

SwitchedSwitched

1D Ring

2D Mesh

Torus, Other 3D

Central Xbar

Distributed Xbar

Circuit Switched

Packet Switched

Regular Mesh

Staggered Mesh

Hierarchical Mesh

Shared Bus

Key Facts:

� All Masters Connected to All Slaves (limits frequency)

� Only one active Master (limits throughput)

� Broadcast “built in by design” (snooping simple to implement)

� Used when wiring resources are scarce

� Split transactions allow pipelined phases (helps throughput somewhat)

– request, snoop, data response

µµµµP

$

µµµµP

$

µµµµP

$

µµµµP

$

Central Memory

Bus

Throughput and Scalability Issues

Examples:

� Intel FSB (early generations)

� ARM AMBA 1/2 (before AXI)

� IBM CoreConnect PLB/OPB

Point To Point

Examples:

� AMD HyperTransport

� Intel QuickPath

� ARM AXI

Key Facts:

� Narrower buses, higher frequency

� Multiple active masters (N links => N * Throughput

� Single Hop (fully connected)

� Scalable topologies (packet switched interconnect)

– 1D Ring, 2D Mesh, 3D Torus

Point to Point:

Makes Interconnect Scalable

On-Chip Interconnect Drivers

Latency

Bandwidth

Performance Scaling

IP Reuse

Quality of Service

SharedBus

PointTo

Point

NetworkOn Chip

Resource Usage

��

��

��

��

��

��

��

�� BestBest SquareSquare LinearLinear

��

��

Point To Point Is Best For Small-Medium Systems

NOC Is Needed As Complexity Grows

SOC Integration Trends

MCU– CPU + peripherals

– Industrial, Motor Control,Display Interface

– FreeScale, Renesas, MicroChip

SOC– CPU + peripherals

– Integrated Accelerators(Video, Networking)

– Sonics

– TI OMAP3, Samsung, ST Nomadic

SMP– MP CPU with coherency

– Accelerators with coherency

– Peripherals

– Intel Atom, TI OMAP4

Generic MCU TI OMAP 3430

With Multi-Layer AXI3

ARM Cortex A9 SMP

With Accelerator Coherency Port

Circuit SwitchedNOC

Transaction Fabric

NOC

Transaction Fabric

With Coherency

Taxonomy of Programming Models

ProgrammingModels

Streaming ShMem

Packetized

User ManagedMemory Spaces

Endless

HW ManagedMemory Consistency

Message Passing

Two Sided One Sided

Eager

Rendevouz

SMP & DSMUPC / PGAS(Distributed

Address Spaces)

Shared Memory Programming on FPGAs:Convey HC-1 (2008)

FPGA Accelerators Today:� ANSI C Programming

� Standard C Compilers

� Pointers

� Flat, Virtual Memory

� Run-Time Scheduler

� HW Managed Memory Consistency

Abstracted Away:� HDL Design

� Timing Closure

� Fixed, Static Scheduling

� DMA Engine Programming

� SW Managed Memory Regions

Source: Convey Computer, 2008

Message Passing in Embedded:Arches (2009)

MPI FSB Bridge

µBMPI SW

Process

FSB

HWMPE

MPI HW

“Process”

PPCMPI SW

Process

MPI GT/GTX

Serial I/O Bridge

X86MPI SW

Process

X86MPI SW

ProcessMemory

MPI FSB Bridge

µBMPI SW

Process

FSB

HWMPE

MPI HW

“Process”

PPCMPI SW

Process

MPI GT/GTX

Serial I/O Bridge

X86MPI SW

Process

X86MPI SW

ProcessMemory

� Standard MPI Programming Model & API

� Light Weight Message Passing Protocol Implementation

� Focused on Embedded Systems

� Explicit Rank to Node Binding SupportSource: ArchesComputing, 2009

Message Passing Portability:Same standard MPI API, different cores

Rank 0:

main() {

…

MPI_Send()

MPI_Send()

MPI_Send()

MPI_Send()

MPI_Recv()

MPI_Recv()

MPI_Recv()

MPI_Recv()

…

}

Rank 1:

main() {

…

MPI_Recv()

Compute()

MPI_Send()

…

}

Rank 2:

main() {

…

MPI_Recv()

Compute()

MPI_Send()

…

}

Rank 3:

main() {

…

MPI_Recv()

Compute()

MPI_Send()

…

}

Rank 4:

( HDL )

Process ( ) {

…

MPE_Recv()

Compute()

MPE_Send()

…

}

X86 X86FPGA Soft Risc

MicroBlaze

FPGA Hard Risc

PowerPC

FPGA Hardware

Engine

Source: ArchesComputing, 2009

UPC on FPGAs

What

� RAMP Blue (UC Berkeley, 2007)

� 1008 MicroBlaze Cores @ 100Mhz

HW

� 12 Cores per FPGA (Virtex-II Pro, 130nm)

� 21 Boards (4 FPGAs per board + 1 Control FPGA)

Memory

� Distributed Memories

� Each Core has its own address space

� Message passing between cores

SW

� UPC running on top of Linux

Take Aways

� Memory Coherency Going Embedded– Multi-Core CPUs with on-chip coherent NOCs

– Convey: Coherent, Shared Memory, X86-FPGA System

� Message Passing Going Embedded– On-chip Coherency Too Expensive For Many-Core CPUs

– Arches: Message Passing on Hybrid X86-FPGA Systems

� Processors Evolve to Match Computing Needs– uC, Multi-Core, Many-Core Machines

� Memory Models to Match Application Needs– FPGAs Support SMP, DSM, Message Passing & Coherency

� Mainstream Programming Models– C programmed, Runtime scheduled, Instruction set based FPGAs

– MPI API lightweight implementation for FPGAs

Challenge :

What memory and programming model do you want to see on FPGAs?

Knowledge Community

� A wide association of people with common

technology interest with the intent to

– Progress their knowledge

– Enhance the skills of all members

– Preserve a legacy of lessons learned

Grow Knowledge Community

Wireless

Grow User CommunityHardware PlatformReference DesignsOpen Source Repository

Wired

Computing

STANFORD

UC BERKELEY

Multi-FPGA system for

parallel computing research

Multi-university collaborationbetween top schools such as

Berkeley,

Stanford,

MIT,University of Texas Austin,

etc

Microsoft Research Labs

RAMP BEE3, multi-FPGA board

Date post:	23-May-2018
Category:	Documents
Upload:	hakhue
View:	219 times
Download:	1 times

NOCs : It is about the memory and the programming...

Documents