+ All Categories
Home > Documents > NOCs : It is about the memory and the programming...

NOCs : It is about the memory and the programming...

Date post: 23-May-2018
Category:
Upload: hakhue
View: 219 times
Download: 1 times
Share this document with a friend
36
NOCs : It is about the memory and the programming model Ivo Bolsens, Sr. VP and CTO Xilinx, Inc 2009
Transcript
Page 1: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

NOCs : It is about the memory and the programming model

Ivo Bolsens, Sr. VP and CTO

Xilinx, Inc 2009

Page 2: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

FPGA Platform

� Optimized FPGA feature mix for

various applications

– LXT: General Logic + Serial

– SXT: Rich DSP & BRAM +Serial

– HXT: Highest Bandwidth Serial

� Ultimate flexibility

– Change FPGA feature mix at any time

during your design / product lifecycle

SelectIO Logic

Clock Manager

DSP Serial Transceiver

BRAM

PCI Express / EMAC

Page 3: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

1.00E+07

1.00E+08

1.00E+09

1985 1990 1995 2000 2005 2010 2015 2020 2025

Year

Nu

mb

er

of

LC

sFPGA Capacity Trends

Historical Data

Largest Xilinx FPGA

ITRS 2013:

2.6M LCs(3.1B transistors)

Page 4: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

10

100

1000

10000

1985 1990 1995 2000 2005 2010 2015 2020 2025

Sy

ste

m S

pe

ed

(M

Hz)

FPGA Performance Trends

Historical

FPGA data2007: 325 MHz typical;

500 MHz max

2013: 500MHz typical,

750MHz max

Page 5: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

0.0001

0.001

0.01

0.1

1

1990 1995 2000 2005 2010 2015

$ /

LC

Price Per Logic Cell

Page 6: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

Circuit Switched Point to Point

Example: Xilinx Virtex FPGA

� Staggered, Segmented Routing

� Wide Reach with Few Hops

� Circuit Switched Interconnect

– Dedicated path A=>B

� Guaranteed timing (frequency and latency)

� Static scheduling (requires place and route)

Circuit Switching Guarantees Timing

Focus: Real Time Design

Each Square = 1 CLB (Configurable Logic Block)

Each Hop = Active Buffer Switch

Page 7: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

The FPGA Ecosystem

Digital Logic

Computer Architecture

Embeddedsystems

ConfigurableComputing

DigitalSignal

Processing

FECCoding Encryption

Networking

Robotics

High PerformanceComputing

Video

Image & VideoProcessing

Dynamicallyreconfigurable

systems

Hardwarecompilation

Hardware-software

co-design

SpeechRecognition

Programmablehardware

architectures

Surveillance

CADtools

Page 8: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

Processor Metrics

FPGA:

• Massively parallel with pipelined throughput

• Distributed, granular memory architecture

0.2X30W130WPower

13.9X7.5 TB/Sec750 GB/Sec

LUTRAM BW Register File BW

9.6X3.3 TB/Sec188 GB/Sec

Block RAM BW L1 Cache BW

0.3X28.8 GB/Sec94 GB/Sec

FPGA to Local Memory BW (opt.)L2 Cache BW

=8.5 GB/s @ 1066MHz8.5 GB/s @ 1066MHz

FPGA to MCH (FSB)CPU to MCH (FSB)BW to Memory

Programmable to any depth14 StagesPipeline Depth

2.2X204 Gflop/s SP94 Gflop/s SPFLOPs (Mul+Add)

Programmable to any bit sizeClassical 8/16/32/64 bitInterger Operators

55.1X2.59 Trillion 64 bit Ops/Sec47 Billion 64 bit Ops/SecTheoretical Issue Rate

DeltaXilinx Virtex-5 SX240TIntel Xeon 7350 (Quad Core)Performance Metric

Page 9: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

FPGA System Interconnect Evolution

� Co-processing– Non-coherent accelerator

– Software managed memory

consistency

– IO Device programming

model

– DMA engines

FSB

PCIe

2005

� Circuit Switched

Page 10: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

Device Centric Shared Memory Programming:User Managed Memory Coherency

� 1. flushSourceToMem()

� 2. setupDMA()

� 3. HW Process()

– A. if DMA’event …

– B. DMAreadFromMainMem()

– C. HWcomputeProcess()

– C. DMAwriteToMainMem()

� 4. SignalDoneIRQ()

� 5. waitForHWDone()

� 6. rebuildCacheFromMem()

CPU

ROOT COMPLEXMemoryPCI Express

Graphics : 16X

SWITCH SWITCH

SWITCH

x2 EndPoint

x1 END POINT

x8 END POINT

LegacyEND

POINT

PCIBridge

PCI

1

2 DMA

3

4

5

Page 11: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

FPGA System Interconnect Evolution

� Co-processing– Non-coherent accelerator

– Software managed memory

consistency

– IO Device programming

model

– DMA engines

FSB FSB

PCIe PCIe

2005 2008

� Peer processing– Coherent accelerator

– Hardware managed memory

consistency

– Shared memory

programming model

� Circuit Switched � Transaction Based

Page 12: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

Xeon 7300 System Platforms

Page 13: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

Hybrid SMP & DSM + AcceleratorConvey HC-1 (2008)

� Socket Filler Module

� Bridge FPGA

� Implements FSB Protocol

� Full Snoop Support

� FPGA Based Compute Accelerator

� Pre-Defined Vector Instruction Set

� Shared Memory Programming Model

� ANSI C Support

� Accelerator Cache Memory

� 80 GB/s BW

� Snoop Coherent with System Memory

� Direct Cache Access CPU<->FPGASource: Convey Computer, 2008

MCLX155

MCLX155

MCLX155

MCLX155

MCLX155

MCLX155

MCLX155

MCLX155

Page 14: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH
Page 15: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH
Page 16: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

FPGA System Interconnect Evolution

� Co-processing– Non-coherent accelerator

– Software managed memory

consistency

– IO Device programming

model

– DMA engines

FSB FSBQPI

PCIe PCIe PCIe

PCIe2005 2008 2009

� Peer processing– Coherent accelerator

– Hardware managed memory

consistency

– Central memory

– Shared memory

programming model

� Scalable peer

processing– Scaleable coherency

– Directory + Snooping

– Distributed memory

– Shared memory programming

model

� Circuit Switched � Transaction Based � Packet Switched

Page 17: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

Point To Point

Examples:

� AMD HyperTransport

� Intel QuickPath

Key Facts:

� Narrower buses, higher frequency

� Multiple active masters (N links => N * Throughput

� Single Hop (fully connected)

� Scalable topologies (packet switched interconnect)

– 1D Ring, 2D Mesh, 3D Torus

Point to Point:

Makes Interconnect Scalable

Page 18: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

Distributed Shared Memory (DSM)AMD Hammer (Opteron) (2002)

Source: AMD, HotChips14, Fall 2002

• Distributed Shared Memory (DSM)

• Per-node Memory Controller

• Global Address Space

• Snoopy Coherency

Page 19: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

ProgrammingModel

Interconnect

MemoryModel

Page 20: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

Taxonomy of Large Multiprocessors

Multiprocessors

SharedAddress Space

DistributedAddress Space

SymmetricShared Memory

DistributedShared Memory

Cache CoherentccNUMA

Non Coherent

Commodity Cluster Custom Cluster

Uniform Cluster

Cluster ofSMPs or DSMs

Source: Asanovic, UCB, CS252 Class Notes, Fall 2007

Page 21: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

Distributed Address Space:Hybrid CPU + FPGA Systems

NoCµB

FSB bridge

NoCµB

NoC HWMPE

X86

SW Process

X86

SW Process

FSB

NoC

PPC

GT/GTX

Serial I/O

NoC HWMPE

NoCµB

FSB bridge

NoCµB

NoC HWMPE

FSB

NoC

PPC

GT/GTX

Serial I/O

NoC HWMPE

MemoryX86

SW Process

X86

SW Process Memory

Source: ArchesComputing, 2009

� Multiple Private Memory Spaces

� Multiple Compute Nodes: X86, Embedded CPUs, FPGA HW

Unique

Address

Spaces

Page 22: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

Distributed Address Space:Intel Polaris (2007)

Source: Intel, HotChips19, Fall 2007

160 Node Many-Core CPU

With Tiled, Distributed, Private Memories

� 8x10 Tiled Design

� 160 SP FP Tiles

� Private Memory per Tile

� 5 Ported Router per Tile

� 2D NOC Mesh Interconnect

� Wormhole Routing

� Message Passing Instructions

Page 23: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

Taxonomy of Networks

Networks

Shared Bus Point to Point

Circuit

SwitchedFully Connected

Transaction

SwitchedSwitched

1D Ring

2D Mesh

Torus, Other 3D

Central Xbar

Distributed Xbar

Circuit Switched

Packet Switched

Regular Mesh

Staggered Mesh

Hierarchical Mesh

Page 24: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

Shared Bus

Key Facts:

� All Masters Connected to All Slaves (limits frequency)

� Only one active Master (limits throughput)

� Broadcast “built in by design” (snooping simple to implement)

� Used when wiring resources are scarce

� Split transactions allow pipelined phases (helps throughput somewhat)

– request, snoop, data response

µµµµP

$

µµµµP

$

µµµµP

$

µµµµP

$

Central Memory

Bus

Throughput and Scalability Issues

Examples:

� Intel FSB (early generations)

� ARM AMBA 1/2 (before AXI)

� IBM CoreConnect PLB/OPB

Page 25: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

Point To Point

Examples:

� AMD HyperTransport

� Intel QuickPath

� ARM AXI

Key Facts:

� Narrower buses, higher frequency

� Multiple active masters (N links => N * Throughput

� Single Hop (fully connected)

� Scalable topologies (packet switched interconnect)

– 1D Ring, 2D Mesh, 3D Torus

Point to Point:

Makes Interconnect Scalable

Page 26: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

On-Chip Interconnect Drivers

Latency

Bandwidth

Performance Scaling

IP Reuse

Quality of Service

SharedBus

PointTo

Point

NetworkOn Chip

Resource Usage

����

�� ��

��

��

��

��

��

�� ��BestBest SquareSquare LinearLinear

��

�� �� ��

Point To Point Is Best For Small-Medium Systems

NOC Is Needed As Complexity Grows

Page 27: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

SOC Integration Trends

MCU– CPU + peripherals

– Industrial, Motor Control,Display Interface

– FreeScale, Renesas, MicroChip

SOC– CPU + peripherals

– Integrated Accelerators(Video, Networking)

– Sonics

– TI OMAP3, Samsung, ST Nomadic

SMP– MP CPU with coherency

– Accelerators with coherency

– Peripherals

– Intel Atom, TI OMAP4

Generic MCU TI OMAP 3430

With Multi-Layer AXI3

ARM Cortex A9 SMP

With Accelerator Coherency Port

Circuit SwitchedNOC

Transaction Fabric

NOC

Transaction Fabric

With Coherency

Page 28: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

Taxonomy of Programming Models

ProgrammingModels

Streaming ShMem

Packetized

User ManagedMemory Spaces

Endless

HW ManagedMemory Consistency

Message Passing

Two Sided One Sided

Eager

Rendevouz

SMP & DSMUPC / PGAS(Distributed

Address Spaces)

Page 29: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

Shared Memory Programming on FPGAs:Convey HC-1 (2008)

FPGA Accelerators Today:� ANSI C Programming

� Standard C Compilers

� Pointers

� Flat, Virtual Memory

� Run-Time Scheduler

� HW Managed Memory Consistency

Abstracted Away:� HDL Design

� Timing Closure

� Fixed, Static Scheduling

� DMA Engine Programming

� SW Managed Memory Regions

Source: Convey Computer, 2008

Page 30: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

Message Passing in Embedded:Arches (2009)

MPI FSB Bridge

µBMPI SW

Process

FSB

HWMPE

MPI HW

“Process”

PPCMPI SW

Process

MPI GT/GTX

Serial I/O Bridge

X86MPI SW

Process

X86MPI SW

ProcessMemory

MPI FSB Bridge

µBMPI SW

Process

FSB

HWMPE

MPI HW

“Process”

PPCMPI SW

Process

MPI GT/GTX

Serial I/O Bridge

X86MPI SW

Process

X86MPI SW

ProcessMemory

� Standard MPI Programming Model & API

� Light Weight Message Passing Protocol Implementation

� Focused on Embedded Systems

� Explicit Rank to Node Binding SupportSource: ArchesComputing, 2009

Page 31: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

Message Passing Portability:Same standard MPI API, different cores

Rank 0:

main() {

MPI_Send()

MPI_Send()

MPI_Send()

MPI_Send()

MPI_Recv()

MPI_Recv()

MPI_Recv()

MPI_Recv()

}

Rank 1:

main() {

MPI_Recv()

Compute()

MPI_Send()

}

Rank 2:

main() {

MPI_Recv()

Compute()

MPI_Send()

}

Rank 3:

main() {

MPI_Recv()

Compute()

MPI_Send()

}

Rank 4:

( HDL )

Process ( ) {

MPE_Recv()

Compute()

MPE_Send()

}

X86 X86FPGA Soft Risc

MicroBlaze

FPGA Hard Risc

PowerPC

FPGA Hardware

Engine

Source: ArchesComputing, 2009

Page 32: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

UPC on FPGAs

What

� RAMP Blue (UC Berkeley, 2007)

� 1008 MicroBlaze Cores @ 100Mhz

HW

� 12 Cores per FPGA (Virtex-II Pro, 130nm)

� 21 Boards (4 FPGAs per board + 1 Control FPGA)

Memory

� Distributed Memories

� Each Core has its own address space

� Message passing between cores

SW

� UPC running on top of Linux

Page 33: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

Take Aways

� Memory Coherency Going Embedded– Multi-Core CPUs with on-chip coherent NOCs

– Convey: Coherent, Shared Memory, X86-FPGA System

� Message Passing Going Embedded– On-chip Coherency Too Expensive For Many-Core CPUs

– Arches: Message Passing on Hybrid X86-FPGA Systems

� Processors Evolve to Match Computing Needs– uC, Multi-Core, Many-Core Machines

� Memory Models to Match Application Needs– FPGAs Support SMP, DSM, Message Passing & Coherency

� Mainstream Programming Models– C programmed, Runtime scheduled, Instruction set based FPGAs

– MPI API lightweight implementation for FPGAs

Challenge :

What memory and programming model do you want to see on FPGAs?

Page 34: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

Knowledge Community

� A wide association of people with common

technology interest with the intent to

– Progress their knowledge

– Enhance the skills of all members

– Preserve a legacy of lessons learned

Page 35: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

Grow Knowledge Community

Wireless

Grow User CommunityHardware PlatformReference DesignsOpen Source Repository

Wired

Computing

STANFORD

UC BERKELEY

Page 36: NOCs : It is about the memory and the programming …circuit.ucsd.edu/~nocs2009/talks/xilinxatnocs2009.pdfHardware-software co-design Speech Recognition ... Graphics : 16X SWITCH SWITCH

Multi-FPGA system for

parallel computing research

Multi-university collaborationbetween top schools such as

Berkeley,

Stanford,

MIT,University of Texas Austin,

etc

Microsoft Research Labs

RAMP BEE3, multi-FPGA board


Recommended