NOCs : It is about the memory and the programming model
Ivo Bolsens, Sr. VP and CTO
Xilinx, Inc 2009
FPGA Platform
� Optimized FPGA feature mix for
various applications
– LXT: General Logic + Serial
– SXT: Rich DSP & BRAM +Serial
– HXT: Highest Bandwidth Serial
� Ultimate flexibility
– Change FPGA feature mix at any time
during your design / product lifecycle
SelectIO Logic
Clock Manager
DSP Serial Transceiver
BRAM
PCI Express / EMAC
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
1.00E+07
1.00E+08
1.00E+09
1985 1990 1995 2000 2005 2010 2015 2020 2025
Year
Nu
mb
er
of
LC
sFPGA Capacity Trends
Historical Data
Largest Xilinx FPGA
ITRS 2013:
2.6M LCs(3.1B transistors)
10
100
1000
10000
1985 1990 1995 2000 2005 2010 2015 2020 2025
Sy
ste
m S
pe
ed
(M
Hz)
FPGA Performance Trends
Historical
FPGA data2007: 325 MHz typical;
500 MHz max
2013: 500MHz typical,
750MHz max
0.0001
0.001
0.01
0.1
1
1990 1995 2000 2005 2010 2015
$ /
LC
Price Per Logic Cell
Circuit Switched Point to Point
Example: Xilinx Virtex FPGA
� Staggered, Segmented Routing
� Wide Reach with Few Hops
� Circuit Switched Interconnect
– Dedicated path A=>B
� Guaranteed timing (frequency and latency)
� Static scheduling (requires place and route)
Circuit Switching Guarantees Timing
Focus: Real Time Design
Each Square = 1 CLB (Configurable Logic Block)
Each Hop = Active Buffer Switch
The FPGA Ecosystem
Digital Logic
Computer Architecture
Embeddedsystems
ConfigurableComputing
DigitalSignal
Processing
FECCoding Encryption
Networking
Robotics
High PerformanceComputing
Video
Image & VideoProcessing
Dynamicallyreconfigurable
systems
Hardwarecompilation
Hardware-software
co-design
SpeechRecognition
Programmablehardware
architectures
Surveillance
CADtools
Processor Metrics
FPGA:
• Massively parallel with pipelined throughput
• Distributed, granular memory architecture
0.2X30W130WPower
13.9X7.5 TB/Sec750 GB/Sec
LUTRAM BW Register File BW
9.6X3.3 TB/Sec188 GB/Sec
Block RAM BW L1 Cache BW
0.3X28.8 GB/Sec94 GB/Sec
FPGA to Local Memory BW (opt.)L2 Cache BW
=8.5 GB/s @ 1066MHz8.5 GB/s @ 1066MHz
FPGA to MCH (FSB)CPU to MCH (FSB)BW to Memory
Programmable to any depth14 StagesPipeline Depth
2.2X204 Gflop/s SP94 Gflop/s SPFLOPs (Mul+Add)
Programmable to any bit sizeClassical 8/16/32/64 bitInterger Operators
55.1X2.59 Trillion 64 bit Ops/Sec47 Billion 64 bit Ops/SecTheoretical Issue Rate
DeltaXilinx Virtex-5 SX240TIntel Xeon 7350 (Quad Core)Performance Metric
FPGA System Interconnect Evolution
� Co-processing– Non-coherent accelerator
– Software managed memory
consistency
– IO Device programming
model
– DMA engines
FSB
PCIe
2005
� Circuit Switched
Device Centric Shared Memory Programming:User Managed Memory Coherency
� 1. flushSourceToMem()
� 2. setupDMA()
� 3. HW Process()
– A. if DMA’event …
– B. DMAreadFromMainMem()
– C. HWcomputeProcess()
– C. DMAwriteToMainMem()
� 4. SignalDoneIRQ()
� 5. waitForHWDone()
� 6. rebuildCacheFromMem()
CPU
ROOT COMPLEXMemoryPCI Express
Graphics : 16X
SWITCH SWITCH
SWITCH
x2 EndPoint
x1 END POINT
x8 END POINT
LegacyEND
POINT
PCIBridge
PCI
1
2 DMA
3
4
5
FPGA System Interconnect Evolution
� Co-processing– Non-coherent accelerator
– Software managed memory
consistency
– IO Device programming
model
– DMA engines
FSB FSB
PCIe PCIe
2005 2008
� Peer processing– Coherent accelerator
– Hardware managed memory
consistency
– Shared memory
programming model
� Circuit Switched � Transaction Based
Xeon 7300 System Platforms
Hybrid SMP & DSM + AcceleratorConvey HC-1 (2008)
� Socket Filler Module
� Bridge FPGA
� Implements FSB Protocol
� Full Snoop Support
� FPGA Based Compute Accelerator
� Pre-Defined Vector Instruction Set
� Shared Memory Programming Model
� ANSI C Support
� Accelerator Cache Memory
� 80 GB/s BW
� Snoop Coherent with System Memory
� Direct Cache Access CPU<->FPGASource: Convey Computer, 2008
MCLX155
MCLX155
MCLX155
MCLX155
MCLX155
MCLX155
MCLX155
MCLX155
FPGA System Interconnect Evolution
� Co-processing– Non-coherent accelerator
– Software managed memory
consistency
– IO Device programming
model
– DMA engines
FSB FSBQPI
PCIe PCIe PCIe
PCIe2005 2008 2009
� Peer processing– Coherent accelerator
– Hardware managed memory
consistency
– Central memory
– Shared memory
programming model
� Scalable peer
processing– Scaleable coherency
– Directory + Snooping
– Distributed memory
– Shared memory programming
model
� Circuit Switched � Transaction Based � Packet Switched
Point To Point
Examples:
� AMD HyperTransport
� Intel QuickPath
Key Facts:
� Narrower buses, higher frequency
� Multiple active masters (N links => N * Throughput
� Single Hop (fully connected)
� Scalable topologies (packet switched interconnect)
– 1D Ring, 2D Mesh, 3D Torus
Point to Point:
Makes Interconnect Scalable
Distributed Shared Memory (DSM)AMD Hammer (Opteron) (2002)
Source: AMD, HotChips14, Fall 2002
• Distributed Shared Memory (DSM)
• Per-node Memory Controller
• Global Address Space
• Snoopy Coherency
ProgrammingModel
Interconnect
MemoryModel
Taxonomy of Large Multiprocessors
Multiprocessors
SharedAddress Space
DistributedAddress Space
SymmetricShared Memory
DistributedShared Memory
Cache CoherentccNUMA
Non Coherent
Commodity Cluster Custom Cluster
Uniform Cluster
Cluster ofSMPs or DSMs
Source: Asanovic, UCB, CS252 Class Notes, Fall 2007
Distributed Address Space:Hybrid CPU + FPGA Systems
NoCµB
FSB bridge
NoCµB
NoC HWMPE
X86
SW Process
X86
SW Process
FSB
NoC
PPC
GT/GTX
Serial I/O
NoC HWMPE
NoCµB
FSB bridge
NoCµB
NoC HWMPE
FSB
NoC
PPC
GT/GTX
Serial I/O
NoC HWMPE
MemoryX86
SW Process
X86
SW Process Memory
Source: ArchesComputing, 2009
� Multiple Private Memory Spaces
� Multiple Compute Nodes: X86, Embedded CPUs, FPGA HW
Unique
Address
Spaces
Distributed Address Space:Intel Polaris (2007)
Source: Intel, HotChips19, Fall 2007
160 Node Many-Core CPU
With Tiled, Distributed, Private Memories
� 8x10 Tiled Design
� 160 SP FP Tiles
� Private Memory per Tile
� 5 Ported Router per Tile
� 2D NOC Mesh Interconnect
� Wormhole Routing
� Message Passing Instructions
Taxonomy of Networks
Networks
Shared Bus Point to Point
Circuit
SwitchedFully Connected
Transaction
SwitchedSwitched
1D Ring
2D Mesh
Torus, Other 3D
Central Xbar
Distributed Xbar
Circuit Switched
Packet Switched
Regular Mesh
Staggered Mesh
Hierarchical Mesh
Shared Bus
Key Facts:
� All Masters Connected to All Slaves (limits frequency)
� Only one active Master (limits throughput)
� Broadcast “built in by design” (snooping simple to implement)
� Used when wiring resources are scarce
� Split transactions allow pipelined phases (helps throughput somewhat)
– request, snoop, data response
µµµµP
$
µµµµP
$
µµµµP
$
µµµµP
$
Central Memory
Bus
Throughput and Scalability Issues
Examples:
� Intel FSB (early generations)
� ARM AMBA 1/2 (before AXI)
� IBM CoreConnect PLB/OPB
Point To Point
Examples:
� AMD HyperTransport
� Intel QuickPath
� ARM AXI
Key Facts:
� Narrower buses, higher frequency
� Multiple active masters (N links => N * Throughput
� Single Hop (fully connected)
� Scalable topologies (packet switched interconnect)
– 1D Ring, 2D Mesh, 3D Torus
Point to Point:
Makes Interconnect Scalable
On-Chip Interconnect Drivers
Latency
Bandwidth
Performance Scaling
IP Reuse
Quality of Service
SharedBus
PointTo
Point
NetworkOn Chip
Resource Usage
����
�� ��
��
��
��
��
��
�� ��BestBest SquareSquare LinearLinear
��
�� �� ��
Point To Point Is Best For Small-Medium Systems
NOC Is Needed As Complexity Grows
SOC Integration Trends
MCU– CPU + peripherals
– Industrial, Motor Control,Display Interface
– FreeScale, Renesas, MicroChip
SOC– CPU + peripherals
– Integrated Accelerators(Video, Networking)
– Sonics
– TI OMAP3, Samsung, ST Nomadic
SMP– MP CPU with coherency
– Accelerators with coherency
– Peripherals
– Intel Atom, TI OMAP4
Generic MCU TI OMAP 3430
With Multi-Layer AXI3
ARM Cortex A9 SMP
With Accelerator Coherency Port
Circuit SwitchedNOC
Transaction Fabric
NOC
Transaction Fabric
With Coherency
Taxonomy of Programming Models
ProgrammingModels
Streaming ShMem
Packetized
User ManagedMemory Spaces
Endless
HW ManagedMemory Consistency
Message Passing
Two Sided One Sided
Eager
Rendevouz
SMP & DSMUPC / PGAS(Distributed
Address Spaces)
Shared Memory Programming on FPGAs:Convey HC-1 (2008)
FPGA Accelerators Today:� ANSI C Programming
� Standard C Compilers
� Pointers
� Flat, Virtual Memory
� Run-Time Scheduler
� HW Managed Memory Consistency
Abstracted Away:� HDL Design
� Timing Closure
� Fixed, Static Scheduling
� DMA Engine Programming
� SW Managed Memory Regions
Source: Convey Computer, 2008
Message Passing in Embedded:Arches (2009)
MPI FSB Bridge
µBMPI SW
Process
FSB
HWMPE
MPI HW
“Process”
PPCMPI SW
Process
MPI GT/GTX
Serial I/O Bridge
X86MPI SW
Process
X86MPI SW
ProcessMemory
MPI FSB Bridge
µBMPI SW
Process
FSB
HWMPE
MPI HW
“Process”
PPCMPI SW
Process
MPI GT/GTX
Serial I/O Bridge
X86MPI SW
Process
X86MPI SW
ProcessMemory
� Standard MPI Programming Model & API
� Light Weight Message Passing Protocol Implementation
� Focused on Embedded Systems
� Explicit Rank to Node Binding SupportSource: ArchesComputing, 2009
Message Passing Portability:Same standard MPI API, different cores
Rank 0:
main() {
…
MPI_Send()
MPI_Send()
MPI_Send()
MPI_Send()
MPI_Recv()
MPI_Recv()
MPI_Recv()
MPI_Recv()
…
}
Rank 1:
main() {
…
MPI_Recv()
Compute()
MPI_Send()
…
}
Rank 2:
main() {
…
MPI_Recv()
Compute()
MPI_Send()
…
}
Rank 3:
main() {
…
MPI_Recv()
Compute()
MPI_Send()
…
}
Rank 4:
( HDL )
Process ( ) {
…
MPE_Recv()
Compute()
MPE_Send()
…
}
X86 X86FPGA Soft Risc
MicroBlaze
FPGA Hard Risc
PowerPC
FPGA Hardware
Engine
Source: ArchesComputing, 2009
UPC on FPGAs
What
� RAMP Blue (UC Berkeley, 2007)
� 1008 MicroBlaze Cores @ 100Mhz
HW
� 12 Cores per FPGA (Virtex-II Pro, 130nm)
� 21 Boards (4 FPGAs per board + 1 Control FPGA)
Memory
� Distributed Memories
� Each Core has its own address space
� Message passing between cores
SW
� UPC running on top of Linux
Take Aways
� Memory Coherency Going Embedded– Multi-Core CPUs with on-chip coherent NOCs
– Convey: Coherent, Shared Memory, X86-FPGA System
� Message Passing Going Embedded– On-chip Coherency Too Expensive For Many-Core CPUs
– Arches: Message Passing on Hybrid X86-FPGA Systems
� Processors Evolve to Match Computing Needs– uC, Multi-Core, Many-Core Machines
� Memory Models to Match Application Needs– FPGAs Support SMP, DSM, Message Passing & Coherency
� Mainstream Programming Models– C programmed, Runtime scheduled, Instruction set based FPGAs
– MPI API lightweight implementation for FPGAs
Challenge :
What memory and programming model do you want to see on FPGAs?
Knowledge Community
� A wide association of people with common
technology interest with the intent to
– Progress their knowledge
– Enhance the skills of all members
– Preserve a legacy of lessons learned
Grow Knowledge Community
Wireless
Grow User CommunityHardware PlatformReference DesignsOpen Source Repository
Wired
Computing
STANFORD
UC BERKELEY
Multi-FPGA system for
parallel computing research
Multi-university collaborationbetween top schools such as
Berkeley,
Stanford,
MIT,University of Texas Austin,
etc
Microsoft Research Labs
RAMP BEE3, multi-FPGA board