© Copyright 2019 Xilinx
Manuel Uhm
Director, Silicon Marketing
Chair of the Board, Wireless Innovation Forum (SDR Forum v2.0)
Jason Vidmar
Sr. System Architect – MILCOM / SATCOM / Machine Learning
AI and SDR:
Software Meets Hardware Again…
© Copyright 2019 Xilinx
SDR Evolution
>> 3
Figure 1: How successive generations of SDRs have come to dominate the radio industry and will continue to evolve.Source: Manuel Uhm, Software-Defined Radio: To Infinity and Beyond, Military Embedded Systems, October 2016
Key semiconductor technology drivers:
• Moore’s Law
• FPGAs
• RFICs
• Analog/Digital Integration
© Copyright 2019 Xilinx
AI Evolution
>> 4
Source: Verhaert, 2019 Perspective on Artificial Intelligence Evolution
Key semiconductor technology drivers:
• Moore’s Law
• GPUs
• FPGAs
• ASICs
© Copyright 2019 XilinxPage 5
SDR & AI Payload Convergence
Cognitive
Radar
Cognitive
SIGINT
Cognitive
EW
Cognitive
RadioMulti-
Mission
Situationally
Aware
Payload:Enabled by SDR
and AI Technology
© Copyright 2019 Xilinx
End of the Line for Processor Performance?
>> 6
MOORE’S LAW
End of “PPA” Improvement
AMDAHL’S LAW
Multicore Hits Limit
DENNARD SCALING
Power Density Rises
Moving Forward: Domain-Specific Architectures (DSAs)
Source: John Hennessy and David Patterson, Computer Architecture: A Quantitative Approach , 6/e. 2018
1980 1985 1990 1995 2000 2005 2010 2015
1
10
100
1000
10000
100000
40 Years of Processor Performance
CISC
2x / 3.5yrs
(22%/yr)
RISC
2x / 1.5yrs
(52%/yr)
End of
Dennard
Scaling
Multicore
2x / 3.5yrs
(23%/yr)
Amdahl’s
Law
2x / 6yrs
(12%/yr)
End of the
line?
2x / 20yrs
(3%/yr)
Pe
rfo
rman
ce v
s.
VA
11
-780
© Copyright 2019 Xilinx
FPGAs
ASICs
ASSPs
General Purpose Processors
Pe
rfo
rma
nce
/Po
we
r E
ffic
ien
cy
Number of Applications
Why ACAP?
ACAPs
(Domain Specific Architecture)
Evolving Processor Landscape
© Copyright 2019 Xilinx
The Adaptive Compute Acceleration Platform
Diverse Workloads in
Milliseconds
Future-Proof for
New Algorithms
ADAPTIVE
AdaptableEngines
ScalarEngines
IntelligentEngines
COMPUTE
ACCELERATION
Enabling Data Scientists, SW Developers, HW Developers>> 8
PLATFORM
Development Tools HW/SW Libraries Run-time Stack
SW Programmable Silicon Infrastructure
Multi-core
Processing SystemProgrammable
Logic
DSP
(Vector-based & Fabric-based)
© Copyright 2019 Xilinx
Hardware Adaptable: Accelerating the Whole Application
NETWORK-ON-CHIP
AI Engines
Arm
Dual-Core
Cortex-R5F
Arm
Dual-Core
Cortex-A72
I/O
TB/s of Bandwidth
PL-to-AI Engine
Scalar, Sequential
& App Processing
Any-to-Any
Connectivity
Flexible Parallel Compute,
Data manipulation
ML & Signal Processing
Vector, Compute Intensive
128 GB/s of
Memory B/W
per Core
IntelligentAdaptableScalar
Adaptive Beamforming
AJ
Tactical Networking
SAR Backprojection
Spectrum Processing
Machine Learning
Heterogeneous Processing
For Tactical Edge Systems (Example Applications)
Delivering Deterministic Performance & Low Latency
Custom Memory
Hierarchy
Page 9
Applications are combined into
Domain Specific Architectures (DSAs)
Robust Device &
Run-time
Security
© Copyright 2019 Xilinx
Versal ACAP: A Platform for Software and Hardware Developers
User ApplicationC, C++, Python
Frameworks
Fully Software Programmablewith Hardware Design Path
OS Drivers
Versal ACAP Device & Integrated Shell
Evaluation & Deployment Boards
RuntimeScout
Vivado
SoftwarePlatform
HardwarePlatform
IP Libraries
© Copyright 2019 Xilinx
Possible Platform Example: Multi-Mission Situationally Aware UAV Payload with Versal ACAP
>> 11
UAV Platform
Multi-Mission Applications: Comms, Radar, SIGINT, EW
ML Overlay
Scalar Engines
VERSAL ACAP
Versal ACAP Eval Board
AI EnginesAdaptable Engines
xfopenCV DSPlib
Xilinx Runtime (XRT)
Frameworks
© Copyright 2019 Xilinx
Versal ACAP Roadmap
>> 12
HBMMemory Integration
AI EdgeLowest power AI
Premium112G SerDes600G CoresAI Core
AI InferenceThroughout
PrimeBroadest Application
AI RFAI withIntegrated RF
© Copyright 2019 Xilinx
Advanced SDR: Technologies
and Challenges
>> 13
© Copyright 2019 Xilinx
Trends in SDR Pushing the Compute Boundary
>> 14
5G 100X Complexity1 vs. 4G
[CAPACITY]
[RESILIENCY]
[AUTONOMY]
10X
20X
10X
100X
100X
3X
Rise of Deep Learning
(Dawn of Next Wave of AI)
Source: ETRI RWS-150029, 5G Vision and Enabling Technologies, Dec. 2015.
300,000X!
Operations in Contested Spectrum
AlexNet to
AlphaGo Zero
Source: “AI and Compute,” OpenAI. May 2018.
© Copyright 2019 Xilinx
Enabling Technologies
˃ Direct-RF / High-IF Sampling Data
Converters
˃ Array Antennas
˃ Compute Optimizations for Deep Learning
>> 15
Array
Antennas
Controlled Reception Pattern Array (CRPA)
beam patterns.(source: gpsworld.com)
mMIMO Spatial Multiplexing
and Beamforming (5G).
(Matheus, 2016)
“Cat”
“Dog”
“Bird”…
“QPSK”
“BPSK”
“8PSK”…
Image Input
Non-image Input (RF)
Classification Result
Animation credit: Philip Leone, Univ. of Sydney. Presentation.
Deep Learning Classification
© Copyright 2019 Xilinx
Advanced SDR: Compute Comparisons
>> 16
References: “Implementing a Real-Time
Beamformer on an FPGA Platform.” Xilinx.
Xcell Journal 60.
See also: Xilinx WP452 “Adaptive
Beamforming for Radar: Floating-Point
QRD+WBS in an FPGA” References: “Applied Deep Learning - Part 4: Convolutional Neural Networks”,
Towards Data Science (blog).
W = Rxx-1 * b Y = X * K
X
Y
K
Covariance Matrix Decomposition:
QR, Cholesky, etc.
Resnet-50 visualization. Kaggle.com
Complex-valued
Higher Precision Desirable (e.g., SPFP32)
Typical FLOPS: up to MFLOPS per Decomposition
Real-valued
Lower Precision Desirable (e.g., INT8)
Typical OPS: 7.6 GOPS (Resnet-50 unpruned)
Steering Vector
Space Time Adaptive ProcessingApplication Example: Beamforming/Nulling (Comms / Anti-Jam)
Deep Learning Inference (Conv. Nets)Application: Modulation Recognition, Waveform Classification
Convolutional
Layer
Processing
© Copyright 2019 Xilinx
AI Engine: Multi-Precision Math Support
8 816
32
64
128
32x32SPFP
32x32Real
32x16Real
16x16Real
16x8Real
8x8Real
MACs / Cycle (per core)
Real Data Types Complex Data Types
24
8
16
32x32Complex
32x16Complex
16x16Complex
16Complexx 16 Real
MACs / Cycle (per core)Linear Algebra
Matrix-Matrix Mult
Matrix-Vector Mult
Convolution
FIR Filters
2-D Filters
Transforms
FFTs/IFFTs
DCT, etc
Optimized For:
© Copyright 2019 Xilinx
AI Engine: Scalar Unit, Vector Unit, Load Units and Memory
Local, Shareable Memory• 32KB Local, 128KB Addressable
32-bit Scalar RISC Processor
Up to 128 MACs / Clock Cycle per Core (INT 8)
Highly
Parallel
Memory Interface
Scalar Unit
ScalarRegister
File
Scalar ALU
Non-linear
Functions
Vector
Register
File
Fixed-Point
Vector Unit
Floating-Point
Vector Unit
Vector Unit Vector Processor
512-bit SIMD DatapathInstruction Fetch
& Decode Unit
AGU AGU AGU
Load Unit A Load Unit B Store Unit
7+ operations / clock cycle
• 2 Vector Loads / 1 Mult / 1 Store
• 2 Scalar Ops / Stream Access
Instruction Parallelism: VLIW Data Parallelism: SIMD
Multiple vector lanes
• Vector Datapath
• 8 / 16 / 32-bit & SPFP operands
Stream Interface
Up to 128 MACs / Clock Cycle per Core (INT 8)
8 FLOPs / Clock Cycle (32SPFP)
© Copyright 2019 Xilinx
AI
Engine
Me
mo
ry
AI
Engine
Me
mo
ry
AI
Engine
Me
mo
ry
AI
Engine
Me
mo
ry
AI
Engine
Me
mo
ry
AI
Engine
Me
mo
ry
AI
Engine
Me
mo
ry
AI
Engine
Me
mo
ry
AI
Engine
Me
mo
ry
AI Engine
Array
AI Engine: Terminology
>> 20
Versal ACAP
AI Engine
Tile
Interconnect
ISA-based
Vector Processor
Local
Memory
AI Vector
Extensions
5G Vector
Extensions Data
Mover
Memory Interface
Scalar Unit
Scalar
Register
File
Scalar ALU
Non-linear
Functions
Vector
Register
File
Fixed-Point
Vector Unit
Floating-Point
Vector Unit
Vector Unit
Instruction Fetch
& Decode Unit
AGU AGU AGU
Load Unit A Load Unit B Store Unit
Stream Interface
AI Engine
1GHz+ VLIW / SIMD vector processor
© Copyright 2019 Xilinx
AI Inference Mapping on Versal™ ACAP
NETWORK-ON-CHIP
AI Engines
Arm
Dual-Core
Cortex-R5
Arm®
Dual-Core
Cortex™-
A72
I/O
Weight
Buffer
(URAM)
IntelligentAdaptableScalar
External Memory
(e.g., DDR)
Activation
Buffer
(URAM)
PL
Max
Pool
Convolution
Layers
Fully
Connected
Layers
ReLU
˃ Custom memory hierarchy
˃ Buffer on-chip vs off-chip; Reduce latency and power
˃ Stream Multi-cast on AI interconnect
˃ Weights and Activations
˃ Read once: reduce memory bandwidth
˃ AI-optimized vector instructions (128 INT8 mults/cycle)
A = Activations
W = Weights
𝐴00
AI
Engine
𝑊00 𝐴10
AI
Engine
AI
Engine
AI
Engine
𝐴00 𝐴01𝐴10 𝐴11
×𝑊00 𝑊01
𝑊10 𝑊11
=𝐴00×𝑾𝟎𝟎 + 𝐴01×𝑊10 …𝐴10×𝑾𝟎𝟎+ 𝐴11×𝑊10 …
Cascade
Stream
X =
(4x8)
(8x4)
(4x4)
>> 21
Program Directly From High-level ML Frameworks
Frameworks
© Copyright 2019 Xilinx
AI Engine Delivers High Compute Efficiency
95%
80%
98%
ML Convolutions FFT DPD
Vector Processor Efficiency
Peak Kernel Theoretical Performance
Block-based
Matrix Multiplication
(32×64) × (64×32)
1024-pt
FFT/iFFT
Volterra-based
forward-path DPD
˃ Adaptable, non-blocking interconnect
Flexible data movement architecture
Avoids interconnect “bottlenecks”
˃ Adaptable memory hierarchy
Local, distributed, shareable = extreme bandwidth
No cache misses or data replication
Extend to PL memory (BRAM, URAM)
˃ Transfer data while AI Engine Computes
Compute
Comm
Overlap Compute and Communication
Compute Compute
Comm Comm
>> 22
© Copyright 2019 Xilinx
Summary
˃ The evolution of processing for AI is following a similar
track to SDR where hardware and software need to be
tightly coupled
˃ The drive for more Capacity, Autonomy and Resiliency in
advanced SDRs carry high compute demands and mixed
precision processing capabilities
˃ Moore’s Law is running out of steam which means the
goal of a SWaP-friendly multi-mission situationally aware
payload requires advancements in processing beyond
just process technology
˃ ACAPs are a response to this new reality
>> 23
Xilinx VC1902 Versal ACAP with 400 AI Engines.
First shipment June 2019.
Visit https://www.xilinx.com/products/silicon-
devices/acap/versal.html for datasheets, whitepapers,
and product tables.
© Copyright 2019 Xilinx
Adaptable.
Intelligent.
>> 24
THANK
YOU!Contact Info: