Microsoft’s Production Configurable Cloud
Derek Chiou
Microsoft Azure Cloud Silicon
UT Austin
H2RC Nov 14, 2016 1
Today’s Data Centers
• O(100K) servers/data center
• Very dense, maximize number of servers
• Tens of MegaWatts
• Strict power and cooling requirements
• Secure, hot, noisy
• Incrementally upgraded • 3 year server depreciation, upgraded quarterly
• Applications change very rapid (weekly, monthly)
• Many advantages including economies of scale, data all in one place, etc.
• At data center scales, don’t need to get an order of magnitude improvement to make sense
• Positive ROI at large scale easier to achieve
• How can we improve efficiencies?
H2RC Nov 14, 2016 2
Efficiency via Specialization
ASICs
Source: Bob Broderson, Berkeley Wireless group
FPGAs
H2RC Nov 14, 2016 3
What Does a Data Center Server With an FPGA look like?Depends on your point of view
H2RC Nov 14, 2016 4
“Offload” Accelerator view of Server
NIC
DRAM
Acc
Acc
Acc
CPU network
Storage
H2RC Nov 14, 2016 7
Intel MCP
Our View of a Data Center Computer
network
DRAM
CPU
FPGA
Storage
DRAM
CPU
DRAM
Acc
H2RC Nov 14, 2016 8
Benefits • Software receives packets slowly• Interrupt or polling• Parse packet, start right work
• FPGA processes every packet anyways• Packet arrival is an event that FPGA deals
with• Identify FPGA work, pass CPU work to CPU
• Map common case work to FPGA• Processor never sees packet• Can read/modify system memory to keep
app state consistent
• CPU is complexity offload engine for FPGA!
• Many possibilities• Distributed machine learning• Software defined networking• Memcached get
H2RC Nov 14, 2016 9
Converged Bing/Azure Architecture
10
CPU CPU FPGA
NIC
DRAM DRAM DRAM
WCS 2.0 Server Blade Catapult V2
Gen3 2x8
Gen3 x8
QPI Switch
QSFP
QSFP
QSF
P
40Gb/s
40Gb/s
WCS Gen4.1 Blade with Mellanox NIC and Catapult FPGA
Pikes Peak
WCS Tray
Backplane
Option Card
Mezzanine
Connectors
Catapult v2 Mezzanine card
• Completely flexible architecture1. local compute accelerator
2. remote compute accelerator
3. Network/storage acceleratorH2RC Nov 14, 2016
IFM 1
IFM 2
IFM 44
IFM 3
IFM 1
IFM 2
IFM 44
IFM 3
IFM 1
IFM 2
IFM 44
IFM 3
Bing Document Ranking Flow
SaaS 1
SaaS 2
SaaS48
SaaS 3
Ranking-as-a-Service (RaaS) - Compute scores for how relevant each selected document is for the search query- Sort the scores and return the results
Selection-as-a-Service (SaaS)- Find all docs that contain query terms, - Filter and select candidate documents for ranking
Selection as a Service (SaaS)
IFM 1
IFM 2
IFM 44
IFM 3
IFM 1
IFM 2
IFM 44
IFM 3
IFM 1
IFM 2
IFM 44
IFM 3
RaaS 1
RaaS 2
RaaS48
RaaS 3
Ranking as a Service (RaaS)
Query
SelectedDocuments
10 blue links
H2RC Nov 14, 2016 13
FE: Feature ExtractionQuery: “FPGA Configuration”
NumberOfOccurrences_0 = 7 NumberOfOccurrences_1 = 4 NumberOfTuples_0_1 = 1{Query, Document}
~4K Dynamic
Features
~2K Synthetic
Features
L2 Score
Document
ScoreH2RC Nov 14, 2016 14
PCIe
Distribution latchesControl/Data
Tokens
CompressedDocument
FeatureGatheringNetwork
Free Form Expression
(FFE)
Stream Preprocessing FSM
Feature Extraction Accelerator
H2RC Nov 14, 2016 15
Bing Production Results
16
software
FPGA
99.9% Query Latency versus Queries/sec
HW
vs.
SW
Lat
ency
an
d L
oad
average software load
99.9% software latency
99.9% FPGA latency
average FPGA query load
H2RC Nov 14, 2016
Feature Extraction FPGA faster than needed
• Single feature extraction FPGA much faster than single server
• Wasted capacity and/or wasted FPGA resources
• Two choices• Somehow reduce performance
and save FPGA resources• Allow multiple servers to use
single FPGA?
• Use network to transfer requests and return responses
H2RC Nov 14, 2016 18
Inter-FPGA communication
ToR
FPGA
NIC
Server
FPGA
NIC
Server
FPGA
NIC
Server
FPGA
NIC
Server
CS0 CS1 CS2 CS3
ToR
FPGA
NIC
Server
FPGA
NIC
Server
FPGA
NIC
Server
FPGA
NIC
Server
SP0 SP1 SP2 SP3• FPGAs can encapsulate
their own UDP packets
• Low-latency inter-FPGA communication (LTL)
• Can provide strong network primitives
• But this topology opens up other opportunities
L0
L1/L2
19H2RC Nov 14, 2016
Lightweight Transport Layer (LTL) Latencies
0
5
10
15
20
25
1 10 100 1000 10000 100000 1000000
Ro
un
d-T
rip
Lat
en
cy (
us)
LTL L0 (same TOR)
LTL L1
Example L0 latency histogram
Example L1 latency histogram
Examples of L2 latency histograms for different pairs of FPGAs
Number of Reachable Hosts/FPGAs
6x8 Torus(can reach up to 48 FPGAs)
LTL Average LatencyLTL 99.9th Percentile
6x8 Torus Latency LTL L2
10K 100K 250K
20H2RC Nov 14, 2016
Hardware Acceleration as a Service Across Data Center (or even across Internet)
ToR ToR
CS CS
ToR ToR
Bing Ranking SW
HPC
Bing Ranking HW
Speech to text
Large-scaledeep learning
H2RC Nov 14, 2016 21
BrainWave: Scaling FPGAs To Ultra-Large DNN Models• Distribute NN models across as
many FPGAs as needed (up to thousands)
• Use HaaS and LTL to manage multi-FPGA execution• Very close to live production
H2RC Nov 14, 2016 22
BrainWave Publicly Demoed
• Ignite 2016
• Translation DNN running on FPGAs
• 2 orders of magnitude lower latency than CPU implementation• < 10% of power
H2RC Nov 14, 2016 23
FPGA SmartNIC for Cloud Networking• Azure runs Software Defined Networking on the hosts
• Software Load Balancer, Virtual Networks – new features each month
• Before, we relied on ASICs to scale and to be COGS-competitive at 40G+• But 12 to 18 month ASIC cycle + time to roll out new HW is too slow to keep up with SDN
• SmartNIC gives us the agility of SDN with the speed and COGS of HW• Base SmartNIC provide common functions like crypto, GFT, QoS, RDMA on all hosts
TranspositionEngine
Rew
rite
SLB Decap SLB NAT VNET ACL Metering
Rule Action Rule ActionRule Action Rule Action Rule Action Rule ActionDecap* DNAT* Rewrite* Allow* Meter*
SmartNIC
VFP
VMSwitch
VM
SR-IOV(Host Bypass)
50GQoSCrypto RDMAFlow Action
Decap, DNAT, Rewrite, Meter1.2.3.1->1.3.4.1, 62362->80
GFT
H2RC Nov 14, 2016 25
Azure Accelerated Networking
• SR-IOV turned on• VM accesses NIC hardware directly,
sends messages with no OS/hypervisor call
• FPGA determines flow of each packet, rewrites header to make data center compatible
• Reduces latency to roughly bare metal
• Azure now has the fastest public cloud network• 25Gb/s at 25us latency
• Fast crypto developed
NIC
VFP
Hypervisor
Guest OS
VM
NIC
VM
GFT/FPGA
H2RC Nov 14, 2016 26
We Are Hiring and Collaborating
• We are hiring FPGA and software folks
• Academic engagements• Research.Microsoft.com/catapult• Will provide boards to a limited number of academics (1 page proposal)• Will be giving access to clusters of up to 48 at TACC• Research grants• Internships
• Please contact me if you’re interested• [email protected]• [email protected]
H2RC Nov 14, 2016 27
Will Configurable Clouds Change the World?
• Being deployed for all new Azure and Bing machines• Many other properties as well
• Ability to reprogram a datacenter’s hardware• Specialized compute acceleration• Networking, storage, security• Can turn homogenous machines into specialized SKUs dynamically
• Hyperscale performance with low latency communication• Exa-ops of performance with a O(10us) diameter
• What should we do with the world’s most powerful configurable fabric?
28H2RC Nov 14, 2016