+ All Categories
Home > Documents > Microsoft’s Production Configurable Cloudece757.ece.wisc.edu/chiou_h2rc.pdf · Microsoft’s...

Microsoft’s Production Configurable Cloudece757.ece.wisc.edu/chiou_h2rc.pdf · Microsoft’s...

Date post: 26-Oct-2019
Category:
Upload: others
View: 18 times
Download: 0 times
Share this document with a friend
28
Microsoft’s Production Configurable Cloud Derek Chiou Microsoft Azure Cloud Silicon UT Austin H2RC Nov 14, 2016 1
Transcript

Microsoft’s Production Configurable Cloud

Derek Chiou

Microsoft Azure Cloud Silicon

UT Austin

H2RC Nov 14, 2016 1

Today’s Data Centers

• O(100K) servers/data center

• Very dense, maximize number of servers

• Tens of MegaWatts

• Strict power and cooling requirements

• Secure, hot, noisy

• Incrementally upgraded • 3 year server depreciation, upgraded quarterly

• Applications change very rapid (weekly, monthly)

• Many advantages including economies of scale, data all in one place, etc.

• At data center scales, don’t need to get an order of magnitude improvement to make sense

• Positive ROI at large scale easier to achieve

• How can we improve efficiencies?

H2RC Nov 14, 2016 2

Efficiency via Specialization

ASICs

Source: Bob Broderson, Berkeley Wireless group

FPGAs

H2RC Nov 14, 2016 3

What Does a Data Center Server With an FPGA look like?Depends on your point of view

H2RC Nov 14, 2016 4

Classic View of Computer

DRAM

CPU network

Storage

H2RC Nov 14, 2016 5

Networking View of Computer

Acc

DRAM

CPU network

Storage

Network “offload”

H2RC Nov 14, 2016 6

“Offload” Accelerator view of Server

NIC

DRAM

Acc

Acc

Acc

CPU network

Storage

H2RC Nov 14, 2016 7

Intel MCP

Our View of a Data Center Computer

network

DRAM

CPU

FPGA

Storage

DRAM

CPU

DRAM

Acc

H2RC Nov 14, 2016 8

Benefits • Software receives packets slowly• Interrupt or polling• Parse packet, start right work

• FPGA processes every packet anyways• Packet arrival is an event that FPGA deals

with• Identify FPGA work, pass CPU work to CPU

• Map common case work to FPGA• Processor never sees packet• Can read/modify system memory to keep

app state consistent

• CPU is complexity offload engine for FPGA!

• Many possibilities• Distributed machine learning• Software defined networking• Memcached get

H2RC Nov 14, 2016 9

Converged Bing/Azure Architecture

10

CPU CPU FPGA

NIC

DRAM DRAM DRAM

WCS 2.0 Server Blade Catapult V2

Gen3 2x8

Gen3 x8

QPI Switch

QSFP

QSFP

QSF

P

40Gb/s

40Gb/s

WCS Gen4.1 Blade with Mellanox NIC and Catapult FPGA

Pikes Peak

WCS Tray

Backplane

Option Card

Mezzanine

Connectors

Catapult v2 Mezzanine card

• Completely flexible architecture1. local compute accelerator

2. remote compute accelerator

3. Network/storage acceleratorH2RC Nov 14, 2016

Network Connectivity (IP)

H2RC Nov 14, 2016 11

Case 1: Local compute accelerator

Bing Ranking as a Service

H2RC Nov 14, 2016 12

IFM 1

IFM 2

IFM 44

IFM 3

IFM 1

IFM 2

IFM 44

IFM 3

IFM 1

IFM 2

IFM 44

IFM 3

Bing Document Ranking Flow

SaaS 1

SaaS 2

SaaS48

SaaS 3

Ranking-as-a-Service (RaaS) - Compute scores for how relevant each selected document is for the search query- Sort the scores and return the results

Selection-as-a-Service (SaaS)- Find all docs that contain query terms, - Filter and select candidate documents for ranking

Selection as a Service (SaaS)

IFM 1

IFM 2

IFM 44

IFM 3

IFM 1

IFM 2

IFM 44

IFM 3

IFM 1

IFM 2

IFM 44

IFM 3

RaaS 1

RaaS 2

RaaS48

RaaS 3

Ranking as a Service (RaaS)

Query

SelectedDocuments

10 blue links

H2RC Nov 14, 2016 13

FE: Feature ExtractionQuery: “FPGA Configuration”

NumberOfOccurrences_0 = 7 NumberOfOccurrences_1 = 4 NumberOfTuples_0_1 = 1{Query, Document}

~4K Dynamic

Features

~2K Synthetic

Features

L2 Score

Document

ScoreH2RC Nov 14, 2016 14

PCIe

Distribution latchesControl/Data

Tokens

CompressedDocument

FeatureGatheringNetwork

Free Form Expression

(FFE)

Stream Preprocessing FSM

Feature Extraction Accelerator

H2RC Nov 14, 2016 15

Bing Production Results

16

software

FPGA

99.9% Query Latency versus Queries/sec

HW

vs.

SW

Lat

ency

an

d L

oad

average software load

99.9% software latency

99.9% FPGA latency

average FPGA query load

H2RC Nov 14, 2016

Case 2: Remote accelerator

H2RC Nov 14, 2016 17

Feature Extraction FPGA faster than needed

• Single feature extraction FPGA much faster than single server

• Wasted capacity and/or wasted FPGA resources

• Two choices• Somehow reduce performance

and save FPGA resources• Allow multiple servers to use

single FPGA?

• Use network to transfer requests and return responses

H2RC Nov 14, 2016 18

Inter-FPGA communication

ToR

FPGA

NIC

Server

FPGA

NIC

Server

FPGA

NIC

Server

FPGA

NIC

Server

CS0 CS1 CS2 CS3

ToR

FPGA

NIC

Server

FPGA

NIC

Server

FPGA

NIC

Server

FPGA

NIC

Server

SP0 SP1 SP2 SP3• FPGAs can encapsulate

their own UDP packets

• Low-latency inter-FPGA communication (LTL)

• Can provide strong network primitives

• But this topology opens up other opportunities

L0

L1/L2

19H2RC Nov 14, 2016

Lightweight Transport Layer (LTL) Latencies

0

5

10

15

20

25

1 10 100 1000 10000 100000 1000000

Ro

un

d-T

rip

Lat

en

cy (

us)

LTL L0 (same TOR)

LTL L1

Example L0 latency histogram

Example L1 latency histogram

Examples of L2 latency histograms for different pairs of FPGAs

Number of Reachable Hosts/FPGAs

6x8 Torus(can reach up to 48 FPGAs)

LTL Average LatencyLTL 99.9th Percentile

6x8 Torus Latency LTL L2

10K 100K 250K

20H2RC Nov 14, 2016

Hardware Acceleration as a Service Across Data Center (or even across Internet)

ToR ToR

CS CS

ToR ToR

Bing Ranking SW

HPC

Bing Ranking HW

Speech to text

Large-scaledeep learning

H2RC Nov 14, 2016 21

BrainWave: Scaling FPGAs To Ultra-Large DNN Models• Distribute NN models across as

many FPGAs as needed (up to thousands)

• Use HaaS and LTL to manage multi-FPGA execution• Very close to live production

H2RC Nov 14, 2016 22

BrainWave Publicly Demoed

• Ignite 2016

• Translation DNN running on FPGAs

• 2 orders of magnitude lower latency than CPU implementation• < 10% of power

H2RC Nov 14, 2016 23

Case 3: Networking accelerator

H2RC Nov 14, 2016 24

FPGA SmartNIC for Cloud Networking• Azure runs Software Defined Networking on the hosts

• Software Load Balancer, Virtual Networks – new features each month

• Before, we relied on ASICs to scale and to be COGS-competitive at 40G+• But 12 to 18 month ASIC cycle + time to roll out new HW is too slow to keep up with SDN

• SmartNIC gives us the agility of SDN with the speed and COGS of HW• Base SmartNIC provide common functions like crypto, GFT, QoS, RDMA on all hosts

TranspositionEngine

Rew

rite

SLB Decap SLB NAT VNET ACL Metering

Rule Action Rule ActionRule Action Rule Action Rule Action Rule ActionDecap* DNAT* Rewrite* Allow* Meter*

SmartNIC

VFP

VMSwitch

VM

SR-IOV(Host Bypass)

50GQoSCrypto RDMAFlow Action

Decap, DNAT, Rewrite, Meter1.2.3.1->1.3.4.1, 62362->80

GFT

H2RC Nov 14, 2016 25

Azure Accelerated Networking

• SR-IOV turned on• VM accesses NIC hardware directly,

sends messages with no OS/hypervisor call

• FPGA determines flow of each packet, rewrites header to make data center compatible

• Reduces latency to roughly bare metal

• Azure now has the fastest public cloud network• 25Gb/s at 25us latency

• Fast crypto developed

NIC

VFP

Hypervisor

Guest OS

VM

NIC

VM

GFT/FPGA

H2RC Nov 14, 2016 26

We Are Hiring and Collaborating

• We are hiring FPGA and software folks

• Academic engagements• Research.Microsoft.com/catapult• Will provide boards to a limited number of academics (1 page proposal)• Will be giving access to clusters of up to 48 at TACC• Research grants• Internships

• Please contact me if you’re interested• [email protected][email protected]

H2RC Nov 14, 2016 27

Will Configurable Clouds Change the World?

• Being deployed for all new Azure and Bing machines• Many other properties as well

• Ability to reprogram a datacenter’s hardware• Specialized compute acceleration• Networking, storage, security• Can turn homogenous machines into specialized SKUs dynamically

• Hyperscale performance with low latency communication• Exa-ops of performance with a O(10us) diameter

• What should we do with the world’s most powerful configurable fabric?

28H2RC Nov 14, 2016


Recommended