SDN: Google's B4 and Traffic Engineering
1 / 57
Outline
1 B4: Experience with a Globally-Deployed Software Defined WAN
2 Achieving high utilization with software-driven WAN
2 / 57
Introduction
Modern WANs are critical to performance, reliability
Typically provisioned to 30-40% average utilization (2-3x bandwidthcost over-provisioning).
3 / 57
Introduction
Modern WANs are critical to performance, reliability
Typically provisioned to 30-40% average utilization (2-3x bandwidthcost over-provisioning).
Overheads + high bandwidth requirement.
3 / 57
Introduction
Google’s WAN, one of the largest in the Internet.
Delivers range of services like search, video, cloud computing, etc.
4 / 57
Introduction
Google’s WAN, one of the largest in the Internet.
Delivers range of services like search, video, cloud computing, etc.
Architecturally two distinct WANs
4 / 57
Introduction
Google’s WAN, one of the largest in the Internet.
Delivers range of services like search, video, cloud computing, etc.
Architecturally two distinct WANs
1 User-facing network peers: for user traffic.
4 / 57
Introduction
Google’s WAN, one of the largest in the Internet.
Delivers range of services like search, video, cloud computing, etc.
Architecturally two distinct WANs
1 User-facing network peers: for user traffic.2 B4
◮ Connectivity between data centers.◮ 90% of internal traffic runs on this network.
eg. asynchronous data copies, end user data replication, etc.
4 / 57
Introduction
Google’s WAN, one of the largest in the Internet.
Delivers range of services like search, video, cloud computing, etc.
Architecturally two distinct WANs
1 User-facing network peers: for user traffic.2 B4
◮ Connectivity between data centers.◮ 90% of internal traffic runs on this network.
eg. asynchronous data copies, end user data replication, etc.
Why two different WANs?- different requirements (eg. priority, latency, etc.)
Internet traffic continues to grow rapidly, but Google’s WANtraffic grows even more faster.
4 / 57
Introduction
SDN approach for DC WAN interconnect.
5 / 57
Introduction
SDN approach for DC WAN interconnect.
Motivation:◮ Deploy routing and TE protocols customized to Google’s unique
requirements.
5 / 57
Introduction
SDN approach for DC WAN interconnect.
Motivation:◮ Deploy routing and TE protocols customized to Google’s unique
requirements.
Design goals:◮ Treat failures as common events.◮ Switches provide programmatic interface under central control.
5 / 57
Introduction
Why SDN based solution?
Limitations with traditional WAN architectures.
Elastic bandwidth demands: majority traffic, tolerant to transient failures
Moderate number of sites: few dozen data centers
End application control: control the network at every level with more
flexibility, thus reducing over-provisioning of resources
Cost sensitivity: nearly impossible to match the growing demand with
traditional approaches
Others include success of SDN and OF, rapid iteration of novelprotocols, improved capacity planning, scalability, flexibility, etc. E
6 / 57
Introduction
Manage switches using SDN principles
SDN Application: support standard routing protocols + centralizedTE service
◮ Edge servers make decisions on resource availability.◮ Use multipath forwarding based on application priority.◮ Dynamic reallocate bandwidth for link/switch failures.
7 / 57
Introduction
Manage switches using SDN principles
SDN Application: support standard routing protocols + centralizedTE service
◮ Edge servers make decisions on resource availability.◮ Use multipath forwarding based on application priority.◮ Dynamic reallocate bandwidth for link/switch failures.
Allows to achieve:
◮ near 100% link utilization on many B4 links◮ 70% on all link utilization
(ie. 2-3x efficiency improvements vs standard practice)
7 / 57
Design - Overview
8 / 57
Design - Overview
Logically, a three layered architecture.
B4 WAN - consists multiple sites.within each site, the switch hardware layer forwards traffic
Site Controller layer - consists of Network Control Servers (NCS)hosting both OpenFlow Controllers (OFC) and Network ControlApplications (NCAs).
- OFC maintains network state based on NCA directives- Paxos for fault tolerance of individual servers
Global layer - logically centralized applications like SDN Gateway,central TE server.
- enables central control of entire network- SDN gateway provides abstractions to TE server
9 / 57
Design - Overview
Options for integrating existing routing protocols with centralized trafficengineering:
10 / 57
Design - Overview
Options for integrating existing routing protocols with centralized trafficengineering:
Approach 1: Build one integrated, centralized service combiningboth routing and TE
10 / 57
Design - Overview
Options for integrating existing routing protocols with centralized trafficengineering:
Approach 1: Build one integrated, centralized service combiningboth routing and TE
Approach 2: Build routing and centralized TE as separateindependent services
10 / 57
Design - Overview
Options for integrating existing routing protocols with centralized trafficengineering:
Approach 1: Build one integrated, centralized service combiningboth routing and TE
Approach 2: Build routing and centralized TE as separateindependent services
Which one would you prefer?
10 / 57
Design - Overview
Approach 2: Building routing and centralized TE as separate independentservices.
11 / 57
Design - Overview
Approach 2: Building routing and centralized TE as separate independentservices.
Why?
Focus on SDN infrastructure development.
11 / 57
Design - Overview
Approach 2: Building routing and centralized TE as separate independentservices.
Why?
Focus on SDN infrastructure development.
Debug SDN architecture before adding new features.
11 / 57
Design - Overview
Approach 2: Building routing and centralized TE as separate independentservices.
Why?
Focus on SDN infrastructure development.
Debug SDN architecture before adding new features.
TE layer sits on top of routing protocols
BIG RED BUTTON to disable TE (back to shortest path forwarding)
11 / 57
Design - Switch Design
Conventional design needs deep buffers, large forwarding tables, hardwaresupport for HA.
12 / 57
Design - Switch Design
Conventional design needs deep buffers, large forwarding tables, hardwaresupport for HA.
For B4, Google resolves them by:
◮ adjusting transmission rates by careful endpoint management
12 / 57
Design - Switch Design
Conventional design needs deep buffers, large forwarding tables, hardwaresupport for HA.
For B4, Google resolves them by:
◮ adjusting transmission rates by careful endpoint management
◮ having modest number of DCs + abstraction = smaller forwardingtables
12 / 57
Design - Switch Design
Conventional design needs deep buffers, large forwarding tables, hardwaresupport for HA.
For B4, Google resolves them by:
◮ adjusting transmission rates by careful endpoint management
◮ having modest number of DCs + abstraction = smaller forwardingtables
◮ moving software functionality from switches to upper layers
12 / 57
Design - Switch Design
Conventional design needs deep buffers, large forwarding tables, hardwaresupport for HA.
For B4, Google resolves them by:
◮ adjusting transmission rates by careful endpoint management
◮ having modest number of DCs + abstraction = smaller forwardingtables
◮ moving software functionality from switches to upper layers
Need for custom switches
Switches that could export low-level control over switch forwardingbehavior
12 / 57
Design - Switch Design
High-radix switch - deploying fewer larger switches ⇒ yields easiermanagement and software scalability
B4 switches - uses multiple merchant silicon switch chips + two-stageClos topology
Figure: High-radix switch
13 / 57
Design - Network Control Functionality
Majority functionality runs on NCS
Paxos handles leader election for all control functionalities◮ Failure detection◮ New leader election
Modified ONIX for OFC◮ OFC is the Network Information Base (NIB)
eg. topology info., trunk configs., link status, etc.
14 / 57
Design - Routing
How to integrate OpenFlow-based switch with existing routingprotocols?
Google chose Quagga stack for BGP/ISIS on NCS.
Developed an SDN application called”Routing Application Proxy (RAP)”.
RAP provides connectivity between Quagga and OF switches for:◮ BGP/ISIS route updates◮ routing-protocol packets flowing between switches and Quagga◮ interface updates from the switches to Quagga
15 / 57
Traffic Engineering
Goal: share bandwidth among competing
applications/flow-groups
16 / 57
Traffic Engineering
Goal: share bandwidth among competing
applications/flow-groups
Objective function: max-min fair allocation
16 / 57
Traffic Engineering
Notions
Network Topology: a group represents sites as vertices and site-to-siteconnectivity as edges.
Flow Group (FG): aggregate applications to flow groups defined as{source site, dest site, QoS} rule.
Tunnel (T): a site-level path in the network eg. sequence of sites
(A ⇒ B ⇒ C)
Tunnel Group (TG): maps FG to a set of tunnels (T ) and correspondingweights.
17 / 57
Traffic Engineering
Figure: Overview of Traffic Engineering
18 / 57
TE - Bandwidth Functions
Associate bandwidth function with every application
Admin-specified static weights (slope functions)
Allocate bandwidth based on flow’s relative priority (fair share)
19 / 57
TE - Max-Min Fair Allocation
Formal definition:
Resources are allocated to sources in order of increasing demand
No source gets a resource share larger than its demand
Sources with unsatisfied demand gets an equal share of the resource
S. Keshav (1997)
An Engineering Approach to Computer Networking, p. 215-217
Publisher Addison-Wesley, Reading, MA, 1997
20 / 57
TE - Max-Min Fair Allocation
Figure: Example of Max-Min Fair Allocation
1 Assign(
10Mbps4 flows
)
= 2.5 Mbps per flow
2 Sum the over-assigned amount (Residual) for flow 1, 0.5 Mbps over-assigned
3 Assign(
ResidualNo. of under assigned flows
)
to each flow = 0.5/3 = 0.0666 Mbps
4 Repeat steps 2 and 3 with new residual until no residual left or no demand isgreater than residual
Final assignment:Flow 1 = 2 Mbps, Flow 2 = 2.6 Mbps, Flow 3 = 2.7 Mbps, Flow 4 = 2.7 Mbps
21 / 57
TE - Weighted Max-Min Fair Allocation
Figure: Example of Weighted Max-Min Fair Allocation
1 Normalize weights (so that smallest weight is 1) W=[5,8,1,2]
2 Unit share =(
Total resourcesum of normalized weights
)
=(
1616
)
= 1
3 Assign every flow [unit share X normalized weight of flow ] units of resource
4 Calculate over-assigned resources and repeat steps 1,2,3, and 4 with thisresidual
Final assignment:Flow 1 = 4 Mbps, Flow 2 = 2 Mbps, Flow 3 = 4 Mbps, Flow 4 = 6 Mbps
22 / 57
TE - Optimization
LP optimal for allocating fair share for FGs is expensive and notscalable.
B4 team designed their own algorithm to achieve this with at least99% utilization and 25 times faster performance relative to LP.
Two main components:
1 Tunnel Group Generation: allocates bandwidth to FGs usingbandwidth functions to prioritize bottleneck edges.
2 Tunnel Group Quantization: changes split ratios in each TG tomatch granularity supported by switch hardware tables.
23 / 57
TE Protocol & OF - TE State and OpenFlow
Three modes of B4 switch:
1 Encapsulating switch
2 Transit switch
3 Decapsulating switch
24 / 57
TE Protocol & OF - TE State and OpenFlow
25 / 57
TE Protocol & OF - TE State and OpenFlow
Source switch maps packets to FG using <dest ip >, forwards tocorresponding TG.
TG hashes packets to a T in the desired ratio.
Each site in the path maintains per-tunnel forwarding rules.
Source site encapsulates packet with outer header (ie. Tunnel ID).
Transit switch uses tunnel ID to match rules and forwards it.
Decapsulating switch terminates flow based on tunnel ID.
26 / 57
TE Protocol & OF - Composing Routing and TE
B4 supports two routing services.1 Shortest path routing (uses Longest Prefix Match - LPM table)2 TE (uses Access Control List - ACL table)
Map different flows and groups to appropriate tables.
ACL takes strict precedence over LPM entries.
27 / 57
TE Protocol & OF - Composing Routing and TE
28 / 57
TE Protocol & OF - Coordinating TE State Across Sites
Figure: Overview of Traffic Engineering
29 / 57
TE Protocol & OF - Coordinating TE State Across Sites
TE server coordinates T/TG/FG rule installations across multipleOFCs.
TED - Traffic Engineering Database captures state needed to forwardpackets along multiple paths.
TED - <key,value> data store.
Compute per-site TED, generate TE Ops to OFCs.
TE Ops either add/modify/delete TED entries at OFCs.
OFCs convert TE Ops to flow-programming instructions and sends toall devices in its site.
Finally, OFC responds to original TE Op. g
30 / 57
TE Protocol & OF - Dependencies and Failures
Dependencies among Ops:
◮ to avoid packet drops, all Ops cannot run simultaneouslyeg. configure a T at all sites before configuring TG/FG
31 / 57
TE Protocol & OF - Dependencies and Failures
Dependencies among Ops:
◮ to avoid packet drops, all Ops cannot run simultaneouslyeg. configure a T at all sites before configuring TG/FG
Synchronizing TED between TE and OFC:
◮ requires common TED view◮ TE session supports this synchronization◮ TE synchronizes TED with persistent memory - to handle
simultaneous failures
31 / 57
TE Protocol & OF - Dependencies and Failures
Dependencies among Ops:
◮ to avoid packet drops, all Ops cannot run simultaneouslyeg. configure a T at all sites before configuring TG/FG
Synchronizing TED between TE and OFC:
◮ requires common TED view◮ TE session supports this synchronization◮ TE synchronizes TED with persistent memory - to handle
simultaneous failures
Ordering issues:
◮ site-specific sequences IDs assigned to TE Ops◮ enables ordering among operations
31 / 57
TE Protocol & OF - Dependencies and Failures
Dependencies among Ops:
◮ to avoid packet drops, all Ops cannot run simultaneouslyeg. configure a T at all sites before configuring TG/FG
Synchronizing TED between TE and OFC:
◮ requires common TED view◮ TE session supports this synchronization◮ TE synchronizes TED with persistent memory - to handle
simultaneous failures
Ordering issues:
◮ site-specific sequences IDs assigned to TE Ops◮ enables ordering among operations
TE Op failures:
◮ due to RPC failure, OFC rejection, etc.◮ dirty/clean bit for each TED entry◮ enables resuming TE Ops from point of failure
31 / 57
Evaluation - Deployment and Evolution
Network traffic doubled in the year 2012
32 / 57
Evaluation - Deployment and Evolution
33 / 57
Evaluation - Deployment and Evolution
Observations:
1 Topology aggregation significantly reduces path churn and systemload.
2 Edge removals happen multiple times a day.
3 WAN links are susceptible ot frequent port flaps and benefit fromdynamic centralized management
34 / 57
Evaluation - TE Ops Performance
100x reduction in no. of TE Opsby caching recently used tunnels.
reduction in failed Ops
Reduced latency
35 / 57
Evaluation - TE Ops Performance
Notes:
TG Ops run for every topology change or change in demand
Growth in no. of TG Ops due to addition of network sites
Reduction in failure of TG Ops due to optimizations
36 / 57
Evaluation - Impact of Failures
Figure: Impact of failure between two sites
Failure of transit router requires longer convergence time (≈ 3.3 sec)◮ update multi-path table entries for potentially several tunnels◮ each update Op is slow
37 / 57
Evaluation - TE Algorithm Evaluation
Throughput improves as wehave more number of paths
Adding more paths and usingfiner granularity traffic splittinggives more flexibility to TE, butconsumes more hardware tableresources
B4′s deployment uses TE with quantum 1/4 and 4 paths
38 / 57
Evaluation - Link Utilization
Utilization close to100%
Ability to mix priorityclasses across all edges
Use separate edges fordifferent classes
39 / 57
Evaluation - Link Utilization
Figure: Per-link utilization in a trunk, demonstrating the effectiveness of hashing
For at least 75% site-to-site edges, max-min ratio of link utilization is:
◮ 1.05 without failures (ie. 5% from optimal)◮ 2.0 with failures
40 / 57
Conclusion
B4 now serves more traffic than Google’s public facing WAN withhigher growth rate.
SDN deployed cost-effective WAN bandwidth, running many links at100% utilization.
Hybrid approach an effective way to introduce SDN into existingdeployments.
Leveraging control at edge increases WAN utilization and improvingfault tolerance.
41 / 57