+ All Categories
Home > Documents > B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of...

B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of...

Date post: 15-Aug-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
62
Subhasree Mandal July 9, 2015 Lessons Learned from B4, Google’s SDN WAN
Transcript
Page 1: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Subhasree MandalJuly 9, 2015

Lessons Learned from B4, Google’s SDN WAN

Page 2: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Google Innovations in Networking

B4

20062008

20102012

2014Google Global Cache

BwE

JupitergRPC

Onix

Freedome

Watchtower

QUIC

Andromeda

Page 3: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

More Than the Sum of PartsGoogle Networking works together as an integrated whole

• B4: WAN interconnect

• GGC: edge presence

• Jupiter: building scale datacenter network

• Freedome: campus-level interconnect

• Andromeda: isolated, high-performance slices of the physical network

Publications in INFOCOM 2012, SIGCOMM 2013, SIGCOMM 2014, CoNEXT 2014, EuroSys 2014, SIGCOMM 2015

Page 4: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Motivation for SDN B4

Page 5: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

WAN Intensive Apps

Motivation for Backend Backbone

Data centers deployed across the world● Serve content with geographic locality● Replicate content for fault tolerance

Need a network to connect these data centers to one another● Not on the public Internet● Cost effective network for high volume traffic● Application specific variable in SLO● Bursty/bulk traffic (not smooth/diurnal)

YouTube Web Search Google+ Maps AppEngine Photos and Hangouts Android/Chrome Updates

Page 6: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

B4: 10x growth in last 3.5 years!

Two separate backbones:● B2: Carries Internet facing traffic → Growing faster than the Internet● B4: Inter-datacenter traffic → More traffic than B2, growing faster than B2

B4

traf

fic

Jul 2012 Jan 2013 Jul 2013 Jan 2014 Jul 2014 Jan 2015

Two Backbones

Page 7: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Growth vs Cost

Does cost per bit/sec go down with additional scale?● Consider analogies with compute or storage

Networking cost/bit doesn't naturally decrease with size● Quadratic complexity in pairwise interactions and broadcast overhead of all-

to-all communication requires more expensive equipment● Manual management and configuration of individual elements● Complexity of automated configuration to deal with non-standard vendor

configuration APIs

Page 8: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

SDN to Solve It● Faster innovation: separate smarts out of embedded devices

○ Leverage powerful compute in Google servers○ Faster feature roll-outs on controllers○ Less frequent switch firmware upgrade○ Easier hardware upgrade/replacement

● Efficient network management○ Manage fabric, rather than collection of devices

● Cost effective: opportunity for centralized Traffic Engineering (TE)○ Higher overall throughput, via better utilization of deployed hardware

■ Need not overprovision○ Leverage multi-objective multi-commodity flow optimization algorithms

■ More optimal throughput and faster convergence ….

Page 9: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Topics for Today● Background for Traffic Engineering (TE)● B4-SDN/TE Architecture with OpenFlow protocol● Benefits of B4-SDN/TE● Lessons learnt on SDN in three key areas

Fast producer/slow consumer: flow control to the rescue

Robust control plane connectivity and stable mastership is critical

SDN is natural fit for abstraction and hierarchy

Performance Availability Scale

Page 10: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Background for Centralized Traffic Engineering

Page 11: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

● Flows: R1->R6: 20; R2->R6: 20; R4->R6: 20

R2

R1

R3 R5

R4

40

20 20

R6

20

20

6020

shortest path

2nd shortest path

3rd shortest path

4th shortest path

Convergence After Failure

Page 12: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

● Flows: R1->R6: 20; R2->R6: 20; R4->R6: 20

R2

R1

R3 R5

R4

40

20 20

R6

20

20

6020

Convergence After Failure

Page 13: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

● Flows: R1->R6: 20; R2->R6: 20; R4->R6: 20

● R5-R6 link fails○ R1, R2, R4 autonomously find next best path

R2

R1

R3 R5

R4

40

20 20

R6

20

20

6020

Convergence After Failure

Page 14: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

● Flows: R1->R6: 20; R2->R6: 20; R4->R6: 20

● R5-R6 link fails○ R1, R2, R4 autonomously try for next best path○ R1, R2, R4 push 20 altogether

R2

R1

R3 R5

R4

40

20 20

R6

20

20

6020

No Traffic Engineering

Convergence After Failure

Page 15: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

● Flows: R1->R6: 20; R2->R6: 20; R4->R6: 20

● R5-R6 link fails○ R1, R2, R4 autonomously try for next best path○ R1 wins, R2, R4 retry for next best path

R2

R1

R3 R5

R4

40

20 20

R6

20

20

6020

Distributed Traffic Engineering Protocols

Convergence After Failure

Page 16: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

● Flows: R1->R6: 20; R2->R6: 20; R4->R6: 20

● R5-R6 link fails○ R1, R2, R4 autonomously try for next best path○ R1 wins, R2, R4 retry for next best path ○ R2 wins this round, R4 retries again

R2

R1

R3 R5

R4

40

20 20

R6

20

20

6020

Distributed Traffic Engineering Protocols

Convergence After Failure

Page 17: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

● Flows: R1->R6: 20; R2->R6: 20; R4->R6: 20

● R5-R6 link fails○ R1, R2, R4 autonomously try for next best path○ R1 wins, R2, R4 retry for next best path ○ R2 wins this round, R4 retries again○ R4 finally gets third best path!

R2

R1

R3 R5

R4

40

20 20

R6

20

20

6020

Distributed Traffic Engineering Protocols

Convergence After Failure

Page 18: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

● Simple topology

● Flows:○ R1->R6: 20; R2->R6: 20; R4->R6: 20

R2

R1

R3

R4

40

20 20

20

Central TE

60

20

20

R5 R6

Centralized Traffic Engineering Protocols

Centralized Traffic Engineering

Page 19: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

● Simple topology

● Flows:○ R1->R6: 20; R2->R6: 20; R4->R6: 20

● R5-R6 fails○ R5 informs TE, which programs routers in one shot

R2

R1

R3

R4

40

20 20

20

Central TE

20

20

R5 R6

Centralized Traffic Engineering Protocols

Centralized Traffic Engineering

Page 20: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

● Simple topology

● Flows:○ R1->R6: 20; R2->R6: 20; R4->R6: 20

● R5-R6 link fails○ R5 informs TE, which programs routers in one shot○ Leads to faster realization of target optimum

R2

R1

R3 R5

R4

40

20 20

R6

20

20

6020

Centralized Traffic Engineering Protocols

Centralized Traffic Engineering

Page 21: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

● Better network utilization with global picture ● Converges faster to target optimum on failure● Allows more control and specifying intent

○ Deterministic behavior simplifies planning vs. overprovisioning for worst case variability

● Can mirror production event streams for testing○ Supports innovation and robust SW development

● Controller uses modern server hardware○ 50x (!) better performance

Advantages of Centralized TE

Page 22: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

B4 Architecture

Page 23: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

OF agent

B4 Site: SDN Architecture

silicon siliconOF agent

silicon siliconOF agent

silicon silicon

2 OF agentOF agent protocol protocol protocol protocol protocol protocol

Page 24: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

OF agent

B4 Site: SDN Architecture

silicon siliconOF agent

silicon siliconOF agent

siliconOF agent

siliconOF agentOF agent

protocol protocol protocol protocol protocol protocol

Page 25: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Traditional WAN integrated with SDN: still speaking ISIS/BGP

OF agent

B4 Site: SDN Architecture

silicon siliconOF agent

silicon siliconOF agent

siliconOF agent

silicon

4 652 OF agent 3OF agent1

protocol

Master SDN controller

protocol protocol protocol protocol 5

protocol 64321

Page 26: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Traditional WAN integrated with SDN: still speaking ISIS/BGP

OF agent

B4 Site: SDN Architecture

silicon siliconOF agent

silicon siliconOF agent

siliconOF agent

silicon

4 652 OF agent 3OF agent1

protocol

Master SDN controller

protocol protocol protocol protocol

Standby SDN controller

heartbeat exchange

5protocol

64321

Page 27: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Traditional WAN integrated with SDN: still speaking ISIS/BGP

OF agent

B4 Site: SDN Architecture

silicon siliconOF agent

silicon siliconOF agent

siliconOF agent

silicon

4 652

Unit of management is a site = fabric

OF agent 3

SITE-A

OF agent1

protocol

Master SDN controller

protocol protocol protocol protocol

Standby SDN controller

heartbeat exchange

5protocol

64321

Page 28: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Traditional WAN integrated with SDN: still speaking ISIS/BGP

OF agent

B4 Site: SDN Architecture

silicon siliconOF agent

silicon siliconOF agent

siliconOF agent

silicon

4 652

Unit of management is a site = fabric

OF agent 3

SITE-A

OF agent1

protocol

Master SDN controller

protocol protocol protocol protocol

Standby SDN controller

heartbeat exchange

5protocol

64321SITE-C

SITE-B

Page 29: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Openflow 1.0 Rules

● Per QoS Traffic Engineering (TE)○ Demand based use of longer paths○ Max-min fair bandwidth allocation○ Per app loss/latency/throughput consideration

● TE paths are overlaid on ISIS/BGP routes○ Higher priority flow rules for TE

Traffic Engineering Overlay

80 Gbps

240 Gbps

B

CISIS shortest path

A

prio

rity TE flows

BGP/ISIS flows

Page 30: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Control Plane ArchitectureTE server

(GlobalOptimizer)

demandTopology Prefixes

Hosts

protocols

silicon

Master SDN controller

OF agent

protocols protocols protocols protocolsprotocols

siliconOF agent OF agent

silicon siliconOF agent

siliconOF agent OF agent

silicon

Standby SDN controller

SITE-A

SITE-C

SITE-B

TE Pathing

Page 31: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Control Plane ArchitectureTE server

(GlobalOptimizer)

SDN Gateway

demandTopology PrefixesTE Pathing

Hosts

demand collection

admission control

Bandwidth Enforcer

protocols

silicon

Master SDN controller

OF agent

protocols protocols protocols protocolsprotocols

siliconOF agent OF agent

silicon siliconOF agent

siliconOF agent OF agent

silicon

Standby SDN controller

SITE-A

SITE-C

SITE-B

Page 32: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Control Plane ArchitectureTE server

(GlobalOptimizer)

SDN Gateway

demandTopology PrefixesTE Pathing

Hosts

demand collection

admission control

Bandwidth Enforcer

protocols

silicon

Master SDN controller

OF agent

protocols protocols protocols protocolsprotocols

siliconOF agent OF agent

silicon siliconOF agent

siliconOF agent OF agent

silicon

Standby SDN controller

TE App TE App

SITE-A

SITE-C

SITE-B

Page 33: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Control Plane ArchitectureTE server

(GlobalOptimizer)

SDN Gateway

demandTopology PrefixesTE Pathing

Hosts

demand collection

admission control

Bandwidth Enforcer

protocols

silicon

Master SDN controller

OF agent

protocols protocols protocols protocolsprotocols

siliconOF agent OF agent

silicon siliconOF agent

siliconOF agent OF agent

silicon

Standby SDN controller

TE App TE App

SITE-A

SITE-C

SITE-B

Page 34: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Benefits of SDN B4 with Centralized Traffic Engineering

Page 35: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Benefits of TE Over Shortest Path

● ~20% increase in throughput over SPF● Larger benefits during capacity crunch

Helps more during capacity crunch

20%

Lowers the requirement for bandwidth provisioning

Thro

ughp

ut Im

prov

emen

t ove

r SP

F (%

)

Jul 2014

Oct 2014

Jan 2015

30

10

0

20

Page 36: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Software and hardware feature roll outs decoupled● Software timescale feature roll out

○ Hitless SW upgrades and new features■ No packet loss and no capacity degradation■ Most feature releases do not touch the switch

● Slower HW upgrades○ 3 generations of HW under same SDN architecture

Other Benefits

Page 37: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Lesson on Performance

Page 38: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Controller to Switch Messaging

Initial simple-minded assumptions● OpenFlow protocol:

○ Flow and control packet (ISIS/BGP/ARP/...) requests sent from controller to OF agent (OFA) sequentially

● OF agent (OFA) can process them in order● System is always in consistent state

But ….

Page 39: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

P FFFF P FF

Fast server Queue build-up on controller and switch due to slow switch CPU

embedded switch stackOFASDN controller

Messages Backlogged and Delayed!

Page 40: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

P FFFF P FF

Fast server Queue build-up on controller and switch due to slow switch CPU

Flow rules generated in bursts

embedded switch stackOFASDN controller

Messages Backlogged and Delayed!

Page 41: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

P FFFF P FF

Fast server Queue build-up on controller and switch due to slow switch CPU

Flow rules generated in bursts

Flow programmingin HW is slow

embedded switch stackOFASDN controller

Messages Backlogged and Delayed!

Page 42: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

P FFFF P FF

Fast server Queue build-up on controller and switch due to slow switch CPU

Flow rules generated in bursts

Flow programmingin HW is slow

Single OpenFlow connectionbetween controller and OFA

embedded switch stackOFASDN controller

Messages Backlogged and Delayed!

Page 43: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

P FFFF P FF

Fast server Queue build-up on controller and switch due to slow switch CPU

Flow rules generated in bursts

Flow programmingin HW is slow

Single OpenFlow connectionbetween controller and OFA

embedded switch stackOFASDN controller

packets delayed

protocols timeoutreconvergence produces more flow rules

Messages Backlogged and Delayed!

Flow rules cause HOL blocking for packets

Page 44: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

P FFFF

Vicious Cycle of Protocol Instability!!!

P FF

Fast server Queue build-up on controller and switch due to slow switch CPU

Flow rules generated in bursts

Flow programmingin HW is slow

Single OpenFlow connectionbetween controller and OFA

embedded switch stackOFASDN controller

Flow rules cause HOL blocking for packets

packets delayed

protocols timeoutreconvergence produces more flow rules

Messages Backlogged and Delayed!

Page 45: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

SDN controller OFA embedded switch stack

Lesson: Mitigation with Flow Control

Page 46: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

● Separate queue for packet IO and flow request● Strict priority for packet IO over flow programming

PPP

FFFFF

strict priority scheduler

SDN controller OFA embedded switch stack

Lesson: Mitigation with Flow Control

Page 47: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

● Separate queue for packet IO and flow request● Strict priority for packet IO over flow programming● Limit queue depth in OFA: token based flow control

PPP

FFFFF

strict priority scheduler

N

SDN controller OFA embedded switch stack

flow control

Lesson: Mitigation with Flow Control

Page 48: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

● Separate queue for packet IO and flow request● Strict priority for packet IO over flow programming● Limit queue depth in OFA: token based flow control● Systematics queue drop discipline

PP

FFFF

strict priority scheduler

N

SDN controller OFA embedded switch stack

flow control

superseded

aged out!!!

Lesson: Mitigation with Flow Control

Page 49: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

● Separate queue for packet IO and flow request● Strict priority for packet IO over flow programming● Limit queue depth in OFA: token based flow control● Systematics queue drop discipline

PP

FFFF

strict priority scheduler

N

Async

SDN controller OFA embedded switch stack

flow control

superseded

aged out!!!

● Asynchronous OFA

Lesson: Mitigation with Flow Control

Page 50: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

● Separate queue for packet IO and flow request● Strict priority for packet IO over flow programming● Limit queue depth in OFA: token based flow control● Systematics queue drop discipline

PP

FFFF

strict priority scheduler

N

Async

SDN controller OFA embedded switch stack

DMA for packet I/O

Flow Processing

flow control

superseded

aged out!!!

● Asynchronous OFA● Packet IO out of flow

processing pipeline

Lesson: Mitigation with Flow Control

Page 51: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Lesson on Availability

Page 52: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Outages!!!

Unstable mastership

Operational Procedure/Tools

Core Software Bugs

Unsupported Software

Sites

Postmortem Bugs by Category

Deployment Growth

Worst

Offender

2012 2013 2014

201420132012

Page 53: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

protocols

silicon

Master SDN controller

OF agent

protocols protocols protocols protocolsprotocols

siliconOF agent OF agent

silicon siliconOF agent

siliconOF agent OF agent

silicon

Standby SDN controller

heartbeat exchange

TE App TE App

Initial naive design:● Symmetry between buildings● Each building can run independently, even if the other one is down● N+1 controller redundancy sufficient for upgrades, failures etc.

Control Plane Connectivity: Mastership

Page 54: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

● Both controllers declare mastership:○ Gateway and OFAs can observe mastership flapping frequently○ Declared master has partial reachability to switches

● Reported topology changes, pathing changes, flow programming failsNon-transitive reachability => Packets dropped!!

silicon

Master SDN controller

OF agentsilicon

OF agent OF agentsilicon silicon

OF agentsilicon

OF agentsilicon

Standby SDN controller

TE App TE App

Gateway

TE server

OF agent

Unstable reachability

Control Network: Unstable Mastership

Page 55: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Paxos

Paxos Paxos

PaxosSDN cntrl

● Multiple independent domains per site: connected only through dataplane○ Each domain is unit for safe modular upgrade and maintenance

● Paxos: quorum-based robust master election within each domain● Also removes single point of failure in each site

TE Appprotocols

Domain 1

SDN cntrl TE Appprotocols

Domain 2

SDN cntrl TE Appprotocols

SDN cntrl TE Appprotocols

Domain 4

Lesson: Robust Control Reachability

Domain 3

Page 56: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Lessons on Scaling

Page 57: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Flat Topology Scales Poorly

● As B4 grows: more sites deployed● As compute per site grows:

○ More capacity required per siteLarger switches OR more switches● Larger switches: loss of large capacity on switch failure● More switches: more nodes and links to manage

○ ISIS and TE will hit scaling issues, converge too slowly...!!!

Page 58: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Lesson: Hierarchical Topology

Best of both worlds with SDN● Topology abstractions by domain controllers

○ Supernode: tightly connected nodes/switches○ Supertrunks: links between super nodes

● Domain controllers compute○ intra-domain routing○ impairment due to internal failure

xN

x2N

xN

x2N

domain X

domain Y

physical topology: domain controller view

Page 59: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Lesson: Hierarchical Topology

Best of both worlds with SDN● Topology in terms of supertrunk capacity● TE and ISIS/BGP work on supernodes

xN

x2N

xN

x2N

Reduces global controller-visible topology complexity by over 100x

domain X

domain Y

abstract topology: global controller view

supernode -2

supernode -1

supertrunk

Page 60: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

● SDN is beneficial in real-world○ Centralized TE delivered upto 30% additional throughput! ○ Decoupled software and hardware rollout

● Lessons to work in practice○ System performance: Flow control between components○ Availability: Robust reachability for master election○ Scale: Hierarchical topology abstraction

Conclusions

Page 61: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

References

● Upward Max Min Fairness: INFOCOM 2012

● B4: Experience with a Globally-Deployed Software Defined WAN: SIGCOMM 2013

● Bandwidth Enforcer: Flexible Hierarchical Bandwidth Allocation for WAN Distributed Computing: SIGCOMM 2015

Page 62: B4, Google’s SDN WAN Lessons Learned frombadri/552dir/papers/scheduling/atc15... · Benefits of B4-SDN/TE Lessons learnt on SDN in three key areas Fast producer/slow ... Controller

Google Platforms Networking

Hiring ● Interns● Full time engineers

Locations worldwide:● Mountain View● New York● Sydney

Inspiration and creativity to build Google’s infrastructure:

● Scale that gives the edge● Research turned into real life

production solution

Thank You!!

Software Hardware Test Technology


Recommended