Software-defined networking:Change is hard
Ratul Mahajanwith
Chi-Yao Hong, Rohan Gandhi, Xin Jin, Harry Liu,Vijay Gill, Srikanth Kandula, Mohan Nanduri, Roger Wattenhofer, Ming Zhang
Inter-DC WAN: A critical, expensive resource
Hong Kong
Seoul
Seattle
Los Angeles
New York
Miami
Dublin
Barcelona
But it is highly inefficient
One cause of inefficiency: Lack of coordination
Another cause of inefficiency: Local, greedy resource allocation
Local, greedy allocation
A
B C D
E
FGH
B C D
FGH
A E
Globally optimal allocation[Latency inflation with MPLS-based traffic engineering, IMC 2011]
SWAN: Software-driven WAN
Highly efficient WANFlexible sharing policies
Coordinate across servicesCentralize resource allocation
Goals Key design elements
[Achieving high utilization with software-driven WAN, SIGCOMM 2013]
SWAN controller
SWAN overview
WAN
Service hosts
Network agentService broker
Traffic demand
BW allocation
Networkconfig.
Topology, traffic
Rate limiting
Key design challenges
Scalably computing BW allocations
Avoiding congestion during network updates
Working with limited switch memory
Congestion during network updates
Congestion-free network updates
Computing congestion-free update plan
Leave scratch capacity on each link Ensures a plan with at most steps
Find a plan with minimal number of steps using an LP Search for a feasible plan with 1, 2, …. max steps
Use scratch capacity for background traffic
SWAN provides congestion-free updatesCo
mpl
emen
tary
CD
F
Oversubscription ratio Extra traffic (MB)
SWAN comes close to optimal
SWAN
Thro
ughp
ut(re
lativ
e to
opti
mal
)
SWANw/o rate control
MPLS TE
Deploying SWAN
WAN
Data center
WAN
Data center
Partial deployment Full deployment
The challenge of data plane updates in SDN
Not just about congestion Blackholes, loops, packet coherence, …
The challenge of data plane updates in SDN
Not just about congestion Blackholes, loops, packet coherence, …
Real-world is even messier
CDF
Latency (seconds) Latency (seconds)
CDF
Google’s B4 Our controlled experiments
Many resulting questions of interest
Fundamental What consistency properties can be maintained and how? Is property strength and ease of maintenance related?
Practical How to quickly and safely update the data plane? Impacts failure recovery time, network utilization, flow response time
Minimal dependencies for a consistency property
[On consistent updates in software-defined networks, HotNets 2013]
None Self Downstream subset
Downstream all Global
Eventual consistency
Always guaranteed
Blackhole freedom Impossible Add before
remove
Loop freedom Impossible Rule dependency
forestRule dependency
tree
Packet coherence Impossible Flow version
numbersGlobal version
numbers
Congestion freedom Impossible Staged partial
moves
Fast, consistent network updates
Desired state
generator
Update planner
Routing policy
Consistency property
Target network
state
Update plan
Current network
state
Forward fault correction Computes states that are robust to common faults
DionysusDynamically schedules
network updates
Overview of forward fault correctionControl and data plane faults cause congestion
Today, reactive data plane updates are needed to remove congestion
FFC handles faults proactively Guarantees absence of congestion for up to k faults
Main challenge: Too many possible faults Constraint reduction technique based on sorting networks
[Traffic engineering with forward fault correction, SIGCOMM 2014 (to appear)]
Congestion due to control plane faults
Current State Target state
FFC for control plane faults
Current State Vulnerable target state
Robust target state (k=1)
Robust target state (k=2)
Congestion due to data plane faults
Pre-failure traffic distribution Post-failure traffic distribution
FFC for data plane faults
Vulnerable traffic distribution Robust traffic distribution (k=1)
FFC guarantee needs too many constraints
[
: { | is a set of up to faulty switches} 𝑇 𝑙(𝑠) : Additional traffic on link 𝑙 when switch 𝑠 is faulty Spare capacity of link in the absence of faultsNumber of constraints is for each link
Efficient solution using sorting networks
: mth largest variable in the array
Use bubble sort network to compute linear expressions for k largest variables
O(nk) constraints
FFC performance in practice
Single-priority traffic(
Multi-priority traffic
Fast, consistent network updates
Desired state
generator
Update planner
Routing policy
Consistency property
Target network
state
Update plan
Current network
state
Forward fault correction Computes states that are robust to common faults
DionysusDynamically schedules
network updates
Overview of dynamic update schedulingCurrent schedulers pre-compute a static update schedule
Can get unlucky with switch delays
Dynamic scheduling adapts to actual conditions
Main challenge: Tractably exploring “safe” schedules
[Dionysus: Dynamic scheduling of network updates, SIGCOMM 2014 (to appear)]
Downside of static schedules
S1
S5S4
S3S2F2: 5 F3: 10
F4: 5F1: 5
Current State
S1
S5S4
S3S2
F1: 5
F4: 5
F2: 5 F3: 10
Target State
F2
F4F3
F1S1S2S3S4
21 time43
Plan A F4 F1
F2F3
F2
F4F3
F1S1S2S3S4
21 3 time4 5
Plan B F4
F1F2F3
F2
F4F3
F1S1S2S3S4
21 3 time
F2
F4F3
F1S1S2S3S4
431 2 time
Downside of static schedules
S1
S5S4
S3S2F2: 5 F3: 10
F4: 5F1: 5
Current State
S1
S5S4
S3S2
F1: 5
F4: 5
F2: 5 F3: 10
Target State
Dynamic plan
F4
F2F3
F1
Low update time regardless of latency variability
Static plan A
F4 F1
F2F3
Static plan B
F4
F1F2F3
Challenge in dynamic scheduling
Tractably explore valid orderings Exponential number of orderings Cannot completely avoid planning
S1
S5S4
S3S2
F2: 5
F3: 5F4: 5
F1: 5
Current State F5: 10
S1
S5S4
S3S2
F1: 5
F4: 5
F2: 5 F3: 10
Target State F5: 10
F3: 5
Dionysus pipeline
Dependency graph
generator
Consistency property
Target network
state
Dependency graph
Current network
stateUpdate
scheduler
Dionysus dependency graph
Nodes: updates and resourcesEdges: dependencies among nodes
S1
S5S4
S3S2
F2: 5
F3: 5F4: 5
F1: 5
Current State F5: 10
S1
S5S4
S3S2
F1: 5
F4: 5
F2: 5 F3: 10
Target State F5: 10
F3: 5
Dionysus schedulingNP-complete problem with capacity and memory constraints
Approach Critical path scheduling Treat strongly connected components
as virtual nodes and favor them Rate limit flows to resolve deadlocks
Dionysus leads to faster updates
Median improvement over static scheduling (SWAN): 60-80%
Dionysus reduces congestion due to failures
99th percentile improvement over static scheduling (SWAN): 40%
Fast, consistent network updates
Desired state
generator
Update planner
Routing policy
Consistency property
Target network
state
Update plan
Current network
state
Forward fault correction Computes states that are robust to common faults
DionysusDynamically schedules
network updates
SummarySDN enables new network operating points such as high utilization
But also pose a new challenge: fast, consistent data plane updates