+ All Categories
Home > Documents > ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu...

ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu...

Date post: 28-Mar-2015
Category:
Upload: shea-hartland
View: 224 times
Download: 1 times
Share this document with a friend
Popular Tags:
33
zUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer, Dave Maltz (Microsoft) 1
Transcript
Page 1: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

1

zUpdate:Updating Data Center

Networks with Zero Loss

Hongqiang Harry Liu (Yale University)Xin Wu (Duke University)

Ming Zhang, Lihua Yuan, Roger Wattenhofer, Dave Maltz (Microsoft)

Page 2: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

2

Switches

DCN is constantly in flux

Upgrade Reboot

Traffic Flows

New Switch

Page 3: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

3

Switches

DCN is constantly in flux

Virtual Machines

Traffic Flows

Page 4: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

4

Network updates are painful for operators

Bob: An operator

Two weeks before update, Bob has to:• Coordinate with application owners• Prepare a detailed update plan• Review and revise the plan with colleagues

At the night of update, Bob executes plan by hands, but• Application alerts are triggered unexpectedly• Switch failures force him to backpedal several times.

Eight hours later, Bob is still stuck with update:• No sleep over night• Numerous application complaints • No quick fix in sight

Holy C**p

Complex Planning

Unexpected Performance Faults

Laborious Process

Switch Upgrade

Page 5: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

5

Congestion-free DCN update is the key

• Applications want network updates to be seamless• Reachability• Low network latency (propagation, queuing)• No packet drops

• Congestion-free updates are hard• Many switches are involved• Multi-step plan• Different scenarios have distinct requirements• Interactions between network and traffic demand changes

Congestion

Page 6: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

6

ToR

AGG

CORE 1

1

2 3 4

2 3 4 5 6

1 2 3 4 5

A clos network with ECMP

300

Link capacity: 1000

300

150

150 = 920620 + 150 + 150

300 300

600 600

150150

All switches: Equal-Cost Multi-Path (ECMP)

Page 7: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

7

ToR

AGG

CORE 1

1

2 3 4

2 3 4 5 6

1 2 3 4 5

+ 150

Switch upgrade: a naïve solution triggers congestion

Link capacity: 1000

Drain AGG1600

+ 300 = 1070= 920620 + 150

Page 8: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

8

ToR

AGG

CORE 1

1

2 3 4

2 3 4 5 6

1 2 3 4 5

Switch upgrade: a smarter solution seems to be working

Link capacity: 1000

Drain AGG1100500

+ 50 = 970620 + 300 + 150= 1070

Weighted ECMP

Page 9: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

9

Traffic distribution transition

Initial Traffic DistributionCongestion-free

Final Traffic Distribution Congestion-free

ToR

AGG

CORE 1

1

2 3 4

2 3 4 5 6

1 2 3 4 5

300 300 300 300ToR

AGG

CORE 1

1

2 3 4

2 3 4 5 6

1 2 3 4 5

0 600 500 100?

Asynchronous Switch Updates

Transition

Simple?

NO!

Page 10: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

10

Asynchronous changes can cause transient congestion

ToR

AGG

CORE 1

1

2 3 4

2 3 4 5 6

1 2 3 4 5

600300300

Drain AGG1

Link capacity: 1000

620 + 300 + 150 = 1070

Not Yet

When ToR1 is changed but ToR5 is not yet:

Page 11: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

11ToR

AGG

CORE 1

1

2 3 4

2 3 4 5 6

1 2 3 4 5

Solution: introducing an intermediate step

Initial Final

IntermediateCongestion-free regardless the asynchronizations

Congestion-free regardless the asynchronizations

ToR

AGG

CORE 1

1

2 3 4

2 3 4 5 6

1 2 3 4 5

300 300 300 300ToR

AGG

CORE 1

1

2 3 4

2 3 4 5 6

1 2 3 4 5

0 600 500 100

200 400 450 150?

Transition

Page 12: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

12

How zUpdate performs congestion-free update

Data Center Network

zUpdate

Current Traffic Distribution

Target Traffic Distribution

UpdateScenario

Update requirementsOperator

IntermediateTraffic Distribution

IntermediateTraffic Distribution

Page 13: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

13

Key technical issues

• Describing traffic distribution

• Representing update requirements

• Defining conditions for congestion-free transition

• Computing an update plan

• Implementing an update plan

Page 14: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

14

ToR

AGG

CORE s4

s2

s5

s3

s1

f

Describing traffic distribution

: flow f’s load on the link from switch v to u

Traffic Distribution:

600

300=300

=150150

Page 15: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

15

ToR

AGG

CORE s4

s2

s5

s3

s1

f

Representing update requirements

To upgrade switch : To restore ECMP:

Drain s2

Constraint: = 0

When s2 recovers

Constraint: =

Page 16: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

16

Switch asynchronization exponentially inflates the possible load values

Asynchronous updates can result in possible load values on link during transition.

f

25𝑒7,8

ingressegress

f

𝑙7 ,8𝑓

In large networks, it is impossible to check if the load value exceeds link capacity.

Transition from old traffic distribution to new traffic distribution

1 2

3

4 6

78

5

Page 17: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

17

Two-phase commit reduces the possible load values to two

•With two-phase commit, f’s load on link only has two possible values throughout a transition:

𝑒𝑣 ,𝑢

𝑙𝑣 ,𝑢𝑓 (old ) 𝑙𝑣 ,𝑢

𝑓 (new )or

f

version flip

ingressegress

f

Transition from old traffic distribution to new traffic distribution

1 2

3

4 6

78

5

Page 18: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

18

Flow asynchronization exponentially inflates the possible load values

f1

f2

𝑙7 ,8𝑓 1 ( old )+ 𝑙7 , 8

𝑓 2 ( old )

1 2

3

4

5

6

7

8

𝑙7 ,8𝑓 1 ( old )+ 𝑙7 , 8

𝑓 2 (new )

𝑙7 ,8𝑓 1 ( new )+𝑙7 , 8

𝑓 2 (old )

𝑙7 ,8𝑓 1 ( new )+𝑙7 , 8

𝑓 2 (new )

0

Asynchronous updates to N independent flows can result in possible load values on link 2𝐍 𝑒7,8

f1 + f2

=

Page 19: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

19

Handling flow asynchronization

[Congestion-free transition constraint] There is no congestion throughout a transition if and only if:

the capacity of link 𝑒𝑣 ,𝑢

∀𝑒𝑣 ,𝑢 :∑∀ 𝑓

max {𝑙𝑣 ,𝑢𝑓 (old ) ,𝑙𝑣 ,𝑢𝑓 (new ) }≤𝑐𝑣 ,𝑢

f1

f2

1 2

3

4

5

6

7

8

0

Basic idea:𝑙7 ,8𝑓 1 ( old )+ 𝑙7 , 8

𝑓 2 ( old )

𝑙7 ,8𝑓 1 ( old )+ 𝑙7 , 8

𝑓 2 (new )

𝑙7 ,8𝑓 1 ( new )+𝑙7 , 8

𝑓 2 (old )

𝑙7 ,8𝑓 1 ( new )+𝑙7 , 8

𝑓 2 (new )

=

Page 20: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

20

Computing congestion-free transition plan

Constant:Current Traffic

Distribution

Variable:Target TrafficDistribution

Variable:Intermediate

Traffic Distribution

Constraint:Congestion-free Constraint:

Update Requirements

Constraint:• Deliver all traffic• Flow conservation

Variable:Intermediate

Traffic Distribution

Linear Programming

Page 21: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

21

Implementing an update plan

• Computation time

• Switch table size limit

• Update overhead

• Failure during transition

• Traffic demand variation

Other FlowsCriticalFlows

Weighted-ECMP ECMP

Flows traversing bottleneck links

Page 22: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

22

Evaluations

• Testbed experiments

• Large-scale trace-driven simulations

Page 23: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

23

ToR

AGG

CORE 1

1

2 3 4

2 3 4 5 6

1 2 3 4 5 6 7 8 9 10 11 12

Switch: OpenFlow 1.0Link: 10Gbps

Testbed setup

Drain AGG1

ToR5: 6Gbps ToR8: 6Gbps

ToR6,7: 6.2Gbps ToR6,7: 6.2Gbps ToR6,7: 6.2Gbps ToR6,7: 6.2Gbps

Traffic Generator

Page 24: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

24

0 5 10 15 20 250.8

0.85

0.9

0.95

1

1.05

Real-time link utilization

Link: CORE1-AGG3 Link: CORE3-AGG4

Time (sec)

Link

Util

izati

on

zUpdate achieves congestion-free switch upgrade

ToR

AGG

CORE 1

1

2 3 4

2 3 4 5 6

1 2 3 4 5

Initial

Final

Intermediate

ToR

AGG

CORE 1

1

2 3 4

2 3 4 5 6

1 2 3 4 5

3Gbps 3Gbps 3Gbps3Gbps

ToR

AGG

CORE 1

1

2 3 4

2 3 4 5 6

1 2 3 4 5

0 6Gbps 5Gbps 1Gbps

2Gbps 4Gbps 4.5Gbps 1.5Gbps

Page 25: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

25

-1 1 3 5 7 9 11 13 150.7

0.8

0.9

1

1.1

Real-time link utilization

Link: CORE1-AGG3 Link: CORE3-AGG4

Time (sec)

Link

Util

izati

on

One-step update causes transient congestion

Initial

ToR

AGG

CORE 1

1

2 3 4

2 3 4 5 6

1 2 3 4 5

3Gbps 3Gbps 3Gbps3Gbps

Final

ToR

AGG

CORE 1

1

2 3 4

2 3 4 5 6

1 2 3 4 5

0 6Gbps 5Gbps 1Gbps

Page 26: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

26

Large-scale trace-driven simulations

ToR

AGG

CORE

A production DCN topology

New Switch

Test flows (1%)Flows

Page 27: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

27

zUpdate beats alternative solutions

zUpdate zUpdate-OneStep ECMP-OneStep ECMP-Planned

Post-transition Loss Rate

Transition Loss Rate

#step 2 1 1 300+

10

15

5

0Loss

Rat

e (%

)

Page 28: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

28

Conclusion

• Switch and flow asynchronization can cause severe congestion during DCN updates

• We present zUpdate for congestion-free DCN updates• Novel algorithms to compute update plan • Practical implementation on commodity switches• Evaluations in real DCN topology and update scenarios

The End

Page 29: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

29

Thanks & Questions?

Page 30: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

30

Updating DCN is a painful process

Operator

InteractiveApplications

This is Bob

Switch Upgrade

Any performance disruption?

How bad will the latency be?

How long will the disruption last?

What servers will be affected?

Uh?…

Page 31: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

31

Network update: a tussle between applications and operators

• Applications want network update to be fast and seamless• Update can happen on demand• No performance disruption during update

• Network update is time consuming• Nowadays, an update is planned and executed by hands• Rolling back in unplanned cases

• Network update is risky• Human errors• Accidents

Page 32: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

32

Challenges in congestion-free DCN update

• Many switches are involved

• Multi-step plan

• Different scenarios have distinctive requirements• Switch upgrade/failure recovery• New switch on-boarding• Load balancer reconfiguration• VM migration

• Coordination between changes in routing (network) and traffic demand (application)

Help!

Page 33: ZUpdate: Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer,

33

Related work

• SWAN [SIGCOMM’13] • maximizing the network utilization• Tunnel-based traffic engineering

• Reitblatt et al. [SIGCOMM’12]• Control plane consistency during network updates• Per-packet and per-flow cannot guarantee “no congestions”

• Raza et al. [ToN’2011], Ghorbani et al. [HotSDN’12]• One a specific scenario (IGP update, VM migration)• One link weight change or one VM migration at a time


Recommended