+ All Categories
Home > Documents > OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al....

OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al....

Date post: 20-Aug-2020
Category:
Upload: others
View: 16 times
Download: 0 times
Share this document with a friend
39
OFRewind: Enabling Record & Replay Troubleshooting for Networks Andreas Wundsam • Dan Levin Srini Seetharaman • Anja Feldmann An-Institut der Technischen Universität Berlin USENIX ATC 2011
Transcript
Page 1: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

OFRewind: Enabling Record & Replay

Troubleshooting for Networks

Andreas Wundsam • Dan LevinSrini Seetharaman • Anja Feldmann

An-Institut der Technischen Universität Berlin

USENIX ATC 2011

Page 2: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

Quick 101

classical switch

Page 3: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

Quick 101

OpenFlow switch

PKT_IN

FLOW_MOD

entry

Page 4: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

OpenFlow entry

!"#$%&

'()$

*+,

-)%

*+,

.-$

/$&

$012

34+5

67

6'

!)%

6'

7-$

6'

')($

8,'

-1()$

8,'

.1()$

9:;2 +%$#(< !$=$-

>? @()"=).A1=%B2$A$(A1()$C-D

E? /<%=1-:;=$2A=<.AF()"=).A$(A%(<$)(;;2)

G? 7)(1A1=%B2$

H? !2<.A$(A<()I=;A1)(%2--#<JA1#12;#<2

KAI=-B

'=%B2$AKAL0$2A%(:<$2)-

(Figure from the Openflow Intro Presentation, N. McKeown)

Page 5: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

Back to the topic of my talk:OFRewind!

Page 6: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

Motivating use case

20:00 21:00 22:00 23:00 00:000

50

100C

PU U

til %

Nov−06−2009 to Nov−07−200920:00 21:00 22:00 23:00 00:00

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SET

CO

NFI

G

CPU Utilization of an OpenFlow switch

Page 7: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

20:00 21:00 22:00 23:00 00:000

50

100C

PU U

til %

Nov−06−2009 to Nov−07−200920:00 21:00 22:00 23:00 00:00

0

50

100

150

200

250

300

350

400

PAC

KET

IN

No correlation!

Arrivals of PKT_IN msgs

Page 8: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

20:00 21:00 22:00 23:00 00:000

50

100C

PU U

til %

Nov−06−2009 to Nov−07−200920:00 21:00 22:00 23:00 00:00

0

10

20

30

40

50

60

70

80

FLO

W M

OD

No correlation!

Arrivals of FLOW_MOD msgs

Page 9: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

No correlation!

20:00 21:00 22:00 23:00 00:000

50

100

CPU

Util

%

Nov−06−2009 to Nov−07−200920:00 21:00 22:00 23:00 00:00

0

50

100

150

200

250

300

350

400

PAC

KET

IN

20:00 21:00 22:00 23:00 00:000

50

100

CPU

Util

%

Nov−06−2009 to Nov−07−200920:00 21:00 22:00 23:00 00:00

0

50

100

150

200

250

300

350

400

450

500

PAC

KET

OU

T

20:00 21:00 22:00 23:00 00:000

20

40

60

80

100

CPU

Util

%

Nov−06−2009 to Nov−07−200920:00 21:00 22:00 23:00 00:00

0

10

20

30

40

50

60

70

FLO

W E

XPIR

ED

20:00 21:00 22:00 23:00 00:000

50

100

CPU

Util

%

Nov−06−2009 to Nov−07−200920:00 21:00 22:00 23:00 00:00

0

10

20

30

40

50

60

70

80

FLO

W M

OD

20:00 21:00 22:00 23:00 00:000

20

40

60

80

100

CPU

Util

%Nov−06−2009 to Nov−07−2009

20:00 21:00 22:00 23:00 00:000

5

10

15

20

25

STAT

S R

EPLY

20:00 21:00 22:00 23:00 00:000

20

40

60

80

100

CPU

Util

%

Nov−06−2009 to Nov−07−200920:00 21:00 22:00 23:00 00:00

0

5

10

15

20

25

STAT

S R

EQU

EST

Page 10: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

Clueless...

• Switch is a black box component

• Can't inspect internal state, source code

• No analytical explanation for the behavior

• Message arrivals do not correlate with symptoms

• Existing interfaces (CLI, SNMP) too coarse grained

Page 11: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

Troubleshooting networks is hard

huge, critical black boxes timing / races

Page 12: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

A solution?

Record

In production

Trouble-shoot

Replay

Reproduce atconvenient

location / pace

Page 13: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

Existing approaches

Endhost Replay Debugging

Fully deterministic replay, via binary instrumentation /

virtualization

✘ no black boxes

✘ scalability?

TCPDump / TCPReplay et. al.

Capture/Replay events

✘ Single vantage point, no network wide view

✘ Scalability due to dataplane datarates

Page 14: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

Existing approaches

Endhost Replay Debugging

Fully deterministic replay, via binary instrumentation /

virtualization

✘ no black boxes

✘ scalability?

TCPDump / TCPReplay et. al.

Capture/Replay events

✘ Single vantage point, no network wide view

✘ Scalability due to dataplane datarates

Full recording of all events feasible?

Page 15: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

However...

• Not all traffic is equal (ctrl plane: 1% traffic, 95-99% bugs!)*

• Behavior of many network devices:

Largely Deterministic w.r.t.

Control Plane Network Events

* Altekar / Stoica, 2010

Page 16: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

events + traffic

selective: record important traffic (control)

skip/aggregate less important traffic (data plane)

Recordreinject events + traffic

"best effort replay"

replay partial recordings

reproduce problem at a chosen time / location

Replay

Go Network* Wide / Always On!

* controller domain

Page 17: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

Replay TweakingLocalize problems through:

Device mapping

Scale time investigate timing issues

Time dilation

different devices / versionsinvestigate regressions / vendor implementation issues

iteratively replay subselected traffic localize events that trigger failure

Trace bisection

Page 18: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

Goals

✓Record a controller domain

✓Scalable, selective, consistent

✓Even with black boxes

✓coordinated Replay

✓ Replay tweaking

✓ Localize problems

Page 19: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

Non-Goals

✘Root cause analysis

✘Automatic configuration of what to record

✘Fully deterministic replay

Page 20: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

Introducing the tool

Page 21: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

System design2 components of 2 modules each:

Page 22: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

OFRecordOpenFlow controller

OFRecord

sw3sw2

sw1

c1

c2c3

c4 c5

c6

DataStoreDataStores

p2p1

pm

Page 23: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

OFReplay

p2

OFReplay

sw3sw2

sw1

DataStoreDataStores

p1

pm

Page 24: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

OFReplayOpenFlow controller

OFReplay

Page 25: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

Typical Usage

• Deploy Ofrecord in production environment -> proxy to 'regular' controller

• Always-on OF messages, control plane, data plane summaries

• Alter selection rules as necessary

• Deploy Ofreplay in lab environment

• Localize bugs / validate bug fixes

Page 26: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

Case studies

1. Debugging Black box components

• CPU inflation in an OpenFlow switch

2. Debugging OpenFlow controllers

• NOX problem

+ Others (see poster/paper)

Page 27: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

Back to CPU inflation

• Replay and bisect the trace by message type

20:00 21:00 22:00 23:00 00:000

50

100

CPU

Util

%

Nov−06−2009 to Nov−07−200920:00 21:00 22:00 23:00 00:00

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SET

CO

NFI

G

Page 28: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

Back to CPU inflation

• When replaying STATS_REQ msgs...

STATS_REQ msgs reproduce the problemeven though there is no correlation in arrival times

• Replay and bisect the trace by message type

20:00 21:00 22:00 23:00 00:000

20

40

60

80

100

CPU

Util

%

Nov−06−2009 to Nov−07−200920:00 21:00 22:00 23:00 00:00

0

5

10

15

20

25

STAT

S R

EQU

EST

Record

08:06 08:36 09:06 09:36 10:06 10:360

50

100Replayed traffic characteristics

Time

CPU

usa

ge (%

)

08:06 08:36 09:06 09:36 10:06 10:360

50

100

Time

Flow

set

up ti

me

(ms)

Replay

Page 29: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

Debugging controllers: NOX problem

• Problem record: Messages initiated by one specific device don't reach NOX controller module

• Not reproducible at the lab

Page 30: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

Debugging controllers: NOX problem

• Record at end user site

• Replay at lab towards NOX

• Use host-level debugging to analyze NOX behavior

Page 31: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

Debugging controllers: NOX problem

• NOX has an 'intelligent' MAC address parser that handles both binary and ASCII MAC addresses

• '0x3a' is the ASCII representation of ':' and appeared in the binary form of this MAC :)

00:26:55:da:3a:40

• Trigger: specific source MAC address

0x3a == ':'

Page 32: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

Performance Evaluation

• Record: production environment

• OFRecord controller performance

• Impact on switch performance

• Replay: lab environment

• Timing accuracy

Page 33: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

0 10 20 30 40 50 60

02

00

00

60

00

01

00

00

0

# Switches

Flo

w R

ate

/s

flowvisornox!pyswitchnox!switchofrecordofrecord!dataof!simple

OFRecord controller performance

Median # Flows handled by different controllers (measured with cbench)

NOX, Flowvisor, OFRecord

SimpleController

Page 34: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

Impact on switch performance

5 10 20 50 100 200 500 2000 5000

510

20

50

100

500

2000

Flows sent/s

Flo

ws

rec/

s

of!record (Vendor A)of!record!data (Vendor A)of!simple (Vendor A)of!record (Vendor B)of!record!data (Vendor B)of!simple (Vendor B)

• Single UDP packet flows created using hping

• sent to switches of two different vendors

• measure # flows successfully forwarded

• compare OFRecord vs. SimpleCtrl

Vendor B breaks down

Vendor A saturates

OFRecord:limited switchperformance

penalty

Page 35: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

End-to-end performance

Rate [Flows/s] Drop % sd (timing) [ms]5 0 4.510 0 15.620 0 21,150 0 23,4100 0 10,9200 0 13,9400 19 % 15,8

Page 36: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

Summary

• reproduce problems at convenient time and place

• Combined in OfRewind, an Open-Flow based tool for Network Record & Replayhttp://www.openflow.org/wk/index.php/OFRewind

• Enables practical record and replay of network domains

Selective, consistent, multigranularity

Network Recording

Adaptive coordinatedbest-effort

Network Replay&

New Primitives:

Page 37: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

Future work

• Scale to larger topology sizes, more complex networks

• Extend to production quality tool

• Improve timing for very fast flow rates

• Automated regression tests through standard sets of traces

Page 38: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

Thank you.

Page 39: OFRewind: Enabling Record & Replay Troubleshooting for ... · TCPDump / TCPReplay et. al. Capture/Replay events Single vantage point, no network wide view Scalability due to dataplane

Summary

• reproduce problems at convenient time and location

• Combined in OfRewind, an Open-Flow based tool for Network, Record & Replay

• Enables practical record and replay of network domains

• http://www.openflow.org/wk/index.php/OFRewind

Selective, consistent, multigranularity

Network Recording

Adaptive coordinatedbest-effort

Network Replay&

New Primitives:


Recommended