+ All Categories
Home > Documents > Troubleshooting SDN Control S oftware with Minimal Causal Sequences

Troubleshooting SDN Control S oftware with Minimal Causal Sequences

Date post: 26-Feb-2016
Category:
Upload: duc
View: 68 times
Download: 0 times
Share this document with a friend
Description:
Troubleshooting SDN Control S oftware with Minimal Causal Sequences. Colin Scott , Andreas Wundsam , Barath Raghavan , Andrew Or, Jefferson Lai, Eugene Huang, Zhi Liu, Ahmed El- Hassany , Sam Whitlock, Hrishikesh B. Acharya , Kyriakos Zarifis , Scott Shenker. Something goes wrong!. - PowerPoint PPT Presentation
Popular Tags:
64
Colin Scott , Andreas Wundsam, Barath Raghavan, Aurojit Panda, Andrew Or, Jefferson Lai, Eugene Huang, Zhi Liu, Ahmed El-Hassany, Sam Whitlock, Hrishikesh B. Acharya, Kyriakos Zarifis, Arvind Krishnamurthy, Scott Shenker Troubleshooting SDN Control Software with Minimal Causal Sequences
Transcript
Page 1: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Colin Scott, Andreas Wundsam, Barath Raghavan, Aurojit Panda, Andrew Or, Jefferson Lai,

Eugene Huang, Zhi Liu, Ahmed El-Hassany, Sam Whitlock, Hrishikesh B. Acharya, Kyriakos Zarifis, Arvind Krishnamurthy, Scott Shenker

Troubleshooting SDN Control Software with

Minimal Causal Sequences

Page 2: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

SDN is a Distributed SystemController

1Controller

NController

2

Page 3: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Distributed Systems are Bug-Prone

Distributed correctness faults:• Race conditions• Atomicity violations• Deadlock• Livelock• …

+ Normal software bugs

Page 4: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Example Bug (Floodlight, 2012)

Master

Backup

Ping Pong Ping

Blackhole persists!

Crash

Link Failure

Notif

y

Switch

ACKNotif

y Master

Page 5: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Best Practice: Logs

Human analysis of log files

Page 6: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Best Practice: Logs

Master

Backup

Ping Pong Ping

Blackhole persists!

Crash

Link Failure

Notif

y

Switch

ACKNotif

y Master

Page 7: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Best Practice: LogsController A

Controller B

Controller C

Switch 1

Switch 2

Switch3

Switch 4

Switch 5

Switch 6

Switch 7

Switch 8

Switch 9

?

Page 8: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Our Goal

Allow developers to focus on fixing the underlying bug

Page 9: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Problem Statement

Identify a minimal sequence of inputs that triggers the bugin a blackbox fashion

Page 10: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Why minimization?

G. A. Miller. The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information. Psychological Review ’56.

Smaller event traces are easier to understand

Page 11: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Minimal Causal SequenceOutput:

V(i.e. violation occurs)

V

Page 12: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Minimal Causal SequenceController A

Controller B

Controller C

Switch 1

Switch 2

Switch3

Switch 4

Switch 5

Switch 6

Switch 7

Switch 8

Switch 9

?

Page 13: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Minimal Causal Sequence

Master

Backup

Ping Pong Ping

Blackhole persists!

Crash

Link Failure

Notif

y

Switch

ACKNotif

y Master

Page 14: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Outline

• What are we trying to do?

• How do we do it?

• Does it work?

Page 15: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Where Bugs are Found

• Symptoms found:• On developer’s local machine

(unit and integration tests)

Page 16: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Where Bugs are Found

• Symptoms found:• On developer’s local machine

(unit and integration tests)• In production environment

Page 17: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Where Bugs are Found

• Symptoms found:• On developer’s local machine

(unit and integration tests)• In production environment• On quality assurance testbed

Page 18: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Approach: Modify Testbed

Controller 1 Controller N

Test Coordinato

r

QA TestbedControl Software

Page 19: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Testbed Observables

• Invariant violation detected by testbed• Event Sequence:

• External events (link failures, host migrations,..) injected by testbed

• Internal events (message deliveries)

observed by testbed (incomplete)

Page 20: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Approach: Delta Debugging1 Replay

1. A. Zeller et al. Simplifying and Isolating Failure-Inducing Input. IEEE TSE ’02

✔✗?

Events (link failures, crashes, host migrations) injected by test orchestrator

Page 21: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Key Point

Must Carefully Schedule Replay Events To Achieve Minimization!

Page 22: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Challenges

• Asynchrony

• Divergent execution

• Non-determinism

Page 23: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Challenge: Asynchrony

• Asynchrony definition:• No fixed upper bound on relative

speed of processors • No fixed upper bound on time for

messages to be delivered

Dwork & Lynch. Consensus in the Presence of Partial Synchrony. JACM ‘88

Page 24: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Challenge: AsynchronyNeed to maintain original event order

Master

Backup

Ping Pong Ping

Crash

Link Failure

port_

stat

us

Switch

ACK

port_

stat

us MasterTimeout

Timeout

Blackhole persists!

Page 25: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Challenge: Asynchrony

Master

Backup

Ping Pong Ping

Link Failure

port_

stat

us

Switch

MasterTimeout

Blackhole avoided!

New Routing Table!

CrashNeed to maintain original event order

Page 26: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Coping with Asynchrony

Use interposition to maintain causal dependencies

Page 27: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Challenge: Divergence

• Asynchrony• Divergent execution• Syntactic Changes• Absent Events• Unexpected Events

• Non-determinism

Page 28: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Divergence: Absent Internal EventsPrune Earlier Input..

Master

Backup

Ping Pong Ping

Crash

Link Failure

Notif

y

Switch

ACKNotif

y Master Policy change

Host Migration

Page 29: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Divergence: Absent Internal Events

Master

Backup

Ping Pong Ping

Crash

Link Failure

Notif

y

Switch

Master

Some Events No Longer Appear

Policy change

Host Migration

Page 30: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Solution: Peek Ahead

Master

Backup

Crash

Link FailureSwitch

Ping

Notif

y

Host Migration

Ping Pong

Infer which internal events will occur

Master Policy change

Page 31: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Challenge: Non-determinism

• Asynchrony

• Divergent execution

• Non-determinism

Page 32: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Coping With Non-Determinism

• Replay multiple times per subsequence

• Assuming i.i.d., probability of not finding bug modeled by:

• If not i.i.d., override gettimeofday(), multiplex sockets, interpose on logging statements

Page 33: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Approach Recap

• Replay events in QA testbed• Apply delta debugging to inputs• Asynchrony: interpose on messages• Divergence: infer absent events• Non-determinism: replay multiple

times

Page 34: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Outline

• What are we trying to do?

• How do we do it?

• Does it work?

Page 35: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Evaluation Methodology• Evaluate on 5 open source SDN

controllers (Floodlight, NOX, POX, Frenetic, ONOS)

• Quantify minimization for:• Synthetic bugs• Bugs found in the wild

• Qualitatively relay experience troubleshooting with MCSes

Page 36: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Pyreti

c Loo

p

POX Pr

ematu

re Pa

cketIn

POX In

-Flight

Blackho

le

POX Migr

ation

Blackh

ole

NOX Disc

overy

Loop

Floodl

ight L

oop

ONOS Data

base L

ocking

Floodl

ight F

ailove

r

ONOS Mast

er Ele

ction

POX Lo

ad Bala

ncer

Delicat

e Tim

er Int

erlea

ving

React

ive Rou

ting T

rigger

Overla

pping

Flow En

tries

Null Po

inter

Multith

reade

d Race

Condit

ion

Memory

Leak

Memory

Corrup

tion

0

50

100

150

200

250

300

350

400

Input sizeMCS size

Num

ber

of In

put

Even

ts

Case Studies

Not r

epla

yabl

e

Discovered Bugs Known Bugs Synthetic Bugs

Substantial minimization except for 1 caseConservative input sizes

17 case studies total

(m)

1596 719

(n)

Page 37: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Comparison to Naïve Replay

• Naïve replay: ignore internal events• Naïve replay often not able to replay at

all• 5 / 7 discovered bugs not replayable• 1 / 7 synthetic bugs not replayable

• Naïve replay did better in one case• 2 event MCS vs. 7 event MCS with our

techniques

Page 38: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Qualitative Results

•15 / 17 MCSes useful for debugging• 1 non-replayable case (not

surprising)• 1 misleading MCS (expected)

Page 39: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Related Work

Page 40: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Conclusion

• Possible to automatically minimize execution traces for SDN control software

• System (23K+ lines of Python) evaluated on 5 open source SDN controllers (Floodlight, NOX, POX, Frenetic, ONOS) and one proprietary controller

• Currently generalizing, formalizing approach

ucb-sts.github.com/sts/

Page 41: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Backup

Page 42: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Related work• Thread Schedule Minimization

• Isolating Failure-Inducing Thread Schedules. SIGSOFT ’02.• A Trace Simplification Technique for Effective Debugging of

Concurrent Programs. FSE ’10.

• Program Flow Analysis• Enabling Tracing of Long-Running Multithreaded Programs via

Dynamic Execution Reduction. ISSTA ’07.• Toward Generating Reducible Replay Logs. PLDI ’11.

• Best-Effort Replay of Field Failures• A Technique for Enabling and Supporting Debugging of Field

Failures. ICSE ’07.• Triage: Diagnosing Production Run Failures at the User’s Site. SOSP

’07.

Page 43: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Bugs are costly and time consuming

• Software bugs cost US economy $59.5 Billion in 2002 [1]• Developers spend ~50% of their

time debugging [2]• Best developers devoted to

debugging1. National Institute of Standards and Technology 2002 Annual Report2. P. Godefroid et al., Concurrency at Microsoft- An Exploratory Study. CAV ‘08

Page 44: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Ongoing work• Formal analysis of approach• Apply to other distributed systems

(databases, consensus protocols)• Investigate effectiveness of various

interposition points• Integrate STS into ONOS (ON.Lab)

development workflow

Page 45: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Scalability

Page 46: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Case Studies

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 170

5

10

15

20

25

30

35

MCS sizeNaïve MCS

Num

ber

of In

put

Even

ts

Discovered Bugs Known Bugs Synthetic Bugs

Not r

epla

yabl

e

Not r

epla

yabl

e

Not r

epla

yabl

e

Not r

epla

yabl

e

Not r

epla

yabl

e

Not r

epla

yabl

e

Not r

epla

yabl

e

infla

ted

non-

repl

ayab

le

misl

eadi

ng (e

xpec

ted)

Techniques provide notable benefit vs. naïve replay

15 / 17 MCSes useful for debugging

Page 47: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Case Studies

Page 48: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Runtime

Page 49: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Coping with Non-Determinism

Page 50: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Replay Requirements

•Need to maintain original happens-before relation• Includes internal events•Message Deliveries•State Transitions

Page 51: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Naïve Replay Approach

t1 t2 t3 t4 t5t6t7 t8 t9 t10

Schedule events according to wall-clock time

Page 52: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Complexity

Best Case Worst Case- Delta

Debugging: (log n) replays

- Each replay: O(n) events

- Total: (nlog n)

- Delta Debugging: O(n) replays

- Each replay: O(n) events

- Total: O(n2)

Page 53: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Assumptions of Delta Debugging

Page 54: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Local vs. Global Minimality

Page 55: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Forensic Analysis of Production Logs

Logs need to capture causality: Lamport Clocks or accurate NTP

Need clear mapping between input/internal events and simulated events

Must remove redundantly logged events Might employ causally consistent snapshots to cope

with length of logs

Page 56: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Instrumentation Complexity

Code to override gettimeofday(), interpose on logging statements, and multiplex sockets:

415 LOC for POX (Python) 722 LOC for Floodlight (Java)

Page 57: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Improvements

• Many improvements:• Parallelize delta debugging• Smarter delta debugging time splits• Apply program flow analysis to

further prune• Compress time (override

gettimeofday)

Page 58: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Divergence: Syntactic ChangesPrune Earlier Input..

Master

Backup

Ping

Seq=

3

PongSeq=4 Pin

gSe

q=5

Crash

Link Failure

port_

stat

usxid

=12

Switch

ACK

port_

stat

usxid

=13

MasterTimeout

Timeout

Page 59: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Divergence: Syntactic ChangesSequence Numbers Differ!

Master

Backup

Ping

Seq=

2

PongSeq=

3 Ping

Seq=

4

Crash

Link Failure

port_

stat

usxid

=11

Switch

port_

stat

usxid

=12

MasterTimeout

Timeout

ACK

Page 60: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Solution: Equivalence Classes

Mask Over Extraneous Fields

Page 61: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Solution: Peek ahead

Page 62: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Divergence: Unexpected EventsPrune Input..

Master

Backup

Ping Pong

Switch

Ping …

Crash

Master

Page 63: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Divergence: Unexpected EventsUnexpected Events Appear

Master

Backup

Ping Pong

Switch

Ping …

Crash

Master LLDP

Page 64: Troubleshooting SDN Control  S oftware with Minimal Causal Sequences

Solution: Emperical HeuristicTheory:•Divergent paths Exponential possibilities

Practice:• Allow unexpected events through


Recommended