+ All Categories
Home > Documents > Jan 2003: Slammer Worm Exploits Buffer Overflow August, 2004: North American Blackout Caused by Race...

Jan 2003: Slammer Worm Exploits Buffer Overflow August, 2004: North American Blackout Caused by Race...

Date post: 22-Dec-2015
Category:
View: 217 times
Download: 0 times
Share this document with a friend
Popular Tags:
90
Finding Liveness Bugs in Systems Code C. Killian, J. Anderson, R. Jhala ,
Transcript

Finding Liveness Bugs in Systems CodeC. Killian, J. Anderson, R. Jhala, A. Vahdat

UC San Diego

Jan 2003: Slammer WormExploits Buffer Overflow

August, 2004: North American BlackoutCaused by Race Condition

Software is

Unreliable

Why?

Many Domains

Many Reasons

One Common Factor

IntendedProperties

ActualCode

Developer Misses Corner Cases

IntendedProperties

ActualCode

Properties drift away

Code Evolves

Key to Reliability: Connect Properties & Code

How?

For each domain:1. Formalize Properties2. Automate Analysis3. Build Tools

Device DriversDistributed Systems

Configuration ManagementPL/Databases

Web 2.0 Security

Domains

Concurrent, Distributed Systems

Stock Exchanges Telecoms Commuter Rail

System Nodes exchanging messages

Concurrent, Distributed Systems

Execution1. Node gets message event2. Executes event handler

- Updates node state - Sends new messages

3. Repeat…

Concurrent, Distributed Systems

Challenges

Nodes enter, leave, fail Messages are reordered, lost

System must stay available- Eventually, all nodes regroup - Eventually, all packets delivered

Pastry

[Rowstron & Druschel ‘01]

Key-Value StoreDistributed Across NodesOrganized in Ring Topology

Nodes Leave and Rejoin

Leaves

Nodes Leave and Rejoin

Detect,Reconnect

Nodes Leave and Rejoin

Returns

Nodes Leave and Rejoin

Asks forNeighbors

Nodes Leave and Rejoin

RejoinsNeighbors

But Sometimes...

Asks forNeighbors

QueryBounces

Back!

Node forever unable to rejoin...

??! #@How to find ?

How to reproduce?How to fix?

For each domain:1. Formalize Properties2. Automate Analysis3. Build Tools

Distributed Systems:

1. Formalize PropertiesSystemLivenessBugs

System Nodes exchanging messagesStates and Transitions

States and Transitions

1 2

StateSnapshot of system

1 2

event@1

At each state,scheduler chooses1. Node n2. Event @n3. Executes code (C++)

The Space of System Executions

1 2 Initial State

1 2

1 2

1 2

1 2

1 2

event@1

event@2

fail@1

event@1

fail@2

event@2

At each state,scheduler chooses1. Node n2. Event @n3. Executes code (C++)

An Execution = Sequence of Choices

1 2

1 2

1 2

1 2

1 2

1 2

event@1

event@2

fail@1

event@1

fail@2

event@1

An Execution = Sequence of Choices

1 2

1 2

1 2

1 2

1 2

1 2

event@1

event@2

fail@1

event@1

fail@2

event@1

An Execution = Sequence of Choices

1 2

1 2

1 2

1 2

1 2

1 2

event@1

event@2

fail@1

event@1

fail@2

event@1

An Execution = Sequence of Choices

1 2

1 2

1 2

1 2

1 2

1 2

event@1

event@2

fail@1

event@1

fail@2

event@1

1. Formalize PropertiesSystemLivenessBugs

System must stay available

System must recover from failure

Desired Properties

Eventually all nodes regroup

(Despite Failures,...)

Eventually some good happens

Liveness Properties

Eventually all data deliveredEventually “P is true”Eventually "n,m $i: n.fwdi = m“nodes form ring”

1. Formalize PropertiesSystemLivenessBugs

Live States

InitialState

Some good happenedP is trueLive States

Live Executions

InitialState

Live States

Liveness Violations

InitialState

Live States

Execution never reaches live state

Liveness Violations

Two kinds

1. Unlucky Executions

InitialState

Live States

At each step, some choice leads to livenessBut scheduler keeps making bad choices

Live States

2: Dead Executions

Dead States

No execution can reach live states

Execution Reaches

Recovery is Impossible

Idea 1:Focus on dead execution bugs

Because Very Severe Unluckiness may be fixed by schedulerBecause Very Plentiful Because We Can Yet, difficult to find, reproduce, fix

1. Formalize PropertiesSystemLivenessBugs

Distributed Systems:1. Formalize Properties2. Automate Analysis3. Build Tools

2. Automate CheckingSystematically Search for Dead Executions

“Model Checking”How to tell if execution Dead?

Model Checking

Systematically Explore Executions

Initial State

Iterate: over sequences of choicesExecution = Sequence of choices

1 2 1 2

Model Checking

Iterate: over sequences of choicesUntil: hit a bad state

1 2 1 2

[Verisoft 97, Cmc 04, Chess 07]

Model Checking

Iterate: over sequences of choicesUntil: hit a bad state

• Null Dereference• Buffer overflow• Assertion failure

SafetyBugs

SafetyBugs

Model Checking

Iterate: over sequences of choicesUntil: hit a bad state

How to find liveness bugs?

Until: hit a dead state

How to tell if state is Dead?Property only says which are live[Verisoft 97, Cmc 04, Chess 07]

• Null Dereference• Buffer overflow• Assertion failure

Idea 2: Random Walks

Live States

Dead States

Execute long random walks from state Pr[reaching live] = 0 Pr[reaching live] = 1How to tell if state is Dead?Property only says which are live

Executions and Random Walks

At each execution step, 1. Scheduler picks node n2. Scheduler picks event @n3. Executes event code

Random Walk: Scheduler picks randomlyFrom some Prob. Dist. over nodes, events

Algorithm = Search + Random Walks

1. Systematic Search: find candidates 2. Random Walk: test if candidate dead

Live States

Iterate

Live States

If walk length >> avg. steps to livenessThen non-live walk is likely liveness bug!

100k Events

1k Events

How to pinpoint bug ?100,000 Step Execution (2 Gb Log file)

Algorithm = Search + Random Walks

Live States

Idea 3: The Critical Transition

Dead States

System transitions from a recoverable to a dead state

How to find Critical Transitionwithout knowing Dead States?

Live States

Idea 3: The Critical Transition

Binary Search using

Random Walks

Live States

Idea 3: The Critical Transition

Binary Search using

Random Walks

Binary Search

Live States

Idea 3: The Critical Transition

Critical Transition

Dead States

System transitions from a recoverable to a dead statePinpoints bug

RecapDead ExecutionsSystem has shot itself (but doesnt know it)

Systematic SearchFinds candidate dead states

Random WalksDetermine if candidate is dead

Critical TransitionEvent that makes recovery impossible

For each domain:1. Formalize Properties2. Automate Analysis3. Build Tools

Distributed Systems:

Liveness Bugs,Critical Transition

Mace (C++)System

Liveness Properties

First Liveness Checker For Systems Code

MaceMC

Liveness Bugs

Mace (C++)System

Liveness Properties

First Liveness Checker For Systems Code

MaceMC[NSDI 07]

Implementation Details (1/2)

Random Walk BiasAssign “likely” events higher weighte.g. application > network > timer > fail

Bugs not missedRandom walk only tests deadness

Live state reached soonerError traces shorter, simpler

Prefix-Based SearchRestart search after reaching livenessAnalyzes effect of failures in “steady” state

Implementation Details (2/2)

Systems Analyzed

RandTreeRandom Overlay Tree with max degree.

MaceTransportUser-level, reliable messaging service.

PastryKey-based routing, using an overlay ring.

ChordKey-based routing, using an overlay ring.

Liveness Properties

RandTreeRandom Overlay Tree with max degree.

MaceTransportUser-level, reliable transport service.

PastryKey-based routing, using an overlay ring.

ChordKey-based routing, using an overlay ring.

Eventually, all messages acknowledged.

Eventually, all nodes form single tree.

Eventually, all nodes form a ring.

Eventually, all nodes form a ring.

Pastry Bug Understood

Node forever unable to rejoin...

C B

Pastry Bug Understood

A

B sends C message about A

A

C B

Pastry Bug Understood

A leaves

A

A

Ring reforms

C B

Pastry Bug Understood

A

A returns

A

B

Pastry Bug Understood

C receives (stale) message about AUpdates routing information

AA

C

A

Critical Transition!

B

Pastry Bug Understood

A’s Rejoin requests bounced back

AC

A

A forever unable to rejoin...

“Dropped JoinRequest on rapid rejoin problem: There was a problem with nodes not being able to quickly rejoin if they used the same NodeId. Didn’t find the cause of this bug, but can no longer reproduce.”

(FreePastry README, “Changes since 1.4.2”)

Also in Original Implementation

A “Protocol Level” Bug

Sample Bug: RandTree

Nodes With Child, Parent pointers

PropertyEventually nodes form tree

Sample Bug: RandTree

C

A

C requests to join under AA sends ackC fails and restartsC ignores ack from AC joins under B

Bug: System stuck as a DAG!

B

MaceMCFinds failure, reordering bugs (Hard to catch via regular testing)

Liveness Bugs Yield Safety Assertions

Dead States Violate a priori unknown safety assertions

Critical TransitionHelps identify dead statesYields new safety assertions and bugs

New Safety Property: ChordNodes with Fwd, Back pointers

PropertyEventually nodes form a ring

Critical Transition To Dead StateWhere: n.back = n, n.fwd = m

New Safety PropertyIF n.back=n THEN n.fwd=n

New Safety Property: Chord

Hard to determine low-levelSafety properties in advance

Easy to determine high-levelLiveness properties in advance

MaceMCHelps find safety assertionsUsing liveness violations

ScorecardSystem Bugs Liveness Safety

MaceTransport 11 5 6RandTree 17 12 5

Pastry 5 5 0Chord 19 9 10Total 52 31 21

Several “protocol level” bugsRoutinely used by Mace programmersZero False Alarms

For each domain:1. Formalize Properties2. Automate Analysis3. Build Tools

Distributed Systems

Other Domains

Lazy AbstractionCraig InterpolationLiquid Types

Software Verification[popl02, pldi04, icse04...]

[pldi08, pldi09, popl10]

[popl04, tacas06]

How to prove device drivers use kernel API properly?

Scalable Race DetectionMultithread Analysis = Sequential Analysis x Race Detection

Multithread Analysis

[popl07]

[pldi08]

[fse07]How to prevent and controlThread Interference ?Lock Allocation

Config ManagementNP-CompleteEncode and Solve via SAT [icse 07]

SAT solvers in Eclipse, Suse-Linux

How to avoid “DLL hell” ?Q:Can I install emacs?A:If you have X11 or Xorg (not both)

Staged Analysis for JavaScript

Web 2.0 Security

[pldi 09]

Tracking Info Flow in Browser [?]

How to prevent JavaScript from doing mischief ?

For each domain:1. Formalize Properties2. Automate Analysis3. Build Tools

Analysis Connects Properties & Code Analyze tricky corner cases+ Re-analyze as code evolves

Improve Software Reliability

“ucsd progsys”(people, papers, code, demos, etc.)

Execution1. Node gets message event2. Executes event handler

- Updates node state - Sends new messages

3. Repeat…

Concurrent, Distributed Systems


Recommended