Date post: | 22-Dec-2015 |
Category: |
Documents |
View: | 217 times |
Download: | 0 times |
IntendedProperties
ActualCode
Properties drift away
Code Evolves
Key to Reliability: Connect Properties & Code
Execution1. Node gets message event2. Executes event handler
- Updates node state - Sends new messages
3. Repeat…
Concurrent, Distributed Systems
Challenges
Nodes enter, leave, fail Messages are reordered, lost
System must stay available- Eventually, all nodes regroup - Eventually, all packets delivered
States and Transitions
1 2
StateSnapshot of system
1 2
event@1
At each state,scheduler chooses1. Node n2. Event @n3. Executes code (C++)
The Space of System Executions
1 2 Initial State
1 2
1 2
1 2
1 2
1 2
event@1
event@2
fail@1
event@1
fail@2
event@2
At each state,scheduler chooses1. Node n2. Event @n3. Executes code (C++)
An Execution = Sequence of Choices
1 2
1 2
1 2
1 2
1 2
1 2
event@1
event@2
fail@1
event@1
fail@2
event@1
An Execution = Sequence of Choices
1 2
1 2
1 2
1 2
1 2
1 2
event@1
event@2
fail@1
event@1
fail@2
event@1
An Execution = Sequence of Choices
1 2
1 2
1 2
1 2
1 2
1 2
event@1
event@2
fail@1
event@1
fail@2
event@1
An Execution = Sequence of Choices
1 2
1 2
1 2
1 2
1 2
1 2
event@1
event@2
fail@1
event@1
fail@2
event@1
Eventually all nodes regroup
(Despite Failures,...)
Eventually some good happens
Liveness Properties
Eventually all data deliveredEventually “P is true”Eventually "n,m $i: n.fwdi = m“nodes form ring”
1. Unlucky Executions
InitialState
Live States
At each step, some choice leads to livenessBut scheduler keeps making bad choices
Live States
2: Dead Executions
Dead States
No execution can reach live states
Execution Reaches
Recovery is Impossible
Idea 1:Focus on dead execution bugs
Because Very Severe Unluckiness may be fixed by schedulerBecause Very Plentiful Because We Can Yet, difficult to find, reproduce, fix
1. Formalize PropertiesSystemLivenessBugs
Distributed Systems:1. Formalize Properties2. Automate Analysis3. Build Tools
2. Automate CheckingSystematically Search for Dead Executions
“Model Checking”How to tell if execution Dead?
Model Checking
Systematically Explore Executions
Initial State
Iterate: over sequences of choicesExecution = Sequence of choices
1 2 1 2
[Verisoft 97, Cmc 04, Chess 07]
Model Checking
Iterate: over sequences of choicesUntil: hit a bad state
• Null Dereference• Buffer overflow• Assertion failure
SafetyBugs
SafetyBugs
Model Checking
Iterate: over sequences of choicesUntil: hit a bad state
How to find liveness bugs?
Until: hit a dead state
How to tell if state is Dead?Property only says which are live[Verisoft 97, Cmc 04, Chess 07]
• Null Dereference• Buffer overflow• Assertion failure
Idea 2: Random Walks
Live States
Dead States
Execute long random walks from state Pr[reaching live] = 0 Pr[reaching live] = 1How to tell if state is Dead?Property only says which are live
Executions and Random Walks
At each execution step, 1. Scheduler picks node n2. Scheduler picks event @n3. Executes event code
Random Walk: Scheduler picks randomlyFrom some Prob. Dist. over nodes, events
Algorithm = Search + Random Walks
1. Systematic Search: find candidates 2. Random Walk: test if candidate dead
Live States
Iterate
Live States
If walk length >> avg. steps to livenessThen non-live walk is likely liveness bug!
100k Events
1k Events
How to pinpoint bug ?100,000 Step Execution (2 Gb Log file)
Algorithm = Search + Random Walks
Live States
Idea 3: The Critical Transition
Dead States
System transitions from a recoverable to a dead state
How to find Critical Transitionwithout knowing Dead States?
Live States
Idea 3: The Critical Transition
Critical Transition
Dead States
System transitions from a recoverable to a dead statePinpoints bug
RecapDead ExecutionsSystem has shot itself (but doesnt know it)
Systematic SearchFinds candidate dead states
Random WalksDetermine if candidate is dead
Critical TransitionEvent that makes recovery impossible
Liveness Bugs,Critical Transition
Mace (C++)System
Liveness Properties
First Liveness Checker For Systems Code
MaceMC
Liveness Bugs
Mace (C++)System
Liveness Properties
First Liveness Checker For Systems Code
MaceMC[NSDI 07]
Implementation Details (1/2)
Random Walk BiasAssign “likely” events higher weighte.g. application > network > timer > fail
Bugs not missedRandom walk only tests deadness
Live state reached soonerError traces shorter, simpler
Prefix-Based SearchRestart search after reaching livenessAnalyzes effect of failures in “steady” state
Implementation Details (2/2)
Systems Analyzed
RandTreeRandom Overlay Tree with max degree.
MaceTransportUser-level, reliable messaging service.
PastryKey-based routing, using an overlay ring.
ChordKey-based routing, using an overlay ring.
Liveness Properties
RandTreeRandom Overlay Tree with max degree.
MaceTransportUser-level, reliable transport service.
PastryKey-based routing, using an overlay ring.
ChordKey-based routing, using an overlay ring.
Eventually, all messages acknowledged.
Eventually, all nodes form single tree.
Eventually, all nodes form a ring.
Eventually, all nodes form a ring.
B
Pastry Bug Understood
C receives (stale) message about AUpdates routing information
AA
C
A
Critical Transition!
“Dropped JoinRequest on rapid rejoin problem: There was a problem with nodes not being able to quickly rejoin if they used the same NodeId. Didn’t find the cause of this bug, but can no longer reproduce.”
(FreePastry README, “Changes since 1.4.2”)
Also in Original Implementation
A “Protocol Level” Bug
Sample Bug: RandTree
C
A
C requests to join under AA sends ackC fails and restartsC ignores ack from AC joins under B
Bug: System stuck as a DAG!
B
Liveness Bugs Yield Safety Assertions
Dead States Violate a priori unknown safety assertions
Critical TransitionHelps identify dead statesYields new safety assertions and bugs
New Safety Property: ChordNodes with Fwd, Back pointers
PropertyEventually nodes form a ring
Critical Transition To Dead StateWhere: n.back = n, n.fwd = m
New Safety PropertyIF n.back=n THEN n.fwd=n
New Safety Property: Chord
Hard to determine low-levelSafety properties in advance
Easy to determine high-levelLiveness properties in advance
ScorecardSystem Bugs Liveness Safety
MaceTransport 11 5 6RandTree 17 12 5
Pastry 5 5 0Chord 19 9 10Total 52 31 21
Several “protocol level” bugsRoutinely used by Mace programmersZero False Alarms
For each domain:1. Formalize Properties2. Automate Analysis3. Build Tools
Distributed Systems
Other Domains
Lazy AbstractionCraig InterpolationLiquid Types
Software Verification[popl02, pldi04, icse04...]
[pldi08, pldi09, popl10]
[popl04, tacas06]
How to prove device drivers use kernel API properly?
Scalable Race DetectionMultithread Analysis = Sequential Analysis x Race Detection
Multithread Analysis
[popl07]
[pldi08]
[fse07]How to prevent and controlThread Interference ?Lock Allocation
Config ManagementNP-CompleteEncode and Solve via SAT [icse 07]
SAT solvers in Eclipse, Suse-Linux
How to avoid “DLL hell” ?Q:Can I install emacs?A:If you have X11 or Xorg (not both)
Staged Analysis for JavaScript
Web 2.0 Security
[pldi 09]
Tracking Info Flow in Browser [?]
How to prevent JavaScript from doing mischief ?
For each domain:1. Formalize Properties2. Automate Analysis3. Build Tools
Analysis Connects Properties & Code Analyze tricky corner cases+ Re-analyze as code evolves
Improve Software Reliability