Date post: | 22-Dec-2015 |
Category: |
Documents |
View: | 215 times |
Download: | 1 times |
Designing For FailureDesigning For Failure
CS 241 Internet ServicesCS 241 Internet Services© 1999-2001 Armando Fox© 1999-2001 Armando Fox
[email protected]@cs.stanford.edu
© 2001
Stanford
AdministriviaAdministrivia Take-home part of midterm will be posted later, due 2/26Take-home part of midterm will be posted later, due 2/26
Open book, open notes, etc.Open book, open notes, etc.
No collaboration with other humansNo collaboration with other humans
Use the Internet to do research, not look up the answersUse the Internet to do research, not look up the answers
Possible paper submission targets (see course Web page)Possible paper submission targets (see course Web page) DIMACS Workshop on Pervasive Networking, Rutgers University, DIMACS Workshop on Pervasive Networking, Rutgers University,
NJ, May 21-22. NJ, May 21-22. Deadline: April 1Deadline: April 1
UbiComp 2001, Atlanta, Sep 30-Oct 2. UbiComp 2001, Atlanta, Sep 30-Oct 2. Deadline: April 13Deadline: April 13
IEEE ComputerIEEE Computer special issue on pervasive computing. special issue on pervasive computing. Deadline: May 1Deadline: May 1
I’m still available to meet with projects and/or help you I’m still available to meet with projects and/or help you pull together the writeup.pull together the writeup.
© 2001
Stanford
OutlineOutline
Mission: Under the assumption that hardware and Mission: Under the assumption that hardware and software software willwill fail, how do we “design for modular fail, how do we “design for modular restartability”?restartability”?
User expectations and failure semantics: case User expectations and failure semantics: case studies and lessonsstudies and lessons TACC server and MS Tiger video server: using soft stateTACC server and MS Tiger video server: using soft state
Cirrus banking network, TWA flight reservations system: Cirrus banking network, TWA flight reservations system: starting from user expectations and “out of band” (offline) starting from user expectations and “out of band” (offline) error handlingerror handling
Search engines, transformation proxies, online Search engines, transformation proxies, online aggregation: exploiting harvest/yield tradeoffsaggregation: exploiting harvest/yield tradeoffs
Mars Pathfinder: protecting a system with orthogonal Mars Pathfinder: protecting a system with orthogonal mechanismsmechanisms
© 2001
Stanford
Designing For Failure: PhilosophyDesigning For Failure: Philosophy
Start with some “givens”Start with some “givens” Hardware Hardware doesdoes fail fail
Latent bugs Latent bugs dodo occur occur
Nondeterministic/hard-to-reproduce bugs Nondeterministic/hard-to-reproduce bugs will will happenhappen
Requirements:Requirements: Maintain availability (possibly with degraded performance) Maintain availability (possibly with degraded performance)
when these things happenwhen these things happen
Isolate faults so they don’t bring whole system downIsolate faults so they don’t bring whole system down
Question:Question: What is “availability” from end user’s point of view?What is “availability” from end user’s point of view?
Specifically…what constitutes correct/acceptable behavior?Specifically…what constitutes correct/acceptable behavior?
© 2001
Stanford
User Expectations & Failure User Expectations & Failure SemanticsSemantics
Determine expected/common failure modes of Determine expected/common failure modes of systemsystem
Find “cheap” mechanisms to address themFind “cheap” mechanisms to address them Analysis: Invariants are your friendsAnalysis: Invariants are your friends
Performance: keep the common case fastPerformance: keep the common case fast
Robustness: minimize dependencies on other components Robustness: minimize dependencies on other components to avoid “domino effect”to avoid “domino effect”
Determine effects on performance & semanticsDetermine effects on performance & semantics Example: “Reload semantics” of the InternetExample: “Reload semantics” of the Internet
Example: Example: acceptableacceptable imprecisions/staleness of information imprecisions/staleness of information
How about unexpected/unsupported failure modes?How about unexpected/unsupported failure modes?
© 2001
Stanford
Anti-Example: NFSAnti-Example: NFS
Original philosophy: “Stateless is good”Original philosophy: “Stateless is good”
Implementation revealed: performance was badImplementation revealed: performance was bad
Caching retrofitted as a hackCaching retrofitted as a hack Attribute cache for Attribute cache for stat()stat()
Stateless Stateless soft state! soft state!
Network file locking added laterNetwork file locking added later Inconsistent lock state if server or client crashesInconsistent lock state if server or client crashes
User’s view of data semantics wasn’t considered User’s view of data semantics wasn’t considered from the get-go. Result: brain-dead semantics.from the get-go. Result: brain-dead semantics.
© 2001
Stanford
Example: Multicast/SRMExample: Multicast/SRM
Expected failure: periodic router death/partitionExpected failure: periodic router death/partition
Cheap solution:Cheap solution: Each router stores only local mcast routing infoEach router stores only local mcast routing info
Soft state with periodic refreshesSoft state with periodic refreshes
Absence of “reachability beacons” used to infer failure of Absence of “reachability beacons” used to infer failure of upstream nodes; can look for anotherupstream nodes; can look for another
Receiving duplicate refreshes is idempotentReceiving duplicate refreshes is idempotent
Unsupported failure mode: frequent or prolonged Unsupported failure mode: frequent or prolonged router failurerouter failure
© 2001
Stanford
Example: TACC PlatformExample: TACC Platform
Expected failure mode: Load balancer death (or Expected failure mode: Load balancer death (or partition)partition)
Cheap solution: LB state is all softCheap solution: LB state is all soft LB periodically beacons its existence & locationLB periodically beacons its existence & location
Workers connect to it and send periodic load reportsWorkers connect to it and send periodic load reports
If LB crashes, state is restored within a few secondsIf LB crashes, state is restored within a few seconds
Effect on data semanticsEffect on data semantics Cached, stale state in FE’s allows temporary operation Cached, stale state in FE’s allows temporary operation
during LB restartduring LB restart
Empirically and analytically, this is not so badEmpirically and analytically, this is not so bad
© 2001
Stanford
Example: TACC WorkersExample: TACC Workers
Expected failure: Worker death (or partition)Expected failure: Worker death (or partition)
Cheap solution:Cheap solution: Invariant: workers are Invariant: workers are restartablerestartable
Use retry count to handle pathological inputsUse retry count to handle pathological inputs
Effect on system behavior/semanticsEffect on system behavior/semantics Temporary extra latency for retryTemporary extra latency for retry
Unexpected failure mode: livelock!Unexpected failure mode: livelock! Timers retrofitted; what should be the behavior on Timers retrofitted; what should be the behavior on
timeout? (Kill request, or kill worker?)timeout? (Kill request, or kill worker?)
User-perceived effects?User-perceived effects?
© 2001
Stanford
A Lesson From TACC: Soft StateA Lesson From TACC: Soft State
+ No special recovery codeNo special recovery code Recovery == restartRecovery == restart
““Recovery” paths exercised during normal operationRecovery” paths exercised during normal operation
+ Leaving the group (or dying) doesn’t require Leaving the group (or dying) doesn’t require special codespecial code Timeouts/expirations mean no garbage collectionTimeouts/expirations mean no garbage collection
““Passive” entities will not consume system resourcesPassive” entities will not consume system resources
– Staleness or other imprecision in answerStaleness or other imprecision in answer
– Possible lack of atomicity/ordering of state updatesPossible lack of atomicity/ordering of state updates When might you care about this?When might you care about this?
© 2001
Stanford
When to Use Soft StateWhen to Use Soft State
Optimization that doesn’t impact correctnessOptimization that doesn’t impact correctness caching, load balancing (e.g. TACC)caching, load balancing (e.g. TACC)
Temporary state loss maskable by Temporary state loss maskable by higher level higher level mechanismsmechanisms
User expectations and failure semantics allow itUser expectations and failure semantics allow it The Internet is a best-effort network (“reload semantics”)The Internet is a best-effort network (“reload semantics”)
Microsoft Tiger example (coming up)Microsoft Tiger example (coming up)
Currently application-specific; goal is to produce a Currently application-specific; goal is to produce a catalog of techniques and quantitative rules for catalog of techniques and quantitative rules for
applying themapplying them
© 2001
Stanford
Example: MS Tiger Video Example: MS Tiger Video FileserverFileserver
ATM-connected disks “walk down” static scheduleATM-connected disks “walk down” static schedule
““Coherent hallucination”: global schedule, but each disk Coherent hallucination”: global schedule, but each disk only knows its local pieceonly knows its local piece
Pieces are passed around ring, bucket-brigade stylePieces are passed around ring, bucket-brigade style
0 1 2 3 4
ATM interconnect WANWAN
ScheduleScheduleTimeTime
© 2001
Stanford
Example: Tiger, cont’d.Example: Tiger, cont’d.
Once failure is detected, bypass failed cubOnce failure is detected, bypass failed cub Why not just detect first, then bypass schedule info to Why not just detect first, then bypass schedule info to
successor?successor?
If cub fails, its successor is If cub fails, its successor is already preparedalready prepared to to take over that schedule slottake over that schedule slot
0 1 2 3 4
ATM interconnect WANWAN
© 2001
Stanford
Example: Tiger, Cont’d.Example: Tiger, Cont’d.
Failed cub permanently removed from ringFailed cub permanently removed from ring
0 1 2 3 4
ATM interconnect WANWAN
© 2001
Stanford
Example: Tiger, cont’d.Example: Tiger, cont’d.
Expected failure mode: death of a cub (node)Expected failure mode: death of a cub (node)
Cheap solution:Cheap solution: All files are mirrored using decluster factorAll files are mirrored using decluster factor
Cub passes chunk of the schedule to both its successor Cub passes chunk of the schedule to both its successor and second-successorand second-successor
Effect on performance/semanticsEffect on performance/semantics No user-visible latency to recover from failure!No user-visible latency to recover from failure!
What’s the cost of this?What’s the cost of this?
What would be an analogous TACC mechanism?What would be an analogous TACC mechanism?
© 2001
Stanford
Example: Tiger, cont’d.Example: Tiger, cont’d.
Unsupported failure modesUnsupported failure modes Interconnect failureInterconnect failure
Failure of both primary and secondary copies of a blockFailure of both primary and secondary copies of a block
User-visible effect: frame loss or complete hosageUser-visible effect: frame loss or complete hosage
Points worth mentioningPoints worth mentioning Schedule is global but not centralized (Schedule is global but not centralized (coherent coherent
hallucinationhallucination))
State is not soft but not really “durable” either State is not soft but not really “durable” either (probabilistically hard? Asymptotically hard?)(probabilistically hard? Asymptotically hard?)
© 2001
Stanford
Example: Tiger, cont’d.Example: Tiger, cont’d.
Unsupported failure modesUnsupported failure modes Interconnect failureInterconnect failure
Failure of both primary and secondary copies of a blockFailure of both primary and secondary copies of a block
User-visible effect: frame loss or complete hosageUser-visible effect: frame loss or complete hosage
Notable differences from TACCNotable differences from TACC Schedule is global but not centralized (Schedule is global but not centralized (coherent coherent
hallucinationhallucination))
State is not soft but not really “durable” either State is not soft but not really “durable” either (probabilistically hard? Asymptotically hard?)(probabilistically hard? Asymptotically hard?)
© 2001
Stanford
Example: CIRRUS Banking Example: CIRRUS Banking NetworkNetwork
CIRRUS network is a “smart switch” that runs on CIRRUS network is a “smart switch” that runs on Tandem NonStop nodes (deployed early 80’s)Tandem NonStop nodes (deployed early 80’s)
2-phase commit for cash withdrawal: 2-phase commit for cash withdrawal:
1. Withdrawal request from ATM1. Withdrawal request from ATM
2. “xact in progress” logged at Bank2. “xact in progress” logged at Bank
3. [Commit point] cash dispensed, confirmation sent to 3. [Commit point] cash dispensed, confirmation sent to BankBank
4. Ack from bank4. Ack from bank
Non-obvious failure mode #1 after commit point: Non-obvious failure mode #1 after commit point: CIRRUS switch resends #3 until reply received CIRRUS switch resends #3 until reply received
If reply indicates error, manual cleanup needed If reply indicates error, manual cleanup needed
© 2001
Stanford
ATM Example, continuedATM Example, continued
Non-obvious failure mode #2: Cash is dispensed, Non-obvious failure mode #2: Cash is dispensed, but CIRRUS switch never sees message #3 from but CIRRUS switch never sees message #3 from ATM ATM Notify bank: cash Notify bank: cash was not was not dispenseddispensed
““Reboot” ATM-to-CIRRUS netReboot” ATM-to-CIRRUS net
Query net log to see if net thinks #3 was sent; if so, and Query net log to see if net thinks #3 was sent; if so, and #3 arrived before end of business day, create Adjustment #3 arrived before end of business day, create Adjustment Record at both sides Record at both sides
Otherwise, resolve out-of-band using physical evidence of Otherwise, resolve out-of-band using physical evidence of xact (tape logs, video cam, etc) xact (tape logs, video cam, etc)
Role of end-user semantics for picking design point Role of end-user semantics for picking design point for auto vs. manual recovery (end of business day)for auto vs. manual recovery (end of business day)
© 2001
Stanford
ATM and TWA Reservations ATM and TWA Reservations ExampleExample
Non-obvious failure mode #3: malicious ATM fakes Non-obvious failure mode #3: malicious ATM fakes message #3. message #3. Not caught or handled! "In practice, this won't go Not caught or handled! "In practice, this won't go
unnoticed in the banking industry and/or by our unnoticed in the banking industry and/or by our customers.”customers.”
TWA Reservations System (~1985)TWA Reservations System (~1985) ““Secondary” DB of reservations sold; points into Secondary” DB of reservations sold; points into
“primary” DB of complete seating inventory“primary” DB of complete seating inventory
Dangling pointers and double-bookings resolved Dangling pointers and double-bookings resolved offlineoffline
Explicitly trades availability for throughput: “...a 90% Explicitly trades availability for throughput: “...a 90% utilization rate would make it impossible for us to log all utilization rate would make it impossible for us to log all our transactions” our transactions”
© 2001
Stanford
Lessons from CIRRUS and TWALessons from CIRRUS and TWA
Start from user’s point of view of expected system Start from user’s point of view of expected system behaviorbehavior
Handle failures relative to thatHandle failures relative to that CIRRUS: what would user do if she received CIRRUS: what would user do if she received nono cash, but cash, but
withdrawal was recorded?withdrawal was recorded?
TWA: what is the (business) cost of processing fewer TWA: what is the (business) cost of processing fewer customers per day in order to maintain strong customers per day in order to maintain strong consistency?consistency?
© 2001
Stanford
Harvest and YieldHarvest and Yield
Yield:Yield: probability of completing a query probability of completing a query
Harvest:Harvest: (application-specific) fidelity of the answer (application-specific) fidelity of the answer Fraction of data represented?Fraction of data represented?
Precision?Precision?
Semantic proximity?Semantic proximity?
Harvest/yield questions:Harvest/yield questions: When can we trade harvest for yield to improve availability?When can we trade harvest for yield to improve availability?
How to measure harvest “threshold” below which response How to measure harvest “threshold” below which response is not useful?is not useful?
Application decomposition to improve “degradation Application decomposition to improve “degradation tolerance” (and therefore availability)tolerance” (and therefore availability)
© 2001
Stanford
Example 1: Search EngineExample 1: Search Engine
Stripe database randomly across all nodes, Stripe database randomly across all nodes, replicate high-priority datareplicate high-priority data Random striping: worst case == average caseRandom striping: worst case == average case
Replication: high priority data unlikely to be lostReplication: high priority data unlikely to be lost
Harvest: fraction of nodes reportingHarvest: fraction of nodes reporting
Questions…Questions… Why not just wait for all nodes to report back?Why not just wait for all nodes to report back?
Should harvest be reported to end user?Should harvest be reported to end user?
What is the “useful” harvest threshold?What is the “useful” harvest threshold?
Is nondeterminism a problem?Is nondeterminism a problem?
Trade harvest for yield/throughputTrade harvest for yield/throughput
© 2001
Stanford
Example 2: TranSendExample 2: TranSend
Recall: lossy on-the-fly Web image compression, Recall: lossy on-the-fly Web image compression, extensively parameterized extensively parameterized (per user, device, etc.)(per user, device, etc.)
Harvest: “semantic fidelity” of what you getHarvest: “semantic fidelity” of what you get Worst case: the original image Worst case: the original image originaloriginal
Intermediate case: “close”Intermediate case: “close”image that has beenimage that has beenpreviously computedpreviously computedand cachedand cached
Metrics for semantic fidelity?Metrics for semantic fidelity?
Trade harvest forTrade harvest foryield/throughputyield/throughput
desired
delivered
© 2001
Stanford
Synergies: H/Y and Cluster Synergies: H/Y and Cluster ComputingComputing
Common thread: trade harvest for yield or Common thread: trade harvest for yield or throughputthroughput Search engine: harvest == fraction of nodes reportingSearch engine: harvest == fraction of nodes reporting
TranSend: harvest == semantic similarity between TranSend: harvest == semantic similarity between delivered & requested resultsdelivered & requested results
Both cases: a direct function of availability of computing Both cases: a direct function of availability of computing powerpower
Harvest vs. Yield neat tricksHarvest vs. Yield neat tricks Migrating a cluster with zero downtimeMigrating a cluster with zero downtime
Managing user classesManaging user classes
© 2001
Stanford
Orthogonal Mechanism Orthogonal Mechanism example: example: SFISFI
Philosophy: Philosophy: Design with unexpected failure in mindDesign with unexpected failure in mind There are “expected failures” and “unexpected failures”There are “expected failures” and “unexpected failures”
Minimize assumptionsMinimize assumptions
Design systems to allow independent failureDesign systems to allow independent failure
In real life (hardware)In real life (hardware) Mechanical interlock systemsMechanical interlock systems
Microprocessor hardware timeoutsMicroprocessor hardware timeouts
In real life (software)In real life (software) Security and safetySecurity and safety
Deadlock detection and shootdownDeadlock detection and shootdown
© 2001
Stanford
Kinds of Software-Based Fault Kinds of Software-Based Fault IsolationIsolation
Put each piece of code in its own Put each piece of code in its own fault domainfault domain Domain=code + data, each aligned on power-of-2 boundaryDomain=code + data, each aligned on power-of-2 boundary
Configure MMU to fault on accesses/jumps outside fault Configure MMU to fault on accesses/jumps outside fault domaindomain
Binary rewriting for non-statically-verifiable accessesBinary rewriting for non-statically-verifiable accesses E.g. jump through a register, or indirect accessE.g. jump through a register, or indirect access
Insert code before each potentially-unsafe operation (like Insert code before each potentially-unsafe operation (like runtime type/bounds checking)runtime type/bounds checking)
System calls (or any cross-domain calls) are redirected System calls (or any cross-domain calls) are redirected through a protected jump table, to an through a protected jump table, to an arbitratorarbitrator
Can isolate pieces of user code Can isolate pieces of user code from each otherfrom each other without without using the (heavyweight) system-call mechanismusing the (heavyweight) system-call mechanism
© 2001
Stanford
Other Examples of OrthogonalityOther Examples of Orthogonality
Orthogonal Security/SafetyOrthogonal Security/Safety Software fault isolationSoftware fault isolation
Sandboxing untrusted code with JanusSandboxing untrusted code with Janus
IP firewallsIP firewalls
Deadlock Deadlock detection & recoverydetection & recovery in databases in databases Note: not deadlock Note: not deadlock avoidance!avoidance!
Timeouts and related guardsTimeouts and related guards
Hardware orthogonal securityHardware orthogonal security Fuses and hardware interlocksFuses and hardware interlocks
Recall the Therac-25Recall the Therac-25
© 2001
Stanford
Timeout example: Space ShuttleTimeout example: Space Shuttle
STS-2, second shuttle mission (1981)STS-2, second shuttle mission (1981) Four redundant IBM processors running custom softwareFour redundant IBM processors running custom software
All All fourfour wedged identically during simulation of a “Trans- wedged identically during simulation of a “Trans-atlantic abort” scenarioatlantic abort” scenario
Cause: systematic bug that caused a branch to garbage Cause: systematic bug that caused a branch to garbage code (computed-GOTO variable not properly reinitialized)code (computed-GOTO variable not properly reinitialized)
Would have to enter same code path Would have to enter same code path twicetwice to trigger it to trigger it
Display unit displayed error code that means “no bits Display unit displayed error code that means “no bits being sent to me”being sent to me”
Insight: Insight: someonesomeone noticed something was wrong noticed something was wrong (display)(display)
© 2001
Stanford
““What Really Happened On Mars”What Really Happened On Mars”
Sources: various posts to Risks Digest, including Sources: various posts to Risks Digest, including one from CTO of WindRiver Systems, which makes one from CTO of WindRiver Systems, which makes the VxWorks operating system that controls the the VxWorks operating system that controls the Pathfinder.Pathfinder.
Background conceptsBackground concepts Threads and mutual exclusionThreads and mutual exclusion
Mutexes and Priority inversionMutexes and Priority inversion
Priority inheritancePriority inheritance
How the failure was diagnosed and fixedHow the failure was diagnosed and fixed
LessonsLessons
© 2001
Stanford
What Really HappenedWhat Really Happened Dramatis personaeDramatis personae
Low-priority thread A: infrequent, short-running meteorological Low-priority thread A: infrequent, short-running meteorological data collection, using bus mutexdata collection, using bus mutex
High-priority thread B: bus manager, using bus mutexHigh-priority thread B: bus manager, using bus mutex
Medium-priority thread C: long-running communications task Medium-priority thread C: long-running communications task (that (that doesn’t doesn’t need the mutex)need the mutex)
Priority inversion scenarioPriority inversion scenario A is scheduled, and grabs bus mutexA is scheduled, and grabs bus mutex
B is scheduled, and blocks waiting for A to release mutexB is scheduled, and blocks waiting for A to release mutex
C is scheduled while B is waiting for mutexC is scheduled while B is waiting for mutex
C has higher priority than A, so it prevents A from running (and C has higher priority than A, so it prevents A from running (and therefore B as well)therefore B as well)
Watchdog timer notices B hasn’t run, concludes something is Watchdog timer notices B hasn’t run, concludes something is wrong, rebootswrong, reboots
© 2001
Stanford
Lessons They LearnedLessons They Learned
Extensive logging facilities (trace of system Extensive logging facilities (trace of system events) allowed diagnosisevents) allowed diagnosis
Debugging facility (ability to execute C commands Debugging facility (ability to execute C commands that tweak the running system…kind of like that tweak the running system…kind of like gdbgdb) ) allowed problem to be fixedallowed problem to be fixed
Debugging facility was intended to be Debugging facility was intended to be turned offturned off prior to production useprior to production use Similar facility in certain commercial CPU designs!Similar facility in certain commercial CPU designs!
© 2001
Stanford
Lessons We Should LearnLessons We Should Learn
The (potentially) biggest gotcha in modular, The (potentially) biggest gotcha in modular, loosely-coupled design: loosely-coupled design: Complexity has big pitfallsComplexity has big pitfalls Concurrency, race conditions, and mutual exclusion are Concurrency, race conditions, and mutual exclusion are
hardhard to debug to debug
C.A.R. Hoare: “The unavoidable price of reliability is C.A.R. Hoare: “The unavoidable price of reliability is simplicity.”simplicity.”
It’s more important to get it extensible than to get It’s more important to get it extensible than to get it rightit right Even in the hardware world!Even in the hardware world!
Knowledge of end-to-end Knowledge of end-to-end semanticssemantics can be critical can be critical
© 2001
Stanford
Why Are Orthos Appealing?Why Are Orthos Appealing?
Small state spaceSmall state space Behavior of any one mechanism is simple to predict (usually)Behavior of any one mechanism is simple to predict (usually)
Allows us to enforce at least some simple invariantsAllows us to enforce at least some simple invariants And invariants are your friendsAnd invariants are your friends
An appealing application of the end-to-end argument?An appealing application of the end-to-end argument?
Orthogonality At RuntimeOrthogonality At Runtime SFI and JanusSFI and Janus
Virtual memory systems (lots of security work based on this)Virtual memory systems (lots of security work based on this)
Process-level resource management (“kill -9”)Process-level resource management (“kill -9”)
© 2001
Stanford
Runtime Orthogonality: Pros and Runtime Orthogonality: Pros and ConsCons
E.g. VM-based (recoverable memory, sandboxing…)E.g. VM-based (recoverable memory, sandboxing…)
+ Use high-confidence machine-level mechanisms+ Use high-confidence machine-level mechanisms Based on hardware-level mechanisms, e.g. VM, trapsBased on hardware-level mechanisms, e.g. VM, traps
In practice, hardware implementation errors for these are In practice, hardware implementation errors for these are extremely extremely rare (why?)rare (why?)
+ Can be used with arbitrary “legacy” code+ Can be used with arbitrary “legacy” code
- No onus on programmer to make potential error - No onus on programmer to make potential error conditions explicit (e.g. assertions)conditions explicit (e.g. assertions) So runtime has no idea what to do to “recover”So runtime has no idea what to do to “recover”
- Doesn’t guarantee correct behavior--only safety to - Doesn’t guarantee correct behavior--only safety to othersothers (so helps modularity, but not necessarily (so helps modularity, but not necessarily correctness)correctness)
© 2001
Stanford
Another Ortho: Dataflow AnalysisAnother Ortho: Dataflow Analysis
PotentiallyPotentially unsafe operations must unsafe operations must alwaysalways be be denied, to be conservativedenied, to be conservative If done statically, renders code impotentIf done statically, renders code impotent
Idea: quarantine the data that may be Idea: quarantine the data that may be “contaminated” by user (“contaminated” by user (taintperltaintperl works this way) works this way)
print STDERR “Enter file name:”;print STDERR “Enter file name:”;$x=<STDIN>; # $x is tainted (user input)$x=<STDIN>; # $x is tainted (user input)…more code… …more code… $z=“/tmp/safe_file.txt”; # $z is clean$z=“/tmp/safe_file.txt”; # $z is clean$y=“$sysdir/$x”; # $y is tainted$y=“$sysdir/$x”; # $y is taintedsystem(“cat $y”); # disallowed!system(“cat $y”); # disallowed!system(“cat $z”); # OKsystem(“cat $z”); # OK
© 2001
Stanford
KISSO Principle:KISSO Principle:Keep It Simple, Soft, and OrthogonalKeep It Simple, Soft, and Orthogonal
KISSO Principle:KISSO Principle:Keep It Simple, Soft, and OrthogonalKeep It Simple, Soft, and Orthogonal
SynergySynergy
Synergy between soft state and simplicitySynergy between soft state and simplicity easier recoveryeasier recovery
may allow a design that centralizes some state without may allow a design that centralizes some state without being a single point of failurebeing a single point of failure
Synergy between orthos and restartabilitySynergy between orthos and restartability Restartable components are easily protected by orthosRestartable components are easily protected by orthos
But loosely-coupled distributed systems are hard But loosely-coupled distributed systems are hard to debugto debug ……then again, so are big monolithic programsthen again, so are big monolithic programs
And ensembles are more likely amenable to parallel And ensembles are more likely amenable to parallel developmentdevelopment