Designing For Failure CS 241 Internet Services © 1999-2001 Armando Fox [email protected].

Designing For FailureDesigning For Failure

CS 241 Internet ServicesCS 241 Internet Services© 1999-2001 Armando Fox© 1999-2001 Armando Fox

[email protected]@cs.stanford.edu

© 2001

Stanford

AdministriviaAdministrivia Take-home part of midterm will be posted later, due 2/26Take-home part of midterm will be posted later, due 2/26

Open book, open notes, etc.Open book, open notes, etc.

No collaboration with other humansNo collaboration with other humans

Use the Internet to do research, not look up the answersUse the Internet to do research, not look up the answers

Possible paper submission targets (see course Web page)Possible paper submission targets (see course Web page) DIMACS Workshop on Pervasive Networking, Rutgers University, DIMACS Workshop on Pervasive Networking, Rutgers University,

NJ, May 21-22. NJ, May 21-22. Deadline: April 1Deadline: April 1

UbiComp 2001, Atlanta, Sep 30-Oct 2. UbiComp 2001, Atlanta, Sep 30-Oct 2. Deadline: April 13Deadline: April 13

IEEE ComputerIEEE Computer special issue on pervasive computing. special issue on pervasive computing. Deadline: May 1Deadline: May 1

I’m still available to meet with projects and/or help you I’m still available to meet with projects and/or help you pull together the writeup.pull together the writeup.

© 2001

Stanford

OutlineOutline

Mission: Under the assumption that hardware and Mission: Under the assumption that hardware and software software willwill fail, how do we “design for modular fail, how do we “design for modular restartability”?restartability”?

User expectations and failure semantics: case User expectations and failure semantics: case studies and lessonsstudies and lessons TACC server and MS Tiger video server: using soft stateTACC server and MS Tiger video server: using soft state

Cirrus banking network, TWA flight reservations system: Cirrus banking network, TWA flight reservations system: starting from user expectations and “out of band” (offline) starting from user expectations and “out of band” (offline) error handlingerror handling

Search engines, transformation proxies, online Search engines, transformation proxies, online aggregation: exploiting harvest/yield tradeoffsaggregation: exploiting harvest/yield tradeoffs

Mars Pathfinder: protecting a system with orthogonal Mars Pathfinder: protecting a system with orthogonal mechanismsmechanisms

© 2001

Stanford

Designing For Failure: PhilosophyDesigning For Failure: Philosophy

Start with some “givens”Start with some “givens” Hardware Hardware doesdoes fail fail

Latent bugs Latent bugs dodo occur occur

Nondeterministic/hard-to-reproduce bugs Nondeterministic/hard-to-reproduce bugs will will happenhappen

Requirements:Requirements: Maintain availability (possibly with degraded performance) Maintain availability (possibly with degraded performance)

when these things happenwhen these things happen

Isolate faults so they don’t bring whole system downIsolate faults so they don’t bring whole system down

Question:Question: What is “availability” from end user’s point of view?What is “availability” from end user’s point of view?

Specifically…what constitutes correct/acceptable behavior?Specifically…what constitutes correct/acceptable behavior?

© 2001

Stanford

User Expectations & Failure User Expectations & Failure SemanticsSemantics

Determine expected/common failure modes of Determine expected/common failure modes of systemsystem

Find “cheap” mechanisms to address themFind “cheap” mechanisms to address them Analysis: Invariants are your friendsAnalysis: Invariants are your friends

Performance: keep the common case fastPerformance: keep the common case fast

Robustness: minimize dependencies on other components Robustness: minimize dependencies on other components to avoid “domino effect”to avoid “domino effect”

Determine effects on performance & semanticsDetermine effects on performance & semantics Example: “Reload semantics” of the InternetExample: “Reload semantics” of the Internet

Example: Example: acceptableacceptable imprecisions/staleness of information imprecisions/staleness of information

How about unexpected/unsupported failure modes?How about unexpected/unsupported failure modes?

© 2001

Stanford

Anti-Example: NFSAnti-Example: NFS

Original philosophy: “Stateless is good”Original philosophy: “Stateless is good”

Implementation revealed: performance was badImplementation revealed: performance was bad

Caching retrofitted as a hackCaching retrofitted as a hack Attribute cache for Attribute cache for stat()stat()

Stateless Stateless soft state! soft state!

Network file locking added laterNetwork file locking added later Inconsistent lock state if server or client crashesInconsistent lock state if server or client crashes

User’s view of data semantics wasn’t considered User’s view of data semantics wasn’t considered from the get-go. Result: brain-dead semantics.from the get-go. Result: brain-dead semantics.

© 2001

Stanford

Example: Multicast/SRMExample: Multicast/SRM

Expected failure: periodic router death/partitionExpected failure: periodic router death/partition

Cheap solution:Cheap solution: Each router stores only local mcast routing infoEach router stores only local mcast routing info

Soft state with periodic refreshesSoft state with periodic refreshes

Absence of “reachability beacons” used to infer failure of Absence of “reachability beacons” used to infer failure of upstream nodes; can look for anotherupstream nodes; can look for another

Receiving duplicate refreshes is idempotentReceiving duplicate refreshes is idempotent

Unsupported failure mode: frequent or prolonged Unsupported failure mode: frequent or prolonged router failurerouter failure

© 2001

Stanford

Example: TACC PlatformExample: TACC Platform

Expected failure mode: Load balancer death (or Expected failure mode: Load balancer death (or partition)partition)

Cheap solution: LB state is all softCheap solution: LB state is all soft LB periodically beacons its existence & locationLB periodically beacons its existence & location

Workers connect to it and send periodic load reportsWorkers connect to it and send periodic load reports

If LB crashes, state is restored within a few secondsIf LB crashes, state is restored within a few seconds

Effect on data semanticsEffect on data semantics Cached, stale state in FE’s allows temporary operation Cached, stale state in FE’s allows temporary operation

during LB restartduring LB restart

Empirically and analytically, this is not so badEmpirically and analytically, this is not so bad

© 2001

Stanford

Example: TACC WorkersExample: TACC Workers

Expected failure: Worker death (or partition)Expected failure: Worker death (or partition)

Cheap solution:Cheap solution: Invariant: workers are Invariant: workers are restartablerestartable

Use retry count to handle pathological inputsUse retry count to handle pathological inputs

Effect on system behavior/semanticsEffect on system behavior/semantics Temporary extra latency for retryTemporary extra latency for retry

Unexpected failure mode: livelock!Unexpected failure mode: livelock! Timers retrofitted; what should be the behavior on Timers retrofitted; what should be the behavior on

timeout? (Kill request, or kill worker?)timeout? (Kill request, or kill worker?)

User-perceived effects?User-perceived effects?

© 2001

Stanford

A Lesson From TACC: Soft StateA Lesson From TACC: Soft State

+ No special recovery codeNo special recovery code Recovery == restartRecovery == restart

““Recovery” paths exercised during normal operationRecovery” paths exercised during normal operation

+ Leaving the group (or dying) doesn’t require Leaving the group (or dying) doesn’t require special codespecial code Timeouts/expirations mean no garbage collectionTimeouts/expirations mean no garbage collection

““Passive” entities will not consume system resourcesPassive” entities will not consume system resources

– Staleness or other imprecision in answerStaleness or other imprecision in answer

– Possible lack of atomicity/ordering of state updatesPossible lack of atomicity/ordering of state updates When might you care about this?When might you care about this?

© 2001

Stanford

When to Use Soft StateWhen to Use Soft State

Optimization that doesn’t impact correctnessOptimization that doesn’t impact correctness caching, load balancing (e.g. TACC)caching, load balancing (e.g. TACC)

Temporary state loss maskable by Temporary state loss maskable by higher level higher level mechanismsmechanisms

User expectations and failure semantics allow itUser expectations and failure semantics allow it The Internet is a best-effort network (“reload semantics”)The Internet is a best-effort network (“reload semantics”)

Microsoft Tiger example (coming up)Microsoft Tiger example (coming up)

Currently application-specific; goal is to produce a Currently application-specific; goal is to produce a catalog of techniques and quantitative rules for catalog of techniques and quantitative rules for

applying themapplying them

© 2001

Stanford

Example: MS Tiger Video Example: MS Tiger Video FileserverFileserver

ATM-connected disks “walk down” static scheduleATM-connected disks “walk down” static schedule

““Coherent hallucination”: global schedule, but each disk Coherent hallucination”: global schedule, but each disk only knows its local pieceonly knows its local piece

Pieces are passed around ring, bucket-brigade stylePieces are passed around ring, bucket-brigade style

0 1 2 3 4

ATM interconnect WANWAN

ScheduleScheduleTimeTime

© 2001

Stanford

Example: Tiger, cont’d.Example: Tiger, cont’d.

Once failure is detected, bypass failed cubOnce failure is detected, bypass failed cub Why not just detect first, then bypass schedule info to Why not just detect first, then bypass schedule info to

successor?successor?

If cub fails, its successor is If cub fails, its successor is already preparedalready prepared to to take over that schedule slottake over that schedule slot

0 1 2 3 4


© 2001

Stanford

Example: Tiger, Cont’d.Example: Tiger, Cont’d.

Failed cub permanently removed from ringFailed cub permanently removed from ring

0 1 2 3 4


© 2001

Stanford


Expected failure mode: death of a cub (node)Expected failure mode: death of a cub (node)

Cheap solution:Cheap solution: All files are mirrored using decluster factorAll files are mirrored using decluster factor

Cub passes chunk of the schedule to both its successor Cub passes chunk of the schedule to both its successor and second-successorand second-successor

Effect on performance/semanticsEffect on performance/semantics No user-visible latency to recover from failure!No user-visible latency to recover from failure!

What’s the cost of this?What’s the cost of this?

What would be an analogous TACC mechanism?What would be an analogous TACC mechanism?

© 2001

Stanford


Unsupported failure modesUnsupported failure modes Interconnect failureInterconnect failure

Failure of both primary and secondary copies of a blockFailure of both primary and secondary copies of a block

User-visible effect: frame loss or complete hosageUser-visible effect: frame loss or complete hosage

Points worth mentioningPoints worth mentioning Schedule is global but not centralized (Schedule is global but not centralized (coherent coherent

hallucinationhallucination))

State is not soft but not really “durable” either State is not soft but not really “durable” either (probabilistically hard? Asymptotically hard?)(probabilistically hard? Asymptotically hard?)

© 2001

Stanford


Unsupported failure modesUnsupported failure modes Interconnect failureInterconnect failure

Failure of both primary and secondary copies of a blockFailure of both primary and secondary copies of a block

User-visible effect: frame loss or complete hosageUser-visible effect: frame loss or complete hosage

Notable differences from TACCNotable differences from TACC Schedule is global but not centralized (Schedule is global but not centralized (coherent coherent

hallucinationhallucination))

State is not soft but not really “durable” either State is not soft but not really “durable” either (probabilistically hard? Asymptotically hard?)(probabilistically hard? Asymptotically hard?)

© 2001

Stanford

Example: CIRRUS Banking Example: CIRRUS Banking NetworkNetwork

CIRRUS network is a “smart switch” that runs on CIRRUS network is a “smart switch” that runs on Tandem NonStop nodes (deployed early 80’s)Tandem NonStop nodes (deployed early 80’s)

2-phase commit for cash withdrawal: 2-phase commit for cash withdrawal:

1. Withdrawal request from ATM1. Withdrawal request from ATM

2. “xact in progress” logged at Bank2. “xact in progress” logged at Bank

3. [Commit point] cash dispensed, confirmation sent to 3. [Commit point] cash dispensed, confirmation sent to BankBank

4. Ack from bank4. Ack from bank

Non-obvious failure mode #1 after commit point: Non-obvious failure mode #1 after commit point: CIRRUS switch resends #3 until reply received CIRRUS switch resends #3 until reply received

If reply indicates error, manual cleanup needed If reply indicates error, manual cleanup needed

© 2001

Stanford

ATM Example, continuedATM Example, continued

Non-obvious failure mode #2: Cash is dispensed, Non-obvious failure mode #2: Cash is dispensed, but CIRRUS switch never sees message #3 from but CIRRUS switch never sees message #3 from ATM ATM Notify bank: cash Notify bank: cash was not was not dispenseddispensed

““Reboot” ATM-to-CIRRUS netReboot” ATM-to-CIRRUS net

Query net log to see if net thinks #3 was sent; if so, and Query net log to see if net thinks #3 was sent; if so, and #3 arrived before end of business day, create Adjustment #3 arrived before end of business day, create Adjustment Record at both sides Record at both sides

Otherwise, resolve out-of-band using physical evidence of Otherwise, resolve out-of-band using physical evidence of xact (tape logs, video cam, etc) xact (tape logs, video cam, etc)

Role of end-user semantics for picking design point Role of end-user semantics for picking design point for auto vs. manual recovery (end of business day)for auto vs. manual recovery (end of business day)

© 2001

Stanford

ATM and TWA Reservations ATM and TWA Reservations ExampleExample

Non-obvious failure mode #3: malicious ATM fakes Non-obvious failure mode #3: malicious ATM fakes message #3. message #3. Not caught or handled! "In practice, this won't go Not caught or handled! "In practice, this won't go

unnoticed in the banking industry and/or by our unnoticed in the banking industry and/or by our customers.”customers.”

TWA Reservations System (~1985)TWA Reservations System (~1985) ““Secondary” DB of reservations sold; points into Secondary” DB of reservations sold; points into

“primary” DB of complete seating inventory“primary” DB of complete seating inventory

Dangling pointers and double-bookings resolved Dangling pointers and double-bookings resolved offlineoffline

Explicitly trades availability for throughput: “...a 90% Explicitly trades availability for throughput: “...a 90% utilization rate would make it impossible for us to log all utilization rate would make it impossible for us to log all our transactions” our transactions”

© 2001

Stanford

Lessons from CIRRUS and TWALessons from CIRRUS and TWA

Start from user’s point of view of expected system Start from user’s point of view of expected system behaviorbehavior

Handle failures relative to thatHandle failures relative to that CIRRUS: what would user do if she received CIRRUS: what would user do if she received nono cash, but cash, but

withdrawal was recorded?withdrawal was recorded?

TWA: what is the (business) cost of processing fewer TWA: what is the (business) cost of processing fewer customers per day in order to maintain strong customers per day in order to maintain strong consistency?consistency?

© 2001

Stanford

Harvest and YieldHarvest and Yield

Yield:Yield: probability of completing a query probability of completing a query

Harvest:Harvest: (application-specific) fidelity of the answer (application-specific) fidelity of the answer Fraction of data represented?Fraction of data represented?

Precision?Precision?

Semantic proximity?Semantic proximity?

Harvest/yield questions:Harvest/yield questions: When can we trade harvest for yield to improve availability?When can we trade harvest for yield to improve availability?

How to measure harvest “threshold” below which response How to measure harvest “threshold” below which response is not useful?is not useful?

Application decomposition to improve “degradation Application decomposition to improve “degradation tolerance” (and therefore availability)tolerance” (and therefore availability)

© 2001

Stanford

Example 1: Search EngineExample 1: Search Engine

Stripe database randomly across all nodes, Stripe database randomly across all nodes, replicate high-priority datareplicate high-priority data Random striping: worst case == average caseRandom striping: worst case == average case

Replication: high priority data unlikely to be lostReplication: high priority data unlikely to be lost

Harvest: fraction of nodes reportingHarvest: fraction of nodes reporting

Questions…Questions… Why not just wait for all nodes to report back?Why not just wait for all nodes to report back?

Should harvest be reported to end user?Should harvest be reported to end user?

What is the “useful” harvest threshold?What is the “useful” harvest threshold?

Is nondeterminism a problem?Is nondeterminism a problem?

Trade harvest for yield/throughputTrade harvest for yield/throughput

© 2001

Stanford

Example 2: TranSendExample 2: TranSend

Recall: lossy on-the-fly Web image compression, Recall: lossy on-the-fly Web image compression, extensively parameterized extensively parameterized (per user, device, etc.)(per user, device, etc.)

Harvest: “semantic fidelity” of what you getHarvest: “semantic fidelity” of what you get Worst case: the original image Worst case: the original image originaloriginal

Intermediate case: “close”Intermediate case: “close”image that has beenimage that has beenpreviously computedpreviously computedand cachedand cached

Metrics for semantic fidelity?Metrics for semantic fidelity?

Trade harvest forTrade harvest foryield/throughputyield/throughput

desired

delivered

© 2001

Stanford

Synergies: H/Y and Cluster Synergies: H/Y and Cluster ComputingComputing

Common thread: trade harvest for yield or Common thread: trade harvest for yield or throughputthroughput Search engine: harvest == fraction of nodes reportingSearch engine: harvest == fraction of nodes reporting

TranSend: harvest == semantic similarity between TranSend: harvest == semantic similarity between delivered & requested resultsdelivered & requested results

Both cases: a direct function of availability of computing Both cases: a direct function of availability of computing powerpower

Harvest vs. Yield neat tricksHarvest vs. Yield neat tricks Migrating a cluster with zero downtimeMigrating a cluster with zero downtime

Managing user classesManaging user classes

© 2001

Stanford

Orthogonal Mechanism Orthogonal Mechanism example: example: SFISFI

Philosophy: Philosophy: Design with unexpected failure in mindDesign with unexpected failure in mind There are “expected failures” and “unexpected failures”There are “expected failures” and “unexpected failures”

Minimize assumptionsMinimize assumptions

Design systems to allow independent failureDesign systems to allow independent failure

In real life (hardware)In real life (hardware) Mechanical interlock systemsMechanical interlock systems

Microprocessor hardware timeoutsMicroprocessor hardware timeouts

In real life (software)In real life (software) Security and safetySecurity and safety

Deadlock detection and shootdownDeadlock detection and shootdown

© 2001

Stanford

Kinds of Software-Based Fault Kinds of Software-Based Fault IsolationIsolation

Put each piece of code in its own Put each piece of code in its own fault domainfault domain Domain=code + data, each aligned on power-of-2 boundaryDomain=code + data, each aligned on power-of-2 boundary

Configure MMU to fault on accesses/jumps outside fault Configure MMU to fault on accesses/jumps outside fault domaindomain

Binary rewriting for non-statically-verifiable accessesBinary rewriting for non-statically-verifiable accesses E.g. jump through a register, or indirect accessE.g. jump through a register, or indirect access

Insert code before each potentially-unsafe operation (like Insert code before each potentially-unsafe operation (like runtime type/bounds checking)runtime type/bounds checking)

System calls (or any cross-domain calls) are redirected System calls (or any cross-domain calls) are redirected through a protected jump table, to an through a protected jump table, to an arbitratorarbitrator

Can isolate pieces of user code Can isolate pieces of user code from each otherfrom each other without without using the (heavyweight) system-call mechanismusing the (heavyweight) system-call mechanism

© 2001

Stanford

Other Examples of OrthogonalityOther Examples of Orthogonality

Orthogonal Security/SafetyOrthogonal Security/Safety Software fault isolationSoftware fault isolation

Sandboxing untrusted code with JanusSandboxing untrusted code with Janus

IP firewallsIP firewalls

Deadlock Deadlock detection & recoverydetection & recovery in databases in databases Note: not deadlock Note: not deadlock avoidance!avoidance!

Timeouts and related guardsTimeouts and related guards

Hardware orthogonal securityHardware orthogonal security Fuses and hardware interlocksFuses and hardware interlocks

Recall the Therac-25Recall the Therac-25

© 2001

Stanford

Timeout example: Space ShuttleTimeout example: Space Shuttle

STS-2, second shuttle mission (1981)STS-2, second shuttle mission (1981) Four redundant IBM processors running custom softwareFour redundant IBM processors running custom software

All All fourfour wedged identically during simulation of a “Trans- wedged identically during simulation of a “Trans-atlantic abort” scenarioatlantic abort” scenario

Cause: systematic bug that caused a branch to garbage Cause: systematic bug that caused a branch to garbage code (computed-GOTO variable not properly reinitialized)code (computed-GOTO variable not properly reinitialized)

Would have to enter same code path Would have to enter same code path twicetwice to trigger it to trigger it

Display unit displayed error code that means “no bits Display unit displayed error code that means “no bits being sent to me”being sent to me”

Insight: Insight: someonesomeone noticed something was wrong noticed something was wrong (display)(display)

© 2001

Stanford

““What Really Happened On Mars”What Really Happened On Mars”

Sources: various posts to Risks Digest, including Sources: various posts to Risks Digest, including one from CTO of WindRiver Systems, which makes one from CTO of WindRiver Systems, which makes the VxWorks operating system that controls the the VxWorks operating system that controls the Pathfinder.Pathfinder.

Background conceptsBackground concepts Threads and mutual exclusionThreads and mutual exclusion

Mutexes and Priority inversionMutexes and Priority inversion

Priority inheritancePriority inheritance

How the failure was diagnosed and fixedHow the failure was diagnosed and fixed

LessonsLessons

© 2001

Stanford

What Really HappenedWhat Really Happened Dramatis personaeDramatis personae

Low-priority thread A: infrequent, short-running meteorological Low-priority thread A: infrequent, short-running meteorological data collection, using bus mutexdata collection, using bus mutex

High-priority thread B: bus manager, using bus mutexHigh-priority thread B: bus manager, using bus mutex

Medium-priority thread C: long-running communications task Medium-priority thread C: long-running communications task (that (that doesn’t doesn’t need the mutex)need the mutex)

Priority inversion scenarioPriority inversion scenario A is scheduled, and grabs bus mutexA is scheduled, and grabs bus mutex

B is scheduled, and blocks waiting for A to release mutexB is scheduled, and blocks waiting for A to release mutex

C is scheduled while B is waiting for mutexC is scheduled while B is waiting for mutex

C has higher priority than A, so it prevents A from running (and C has higher priority than A, so it prevents A from running (and therefore B as well)therefore B as well)

Watchdog timer notices B hasn’t run, concludes something is Watchdog timer notices B hasn’t run, concludes something is wrong, rebootswrong, reboots

© 2001

Stanford

Lessons They LearnedLessons They Learned

Extensive logging facilities (trace of system Extensive logging facilities (trace of system events) allowed diagnosisevents) allowed diagnosis

Debugging facility (ability to execute C commands Debugging facility (ability to execute C commands that tweak the running system…kind of like that tweak the running system…kind of like gdbgdb) ) allowed problem to be fixedallowed problem to be fixed

Debugging facility was intended to be Debugging facility was intended to be turned offturned off prior to production useprior to production use Similar facility in certain commercial CPU designs!Similar facility in certain commercial CPU designs!

© 2001

Stanford

Lessons We Should LearnLessons We Should Learn

The (potentially) biggest gotcha in modular, The (potentially) biggest gotcha in modular, loosely-coupled design: loosely-coupled design: Complexity has big pitfallsComplexity has big pitfalls Concurrency, race conditions, and mutual exclusion are Concurrency, race conditions, and mutual exclusion are

hardhard to debug to debug

C.A.R. Hoare: “The unavoidable price of reliability is C.A.R. Hoare: “The unavoidable price of reliability is simplicity.”simplicity.”

It’s more important to get it extensible than to get It’s more important to get it extensible than to get it rightit right Even in the hardware world!Even in the hardware world!

Knowledge of end-to-end Knowledge of end-to-end semanticssemantics can be critical can be critical

© 2001

Stanford

Why Are Orthos Appealing?Why Are Orthos Appealing?

Small state spaceSmall state space Behavior of any one mechanism is simple to predict (usually)Behavior of any one mechanism is simple to predict (usually)

Allows us to enforce at least some simple invariantsAllows us to enforce at least some simple invariants And invariants are your friendsAnd invariants are your friends

An appealing application of the end-to-end argument?An appealing application of the end-to-end argument?

Orthogonality At RuntimeOrthogonality At Runtime SFI and JanusSFI and Janus

Virtual memory systems (lots of security work based on this)Virtual memory systems (lots of security work based on this)

Process-level resource management (“kill -9”)Process-level resource management (“kill -9”)

© 2001

Stanford

Runtime Orthogonality: Pros and Runtime Orthogonality: Pros and ConsCons

E.g. VM-based (recoverable memory, sandboxing…)E.g. VM-based (recoverable memory, sandboxing…)

+ Use high-confidence machine-level mechanisms+ Use high-confidence machine-level mechanisms Based on hardware-level mechanisms, e.g. VM, trapsBased on hardware-level mechanisms, e.g. VM, traps

In practice, hardware implementation errors for these are In practice, hardware implementation errors for these are extremely extremely rare (why?)rare (why?)

+ Can be used with arbitrary “legacy” code+ Can be used with arbitrary “legacy” code

- No onus on programmer to make potential error - No onus on programmer to make potential error conditions explicit (e.g. assertions)conditions explicit (e.g. assertions) So runtime has no idea what to do to “recover”So runtime has no idea what to do to “recover”

- Doesn’t guarantee correct behavior--only safety to - Doesn’t guarantee correct behavior--only safety to othersothers (so helps modularity, but not necessarily (so helps modularity, but not necessarily correctness)correctness)

© 2001

Stanford

Another Ortho: Dataflow AnalysisAnother Ortho: Dataflow Analysis

PotentiallyPotentially unsafe operations must unsafe operations must alwaysalways be be denied, to be conservativedenied, to be conservative If done statically, renders code impotentIf done statically, renders code impotent

Idea: quarantine the data that may be Idea: quarantine the data that may be “contaminated” by user (“contaminated” by user (taintperltaintperl works this way) works this way)

print STDERR “Enter file name:”;print STDERR “Enter file name:”;$x=<STDIN>; # $x is tainted (user input)$x=<STDIN>; # $x is tainted (user input)…more code… …more code… $z=“/tmp/safe_file.txt”; # $z is clean$z=“/tmp/safe_file.txt”; # $z is clean$y=“$sysdir/$x”; # $y is tainted$y=“$sysdir/$x”; # $y is taintedsystem(“cat $y”); # disallowed!system(“cat $y”); # disallowed!system(“cat $z”); # OKsystem(“cat $z”); # OK

© 2001

Stanford

KISSO Principle:KISSO Principle:Keep It Simple, Soft, and OrthogonalKeep It Simple, Soft, and Orthogonal

KISSO Principle:KISSO Principle:Keep It Simple, Soft, and OrthogonalKeep It Simple, Soft, and Orthogonal

SynergySynergy

Synergy between soft state and simplicitySynergy between soft state and simplicity easier recoveryeasier recovery

may allow a design that centralizes some state without may allow a design that centralizes some state without being a single point of failurebeing a single point of failure

Synergy between orthos and restartabilitySynergy between orthos and restartability Restartable components are easily protected by orthosRestartable components are easily protected by orthos

But loosely-coupled distributed systems are hard But loosely-coupled distributed systems are hard to debugto debug ……then again, so are big monolithic programsthen again, so are big monolithic programs

And ensembles are more likely amenable to parallel And ensembles are more likely amenable to parallel developmentdevelopment

© 2001

Stanford

Travel Photo: Ski JoringTravel Photo: Ski Joring

Date post:	22-Dec-2015
Category:	Documents
View:	215 times
Download:	1 times

Designing For Failure CS 241 Internet Services © 1999-2001 Armando Fox [email protected].

Documents