+ All Categories
Home > Documents > CS444A: Software for Critical Systems. 2 Staff Prof. David L. Dill Prof. Armando Fox.

CS444A: Software for Critical Systems. 2 Staff Prof. David L. Dill Prof. Armando Fox.

Date post: 02-Jan-2016
Category:
Upload: katherine-mckinney
View: 215 times
Download: 1 times
Share this document with a friend
Popular Tags:
27
CS444A: Software for Critical Systems
Transcript

CS444A:

Software for Critical Systems

CS444A:

Software for Critical Systems

2

StaffStaff

Prof. David L. Dill

Prof. Armando Fox

3

TopicTopic

The engineering of software for applications where failure is unacceptable. . . for some value of “failure” and “unacceptable”.

Costs of failure exceed value of the software

4

Critical software is growing in importanceCritical software is growing in importance Computers are getting exponentially smaller,

cheaper, faster, and better connected.

Communications are improving at least as fast.

Increased use of critical software is irresistable Automation of tasks that were previous manual or infeasible. Sophisticated control replacing simple control. Replacing mechanical, analog, digital hardware.

5

Software is growingSoftware is growing

Software will replace mechanical, analog, and digital hardware Cheaper to copy. Easier to manufacture. Easier to upgrade. Provides more functionality.

Software will replace manual processes Cheaper and more reliable than human workers Relieves them of tedious tasks Faster and more predictable

6

Complexity is increasingComplexity is increasing

COTS is coming to softwareLarge projects increasingly use commercial off-the-shelf

componentsCommodity hardware, OS’s, tools, other building blocksExample: Mars Pathfinder

This is good and badCOTS reduces development cost & development timeSophisticated “building blocks” allow creation of more

complex systemsBut they are often brittle: intra-component and inter-

component failure modes are poorly understoodComposition of pieces that were designed separately

sometimes leads to unexpected failure modes

7

Software will be used in safety-critical applicationsSoftware will be used in safety-critical applications All of the above reasons (esp. cost)

Software can make systems saferTCAS - Aircraft collision avoidance system

Software can enhance system performanceFly-by-wire antilock braking

Software can perform life-saving functionsComputer-controlled pacemakers

8

Software will be used in safety-critical applicationsSoftware will be used in safety-critical applications All of the above reasons (esp. cost)

Software can make systems saferTCAS - Aircraft collision avoidance system

Software can enhance system performanceFly-by-wire antilock braking

Software can perform life-saving functionsComputer-controlled pacemakers

9

SubtopicsSubtopics

Successful engineering of software encompasses many different issues

Relationship of software to the larger system

Software development processes

Software design

Algorithms

Programming practices

10

Goal: Best Of Both WorldsGoal: Best Of Both Worlds

Traditional safety-engineering perspectiveFormal verification, requirements specification, related

formal methodsTraditional hazard/fault analysisFault tolerance

Systems perspectiveDesign techniques and programming practicesAs much “folklore” as formalEspecially recent experience in Internet-scale mission-

critical systems

11

Formal Methodology OutlineFormal Methodology Outline

Safety engineering of systems Hazard identification Hazard avoidance Standards

Requirements specification and tools Specification for reactive systems Model checking Logical specification (Z, VDM?) Theorem proving

Fault toleranceFault models Fault tolerant protocols

Etc.

12

The Case for the Systems PerspectiveThe Case for the Systems Perspective Many visible success stories

The InternetMars PathfinderGargantuan-scale 24x7 mission critical systems: Wal-Mart

financial exchanges, Visa, CIRRUS banking network…

Some spectacular failuresTherac-25 (today)

System design combines engineering judgment and “folklore” with formal methodology

13

The Role of the InternetThe Role of the Internet

The distributed system from hellEvolved over >25 years, lots of legacy code layersWidely distributed, both geographically and administrativelyTransient failure (hardware & software) is a way of lifeYet, it mostly works...What great ideas can we steal?

The Internet is a good testbed for new approaches to reliability “Internet scale” implies large size, exponential growth, and

24x7 operational requirementsPeople don’t die (usually) when systems go downStrong financial incentive spurs industrial deployment :-)

14

Systems Track OutlineSystems Track Outline

Conceptual vocabulary, research landscape

Fault isolation, fault containment, orthogonal guard mechanisms

Transactions, replication, consistency

State maintenance

Availability vs. consistency tradeoffs, harvest and yield

Application-level vs. OS-level mechanisms

Systems case studies

15

GoalsGoals

Identify recurrent design philosophies that work well

Taxonomize the “folklore” in software systems design

Identify fertile crossover areas to the “formal world”

16

Example: Software failures in the Therac-25Example: Software failures in the Therac-25

17

MotivationMotivation The "Therac-25" is a classic case study in engineering

failure -- like Tacoma Narrows bridge, Challenger disaster, etc.

Illustrates many problems and issues of software safety.

Shows how not to do it.

Related to assignment.

18

The MachineThe Machine

The Therac-25 is a linear accelerator used for radiation therapy (e.g. cancer treatment).

Safety issues: overdose: Patient is injured or dies from radiation burns. underdose: Serious disease is not treated properly, patient may

be injured or die because of this.

Therac-25 much more dependent on software for safety than its predecessors (Therac-20, Therac-6) "Hardware interlocks" replaced by software.

19

Technical detailsTechnical details Multi-mode machine: protons, electrons, X-rays.

X-rays generated when electron beam collides with target.

- This is inefficient, so electron beam must be very powerful.

Different modes require turntable to be properly positioned with targets, spreaders, etc. between beam and patient.

20

AccidentsAccidentsMachine reliably treated thousands of patients, but occasionally weird things would happen.

There were at least 6 accidents.

Kennestone 1985: Patient treated for breast cancer is unexpectedly burned. Est. 15K-20K rad dose (500 rad to whole body 50% fatal). Patient lost breast, shoulder and arm paralyzed. Patient sued, settled out of court. FDA not informed until much later.

21

Another accidentAnother accidentTyler 1986:

Patient to be treated with electron beam. Operator said to treat with X-ray, then corrected. Patient felt "electric shock”. Operator saw "malfunction 54" and under-dose reading, so

said "proceed" to zap patient again. Patient overdosed a second time (in arm) as he was trying to

escape. Patient died horribly of radiation overdose 5 months later.

22

Software issuesSoftware issuesNo locks on shared variables (race conditions).

Control flow bug: some newly entered data can be ignored.

Timing sensitivity in user interface.

Wrap-around on counters.

23

User interface issuesUser interface issues“Malfunction 54” (patient might have received overdose

or under-dose).

No indication about patient safety with error messages.

“Proceed” button continues after error message

- one patient overdosed twice.

24

System issuesSystem issuesInadequate mechanical checks on turntable

- 3 microswitches for position sensing.

- 1-bit error in encoding makes position inaccurate.

- potentiometer installed later to sense position.

No independent hardware to suppress beam.

Dosage measurement devices (ion chambers) report inaccurate results for very high doses.

Therac-20 had same bugs, but no accidents because of independent protective systems.

25

Management issuesManagement issuesSoftware complacency

- software errors not modelled in fault trees.

- users told “no possibility of overdose”.

Absurdly low probabilities assigned to SW failure.

Guesswork in analyzing observed failures

- blamed microswitches on turntable.

- no actual failures found in microswitches.

- problem was probably software.

Inadequate software processes

- unclear safety analyses.

- no audit trails.

- inadequate testing.

26

Regulatory and legal issuesRegulatory and legal issuesFDA, Canadian regulators not heavily involved

- no software regulation in med. devices (at that time).

- not notified of incidents (no requirement to do so).

- inadequate investigation of early incidents.

When FDA got involved, the machine got fixed.

(speculation) Out of court settlements impeded. dissemination of information about hazards.

27

A more Armando-like example?A more Armando-like example?


Recommended