Post on 22-Dec-2015
transcript
1
Software in Practicea series of four lectures on why software projects fail, and what you can do about it
Martyn ThomasFounder: Praxis High Integrity Systems LtdVisiting Professor of Software Engineering, Oxford University Computing Laboratory
2/25
Lecture 2: Software Failures
Developing software is very difficult it is easy to make mistakes … …. and they are unlikely to be found by
testingErrors can be introduced in every
phase of software development: requirements capture, specification,
design, programming, building, error correction, modification, re-use ...
3/25
Finding faults by testing?type Alert is (Warning, Caution, Advisory);
function RingBell(Event : Alert) return Boolean
-- return True for Event = Warning or Event = Caution,
-- return False for Event = Advisory
is
Result : Boolean;
begin
if Event = Warning then
Result := True;
elsif Event = Advisory then
Result := False;
end if;
return Result;
end RingBell;
-- C130J code: Caution returns uninitialised (usually TRUE, as required).
4/25
Taurus
Taurus was a £50m system to provide electronic share trading for the London Stock Exchange in 1991, removing paper share certificates. (This would revolutionise the job of share registrars).
It overran: a recovery strategy was put in place, It reached 85% complete and a date for cut-over
was announced later the same year. A few weeks later, the project was cancelled.
City firms had wasted £350m on new systems to interface to Taurus.
5/25
Taurus: a requirements problem
The system was over-complicated and had failed to reconcile conflicting requirements, especially those from the share registrars.
6/25
This lesson has not been learnt ...
No public-sector civil project has ever been put out to tender with a formal specification.
For example, eFDP took two years to agree a set of requirements. The remaining difficulties were put in the requirements as six-month “design studies”. Four weeks after the RfP, the project was abandoned.
7/25
Nancy Leveson’s Torpedo:gaps in the specification
How to stop a torpedo blowing up the launch ship?
If it malfunctions or starts to come back: sink it blow it up
On live test, a torpedo failed whilst still in the torpedo tube… …
9/25
LAS: The Manual System
LAS covers 600 Sq Miles, carries >5000 patients each day; handles 2000-2500 calls daily including 1300-1600 emergency calls. 750 ambulances.
Emergency call written on a form. Location looked up on a map. Form and map co-ordinates placed on a conveyor belt to central dispatch, who remove duplicates and route to a zone to contact an ambulance
This took ~3 minutes and 200 staff. Decision to implement Computer-Aided Dispatch.
10/25
LAS: Computer Aided Dispatch (CAD) version 1
1980s. £7.5 million spent. System built but failed its load test and was abandoned. LAS sued the Supplier, who had not understood the requirement properly.
1990: Requirements started for Version 2.New CAD to be “fully automated”.
Automatic lookup of location; automatic selection of the best ambulance.
No similar system in existence
11/25
LAS: CAD Version 2
New System much more complex than Version 1: CAD+Map Display+Automatic Vehicle Location Service (AVLS)
Andersen Consulting had estimated that a package solution without AVLS, if one existed, would cost £1.5m and take 19 months to implement.
This seems to have become the project budget for a custom system.
12/25
LAS: Version 2 bids
35 companies looked, 19 bid, most said it needed more time and money than the budget
The only bidder who promised to meet all the requirements on time and within budget was a consortium of Apricot (hardware), Systems Options (SO - a small software house) and Datatrak (AVLS).
SO bid only £35K to develop the CAD software! Total bid £937,463
The next lowest bid was £700K more!!
13/25
LAS: Version 2 development
Phase 1 system: no radio messaging client and server lock-ups
Phase 2 system: with radio messaging unstable, overloaded at shift change,
radio blackspots, unable to cope with staff taking the “wrong” vehicle.
Managers decided to go live on 26 October 2002, ignoring independent review
14/25
LAS: Result26 October, control room reconfigured to use CAD. No
manual backup system. System progressively lost ambulances screens filled with exception messages, that scrolled off and
were lost system delayed incidents, waiting for ambulances, so public
called again, increasing the workload. Several or zero ambulances sent to each incident. Staff stress caused operator errors Network congestion, slowdown, system collapse. Oct 27th, semi-manual operation but system crashed
through memory leak. System abandoned.
16/25
Therac 25
(not the system on the previous slide)A system for treatment of tumours
Mode 1: low energy electron beam treatment Mode 2: very high energy beam (25MeV) with
a thick metal plate in front, for X-rays.Therac-20 had a mechanical switch to
change beam, and an interlock to stop change to high energy without the plate.
Therac 25 interlock was in software.
17/25
Therac-25 User InterfaceSet up treatment timeElectron beam, type eX-ray beam, type x. System puts the plate in place before
switching beam to X-rays. System: “Beam Ready”, Operator types b
to start treatment.Operator station in a different room from
the patient, to protect staff from radiation
18/25
Therac: Accident
Ray Cox, oil worker, on the table for his regular e-beam treatment for a tumour on his shoulder.
Operator goes to the other room types x, realises mistake, types “edit”, e, “enter” - all
within 8 seconds. System says “Malfunction” cleared the error, got “beam ready” and hit b same error message, so tried again. Twice.
Ray felt a painful jolt - not like previous treatments. Shouted in pain but no-one heard. Third time he got off the table and went to find the nurse.
19/25
Therac 25: outcome
Ray Cox died of radiation overdose 4 months later.
Meanwhile another patient experienced the same accident, but this time a technician realised there was a problem and reported it.
The same problem had occurred in Georgia, Canada and Washington.
20/25
Therac: what went wrong?
The operator’s actions exposed a race-condition in the (multi-tasking) code.
The result was a full-power beam without the plate in place. 125-fold overdose!
The particular sequence of actions had never occurred in testing.
Made worse because audio intercom and video link both out of service. System error messages not informative (and usually meant treatment had not occurred).
21/25
Therac: Failings
Safety Case claimed 10-11 probability for “computer selects wrong energy”. No evidence for the claim.
No low-complexity protection system (fuse and/or interlock).
Poor software engineering.Poor investigation of reported accidents.
Manufacturer did not consider possible software fault until several accidents
23/25
Ariane V: Explosion
Initial launch explodedFailure traced to the inertial
navigation system (INS). Overflow on conversion from 64-bit
floating to 16-bit integer; exception not trapped
primary and back-up INS both failed for the same reason, and stopped
loss of INS led to auto-destruction.
24/25
Ariane V: cause of failure
INS software re-used from Ariane IVAriane IV flight profile guaranteed
this parameter could not overflowAriane V specification was different,
in a way that affected the requirements for the INS.
Formal specification would catch this fault.
25/25
Conclusions (1)
Software development is hard - all sorts of things go wrong.
It is an engineering task. You dare not do without discipline and rigour.
Even the best people make mistakes. That’s why we use reviews, checklists, type-checkers and other static analysis tools, testing, and proof.
26/25
Conclusions (2)
A safety-critical software team must have:Good domain knowledgeExcellent systems engineering / software
engineering knowledge, skills, processesGood knowledge of safety assessment
principles, standards, practice and law,… and finally ...