Computational Methods for Finding Patterns of Human and System ‘Failure’ in Mishap Reports Chris...

Post on 16-Dec-2015

217 views 0 download

Tags:

transcript

Computational Methods for Finding Patterns ofHuman and System ‘Failure’ in Mishap Reports

Chris Johnson

University of Glasgow, Scotland.http://www.dcs.gla.ac.uk/~johnson

UCD: 12th December 2003

A: Detection and Notifi cation

B: Data gathering

C: Reconstruction

D: Analysis

E: Recommendations and Monitoring

F: Reporting and ExchangeJohnson, Le Galo and Blaize; European Incident Reporting Requirements in Air Traffic Management,EUROCONTROL, 2000.

0

1

2

3

4

5

6

7

8

9

10

Could the

incident have

been anticipated

by risk

managers?

Could the

incident have

been anticipated

by participants?

How severe was

the incident?

How much is

such an incident

f eared by staff ?

How confi dent

are you in

avoiding such

incidents?

How risky was

the incident?

How easy is it to

control the

outcome of such

incidents?

How visible was

the incident?

How much eff ort

is necessary to

avoid f uture

incidents?

bad

good

NASA safety managers complain that the Web Program Compliance Assurance and Status System is too cumbersome.

Personnel use Lessons Learned Information System only on an ad hoc basis.

Hazard reports rarely communicated effectively, nor are databases used by engineers and managers capable of translating operational experiences into effective risk management practices. (CAIB, p.189)

“Centers and contractors used Problem Reporting and Corrective Action database differently, preventing comparisons across the database.

• Probabilistic information retrieval:•Avoids problem of codification;•But issues of precision and recall.

•Conversational case based reasoning:

• Extended form of US Navy’s NACODAE system;• Flexible precision & recall.

•Word sense disambiguation etc.

FAA GAIN lacks computational support.

Someone must address this opportunity…

Meta-Level Concerns for Aerospace

Linda, JavaSpaces and Middleware for Incident Reporting

<A320, 12/12/2003, “ATC came through…”>

<B777, 1/12/2003, “On final approach…”>

< “Weather poor but …”>

<B737, “Maintenance failure on …”>

UK

US

Australia

<A320, “No clearance…”>

Concurrency and distribution

<A320, 12/12/2003, “ATC came through…”>

<B777, 1/12/2003, “On final approach…”>

< “Weather poor but …”>

<B737, “Maintenance failure on …”>

UK

US

Australia

<A320, ?, ?>

<A320, “No clearance…”>

Overloading of matching operators

<?, ?, match(CRM)>

Linda, JavaSpaces and Middleware for Incident Reporting

<A320, 12/12/2003, “ATC came through…”>

<B777, 1/12/2003, “On final approach…”>

< “Weather poor but …”>

<B737, “Maintenance failure on …”>

UK

US

Australia

<A320, ?, ?>

<A320, “No clearance…”>

Leases and persistence

<?, ?, match(CRM)>

Linda, JavaSpaces and Middleware for Incident Reporting

So does the software say something new and

useful?

Look, I’m not blaming you, I’m just suing you…

•Medical errors lead to:• 45,000-100,000 deaths (US). • RTA=43,000, Aids=16,000.

•Additional care $15 billion:

–45% have some mishap.–17% prolonged hospital stay.

Case Study 1: FDA Telemedicine

Courtesy: Univ. of Virginia, Office of Telemedicine

•SE Virginia medical centres:

1 nurse monitors system; 49 remote patients; 5 ICUs at 3 centres.

• Staff 50-80% of ICU budget.

Courtesy: NASA Telemedicine Instrumentation Pack project

A: MDR Report Identifier

B: Event Information

E: Professional information

F: Distributor Information

G: Manufacturer Information

H: Device Information

MDR Report Key MDR Event Key Report Number

Source Code Number of devices

Date receivedNumber of patients

Master Event Data File, Section A: MDR Report Identifier

MDR Report Key Manufacturer’sName

Master Event Data File, Section G: Manufacturer Information

Manufacturer’sAddress

Source Type

Date Manufacturer received report

MDR Report Key Made when?

Master Event Data File, Section H: Device Information

Single use device?

Remedial Action

Use code Correction number

Event type

Master Event Data File Format Identifier

MDR Report Key Device Event Key

Device Data File

Device Seq. Number

Device available for examination?

Brand Name

Generic Name

Age? …

MDR Report Key Patient Seq. Number

Patient Data File

Date report received

Sequence and treatment

Patient Outcome

MDR Report Key Text key

Text Data File

Text type

Patient Seq. number

Report date

Text

Findings from MAUDE: Safety Culture and Telemedical Mishaps

• Introduction of telemedicine implies:– less clinical staff more technical staff;– technical staff don’t understand devices/procedures?

• Increasing reliance on vendor’s guidance:– vendors in turn rely on manufacturers;– communication often breaks down or is too slow.

• No common ‘safety culture’;– many incidents stem from poor communication;– Strong parallels with NASA (CAIB Chapter 7).

Cluster 1: Configuration

• EASITM software provides 12-lead ECG data on 5-leads to patient.

TECH NOTED EASI 12-LEAD DISPLAY ON CENTRAL STATION FROM TRANSMITTER THAT WASNT EASI CAPABLE.

CUSTOMER REPLACED TRANSMITTER, RELOADED CENTRAL STATION SOFTWARE, CONFIRMED ALL SIGNALS WERE CORRECTLY TRANSMITTED AND LABELED.

CUSTOMER DID NOT UNDERSTAND DIFFERENCE BETWEEN STANDARD ECG AND EASI.

CUSTOMER WAS RETRAINED TO FURTHER THEIR UNDERSTANDING OF DIFFERENCE. (MDR TEXT KEY: 1379795)

• Less electrodes reduce work for nurses, improves patient comfort.

• Social implications: clinicians and support rely on suppliers’ explanations.

• Symptomatic of system safety problems:– manufacturers gain insights that should be caught earlier in development.

• Retraining is proposed, no idea of systemic causes of human ‘error’?

DURING INVESTIGATION, ENGINEERS CONFIGURED A SYSTEM IN SAME SETUP AS CUSTOMER. FOUND MAINFRAME RECEIVERS CAN RECEIVE INCORRECT BIT TO MISIDENTIFY TRANSMITTER AS EASI

CAPABLE…

• Report doesn’t state how to prevent mis-configuration.

Cluster 1: Configuration

Cluster 2: Sub-contractors

• End-user frustrated by device unreliability and manufacturers’ response:

SEVERAL UNITS RETURNED FOR REPAIR HAD FAN UPGRADES TO ALLEVIATE TEMP PROBLEMS. HOWEVER, THEY FAILED IN USE AGAIN AND WERE RETURNED FOR REPAIR…

AGAIN SALESMAN STATED ITS NOT A THERMAL PROBLEM ITS A PROBLEM WITH X’s Circuit Board.

X ENGINEER STATED Device HAS ALWAYS BEEN HOT INSIDE, RUNNING AT 68⁰C AND THEIR product ONLY RATED AT 70⁰C….

ANOTHER TRANSPONDER STARTED TO BURN…SENT FOR REPAIR. SHORTLY AFTER MONITOR BEGAN RESETTING FOR NO REASON… (MDR TEXT KEY: 1370547)

• Manufacturers felt reports not safety-related: – “reports relate to end-user frustration regarding product reliability (not

safety)”.

• Telemedicine applications developed by groups of suppliers:– flexibility and cost savings during development, manufacture, marketing; – problems if incidents stem from sub-components not manufactured by

suppliers; – incident reports must be propagated back along the supply chain.

• Manufacturer states problems stem from subcontractors circuit board: – more problems after faulty board replaced, customer returns unit again; – connectors to PCB not properly seated but still passes acceptance test? – connector not seated completely during initial repair and gradually loosens

over time?

Cluster 2: Subcontractors

• “Fly-fix-fly” approach undermines attempts to improve patient safety.

• Confused dialogue between clinician, vendor, manufacturer…– End-user may see technical issues as form of excuse (eg PCB

connectors)…

• Device repairs not only rectify problems, they introduce new ones:– compounds end-user uncertainty and distrust of device reliability;– communication fails and shared safety culture erodes over time.

Cluster 2: Subcontractors

Cluster 3: Modification Induced Bugs

IN SOFTWARE RELEASE VF2, IF PATIENT IN "AUTOADMIT" MODE, PARAMETER DATA AUTOMATICALLY COLLECTED AND STORED IN THE SYSTEMS DATABASE,

IF THE PATIENT LATER REMOVED (BUT NOT DISCHARGED) FROM ORIGINAL BED/NETWORK LOCATION, DATA COLLECTION TEMPORARILY DEACTIVATED (EG DURING MOVE FOR TREATMENT).

PROBLEM OCCURS WHEN NEW PATIENT ADMITTED TO SAME BED/NETWORK LOCATION BUT ORIGINAL PATIENT NOT DISCHARGED WHILE CONNECTED TO THAT LOCATION.

NEW PATIENT ADMISSION STORES DATA IN DATABASE CORRECTLY. HOWEVER, IN PARALLEL, INCORRECTLY APPENDS NEW PATIENT DATA ON TOP OF OLD PATIENT'S RECORD…

(MDR TEXT KEY: 1340560)

Safety Culture and Telemedical Mishaps

• Software identifies 40-50% more US telemedical mishaps in 6 months.

• Analysis of reports suggests no ‘quick fixes’ but:– Regulators need to focus on dialogue between manufacturers and users;– Consider detailed training requirements for telemedicine before approval;– Especially look at end-user maintenance and configuration issues;– Introduce training in safety and risk management for support staff?

• Joint US/UK AHRQ presentation in Washington.– Things are only going to get worse…

Da Vinci, 1st robotic aid approved by the FDA: New York Presbyterian Hospital uses it on atrial septal defects.

Case Study 2: Inter-Industry Comparisons

Cluster 1: Programming Errors

• Pilot didnt check 1st Officer programming FMC.

• “ATC informed us we were off course ... it took minutes to figure out what happened. ATC vectored us back onto departure and gave us a climb clearance. ATC also pointed out traffic, but we never saw it. We arent sure if our error caused a conflict.

• First Officer programmed FMC. I checked the Route Page to see if it matched our clearance. It showed correct departure and transition. I did not check Legs Pages to see if all fixes were there. I will next time!

• We made an error programming the FMC, then became complacent… I should have done a more complete check of the First Officer's programming”

• Computer flight plan was route ABC.

• ATC clearance was via route D-E-F.

• Original flight plan should have been destroyed, so as not to accidentally revert to old route.

• First Officer very experienced and I had complete trust that he was capable of loading correct waypoints, but both he and I failed to use a visible method of marking the computer flight plan.

• 99% of time, cleared route is same as computer flight plan, but not always, as I found out the hard way. ATC caught my error”.

Cluster 1: Programming Errors

• Container ship grounds, same route every week.

• 4 deck officers, good visibility, 2 radars and GPS.

• Charts had courses in black ink, couldnt be erased.

• At 0243 altered course to 237°, position plotted.

• 45 minutes later, ship grounds at full speed.

• Watch officer set auto steering to wrong course.

• 237 next to reciprocal 157 for return voyage.

Cluster 1: Programming Errors

• During the descent, we were doing some HF radio checks, and forgot to arm the altitude select mode on the flight director. As a result, we descended through our altitude....

• We promptly returned to FL280. As a crew, we are very diligent and disciplined about altitude assignments.

• But in this case, because our attention was diverted from the task at hand, we flew through our assigned altitude. It was that classic trap: both crew members distracted by something and nobody flying the airplane.

Cluster 2: Warnings as Safety Nets

• 3 on fishing vessel, 2 cook, pump bilges, maintain watch.

• Skipper asleep on the deck of the wheelhouse.

• Vessel’s planned track 0.35 miles from a rig.

• Automated radar alarm system set to 0.3 miles.

• VHF off; skipper said too much distracting traffic.

• Rig ask stand-by safety vessel for help, alongside boat.

• Nobody on bridge or deck even after sounding horns.

• ‘Abandon platform stations’ as precautionary measure.

• Skipper protests on being wakened, “under control”.

• Radar warning system is a safety net or final safeguard.

Cluster 2: Warnings as Safety Nets

Conclusions• Must make better use of lessons learned

systems.

• Use Tuple Space and IR to search for key issues:– distributed and persistent architectures for retrieval;– avoids need for standardised formats;– can be used within and between industries.

• Caveats: – does it tell us anything new?– how valid are inter-industry comparisons?– how do we get from clusters to recommendations?

Questions?