+ All Categories
Home > Technology > Socio-technical systems failure (LSCITS EngD 2012)

Socio-technical systems failure (LSCITS EngD 2012)

Date post: 13-Jan-2015
Category:
Upload: ian-sommerville
View: 1,260 times
Download: 0 times
Share this document with a friend
Description:
Discusses socio-technical issues in systems failure
26
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 1 Systems failure – a socio-technical perspective
Transcript
Page 1: Socio-technical systems  failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 1

Systems failure – a socio-technical perspective

Page 2: Socio-technical systems  failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 2

Complex software systems

• Multi-purpose. Organisational systems that support different functions within an organisation

• System of systems. Usually distributed and normally constructed by integrating existing systems/components/services

• Unlimited. Not subject to limitations derived from the laws of physics (so, no natural constraints on their size)

• Data intensive. System data orders of magnitude larger than code; long-lifetime data

• Dynamic. Changing quickly in response to changes in the business environment

Page 3: Socio-technical systems  failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 3

Systems of systems• Operational

independence

• Managerial independence

• Multiple stakeholder viewpoints

• Evolutionary development

• Emergent behaviour

• Geographic distribution

Page 4: Socio-technical systems  failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 4

Complex system realities

• There is no definitive specification of what the system should ‘do’ and it is practically impossible to create such a specification

• The complexity of the system is such that it is not ‘understandable’ as a whole

• It is likely that, at all times, some parts of the system will not be fully operational

• Actors responsible for different parts of the system are likely to have conflicting goals

Page 5: Socio-technical systems  failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 5

System failure

Page 6: Socio-technical systems  failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 6

System dependability model

A system characteristic that can (but need not) lead to a system error

An erroneous system state that can (but need not) lead to a system failure

System fault System error

Externally-observed, unexpected and undesirable system behaviour

System failure

Page 7: Socio-technical systems  failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 7

A hospital system

• A hospital system is designed to maintain information about available beds for incoming patients and to provide information about the number of beds to the admissions unit.

• It is assumed that the hospital has a number of empty beds and this changes over time. The variable B reflects the number of empty beds known to the system.

• Sometimes the system reports that the number of empty beds is the actual number available; sometimes the system reports that fewer than the actual number are available .

• In circumstances where the system reports that an incorrect number of beds are available, is this a failure?

Page 8: Socio-technical systems  failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 8

What is failure?

• Technical, engineering view: a failure is ‘a deviation from a specification’.

• An oracle can examine a specification, observe a system’s behaviour and detect failures.

• Failure is an absolute - the system has either failed or it hasn’t

Page 9: Socio-technical systems  failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 9

Bed management system

• The percentage of system users who considered the system’s incorrect reporting of the number of available beds to be a failure was 0%.

• Mostly, the number did not matter so long as it was greater than 1. What mattered was whether or not patients could be admitted to the hospital.

• When the hospital was very busy (available beds = 0), then people understood that it was practically impossible for the system to be accurate.

• They used other methods to find out whether or not a bed was available for an incoming patient.

Page 10: Socio-technical systems  failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 10

Failure is a judgement• Specifications are a gross simplification of reality

for complex systems.

• Users don’t read and don’t care about specifications

• Whether or not system behaviour should be considered to be a failure, depends on the observer’s judgement

• This judgement depends on:– The observer’s expectations

– The observer’s knowledge and experience

– The observer’s role

– The observer’s context or situation

– The observer’s authority

Page 11: Socio-technical systems  failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 11

Failures are inevitable• Technical reasons

– When systems are composed of opaque and uncontrolled components, the behaviour of these components cannot be completely understood

– Failures often can be considered to be failures in data rather than failures in behaviour

• Socio-technical reasons– Changing contexts of use mean that the judgement on

what constitutes a failure changes as the effectiveness of the system in supporting work changes

– Different stakeholders will interpret the same behaviour in different ways because of different interpretations of ‘the problem’

Page 12: Socio-technical systems  failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 12

Conflict inevitability

• Impossible to establish a set of requirements where stakeholder conflicts are all resolved

• Therefore, successful operation of a system for one set of stakeholders will inevitably mean ‘failure’ for another set of stakeholders

• Groups of stakeholders in organisations are often in perennial conflict (e.g. managers and clinicians in a hospital). The support delivered by a system depends on the power held at some time by a stakeholder group.

Page 13: Socio-technical systems  failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 13

Normal failures

• ‘Failures’ are not just catastrophic events but normal, everyday system behaviour that disrupts normal work and that mean that people have to spend more time on a task than necessary

• A system failure occurs when a direct or indirect user of a system has to carry out extra work, over and above that normally required to carry out some task, in response to some inappropriate or unexpected system behaviour

• This extra work constitutes the cost of recovery from system failure

Page 14: Socio-technical systems  failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 14

The Swiss Cheese model

Page 15: Socio-technical systems  failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 15

Failure trajectories

• Failures rarely have a single cause. Generally, they arise because several events occur simultaneously

– Loss of data in a critical system

• User mistypes command and instructs data to be deleted

• System does not check and ask for confirmation of destructive action

• No backup of data available

• A failure trajectory is a sequence of undesirable events that coincide in time, usually initiated by some human action. It represents a failure in the defensive layers in the system

Page 16: Socio-technical systems  failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 16

Vulnerabilities and defences

• Vulnerabilities– Faults in the (socio-technical) system which, if triggered

by a human or technical error, can lead to system failure

– e.g. missing check on input validity

• Defences– System features that avoid, tolerate or recover from

human error

– Type checking that disallows allocation of incorrect types of value

• When an adverse event happens, the key question is not ‘whose fault was it’ but ‘why did the system defences fail?’

Page 17: Socio-technical systems  failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 17

Reason’s Swiss Cheese Model

Page 18: Socio-technical systems  failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 18

Active failures

• Active failures– Active failures are the unsafe acts committed by people

who are in direct contact with the system or failures in the system technology.

– Active failures have a direct and usually short-lived effect on the integrity of the defenses.

• Latent conditions– Fundamental vulnerabilities in one or more layers of the

socio-technical system such as system faults, system and process misfit, alarm overload, inadequate maintenance, etc.

– Latent conditions may lie dormant within the system for many years before they combine with active failures and local triggers to create an accident opportunity.

Page 19: Socio-technical systems  failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 19

Defensive layers

• Complex IT systems should have many defensive layers:– some are engineered - alarms, physical barriers,

automatic shutdowns,

– others rely on people - surgeons, anesthetists, pilots, control room operators,

– and others depend on procedures and administrative controls.

• In an ideal word, each defensive layer would be intact.

• In reality, they are more like slices of Swiss cheese, having many holes- although unlike in the cheese, these holes are continually opening, shutting, and shifting their location.

Page 20: Socio-technical systems  failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 20

Dynamic vulnerabilities

• While some vulnerabilities are static (e.g. programming errors), others are dynamic and depend on the context where the system is used.

• For example– vulnerabilities may be related to human actions

whose performance is dependent on workload, state of mind, etc. An operator may be distracted and forget to check something

– vulnerabilities may depend on configuration – checks may depend on particular programs being up and running so if program A is running in a system then a check may be made but if program B is running, then the check is not made

Page 21: Socio-technical systems  failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 21

Recovering from failure

Page 22: Socio-technical systems  failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 22

Coping with failure• People are good at

coping with unexpected situations when things go wrong.

– They can take the initiative, adopt responsibilities and, where necessary, break the rules or step outside the normal process of doing things.

– People can prioritise and focus on the essence of a problem

Page 23: Socio-technical systems  failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 23

Recovery strategies

• Local knowledge

– Who to call; who knows what; where things are

• Process reconfiguration

– Doing things in a different way from that defined in the ‘standard’ process

– Work-arounds, breaking the rules (safe violations)

• Redundancy and diversity

– Maintaining copies of information in different forms from that maintained in a software system

– Informal information annotation

– Using multiple communication channels

• Trust

– Relying on others to cope

Page 24: Socio-technical systems  failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 24

Design for recovery

• Holistic systems engineering– Software systems design has to be seen as part of a

wider process of socio-technical systems engineering

• We cannot build ‘correct’ systems – We must therefore design systems to allow the

broader socio-technical systems to recognise, diagnose and recover from failures

• Extend current systems to support recovery

• Develop recovery support systems as an integral part of systems of systems

Page 25: Socio-technical systems  failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 25

Recovery strategy

• Designing for recovery is a holistic approach to system design and not (just) the identification of ‘recovery requirements’

• Should support the natural ability of people and organisations to cope with problems

– Ensure that system design decisions do not increase the amount of recovery work required

– Make system design decisions that make it easier to recover from problems (i.e. reduce extra work required)

• Earlier recognition of problems

• Visibility to make hypotheses easier to formulate

• Flexibility to support recovery actions

Page 26: Socio-technical systems  failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 26

Key points

• Failures are inevitable in complex systems because multiple stakeholders see these systems in different ways and because there is no single manager of these systems

• Failures are a judgement – they are not absolute – but depend on the system observer

• The Swiss cheese model is a failure model based on active failures (trigger events) and latent errors (system vulnerabilities).

• People have developed strategies for coping with failure and systems should not be designed to make coping more difficult.


Recommended