1 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay....

transcript

Critical Systems

DevelopmentIS301 – Software Engineering

Lecture #27 – 2004-11-03M. E. Kabay, PhD, CISSP

Assoc. Prof. Information AssuranceDivision of Business & Management, Norwich University

mailto:mkabay@norwich.edu V: 802.479.7937

First, take a deep breath.You are about to enter the

fire-hose zone.

Objectives

To explain how fault tolerance and fault avoidance contribute to the development of dependable systems

To describe characteristics of dependable software processes

To introduce programming techniques for fault avoidance

To describe fault tolerance mechanisms and their use of diversity and redundancy

Topics covered

Dependable processesDependable programmingFault toleranceFault tolerant architectures

Dependable Software Development

Programming techniques for building dependable software systems.

Software Dependability

In general, software customers expect all software to be dependable

For non-critical applications, may be willing to accept some system failures

Some applications have very high dependability requirements Special programming techniques req’d

Dependability Achievement

Fault avoidanceSoftware developed so

Human error avoided and System faults minimized

Development process organized so Faults in software detected and Repaired before delivery to customer

Fault toleranceSoftware designed so

Faults in delivered software do not result in system failure

Diversity and RedundancyRedundancy

Keep more than 1 version of a critical component available so that if one fails then a backup is available.

DiversityProvide the same functionality in different

ways so that they will not fail in the same way.However, adding diversity and redundancy adds

complexity and this can increase the chances of error.

Some engineers advocate simplicity and extensive verification & validation (V&V) as a more effective route to software dependability.

Diversity and Redundancy ExamplesRedundancy

Where availability is critical (e.g. in e-commerce systems),

companies normally keep backup servers and switch to these automatically if failure occurs.

Diversity. To provide resilience against external attacks, different servers may be implemented using

different operating systems (e.g. Windows and Linux)

Fault Minimization

Current methods of software engineering now

allow for production of fault-free softwareFault-free software means it conforms to its

specificationDoes NOT mean software

which will always perform correctly

Why not?

Because of specificatio

n errors.

Cost of Producing Fault-Free Software (1)

Very highCost-effective only in exceptional

situationsWhich?

May be cheaper to accept software faultsBut who will bear costs?

Users?Manufacturers?Both?

Will the risk-sharing be with full knowledge?

Cost of Producing Fault-Free Software (2)

The Pareto Principle

If curve really is asymptotic to 100%, cost

approach

Cost of ProducingFault-Free Software (3)

Many Few Very fewNumber of residual errors

Just a different way of

looking at it.

Validation activities

Requirements inspections.Requirements management.Model checking.Design and code inspection.Static analysis.Test planning and management.Configuration management, discussed in

Chapter 29, is also essential.

Safe Programming

Faults in programs are usually a consequence of programmers making mistakes.

These mistakes occur because people lose track of the relationships among program variables.

Some programming constructs are more error-prone than others so avoiding their use reduces programmer mistakes.

Fault-Free Software Development

Needs precise (preferably formal) specification

Requires organizational commitment to quality

Information hiding and encapsulation in software design essential

Use programming language with strict typing and run-time checking

Avoid error-prone constructsUse dependable and repeatable development

process

Structured Programming

First discussed in 1970'sProgramming without gotoWhile loops and if statements as only

control statementsTop-down design Important because it promoted thought and

discussion about programmingPrograms easier to read and understand than

old spaghetti code

Error-Prone Constructs (1)Floating-point numbers

Inherently imprecise – and machine-dependent

Imprecision may lead to invalid comparisons

PointersPointers referring to wrong memory as can

corrupt dataAliasing can make programs difficult to

understand and changeDynamic memory allocation

Run-time allocation can cause memory overflow

Error-Prone Constructs (2)Parallelism

Can result in subtle timing errors (race conditions) because of unforeseen interaction between parallel processes

RecursionErrors in recursion can cause memory

overflow Interrupts

Interrupts can cause critical operation to be terminated and make program difficult to understand

Similar to goto statements

Error-Prone Constructs (3)

InheritanceCode not localizedCan result in unexpected behavior when

changes madeCan be hard to understandDifficult to debug problems

All of these constructs don’t have to be absolutely eliminated But must be used with great care

Reliable Software Processes

Well-defined, repeatable software process:Reduces software faultsDoes not depend entirely on individual

skills – can be enacted by different peopleProcess activities should include significant

verification and validation

Process Validation Activities

Requirements inspectionsRequirements managementModel checkingDesign and code inspectionStatic analysisTest planning and managementConfiguration management also essential

Fault Tolerance

Critical software systems must be fault tolerantSystem can continue operating in spite of

software failureFault tolerance required in

High availability requirements orSystem failure costs very high

Even “fault-free” systems need fault tolerance May be specification errors orValidation may be incorrect

Fault Tolerance ActionsFault detection

Incorrect system state has occurredDamage assessment

Identify parts of system state affected by fault

Fault recoveryReturn to known safe state

Fault repairPrevent recurrence of faultIdentify underlying problemIf not transient*, then fix errors of design,

implementation, documentation or training that led to error

E.g., hardware failure

Approaches to Fault ToleranceDefensive programming

Programmers assume faults in codeCheck state after modifications to ensure

consistencyFault-tolerant architectures

HW & SW system architectures support redundancy and fault tolerance

Controller detects problems and supports fault recovery

Complementary rather than opposing techniques

Fault Detection (1)

Strictly-typed languages E.g., Java and Ada Many errors trapped at compile-time

Some classes of error can only be discovered at run-time

Fault detection: Detecting erroneous system state Throwing exception

To manage detected fault

Fault Detection (2)

Preventative fault detectionCheck conditions before making changesIf bad state detected, don’t make change

Retrospective fault detectionCheck validity after system state has been

changedUsed when

Incorrect sequence of correct actions leads to erroneous state or

When preventative fault detection involves too much overhead

Damage Assessment

Analyze system stateJudge extent of corruption caused by

system failureAssess what parts of state space have been

affected by failureGenerally based on ‘validity functions’

Can be applied to state elements Assess if their value within allowed range

Damage Assessment Techniques

Checksums Used for damage assessment in data

transmissionVerify integrity after transmission

Redundant pointers Check integrity of data structuresE.g., databases

Watch-dog timers Check for non-terminating processesIf no response after certain time, there’s a

problem

Fault Recovery

Forward recoveryApply repairs to corrupted system stateDomain knowledge required to compute

possible state correctionsForward recovery usually application

specificBackward recovery

Restore system state to known safe stateSimpler than forward recoveryDetails of safe state maintained and

replaces corrupted system state

Forward Recovery

Data communicationsAdd redundancy to coded dataUse to repair data corrupted during

transmissionRedundant pointers

E.g., doubly-linked lists Damaged list / file may be repaired if

enough links are still validOften used for database and file system

repair

Backward Recovery

Transaction processing often uses conservative methods to avoid problems

Complete computations, then apply changesKeep original data in buffersPeriodic checkpoints allow system to 'roll-

back' to correct state

Recovery Blocks (1)

Acceptancetest

Algorithm 2

Algorithm 1

Algorithm 3

Recoveryblocks

Test forsuccess

Retest

Try algorithm1

Continue execution ifacceptance test succeedsSignal exception if allalgorithms fail

Acceptance testfails – re-try

Recovery Blocks (2)

Force different algorithm to be used for each version so they reduce probability of common errors

However, design of acceptance test difficult as it must be independent of computation used

Problems with approach for real-time systems because of sequential operation of redundant versions

HomeworkStudy Chapter 18 in detail using SQ3RRequired

By next Wed 10 Nov 2004For 30 points

20.1, 20.2, 20.4 – 20.6, 20.9 (@5) and pay attention to demands for examples

OPTIONALBy Wed 17 Nov 2004For up to 14 extra points, any or all of

20.10 (@3), 20.11 (@3) – details please20.12 (@8) – detailed answers to all

parts of this question

DISCUSSION

1 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay....

Documents