Critical Systems Development IS301 – software Engineering Lecture #19 – 2003-10-28 M. E. Kabay,...

Critical Systems

DevelopmentIS301 – software Engineering

Lecture #19 – 2003-10-28M. E. Kabay, PhD, CISSP

Dept of Computer Information SystemsNorwich University

[email protected]

mailto:[email protected]

2 Copyright © 2003 M. E. Kabay. All rights reserved.

First, take a deep breath.You are about to enter the

fire-hose zone.


AcknowledgementMost of the material in this presentation is

based directly on slides kindly provided by

Prof. Ian Sommerville on his Web site at

http://www.software-engin.com Used with Sommerville’s permission as

extended by him for all non-commercial educational use

Copyright in Kabay’s name applies solely to appearance and minor changes in Sommerville’s work or to original materials and is used solely to prevent commercial exploitation of this material


Dependable Software Development

Programming techniques for building dependable software systems.


Software Dependability

In general, software customers expect all software to be dependable

For non-critical applications, may be willing to accept some system failures

Some applications have very high dependability requirements Special programming techniques req’d


Dependability Achievement

Fault avoidanceSoftware developed so

Human error avoided and System faults minimised

Development process organised so Faults in software detected and Repaired before delivery to customer

Fault toleranceSoftware designed so

Faults in delivered software do not result in system failure


Fault Minimisation

Current methods of software engineering now

allow for production of fault-free softwareFault-free software means it conforms to its

specificationDoes NOT mean software

which will always perform correctly

Why not?

Because of specificatio

n errors.


Cost of Producing Fault-Free Software (1)

Very highCost-effective only in exceptional

situationsWhich?

May be cheaper to accept software faultsBut who will bear costs?

Users?Manufacturers?Both?

Will the risk-sharing be with full knowledge?


Cost of Producing Fault-Free Software (2)

The Pareto Principle

Costs

To

tal

% o

f E

rro

rs F

ixed

20%

80%

100%

If curve really is asymptotic to 100%, cost

may approach


Cost of ProducingFault-Free Software (3)

Many Few Very fewNumber of residual errors

Co

st p

er e

rro

r d

etec

ted


Fault-Free Software Development

Needs precise (preferably formal) specification

Requires organizational commitment to quality

Information hiding and encapsulation in software design essential

Use programming language with strict typing and run-time checking

Avoid error-prone constructsUse dependable and repeatable development

process


Structured Programming

First discussed in 1970'sProgramming without gotoWhile loops and if statements as only

control statementsTop-down design Important because it promoted thought and

discussion about programmingPrograms easier to read and understand than

old spaghetti code


Error-Prone Constructs (1)Floating-point numbers

Inherently imprecise – and machine-dependent

Imprecision may lead to invalid comparisons

PointersPointers referring to wrong memory as can

corrupt dataAliasing can make programs difficult to

understand and changeDynamic memory allocation

Run-time allocation can cause memory overflow


Error-Prone Constructs (2)Parallelism

Can result in subtle timing errors (race conditions) because of unforeseen interaction between parallel processes

RecursionErrors in recursion can cause memory

overflow Interrupts

Interrupts can cause critical operation to be terminated and make program difficult to understand

Similar to goto statements


Error-Prone Constructs (3)

InheritanceCode not localisedCan result in unexpected behaviour when

changes madeCan be hard to understandDifficult to debug problems

All of these constructs don’t have to be absolutely eliminated But must be used with great care


Information Hiding

Information should only be exposed to those parts of program which need to access it.

Create objects or abstract data types which maintain state and operations on state

Reduces faults:Less accidental corruption of information ‘Firewalls’ make problems less likely to

spread to other parts of programAll information localised:

Programmer less likely to make errorsReviewers more likely to find errors


Example: Queue Specification in Java

interface Queue {

public void put (Object o) ;public void remove (Object o) ;public int size () ;

} //Queue

Fig. 18.2, p. 398• Users can put, remove, query size• But implementation of queue is concealed as private


Example: Signal Declaration in Java

class Signal {

public final int red = 1 ; public final int amber = 2 ; public final int green = 3 ;

... other declarations here ...}

Fig. 18.3, p. 398• Define constants as globals once• Refer to signal.red, signal.green etc.• Avoids risk of accidentally using wrong value in parm


Reliable Software Processes

Well-defined, repeatable software process:Reduces software faultsDoes not depend entirely on individual

skills – can be enacted by different peopleProcess activities should include significant

verification and validation


Process Validation Activities

Requirements inspectionsRequirements managementModel checkingDesign and code inspectionStatic analysisTest planning and managementConfiguration management also essential


Fault Tolerance

Critical software systems must be fault tolerantSystem can continue operating in spite of

software failureFault tolerance required in

High availability requirements orSystem failure costs very high

Even “fault-free” systems need fault tolerance May be specification errors orValidation may be incorrect


Fault Tolerance ActionsFault detection

Incorrect system state has occurredDamage assessment

Identify parts of system state affected by fault

Fault recoveryReturn to known safe state

Fault repairPrevent recurrence of faultIdentify underlying problemIf not transient*, then fix errors of design,

implementation, documentation or training that led to error

E.g., hardware failure

*


Approaches to Fault ToleranceDefensive programming

Programmers assume faults in codeCheck state after modifications to ensure

consistencyFault-tolerant architectures

HW & SW system architectures support redundancy and fault tolerance

Controller detects problems and supports fault recovery

Complementary rather than opposing techniques


Fault Detection (1)

Strictly-typed languages E.g., Java and Ada Many errors trapped at compile-time

Some classes of error can only be discovered at run-time

Fault detection: Detecting erroneous system state Throwing exception

To manage detected fault


Fault Detection (2)

Preventative fault detectionCheck conditions before making changesIf bad state detected, don’t make change

Retrospective fault detectionCheck validity after system state has been

changedUsed when

Incorrect sequence of correct actions leads to erroneous state or

When preventative fault detection involves too much overhead


Damage Assessment

Analyse system stateJudge extent of corruption caused by

system failureAssess what parts of state space have been

affected by failureGenerally based on ‘validity functions’

Can be applied to state elements Assess if their value within allowed range


Damage Assessment Techniques

Checksums Used for damage assessment in data

transmissionVerify integrity after transmission

Redundant pointers Check integrity of data structuresE.g., databases

Watch-dog timers Check for non-terminating processesIf no response after certain time, there’s a

problem


Fault Recovery

Forward recoveryApply repairs to corrupted system stateDomain knowledge required to compute

possible state correctionsForward recovery usually application

specificBackward recovery

Restore system state to known safe stateSimpler than forward recoveryDetails of safe state maintained and

replaces corrupted system state


Forward Recovery

Data communicationsAdd redundancy to coded dataUse to repair data corrupted during

transmissionRedundant pointers

E.g., doubly-linked lists Damaged list / file may be repaired if

enough links are still validOften used for database and filesystem

repair


Backward Recovery

Transaction processing often uses conservative methods to avoid problems

Complete computations, then apply changesKeep original data in buffersPeriodic checkpoints allow system to 'roll-

back' to correct state


Key Points

Fault tolerant software can continue in execution in presence of software faults

Fault tolerance requires failure detection, damage assessment, recovery and repair

Defensive programming approach to fault tolerance relies on inclusion of redundant checks in program

Exception handling facilities simplify process of defensive programming


Fault Tolerant Architecture

Defensive programming cannot cope with faults involve interactions between hardware and software

Misunderstandings of requirements may mean checks and associated code incorrect

Where systems have high availability requirements, specific architecture designed to support fault tolerance may be required.

Must tolerate both hardware and software failure


Hardware Fault Tolerance

Depends on triple-modular redundancy (TMR)3 replicated identical components

Receive same input Outputs compared

If 1 output different, is discardedComponent failure assumed

AssumesMost faults result from component failuresFew design faultsLow probability of simultaneous

component failures


Hardware Reliability With TMR

A2

A1

A3

Outputcomparator

Faultmanager


Fault Tolerant Software Architectures

Assumptions of TMR for HW not true for softwareSW design flaws more common than in HWCannot replicate same component in

software:Would have common design faultsSimultaneous component failure

therefore virtually inevitableSoftware systems must therefore be diverse


Design Diversity

Different versions of system designed and implemented in different waysOught to have different failure modes

Different approaches to design (e.g object-oriented and function oriented)Implementation in different programming

languagesUse of different tools and development

environmentsUse of different algorithms in

implementation


Software Analogies to TMR (1)

N-version programmingSame specification implemented

In number of different versions By different teams

All versions compute simultaneously Majority output selected Using voting system

Most commonly-used approach E.g. in Airbus 320 control systems


N-version programming (1)

Version 2

Version 1

Version 3

Outputcomparator

N-versions

Agreedresult


N-version Programming (2)

Different system versions Designed and implemented by different

teamsAssume low probability of same mistakesAlgorithms used should be different

Problem: may not be different enoughSome empirical evidence (research)Teams commonly misinterpret

specifications in same way Choose same algorithms for their systems


Software Analogies to TMR (2)

Recovery blocksExplicitly different versions of same

specification written and executed in sequence

Acceptance test used to select output to be transmitted


Recovery Blocks (1)

Acceptancetest

Algorithm 2

Algorithm 1

Algorithm 3

Recoveryblocks

Test forsuccess

Retest

Retry

Retest

Try algorithm1

Continue execution ifacceptance test succeedsSignal exception if allalgorithms fail

Acceptance testfails – re-try


Recovery Blocks (2)

Force different algorithm to be used for each version so they reduce probability of common errors

However, design of acceptance test difficult as it must be independent of computation used

Problems with approach for real-time systems because of sequential operation of redundant versions


Problems with Design Diversity

Teams not culturally diverse so they tend to tackle problems in same way

Characteristic errorsDifferent teams make same mistakes

Some parts of implementation more difficult than others so all teams tend to make mistakes in same place.

Specification errorsIf error in specification, then reflected

in all implementationsCan be addressed to some extent by

using multiple specification representations


Specification Dependency

Both approaches to software redundancy susceptible to specification errors. If specification incorrect, system could fail

Also problem with hardware but software specifications usually more complex than hardware specifications and harder to validate

Has been addressed in some cases by developing separate software specifications from same user specification


Software Redundancy Needed?

Software faults not inevitable (unlike hardware faults = inevitable consequence of physical world) (WHY?)

Reducing software complexity may improve reliability and availability (WHY?)

Redundant software much more complex Scope for range of additional errorsAffect system reliabilityIronically, caused by existence of fault-

tolerance controllers


Key Points

Dependability in system can be achieved through fault avoidance and fault tolerance

Some programming language constructs such as gotos, recursion and pointers inherently error-prone

Data typing allows many potential faults to be

trapped at compile time.


Key Points

Fault tolerant architectures rely on replicated hard and software components

include mechanisms to detect faulty component and to switch it out of system

N-version programming and recovery blocks two different approaches to designing fault-tolerant software architectures

Design diversity essential for software redundancy


Homework

Study Chapter 18 in detail using SQ3RFor next Tuesday 4 Nov 2003

Complete exercises 18.1, 18.3-18.5 & 18.7-18.9 for 21 points

For next class (Thursday 30 Oct 2003), apply S-Q to chapter 19 on Verification and Validation. This is increasingly hard stuff, so STUDY before coming to class.

OPTIONAL:For 3 extra points per question, answer

any of questions 18.2, 18.6, 18.10 and 18.11 by the 11 Nov.


DISCUSSION

Date post:	13-Jan-2016
Category:	Documents
Upload:	austin-oconnor
View:	216 times
Download:	0 times

Critical Systems Development IS301 – software Engineering Lecture #19 – 2003-10-28 M. E. Kabay,...

Documents