+ All Categories
Home > Documents > Reliable Computer Systems

Reliable Computer Systems

Date post: 22-Oct-2014
Category:
Upload: piratesco666
View: 189 times
Download: 27 times
Share this document with a friend
Popular Tags:

If you can't read please download the document

Transcript

RELIABLE COMPUTER SYSTEMSDESIGN

AND EVALUATION

HIRD EDITION

BERT

S.

SWARZ

l

7

.

RELIABLE

COMPUTERSYSTEMS

Digitized by the Internet Archivein

2011

http://www.archive.org/details/reliablecomputerOOsiew

RELIABLE

COMPUTERSYSTEMSDESIGN

AND

EVALUATION

THIRD EDITIONDaniel P Siewiorek

Carnegie Mellon

University

Pittsburgh, Pennsylvania

Roberts. Swarz

Worcester Polytechnic Institute Worcester, Massachusetts

A K PetersNatick, Massachusetts

Editorial, Sales,

and Customer Service Office

A K Peters, Ltd.63 South AvenueNatick,

MA 017601998 by

Copyright

A K Peters,

Ltd.

All rights reserved.

No part of the material protected

by

this

copyright notice

may be reproduced

or utilized in any form, electronic or mechanical, including

photocopying, recording, or by any information storage and retrieval system,without written permission from the copyright owner.

Trademark products mentioned

in the

book

are listed

on page 890.

Library of Congress Cataloging-in-Publication DataSiewiorek, DanielP.:

Reliable computer systems

design and evaluation

/

Daniel

P.

Siewiorek, Robert S. Swarz. - 3rd ed.p.

cm.undertitle:

First ed. published

The theory and

practice of

reliable

system design.

Includes bibliographical references and index.

ISBN 1-56881-092-X1.

Electronic digital computersI.

-

Reliability. 2. Fault-tolerantP.

computing.

Swarz, Robert

S.

II.

Siewiorek, DanielIII. Title.

Theory

and practice of reliablesystem design.

QA76.5.S537 1998004-dc21Printed in the United States of America

98-202237

CIP

02 0100 99 98

109876543

2

1

CREDITSFigure 1-3: Eugene Foley, "The Effects of Microelectronics Revolution on Systems and Board Test," Computers, Vol. 12, No. 10 (October 1979). Copyright 1979 IEEE.Reprinted by permission.

Figure 1-6:

S. Russell Craig,

"Incoming Inspection and Test Programs," Electronics Test

(October 1980). Reprinted by permission.Credits are continued on pages 885-890, which are considered a

continuation of the copyright page.

To Karon and Lonnie

A

Special

Remembrance:this

During the development of

book, a friend, colleague, and fault-tolerant pioneerhis 37 years of

passed away. Dr. Wing N. Toy documentedseveral generations of fault-tolerant

experience

in

designing

computers

for the Bell

System electronic switchingin

systems describedthese pages.

in

Chapter

8.

We

dedicate this book to Dr. Toy

the confidencelearn from

that his writings will continue to influence designs

produced by those who

CONTENTSPreface

xv

I

THE THEORY OF RELIABLE SYSTEM DESIGNFUNDAMENTAL CONCEPTSa Digital

1

1

35

Physical Levels in a Digital System

Temporal Stages ofCost of aDigital

System

6

System

18

SummaryReferences

21

21

2

FAULTS

AND

THEIR MANIFESTATIONS

22

System Errors

2431

Fault ManifestationsFault Distributions

49 57

Models for Permanent Faults: The MIL-HDBK-217 Model Models for Intermittent and Transient Faults 65 Software Fault Models 73Distribution Distribution

SummaryReferences

76 76 77

Problems3

RELIABILITY

TECHNIQUESP.

79Siewiorek

Steven A. Elkind and Daniel

System-Failure Response Stages

80 84

Hardware Hardware Hardware Hardware

Fault-Avoidance TechniquesFault-Detection Techniques

96 138169

Masking Redundancy Techniques Dynamic Redundancy Techniques Software Reliability Techniques 201

SummaryReferences

219219221

Problems4

MAINTAINABILITY

AND

TESTING TECHNIQUES229

228

Specification-Based Diagnosis

Symptom-Based Diagnosis

260

viii

CONTENTS

SummaryReferences

268 268 269

Problems5

EVALUATION CRITERIA271

271 Stephen McConnel and Daniel P. Siewiorek

Introduction

Survey of Evaluation Criteria: Hardware Survey of Evaluation Criteria: SoftwareReliability

272 279 285

Modeling Techniques: Combinatorial Models294

Examples of Combinatorial ModelingReliability

and

Availability

Modeling Techniques: Markov Models334 342

305

Examples of Markov ModelingAvailability

Modeling Techniques

Software Assistance for Modeling Techniques

349 356

Applications of Modeling Techniques to Systems Designs

SummaryReferences

391391

Problems6

392

FINANCIAL CONSIDERATIONS402

402

Fundamental ConceptsCost Models408419

SummaryReferences

419 420

Problems

II

THE PRACTICE OF RELIABLE SYSTEM DESIGN424

423

Fundamental Concepts 402 General-Purpose ComputingHigh-Availability Systems

424

Long-Life SystemsCritical

425

Computations

425

7

GENERAL-PURPOSE COMPUTING427 427

427

Introduction

Generic Computer

DECIBM

430431

The DEC Case:DanielP.

RAMP

in the

VAX Family

433

Siewiorek

CONTENTS

The VAX ArchitectureFirst-Generation

433

VAX Implementations 439 Second-Generation VAX Implementations 455References484PartI:

The IBM CaseDanielP.

Reliability, Availability,

and

Serviceability in

IBM 308X

and IBM 3090 Processor ComplexesSiewiorek

485

Technology 485 Manufacturing 486

Overview of the 3090 Processor ComplexReferences507PartII:

493

The IBM Case

Recovery Through Programming:508

MVS

Recovery ManagementC.T. Connolly

Introduction

508

RAS Objectives 509 Overview of Recovery Management 509 MVS/XA Hardware Error Recovery 511

MVS/XA

Serviceability Facilities

520

Availability

522523

SummaryReference

Bibliography

523

523

8

HIGH-AVAILABILITY SYSTEMS524 524

524

Introduction

AT&T Switching Systems Tandem Computers, Inc.Stratus

528531

Computers,533

Inc.

References

The AT&T Case Part I: Fault-Tolerant Design of AT&T Telephone Switching System Processors 533W.N. ToyIntroduction

533

Allocation and Causes of System

Downtime

534

Duplex Architecture 535 Fault Simulation Techniques

538

First-Generation ESS Processors

540544551

Second-Generation Processors

Third-Generation 3B20D Processor

SummaryReferences

572 573

The AT&T Case Part AT&T 5ESS SwitchL.C.

II:

Large-Scale Real-Time Program Retrofit

Methodology

in

574

Toy

5ESS Switch Architecture OverviewSoftware Replacement576

574

SummaryReferences

585

586

The Tandem Case: Fault Tolerance in Tandem Computer Systems 586 Joel Bartlett, Wendy Bartlett, Richard Carr, Dave Garcia, Jim Cray, Robert Horst, Robert Jardine, Doug Jewett, Dan Lenoski, and Dix McGuireHardwareIntegrity S2

588 597

Processor Module Implementation Details618Facilities

MaintenanceSoftware

and Practices

622

625

OperationsReferences

647 647

Summary and Conclusions648

The

Stratus Case:

The Stratus Architecture

648

Steven

Webber

Stratus Solutions to

Downtime652

650

Issues of Fault Tolerance

System Architecture OverviewRecovery ScenariosStratus Software

653

664665

Architecture Tradeoffs

666 669

Service Strategies

Summary

670

9

LONG-LIFE SYSTEMS671 671

671

Introduction

Generic Spacecraft

Deep-Space Planetary Probes 676 Other Noteworthy Spacecraft DesignsReferences679

679

The Galileo Case: Galileo OrbiterRobert W. Kocsis

Fault Protection

System

679

The Galileo Spacecraft 680 Attitude and Articulation Control Subsystem Command and Data Subsystem 683

680

AACS/CDS

Interactions

687688

Sequences and

Fault Protection

CONTENTS

Fault-Protection Design Problems

and Their Resolution

689

SummaryReferences10

690 690

CRITICAL

COMPUTATIONS691

691

Introduction

C.vmpSIFT

691

693

The C.vmp Case: A Voted MultiprocessorDanielP.

694

Siewiorek, Vittal Kini, Henry Mashburn, Stephen McConnel, and Michael Tsao

System Architecture

694 699

Issues of Processor Synchronization

Performance MeasurementsOperational Experiences707

702

References

709for

The SIFT Case: Design and Analysis of a Fault-Tolerant ComputerAircraft

Control

710

John H. Wensley, Leslie Lamport, Jack Goldberg, Milton W. Green, Karl N. Levitt, P.M. Melliar-Smith, Robert E. Shostak, and Charles B. Weinstock

Motivation and Background

710711

SIFT Concept of Fault Tolerance

The SIFT Hardware 719 The Software System 723 The Proof of Correctness 728

SummaryReferences

733 733

Appendix: Sample Special Specification735

III

A DESIGN METHODOLOGY AND EXAMPLE OF DEPENDABLE SYSTEMDESIGN737739

11

A DESIGN METHODOLOGYDanielP.

Siewiorek and David Johnson

Introduction

739for

A Design Methodology

Dependable System DesignDigital

739

The VAXft 310 Case: A Fault-Tolerant System by William Bruckert and Thomas Bissett

Equipment Corporation746

745

Defining Design Goals and Requirements for the VAXft 310

VAXft 310 Overview

747 756

Details of VAXft 310 Operation

Summary

766

xii

APPENDIXESAPPENDIX A771

769

Error-Correcting

A

State-of-the-ArtC.L.

Codes for Semiconductor Memory Applications: Review 771771

Chen and M.Y. Hsiao

Introduction

Codes SEC-DEC Codes 775 SEC-DED-SBD Codes 778 SBC-DBD Codes 779 DEC-TED Codes 781Binary Linear Block

773

Extended Error CorrectionConclusionsReferences786786

784

APPENDIX BSystem Design

787

Arithmetic Error Codes: Cost

and

Effectiveness Studies for Application in Digital

787

Algirdas Avizienis

Methodology of Code EvaluationLow-Cost Radix-2 Arithmetic CodesMultiple Arithmetic Error Codes

787 790

Fault Effects in Binary Arithmetic Processors

794

799

References

802

APPENDIX C

803

Design for TestabilityA Survey 803 Thomas W. Williams and Kenneth P. ParkerIntroduction

803 807 808 813

Design for Testability

Ad-Hoc DesignSelf-Testing

for Testability

Structured Design for Testability

and828 829

Built-in Tests

821

ConclusionReferences

APPENDIX D

831Reliability

Summary of MIL-HDBK-217EFailure Rate

Model

831

Model and

Factors

831

Reference

833

APPENDIX

E

835835

Algebraic Solutions to Markov Models Jeffrey P. Hansen

Solution of MTTF Models 837 Complete Solution for Three- and Four-State Models 838 Solutions to Commonly Encountered Markov Models 839

References

839

GLOSSARYREFERENCESCREDITS

841

845

885

TRADEMARKS 890INDEX891

PREFACEreliability has been a major concern since the beginning of the electronic computer age. The earliest computers were constructed of components such as relays and vacuum tubes that would fail to operate correctly as often as once every hundred thousand or million cycles. This error rate was far too large to ensure correct completion of even modest calculations requiring tens of millions of operating cycles. The Bell relay computer (c. 1944) performed a computation twice and compared results; it also employed error-detecting codes. The first commercial computer, the UNIVAC (c. 1951), utilized extensive parity checking and two arithmetic logic units (ALUs) in a match-and-compare mode. Today, interest in reliability pervades the computer industry from large mainframe manufacturers to semiconductor fabricators who produce

System

digital

I

not only reliability-specific chips (such as for error-correcting codes) but also entiresystems.

users.

Computer designers have to be students of reliability, and so do computer system Our dependence on computing systems has grown so great that it is becomingor impossible to return to less sophisticated mechanisms.

difficult

When

an

airline

computer "crashes," for example, the airline can no longer revert to assigning seats from a manual checklist; since the addition of round-trip check-in service, there is no way of telling which seats have been assigned to passengers who have not yet checked in without consulting the computer. The last resort is a free-forall rush for seats. The computer system user must be able to understand the advantages and limitations of the state-of-the-art in reliability design; determine the impact of those advantages and limitations upon the application or computation at hand; and specify the requirements for the system's reliability so that the application or compuseat selection

be successfully completed. on reliability has been slow to evolve. During the 1950s reliability was the domain of industry, and the quality of the design often depended on thetation can

The

literature

cleverness of an individual engineer. Notable exceptions are the[1948]

work

of

Shannon

and Hamming [1950] on communication through noisy (hence error-inducing) channels, and of Moore and Shannon [1956] and von Neumann [1956] on redundancy that survives component failures. Shannon and Hamming inaugurated the field of coding theory, a cornerstone in contemporary systems design. Moore, Shannon, and von Neumann laid the foundation for development and mathematical evaluation of redundancy techniques. During the 1960s the design of reliable systems received systematic treatment in industry. Bell Telephone Laboratories designed and built an Electronic Switching System (ESS), with a goal of only two hours' downtime in 40 years [Downing, Nowak, and

Tuomenoksa,industry,

1964].

features [Carter et

and a

The IBM System/360 computer family had extensive serviceability 1964]. Reliable design also found increasing use in the aerospace triplicated computer helped man land on the moon [Cooper and Chow,al.,

PREFACE

and Randa, 1964]. The volume of literature also increased. Symposium on Redundancy Techniques held in Washington, D.C., led to the first comprehensive book on the topic [Wilcox and Mann, 1962]. Later, Pierce [1965] published a book generalizing and analyzing the Quadded Redundancy technique proposed by Tryon and reported in Wilcox and Mann [1962]. A community of reliability1976; Dickinson, Jackson,In

1962 a

theoreticians and practitioners

was developing.in

system reliability expanded explosively. Companies were formed whose major product was a reliable system (such as Tandem). Due to the effort of Algirdas Avizienis and other pioneers, a Technical Committee on Fault Tolerant Computing (TCFTC) was formulated within the Institute of Electrical and

During the 1970s interest

Electronic Engineers (IEEE). Every year since 1971, the TCFTC has held an International Symposium on Fault-Tolerant Computing. In 1982, when the first edition of The Theory and Practice of Reliable System Design was published, the time was ripe for a book on the design of reliable computing structures. The book was divided into two parts the first being devoted to the fundamental concepts and theory and the second being populated with a dozen chapters that represented detailed case studies. The second edition follows the same basic

structure, butIII

is

divided into three parts. Part

I

deals with the theory and Parts

II

and

with the practice of reliable design. The appendices provide detailed informationfor testability,

on coding theory, designmodel.

and the MIL-HDBK-217 component

reliability

In recent years, the number of reliability and redundancy techniques has continued to expand, along with renewed emphasis on software techniques, application of older techniques to newer areas, and in-depth analytical evaluation to compare and contrast many techniques. In Part I, Chapters 3 and 5 have been expanded to include these new results. More case studies have been developed on the frequency and manifestation of hardware and software system failures. Chapter 2 has been updated to include summaries of this new material. Likewise, Chapter 4 has been enlarged to cover testing techniques commencing with prototypes through manufacturing, field installation, and field repair. The new additions to Part have resulted in over a 50 percent increase in the number of references cited in the second edition over the firstI

edition.PartII

of the

the

first

edition, Part

second edition has undergone an even more dramatic change. In II surveyed twelve different computer systems, ranging from one-

of-a-kind research vehicles to mass-produced general-purpose commercial systems. The commercial systems focused on error detection and retry and represented threeof the case studies. Four case studies represented one-of-a-kind research systems.

Three other systems sought limited deployment in aerospace and message-switching applications. Only two of the case studies represented wider-spread deployment offault-tolerant systems

numbering in the thousands. Furthermore, each case study represented almost a unique architecture with little agreement as to the dominantfault-tolerant systems.first and second editions, fault tolerance has major segment of the computing market. The number of deployed fault-tolerant systems is measured in the tens of thousands. Manufacturers are

approach for buildingIn

the intervening years between theitself

established

as a

PREFACE

developing the third- and fourth-generation systems so that

we

can look back

at

the

evolutionary trajectory of these "fault-tolerant computer families." There has also beena

convergence with respect to the system architecture of preference. While the comstill depend upon error detection and retry, the high-reliability systems rely upon triplication and voting, and the high-availability systems depend upon duplication and matching. The case studies have been reduced to nine in order for moremercial systems

space to be devoted to technical details as well as evolutionary family growth.

Two

case studies represent general-purpose commercial systems, three represent research

and aerospace systems, and four represent high-availability systems. The approaches used in each of these three application areas can be compared and contrasted. Of special interest are the subtle variations upon duplication and matching used by allfour high-availability architectures.In total,

almost 50 percent of the materialedition.

in

the

second editionThis

is

new

with respect to the

first

book has three audiences. The

first is

the advanced undergraduate student

interested in reliable design; as prerequisites, this student should have had coursesin

introductory programming, computer organization, digital design, and probability.

the IEEE Computer Society developed a model program in computer science and engineering. This program consisted of nine core modules, four laboratory modules, and fifteen advanced subject areas. One of those advanced subject areas was "fault-tolerant computing." Table P-1 illustrates how this book can be used in support of the module on fault-tolerant computing.In 1983,

TABLE P-1

Mapping of the book to modulesin

Module1.

Appropriate Chapter

Need

for Fault-Tolerant Systems:

Ch. Ch.

1,

Fundamental ConceptsTechniques

Subject Area 20:

Applications, fault avoidance, faulttolerance, levels of implementation

3, Reliability

Fault-Tolerant

Computing, of the1983 IEEE

elements2. Faults

and Their Manifestations:

Ch.

2, Faults

and Their Manifestations

Computer Society ModelUndergraduateProgramin4.3.

Sources, characteristics, effects,

modelingError Detection:

Ch.

3, Reliability

Techniques Techniques

Duplication, timeouts, parity checksProtective Redundancy:

Computer Science and Engineering

Ch.

3, Reliability

Functional replication, information

redundancy, temporal methods5.

Fault-Tolerant Software:

Ch.

3, Reliability

Techniques

N-version programming, recovery blocks,specification validation, proof, mutation6.

Measures ofReliability

Fault Tolerance:

Ch.availability,

5,

Evaluaton Criteria

models, coverage,

Ch.

6, Financial

Considerations

maintainability7.

Case Studies

Introduction to Part

II

and further examples11 as

from Chapters 7 to

time permits

xviii

PREFACE

The second audiencePart

is

the graduate student seeking a second coursein

in reliable

design, perhaps as a prelude to engagingI

and the system examples of

Part

II

The more advanced portions of should be augmented by other books andresearch.

current research literature as suggested

in

Table P-2.is

A

project, such as design of a

dual system with a mean-time-to-failure that

an order of magnitude greater than

nonredundant systems while minimizingmaterial for students.

life-cycle costs,

would help

to crystallize the

An

extensive bibliography provides access to the literature.is

Thehensive

third

audience

the practicing engineer.

A major

goal of this

book

is

to

provide enough concepts to enable the practicing engineer to incorporate compreprovides a taxonomy models to evaluate them. Design techniques are illustrated through the series of articles in Part II, which describe actual implementations of reliable computers. These articles were writen by the system designers. The final chapter provides a methodology for reliable system design and illustrates how this methodology can be applied in an actual design situation (thereliability

techniques into

his or

her next design. Part

I

of reliability techniques and the mathematical

DEC

VAXft 310).

Acknowledgments. The authors wish to express deep gratitude to many colleaguesthis

in

the fault-tolerant computing community. Without their contributions and assistance

book could not have been

written.

We

are especially grateful to the authors of

the papers

who

shared their design insights with us.

Special thanksBissett (DEC),

go to

Joel Bartlett

(DEC-Western),

Wendy Bartlett

(Tandem), Thomas

Doug Bossen

(IBM), William Bruckert (DEC), Richard Carr (Tandem),

Kate Connolly (IBM), Stanley Dickstein (IBM), Dave Garcia (Tandem), Jim Cray (Tandem), Jeffrey P. Hansen (CMU), Robert Horst (Tandem), M.Y. Hsiao (IBM), Robert Jardine (Tandem), Doug Jewett (Tandem), Robert W. Kocsis (Jet Propulsion Lab.), Dan Lenoski (Tandem), Dix McCuire (Tandem), Bob Meeker (IBM), Dick Merrall (IBM),Larry Miller (IBM), Louise Nielsen (IBM), Les Parker (IBM), Frank Sera (IBM),

Man-

dakumar Tendolkar(Stratus).

(IBM), Liane Toy (AT&T),

Wing Toy

(AT&T), and Steven

Webber

Jim Franck and John Shebell of Digital provided material and insight for Chapters

4 and 6 respectively. Jim Gray provided data on

Tandem system

failures that

have been

included

in

Chapter

2.

Jeff Hansen, David Lee, and Michael Schuette provide material on mathematical modeling, computer aids, and techniques. Comments from several reviewers and

students were particularly helpful.Special thanks are due to colleagues at both Carnegie-Mellon University and Digital Equipment Corporation (DEC) for providing an environment conducive to generating and testing ideas, especially Steve Director, Dean of the Engineering College, and Nico Habermann, Dean of the School of Computer Science. The entire staff of Digital Press

provided excellent support for

a timelystaff at

production.

Technical Texts is deeply appreciated as they provided invaluable assistance throughout the production of the book. A special

The professionalism of theis

acknowledgment

also

due

Sylvia

to details contributed towards her goal of a "user friendly"

Dovner whose countless suggestions and attention book. The manuscript

TABLE P-2Proposed structurefor graduate course

ChaptersCh. Ch.1,

Augmentation

Fundamental

Concepts2, Faults

and Their

Ross [1972] and/orstatistical

Shooman

[1968] for

random

variables,

Manifestations

parameter estimation

ARINCCh.3, Reliability

and

Availability

Techniques

[1964] for data collection and analysis Appendix A, Peterson and Weldon [1972] for coding theory; Sellers, Hsiao, and Bearnson [1968b] for error-detection

techniquesProceedings of Annual IEEE International Symposium on FaultTolerant

Computing

Special issues of the IEEE Transactions

Tolerant1974,

Computing

(e.g.,

November

1971,

on Computers on FaultMarch 1973, July

May

1975, June 1976, June 1980, July 1982, 1986, April

1990)

Special issues of

Computer on

Fault-Tolerant

Computing

(e.g.,

MarchCh.4,

1980, July 1984, July 1990)

Maintainability

and

Breuer and Friedman [1976] for testing; Proceedings of CherryHill Test

Testing Techniques

Conference

Special issues of

Computer on Testingmaintenance analysis[1971],

(e.g.,

October 1979)

ARINCCh.Ch.5,

[1964] for

Evaluation Criteria

Ross [1972], Howard Markov models andPhister [1979]

Shooman

[1968], Craig [1964] for

their solutions

6, Financial

ConsiderationsPartII

October 1978

special issue of the Proceedings of the IEEE

provided many unforeseen "challenges," and

Sylvia's

perseverance was the glue that

held the project together. That the book exists todayefforts.

is

due

in

no small part to

Sylvia's

This

book would not have been possible without the patience and diligence

of

Mrs. Laura Forsyth,

who

typed, retyped, and mailed the

many

drafts of the manuscript.

Herthat

activities as a "traffic controller"Finally,

were

vital to

the project.is

the support and understanding of our families

the central ingredient

From the occupation of the dining room table for weeks at a time for reorganizing text or double-checking page proofs to missing social events or soccer games, their patience and sacrifice over the last five years enabled the project to draw to a successful conclusion.

made

this

book

possible.

REFERENCES*

ARINC

[1964];

Breuer and Friedman [1976]; Carter

et

al.

[1964];

Cooper and Chow

[1976]; Craig

Downing, Nowak, and Toumenoksa [1964]; Hamming [1950]; Howard [1971]; Moore and Shannon [1956]; Peterson and Weldon [1972]; Phister [1979]; Pierce [1965]; Ross [1972]; Sellers, Hsiao, and Bearnson [1968b]; Shannon [1948]; Shooman[1964]; Dickinson, Jackson,

and Randa

[1964];

[1968];

von Neumann

[1956];

Wilcox and

Mann

[1962].

*

For

full

citations of the shortened references at the

end

of each chapter, see References at the

back of the

book.

THE THEORY OF RELIABLE SYSTEM DESIGNPartI

of this

book presents the many1

disciplines required to construct a reliable

computing system. Chapter

explains the motivation for reliable systems and providesIt

the theoretical framework for their design, fabrication, and maintenance. the hierarchy of physical levels into which a computer systemis

presents

customarily partitionedis

and introduces the stages1

into

which the

life

of a

computer systemin a

divided. Chapter

also provides a detailed discussion ofLastly,

two stages

system's

life:

manufacturing

and operation.

the chapter identifies several of the costs of ownership for aspecifies

computer system andto increase

some

of the parameters that the designer can control

customer

satisfaction.fault manifestations in a

Chapter 2 discusses errors andof applicable probability theoryis

computer system. A review

presented as an aid to understanding the mathe-

matics of the various fault distributions. data to fault distributions, such as the

Common

techniques for matching empirical

maximum

likelihood estimator, linear regression,

and the chi-square goodness-of-fitodsfor estimating

test,

are discussed. Chapter 2 also introduces meth-

permanent

failure rates, including thein

MIL-HDBK-217 procedure,

a

widely used mathematical model of permanent faultslife-cycle testing

electronic equipment, and the

and data analysis approaches.

It

addresses the problem of finding an

appropriate distribution for intermittent and transient errors by analyzing field data

from computer systems of diverse manufacturers.Chapterfailure.It

3 deals with reliability techniques, or

ways

to

improve the mean time to

presents a comprehensive taxonomy of

reliability

and

availability

techniques.

There

is

also a catalog of techniques, along with evaluation criteria for both hardware

and software.Chapter 4 deals with maintainability techniques, or ways to improve the meantime to repair of a failed computer system.It

provides a taxonomy of testing and

maintenance techniques, and describes ways to detect and correct sources of errorsat

each stage of a computer's

life

cycle. Specific strategies for testing during the

manufacturing phase are discussed. The chapter explains several logic-level acceptance

I.

THE THEORY OF RELIABLE SYSTEM DESIGN

tests,

such as exclusive-OR testing, signature analysis, Boolean difference, path sen-

sitization,ability,

and the D-algorithm.

It

also introduces a discipline, called design for test-

which attempts to define properties of easy-to-test systems. The chapter con-

cludes with a discussion of

symptom

directed diagnosis which utilizes operational

life

data to predict and diagnose failures.

Howif

can a reliable or maintainable design be mathematically evaluated? Thatis

is,

a

system

supposed

to

be down no more than two hours

in

40 years,

how

can one

avoid waiting that long to confirm success? Chapter 5 defines a host of evaluationcriteria, establishes

the underlying mathematics, and presents deterministic modelsseries-parallel

and simulation techniques. Simplefor evaluating the reliability of

models are introduced

as a

method

nonredundant systems and systems with standby spar-

ing. Next, several types of combinatorial (failure-to-exhaustion)

models are described. modelsto

The chapter also introduces waystractable forms.

of reducing nonseries, nonparallel

moreand

Chapter

5 continues with

Markov models, which define various systemIn

states

express the probability of going from one state to another.probabilitystate

these models, the

depends only on the present

state

and

is

independent of how the present

was reached. After describing

several other simulation

and modeling techniques,a

the chapter concludes with a case study of an effort toof a

make

more

reliable version

SUN

workstation using the techniques defined

in

Chapter

3.

Finally,

Chapter 6

is

concerned with the

financial considerations inherent in the

design, purchase, and operation of a computer system. The discussion adopts two

major viewpoints: that of the maintenance provider and that of the system's owner/operator.

An explanationis

of the various sources of

maintenance costs, such as laborbusiness. Several main-

and materials,

followed by an overview of the

field service

tenance cost models are suggested, along with a method for assessing the value ofmaintainability features.life-cycle costs of

The chapter describes two of the many ways of modeling thea

owning and operating

computer system; these cost models are

essential to the system designer in understanding the financial motivations of the

customer.

1

FUNDAMENTAL CONCEPTSHistorically, reliable computers have been limited to military, industrial, aerospace, and communications applications in which the consequence of computer failure is significant economic impact and/or loss of life. Reliability is of critical importance wherever a computer malfunction could have catastrophic results, as in the space shuttle, aircraft flight-control systems, hospital patient monitors, and power system

control.Reliability

techniques have become of increasing interest to general-purpose com-

puter systems because of several recent trends, four of which are presented here.1.

Harsher Environments: With the advent of microprocessors, computer systemscontains

have been moved from the clean environments of computer rooms into industrialenvironments. The coolingair

more

particulate matter;

temperature and

humidity vary widely and are frequently subject to spontaneous changes; the primary

power supply2.

fluctuates;

and there

is

electromagnetic interference.

Novice Users: As computers

proliferate, the typical user

knows

less

about proper

operation of the system. Consequently, the system has to be able to tolerate moreinadvertent user abuse.3.

Increasing Repair Costs: As hardware costs continue to decline and labor costs

escalate, a user cannot afford frequent calls for field service. Figure 1-1 depicts therelation

features.

between cost of ownership and the addition of reliability and maintainability Note that as hardware costs increase, service costs decrease because of fewerfield service calls.

and shorter4.fail.

Larger Systems: As systems

become

larger, there areis

Because the overallits

failure rate of a

system

directly related to the

more components that can sum of thefromlevel.

failure rates of

individual

components, designs

that tolerate faults resultingat

component

failure

can keep the system failure rate

an acceptable

As the need for reliability has increased in the industrial world, so has the interest tolerance. Manufacturers of large mainframe computers, such as IBM, Unisys, and Amdahl, now use fault-tolerant techniques both to improve reliability and to assistin fault

field service

personnel

in

fault isolation.

fault-tolerant features in their designs,

Minicomputer manufacturers incorporate and some companies, such as Tandem, have

been formed

solely to market fault-tolerant computers.

computing is the correct execution of a specified algorithm in the presence of defects. The effect of defects can be overcome by the use of redundancy.Fault-tolerant

I.

THE THEORY OF RELIABLE SYSTEM DESIGN

FIGURE 1-1Cost of ownershipas a function of reliability

Cost of ownership

and main-

_ Minimum

tainability

u

8

cost of

-

ownershipAcquisition costService cost

Reliability

and maintainability features

This redundance can be either temporal (repeated executions) or physical (replicated

hardware or software). At the highest

level, fault-tolerant

systems are categorized as

either highly available or highly reliable.

Availability:

The

availability of ais

system as a function of time,the instant of time,t.

A(t),

is

the probability

that theexists asis

systemf

operationalinfinity,it

at

If

the limit of this function

goes to

expresses the expected fraction of time that the system

available to perform useful computations. Activities such as preventive mainte-

nance and repair reduce the time that the system is available to the user. Availability is typically used as a figure of merit in systems in which service can be delayed or denied for short periods without serious consequences. Reliability: The reliability of a system as a function of time, R(t), is the conditionalprobability that the system has survived the interval[0,f],

given that the systemin which which the computer

was operationalis

at

time

f

=

0.

Reliability

is

used to describe systems

repair cannot take place (as in satellite computers), systems in

serving a

critical

function and cannot be lost even for the duration of a repairaircraft),is

(as in flight

computers onIn general,it

or systems

in

which the repair

is

prohibitively

expensive.

more

difficult to build a highly reliable

computing

system than a highly available system because of the more stringent requirements

imposed by the reliability definition. An even more stringent definition than R(t), sometimes used in aerospace applications, is the maximum number of failures anywhere in the system that the system can tolerate and still function correctly.This chapter describes the basic concepts in a three-dimensional reliability framework. This framework allows the various constraints, techniques, and decisions in the

design of reliable systems to be mapped. Thephysical hierarchy,

first dimension in the framework is the which ranges from primitive components to complex systems. The second dimension is the time in the system's life, which includes various stages from concept through manufacturing and operation. The third dimension is the cost of the system relative to customer satisfaction and physical resources. This framework is the foundation for all techniques and approaches to reliable systems that are covered in subsequent chapters of this book.

1.

FUNDAMENTAL CONCEPTS

PHYSICALLEVELS IN A

The

first

dimension

in

the

reliability

framework pertains

to the physical levels in a

digital

system. Digital computer systems are enormously complex, and

some

hierar-

DIGITAL

chical

concept must be used to managelevels.

this complexity. In

the hierarchy, each level

SYSTEM*

contains only information important totion

its

level

and suppresses unnecessary informautilize a

about lower

System designers frequently

hierarchy

in

which the

levels coincide with the system's physical boundaries, as listed in Table 1-1.

Circuit Level:

tors, inductors,

Jhe circuit level consists of such components as resistors, capaciand power sources. The metrics of system behavior include voltage,Thecircuit levelis

current, flux, and charge.

not the lowest possible level

at

which to

describe a digital system. Various electromagnetic and

quantum mechanical phenom-

ena underlie

circuit theory,

(such as disks) requires

more thanlogic level

and the operation of electromechanical system devices circuit theory to model their operation.is

Logic Level:is

The

unique to

digital

systems. The switching-circuitbuilt

sublevel

composedis

of such things as gates

and data operators

out of gates.

This sublevel

further subdivided into sequential and combinatorial logic circuits,

with the fundamental difference being the absence oftorial circuits.

memory elements

in

combina-

The

register transfer sublevel, the next higher level, deals with registers

TABLE 1-1 HierarchicalLevel/Sublevel

levels for digital

computers

ComponentsProcessors

Level/SublevelLogic

Components

PMS

(highest level)

MemoriesSwitchesControllers

Switching circuit

SequentialFlip-flops; latches; delays

CombinatorialGates; encoders/decoders;data operatorsRegister transfer

TransducersData operatorsLinks

DataRegisters; operators; data paths

ProgramISP

Memory

state

Control

Processor stateEffective address

HardwiredSequential logic machines

calculation Instruction

MicroprogrammingMicrosequencer; microstoreCircuit (lowest level)

decode

Instruction executionResistors

High-level language

Software

CapacitorsInductors

Power sourcesDiodesTransistors

*

This discussion

is

adapted from Siewiorek,

Bell,

and Newell, 1982.

I.

THE THEORY OF RELIABLE SYSTEM DESIGN

and functional transfers of information among registers. This sublevel is frequently further subdivided into a data part and a control part. The data part is composed of registers, operators, and data paths. The control part provides the time-dependent stimuli that cause transfers between registers to take place. In some computers, the control part is implemented as a hard-wired state-machine. With the availability of lowcost read-only

memories (ROMs), microprogramming implement the control function.

is

now

a

more popular way

to

Program Level: The program level is unique to digital computers. At this level, a sequence of instructions in the device is interpreted, and it causes action upon a data structure. This is the instruction set processor (ISP) sublevel. The ISP description is used in turn to create software components that are easily manipulated by programmers the high-level-language sublevel. The result is software, such as operating systems, run-time systems, application programs, and application systems.

PMS

(Processor,

Memory, Switch)

Level: Finally, the various

elements

put devices, memories, mass storage, communications, and processors

input/outare intercon-

nected to form a complete system.

TEMPORAL STAGES OF ADIGITAL SYSTEM

The second dimension in the reliability framework is that of time. The point at which a technique or methodology is applied during the life cycle of a system may be moreimportant than the physicallevel.

be treated as a "black box" that produces outputs in response to input stimuli. Table 1-2 lists the numerous stages in the life of the box as it progresses from concept to final implementation. These stagesa user's viewpoint, a digital system can

From

include specification of input/output relationships, logic design, prototype debugging,

manufacturing, installation, and

field

operation. Deviations from intended behavior,

or errors, can occur at any stage as a result of incomplete specifications, incorrect

TABLE 1-2Stages in thelife

_Prototype

Stage

Error Sources

Error Detection

Techniques

of a system

Specification and design

Algorithm designFormal specifications

Simulation

Consistency checksStimulus/response testing

Algorithm design

Wiring and assembly

Timing

ComponentManufacture

failure

Wiring and assembly

System testingDiagnostics

ComponentInstallation

failure

Assembly

System testingfailure failure

ComponentOperationallife

DiagnosticsDiagnostics

Component

Operator errorsEnvironmental fluctuations

1.

FUNDAMENTAL CONCEPTS

implementation of a specification into a logic design, and assembly mistakes during prototyping or manufacturing.

During the system's operationalstate or

damage

to hardware. Physical

factors such as fluctuations in and even a-particle emissions. Inconsistent states can also be caused by both operator errors and design errors in hardware or software. Operational causes of outage are relatively evenly distributed among hardware, software, maintenance actions, operations, and environment. Table 1-3 depicts the distribution of outages from seven different studies. As illustrated by the table, substantial gains in reliability will result only when all sources of outage are addressed. For example, complete elimination of hardware caused outages will only increase time between errors by about 25 percent. Design errors, whether in hardware or software, are those caused by improper translation of a concept into an operational realization. Closely tied to the human

errors can result from change in the physical changes may be triggered by environmental temperature or power supply voltage, static discharge,life,

creative process, design errors are difficult to predict. Gathering statistical information

about the phenomenon

is

difficult

because each design error occurs only once perin

system. The rapid rate of development

hardware technology constantly changes the

set of design trade-offs, further complicating the study of

hardware design errors.

In

the

last

decade, there has been some progress

in

the use of redundancy

usingthat

additional resources

beyond the minimum required

to

perform the task successfullyusually

to control software design errors.

Any source

of error can appear at any stage; however,

it

is

assumed

certain sources of error

predominate

at particular stages.

Furthermore, error-detection

TABLE 1-3 Probability of operational outage caused by various sources

AT&TSwitching

JapaneseBellcorea

Sourceof

Systems[Toy, 1978]

Commerciala

Tandem[Gray, 1985]

Tandem[Gray, 1987]

Northern

MainframeUsers0.45

Outage

[AM, 1986]

Users0.75 0.75 0.75f

Telecom0.19 0.19

HardwareSoftware

0.200.15

0.260.30

c

0.18 0.260.25

0.190.430.13 0.13

d

f

0.200.05

MaintenanceOperations

0.65b

0.44e

f

0.33 0.28 gin

0.11

0.17 0.14

0.15 0.15

Environment

"

thatbc

0.13

0.12

Note: Dashes indicate that no separate value was reported for that category

the cited study.

is defined as any service disruption exceeds 30 seconds duration. The Bellcore data represented a 3.5 minute downtime per year per system. is split between procedural errors (0.30) and recovery deficiencies (0.35). 47 percent of the hardware failures occurred because the second unit failed before the first unit could be

Data shows the fraction of downtime attributed to each source. DowntimeTotal

replaced.de1

Data applies to recovery software.Totalis split

between procedural errors

(0.42)

and operational software(i.e.,

(0.02).

Study only reported probability of vendor-related outageattributed to power.

0.75

is

split

between vendor hardware, software,

and maintenance). 8 Of the total amount, 0.15

is

I.

THE THEORY OF RELIABLE SYSTEM DESIGN

techniques can be tailored to the manifestation ofof systemlife

fault sources.

Thus,

at

each stage

there

is

a primaryin

methodologylife

for detecting errors.

Two

important stageslifeis

the

of a systemin

the

manufacturing stage and the

operational

stage

are discussed

the following subsections.in PartI.

A

third important

stage, design,

the subject of the remaining chapters

The Manufacturing Stage

A

careless manufacturing process can

make even the mostfinal

careful design useless.

The

manufacturing stage begins with thecalled design maturity testing.

portion of the prototype stage in a process

Design Maturity Testing.

(MTTF) for a

A design maturity test (DMT) estimates the mean time to failure new product before the product is committed to volume manufacturing.isolatein

Thein

DMTThe

is

conducted tois

and correct

repetitive systemic

problems

that,

if

left

the design, would result

higher service costs and customer dissatisfaction.a set of

DMT

accomplished by operating

sample devices for a prolongedfield

time (typically 6 to 8 units for 2 to 4 months) to simulate actualcasesin

operation.

In

which the duty cycle of the equipment

is

less

than 100 percent, the duty cycle

under

test

may be

increased to 100 percent to accelerate testing. As failures areclassified

observed and recorded, they arefrequency of occurrence.

according to such factors as failure mode,

time, or environmental cause. Similar failures are then ranked in groups by decreasing

This procedure establishes priorities for eliminating the causes. After the funda-

mental cause of the failure

is

found and corrective design action

is

taken, the operation

of the modified or repaired test samples provides a closed-loop evaluation of theefficacy of the change. Repeating the

procedure improves the design of thethe specifications with a certain

test

samples

until their

estimated

MTTF meets

statistical

confidence.

The progress of the

test

can be monitored with a chart prepared

in

advance for

FIGURE 1-2Reliability

demonReject

stration chart for

/ /16

monitoring theprogress of a design maturity test

[From data

in

von

^

Alven, 1964]

/

/

/Continue testing /

/

Accept

r-'-VUnit test hours

1.

FUNDAMENTAL CONCEPTS

the product under

test, as

shown

in

Figure 1-2.

It

provides an objective criterion forstatistical risk.

judging theis

MTTF

of a product with a

predetermined

The

chart,

which

based on four parameters relating to the upper bound of the MTTF, the minimum acceptable MTTF, and the risks to both consumer and producer, is divided into threeareas: accept, reject, or continue testing.

When

the performance line crosses into the

accept region, the test samples'

MTTF

is

at least

equal to the

minimum

acceptableIf

MTTF

(with the predetermined risk of error),

and the design should be accepted.

the performance line crosses into the reject region, the

MTTF

of the design

is

probablycan

lower than the acceptable

minimum

with

its

corresponding probability of error; testingit

should be suspended

until

the design has been sufficiently improved andtest.

reasonably be expected to pass the

The

DMT

is

a time

manufacturers are replacing

consuming, costly process as illustrated in Chapter 4. Many it by a reliability growth test as described in Chapter 4.

Incoming Inspection. Incoming inspection

is

an attempt to

cull

weak or

defective

com-

ponents prior to assembly or fabrication into subsystems, as shown in Figure 1-3. All semiconductor processes yield a certain number of defective devices. Even after the semiconductor manufacturer has detected and removed these defective devices, failures will continue to occur for a time known as the infant mortality period. This periodis

typically 20

weeks or fewer during which the

rate of failuresat a

continues to decline.

At the end of this period, failures tend to stabilize

constant rate for a long time,

sometimes 25 years or more. Ultimately the

failure rate begins to rise again, in a periodis

known

as the wear-out period. This variation in failure rate as a function of time

illustrated

by the bathtub-shaped curve shownin

in

Figure 1-4.

As shownfactors:(1)

Figure 1-5, the failure rate can be considered to be the

sum

of three

which decreases with time, (2) steady-state stress, which is constant with time, and (3) wear-out, which increases with time. Chapter 2 describes the Weibull model for estimating the impact of infant mortality failures during early product life.infant mortality,

The cost of componentsemiconductor componentlevel, field service level,

failure

detected: The higher the level, thelevel

depends upon the more expensive theat $5;

level at

which theat

failure

is

repair. Fault detection at the

minimizes cost. Fault detectionat

the next highestat

the board, has been estimated

the system test level, $50; andat

the

$500 [Russell, 1980]. The levelinfant mortality failuresis

which

detects

initial

and

a function of

computer manufacturer the incoming test programa

chosen.Example. Even relatively low semiconductor failure rates can cause substantiala board with 40

board yield problems, which are aggravated by the density of the board. Consider semiconductor devices that have an initial failure rate of 1 percent:Probability board not defective

=

40

(0.99)

=

0.669

The benefits of an incoming inspection program can be easily quantified. The value of culling bad semiconductor components before they are inserted into the

10

THE THEORY OF RELIABLE SYSTEM DESIGN

FIGURE 1-3Typical steps in the

manufacture ofdigital

a

Incoming

system,

componentinspection

[From Foley, 7979;

7979 IEEE]Printedcircuit

boardfabricationI

Backplane assembly

Board assembly

Printedcircuit

board

test

BoardBackplanetest

inspection

and functionaltest

System assembly

I

Systemtest

board

is

the most easily measured benefit. Board/system test savings, inventory

depend on the particular strategy used. To calculate the value of removing defective components at incoming inspection, multiply the number of bad parts found by the cost of detecting, isolating, andrepairing failures at higher levels of integration.

reduction, and service personnel savings

The following formula estimates

the total savings:

D =where

56 + 50S + 500F

D

=

dollar savingsfailures atat

B = number ofF

boardsystem

test leveltest level

S = number of failures

= number

of failures in the field

1.

FUNDAMENTAL CONCEPTS

FIGURE 1-4Bathtub-shapedcurve depicting

componentof time

failure

rate as a function\

Infant mortality

Normallifetime

Wear-outperiod

period

Approximately20 weeks

5 to 25

years

This formula can be translated into annual savings by considering total

component

volume and mean

failure rate data:

Potential annual savings

= annual component volumefailures)(% failures detected at board level

x [(%

initial

x$5

+ + +

%%

failures

detected detected

at

system

level

x

$50)]

[(% infancy failures)(% failures detected at system level x $50failuresin

the field x $500)]

FIGURE 1-5Factors that contribute to thefail-

ure rate of a

com-

ponent over time

12

THE THEORY OF RELIABLE SYSTEM DESIGN

Typical savings for 100 percent incoming inspection can be estimated

and com-

pared with the cost of the automatic1-6 shows the component volumes. A familytesting. Figure1.2, 2.0,

test

equipment required

to carry out such

potential annual savings as a function of annual of curvesis

shown

for overall failure rates of 0.8,

and 4.0 percent.

Process Maturity Testing.parts,

The term process includestest ais

all

manufacturing steps to acquire

assemble, fabricate, inspect, and

product during volume production. Thethat

rationale for process maturity testing (PMT)tain

newly manufactured products conare operated

some latent defects built in by the process that produced them. A large number of units, usually the first 120 off the production line,in) in a

for 96 hours, often in lot sizes convenient to the particular

production process. They

are operated (burned

mannerIf

that simulates the

normal production process

environment

as closely as possible.

the burn-in and production process environ-

ments

differ significantly, appropriate test results

mortality characteristics

may

fluctuate significantly throughout the test

posite of these individual failure characteristics

must be adjusted accordingly. Infant lot. The comis considered the normal infancy foris

the device. The end of the burn-in period for production equipment the normal infancy curve thus derived from the PMT. The objectiveof consistentlyis

determined by

to ship products

good

quality

and acceptable MTTF

after a

minimum

burn-in period.

Typical production burn-in times are 20 to 40 hours.

PMT

is

used to identify several classes of

failures. Infancy failures are

generally caused by parts that were defective from the time theylargely solid-state devices,

problems were received. In

component problems

will

remain

in this

category until they

are identified and controlled by either incoming inspection or changes

implemented

by the component vendor. Manufacturing/inspection failures are generally failures repaired by readjustments or retouching. Examples include parts damaged by the assembly process or defects that bypassed the normal incoming test procedures.

FIGURE 1-6Potential annual

savings from

1,000testi

screening and

]1.2%2.0%

ing as a function

of annual compo-

100^

///ATE (/X total

/y//

!

costs

nent volumes

I

[From Craig, 1980]

"~ijo%]~yyFacility yearly

operating cost

////1

0.8% Total failures

10Yearly

100

1,000in

10,000

component volume

thousands

1.

FUNDAMENTAL CONCEPTS

13

corrected or

in the design that have not yet been been resolved because of lack of experience. Residual failures are problems that have not yet recurred and for which there is no corrective action except to repair them when they occur. These are the truly random failures. Experience has shown that the three major recurring problems usually account for

Engineering failures are recurrent problems

new problems

that have not yet

75 percent of

all

failures.

It is

reasonable to expect that the correction of the top fourtenfold improvement in MTTF. The current trend produce the DMT units, so that the data derivedIn this

to six recurringis

problems

will yield a

to have the manufacturing line

during

DMT

can be used to identify and remove process-related defects.

case

PMT

is

redundant and unnecessary.

The Operational

Life Stage

Over the years, with the accumulation of experience in the manufacture of semiconductor components, the failure rate per logic device has steadily declined. Figure 1-7depicts theof thedata.

number of failures per million hours for bipolar technology as a function number of gates on a chip. The Mil Model 217A curves were derived from 1965for Mil

The curves

Models 217B, 217C, 217D, and 217E (see Appendixreliability

E)

were

generated from 1974, 1979, 1982, and 1986

prediction models, respectively.

Actual failure data are also plotted to calibrate the Mil models. The curve field data was derived from a year-long reliability study of a sample of video terminals [Harrahy, 1977]. The curve life cycle data was derived from elevated temperature testing of chips, followed by the application of a mathematical model that translated the failure rates to ambient temperatures [Siewiorek et al., 1978b]. Finally, the improvement in the 3000-gate Motorola MC 6800 is plotted [Queyssac, 1979]. In general, the Mil Model 217 is conservative, especially with respect to large-scale integration (LSI) and randomaccess memory (RAM) chips. See Chapter 2 for a more detailed discussion. Two trends are noteworthy. First, there is more than an order of magnitude decrease in failure rate per gate. Plots of failure per bit of bipolar random access memory indicate that the failure rates per gate and per bit are comparable for com-

parable levels of integration. Obviously, the chip failure rate

is

a function of chipbit)

complexity and

is

not a constant. Failure rate per function (gate or

decreases by

one order

magnitude over two orders of magnitude of gate complexity and by two to three orders of magnitude of memory complexity. The failure rate decreases inof direct proportion to increases in complexity.

The second trendplexity,

is

that the Mil

model predicted

failure rate

decreases with time.at that scale

Each model predicts an increaseof integration at that time.*

in failure rate

per function beyond a particular com-

presumably because of the immaturity of the fabrication process

*

The switch from

a

polynomial to an exponential functionin

in

number

of gates occurs at 100 in 217B and 1000

in

217C, reflecting the improvements

the fabrication process over time.

14

THE THEORY OF RELIABLE SYSTEM DESIGN

FIGURE 1-7Failure rate

per

gate as a function

of chip complexityfor bipolar tech-

nology

Mil Model 217A (1965b)

c^^*"

v

5,

it

may be

necessary to pool categories.

A

reasonable level of confidencefile

0.05.

Example

1.

Data are collected from thein

system of a time-sharing system aboutdistribution.

the transient faults

8 disk drives in an effort to discover whether the time

between transient errors follows an exponentialof Xtotalis

0.1344 (time

in

minutes) corresponding to ais

MTBF

of about 7 minutes.

number

of observed errors

877

in a

5-day interval.

The estimated value The Table 2-12a shows the

observed errors by division into time categories and the expected number of errors in each time category according to an exponential distribution. For instance,thefirst

row

in

the table

means

that 548 errors

errors of

to 5 minutes, while an exponential distribution with X

were observed with times between = 0.1344 gives

the expectedof failuresis

number

of errors in that range as 429.20 (given that the total to

number

be pooled until no E, is smaller than 5. The result of this operation is shown in Table 2-12b. The number of degrees of freedom ism = because there are eight different categories, and one parameter (X) has been estimated from the data. For 6 degrees of freedom, 2 Xo.o5 = 12.592. Since x > Xo.os, the hypothe;877).

The remaining categories have

8-1-1=6

an exponential distribution must be rejected.

2.

FAULTS

AND

THEIR MANIFESTATIONS

57

TABLE 2-1 2 Data on transienta.

faults for the time-sharing file

system (Example

1)

Collected Data

b Pooled Categories

TimeCategory(mins)

ObservedErrors,

ExpectedErrors

TimeCategory(mins)

ObservedErrors

ExpectedErrors,E,

TimeCategory(mins)

o, 5481486335

o,21 1

O,548 14863 35

f,

(O,

-

E,f/E,

0-5

429.20219.15 111.8957.13

55-6060-6565-7070-75

0.2639 0.13470.06881

0-55-1010-15

429.20219.15

32.8823.10 21.368.57 0.04

5-1010-15

111.8957.13

15-20 20-25

1 1 1 11

0.03514 0.01794

15-20 20-25

281812

29.1714.897.60

75-80

2818 12 25

29.17 14.897.607.93 Total x2

25-3030-35

80-8585-9090-95

0.009160 0.0046900.002395 0.001215

25-3030-3535-oc

0.642.53

35-4040-15 45-5050-55

631

3.881.981.01

36.74

95-100100-105

1 1

=

125.86

0.000627

3

0.5178

Example

2.

The times between crashes

of a time-sharing system (see Table 2-13)is

have been recorded for one month of system operation. The goal

to find out

whether the distribution of time between crashes follows a Weibull distribution. The maximum likelihood estimates of the Weibull parameters are A. = 0.0888, and

a =

0.98 (time units in hours) corresponding to a time

between crashes

of about

11 hours. Table 2-13a gives the

observed counts

in

several ranges of time

between

crashes. After pooling categories so that

obtained. The

no E, is smaller than 5, Table 2-13b is 1 =6. For a x 2 freedom ism = random variable with 6 degrees of freedom, xo.os = 12.592. Because x 2 < Xoos,

number

of degrees of

9-2is

the hypothesis that the distribution of the time to crash

a

Weibull

is

accepted.

Another goodness-of-fit

statistical test

is

the Kolmogorov-Smirnov

test.

The

Kol-

mogorov-Smirnovtial

test has

been developed

for

known parameters

or for the exponen-

distribution [Lilliefors, 1969]. If the parameters of the distribution are estimated from the experimental data or the distribution is not exponential, the KolmogorovSmirnov test may give extremely conservative results.

DISTRIBUTION

The

Reliability Analysis

Center has extensively studied

statistics

MODELS FOR PERMANENTFAULTS: THE

nent failures. The data have led to the development of a widely usedof chip failures, thein in

on electronic comporeliability modelpresented

MIL-HDBK-217,* whichto the

is

periodically updated, starting with 217A

1965 and progressing to model 217E of 1986. Thethis sectionis

componentat

failure data

MIL-HDBK-217

compared

model

that

was current

the time of the data

MODEL

collection.

*

A more

detailed explanation of the

model

is

found

in

Appendix D.

58

I.

THE THEORY OF RELIABLE SYSTEM DESIGN

TABLE 2-13 Data on time between crashes(Examplea.

in

one month

for time-sharing file

system

2) b.

Cc llected DataTimeCategory(hrs)

Poole d Categories

TimeCategory(hrs)

ObservedErrors,

ObservedErrors,

TimeCategory(hrs)

O,63 5

O,1

O,97

f,

(O,

-

2 E,) /E,

0-1

11-12 12-14

0-2 2-4

9.97 8.17 6.795.67

0.090.16 3.97

1-2

221 1

2-3

14-15

4-6 6-88-11

122

3-44-5

27

15-1616-1717-1818-21

2.370.700.41

95 5

6.806.665.61

5-6 6-77-8

51 1

31

11-15

15-20

0.06

21-24

41

20-2828-oo

65

5.145.13 Total x2

0.14

8-99-1010-11

3

24-29

0.003

42

29-3838-75

3

=

7.95

2

For MIL-HDBK-217E, reliability

is

assumed

to

be an exponential distribution with

the failure rate for a single chip taking the form

X =

iTz.'n-Q(C-iiTrnv

+

Citxe)

where

tt l

= learning

factor,1

based on the maturity of the fabrication process (assumes

a value of

or 10)

ttq

=

quality factor, based

on incoming screening

of

components

(values range

from 0.25 tottt

20)factor,

= temperature

based on the ambient operating temperature and the

type of semiconductor process (values range from 0.1 to 1000)tt

= environmental = voltage

factor,

based on the operating environment (values range

from 0.38 to 220)ttv

stress derating factor for CMOS devices (values range from 1 to over 10 as a function of supply voltage and temperature; value is 1 for other technologies)factors,

G,C2 = complexityor bits (for

based on the number of gatesin

(for

randomof pins

logic)

memory)

the

component and the number

Since

new component

types are continually being introduced and because the learningfield

curve for any component type changes asquestion of the accuracy ofrapidlythis

experience accumulates, there

is

someto

MIL-HDBK-217 model, particularly with regard changing technologies such as MOS RAMs and ROMs.

Typical component failure rates are in the range of 0.01-1.0 per million hours. Thus, tens of millions of component hours are required to gain statistically significant results. Two separate approaches can be used to gather sufficient data for comparison

2.

FAULTS

AND

THEIR MANIFESTATIONS

59

with the MIL-HDBK-217 model: life-cycle testing of components and analysis of fielddata on failure rates.

Life-Cycle TestingLife-cycle testing involves a small

numberat

of

components

in a

controlled environment.

Frequently, temperaturefactoris

is

elevated to accelerate failure mechanisms.1

then used to equate

hour

elevated temperature to ais

An acceleration number of hours at

ambient temperature. The acceleration factorequation:E R = Ae>

usually derived from the Arrhenius

/kT

where R = reaction rate constant A = a constantEa k

= activation energy in electron-volts = Boltzmann's constant T = absolute temperature

Example.

We

A life-cycle test is designed whereby components will be heated to 125C. want to know how many hours at 25C ambient will be represented by each hour at 125C. Let /?(125) be the reaction rate at 125C and R(25) be the reaction rate at 25C. The temperature acceleration factor is thus/?(125)

R(25)

To convert from C to/?(125)

K,

we add

273.

Using the Arrhenius equation for R,

we

get

Ae'

R(25)In

Ae T/MRis

MIL-HDBK-217B, Eais

assumed

to

be 0.7 eV

for

MOS

devices. Boltzmann's

constant

0.8625 x 10

-4

eV/K. Thus,_6-843

R(^25)R(25)

_

(-0.7/0.8625)10- 4 [(1/398) - (1/298)]

_

q-ij

Hence

1

hour

at

125C

is

equivalent to 937 hours

at

25C.

become

Because of the exponential in the Arrhenius equation, accelerating factors can quite large. They can, however, be plotted as a graph for easy look-up. Figure

2-9 depicts acceleration factors for various activation energies. For the life-cycle test

example, curve 3 applies. The acceleration factor for any two temperatures is found by taking the ratio of the two acceleration factors with respect to 25C. For example,

hour

at

125C for an activation energy of 0.41 eV

is

equal to about 10 hours

at

85C

(900/90

=

10).at

Consider conversion of time

125C to time

at 50C.in

devices assumes an activation energy of 0.41 eV (curve 2

Figure 2-9).

MIL-HDBK-217B for bipolar The acceleration

I.

THE THEORY OF RELIABLE SYSTEM DESIGN

FIGURE 2-9 Graph of failurerate acceleration

Degrees Centigrade250

200

175

150

1251

-

"T1

r(5)- -ii

factors

[From\rer.

-f(4)

Thielman, 1975;

\n* -M-

printed by permission of Signetics]1

\j\j

(6)

0,000 x

UhLj (1)

*>

" I

6,000x

v>k>

l,000x

^h'*

\

1

600X _4

>yNi i

>

i

K

Nr !*.]

\[ Y

-v-i-

T+\ \J

(2)

\ \

I11

5I

100X1

^1

60x1

^ X

\ \

\

^i

-V,\,

N \

1

(6)

^X ^1

\

|

\sN

M\

\

3.2

3.3

3.4

Temperature

NOTES:Calculated from the Signetics Failure Rateperature Graphinvs.

KelvinTem(for

X 10" 3Ea

energy"all

ISignetics,

1

9751. Signetics uses

linear,

= 0.70 eV and is applicable to all MOS, and bipolar ECL devices in the normal

acceleration factors of 15 (for

+85C), 100(for

modes

of operation.

+ 150O, 200

(for

+175X), 350

+ 200C),

970 (for + 250C1, and 2100 (for + 300C) to relate to +25C equivalent ambient temperature. The graph equates to an "activation energy" E a = 0.41

CalculatedEa

from MIL-STD-883A, 15 November, 1974. The graph equates to an "activation energy"

=

1

.02 eV.is

eVCalculated from

The curved graph

the result of plotting the "rule of

MIL-HDBK-21

thumb"7B,

that failure rates (hence acceleration facfor

20 September,

tors)All

1974 The graph equates to an "activation energy" Ej = 0.41 eV and is applicable to all bipolar digital(except tCL) in the normal

double

every

+ A10C.two boundaries. The two

competitor data (available to Signetics) producedfalling within these

modeto

of operation.

graphs

Calculated from

MIL-HDBK-21equates

7B,

20 September,a

1974.

The

graph

boundaries equate to "activation energies" E a = 23 eV (for lower bound) and E a = 1 .92 eV (for the

upper bound).

2.

FAULTS

AND

THEIR MANIFESTATIONS

61

factor at 125C

is

approximately 60, whileis

at

the effective accelerationfor

60/4 or 15. For

50C it is about 4 (relative to 25C). Thus, MIL-STD-883A (used to qualify componentsis

procurement) an activation energy of 1.02 eVeffective acceleration factor60.is

The

(20,000/20

=

1000).

assumed (curve 4 in Figure 2-9). The accelerating factors differ

by over a factor of

is

The Arrhenius equation assumes only one activation energy, and the reaction rate assumed to be a uniform function of temperature. Assuming a straight line (on ain

semilog scale) can result

substantial errors. Figure 2-10 illustrates the nonlinear

behavior. Consider the three test points, 150C, 125C, and 85C. Drawing a best-fitin Figure 2-10 on the 1970 curve yields a failure rate whereas the 25C observed point is 0.0013, too low by a factor of 7. The same three points on the 1975 curve suggest a failure rate of 0.06 instead of 0.0017, which is too high by a factor of 35. In summary, data from accelerated life-cycle testing must be reviewed carefully,

straight line

through these points

of about 0.0002 at 25C,

FIGURE 2-10Nonlinear plots offailure rate acceler-

Degrees Centigrade200175

10 6

150

1251

ation factors [From

85

Thielman, 1975;

re-

printed by permission of Signetics]91

2

AT*

oc

u Vc

1.0 0.6

V

i

"V^ *>

i

v

y\

0.2

1(3)

0.1

0.06

Nr^ \ "r ^>^sx \A!

i

^\ \k.

o0.02

\N

g0.01

\ii

\

It

".

N*-

X

0.006

^

V0968

J 975 \ ^0.0020.001

2

0.0006

7 1.81.92.02.1

I

i-

M973

s

b(s -

1)

the block length.or two adjacent columns have unidirectional errors, the totalis

When one columnnumber

of possible undetectable errors

Uicoi

= bU(b

U2 co\ =where U=

- 1K2N -

2(7)

2

sQkab-v)

N= 2bTj

2s

(-D* sCk T Cs-1f

= i(2 - 1) + b P = s/(2 - 1) Q = 3s/(2 6 - 1) b - 1)/4 R, = i(2

-

4/

-

1

With these formulas, Usas showed the low-cost residue code to be superior single-precision checksum.

to the

An arithmetic code, A, has the property that A(b * c) = A(b) * A(c) where b and c are noncoded operands, * is one of a set of arithmetic operations (such as addition and multiplication), and A(x) is the arithmetic code word for x. Thus, the set of code words in A is closed with respect to a specific set of arithmetic operations. Such a code can be used to detect or correct errors and to check the results ofArithmetic Codes.

118

I.

THE THEORY OF RELIABLE SYSTEM DESIGN

arithmetic operations.*

Some operations (such as logical operations), however, cannot be checked by arithmetic codes and must be performed on unencoded operands. This section provides an introduction to three classes of arithmetic codes: AN, residue-m, and inverse residue-m arithmetic codes. Appendix B, a paper by Avizienis[1971],

[1974]; Sellers, Hsiao,

in detail, and other sources of information are Rao and Bearnson [1968b]; and Avizienis [1973]. The simplest arithmetic codes are the AN codes. These codes are formed by

examines the three classes

multiplying the data

word by

a

number

that

is

not a power of the radix of the repreis

sentation (such as 2 for binary). The redundancy

determined by the multiplierarithit

chosen, called the modulus.metic.If

AN

codes are invariant with respect to unsigned2a

the code chosen has

A =

-

1

and

a length that

is

a multiple of a bits,

is

also invariant (using

addition and

left

one's-complement algorithms) with respect to the operations of and right arithmetic shifting. Additionally, complementation and sign

detection are the

same

[Avizienis, 1973].

AN code is the 3N code. An n-bit word is encoded simply by multiplying by 3. This adds at most 2 bits of redundancy and can be encoded quickly and inexpensively in parallel with an (n + 1)-bit adder (Figure 3-18). Error checking is performed by confirming that the received word is evenlyAn exampleof a single-error-detectingdivisible

by

3,

and can be accomplished with ais

relativelyin

simple combinational logic

decoder. Although therecalculations) can

one more

bit

than

bit-per-word parity for roughly the

same coverage, the operationelement increase, an

of other system functions (such as ALU and address be checked. The hardware cost is a (2/n) x 100 percent memory

(n + 1)-bit adder for encoding, a combinational decoding circuit, and extra control circuitry. The delay on reads results from a small number of gate delays, and on writes from the delay of the adder. Avizienis [1973] presents algorithms

for operations involving

AN

codes, and discussesin

in detail

the design of a 15N code

arithmetic processing unit usedAvizienis etal.

an early version of the JPL-STAR computer (seeIn

[1971].

Residue codes are a class of separable arithmetic codes.the residue of a data

the residue-m code,

m. The code word is formed by concatenating N with R(N) to produce N\R (the vertical bar denotes concatentation). The received word N'\R' is checked by comparing R(N') with /?'. If they are equal, no

word

N

is

defined as R(N) =

N mod

error has occurred. Figure 3-19

is

a blockis

diagram of a residue-m code arithmetic

unit.

A variantquantity,

of the residue-m

code

the inverse residue-m code. The separate check

Q, is formed as Q = m - (N mod m). The inverse residue code has greater coverage of repeated-use faults than does the residue code. A repeated-use fault occurs when a chain of operations is performed sequentially on the same faulty hardwarebefore checkingis

performed. For example,

iterative

operations such as multiplication

and division are subject

to repeated-use faults. Both the residue-m

and inverse residue-

*

Other codes are not

invariant with respect to arithmetic operations. For

some separable

linear

codes other

than arithmetic codes, the check symbol portion of the result can be produced by a prediction circuit. Usually such circuits are complex. Wakerly [1978] details check symbol prediction for parity-check codes and checksum

codes.

3.

RELIABILITY

TECHNIQUES

119

FIGURE 3-18Simple encoder for.

n-bit data

word

3N

single-error-de-

\

^

tecting arithmetic

code

ur+1

bit

adder

Carry

zr

m

codes can be used with either ope's-complement or two's-complement arithmetic.[Avizienis etal.,

The JPL-STAR computerwhere, AvizienisIn

1971] uses an inverse residue-15 code. Else-

[1973] describes the adaptation of

2's-complement arithmetic for use

with an inverse residue code.

both the

AN

and residue codes, the detection operations can be complex,

AN codes, m for residue-m codes) are of the a form 2 - 1. The check operation in this case can be performed using an a-bit adder with end-around carry, serially adding a-bit bytes of the data word (or code word forexcept

when

the check moduli (A for

AN

codes) [Avizienis, 1971, 1973].

In effect, this

operation performs the division of thealso be

word by the check modulus. The operation cancodes.

implemented

in a faster, parallel

fashion. Arithmetic codes with check moduli of this form are called low-cost arithmetic

FIGURE 3-19Block diagram of

>WN1)

an arithmetic unitusing residue-m

Arithmetic function

codeR(N2)

>>J

M

*

N2

VResidue generator

M*WV1

*

N2 ^>*

N2)

Result

residue generator

RCM

*

N2):>|

Compare|

120

I.

THE THEORY OF RELIABLE SYSTEM DESIGN

Cyclic Codes. In cyclic codes, any cyclic (end-around) shift of a

code word producesshift

another code word. Cyclic codes are easily implemented using linear-feedbackregisters,

which are made from

XOR

gates andin serial

memory

elements. These codes find

frequent (though not exclusive) use

applications such as sequential-access

devices (tapes, bubble memories, and disks) as well as data links. Sometimes encodingis

performed independently and

in

parallel

over several

serial-bit streams, as in a

multiple-wire bus. The bits of each byte are transmitted simultaneously. The cyclic

redundancy check (CRC) check bits for each bit stream are generated for the duration and are appended to the end of the block. In discussion of cyclic codes, the term (n,k) code is often used. In this expression, n is the number of bits in the entire code word, while k is the number of data bits.of the block transmission

Thus,to

in an (n,k) separable code there are (n - k) bits concatenated with the data bits form the code words. The (n,k) cyclic codes can detect all single errors in a code word, all burst errors (multiple adjacent faults) of length b < (n - k), and many other patterns of errors, depending on the particular code. A cyclic code is uniquely and completely characterized by its generator polynomial G(X), a polynomial of degree

(n

-

k)

or greater, with the coefficients either

or

1

for a binary code. This section

introduces

some

of these codes,

andin

a

complete discussion of these and other poly[1969]

nomial-based codes can be found[1972].

Tang and Chien

and Peterson and Weldon

CRC

Codes. Given the check polynomial G(X) for anshift register

(n

-

k)

separable code, a linear-

feedback

encoder/decoder for thebitsis

CRC codes

can be easily derived.* The

block check register (BCR) contains the check

bits at

the end of the encoding process,

during which the dataof the BCR.In(r

have been simultaneously transmitted and fed to the inputr-bit shift register,

The BCR1, 0,

an

whereits

r

=is

(n

-

k),

the degree of G(x).

Figure 3-20, the register shifts to the right, and

memory

cells are labeled (r

-

1),

broken to the right of each = (r - and / is the degree of a nonzero term in G(X). At each of these cell i, where points, an XOR gate is inserted, and the gate output is connected to the input of the cell on the right side of the break. The output of the gate to the right of cell is connected to the input of the leftmost memory cell (cell r - 1) and to one of the inputs of each of the other gates. The remaining input of each gate is connected to2),.

-

.

.

,

from/")

left

to right.

The

shift register

/'

the output of the

memory

cell to

the

left.

The second input of the rightmost gateis

is

connected to the

serial data input.

The

result

a

feedback path, whose value

is

the

XOR

of

BCR

bit

and the current data

bit.

Figure 3-20 thus

shows the BCR

for a cyclic

code withG(X)

= X 12 + X 11is

4-

X3 + X 2 + X +

1

This CRC-12 code

often used with 6-bit bytes of data because the check bits

fit

evenly into two 6-bit bytes. The

XOR

gates are placed to the right of the five shift

The following discussion

is

b


Recommended