Schilling Walter William Jr

transcript

A Dissertationentitled

A Cost Effective Methodology for Quantitative Evaluation of

Software Reliability using Static Analysis

Walter W. Schilling, Jr.

Submitted as partial fulfillment of the requirements for theDoctor of Philosophy in Engineering

Advisor: Dr. Mansoor Alam

Graduate School

The University of ToledoDecember 2007

The University of Toledo

College of Engineering

I HEREBY RECOMMEND THAT THE DISSERTATION PREPARED UNDER MY

SUPERVISION BY Walter W. Schilling, Jr.

ENTITLED A Cost Effective Methodology for Quantitative Evaluation of Software

Reliability using Static Analysis

BE ACCEPTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR

THE DEGREE OF DOCTOR OF PHILOSOPHY IN ENGINEERING

_____________________________________________________________________

Dissertation Advisor: Dr. Mansoor Alam

Recommendation concurred by _______________________________________ Dr. Mohsin Jamali _______________________________________ Dr. Vikram Kapoor _______________________________________

Dr. Henry Ledgard

_______________________________________ Dr. Hilda Standley _______________________________________ Mr. Michael Mackin _______________________________________ Mr. Joseph Ponyik

_____________________________________________________________________

Dean, College of Engineering

Committee

Final Examination

law, no parts of this document may be reproduced, stored in a retrieval system,

or transmitted in any form or by any means, electronic, mechanical, photocopying,

recording, or otherwise without prior written permission of the author.

Typeset using the LATEX Documentation system using the MikTEX package devel-

oped by Christian Schenk.

All trademarks are the property of their respective holders and are hereby ac-

knowledged.

An Abstract of

A Cost Effective Methodology for Quantitative Evaluation of Software

Reliability using Static Analysis

Walter W. Schilling, Jr.

Submitted in partial fulfillmentof the requirements for the

Doctor of Philosophy in Engineering

The University of ToledoDecember 2007

Software reliability represents an increasing risk to overall system reliability. As

systems have become larger and more complex, mission critical and safety critical

systems have increasingly had functionality controlled exclusively through software.

This change has resulted in a shift of the root cause of systems failure from hardware

to software. Market forces have encouraged projects to reuse existing software as well

as purchase COTS solutions. This has made the usage of existing reliability models

difficult. Traditional software reliability models require significant testing data to be

collected during development. If this data is not collected in a disciplined manner

or is not made available to software engineers, these modeling techniques can not be

applied. It is imperative that practical reliability modeling techniques be developed

to address these issues. This dissertation puts forth a practical method for estimating

software reliability.

The proposed software reliability model combines static analysis of existing source

code modules, functional testing with execution path capture, and a series of Bayesian

Belief Networks. Static analysis is used to detect faults within the source code which

may lead to failure. Code coverage is used to determine which paths within the source

code are executed as well as the execution rate. The Bayesian Belief Networks combine

these parameters and estimate the reliability for each method. A second series of

Bayesian Belief Networks then combines the data for each method to determine the

overall reliability for the system.

In order to use this model, the SOSART tool is developed. This tool serves as

a reliability modeling tool and a bug finding meta tool suitable for comparing the

results of different static analysis tools.

Verification of the model is presented through multiple experimental instances.

Validation is first demonstrated through the application to a series of Open Source

software packages. A second validation is provided using the Tempest Web Server,

developed by NASA Glenn Research Center.

Dedication

I would like to dedicate this dissertation to my wife Laura whose help and assis-

tance has allowed me to persevere through the struggles of doctoral studies.

Acknowledgments

I would like to take this opportunity to express my thanks to the many persons

who have assisted my research as a doctoral student.

First and foremost, I would like to recognize the software companies whom I am

indebted to for the usage of their tools in my research. In particular, this includes

Gimple Software, developers of the PC-Lint static analysis tool, Fortify Software,

developed of the Fortify SCA security analysis tool, Programming Research Limited

developers of the QAC, QAC++, and QAJ tools, and Sofcheck, developer of the

Sofcheck static analysis tools. Through their academic licensing programs, I was able

to use these tools at greatly reduced costs for my research, and without their support,

it would have been virtually impossible to conduct successful experimentation.

Second, I would like to acknowledge the Ohio Supercomputing Center in Colum-

bus, OH. My research included a grant of computing on the supercomputing cluster

in Columbus which was used for reliability research and web reliability calculation.

I am indebted to the NASA Glenn Research Center in Cleveland and the Flight

Software Engineering Branch, specifically Michael Mackin, Joseph Ponyak, and Kevin

Carmichael. Their assistance during my one summer on site proved vital in the

development of this practical and applied software reliability model.

I am also indebted to the Ohio Space Grant Consortium, located in Cleveland

Ohio. Their Doctoral fellowship supported my graduate studies and made it possible

for me to complete this work.

I am also indebted to Dr. Mosin Jamali, Dr. Vikram Kapoor, Dr. Dr. Henry

Ledgard, Dr. Hilda Standley, and Dr. Afzal Upal from the University of Toledo who

at various times served on my dissertation committee. As a last statement, I would

like to express my gratitude to my advisor, Dr. Mansoor Alam.

Contents

Abstract iv

Dedication vi

Acknowledgments vii

Contents ix

List of Figures xiv

List of Tables xvii

1 Introduction and Key Contributions 1

1.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Software Field Failure Case Studies 12

2.1 COTS System Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Embedded System Failures . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Hard Limits Exceeded . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 Survey of Static Analysis Tools and Techniques 28

3.1 General Purpose Static Analysis Tools . . . . . . . . . . . . . . . . . 31

3.1.1 Commercial Tools . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1.2 Academic and Research Tools . . . . . . . . . . . . . . . . . . 37

3.2 Security Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.2.2 Academic and Research Tools . . . . . . . . . . . . . . . . . . 43

3.3 Style Checking Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3.1 Academic Tools . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.4 Teaching Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.4.2 Academic Tools . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4 Proposed Software Reliability Model 48

4.1 Understanding Faults and Failures . . . . . . . . . . . . . . . . . . . 49

4.1.1 Classifying Software Faults . . . . . . . . . . . . . . . . . . . . 51

4.2 What Causes a Fault to Become a Failure . . . . . . . . . . . . . . . 53

4.2.1 Example Statically Detectable Faults . . . . . . . . . . . . . . 53

4.2.2 When Does a Fault Manifest Itself as a Failure . . . . . . . . . 58

4.3 Measuring Code Coverage . . . . . . . . . . . . . . . . . . . . . . . . 68

4.3.1 The Static Analysis Premise . . . . . . . . . . . . . . . . . . . 70

5 Static Analysis Fault Detectability 83

5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.2.1 Basic Fault Detection Capabilities . . . . . . . . . . . . . . . . 88

5.2.2 The Impact of False Positives and Style Rules . . . . . . . . . 91

5.3 Applicable Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6 Bayesian Belief Network 96

6.1 General Reliability Model Overview . . . . . . . . . . . . . . . . . . . 97

6.2 Developed Bayesian Belief Network . . . . . . . . . . . . . . . . . . . 98

6.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.2.2 Confirming Fault Validity and Determining Fault Risk . . . . 99

6.2.3 Assessing Fault Manifestation Likelihood . . . . . . . . . . . . 103

6.2.4 Determining Reliability . . . . . . . . . . . . . . . . . . . . . . 108

6.3 Multiple Faults In a Code Block . . . . . . . . . . . . . . . . . . . . . 110

6.4 Combining Code Blocks to Obtain Net Reliability . . . . . . . . . . . 114

7 Method Combinatorial Network 117

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.2 Problems with Markov Models . . . . . . . . . . . . . . . . . . . . . . 122

7.3 BBNs and the Cheung Model . . . . . . . . . . . . . . . . . . . . . . 123

7.4 Method Combinatorial BBN . . . . . . . . . . . . . . . . . . . . . . . 124

7.5 Experimental Model Validation . . . . . . . . . . . . . . . . . . . . . 127

7.6 Extending the Bayesian Belief Network . . . . . . . . . . . . . . . . . 129

7.7 Extended Network Verification . . . . . . . . . . . . . . . . . . . . . . 131

7.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

8 The Software Static Analysis Reliability Tool 135

8.1 Existing Software Reliability Analysis Tools . . . . . . . . . . . . . . 135

8.2 Sosart Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

8.3 SOSART Implementation Metrics . . . . . . . . . . . . . . . . . . . . 140

8.4 External Software Packages Used within SOSART . . . . . . . . . . . 143

8.5 SOSART Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

8.5.1 Commencing Analysis . . . . . . . . . . . . . . . . . . . . . . 144

8.5.2 Obtaining Program Execution Profiles . . . . . . . . . . . . . 149

8.5.3 Importing Static Analysis Warnings . . . . . . . . . . . . . . . 154

8.5.4 Historical Database . . . . . . . . . . . . . . . . . . . . . . . . 158

8.5.5 Exporting Graphics . . . . . . . . . . . . . . . . . . . . . . . . 161

8.5.6 Printing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

8.5.7 Project Saving . . . . . . . . . . . . . . . . . . . . . . . . . . 162

8.5.8 GUI Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . 163

8.5.9 Reliability Report . . . . . . . . . . . . . . . . . . . . . . . . . 164

9 Model Validation using Open Source Software 171

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

9.2 STREW Metrics and GERT . . . . . . . . . . . . . . . . . . . . . . . 172

9.3 Real Estate Program Analysis . . . . . . . . . . . . . . . . . . . . . . 174

9.4 JSUnit Program Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 177

9.5 Jester Program Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 180

9.6 Effort Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

10 Model Validation using Tempest 185

10.1 Tempest Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

10.2 Evaluating Tempest . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

10.2.1 Java Web Tester . . . . . . . . . . . . . . . . . . . . . . . . . 189

10.2.2 Initial Experimental Setup . . . . . . . . . . . . . . . . . . . . 191

10.2.3 Final Experiment Setup . . . . . . . . . . . . . . . . . . . . . 193

10.2.4 Measured Reliability . . . . . . . . . . . . . . . . . . . . . . . 196

10.2.5 Analyzing the Source code for Statically Detectable Faults . . 198

10.2.6 SOSART Reliability Assessment . . . . . . . . . . . . . . . . . 200

11 Conclusions and Future Directions 202

11.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

11.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

Bibliography 210

A Fault Taxonomy 239

B SOSART Requirements 243

B.1 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 243

B.2 Development Process requirements . . . . . . . . . . . . . . . . . . . 246

B.3 Implementation Requirements . . . . . . . . . . . . . . . . . . . . . . 247

List of Figures

2-1 Source code which caused AT&T Long Distance Outage. . . . . . . . 22

4-1 The Relationship between Faults and Failures. . . . . . . . . . . . . . 51

4-2 Venn Diagram classifying software bugs. . . . . . . . . . . . . . . . . 52

4-3 Source code exhibiting uninitialized variable. . . . . . . . . . . . . . . 54

4-4 source code exhibiting statically detectable faults. . . . . . . . . . . . 55

4-5 PC Lint for buffer overflow.c source file. . . . . . . . . . . . . . . . . 55

4-6 Source exhibiting loop overflow and out of bounds array access. . . . 57

4-7 Source exhibiting statically detectable mathematical error. . . . . . . 63

4-8 GNU gcov output from testing prime number source code. . . . . . . 64

4-9 Control flow graph for calculate distance to next prime number method.

4-10 Source exhibiting uninitialized variable. . . . . . . . . . . . . . . . . . 66

4-11 Control flow graph for do walk method. . . . . . . . . . . . . . . . . . 67

4-12 Source code which determines if a timer has expired. . . . . . . . . . 71

4-13 Translation of timer expiration routine from C to Java. . . . . . . . . 72

4-14 Flowchart for check timer routine.. . . . . . . . . . . . . . . . . . . . 73

4-15 gcov output for functional testing of timer routine. . . . . . . . . . . 77

4-16 Modified timer source code to output block trace. . . . . . . . . . . . 79

4-17 Rudimentary trace output file. . . . . . . . . . . . . . . . . . . . . . . 80

4-18 gdb script for generating path coverage output trace. . . . . . . . . . 81

6-1 BBN for faults, code coverage, and reliability. . . . . . . . . . . . . . 99

6-2 BBN relating two statically detectable faults. . . . . . . . . . . . . . . 111

6-3 BBN Combining four statically detectable faults. . . . . . . . . . . . 115

6-4 BBN Network to combine multiple blocks with multiple faults. . . . . 116

7-1 A program flow graph. . . . . . . . . . . . . . . . . . . . . . . . . . . 118

7-2 Program flow graph with internal loop. . . . . . . . . . . . . . . . . . 121

7-3 Basic BBN for modeling a Markov Model. . . . . . . . . . . . . . . . 125

7-4 A program flow graph. . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7-5 Extended BBN for modeling a Markov Model. . . . . . . . . . . . . . 130

7-6 Extended BBN allowing up to n nodes to be assessed for reliability . 131

7-7 Extended program flow graph. . . . . . . . . . . . . . . . . . . . . . . 132

8-1 Analysis menu used to import Java source code files. . . . . . . . . . 145

8-2 Summary panel for imported class.. . . . . . . . . . . . . . . . . . . . 146

8-3 Source Code Panel for program. . . . . . . . . . . . . . . . . . . . . . 147

8-4 Basic Activity Diagram for program. . . . . . . . . . . . . . . . . . . 148

8-5 Basic Tracepoint Panel for program. . . . . . . . . . . . . . . . . . . 149

8-6 Java Tracer command line usage. . . . . . . . . . . . . . . . . . . . . 150

8-7 Java Tracer execution example. . . . . . . . . . . . . . . . . . . . . . 150

8-8 XML file showing program execution for HTTPString.java class. . . . 152

8-9 Execution trace within the SOSART tool. . . . . . . . . . . . . . . . 153

8-10 Taxonomy assignment panel. . . . . . . . . . . . . . . . . . . . . . . . 155

8-11 A second taxonomy definition panel. . . . . . . . . . . . . . . . . . . 155

8-12 Imported Static Analysis Warnings. . . . . . . . . . . . . . . . . . . . 166

8-13 Static Analysis Verification panel. . . . . . . . . . . . . . . . . . . . . 167

8-14 Static Analysis fault report. . . . . . . . . . . . . . . . . . . . . . . . 168

8-15 Analyze menu of SOSART tool. . . . . . . . . . . . . . . . . . . . . . 168

8-16 SOSART Program Configuration Panel. . . . . . . . . . . . . . . . . 169

8-17 SOSART Reliability Report Panel. . . . . . . . . . . . . . . . . . . . 169

8-18 Textual Export of Reliability report for analyzed source. . . . . . . . 170

10-1 Flow diagram showing relationship Tempest, experiments, and laptop

web browser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

10-2 Web tester GUI Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

10-3 OCARNet Lab topology. . . . . . . . . . . . . . . . . . . . . . . . . . 192

10-4 Network topology for test setup. . . . . . . . . . . . . . . . . . . . . . 194

10-5 DummyLogger.java class. . . . . . . . . . . . . . . . . . . . . . . . . . 195

10-6 Modified NotFoundException.java file. . . . . . . . . . . . . . . . . . 196

List of Tables

1.1 The cost of Internet Downtime . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Software Failure Root Cause Descriptions . . . . . . . . . . . . . . . . 13

3.1 Summary of static analysis tools . . . . . . . . . . . . . . . . . . . . . 30

4.1 Relationship between faults and failures in different models . . . . . . 49

4.2 Discrete Paths through sample function . . . . . . . . . . . . . . . . . 74

4.3 Execution Coverage of various paths . . . . . . . . . . . . . . . . . . 76

4.4 Execution Coverage of various paths . . . . . . . . . . . . . . . . . . 80

5.1 Static Analysis Fault Categories for Validation Suite . . . . . . . . . . 85

5.2 Summary of fault detections. . . . . . . . . . . . . . . . . . . . . . . . 89

5.3 Static Analysis Detection Rate by Tool Count . . . . . . . . . . . . . 90

5.4 Correlation between warning tool detections. . . . . . . . . . . . . . . 90

5.5 Static Analysis Tool False Positive and Stylistic Rule Detections . . . 91

5.6 Correlation between false positive and stylistic rule detections . . . . 92

5.7 Percentage of warnings detected as valid based upon tool and warning. 95

6.1 Bayesian Belief Network State Definitions . . . . . . . . . . . . . . . 100

6.2 Network Combinatorial States . . . . . . . . . . . . . . . . . . . . . . 112

6.3 Network Worst Case Combinatorial Probabilities . . . . . . . . . . . 112

6.4 Network Typical Combinatorial Probabilities . . . . . . . . . . . . . . 113

7.1 Bayesian Belief Network States Defined for Reliability . . . . . . . . . 126

7.2 Bayesian Belief Network States Defined for Execution Rate . . . . . . 126

7.3 Markov Model Parameter Ranges . . . . . . . . . . . . . . . . . . . . 128

7.4 Differences between the Markov Model reliability values and the BBN

Predicted Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

7.5 Test Error ranges and counts . . . . . . . . . . . . . . . . . . . . . . . 129

7.6 Test error relative to normal distribution . . . . . . . . . . . . . . . . 129

7.7 Extended Bayesian Belief Network States Defined for Execution Rate 131

7.8 Markov Model Parameter Ranges . . . . . . . . . . . . . . . . . . . . 133

7.9 Differences between the Markov Model reliability values and the BBN

Predicted Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

7.10 Test Error ranges and counts . . . . . . . . . . . . . . . . . . . . . . . 134

7.11 Test error relative to normal distribution . . . . . . . . . . . . . . . . 134

8.1 SOSART Development Metrics . . . . . . . . . . . . . . . . . . . . . 141

8.2 SOSART Overview Metrics . . . . . . . . . . . . . . . . . . . . . . . 142

8.3 A listing of SOSART supported static analysis tools. . . . . . . . . . 154

9.1 Real Estate Overview Metrics . . . . . . . . . . . . . . . . . . . . . . 174

9.2 RealEstate Class Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 175

9.3 Real Estate STREW Metrics Reliability Parameters . . . . . . . . . . 176

9.4 RealEstate Class Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 177

9.5 JSUnit Overview Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 178

9.6 JSUnit Class Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

9.7 JSUnit STREW Metrics Reliability Parameters . . . . . . . . . . . . 179

9.8 JSUnit Static Analysis Findings . . . . . . . . . . . . . . . . . . . . . 179

9.9 Jester Overview Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 180

9.10 Jester Class Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

9.11 Jester STREW Metrics Reliability Parameters . . . . . . . . . . . . . 181

9.12 Jester Static Analysis Findings . . . . . . . . . . . . . . . . . . . . . 182

9.13 Software Complete Review Effort Estimates . . . . . . . . . . . . . . 183

9.14 Software Reliability Modeling Actual Effort . . . . . . . . . . . . . . 184

10.1 Tempest Overview Metrics . . . . . . . . . . . . . . . . . . . . . . . . 188

10.2 Tempest Class Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 188

10.3 Tempest Configuration Parameters . . . . . . . . . . . . . . . . . . . 189

10.4 Tempest Test Instance Configurations . . . . . . . . . . . . . . . . . . 193

10.5 Tempest Field Measured Reliabilities . . . . . . . . . . . . . . . . . . 197

10.6 Tempest Rule Violation Count with All Rules Enabled . . . . . . . . 198

10.7 Tempest Rule Violation Densities with All Rules Enabled . . . . . . . 199

10.8 Static Analysis Rule Configuration Metrics . . . . . . . . . . . . . . . 199

10.9 Tempest Rule Violation Count with Configured Rulesets. . . . . . . . 200

10.10Tempest Estimated Reliabilities using SOSART . . . . . . . . . . . . 200

A.1 SoSART Static Analysis Fault Taxonomy . . . . . . . . . . . . . . . . 239

Chapter 1

Introduction and Key

Contributions

“The most significant problem facing the data processing business to-

day is the software problem that is manifested in two major complaints:

Software is too expensive and software is unreliable.”[Mye76]

This statement, leading into Myers book Software Reliability: Principles and

Practices was written in 1976. Yet, even today, this statement is equally valid, and

thus, it forms a perfect introductory quotation for this dissertation.

The problems of software reliability are not new. The first considerations to soft-

ware reliability, in fact, were made during the late 1960’s as system downtime began

to become a significant issue[Sho84]. The earliest effort towards the development of

a reliability model was a Markov birth-death model developed in 1967[Sho84]. How-

ever, beyond a few specialized application environments, software failure was regarded

as an unavoidable nuisance. This has been especially true in the area of embedded

systems, where cost and production factors often outweigh quality and reliability

issues.

Increased systemic reliance upon software has began to change this attitude, and

software reliability is beginning to be significantly considered in each new product

development. Software is becoming increasingly responsible for the safety-critical

functionality in the medical, transportation, and nuclear energy fields. The software

content of embedded software doubles every 18 months[Hot01], and many automotive

products, telephones, routers, and consumer appliances incorporate 100 KB or more

of firmware. The latest airplanes under development contain over 5 million lines of

code, and even older aircraft contain upwards of 1 million lines of code[Sha06]. A

study of airworthiness directives indicated that 13 out of 33 issued for the period

1984-1994, or 39%, were directly related to software problems[Sho96]. The medical

field faces similar complexity and reliability problems. 79% of medical device recalls

can be attributed to software defects[Ste02]. In today’s consumer electronic products,

software has become the principle source for reliability problems, and it is reported

that software driven outages exceed hardware outages by a factor of ten[EKN98].

Aside from the inconvenience and potential safety hazards related to software fail-

ures, there is a huge economic impact as well. A 2002 study by the National Institute

of Standards found that software defects cost the American economy $59.5 billion

annually [Tas02]. For the fiscal year 2003, the Department of Defense is estimated to

have spent $21 billion on software development. Upwards of $8 billion, or 40% of the

total, was spent on reworking software due to quality and reliability issues[Sch04b].

Table 1.1: The cost of downtime per hour in 2000[Pat02]

Brokerage operations $6,450,000Credit card authorization $2,600,000Ebay $225,000Amazon.com $180,000Package shipping services $150,000Home shopping channel $113,000Catalog sales center $90,000Airline reservation center $89,000Cellular service activation $41,000On-line network fees $25,000ATM service fees $14,000

But it is not only these large systems that have enormous economic costs. Delays in

the Denver airport automated luggage system due to software problems cost $1.1 mil-

lion per day[Nta97]. A single automotive software error led to a recall of 2.2 million

vehicles and expenses in excess of $20 million[LB05].

In the embedded systems world, software development costs have skyrocketed to

the point that modern embedded software typically costs between $15 and $30 per line

of source code[Gan01]. But, the development costs can be considered small when one

considers the economic cost for system downtime, which ranges from several thousand

dollars per hour to several million dollars per hour depending upon the organization,

as is shown in Table 1.

These economic costs, coupled with the associated legal liabilities, have made soft-

ware reliability an area of extreme importance. That being said, economic pressures,

including decreased time to market, stock market return on investment demands, and

shortages of skilled programmers have led business decisions being made that may

work against increasing reliability. In many projects, software reuse has been consid-

ered as a mechanism to solve both the problems of decreased delivery schedules and

increasing software cost. However, software reliability has not necessarily increased

with reuse. The Ariane 5[JM97]1 and Therac-25[LT93] failures can be directly related

to improper software reuse.

In today’s software development environment, the development of even a rela-

tively simple software system is often partitioned to multiple contractors or vendors,

typically referred to as suppliers. Each supplier is responsible for delivering their

piece of software. Each piece is then integrated to form a final product. Over the

course of a product lifecycle, a given vendor may release hundreds of versions of their

component. Each time a new release is made, the integration team must make a de-

cision as to whether or not it is safe to integrate the given component into the overall

product. Unfortunately, there is often little concrete knowledge to base this decision

upon, and the exponential growth in releases by COTS vendors has been shown to

cause a reduction in net software reliability over time[MMZC06].

The Capability Maturity Model (CMM) was developed by Carnegie Mellon Uni-

versity to assist in assessing the capabilities of government contractors to deliver

software on time and on budget. As such, software development companies are as-

sessed on a scale of 1 to 5 relative to their development capabilities, with 5 being

the best and 1 being the lowest. However, due to many issues, the usage of CMM

assessment has been problematic in filtering capable companies from incapable com-

panies, as is discussed in O’Connell and Saiedian [OS00] and [Koc04]. It has also had

problems related to inconsistent assessment by different assessors, as is documented

1The Ariane 5 failure is discussed in further detail in Chapter 2 of this dissertation.

in Saiedian and Kuzara[SK95]. Thus, an engineer can not rely on CMM assessments

to determine if it is acceptable to integrate a new release of a software component

into a product.

1.1 The Problem

Traditional software reliability models require significant data collection during

development and testing, including the operational time between failures, the severity

of the failures, and other metrics. This data is then applied to the project to determine

if an adequate software reliability has been achieved. While these methods have been

applied successfully to many projects, there are often occasions where the failure data

has not been collected in an adequate fashion to obtain relevant results. This is often

the case when reusing software which has been developed previously or purchasing

COTS components for usage within a project. In the reuse scenario, the development

data may have been lost or never collected. In the case of COTS software, the requisite

development data may be proprietary and unavailable to the customer.

This poses a dilemma for a software engineer wishing to reuse a piece of software or

purchase a piece of software from a vendor. Internal standards for release vary greatly,

and from the outside, it is impossible to know where on the software reliability curve

the existing software actually stands. One company might release early on the curve,

resulting in more failures occurring in the field, whereas another company might

release later in the curve, resulting in fewer field defects. Further complicating this

decision are licensing issues. In traditional software reliability models, whenever a

failure occurs it is nearly immediately repaired and a new version is released. With

COTS and open source code, this immediate response generally does not occur, and

the software must be used “as is”. Licensing agreements may also restrict a customer

from fixing a known defect which leads to a failure or risk support termination if such

a fix is attempted.

Independent verification of third party software can be used as a mechanism to

aid in assessing externally developed software. However, independent verification of

third party code can be a costly and expensive endeavor. Huisman et al. [HJv00]

report that the verification of one Java class, namely the Java Vector class, required

nearly 700 man hours of effort.

This leads to the following engineering problem:

How can one economically ensure that software delivered from an ex-

ternal vendor is of sufficient reliability to ensure that the final delivered

product obtains an acceptable level of reliability?

It is this very problem that this dissertation intends to address.

1.2 Key Contributions

Thusfar, we have discussed the problems of software reliability and ensuring that

reused software is of acceptable quality. A key emphasis has been on the aspect

of the economic costs of software failure. In this section, our intent is to outline the

key contributions to the software engineering body of knowledge that this dissertation

provides as a partial solution to the problem of software reliability modeling of reused

and COTS software.

In order to address the needs of software practitioners, any developed model must

be readily understood, easily applied, and generate no more than a small increase

in development costs. Without meeting these criteria, the likelihood of adoption by

development organizations is decreased.

Static analysis can operate in a manner similar to that of a compiler. An un-

derstanding of compiler usage is a required skill for all software engineers developing

embedded systems software, and thus, by using static analysis tools to detect stati-

cally detectable faults within source code modules, the additional knowledge required

of a practitioner to apply this model is minimized. Static analysis tools offer yet

another benefit. Because the interface to static analysis tools is similar to that of a

compiler, the operation of static analysis tools can be highly automated through the

usage of a build process. This automation not only adds to the repeatability of the

analysis, but also serves to obtain a cost effective analysis.

Research contributions of this dissertation are:

1. A demonstration of the effectiveness of static analysis tools.

This dissertation shows that for the Java programming language, through the

usage of an appropriate set of static analysis tools, a significant number of

commonly encountered programming mistakes can be detected and analyzed.

2. The design of a Bayesian Belief Network which relates software reliability to

statically detectable faults and program execution profiles.

In carrying out this research, a Bayesian Belief Network was developed using

expert knowledge which relates the risk of failure to statically detected faults.

3. The design of a Bayesian Belief Network which relates method execution rates

and reliability and obtains results comparable to those obtained through Markov

Modeling.

This network serves to combine the reliabilities of multiple methods based upon

their execution frequency and their individual reliability measures.

4. The design and development of a new software tool, SOSART, which allows the

application of the described model to existing software modules.

The SOSART tool simplifies the application of this model to existing source

code by providing the user a convenient interface to analyze detected faults as

well as capture execution profiles.

5. The successful demonstration of the reliability model through two different sets

of experiments which shows that the reliability model developed here can accu-

rately predict software reliability.

In the first set of experiments, readily available Open Source software is an-

alyzed for reliability using the STREW metrics suite. The software is then

re-analyzed using the SOSART method and the results are compared. The

second experiment applies the model to an existing medium scale embedded

systems project and comparing the results to actual reliability data obtained

from program operation.

1.3 Dissertation Overview

Chapter 1 has provided a synopsis of the software reliability problem and a jus-

tification for the research that follows. Next, the problem statement is clearly and

succinctly stated. An overview of the key contributions of this research to the software

engineering body of knowledge is presented as well.

Chapter 2 provides case studies of software failure. Understanding how software

fails in the field is vital in order to create a practical software reliability model which

is applicable to today’s competitive software development environments. In this case,

the failures discussed have been chosen to be ones which occurred due to a software

fault which was statically detectable through the usage of the appropriate static

analysis tools. The discussion of these failures provides the justification in Chapter 1

for the research objectives.

Chapter 3 provides an overview of static analysis tools. The research upon which

this dissertation is based relies heavily upon the capabilities of static analysis tools

to reliably detect injected faults at the source code level. In order to understand

the assessment capabilities of static analysis tools, this chapter provides a detailed

overview of the static analysis tools which are currently available.

Chapter 4 presents the key areas of contribution of our research by presenting key

concepts for our reliability model, an analysis of faults, and how certain faults lead to

failures. Then we discuss how to measure code coverage. Code coverage plays a major

importance both in achieving software reliability through testing as well as revealing

faults which manifest themselves as failures. This chapter then concludes with the

details of our proposed software reliability model, which combines static analysis of

existing source code modules, black-box testing of the existing source code module

while observing code coverage, and a series of Bayesian Belief Networks to combine

the data into a meaningful reliability measure.

Chapter 5 discusses the results of our experimentation with Static analysis tools

to determine their effectiveness at detecting commonly injected source code faults.

While extensive studies have been done for the C, C++, and Ada languages, there

have been very few studies on the effectiveness of Java static analysis tools. This

research works with the Java language. This chapter presents both the experiment

design as well as the results of the experiment, indicating that static analysis can be

an effective mechanism for detecting faults which may cause run time failures in the

Java language.

Chapters 6 and 7 discuss in detail the developed Bayesian Belief Networks which

combine the results of static analysis execution with program execution traces ob-

tained during limited testing. In general, there are three sets of Bayesian Belief

Networks within the model. One Bayesian belief network combines information to de-

termine the risk of failure from a single statically detectable fault. A second Bayesian

Belief network combines statically detectable faults which are co-located within a

single code block to obtain the reliability of the given block. A third Bayesian Belief

Network is then used to combine the code blocks together to assess the net reliability

for each method.

Chapter 8 discusses the Software Static Analysis Reliability Toolkit (SOSART)

which has been designed to allow the usage of the proposed model in assessing the

reliability of existing software packages. The SOSART tool developed entirely in the

Java language, interacts with the static analysis tools to allow statically detectable

faults to be combined with execution traces and analyzed using the proposed relia-

bility model. Without such a tool, the usage of this model is impractical due to the

large data processing overhead.

Chapter 9 discusses the experiments conducted on Open Source software pack-

ages which used the STREW metrics suite to assess reliability. In this chapter, the

methodology used is discussed, as well as the results of applying the methodology.

The chapter concludes with a discussion on the effort to apply this model in compar-

ison to 100% source code review.

Chapter 10 discusses the application of this model to a specific embedded reusable

component. In this case, the large scale component is assessed for its reliability and

then the reliability of this component is compared with the reliability obtained from

actual field execution and testing. This chapter highlights the economic advantages

of this model versus a complete peer review of the source code for reliability purposes.

This chapter also demonstrates this model’s overall effectiveness.

Chapter 11 summarizes the work and provides some suggestions for future work.

Chapter 2

Software Field Failure Case

Studies1

“The lessons of cogent case histories are timeless, and hence a height-

ened awareness among today’s designers of classic pitfalls of design logic

and errors of design judgment can help prevent the same mistakes from

being repeated in new designs.”[Pet94]

Before one can understand how to make software reliable, one must understand

how software fails. Leveson [Lev94] and Holloway [Hol99] both advocate that in order

to understand how to model risk for future programs, it is important to study and

understand past failures. Therefore, we begin with an in-depth study of computer

systems failures, specifically of embedded systems. The failures are summarized in

Table 2.1.

1Portions of this chapter appeared in Schilling[Sch05].

Table 2.1: Software Failure Root Cause Descriptions

Program Causes

Is fault

Statically

Detectable?Air Traffic Communi-cations

Failure to reboot Communications System. Reboot required by defect inusage of GetTickCount() API call of Win32 Subsystem.

Possibly

Mars Spirit Rover File System Expanded beyond available RAM Yes

2003 US BlackoutRace condition within event handling system, allowing two threads to simul-taneously write to a data structure Yes

Patriot Missile Failure Error induced in algorithmic computation due to rounding of decimal number. YesSTS–2 Uninitialized code jump Yes

Ariane 5Unhanded exception raised by variable overflow in typecast, resulting in com-puter shutdown. Yes

Milstar 3 Improper filter coefficient entered into configuration tables. Possibly

Mars PathFinder Priority Inversion between tasks resulted in reset of CPU. Possibly

USS YorktownImproper user input resulted in division by zero. This caused cascading errorwhich shutdown the ships propulsion system. Yes

Clementine Software error resulted in stuck open thruster. Possibly

GeoSatImproper sign in telemetry table resulted in momentum and torque beingapplied in the wrong direction.

Possibly

Near Earth AsteroidRendezvous Space-craft

Transient lateral acceleration of the craft exceeded firmware threshold, re-sulting in shutdown of thruster. Craft enter safe mode, but race conditionsin software occurred, leading to unnecessary thruster firing and a loss of fuel.

Possibly

AT&T Long DistanceOutages

Missing break statement in case statement Yes

Sony, Hitachi, Philips,RCA Television Fail-ures

Buffer overflow in microprocessor software caused by two extra bits in trans-mission stream.

Diebold ATM Failure Nachi worm infected Diebold ATM machines. YesTacSat-2 satellitelaunch delay

Improper sign in equation. Possibly

It is important to note that there are significant levels of commonality amongst

these failures. Many of these failures are directly attributable to the reuse of previ-

ously developed software with insufficient testing. Furthermore, many of these failures

can be attributed to a fault which can be readily and easily detected with existing

static analysis tools.

2.1 COTS System Failure

Air Traffic Control System Failure

On Tuesday September 14, 2004 at about 5 p.m. Pacific Time, air traffic con-

trollers lost contact with 400 airplanes which were flying in the Southwestern United

States. Planes were still visible on radar, but all voice communications between the

air traffic controllers and pilots was lost. This loss of communication was caused by

the failure of the Voice Switching and Control System (VSCS) which integrates all

air traffic controller communications into a single system. The failures lasted three

hours, and resulted in 800 flights being disrupted across the Western United States.

In at least five instances, planes violated minimum separation distances for planes

flying at high altitudes, but luckily no fatalities occurred[Bro04] [Gep04].

The cause was the failure of a newly integrated system enhancement intended

to provide operational availability of 0.9999999[Mey01]. Soon after installation, a

problem was detected within the VCSU installation, as the system crashed after 49.7

days of operation. After rebooting the system, operations returned to normal. As a

workaround, the FAA instituted a maintenance reboot every 30 days to the system.

This mandatory reboot was necessitated by a design flaw within the software.

The internal software of the VCSU relies on a 32 bit counter which counts down from

232 to 0 as the software operates. This takes 232ms, or 49.7103 days to complete a

countdown[Cre05]. This tick, part of the Win32 system and accessed by a call to the

GetTickCount() API, is used to provide a periodic pulse to the system. When the

counter reaches 0, the tick counter wraps around from 0 to 4, 294, 967, 296, and in

doing so, the periodic pulse fails to occur. In the case of the Southern California air

traffic control system, the mandatory reboot did not occur and the tick count reached

zero, shutting down the system in its entirety. The backup system attempted to start,

but was unable to handle the load of traffic, resulting in the complete communications

failure[Wal04] [Gep04].

USS Yorktown Failure

On September 21, 1997, the USS Yorktown, the prototype for United States Navy

Smart Ship program, suffered a catastrophic software failure while steaming in the

Atlantic. As a result of an invalid user input, the entire ship was left dead in the water

for two hours and forty five minutes while a complete reboot of systems occurred.

The software consisted of two major segments, a Windows NT user interface

portion and a monitoring and control surveillance system portion. Windows NT

was mandated by the Navy’s IT-21 report, “Information Technology for the 21st

Century”[wir98]. The monitoring and control surveillance system code was developed

in Ada[Sla98a]. In order to meet Navy schedules and deployment deadlines for the

Smart Ship program, installation was completed in eight months[Sla98a]2.

The failure of the Yorktown system began with a petty officer entering an incorrect

calibration value into the Remote Database Manager. This zero entry resulted in a

divide by zero, causing the database system to crash. Through the ATM network, the

crash propagated to all workstations on the network. Restoration of the ship required

rebooting of all computers[Sla98b].

Diebold ATM Lockups

In August of 2003, automated teller machines at two financial institutions failed

due to the Nachi worm in the first confirmed case of malicious code penetrating

ATM machines. The ATM machines operated on top of the Windows XP Embedded

2The typical schedule for such a development effort is approximately three years [Sla98a].

operating system which was vulnerable to an RPC DCOM security bug exploited by

Nachi and the Blaster virus. In this case, a patch had been available from Microsoft

for over a month, but had not been installed on the machines due to an in-depth set

of regression tests required by Diebold before deploying software patches[Pou03].

The Nachi worm spread through a buffer overflow vulnerability within Microsoft

Windows. In this particular instance, an infected machine would transmit a set of

packets to a remote machine using port 135. When the packets were processed a

buffer overflow would occur on the receiving machine crashing the RPC service and

allowing the worm to propagate to the second machine[McA].

2.2 Embedded System Failures

Midwest Power Failure

On August 14th, 2003, the worst electrical outage in North American History

occurred, as 50 million customers in eight US states and Canada lost power. The

economic impact of this failure has been estimated to range between $4.5 and $10

billion[ELC04]. While there were many contributing factors to the outage, buried

within the causes was the failure of a monitoring alarm system [Pou04a].

First Energy’s energy management system used to monitor the state of its elec-

trical grid failed silently. This system had over 100 installations worldwide [Jes04]

and the software was approximately four million lines of C code[Pou04b]. A thorough

analysis of the Alarm and Event processing routines comprised of approximately one

million lines of C and C++ code yielded a subtle race condition which allowed two

asynchronous threads to obtain write access to a common data structure. Once this

happened the data structure was corrupted and the alarm application went into an

infinite loop. This caused a queue overflow on the server hosting the alarm process,

resulting in a crash. A backup system kicked in, but it too was overwhelmed by the

number of unprocessed messages and it also soon crashed[Pou04b][Pou04a].

Patriot Missile Interception Failure

On February 25, 1991, an incoming Iraqi Scud missile was launched and struck

the American Army barracks. An American Patriot Missile battery in Dharan, Saudi

Arabia was tasked with intercepting incoming missiles, but due to a software defect,

this did not occur[Arn00].

The Patriot missile defense system operated by detecting airborne objects which

potentially match the electronic signature for a foreign object[Car92]. In calculating

the location to apply the range gate, the Patriot computer system used two funda-

mental data items: time, expressed as a 24 bit integer, and velocity expressed as a

24-bit fixed point decimal number. The prediction algorithm used in the source code

multiplied the current time stamp by 110

when calculating the next location.

In order to fit into the 24 bit registers of the Patriot computer system, the binary

expansion for 110

was truncated to 0.00011001100110011001100, yielding an error of

0.0000000000000000000000011001100... binary, or approximately 0.000000095 deci-

mal. This error, was cumulative with the time since the unit was initialized[Arn00].

The Alpha battery had been operating without reset for over 100 consecutive

hours. At 100 hours, the time inaccuracy was 0.3433 seconds. The incoming Iraqi

Scud missile cruised at 1,676 metres per second, and during this time error, it had

traveled more than one half of a kilometer [Car92]. The battery did not engage

the Scud missile, and an Army barrack was hit, resulting in 28 deaths and over 100

injuries.

STS-2 Simulation Failure

In October of 1981, a catastrophic space shuttle software failure was discovered

during a training scenario. In this training scenario, the objective of the crew was to

abort the launch and land in Spain. This required the dumping of excess fuel from

the orbiter. When the crew initiated the abort, all four redundant flight computers

failed and became unresponsive. The displays each showed a large X, indicating that

the display I/O routines were not receiving data.

The failure was traced to a single fault, namely a single, common routine used to

dump fuel from the orbiter. In this particular scenario, the routine had been invoked

once during the simulated assent, and then was canceled. Later on, the routine was

called again. However, not all variables had been re-initialized. One of the variables

was used to compute the offset address for a “GOTO” statement. This caused the

code to branch to an invalid address, resulting in the simultaneous lockup of all four

computers. A complete analysis of the shuttle code found 17 other instances of this

fault[Lad96].

Ariane 5 Maiden Launch Failure

On June 4, 1996, the maiden flight of the Ariane 5 rocket occurred. 39 seconds

into flight, a self destruct mechanism on the rocket activated, destroying the launcher

and all four payload satellites. The root cause for the failure was traced to a stat-

ically detectable programming error where a 64 bit floating point number was cast

into a 16 bit integer. As the code was implemented in Ada, this caused a runtime

exception which was not handled properly, resulting in the computer shutting down

and the ultimate loss of the vehicle and payload. Had this occurred on a less strongly

typed language, the program most likely would have continued executing without

incident[Hat99a]. In reviewing the source code for the module, it was found that

there were at least three other instances of unprotected typecasts present within the

code which could have caused a similar failure [Lio96].

The portion of the guidance software which failed in Ariane 5 had actually been

reused from Ariane 4 without review and integration testing[Lio96]. On the Ariane

4 program, a software “hack” had been implemented to keep a pre-launch alignment

task executing after lift-off[Gle96]. This had the effect of saving time whenever a

hold occurred. When the Ariane 5 software was developed, no one removed this

unnecessary feature. Thus, even after launch was initiated, pre-launch alignment

calculations occurred[And96]. The variable that overflowed was actually in the pre-

launch alignment calculations and served no purpose after launch.

Clementine Mission Failure

While in operation, the Clementine orbiter was classified by management as highly

successful. However, for all the success the orbiter was achieving, significant prob-

lems were also occurring, as over 3000 floating point exceptions were detected during

execution[Gan02b] and ground controllers were required to perform hardware resets

on the vehicle at least 16 times[Lee94].

All of these problems came to a climax in May, 1994 when another floating point

exception occurred. Ground controllers attempted to revive the system by sending

software reset commands to the craft, but these were ignored. After 20 minutes, a

hardware reset was successful at bringing the Clementine probe back on line [Lee94].

However, all of the vehicles hydrazine propellant had been expended[Gan02b].

Software for the Clementine orbiter was developed through a spiral model pro-

cess, resulting in iterative releases. Code was developed using both the C and Ada

languages. In order to protect against random thruster firing, designers had im-

plemented a software thruster timeout. However, this protection was defeated by

a hangup within the firmware. Evidence from the vehicle indicates that following

the exception, the microprocessor locked up due to a software defect. While in a

hung up state, the processor erroneously turned on a thruster, erroneously dump-

ing fuel and imparting a 80 RPM spin on the craft. This thruster fired until the

hydrazine propellant was exhausted[Gan02b]. The software to control the built in

watchdog mechanism which could have detected the run away source code was not

implemented due to time constraints [Lee94].

Milstar Launch Failure

A $433.1 million Lockheed Martin Titan IV B rocket was launched on April 30,

1999, from Cape Canaveral Air Station in Florida [Hal99] carrying the third Milstar

satellite destined for Geosynchronous orbit[Pav99]. The first nine minutes of the

launch were nominal. However, very quickly afterward an anomaly was detected in

the flight path as the vehicle was unstable in the roll orientation.

The instability against the roll axis returned during the second engine burn causing

excess roll commands that saturated pitch and yaw controls, making them unstable.

This resulted in vehicle tumble until engine shutdown. The vehicle then could not

obtain the correct velocity or the correct transfer orbit. Trying to stabilize the ve-

hicle, the RCS system exhausted all remaining propellant[Pav99]. The vehicle again

tumbled when the third engine firing occurred resulting in the vehicle being launched

into a low elliptical orbit instead of the intended geostationary[Pav99].

An accident investigation board determined that a single filter constant in the

Inertial Navigation Unit (INU) software file caused the error. In one filter table,

instead of −1.992476, a value of −0.1992476 had been entered. Thus, the values for

roll detection were effectively zeroed causing the instability within the roll system.

These tables were not under configuration management through a version control

system (VCS). This incorrect entry had been made in February of 1999, and had

gone undetected by both the independent quality assurance processes as well as the

independent validation and verification organizations [Neu99][Pav99].

As a footnote to this incident, and paralleling the Ariane 5 failure, the roll filter

which was mis-configured was not even necessary for correct flight behavior. The

filter had been requested early in the development of the first Milstar satellite when

there was a concern that fuel sloshing within the Milstar Satellite might effect launch

trajectory. Subsequently, this filter was deemed unnecessary. However, it was left in

place for consistency purposes [Pav99].

Long Distance System Crash

On January 15 of 1990, AT&T experienced the largest long distance outage on

record. 60 thousand people were left without telephone service as 114 switching nodes

within the AT&T system failed. All told, a net total of between $60 to $75 million was

lost. The culprit for this failure was a missing line of code within a failure recovery

routine[Dew90].

1: ...

2: switch (message)

4: case INCOMING_MESSAGE:

5: if (sending_switch == OUT_OF_SERVICE)

7: if (ring_write_buffer == EMPTY)

8: send_in_service_to_smm(3B);

9: else

10: break; /* Whoops */

12: process_incoming_message();

13: break;

14: ...

16: do_optional_database_work();

17: ...

Figure 2-1: Source code which caused AT&T Long Distance Outage[Hat99b]

When a single node crashed on the system, a message indicating that the given

node was out of service was sent to adjacent nodes so that the adjacent node could

route the traffic around the failed node. However, because of a misplaced “break”

statement within a C “case” construct the neighboring nodes themselves crashed upon

receiving an out of service message from an adjacent node. The second node then

transmitted an out of service message to its adjacent nodes, resulting in the domino

failure of the long distance network[Dew90].

NASA Mars Rover Watchdog Reset

On January 21, 2004 Jet Propulsion Laboratory Engineers struggled to estab-

lish communications with the Mars Spirit Rover. After 18 days of nearly flawless

operation on the Martian surface, a serious anomaly with the robotic vehicle had

developed, rendering the vehicle inoperable. While initial speculation on the Rover

failure pointed toward a hardware failure, the root cause for the failure turned out to

be software. The executable code that had been loaded into the Rover at launch had

serious shortcomings. A new code was uploaded via radio to the rover during flight.

In doing so, a new directory structure was uploaded into the file system while leaving

the old system intact.

Eventually, the Rover attempted to allocate more files than RAM would allow,

raising an exception and resulting in a diagnostic code being written to the file system

before rebooting the system. This scenario continued, causing the rover to get stuck

in an infinite series of resets[Hac04].

Mars Pathfinder File System Overflow

Mars PathFinder touched down on the surface of Mars on July 4, 1997, but

all communication was suddenly lost on September 27, 1997. The system began

encountering periodic total system resets, resulting in lost data and a cancelation

of any pending ground commands[Woe99]. The problem was quickly traced to a

watchdog reset[Gan02a] caused by a priority inversion within the system. New code

was uploaded to PathFinder and the mission was able to recover and complete a

highly successful mission[Ree97].

The Mars PathFinder failure resulted in significant research in real time tasking

leading to the development of the Java PathFinder [HP00] and JLint [Art01] static

analysis tools.

GEOSAT Sign inversion

The United States Navy launched the 370 kg GeoSAT Follow-On (GFO) satellite

in February 10, 1998 from Vandenburg Air Force Base on board a Taurus rocket.

Immediately following launch, there were serious problems with attitude control as

the vehicle simply tumbled in space. Subsequent analysis of the motion equations

programmed into the vehicle indicated that momentum and torque were being applied

in the wrong direction. Somewhere in the development of the vehicle, a sign on a

coefficient had been inverted[Hal03], resulting in forces being applied in the opposite

direction of what was necessary.

TacSat-2 Sign Inversion

The December, 2006 launch of the TacSat-2 satellite on board an Air Force Mino-

taur I rocket was delayed due to software issues. While the investigation is not com-

plete, indications are that the problem maybe related to a missing minus sign within

a mathematical equation. If the TacSat had been launched, the software defect would

have prevented the satellite’s attitude control system from turning the solar panels

closer than 45 degrees to the sun’s rays resulting in an eventual loss of power to the

satellite.

2.3 Hard Limits Exceeded

Near Earth Asteroid Rendezvous Spacecraft Threshold Exceeded

The NASA Near Earth Asteroid Rendezvous (NEAR) mission was launched from

Cape Canaveral on board a Delta-2 rocket on February 17, 1996. Its main mission was

to rendezvous with the asteroid 433 Eros[SWA+00]. Problems occurred on December

20, 1998 when the spacecraft was to fire the main engine in order to place the vehicle

in orbit around Eros. The engine started successfully but the burn was aborted almost

immediately. Communications with the spacecraft was lost for 27 hours[Gan00] dur-

ing which the spacecraft performed 15 automatic momentum dumps, fired thrusters

several thousand times, and burned up 96 kg of propellant[Hof99].

The root cause of the engine abort was quickly discovered. Sensors on board the

space craft had detected a transient lateral acceleration which exceeded a defined

constant in the control software. The software did not appropriately filter the input

and thus the engine was shut down. The spacecraft then executed a set of automated

scripts intended to place the craft into safe mode. These scripts, however, did not

properly start the reaction control wheels used to control attitude in the absence of

thrusters[Gan00]. A set of race conditions occurred and several untested exception

handlers executed. Both were exacerbated by low batteries. Over the next few

hours, 7900 seconds of thruster firings were logged before the craft reached sun-safe

mode[Gan00].

As a part of the investigation, some of the 80,000 lines of source code were in-

spected, and 17 faults were discovered[Hof99]. Complicating the situation was the

fact that there turned out to be two different version of flight software 1.11, one which

was onboard the craft and one which was readily available on the ground, as flight

code had been stored on a network server in an uncontrolled environment.

Television Sound Loss

The television transmission standard has changed little since its initial develop-

ment during the 1930’s. Recently, there has been a trend towards increased usage

of Extra Digital Services (XDS) services, such as closed captioning, automatic time

of day synchronization, program guides, and other such information, which is broad-

cast digitally during the vertical blanking interval. XDS services are decoded on the

television set using a microcontroller[Sop01].

Television system is transmission standards are strictly regulated in order that

television receivers can easily be mass produced cheaply. Problems however occur

when a transmission is out of specification. In one particular instance, a device gen-

erating the digital data stream for closed captioning on a transmitter had a periodic

yet random failure whereby two extra bits would be erroneously inserted into the data

stream[Sop01]. On certain models, receiving these erroneous bits caused a buffer over-

flow within software causing a complete loss of video image, color tint being set to

maximum green, or muted audio. In each case, the only mechanism for recovery was

to unplug the television set and allow the microcontroller to reset itself to the default

settings[Sop01].

Chapter 3

Survey of Static Analysis Tools

and Techniques1

“Static Program analysis consists of automatically discovering proper-

ties of a program that hold for all possible execution paths of the program.”[BV03]

Static analysis of source code is a technique commonly used during implementa-

tion and review to detect software implementation errors. Static analysis has been

shown to reduce software defects by a factor of six [XP04], as well as detect 60% of

post-release failures[Sys02]. Static analysis has been shown to out perform other Qual-

ity Assurance methods, such as model checking[Eng05][ME03]. Static analysis can

detect errors such as buffer overflows and security vulnerabilities[VBKM00], memory

leaks[Rai05], timing anomalies (race conditions, deadlocks, and livelocks)[Art01], se-

curity vulnerabilities[LL05], as well as other common programming mistakes. Faults

1Portions of this chapter appeared in Schilling and Alam[SA06c].

caught with static analysis tools early in the development cycle, before testing com-

mences where defects can be 5 to 10 times cheaper to repair than at a later phase[Hol04].

Static analysis of source code does not represent new technology. Static analy-

sis tools are highly regarded within certain segments of industry for being able to

quickly detect software faults [Gan04]. Static analysis is routinely used in mission

critical source code development, such as aircraft[Har99] and rail transit[Pol] areas.

Robert Glass reports that static analysis can remove upwards of 91% of errors within

source code [Gla99b] [Gla99a]. It has also been found effective at detecting pieces

of dead or unused source code in embedded systems [Ger04] and buffer overflow

vulnerabilities[LE01]. Richardson[Ric00] and Giessen[Gie98] provide overviews of the

concept of static analysis, including the philosophy and practical issues related to

static analysis.

Recent papers dealing with static analysis tools have shown a statistically signif-

icant relationship between the faults detected during automated inspection and the

actual number of field failures occurring in a specific product[NWV+04]. Static anal-

ysis has been used to determine testing effort by Ostrand et al.[OWB04]. Nagappan

et al. [NWV+04] and Zheng et al.[ZWN+06] discuss the application of static analysis

to large scale industrial projects, while Schilling and Alam[SA05b] cite the benefits of

using static analysis in an academic setting. Integration of static analysis tools into a

software development process have been discussed by Schilling and Alam[SA06c] and

Barriault and Lalo[BL06].

Static analysis tools have two important characteristics: soundness and complete-

ness. A static analysis tool is defined to be complete if it detects all faults present

Table 3.1: Summary of static analysis toolsSoftware Tool Domain Responsible Party

LanguagesChecked

Platforms

CGS Academic NASA C Linux

Checkstyle AcademicOpen Source Hosted onSourceforge Java OS Independent

CodeSonar Commercial Grammatech C, C++ WindowsCodeSurfer Commercial Grammatech C, C++ Windows

Coverity Prevent Commercial Coverity, Inc. C, C++ Linux, UNIX, Windows, MacOS X

CQual AcademicUniversity of California atBerkeley, GPL

C, C++ Unix, Linux

ESC-Java Academic

Software Engineering withApplied Formal MethodsGroupDepartment of ComputerScienceUniversity College Dublin

JavaLinux, Mac OSX, Windows,Solaris

ESP Commercial Microsoft C,C++ Windows

FindBugs Academic University of Maryland JavaAny JVM compatible plat-form

FlawFinder GPL David A. Wheeler C, C++ UNIX

Fortify Source CodeAnalysis (SCA) Commercial Fortify Software Java, C, C++

Windows, Solaris, Linux,Mac OS X, HP-UX, IBMAIX

Gauntlet Academic US Military Academy Java WindowsITS4 Commercial Cigital C,C++ Linux, Solaris, Windows

Java PathFinder Academic NASA Ames JavaAny JVM Compatible Plat-form.

JiveLint Commercial Sureshot Software Java Windows

JLint AcademicKonstantin KnizhnikCyrille Artho Java Windows, Linux

JPaX Academic NASA Java Not DocumentedKlocwork K7 Commercial Klocwork Java, C, C++ Sun Solaris, Linux, WindowsLint4j Academic jutils.com Java Any JDK System

MOPS AcademicUniversity of California,Berkeley C UNIX

PC-Lint FlexLint Commercial Gimpel Software C, C++DOS, Windows, OS/2, UNIX(FlexLint only)

PMD AcademicAvailable from SourceForgewith BSD License

JavaAny JVM Compatible Plat-form

Polyspace C Verifier Commercial Polyspace Ada, C, C++ Windows, UNIXPREfixPREfast

Commercial Microsoft C, C++ C# Windows

QAC QAC++ CommercialProgramming Research Lim-ited

C, C++ Windows, UNIX

RATS Academic Secure Software C,C++ Windows, Unix

Safer C Toolkit Commercial Oakwood Computing C Windows, Linux

SLAM Academic Microsoft C WindowsSofCheck Inspector forJava

Commercial Sofcheck Java Windows, UNIX, Linux

Splint AcademicUniversity of Virginia, De-partment of Computer Sci-ence

C Windows, UNIX, Linux

within a given source code module. A static analysis tool is deemed to be sound if it

never gives a spurious warning. A static analysis tool is said to generate a false posi-

tive if a spurious warning is detected within source code. A static analysis tool is said

to generate a false negative if a fault is missed during analysis. In practice, nearly

all static analysis tools are unsound and incomplete, as most tools generate false

positives and false negatives[Art01]. A discussion on the importance of soundness is

provided by Xie et al.[XNHA05] and Godefroid[God05].

For all of the advantages of static analysis tools, there have been very few inde-

pendent comparison studies between tools. Rutar et al. [RAF04] compare the results

of using Findbugs, JLint, and PMD tools on Java source code. Forristal[For05] com-

pares 12 commercial and open source tools for effectiveness, but the analysis is based

only on security aspects and security scanners, not the broader range of static anal-

ysis tools available. Lu et al.[LLQ+05], as well as Meftah[Mef05] propose a set of

benchmarks for benchmarking bug detection tools, but the paper does not specifi-

cally include static analysis tools.

In order to evaluate which static analysis tools have the potential for usage in soft-

ware reliability modeling, it was important to obtain information about the currently

existing tools. Table 3 provides a summary of the tools which are to be discussed in

the following sections.

3.1 General Purpose Static Analysis Tools

3.1.1 Commercial Tools

Lint[Joh78] is one of the first and widely used static analysis tools for the C and

C++ languages. Lint checks programs for a large set of syntax errors and seman-

tic errors. Newer versions of Lint include value tracking, which can detect subtle

initialization and value misuse problems, inter-function Value Tracking, which tracks

values across function calls during analysis, strong type checking, user-defined seman-

tic checking, usage verification, which can detect unused macros, typedef’s, classes,

members, and declarations, flow verification for uninitialized variables. Lint can also

handle the verification of common safer programming subsets, including the MISRA

(Motor Industry Software Reliability Association) C Standards [MIS04] [MIS98] and

the Scott Meyers Effective C++ Series of Standards [Mey92]. Lint also supports code

portability checks which can be used to verify that there are no known portability

issues with a given set of source code[Rai05]. A handbook on using Lint to verify C

programs has been written by Darwin[Dar88].

In addition to the basic Lint tool, several add on companion programs exist to

aid in the execution of the Lint program. ALOA[Hol04] automatically collects a set

of metrics from the Lint execution which can be used to aid in quality analysis of

source code. ALOA provides an overall lint score, which is a weighted sum of all Lint

warnings encountered, as well as break downs by source code module of the number

and severity of faults discovered.

QAC, QAC++, QAJ

QAC, and its companion tools QA C++, QAJ, and QA Fortran, have been de-

veloped by Programming Research Limited. Each tool is a deep flow static analyzer

tailored to the given languages. These tools are capable of detecting language im-

plementation errors, inconsistencies, obsolescent features and programming standard

transgressions through code analysis. Version 2.0 of the tool issues over 800 warn-

ing and error messages, including warnings regarding non-portable code constructs,

overly complex code, or code which violates the ISO/IEC 14882:2003 Programming

languages – C++ standard[ISO03]. Code which relies upon unspecified, undefined,

or implementation defined behavior will also be appropriately flagged.

The QAC and QAC++ family of tools are capable of validating several different

coding standards. QAC can validate the MISRA C coding standard [MIS98] [MIS04],

while QAC++ can validate against the High Integrity C++ Coding standard[Pro].

Polyspace C Verifier

The Polyspace Ada Verifier was developed as a result of the Ariane 501 launch fail-

ure and can analyze large Ada programs and reliably detect runtime errors. Polyspace

C++ and Polyspace C verifiers have subsequently been developed to analyze these

languages[VB04].

The Polyspace tools rely on a technique referred to as Abstract Interpretation.

Abstract Interpretation is a theory which formally constructs approximations for the

semantics of programming languages. Abstract interpretation extends data-flow anal-

ysis by providing an additional theoretical framework for mathematically justifying

data-flow analyzers, designing new data-flow analysis, and handling particular infinite

sets of properties[Pil03].

Polyspace C analysis has suffered from scalability issues. Venet and Brat[VB04]

indicate that the C Verifier was limited to analyzing 20 to 40 KLOC in a given in-

stance, and this analysis took upwards of 8 to 12 hours to obtain 20% of the total

warnings within the source code. This requires overnight runs and batch process-

ing, making it difficult for software developers to understand if their changes have

corrected the discovered problems[BDG+04]. Zitser et al. [ZLL04] discuss a case

whereby Polyspace executed for four days to analyze a 145,000 LOC program before

aborting with an internal error. This level of performance is problematic for large

programs. Aliasing has also posed a significant problem to the Polyspace tool[BK03].

PREfix and PREfast

PREfix operates as a compile time tool to simulate a C or C++ program in

operation, catching runtime problems before the program actually executes, as well

as matching a list of common logic and syntactic errors. PREfix performs extensive

deep flow static analysis and requires significant installation of both client and server

packages in order to operate. Current versions include both a database server and a

graphical user interface and are typically integrated into the mast build process.

PREfix mainly targets memory errors such as uninitialized memory, buffer over-

flows, NULL-pointer de-references, and memory leaks. PREfix is a path-sensitive

static analyzer that employs symbolic evaluation of execution paths. Path sensitivity

ensures that program paths analyzed are only the paths that can be taken during

execution. This helps to reduce the number of false positives found during static

analysis. Path sensitivity, however, does cause problems. Exponential path blowup

can occur due to control constructs and possibly infinite paths due to loops make it

impractical. To avoid this problem, PREfix only explores a representative set of paths

which can be configured by the user[BPS00]. PREfix running time scales linearly with

program size due to the fixed cutoff on number of paths[Rai05].

The development of PREfix has shown that the bulk of errors come from inter-

actions between two or more procedures. Thus, for maximum effectiveness, static

analyzers should be interprocedural[Rai05]. PREfix has also shown that in commer-

cial C and C++ code, approximately 90% errors are attributable to the interaction

of multiple functions. Furthermore, these problems are only revealed under rare error

conditions.

PREfast is a simpler tool developed by Microsoft based upon the results of ap-

plying PREfix to a large number of internal developments. This tools performs a

simpler intra-procedural analysis that detects fewer defects, has a higher noise faster,

and only generates local XML output files.

Nagappan and Ball[NB05] discuss the usage of PREfix and PREfast at Microsoft.

Microsoft has also developed the PREsharp defect detection tool for C#. PRE-

sharp performs analysis equivalent to PREfast on C# code.

SofCheck

SofCheck Inspector is a static analysis tool produced by SofCheck, Inc. for ana-

lyzing Java source code. The tool is designed to detect a large array of programming

errors including the misuse of pointers, array indices which go out of bounds, buffer

overruns, numeric overflows, numeric wraparounds, dimensional unit mismatch, stor-

age leaks, and the improper use of Application Programming Interfaces.

SofCheck works by thoroughly characterizing each element of the program in terms

of its inputs, outputs, heap allocations, preconditions, and postconditions. Precon-

ditions are based upon what needs to be present to prevent a run-time failure, and

post conditions are based upon every possible output when the element is executed.

Output is then provided as annotated source code in browsable html format. Included

within the annotated source code is the characterization of each method. SofCheck

also includes a built in history system which allows regression testing on source code

to be conducted such that the tool can actually verify that fixed faults are completely

removed and that no new faults are introduced.

Based upon the companies materials, SofCheck Inspector averages approximately

1000 lines per minute analysis speed depending upon the CPU speed, available RAM,

and complexity of the source code. Currently, SofCheck Inspector only works with

the Java Programming langauge. However, work is going on to develop versions

compatible with Ada, C, C++, and C#.

KlocWork K7

Klocwork was derived from a Nortel Networks tool to evaluate massive code bases

used in telephone switches. As such, it efficiently detects access problems, denial of

service vulnerabilities, buffer overflows, injection flaws, DNS spoofing, ignored return

values, mobile code security injection flaws, broken session management, insecure

storage, cross-site scripting, un-validated user input, improper error handling, and

broken access control. K7 can perform analysis based on Java source code and byte-

codes. This allows even third party libraries to be analyzed for possible defects.

Code Sonar

CodeSonar from GrammaTech, Inc. is a deep flow static analysands for C and

C++ source code. The tool is capable of detecting many common programming

errors, including null-pointer de-references, divide-by-zeros, buffer overruns, buffer

underruns, double-frees, use-after-frees, and frees of non-heap memory.

Codesurfer[Gra00][AT01] has been used by Ganapathy et. all[GJC+03] to detect

the presence of buffer overflows within C source code. In this work, Codesurfer was

extended through the use of supplemental plugins to generate, analyze, and solve

constrains within the implemented source code. This extended tool was applied to

the wu ftp daemon as well as the sendmail program.

JiveLint

JiveLint by Sureshot Software is a static analysis tool for the Java programming

language. JiveLint has three fundamental goals: to improve source code quality by

pointing out dangerous source code constructs; to improve readability, maintainability

and debugging through enforced coding and naming conventions; and to communicate

knowledge of how to write high quality code. JiveLint is a stand-alone Windows

application which does not require the Java language to be installed and is Windows

95/98/ME/2000/NT/XP compatible. As a very modestly priced commercial product,

very little information on the analysis techniques is available for JiveLint.

3.1.2 Academic and Research Tools

C Global Surveyor

The NASA C Global Surveyor Project (CGS) was intended to develop an efficient

static analysis tool. The tool was brought about to overcome deficiencies in current

static analysis tools, such as the Polyspace C Verifier which suffers significantly from

scalability issues[VB04] This tool would then be used to reduce the occurrence of

runtime errors within NASA developed software.

To improve performance, CGS was designed to allow distributed processing for

larger projects in which the analysis can run on multiple computers. CGS results are

reported to a centralized SQL database. While CGS can analyze any ISO C program,

its analysis algorithms have been precisely tuned for the Mars PathFinder programs.

This tuning results in less than 10% warnings being issued[VB04].

CGS has been applied to several NASA Jet Propulsion Laboratory projects, in-

cluding the Mars PathFinder mission (135K lines of code) and the Deep Space One

mission (280K lines of code). CGS is currently being extended to handle C++ pro-

grams for the Mars Science Laboratory mission. This requires significant advances in

the analysis of pointers in the context of dynamic data structures.

ESP is a method developed by the Program Analysis group at the Center for

Software Excellence of Microsoft for static detection of protocol errors in large C/C++

programs. ESP requires the user to develop a specification for the high level protocol

that the existing source code is intended to satisfy. The tool then compares the

behavior of the source code as implemented with the requisite specification. The

output is either a guarantee that the code satisfies the protocol, or a browsable list

of execution traces that lead to violations of the protocol.

ESP has been used by Microsoft to verify the I/O properties of the GNU C

compiler, approximately 150,000 lines of C code. ESP has also been used to validate

the Windows OS kernel for security vulnerabilities.

JLint is a static analysis program for the Java language initially written by Kon-

stantin Knizhnik and extended by Cyrille Artho[AB01][Art01]. JLint checks Java

code through the use of data flow analysis, abstract interpretation, and the construc-

tion of lock graphs.

JLint is designed as two separate programs which interact with each other during

analysis, the AntiC syntax analyzer and the JLint semantic analyzer[KA].

JLint has been applied to Space Exploration Software by NASA Ames Research

Center, and shown to be effective in all applications thusfar. Details of this experience

are provided in Artho and Havelund[AH04]. Rutar et al. [RAF04] compare JLint

with several other analysis tools for Java. JLint has also been applied to large scale

industrial programs in [AB01].

Lint4j

Lint4j (“Lint for Java”) is a static analyzer that detects locking and threading

issues, performance and scalability problems, and checks complex contracts such as

Java serialization by performing type, data flow, and lock graph analysis. In many

regards, Lint4j is quite similar in scope to JLint. The checks within Lint4j represent

the most common problems encountered while implementing products designed for

performance and scalability. General areas of problems detected are based upon those

found in Monk et al.[MKBD00], Bloch[Blo01], Allen[All02], Larmann et al.[LG99],

and Gosling et. al[GJSB00]. Lint4j is written in pure Java, and therefore, will

execute on any platform on which the Java JDK or JRE 1.4 has been installed.

Java PathFinder

The Java PathFinder (JPF) program is a static analysis model checking tool

developed by the Robust Software Engineering Group (RSE) at NASA Ames Research

Center and available under Open Source Licensing agreement from Sourceforge. This

software is an explicit-state model checker which analyzes Java bytecode classes for

deadlocks, assertion violations and general linear-time temporal logic properties. The

user can provide custom property classes and write listener-extensions to implement

other property checks, such as race conditions. JPF uses a custom Java Virtual

Machine to simulate execution of the programs during the analysis phase.

Java PathFinder has been applied by NASA Ames to several projects. Havelund

and Pressburger [HP00] discuss the general application of an early version of the Java

PathFinder tool. Brat et al. [BDG+04] provide a detailed description of the results

of applying Java PathFinder to Martian Rover software.

ESC-Java

The Extended Static Checker for Java (ESC/Java) was developed at the Compaq

Systems Research Center (SRC) as a tool for detecting common errors in Java pro-

grams such as null dereference errors, array bounds errors, type cast errors, and race

conditions. ESC/Java is neither sound nor complete.

ESC/Java uses program verification technology, and includes an annotation lan-

guage with which programmers can use to express design decisions using light-weight

specifications. ESC/Java checks each class and each routine separately, allowing

ESC/Java to be applied to code that references libraries without the need for library

source code.

The initial version of ESC/Java supported the Java 1.2 language set. ESC/Java2

is based upon the initial ESC/Java tool, but has been modernized to support JML and

Java 1.4 as well as support for checking frame conditions and annotations containing

method calls and additional static checks.

ESC/Java has proven very successful at analyzing programs which include an-

notations from the very beginning. However, adding ESC/Java annotations to an

existing program has proven to be an error prone daunting task. To alleviate some of

these difficulties, Compaq Systems Research Center developed a tool, referred to as

Houdini, to aid in annotating source code. Houdini infers a set of ESC/Java annota-

tions for a given programs as candidate annotations. ESC/Java is then run on each

candidate annotation to verify or refute the validity of each candidate assumption,

generating the appropriate warnings as is necessary. The Houdini tool is described in

Flanagan and Leino[FL01].

FindBugs

FindBugs is a lightweight static analysis tool for the Java program with a reputa-

tion for uncovering common errors in Java programs[CM04]. FindBugs automatically

detects common programming mistakes through the use of “Bug Patterns”, which

are code idioms that commonly represent mistakes in software. General usage of the

FindBugs tool is described in Grindstaff[Gri04a]. In addition to the built in detec-

tors of the FindBugs program, the tool can be extended through the development of

customized detectors. This is described in Grindstaff[Gri04b].

In practice, the rate of false warnings reported by FindBugs is generally less than

50%. Rutar et al. [RAF04] compare the results of using FindBugs versus other Java

tools and report similar results. Wagner et al. [WJKT05] generally concur with this

assessment as well.

3.2 Security Tools

ITS4 is a static vulnerability scanner for the C and C++ languages developed by

Cigital. The tool was developed as a replacement for a series of grep scans on source

code used to detect security vulnerabilities as part of Cigital’s consulting practice.

Output from the tool includes a complete report of results as well as suggested fixes

for each uncovered vulnerability[VBKM00].

The analysis performed is quite fast from a performance standpoint, with an

analysis of the 57,000 LOC of sendmail-8.9.3 taking approximately 6 seconds. ITS4,

however, does suffer from its simplistic nature resulting in a significant number of

false positives. ITS4 has been applied to the Linux Kernel by Majors[Maj03].

Fortify SCA

Fortify SCA is a static analysis tool produced by Fortify Software aimed at aiding

in the validation of software from a security perspective. The core of the tool includes

the Fortify Global Analysis Engine. This consists of five different static analysis en-

gines which find violations of secure coding guidelines. The Data Flow analyzer is

responsible for tracking tainted input across application architecture tiers and pro-

gramming language boundaries. The Semantic Analyzer detects usage of functions

deemed to be vulnerable as well as the context of their usage. The control flow an-

alyzer tracks the sequencing of programming operations with the intent of detecting

incorrect coding constructs. The Configuration Analyzer detects vulnerabilities in

the interactions between the structural configuration of the program and the code

architecture. The Structural Analyzer identifies security vulnerabilities brought on

by the chosen code structures. These engines can also be expanded by writing custom

rules.

While for the purposes of this article the focus is on Java, Fortify SCA supports

an extensive set of programming languages, including ASP.NET, C/C++, C#, Java,

JSP, PL/SQL, T-SQL, VB.NET, XML and other .NET languages.

3.2.2 Academic and Research Tools

LCLint and SPLint

LCLint was a product of the MIT Lab for Computer Science and the DEC Re-

search Center and was designed to take a C program which has been annotated with

additional LCL formal specifications within the source code. In addition to detect-

ing many of the standard syntactical issues, LCLint detects violation of abstraction

boundaries, undocumented uses of global variables, undocumented modification of

state visible to clients and missing initialization for an actual parameter or use of an

uninitialized formal parameter[EGHT94].

Splint is the successor to LCLint, as the focus was changed to include secure pro-

grams. The name is extracted from “SPecification Lint” and “Secure Programming

Lint”. Splint extends LCLint to include checking for de-referencing a null pointer, us-

ing possibly undefined storage or returning storage that is not properly defined, type

mismatches, violations of information hiding, memory management errors, danger-

ous aliasing, modifications and global variable uses inconsistent with specified inter-

faces, problematic control flow (likely infinite loops), fall through cases or incomplete

switches, suspicious statements buffer overflow vulnerabilities, dangerous macro im-

plementations or invocations, and violations of customized naming conventions[EL03].

Splint has been compared to other dynamic tools in Hewett and DiPalma [HD03].

LCLint and Splint can only analyze C source code.

Flawfinder

Flawfinder was developed by David A. Wheeler to analyze C and C++ source

code for potential security flaws. Flawfinder operates by providing a listing of target

files to be processed. Processing then generates a list of potential security flaws sorted

on the basis of their risk. As with most static analysis tools, Flawfinder generates

both false positives and false negatives as it scans the given source code[Whe04].

The Rough Auditing Tool for Security (RATS) is a basic Lexical analysis tool for

C and C++, similar in operation to ITS4 and Flawfinder. As implied by its name,

RATS only performs a rough analysis of source code for security vulnerabilities, and

will not find all errors. It is also hampered by flagging a significant number of false

positives[CM04].

The SLAM project from Microsoft is designed to allow the safety verification of

C code. The Microsoft tool accomplishes this by placing a strong emphasis upon

verifying API usage rules. SLAM does not require the programmer to annotate

the source program, and it minimizes false positive error messages through a process

known as “counterexample-driven refinement”. The Slam project is intended to check

temporal safety properties.

SLAM has been extensively used within Microsoft for the verification of Windows

XP Device Drivers. Behavior has been checked using this tool, as well as the usage

of kernel API calls[BR02]. The SLAM analysis engine is the core of Microsoft’s

Static Driver Verifier (SDV), available in Beta form as part of the Windows Software

Developers Kit.

MOPS (MOdelchecking Programs for Security properties) was a tool developed

by Hao Chen in collaboration with David Wagner to find security bugs in C programs

and to verify compliance with rules of defensive programming. MOPS was targeted

towards developers of security-critical programs and auditors reviewing the security

of existing C code. MOPS was designed to check for violations of temporal safety

properties that dictate the order of operations in a sequence.

3.3 Style Checking Tools

3.3.1 Academic Tools

PMD, like JLint and FindBugs, is a static analysis tool for Java. However, unlike

these other tools, it does not contain a dataflow component as part of its analysis. In-

stead, it searches for stylistic conventions which occur in suspicious locations[RAF04].

PMD also includes the capability to detect near-duplicate code[Jel04].

PMD allows users to create extensions to the tool to detect additional bug pat-

terns. New bug patterns can be written in either Java or XPath[RAF04].

PMD is mainly concerned with infelicities of design or style. As such, it has a low

hit rate for detecting bugs. Furthermore, enabling all rules sets in PMD generates a

significant amount of noise relative to the number of real issues.

Checkstyle

Checkstyle is a Java style analyzer that verifies if Java source code is compliant

with predefined stylistic rules. Similar to PMD, Checkstyle has a very low rate for

detecting bugs within Java software. However, it does spare code reviewers the tedious

effort of verifying coding standards compliance.

3.4 Teaching Tools

Safer C Toolkit

The Safer C toolset (SCT) was developed by Oakwood Computing Associates

based upon extensive analysis of the failure modes of C code and the 1995 publi-

cation Safer C: Developing for High-Integrity and Safety-Critical Systems

[Hat95], as well as feedback from teaching 2500 practicing engineers the concepts of

safer programming subsets. The key intent was to provide a tool which was both

educational to the user as well as practical for use with development projects.

3.4.2 Academic Tools

Gauntlet

The Gauntlet tool for Java has been developed by the United States Military

Academy for use in an introductory Information Technology course. The intent of

the tool was to act as a pre-compiler, statically analyzing the source code before it

is sent to the Java compiler and translating the top 50 common errors into layman’s

terms for the students. Gauntlet was developed based upon four years of background

teaching students introductory programming using Java[FCJ04].

Chapter 4

Proposed Software Reliability

Model1

This dissertation has thusfar provided justification for a new software reliability

model. The first chapter provided a brief introduction to the problem as well as an

overview of the key objectives for this research. The second Chapter provided nu-

merous case studies showcasing that catastrophic system failure can be attributed to

software faults. The third Chapter introduced the concept of static analysis and pro-

vided a literature survey of currently existing static analysis tools for the C, C++, and

Java languages. This chapter will present relevant details for the Software Reliability

Model.

As software does not suffer from age related failure in the traditional sense, all

faults which lead to failure are present when the software is released. In a theoreti-

cal sense, if all faults can be detected in the released software, and these faults can

1Portions of this chapter appeared in Schilling and Alam[SA05a][SA06d][SA06b].

then be assigned a probability of manifesting themselves during software operation,

an appropriate estimation of the software reliability can be obtained. The difficulty

is reliably detecting the software faults and assigning the appropriate failure prob-

abilities. It is understood that it is impossible to prove that a computer program

is correct, as this problem is equivalent to the unsolvable Halting problem[Sip97].

However, it is believed that it is possible to develop a reliability model based upon

static analysis, limited testing, and a series of Bayesian Belief Networks which can be

used for assessing the reliability of existing modules.

4.1 Understanding Faults and Failures

It is often the case that the terms fault and failure are used interchangeably. This

is incorrect, as each term has a distinct and specific meaning. Unfortunately, sources

are not in agreement of their relationships. Different models for this relationship are

shown in Table 4.1.

Table 4.1: Relationship between faults and failures in different models

Source ModelANSI / IEEE 729-1983 error ⇒ fault ⇒ failureFenton error ⇒ fault ⇒ failureShooman fault ⇒ error ⇒ failureIEC 1508 fault ⇒ error ⇒ failureHatton error ⇒ fault or defect or bug ⇒ failureNagappan, Ball, and Zeller[NBZ06] defect ⇒ failureSchilling human error ⇒ fault ⇒ failure

For the purposes of this dissertation, a human makes a mistake during software

development, resulting in a software fault being injected into the source code. The

fault represents a static property of the source code. Faults are initiated through

a software developer making an error, either through omission or other developer

action.

The development of software is a labor intensive process, and as such, program-

mers make mistakes, resulting in faults being injected during development into each

and every software product. The majority of faults are injected during the imple-

mentation phase[NIS06]. The rate varies for each developer, the implementation

language chosen, and the Software Development process chosen. Boland[Bol02] re-

ports that the rate is approximately one defect for every ten lines of code developed,

and Hatton[Hat95] reports the best software as having approximately five defects

per thousand lines of code. These injected defects are removed through the software

development process, principally through review and testing.

A software failure is a dynamic property and represents an unexpected departure of

the software package from expected operational characteristics. If a piece of software

never executes, then it can not cause a failure. Software failures occur due to the

presence of one or more software faults being activated through a certain set of input

stimuli[Pai]. Any fault can potentially cause a failure of a software package, but not

all faults will cause a failure, as is shown graphically in Figure 4-1. Adams[Ada84],

as well as Fenton, Pfleeger, and Glass[FPG94] and Wagner[Wag04], indicates that,

on average, one third of all software faults only manifest themselves as a failure once

every 5000 executable years, and only two percent of all faults lead to a MTTF of less

than 50 years. Downtime is not evenly distributed either, as it is suggested that about

90 percent of the downtime comes from at most 10 percent of the faults. From this, it

Figure 4-1: The Relationship between Faults and Failures.

follows that finding and removing a large number of defects does not necessarily yield

the highest reliability. Instead, it is important to focus on the faults that have a short

MTTF associated with them. Malaiya et al. [MLB+94] indicate that rarely executed

modules, such as error handlers and exception handlers, while rarely executed, are

notoriously difficult to test, and are highly critical to the resultant reliability for the

system.

4.1.1 Classifying Software Faults

Gray[Gra86] classifies software faults into two different categories, Bohrbugs and

Heisenbugs. Bohrbugs represent permanent design faults within software. Provided

that proper testing occurs during product development, most Bohrbugs can be de-

tected and easily removed from the product. Heisenbugs represent a class of tempo-

rary faults which are random and intermittent in their occurrence. Heisenbugs include

memory exhaustion, race conditions and other timing related issues, and exception

handling.

Vaidyanathan and Trivedi [VT01] have extended this initial classification of soft-

ware faults to include a third classification for software faults, “Aging-related faults”.

These faults are similar to Heisenbugs in that they are random in occurrence. How-

ever, they are typically brought on by prolonged execution of a given software pro-

Figure 4-2: Venn Diagram classifying software bugs.

Grottke and Trivedi[GT05] have further refined the Vaidyanathan and Trivedi

model to better reflect the nature of software bugs. In this classification, there are

two major classifications for software faults, Bohrbugs and Mandelbugs. Bohrbugs are

faults that are easily isolated and that manifest themselves consistently under a well-

defined set of conditions. Mandelbugs, the complementary set of bugs to Bohrbugs,

are faults whose activation and/or error propagation are complex. Typically, a Man-

delbugs are difficult to isolate, as the failures caused by a it are not systematically

reproducible. Mandelbugs are divided into two subcategories, Heisenbugs and Ag-

ing related bugs. Heisenbugs are faults that cease to cause failures or that manifest

themselves differently when one attempts to probe or isolate them. Aging-related

bugs are faults that leads to the accumulation of errors either inside the running

application or in its system-internal environment, resulting in an increased failure

rate and/or degraded performance with increasing time. This classification scheme is

shown graphically through the Venn Diagram in Figure 4-2.

4.2 What Causes a Fault to Become a Failure

4.2.1 Example Statically Detectable Faults

The key to understanding software reliability based upon static analysis is to

understand what causes a fault to manifest itself as a failure and to be able to predict

which faults will likely manifest themselves as a failure.

In terms of faults and their density, little has been published classifying faults

by their occurrence. One of the most thorough studies of this was published by

Hatton[Hat95]. QAC Clinic from Japan’s Toyo Software Company[QAC98] details

the top five errors in Japanese embedded systems programming. However, this pub-

lication does not provide failure rate information.

1: int32_t foo(int32_t a)

3: int32_t b;

4: if (a > 0)

6: b = a;

8: return ((b) ? 1 : 0);

Figure 4-3: Source code exhibiting uninitialized variable.

Uninitialized variables pose a significant problem to embedded source code pro-

grams. In the ISO C language[ISO90][ISO99], by default, variables are not automat-

ically initialized when defined. For an automatic variable which is allocated either

on the stack or within a processor register, the value which was previously in that

location will be the value of the variable. Figure 4-3 shows an example of a function

which has potential uninitialized variable. If a > 0, b is initialized to the value of a.

However, if a ≤ 0, the value for b is indeterminate, and therefore, the return value of

the function is also indeterminate. This behavior is statically detectable, yet occurs

once every 250 lines of source code for Japanese programs and once every 840 lines

of code for US programs[QAC98]. If this function is executed, the resulting behavior

is entirely unpredictable.

Figure 4-4 provides another example of source code which contains statically de-

tectable faults. The intent of the code is to calculate the running average of an array

1: typedef unsigned short uint16_t;

2: void update_average(uint16_t current_value);

4: #define NUMBER_OF_VALUES_TO_AVERAGE (11u)

6: static uint16_t data_values[NUMBER_OF_VALUES_TO_AVERAGE];

7: static uint16_t average = 0u;

9: void update_average(uint16_t current_value)

11: static uint16_t array_offset = 0u;

12: static uint16_t data_sums = 0u;

14: array_offset = ((array_offset++) % NUMBER_OF_VALUES_TO_AVERAGE);

15: data_sums -= data_values[array_offset];

16: data_sums += current_value;

17: average = (data_sums / NUMBER_OF_VALUES_TO_AVERAGE);

18: data_values[array_offset] = current_value;

Figure 4-4: source code exhibiting statically detectable faults.

--- Module: buffer_overflow.c

array_offset = ((array_offset++) % NUMBER_OF_VALUES_TO_AVERAGE);

"*** LINT: buffer_overflow.c(14) Warning 564: variable

’array_offset’ depends on

order of evaluation [MISRA Rule 46]"

Figure 4-5: PC Lint for buffer overflow.c source file.

of variables. Variables are stored in a circular buffer data values of length NUM-

BER OF VALUES TO AVERAGE, defined at compile time to be 11. The average

of the values stored is kept in the variable average, the sum of all data values is stored

in the variable data sums, and the current offset into the circular buffer is stored in

array offset. Each time the routine is called, a 16 bit value is passed in representing

the current value that is to be added to the average. The array offset is incremented,

the previous value is removed from the data sums variable, the new version is added

to the array and data sums variable, and the updated average is stored in average.

However, there are several potential problems with this simple routine associated with

the array offset variable. The intent of the source code is to increment the offset by

one and then perform a modulus operation on this offset to place it within the range

of 0 to 11. Based upon the behavior of the compiler, this may or may not be the case.

If array offset = 10, the value of array offset can be either 0 or 10 The value will be

0 if the postfix increment operator (++) is executed before the modulus operation

occurs. However, if the compiler chooses to implement the logic so that the postfix in-

crement operator occurs after the modulus operation occurs, the array offset variable

will have a value of 11.

If array offset is set to 11, the execution of line 15 results in an out of bounds

access for the array. In the C Language, reading from outside of the array will not, in

general, cause a processor exception. However, the value read is invalid, potentially

resulting in a very large negative number if the offset value being subtracted is larger

than the current data sum variable value. Line 18 may result in the average value

being overwritten. Depending upon how the compiler organizes RAM, the average

variable may be the next variable in RAM following the data values array. If this is

the case, writing to data values[11] will result in the average value being overwritten.

The exact behavior will depend upon the compiler word alignment, array padding,

and other implementation behaviors. This behavior can vary from one compiler to

another compiler, one compiler version to another compiler version, or be dependant

upon compiler options passed on the command line, especially if compiler optimization

is used.

From a software reliability standpoint, the probability of failure associated with

this construct is easy to verify through testing. So long as the code has been exercised

through this transition point and proper behavior has been obtained, proper behavior

will continue until the code is recompiled, a different compiler version is used, or the

compiler is changed. Code constructs like this, however, do make code portability

difficult.

Figure 4-6 exhibits another statically detectable fault, namely the potential to

1: #define NUMBER_OF_CONNECTIONS (18u)

3: void ListenerClass::cmdServer (ListenerClass *This)

5: int i;

6: ProcesorConnection nCon[NUMBER_OF_CONNECTIONS];

7: ProcesorConnection *slot = nCon;

9: // Init the connections

10: for ( i = 0; i < NUMBER_OF_CONNECTIONS; i++ )

12: nCon[i].setOwner((void*)This);

15: while ( _instance->cmdServe.accept( *slot ) == OK )

17: // Validate connection

18: if ( This->validateClient( slot ) == ERROR )

20: // Connection rejected

21: slot->close_connection();

22: continue;

24: // Find a new unused slot

25: slot = nCon;

26: while ((slot != (&nCon[NUMBER_OF_CONNECTIONS])) && (slot->isActive()))

28: slot++;

30: if ( slot == (&nCon[NUMBER_OF_CONNECTIONS]) )

32: log (10, "Command listener overloaded", 0,0,0,0,0,0);

36: for ( i = 0; i < NUMBER_OF_CONNECTIONS; i++ )

38: nCon[i].close_connection ();

Figure 4-6: Source exhibiting loop overflow and out of bounds array access.

de-reference outside of an array. In this case, if all of the connections slots are busy,

an error message will be printed out notating this condition. The slot pointer will be

pointing one element beyond the end of the array when the code returns to execute

line 15. When line 15 executes, a de-reference beyond the end of the array will

occur. Depending upon the behavior of the cmdServe.accept(*slot ) function, which

is unknown based upon the code segment provided, this can result in data corruption

outside of the given array. In a worst case scenario, this behavior could result in

an infinite loop which never terminates. The fault present within this code can be

detected by Static Analysis tools.

4.2.2 When Does a Fault Manifest Itself as a Failure

In order to use static analysis for reliability prediction, it is important to under-

stand what causes these faults to become failures. In reliability growth modeling, one

of the most important parameters is known as the fault exposure ratio (FER). The

parameter represents the average detectability of a fault within software, or in other

words, the relationship between faults and failures. Naixin and Malaiya[NM96] and

von Mayrhauser and Srimani[MvMS93] discuss both the calculation of this parame-

ter as well as its meaning to a software module. This parameter, however, is entirely

black-box based, and does not help relate faults to failures at a detailed level.

There are many reasons why a fault lays dormant and does not manifest itself as

a failure. The first, and most obvious, deals with code coverage. If a fault does not

execute, it can not lead to failure. While this is intuitively obvious, determining if a

fault can be executed can be quite complicated and require significant analysis.

Figure 4-7 provides such an example of this complexity. The intent of the code

is to calculate the distance from a current number to the next prime number. These

types of calculations are often used in random number generation such as the Linear

Congruential Random Number Generator, which uses the equation

Ik = (a × Ik−1 + c) mod m (4.1)

a representing the multiplier,

c representing the increment, and

m representing the modulus.

Depending upon the exact algorithm used, the values selected for a and c may de-

pending upon the distance from m to the next prime number.

There is, however, a statically detectable problem with this implementation. The

code begins on line 9 by checking to make certain that the number is greater than 0.

If this is the case, the code will step through a set of if and else if functions, looking

for the smallest prime number which is greater than the value passed in. Once this

has been found, the t next prime number variable is set to this value, and in line 73

the calculation

t return value = t next prime number − p number (4.2)

is executed. However, there is a problem. If

p number ≥ 127, (4.3)

t next prime number = 0, (4.4)

resulting in

t return value < 0. (4.5)

Since t return value, however, is an uint8 t, t return value will take on a very large

positive value. It is important to note that this statically detectable fault escaped

testing even though 100% statement coverage had been achieved, as is shown in Figure

There are several different ways we can assess the probability of this fault mani-

festing itself. If we base the probability on the size of the input space, there are 256

possible inputs to the function, ranging from 0 to 255. The failure will occur any

127 ≤ p number < 256 (4.6)

resulting in

pf =129

256= .50360625. (4.7)

However, if the input domain is limited to the domain D such that

D = {x ∈ N ∩ x ≤ 127} (4.8)

then the failure probability is reduced to

128= 0.0078125. (4.9)

If the input domain is further reduced to the domain D such that

D = {x ∈ N ∩ x ≤ 100} (4.10)

100= 0. (4.11)

This complexity partly explains why faults lie dormant for such long periods and why

many faults only surface when a change is made to the software. If an initial program

using this software never passes a value greater than 100 to the routine, it will never

fail. But if a change is made and a value of 128 can now be passed in, the failure is

more likely to surface. A change in value to include up to 255 for the input virtually

guarantees that the failure will surface.

If we want to measure failure as to the number of paths through the program

which can cause failure, a control flow graph can be generated for the source code as

is shown in Figure 4-9. From this graph, the static path count through the method

can be calculated. In this case, there are 33 distinct paths through the function; one

path will cause a failure, resulting in pf = 133

= 0.0303 if all paths are assumed to

execute equally.

Uninitialized variables, an example of which is given in Figure 4-10, represent

the second most prevalent problem in Japanese source code. In this case, there are

seven distinct paths through the source code, yielding a static path count of 7. Of

these paths, six of them do not contain any statically detectable faults. However, the

eighth path fails to initialize a function pointer, resulting in the program jumping to

an unknown address, and most likely crashing the program. Assuming that we can

consider these paths of having an equal probability of executing, pf = 17≈ 0.1428.

The very presence of these problems may or may not immediately result in a

failure. Returning a larger than expected number from a mathematical function may

not immediately result in a software failure. A random jump to a pointer address in

C will most likely result in an immediate and noticeable failure; Overwriting the stack

return address is likely to result in the same behavior. Thus, for a fault to manifest

itself as a failure, the code which contains the fault first must execute, and then the

result of fault must be used in a manner that will result in a failure occurring.

1: #include <stdint.h>

2: /* This routine will calculate the distance from the current number to the next prime number. */

3: uint8_t calculate_distance_to_next_prime_number(uint8_t p_number) {

4: int8_t t_next_prime_number = 0;

5: uint8_t t_return_value = 0;

6: if (p_number > 0) {

7: if (p_number < 2)

8: {t_next_prime_number = 2; }

9: else if (p_number < 3)

10: { t_next_prime_number = 3; }

69: t_return_value = t_next_prime_number - p_number;

71: return t_return_value;

Figure 4-7: Source exhibiting statically detectable mathematical error.

File ‘prime_number_example1.c’

Lines executed:100.00% of 69

prime_number_example1.c:creating ‘prime_number_example1.c.\index{gcov}gcov’

-: 1:#include <stdint.h>

-: 2:uint8_t calculate_distance_to_next_prime_number(uint8_t p_number);

function calculate_distance_to_next_prime_number called 1143 returned 100% blocks executed 100%

1143: 3:uint8_t calculate_distance_to_next_prime_number(uint8_t p_number) {

1143: 4: int8_t t_next_prime_number = 0;

1143: 5: uint8_t t_return_value;

1143: 6: if (p_number > 0) {

1142: 7: if (p_number < 2)

118: 8: {t_next_prime_number = 2; }

1024: 9: else if (p_number < 3)

119: 10: {t_next_prime_number = 3;}

1142: 69: t_return_value = t_next_prime_number - p_number;

-: 70: }

1143: 71: return t_return_value;

-: 72:}

Figure 4-8: GNU gcov output from testing prime number source code.

Figure 4-9: Control flow graph for calculate distance to next prime number method.

1 #include "interface.h"

3 static uint16_t test_active_flags;

4 static uint16_t test_done_flags;

6 void do_walk(void) {

7 uint8_t announce_param;

8 function_ptr_type test_param;

9 if (TEST_BIT(test_active_flags, DIAG_TEST)) {

10 if (check_for_expired_timer(TIME_IN_SPK_TEST) == EXP) {

11 start_timer();

12 if (TEST_BIT(test_active_flags, RF_TEST)) {

13 announce_param = LF_MESSAGE;

14 test_param = LF_TEST;

15 SETBIT_CLRBIT(test_active_flags, LF_TEST, RF_TEST);

16 } else if (TEST_BIT(test_active_flags, LF_TEST)) {

17 announce_param = LR_MESSAGE;

18 test_param = LR_TEST;

19 SETBIT_CLRBIT(test_active_flags, LR_TEST, LF_TEST);

20 } else if (TEST_BIT(test_active_flags, LR_TEST)) {

21 announce_param = RR_MESSAGE;

22 test_param = RR_TEST;

23 SETBIT_CLRBIT(test_active_flags, RR_TEST, LR_TEST);

24 } else if ((TEST_BIT(test_active_flags, RR_TEST)) &&

25 (get_ap_state(AUK_STATUS) != UNUSED_AUK)) {

26 announce_param = SUBWOOFER_MESSAGE;

27 test_param = AUX1_TEST;

28 SETBIT_CLRBIT(test_active_flags, SUBWOOFER1_TEST, RR_TEST);

29 } else {

30 announce_param = EXIT_TEST_MESSAGE;

31 CLRBIT(test_active_flags, DIAG_TEST);

32 SETBIT(test_done_flags, DIAG_TEST);

34 make_announcements(announce_param);

35 *test_param();

Figure 4-10: Source exhibiting uninitialized variable.

Figure 4-11: Control flow graph for do walk method.

4.3 Measuring Code Coverage

The first and most important starting point for determining if a fault is to become

a failure is related to source code coverage. If a fault is never encountered during ex-

ecution, it can not result in a failure. In many software systems, especially embedded

systems, the percentage of code which routinely executes is actually quite small, and

the majority of the execution time is spent covering the same lines over and over

again. Embedded systems are also designed with a few repetitive tasks that execute

periodically at similar rates. Thus, with limited testing covering the normal use cases

for the system, information about the “normal” execution paths can be obtained.

There are many different metrics and measurements associated with code cover-

age. Kaner [Kan95] lists 101 different coverage metrics that are available. State-

ment Coverage2 measures whether each executable statement is encountered. Block

coverage is an extension to statement coverage except that the unit of code is a se-

quence of non-branching statements. Decision Coverage 3 reports whether boolean

expressions tested in control structures evaluated to both true and false. Condition

Coverage reports whether every possible combination of boolean sub-expressions oc-

curred. Condition/Decision Coverage is a hybrid measure composed by the union

of condition coverage and decision coverage. Path Coverage4 reports whether each

of the possible paths5 in each function have been followed. Data Flow Coverage, a

2Also known as line coverage or segment coverage.3Also known as branch coverage, all-edges coverage, basis path coverage, or decision-decision-path

testing.4Also known as predicate coverage.5A path is a unique sequence of branches from the function entry to the exit.

variation of path coverage, considers the sub-paths from variable assignments to sub-

sequent references of the variables. Function Coverage measure reports whether each

function or procedure was executed. It is useful during preliminary testing to assure

at least some coverage in all areas of the software. Call Coverage6 reports whether

each function call has been made. Loop Coverage measures whether each loop is

executed body zero times, exactly once, and more than once (consecutively). Race

Coverage reports whether multiple threads execute the same code at the same time

and is used to detect failure to synchronize access to resources. In many cases, cover-

age definitions overlap each other, as decision coverage includes statement coverage,

Condition/Decision Coverage includes Decision Coverage and Condition Coverage,

Path Coverage includes Decision Coverage, and Predicate Coverage includes Path

Coverage and Multiple Condition Coverage[Cor05].

Marick [Mar99] cites some of the misuses for code coverage metrics. A certain

level of code coverage is often mandated by the software development process when

evaluating the effectiveness of the testing phase. This level is often varied. Extreme

Programming advocates endorse 100% method coverage in order to ensure that all

methods are invoked at least once, though there are also exceptions given for small

functions which are smaller than the test cases would be[Agu02][JBl]. Piwowarski,

Ohba, and Caruso[POC93] indicate that 70% statement coverage is necessary to en-

sure sufficient test case coverage, 50% statement coverage is insufficient to exercise

the module, and beyond 70%-80% is not cost effective. Hutchins et al.[HFGO94]

indicates that even 100% coverage is not necessarily a good indication of testing ad-

6Also known as call pair coverage.

equacy, for though more faults are discovered at 100% coverage than 90% or 95%

coverage, faults can still be uncovered even if testing has reached 100% coverage.

There has been significant study of the relationship between code coverage and

the resulting reliability of the source code. Garg [Gar95] [Gar94] and Del Frate

[FGMP95] indicate that there is a strong correlation between code coverage obtained

during testing and software reliability, especially in larger programs. The exact extent

of this relationship, however, is unknown.

4.3.1 The Static Analysis Premise

The fundamental premise behind this model is that the resulting software relia-

bility can be related to the statically detectable faults present within the source code,

the number of paths which lead to the execution of the statically detectable faults,

and the rate of execution for each segment of the software.

To model reliability, the source code is first divided into statement blocks. A

statement block represents a contiguous block of source code instructions which is

uninterrupted by a conditional statement. By using this organization, the source

code is translated into a set of blocks connected by decisions. Statically detectable

faults can then be assigned into the appropriate block.

Figure 4-12 provides example source code for an embedded system timer routine

which verifies if a timer has or has not expired. Figure 4-13 shows a language transla-

tion of the code into Java. This source code can be decomposed into the block format

shown in Figure 4-14.

1 #include <stdint.h>

2 typedef enum {FALSE, TRUE} boolean;

4 extern uint32_t get_current_time(void);

6 typedef struct {

7 uint32_t starting_time; /* Starting time for the system */

8 uint32_t timer_delay; /* Number of ms to delay */

9 boolean enabled; /* True if timer is enabled */

10 boolean periodic_timer; /* TRUE if the timer is periodic. */

11 } timer_ctrl_struct;

13 boolean has_time_expired(timer_ctrl_struct p_timer)

15 boolean t_return_value = FALSE;

16 uint32_t t_current_time;

17 if (p_timer.enabled == TRUE)

19 t_current_time = get_current_time();

20 if ((t_current_time > p_timer.starting_time) &&

21 ((t_current_time - p_timer.starting_time) > p_timer.timer_delay))

23 /* The timer has expired. */

24 t_return_value = TRUE;

26 else if ((t_current_time < p_timer.starting_time) &&

27 ((t_current_time + (0xFFFFFFFFu - p_timer.starting_time)) > p_timer.timer_delay))

29 /* The timer has expired and wrapped around. */

32 else

34 /* The timer has not yet expired. */

35 t_return_value = FALSE;

37 if (t_return_value == TRUE)

39 if (p_timer.periodic_timer == TRUE )

41 p_timer.starting_time = t_current_time;

43 else

45 p_timer.enabled = FALSE;

46 p_timer.starting_time = 0;

47 p_timer.periodic_timer = FALSE;

51 else

53 /* Timer is not enabled. */

55 return t_return_value;

Figure 4-12: Source code which determines if a timer has expired.

The reliability for each block is assessed using a Bayesian Belief Network, which is

described in detail in the next chapter. The Bayesian Belief network uses an analysis

of the fault locations, fault characteristics, historical data from past projects, fault

taxonomy data, and other parameters to determine if the given fault is either a valid

statically detected fault or a false positive fault.

1 public class sample_timer {

2 public class timer_ctrl_struct {

3 public int starting_time; /* Starting time for the system */

4 public int timer_delay; /* Number of ms to delay */

5 public boolean enabled; /* True if timer is enabled */

6 public boolean periodic_timer; /* TRUE if the timer is periodic. */

9 boolean has_time_expired(timer_ctrl_struct p_timer) {

10 boolean t_return_value = false;

11 int t_current_time;

12 if (p_timer.enabled == true) {

13 t_current_time = (int)(System.currentTimeMillis() % 0xFFFFFFFF);

14 if ((t_current_time > p_timer.starting_time)

15 && ((t_current_time - p_timer.starting_time) > p_timer.timer_delay)) {

17 t_return_value = true;

18 } else if ((t_current_time < p_timer.starting_time)

19 && ((t_current_time + (0xFFFFFFFF - p_timer.starting_time)) > p_timer.timer_delay)) {

21 t_return_value = true;

22 } else {

24 t_return_value = false;

26 if (t_return_value == true) {

27 if (p_timer.periodic_timer == true) {

29 } else {

30 p_timer.enabled = false;

32 p_timer.periodic_timer = false;

35 } else {

Figure 4-13: Translation of timer expiration routine from C to Java. Note that nothinghas changed other than the implementation language. The algorithm is exactly the same.

Figure 4-14: Flowchart for check timer routine..

Integration of Code Coverage Into the Model

Simply using faults to model reliability is insufficient, for the faults must be acti-

vated through execution before manifesting themselves as a failure.

Table 4.2: Discrete Paths through sample functionPath Uniform Path Uniform Conditional Uniform Path Uniform Conditional

Path Logic with withExecution Value Tracking Value Tracking

A .10 .5000 1/6 = 0.166 = 0.5000B → C .10 .5 · .75 · .75 · .5 = .1406 1/6 = 0.166 .5 · .75 · .75 · 1.0 = 0.2812B → D .10 .5 · .75 · .25 · .5 = .0468 0/6 = 0.000 .5 · .75 · .25 · 0.0 = 0.0000B → E .10 .5 · .25 · .5 = .0625 0/6 = 0.000 .5 · .25 · 0.0 = 0.0000B → C → F .10 .5 · .75 · .75 · .5 · .5 = .0703 0/6 = 0.000 .5 · .75 · .75 · 0.0 · .5 = 0.0000B → D → F .10 .5 · .75 · .25 · .5 · .5 = .0234 1/6 = 0.166 .5 · .75 · .25 · 1.0 · .5 = 0.0468B → E → F .10 .5 · .25 · .5 · .5 = .0312 1/6 = 0.166 .5 · .25 · 1.0 · .5 = 0.0625B → C → G .10 .5 · .75 · .75 · .5 · .5 = .0703 0/6 = 0.000 .5 · .75 · .75 · 0.0 · .5 = 0.0000B → D → G .10 .5 · .75 · .25 · .5 · .5 = .0234 1/6 = 0.166 .5 · .75 · .25 · 1.0 · .5 = 0.0468B → E → G .10 .5 · .25 · .5 · .5 = .0312 1/6 = 0.166 .5 · .25 · 1.0 · .5 = 0.0625

The simplest method for establishing code coverage using the model given would

be to assume that all paths through the method are executed with equal probability.

For example, the function diagramed in Figure 4-14 has ten possible paths through the

source code, as is shown in Table 4.2. Using the simplest method, each path would

have a probability pp = 110

= 0.10 of executing. From this method, we can then

calculate a reliability for the given function. However, empirically it is known that

this assumption of uniform path coverage is incorrect. Many functions contain fault

tolerance logic which rarely executes. Other functions contain source code which,

given the calling parameters, is never executed.

The next natural refinement of this probabilistic assignment would be to look at

the discrete decisions which cause the execution of each path through the source code.

To use this methodology, we assume that each conditional statement used to make a

decision has an equal probability of being true or false. Thus, the statement

if (p_timer_enabled == TRUE)

has a probability of pi = .50 of taking the if condition and a probability pe = .50 of

taking the else condition. Using this same logic, the statement

if ((t_current_time > p_timer.starting_time) &&

((t_current_time - p_timer.starting_time) > p_timer.timer_delay))

has a probability of

pi = pC1=TRUE ∩ pC2=TRUE = 0.25 (4.12)

of taking the if condition and a probability of .75 of taking the else condition. We

refer to this measure as uniform conditional logic, and when applying it to Figure

4-14, we obtain the third column in Table 4.2.

While this method does generate a valid distribution for each path through the

source code, the method given above fails to take into account the dependencies

present within the function. For example, Blocks D and E of the source code execute

the command “t return value = TRUE;”. Thus, if block D or E is ever visited

within the function, then we are guaranteed to visit block F or G. Block C sets

“t return value = FALSE;”. Thus, if we ever visit node C, we are guaranteed not to

visit node F or G. Making these changes yields the forth and fifth column in Table 4.2.

One problem with value tracking is that it can be difficult to reliably track variable

dependencies if different variables are used but the values are assigned elsewhere in

code, or if there are multiple aliases used to refer to a single variable.

This method is also problematic in that paths which encounter fewer decisions

have a higher probability of executing. It is known, however, from studying developed

source code that this assumption is not always true. In many instances, the first logical

checks within a function are often for fault tolerance purposes (i.e. checking for NULL

pointers, checking for invalid data parameters, etc.), and since these conditions rarely

exist, the paths resulting from this logic is rarely executed. Other methods that

can be used include fuzzy logic reasoning for path probabilities and other advanced

techniques which weigh paths based upon their intent.

Table 4.3: Execution Coverage of various paths

Path CoverageA 0

2537= 0

B → C ∪ B → D ∪ B → E ∪ B → C → F ∪ B → D → F∪B → E → F ∪ B → C → G ∪ B → D → G ∪ B → E → G 2537

2537= 1.0000

B → C 24152537

= 0.9519B → D → F ∪ B → E → F 9

2537= 0.0035

B → D → G ∪ B → E → G 1132537

= 0.0445B → D → F ∪ B → D → G 5

2537= 0.0020

B → E → F ∪ B → E → G 1172537

= 0.0461

These theoretical methods also lack an important relevance to the implemented

source code. Each method profiles thusfar does not take into any account the user

provided data which has the greatest effect on which paths actually execute. De-

pending upon the users preferences and use cases, the actual path behavior may vary

greatly from the theoretical values. Figure 4-15 provides output from the GNU gcov

tool showing a simple usage of the routine diagramed in Figure 4-14. From this infor-

mation, we can construct the experimental probability of each block being executed

and set up a system of equations from this information, as is shown in Equation 4.13.

100.00% of 17 source lines executed in file check_timer.c

100.00% of 10 branches executed in file check_timer.c

90.00% of 10 branches taken at least once in file check_timer.c

100.00% of 1 calls executed in file check_timer.c

Creating check_timer.c.\index{gcov}gcov.

#include <stdint.h>

typedef enum {FALSE, TRUE} boolean;

extern uint32_t get_current_time(void);

typedef struct {

uint32_t starting_time; /* Starting time for the system */

uint32_t timer_delay; /* Number of ms to delay */

boolean enabled; /* True if timer is enabled */

boolean periodic_timer; /* TRUE if the timer is periodic. */

} timer_ctrl_struct;

boolean has_time_expired(timer_ctrl_struct *p_timer)

2537 {

2537 if (p_timer->enabled == TRUE)

branch 0 taken = 0%

call 0 returns = 100%

2537 if ((t_current_time > p_timer->starting_time) &&

branch 0 taken = 6%

branch 1 taken = 95%

((t_current_time - p_timer->starting_time) > p_timer->timer_delay))

/* The timer has expired. */

2420 else if ((t_current_time < p_timer->starting_time) &&

((t_current_time + (0xFFFFFFFFu - p_timer->starting_time)) > p_timer->timer_delay))

/* The timer has expired and wrapped around. */

/* The timer has not yet expired. */

122 if (p_timer->periodic_timer == TRUE )

branch 0 taken = 7%

113 p_timer->starting_time = t_current_time;

9 p_timer->enabled = FALSE;

9 p_timer->starting_time = 0;

9 p_timer->periodic_timer = FALSE;

/* Timer is not enabled. */

Figure 4-15: gcov output for functional testing of timer routine.

pB→D→F+ pB→E→F = 92537

pB→D→G+ pB→E→G = 1132537

pB→D→F+ pB→D→G+ = 52537

pB→E→F+ pB→E→G = 1172537

(4.13)

However, this set of equations is unsolvable. Thus, from the information captured

by gcov is not entirely suitable for determining the paths which are executed during

limited testing.

There are many tools that have been developed to aid in code coverage analysis,

both commercial and open source, besides the gcov program. A detailed discussion

of Java code coverage tools is available in [Agu02]. However, none of the existing

analysis tools supports branch coverage metrics, requiring the development of our

own tool to measure branch coverage of Java programs.

It is possible to obtain better information on the branch coverage for the func-

tion if a subtle change is made to the source code. This conceptual change involves

placing a log point in each block of source code. For this primitive example, this

is accomplished through a simple printf statement in the source code and a letter

A → G corresponding to the block of code executing. Upon exit from the function, a

newline is printed, indicating that the given trace has completed. This modified code

is shown in Figure 4-16. In this figure, lines retain their initial numbering scheme

from the original code. Code which has been added is denoted with a ** symbol.

** #include <stdio.h>

1 #include <stdint.h>

2 typedef enum {FALSE, TRUE} boolean;

4 extern uint32_t get_current_time(void);

6 typedef struct {

7 uint32_t starting_time; /* Starting time for the system */

8 uint32_t timer_delay; /* Number of ms to delay */

9 boolean enabled; /* True if timer is enabled */

10 boolean periodic_timer; /* TRUE if the timer is periodic. */

11 } timer_ctrl_struct;

13 boolean has_time_expired(timer_ctrl_struct p_timer)

17 if (p_timer.enabled == TRUE)

** printf("B");

20 if ((t_current_time > p_timer.starting_time) &&

21 ((t_current_time - p_timer.starting_time) > p_timer.timer_delay))

** printf("E");

26 else if ((t_current_time < p_timer.starting_time) &&

27 ((t_current_time + (0xFFFFFFFFu - p_timer.starting_time)) > p_timer.timer_delay))

** printf("D");

32 else

** printf("C");

39 if (p_timer.periodic_timer == TRUE )

** printf("G");

43 else

** printf("F");

45 p_timer.enabled = FALSE;

47 p_timer.periodic_timer = FALSE;

51 else

** printf("A");

** printf("\n");

Figure 4-16: Modified timer source code to output block trace.

Figure 4-17: Rudimentary trace output file.

By compiling and executing this modified code, a trace file can be captured match-

ing that shown in Figure 4-17 providing the behavioral trace for the program. By

postprocessing this file using an AWK or PERL script, it is possible to determine the

number of unique paths executed through the function as well as their occurrence,

as is shown in Table 4.4. Notice that the actual path counts observed during lim-

ited testing are significantly different from the theoretical path count. One drawback

to this method is that trace logging adds significant overhead to program execution

which may affect reliability if used in a hard real time system.

Table 4.4: Execution Coverage of various pathsPath Coverage Percentage Uniform Path Uniform Conditional Uniform Conditional Uniform Path

Count Execution Logic with Value Tracking with Value TrackingFrom Table 4.2

2537= .0000 .10 .5000000 0.166 .500000

B → C 2415 2415

2537= .9519 .10 .1406250 0.166 .281250

B → D 0 0

2537= .0000 .10 .0468750 0.000 .000000

B → E 0 0

2537= .0000 .10 .0625000 0.000 .000000

B → E → F 7 7

2537= .0028 .10 .0312500 0.166 .062500

B → E → G 110 110

2537= .0433 .10 .0312500 0.166 .062500

B → D → G 3 3

2537= .0012 .10 .0234375 0.166 .046875

B → D → F 2 2

2537= .0008 .10 .0234375 0.166 .046875

B → C → F 0 0

2537= .0000 .10 .0703125 0.166 .000000

B → C → G 0 0

2537= .0000 .10 .0703125 0.166 .000000

Through the use of the GNU debugging program (GDB), it is possible to obtain

the same path coverage information without modifying the original source code. This

is accomplished through the use of breakpoints combined with output logging. To

accomplish this, a debug script is created which defines a breakpoint for each block.

When one of these breakpoints is reached, an appropriate output message is displayed,

and then the program continues execution. This will generate the same format output

as the print method applied previously. However, this method requires no changes to

the source code. The script used for this is shown in Figure 4-18. One disadvantage

of this method, however, is that it is impossible to generate an explicit output when

block A is encountered, as Block A does not contain any executable code upon which

a breakpoint can be placed.

file a.exe break check_timer.c:19 commands silent printf "B"

continue end break check_timer.c:35 commands silent printf "C"

continue end break check_timer.c:24 commands silent printf "D"

continue end break check_timer.c:30 commands silent printf "E"

continue end break check_timer.c:45 commands silent printf "F"

continue end break check_timer.c:41 commands silent printf "G"

continue end break check_timer.c:55 commands silent printf "\n"

continue end set height 0 run quit

Figure 4-18: gdb script for generating path coverage output trace.

In the event that the breakpoint method described above is inappropriate for ob-

taining the execution trace, there are several other methods that can be used. In

certain applications, it is not feasible for the debugger to interrupt the program’s

execution. Delays introduced by a debugger might cause the program to change its

behavior drastically, or perhaps fail, even when the code itself is correct. In this

situation, the GDB supports a feature referred to as Tracepoints. Using GDB’s trace

and collect commands, you can specify locations in the program, called tracepoints,

and arbitrary expressions to evaluate when those tracepoints are reached. The tra-

cepoint facility can only be used with remote targets. As a final (and extremely

difficult) method, a logic analyzer can be connected to the address bus, so long as the

microprocessor on the system has an accessible address bus. By setting the trigger-

ing systems appropriately, the logic analyzer can trigger on the instruction addresses

desired, and by storing this information in a logic buffer, a coverage trace can be cre-

ated. This method, however, is by far the most difficult method for obtaining path

coverage.

Chapter 5

Static Analysis Fault Detectability1

Chapter 3 of this dissertation provided an overview of existing static analysis tools,

an overview of their capabilities, and their availability. However, as the goal of our

research is to use static analysis to estimate the software reliability of existing software

packages, it is imperative that the real-world detection capabilities for existing static

analysis tools be investigated. As with all fields of software engineering, static analysis

tools are constantly evolving, incorporating new features and detection capabilities,

executing on different platforms, and resolving development bugs. Therefore, the

analysis of capabilities must both be current and relevant.

For all of the advantages of static analysis tools, there have been very few indepen-

dent comparison studies of Java static analysis tools. Rutar et al. [RAF04] compares

the results of using Findbugs, JLint, and PMD tools on Java source code. This study,

however, is somewhat flawed in that it only looks at Open Source tools and does not

investigate the performance of commercial tools. Second, while the study itself was

1Portions of this chapter appeared in Schilling and Alam[SA07b].

conducted on five mid-sized programs, there was no attempt to analyze specifically

the capabilities of each tool when it comes to detecting statically detectable faults.

Forristal[For05] compares 12 commercial and open source tools for effectiveness, but

the analysis is based only on security aspects and security scanners, not the broader

range of static analysis tools available. Furthermore, while the study included tools

which tested for Java faults, the majority of the study was aimed at C and C++

analysis tools, so it is unclear how applicable the results are to the Java programming

language.

Thus, a pivotal need for the success of our modeling is to experimentally determine

the capabilities of existing static analysis tools.

5.1 Experimental Setup

In order to analyze the effectiveness of Java static analysis tools, an experiment

was created which would allow an assessment of the effectiveness of multiple static

analysis tools when applied against a standard set of source code modules. The basic

steps for this experiment involved:

1. Determining the scope of faults that would be included within the validation

suite.

2. Creating test cases which represented manifestations of the faults.

3. Running the analysis tools on validation suite.

4. Combining the tool results using the SoSART tool developed for this purpose.

5. Analyzing the results.

Table 5.1: Static Analysis Fault Categories for Validation Suite

Aliasing errorsArray Boundary ErrorsDeadlocks and other synchronization errorsInfinite Loop conditionsLogic FaultsMathematical FaultsNull De-referencing FaultsUninitialized variables

All together, the validation package consisted of approximately 1200 lines of code,

broken down into small segments which demonstrated a single fault. Injected faults

were broken into eight different categories based upon those mistakes which commonly

occur during software development. Categories are shown in Table 5.1.

Aliasing errors represented errors which occurred due to multiple references alias-

ing to a single variable instance. Array out of bounds errors were designed to check

that static analysis tools were capable of detecting instances in which an array refer-

ence falls outside the valid boundaries of the given array. Several different mechanisms

for indexing outside of an array were tested, including off-by-one errors when iter-

ating through a loop and fixed errors in which a definitive reference which is out of

the range of the array occurs. One test case involved de-referencing of a zero length

array, and one test involved the out of range reference into a two dimensional array.

Deadlocks and synchronization errors were tested using standard examples of code

which suffered from deadlocks, livelocks, and other synchronization issues. An at-

tempt to locate infinite loops using static analysis also occurred. Several examples of

commonly injected infinite loop scenarios were developed based upon PSP historical

data. One example used a case of a Vacuous Truth for the while condition, whereas

others tested the case in which a local variable is used to calculate the index value

yet the value never changes once inside of the loop construct.

In the area of logic, there were six subareas which had test cases developed.

Case statement tests were designed to validate that case statements missing break

statements were detected, impossible case statements (i.e. case statements whose

values could not be generated) were detected, and dead code within case statements

was detected. Operator precedence tests tested the ability of the static analysis tool

to detect code which exhibited problems with operator precedence and might result

in outcomes which are not correct relating to the desired outcomes. Logic conditions

which always evaluate in the same manner were also tested, as well as incorrect string

comparison uses.

Mathematical analysis consisted of two major areas, namely division by zero de-

tection and numerical overflow and underflow. Division by zero was tested for both

integer numbers and floating point numbers. Null variable dereferences were tested

using a set of logic which resulted in a null variable reference being dereferenced.

Lastly, a set of test conditions verified the ability of the static analysis tools to detect

uninitialized variables.

Code was developed in Eclipse, and, except cases where explicit faults were desired,

was clean of all warnings at compilation time. In the case where the Eclipse tool or

the Java compiler issued a warning indicating an errant construct, this information

was logged for future comparison. In many cases, it was found that even though

Eclipse provided an indication of an existing statically detectable error being present,

the actual class file was compiled by the tool.

Once the analysis suite had been developed, it was placed under configuration

management and archived in a CVS repository. The suite was re-reviewed in a PSP

style review, specifically looking for faults within the test suite as well as other im-

provements that could be made. While the initial analysis only included nine tools,

a tenth tool was obtained and included in the experiment.

Following the development of the validation files, an automated process for exe-

cuting the analysis tools was developed. This ensured that the analysis of this suite

(as well as subsequent file sets) could occur in an automated and uniform manner.

An Apache Ant build file was created which automatically invoked the static analy-

sis tools. The output from the analysis tools was then combined using the Software

Static Analysis Reliability Toolkit (SOSART)[SA06a]. The SOSART tool acted as a

static analysis Meta Data tool as well as providing a visually stimulating environment

for reviewing faults.

When running the static analysis tools, each tool was run with all warnings en-

abled. While this maximized the number of warnings generated and resulted in a

significant number of false positives and other nuisance warnings being generated,

this also maximized the potential for each tool to detect the seeded faults.

5.2 Experimental Results

The experiment consisted of analyzing our validation suite using ten different Java

static analysis tools. Five of the tools used were open source or other readily available

static analysis tools. The other five tools included in this experiment represented

commercially available static analysis tools. Due to licensing and other contractual

issues with the commercial tools, the results have been obfuscated and all the tools

will simply be referred to as Tool 1 through Tool 10.

In analyzing the results, three fundamental pieces of data from each tool were

sought. The first goal was to determine if the tool itself issued a warning which

would lead a trained software engineer to detect the injected fault within the source

code. This was the first and most significant objective, for if the tools are not able to

detect real-world fault examples, then the tools will be of little benefit to practitioners.

However, beyond detecting injected faults, we were also interested in the other faults

that the tool found within our source code. These faults can be categorized into two

categories, valid faults which may pose a reliability issue for the source code and false

positive warnings. By definition, false positive warnings encompass both faults which

can not lead to failure given the implementation, as well as faults which detect a

problem which is unrelated to a potential failure. Using this definition, all stylistic

warnings are considered to be invalid, for a stylistic warning by its very nature can

not lead to failure.

5.2.1 Basic Fault Detection Capabilities

The first objective of this experiment was to determine which of the tools actually

detected the injected faults. This accomplished by reviewing the tool outputs in

SOSART and designating those tools which successfully detected the injected static

Table 5.2: Summary of fault detection. A 1 indicates the tool detected the injected fault.Tool

Count Eclipse 1 2 3 4 5 6 7 8 9 10

Array Out of Bounds 1 1 0 0 0 0 0 0 0 0 0 1 0Array Out of Bounds 2 0 0 0 0 0 0 0 0 0 0 0 0Array Out of Bounds 3 2 0 1 0 0 0 0 0 0 0 1 0Array Out of Bounds 4 3 0 1 1 0 0 0 0 0 0 1 0Deadlock 1 2 0 1 0 0 0 0 1 0 0 0 0Deadlock 2 3 0 1 0 0 1 0 1 0 0 0 0Deadlock 3 1 0 1 0 0 0 0 0 0 0 0 0Infinite Loop 1 3 0 1 0 0 0 0 1 1 0 0 0Infinite Loop 2 1 0 1 0 0 0 0 0 0 0 0 0Infinite Loop 3 2 0 1 1 0 0 0 0 0 0 0 0Infinite Loop 4 2 0 1 0 0 1 0 0 0 0 0 0Infinite Loop 5 0 0 0 0 0 0 0 0 0 0 0 0Infinite Loop 6 1 0 0 0 0 0 0 0 0 0 1 0Infinite Loop 7 2 0 1 0 0 0 0 0 0 0 1 0logic 1 3 0 1 0 0 0 0 1 1 0 0 0logic 2 2 1 1 0 0 0 0 0 0 0 0 0logic 3 0 0 0 0 0 0 0 0 0 0 0 0logic 4 1 0 1 0 0 0 0 0 0 0 0 0logic 5 1 0 1 0 0 0 0 0 0 0 0 0logic 6 1 0 1 0 0 0 0 0 0 0 0 0logic 7 1 0 1 0 0 0 0 0 0 0 0 0logic 8 0 0 0 0 0 0 0 0 0 0 0 0logic 9 0 0 0 0 0 0 0 0 0 0 0 0logic 10 0 0 0 0 0 0 0 0 0 0 0 0logic 11 1 0 0 0 0 0 0 0 0 0 0 1logic 12 2 0 1 0 0 0 0 0 0 0 1 0logic 13 2 0 1 0 0 0 0 0 0 0 1 0logic 14 4 0 1 0 0 0 0 0 1 1 0 1logic 15 3 0 1 0 0 0 0 0 1 0 0 1Math 1 2 0 0 1 0 0 0 0 0 0 1 0Math 2 2 0 0 1 0 0 0 0 0 0 1 0Math 3 1 0 0 0 0 0 0 0 0 0 1 0Math 4 0 0 0 0 0 0 0 0 0 0 0 0Math 5 3 0 0 0 0 0 0 1 1 0 0 1Math 6 3 0 0 0 0 0 0 1 1 0 0 1Math 7 3 0 0 0 0 0 0 1 1 0 0 1Math 8 4 0 0 0 0 1 0 1 1 0 0 1Math 9 0 0 0 0 0 0 0 0 0 0 0 0Math 10 1 0 0 0 0 0 0 0 0 0 1 0Math 11 1 0 0 0 0 0 0 0 0 0 1 0Math 12 1 0 0 0 0 0 0 0 0 0 1 0Math 13 1 0 0 0 0 0 0 0 0 0 1 0Math 14 1 0 0 0 0 0 0 0 0 0 1 0Math 15 1 0 0 0 0 0 0 0 0 0 1 0Math 16 1 0 0 0 0 0 0 0 0 0 1 0Null Dereferences 1 3 0 1 1 0 1 0 0 0 0 0 0Null Dereferences 2 0 0 0 0 0 0 0 0 0 0 0 0Null Dereferences 3 1 0 0 0 0 1 0 0 0 0 0 0Uninitialized Variable 1 1 1 0 0 0 0 0 0 0 0 0 0Uninitialized Variable 2 2 1 0 0 0 0 0 1 0 0 0 0Total Detected 3 21 5 0 5 0 9 8 1 17 7Percent Detected 6% 42% 10% 0% 1% 0% 18% 16% 2% 34% 14%

faults. These results are shown in Table 5.2. In each column, a 1 is present if the

given tool detected the fault. A 0 is present if the tool did not provide a meaningful

warning which would indicate the presence of a fault within the source code.

Beyond determining which tools detected the faults, we were also interested in

knowing which faults were detected by multiple tools. Rutar et al.[RAF04] indicated

that they found little overlap between tools in their research. We wanted to see if

this held true for out results. The simplest way of accomplishing this was to count

Table 5.3: Static Analysis Detection Rate by Tool CountNumber of faults detected by 0 tools 9 18.0%

Number of faults detected by 1 tool 19 38.0%

Number of faults detected by 2 tools 11 22.0%

Number of faults detected by 3 tools 9 18.0%

Number of faults detected by 4 or more tools 2 4.0%

the number of tools which detected a given fault. This result is shown in the count

column of Table 5.3. In summary, of the 50 statically detectable faults present within

the validation suite, 22 of them, or 44% of detected injected faults, were detected by

two or more static analysis tools.

Table 5.4: Correlation between warning tool detections.Tool 1 Tool 2 Tool 3 Tool 4 Tool 5 Tool 6 Tool 7 Tool 8 Tool 9 Tool 10

Eclipse -0.0443 -0.0842 N/A -0.0842 N/A 0.1008 -0.1102 -0.036 -0.1813 -0.1019Tool 1 1 0.1215 N/A 0.1215 N/A 0.0232 0.0707 0.1678 -0.183 -0.1098Tool 2 1 N/A 0.1111 N/A -0.1561 -0.1454 -0.0476 0.1829 -0.1345Tool 3 N/A N/A N/A N/A N/A N/A N/A N/ATool 4 1 N/A 0.1908 0.0363 -0.0476 -0.2392 0.0576Tool 5 N/A N/A N/A N/A N/A N/ATool 6 1 0.6475 -0.0669 -0.3362 .4111Tool 7 1 0.3273 -0.3132 .7673Tool 8 1 -0.1025 .3541Tool 9 1 -.2896Tool 10 1

In order to see if there is a relationship between different tools, the correlation

between tool results was calculated. Perfect correlation between tools, in which case

the tools detected exactly the same set of faults, would be represented by a value of

1. No correlation between detection would be captured as a value of 0. Perfect non-

correlation, in which every fault that was detected by the first tool is not detected by

the second tool and every fault which is not detected by the first tool was detected

by the second tool would be represented as a -1 value. While the results of Table

5.4 do show some correlation between tools, the only correlation of significance (and

admittedly a low significance) is between tools 6 and 7. This correlation indicates

that tools 6 and 7 are capable of detecting similar types of faults.

Table 5.5: Static Analysis Tool False Positive and Stylistic Rule DetectionsTool

Eclipse 1 2 3 4 5 6 7 8 9 10

Aliasing Error 0 0 2 13 0 5 20 0 41 0 0Array Out of Bounds 1 0 0 0 0 0 0 9 0 7 0 0Array Out of Bounds 2 0 0 0 2 0 0 7 0 9 0 0Array Out of Bounds 3 0 0 0 1 0 0 27 0 24 0 0Array Out of Bounds 4 0 0 0 1 0 0 7 0 5 0 0Deadlock 1 0 0 2 1 0 0 5 1 5 0 0Deadlock 2 0 0 0 2 0 0 11 0 11 0 0Deadlock 3 0 0 0 4 0 0 0 0 10 0 0Infinite Loop 1 0 0 0 1 0 1 11 0 13 0 0Infinite Loop 2 0 1 3 1 0 0 13 0 10 0 0Infinite Loop 3 0 1 2 2 0 0 15 0 11 0 0Infinite Loop 4 0 0 0 2 0 0 8 0 8 0 0Infinite Loop 5 0 0 0 0 0 0 7 0 7 0 0Infinite Loop 6 0 0 0 0 0 0 7 0 7 0 0Infinite Loop 7 0 0 0 2 0 0 11 0 9 0 0logic 1 0 1 0 4 0 0 29 0 31 0 0logic 2 0 0 0 4 0 0 19 0 14 0 0logic 3 0 0 0 0 0 0 9 0 9 0 0logic 4 0 0 0 4 0 0 19 0 15 0 0logic 5 0 0 0 1 0 0 4 0 5 0 0logic 6 0 0 0 5 0 0 4 0 5 0 0logic 7 0 0 0 5 0 0 3 0 4 0 0logic 8 0 1 0 4 0 0 8 0 7 0 0logic 9 0 0 0 5 0 0 8 0 7 0 0logic 10 0 0 0 4 0 0 10 0 7 0 0logic 11 0 0 0 4 0 0 6 0 5 0 0logic 12 0 0 0 3 0 0 8 0 8 0 0logic 13 0 0 0 3 0 0 10 0 11 0 0logic 14 0 0 0 4 0 0 9 0 6 0 0logic 15 0 0 0 4 0 0 3 0 4 0 0Math 1 0 0 0 0 0 0 7 0 5 0 0Math 2 0 0 0 0 0 0 6 0 6 0 0Math 3 0 0 0 0 0 0 8 0 5 0 0Math 4 0 0 0 0 0 0 6 0 6 1 0Math 5 0 0 0 2 0 0 9 0 8 0 0Math 6 0 0 0 5 0 0 9 0 9 0 0Math 7 0 0 0 2 0 0 10 0 9 0 0Math 8 0 0 0 1 0 0 10 0 11 0 0Math 9 0 0 0 1 0 0 5 0 3 0 0Math 10 0 0 0 1 0 0 2 0 1 0 0Math 11 0 0 0 1 0 0 11 0 8 1 0Math 12 0 0 0 2 0 0 7 0 5 0 0Math 13 0 0 0 1 0 0 12 0 9 0 0Math 14 0 0 0 2 0 0 8 0 6 0 0Math 15 0 0 0 0 0 0 13 0 8 0 0Math 16 0 0 0 0 0 0 7 0 6 0 0Null Dereferences 1 0 0 1 3 0 1 0 0 20 1 0Null Dereferences 2 0 0 1 3 0 0 13 0 11 0 1Null Dereferences 3 0 0 2 3 0 0 21 0 16 0 0uninitialized variables 1 0 0 0 0 0 0 8 0 8 0 0uninitialized variables 2 0 0 0 2 0 0 6 0 5 0 0uninitialized variables 3 0 0 0 2 0 0 4 0 4 0 0Total 0 4 13 117 0 7 489 1 484 3 1

5.2.2 The Impact of False Positives and Style Rules

As was stated previously, when executing the static analysis tools, each and every

rule was enabled for the static analysis tools. As would be expected, this method

generated a significant number of false positive warnings which needed to be filtered

before the valid warnings could be addressed. Our purpose for this analysis, however,

was to attempt to understand the relationship between false positives and the overall

detection of faults. Table 5.5 provides raw information relating to the false positives

and stylistic warning issued by each of the tools during our analysis.

Table 5.6: Correlation between false positive and stylistic rule detectionsTool 1 Tool 2 Tool 3 Tool 4 Tool 5 Tool 6 Tool 7 Tool 8 Tool 9 Tool 10

Eclipse N/A N/A N/A N/A N/A N/A N/A N/A N/A N/ATool 1 1 0.43 0.07 N/A -0.05 0.34 -0.04 0.23 -0.07 0.49Tool 2 1 0.23 N/A 0.37 0.25 0.36 0.36 0.03 -0.05Tool 3 1 N/A 0.66 0.22 -0.08 0.55 -0.1 0.11Tool 4 N/A N/A N/A N/A N/A N/A N/ATool 5 1 0.21 -0.03 0.69 0.07 -0.03Tool 6 1 -0.11 0.73 -0.16 -0.03Tool 7 1 -0.09 -0.03 -0.02Tool 8 1 0.07 -0.05Tool 9 1 -0.03Tool 10 1

From the raw data, two tools, Tools 6 and 8, seemed to have an extremely high

rate of false positives and stylistic warnings emanated, with a combined total of 484

and 489 instances respectively. Tool 3 was also slightly higher with 117 false positive

detections. This relationship was then confirmed again by performing a statistical

correlation calculation between the two tools, as is shown in Table 5.6.

This correlation, coupled with anecdotal evidence collected during the analysis, led

to a further investigation of the warnings generated. In reviewing the data shown in

Table 5.7, a significant portion of the false positives were generated by two warnings,

Tool 8 Rule 1, with 399 instances, and Tool 6 Rule 4 with 375 instances. Together,

these two warnings constituted nearly two-thirds of the false positive warnings ob-

served. In reviewing the warning documentation, both of these warnings referred to

a stylistic violation of combining spaces and tabs together within source code, which

by the definition of our experiment was a false positive because it could not directly

result in a program failure.

5.3 Applicable Conclusions

This study indicates that there is a large variance in the detection capabilities

between tools. The ability of a given tool to detect an injected fault varied between

0% and 42%. This does not mean that the tools that did not detect our injected

faults were ineffective. Rather, it simply means that they were not appropriate tools

for detecting the faults injected in our experiment.

This experiment also showed a significant correlation between false positive warn-

ings. However, it was also found that a significant portion of the false positive de-

tections came from two rules which essentially detected the same conditions. It is

believed that through proper configuration of the tools and filtration of the rules de-

tected, it may be possible to significantly diminish the impact of the false positive

problem.

Based on our results, a correlation between tools and the faults which they de-

tect was observed. 44% of the injected faults were detected by two or more static

analysis tools. This seems to run contradictory to the results of the Rutar experi-

ment, in which no significant correlation was found. The experiment reported here,

however, used a greater variety of tools, and more importantly, included commer-

cially developed tools which may be more capable than the open source tools used

in their experiment. Furthermore, the methods were slightly different, in that their

method involved starting with existing large projects and applying the tool to them

whereas our experiment used a smaller validation suite to test fault discovery. We

can conclude by our experiment that it is possible to use multiple independent tools

and correlation to help reduce the impact of false positive detections when running

static analysis tools.

In common with the Rutar experiment, we conclude that it is still necessary to

use multiple static analysis tools in order to effectively detect all statically detectable

faults. Each tool tested appeared to detect a varying subset of injected faults, and

while there was an overlap between tools, we are not yet able to effectively characterize

the minimum set of tools necessary to ensure adequate fault detection.

As a last observation, even though every attempt had been taken to control the

source code in a stylistic manner, including using Eclipse and the build in style for-

matting tools, style warnings detected by the tools were still in great abundance

during the experiment. These false positives, as has been noted by Hatton[Hat07],

reduce the signal to noise ratio of the analysis. Stylistic warnings by their very nature

can not directly lead to program failure and therefore need to be carefully filtered

when reviewing reliability and failure probability.

Table 5.7: Percentage of warnings detected as valid based upon tool and warning.# of valid # of invalid # of valid # of invalid

Tool Warning instances instances % valid Tool Warning instances instances % valid1 1 1 0 100 6 1 18 0 1001 2 1 0 100 6 2 1 0 1001 3 1 0 100 6 3 1 0 1001 4 1 0 100 6 4 0 375 01 5 1 0 100 6 5 0 2 01 6 1 0 100 6 6 1 0 1001 7 2 0 100 6 7 28 0 1001 8 0 3 0 6 6 1 0 1001 9 2 0 100 6 9 0 11 01 10 3 0 100 6 10 0 20 01 11 4 0 100 6 11 2 0 1001 12 8 0 100 6 12 4 3 57.141 13 3 0 100 6 13 3 20 13.041 14 1 0 100 6 14 2 3 401 15 2 0 100 6 15 1 0 1001 16 2 0 100 6 16 1 3 252 1 0 1 0 6 17 2 0 1002 2 2 0 100 6 18 1 0 1002 3 4 0 100 6 19 0 21 02 4 5 1 83.33 6 20 0 1 02 5 3 8 27.27 6 21 0 8 02 6 1 6 14.29 6 22 1 0 1003 1 0 44 0 6 23 0 7 03 2 0 1 0 6 24 0 4 03 3 0 4 0 6 25 0 3 03 4 1 1 50 6 26 4 0 1003 5 0 1 0 6 27 0 2 03 6 2 40 4.76 6 28 51 1 98.083 7 0 4 0 6 29 0 3 03 8 1 0 100 6 30 0 1 03 9 0 2 0 6 31 3 0 1003 10 1 0 100 6 32 0 3 03 11 5 0 100 6 33 1 0 1003 12 0 1 0 6 34 3 0 1003 13 0 2 0 7 1 2 0 1003 14 0 1 0 7 2 4 0 1003 15 3 18 14.29 7 3 2 0 1004 1 1 0 100 7 4 0 1 04 2 1 0 100 7 5 0 1 04 3 1 0 100 7 6 1 0 1004 4 1 0 100 7 7 2 0 1004 5 1 0 100 8 1 0 399 04 6 1 0 100 8 2 1 0 1004 7 1 0 100 8 3 1 0 1004 8 1 0 100 8 4 0 3 05 1 0 1 0 8 5 0 26 05 2 0 2 0 8 6 1 42 2.3310 1 1 0 100 8 7 0 1 010 2 1 0 100 8 8 1 0 10010 3 1 0 100 8 9 0 16 010 4 0 1 0 8 10 0 1 010 5 4 0 100 8 11 1 0 10010 6 1 0 100 10 9 1 0 10010 7 3 0 100 10 10 1 0 10010 8 4 0 100 10 11 2 0 100Summary: 84 142 37.16 149 981 13.18Overall 233 1123 17.18

Chapter 6

Bayesian Belief Network

Bayesian Belief Networks (BBNs) are powerful tools which have been found to be

useful for numerous applications when general behavioral trends are known but the

data being analyzed is uncertain or incomplete. BBNs, through their usage of causal

directed acyclic graphs, offer an intuitive visual representation for expert opinions yet

also provide a sound mathematical basis[Ana04].

Bayesian Belief networks have found wide acceptance within the medical field and

other areas. Haddaway et al. [Had99] provides an overview of both the existing BBN

software packages that are available as well as an extensive analysis of projects which

have successfully used BBNs. Within the software engineering field, many different

projects have used Bayesian Belief Networks to solve common software engineering

problems. Laskey et al.[LAW+04] discusses the usage of Bayesian Belief Networks for

the analysis of computer security. The quality of software architectures have been

assessed using a BBN by van Gurp and Bosch[vGB99] and Neil and Fenton[NF96].

Software Reliability has been assessed using Bayesian Belief Networks by Gran and

Helminen[GH01] and Pai[Pai01][PD01].

6.1 General Reliability Model Overview

The fundamental premise behind this model is that the resulting software relia-

bility can be related to the number of statically detectable faults present within the

source code, the paths which lead to the execution of the statically detectable faults,

and the rate of execution of each path within the software package.

To model reliability, the source code is first divided into a set of methods or

functions. Each method or function is then divided further into a set of statement

blocks and decisions. A statement block represents a contiguous set of source code

instructions uninterrupted by a conditional statement. By using this organization,

the source code is translated into a set of blocks connected by decisions. Statically

detectable faults are then assigned to the appropriate block based upon their location

in the code.

Once the source code has been decomposed into blocks, the output from the ap-

propriate static analysis tools is linked into the decomposed source code. In order

to predict the reliability, the probability of execution for each branch must be de-

termined. This is accomplished by combining the theoretical programs paths with

the actual program paths observed through execution trace capture during limited

testing. The testing consists of a set of black box tests or functional tests which are

observed at the white box level. For each method, the a reliability for each block is

assigned based upon the output of a Bayesian Belief network relating reliability to

the statically detectable faults, code coverage during limited testing, and the code

structure for the routine.

6.2 Developed Bayesian Belief Network

6.2.1 Overview

In order to accurately assess both the validity of the static analysis fault detected

as well as the likelihood of manifestation, a Bayesian Belief network has been devel-

oped. This Bayesian belief network, shown in Figure 6-1, incorporates historical data

as well as program executional traces to predict the probability that a given statically

detectable fault will cause a program failure.

The Bayesian Belief network can effectively be divided into three main segments.

The upper left half of the Bayesian Belief Network handles attributes related to the

validity and risk associated with a statically detected fault. The upper right half of

the Bayesian Belief Network assesses the probability that a given fault is exposed

through program execution. The bottom segment combines the results and provides

an overall estimate of the reliability for the given statically detectable fault.

As is required in a Bayesian Belief Network, each continuous variable must be con-

verted into a discrete state value. Based upon the work of Neil and Fenton[NF96], the

majority of the variables are assigned the states of “Very High”, “High”, “Medium”,

“Low”, and “Very Low”. In certain cases, an optional state of “None” exists. This

state is generally not used in probabilistic calculations unless a variable is specifically

Figure 6-1: Bayesian Belief Network relating statically detectable faults, code coverageduring limited testing, and the resulting net software reliability.

observed to have a value of none.

6.2.2 Confirming Fault Validity and Determining Fault Risk

By definition, all static analysis tools have the capability of generating false pos-

itives. Some tools have a low false positive rate, while other tools have a high false

positive rate. Determining the validity of the fault is therefore the first required step

in assessing whether or not a statically detectable fault will cause a failure. Once the

validity of a given statically detectable fault has been assessed, the probability of the

fault manifesting itself as an immediate failure given program execution needs to be

assessed. Thus, the upper left segment of the reliability network concerns itself with

Table 6.1: Bayesian Belief Network State DefinitionsNode Name State ValuesFalse Positive Rate Very High, High, Medium, Low, Very LowMethod Clustering Valid Cluster Present, Invalid Cluster Present,

No Cluster Present, Unknown Cluster PresentFile Clustering Valid Cluster Present, Invalid Cluster Present,

No Cluster Present, Unknown Cluster PresentIndependent Correlated Yes, NoFault Validity Valid, False Positive

Immediate Failure Risk Very High, High, Medium, Low, Very Low, None

Maintenance Risk Yes , No

Fault Risk Very High, High, Medium, Low, Very Low, NonePercentage Paths Through BlockExecuted

Very High, High, Medium, Low, Very Low

Test Confidence Very High, High, Medium, Low, Very LowFault Exposure Potential Very High, High, Medium, Low, Very Low

Distance from Nearest Path Adjacent, Near, Far, Very Far, NoneNearest Path Execution Percentage Very High, High, Medium, Low, Very Low, None

Fault Execution Potential Very High, High, Medium, Low, Very Low, NoneNet Fault Exposure Very High, High, Medium, Low, Very Low, None

Code ExecutionBlock Executed, Method Executed, Block Reachable,Block Unreachable

Estimated Reliability Perfect, Very High, High, Medium, Low, Very LowTested Reliability Very High, High, Medium, Low, Very Low

Fault Failed Yes , NoNet Reliability Perfect, Very High, High, Medium, Low, Very LowCalibrated Net Reliability Perfect, Very High, High, Medium, Low, Very LowOutput Color Red, Orange, Yellow, Green

determining the likelihood that a given statically detectable fault is either valid or a

false positive and assessing the fault risk assuming it is a valid statically detectable

fault.

Each detected fault type naturally has a raw false positive rate based upon the

algorithms and implementation. Certain static analysis faults are nearly always a

false positive. Therefore, the states of “Very High”, “High”, “Medium”, “Low”,

and “Very Low” have been selected to represent the false positive rate for the given

statically detectable fault. The model itself does not prescribe a specific translation

from percentages into state values, as this translation may change with the domain.

However, it is expected that this value will be collected from historical analysis of

previous projects.

The validity of a static analysis fault is also impacted by the clustering of faults.

Kremenek et al.[KAYE04] indicate that there is a strong correlation between code

locality and either valid or invalid faults. The rationale behind this clustering is that

programmers tend to make the same mistakes, and these mistakes will tend to be

localized at the method, class, file, or package level. However, clustering can also

occur if a tool enters into a run-away state and generates a significant number of

false positives. The clustering states can therefore be represented as “Valid Cluster

Present”, “Invalid Cluster Present” , “No Cluster Present”, and “Unknown Cluster

Present”. By default, when an analysis is first performed, all statically detectable

faults which are part of a cluster will be initialized to the state value “Unknown

Cluster Present”, indicating that the cluster has neither been shown to be valid or

invalid. However, as the Software Engineer inspects the results and observes static

analysis faults to be valid or invalid, the cluster will shift to the appropriate states of

“Valid Cluster Present” or “Invalid Cluster Present” depending upon the results of

the inspection. This model recognizes two types of clustering, clustering at the file

level and clustering at the method level.

Another input node contains information on whether the static analysis fault has

been correlated by a fault in a second tool at the same location. The usage of multiple

static analysis tools allows an extra degree of confidence that the detected static

analysis fault is valid, for if two independent tools have detected a comparable fault

at the same location, then there is a better chance that both faults are valid. This

statement assumes that the algorithms used to detect the fault are truly independent.

Even though Rutar et al.[RAF04] and Wagner[WJKT05] did not find a significant

overlap in rule checking capabilities between tools, these experiments only used a

limited set of static analysis tools. Since these articles were published, several new

tools were introduced to the commercial marketplace. Our experimental results,

described in Chapter 5 and published in Schilling and Alam [SA07b], indicate that

there is at least some form of correlation between tools when the faults represent

the same taxonomical definition. The Independently Correlated state can have a

value of “Yes” or “No” depending on whether the statically detectable fault has been

independently correlated or not.

From these nodes, an overall summary node can be obtained, referred to as the

fault validity node. This node can contain the states “Valid” or “False Positive”,

depending on whether the fault is believed to be valid or a false positive instance.

When an analysis is first conducted, this node is estimated based upon the input

states and their observed values. However, as the software engineer begins to inspect

the static analysis warnings, the instances of this node will be observed to be either

“Valid” or “False Positive” as appropriate.

The immediate failure node represents whether the fault that has been detected is

likely to cause an immediate failure. A fault which, for example, indicates that a jump

to a null pointer may occur or that the return stack will be overwritten has a very

high probability of resulting in an immediate failure, and thus, this value will reflect

this case. However, a fault which is detected due to an operator precedence issue may

not be assigned as significant of a value. Thus, for each statically detectable fault,

there is a potential that the given fault will result in a failure if the code is executed.

This node probability is directly defined by the characteristics of the fault detected,

and can be represented as “Very High”, “High”, “Medium”, “Low”, “Very Low”, and

“None”. In this case, the “None” state is reserved for static analysis warnings of the

stylistic nature, such as the usage of tabs instead of spaces to indent code. While

these can be considered to be valid warnings, these faults can not directly lead to

program failure.

While there are certain statically detectable faults which do not directly lead to

a failure, there are cases in which a statically detectable fault may represent a fault

which will manifest itself through maintenance. For example, it is deemed to be good

coding practice to enclose all if, else, while, do, and other constructs within opening

and closing brackets. While not doing this does not directly lead to a failure, it can

lead to failures as maintenance occurs on the code segment. Thus, the maintenance

risk state can be set to “Yes” or “No”, indicating whether or not the given fault is a

maintenance risk. The maintenance risk only applies to those faults which are marked

to be valid by the fault validity portion of the network.

These parameters all feed into the fault risk node. This node represents the

risk associated with a given fault and can take on the states “Very High”, “High”,

“Medium”, “Low”, “Very Low”, and “None”.

6.2.3 Assessing Fault Manifestation Likelihood

In order for a fault to result in a failure, the fault itself must be executed in

a manner which will stimulate the fault to fail. Thus, the right upper half of the

reliability Bayesian Belief Network deals with code coverage and the code block during

program execution.

Code Execution Classification

The first subnetwork assumes that the code block with the statically detectable

fault has executed during testing. If the code block has been executed, the probability

of a fault manifesting itself can be related to the number of discrete paths through the

block which have been executed versus the theoretical number of paths through the

block. A fault that is detected may only occur if certain conditions are present, and

these conditions may only be present if a certain execution path has been followed.

By increasing the number of paths executed, the likelihood of a fault manifesting

itself is diminished. However, as has been noted by Hutchins et al.[HFGO94], even

full path coverage is insufficient to guarantee that a fault will not manifest itself, for

the fault may be data dependent based upon a parameter passed into the method1.

There are four principle states that a code block with a static analysis warning

can be in, namely “Block Executed”, “Method Executed”, “Block Reachable”, or

“Block Unreachable”. If a give code block has been executed, this means that at

least one path of execution has traversed the given code block during the testing

period, resulting in the node having the value “Block Executed”. A second state for

this node occurs when the method that contains the statically detectable fault has

been executed but the specific code block has not been executed by at least one path

1Tracing the entire program state is the basis for Automatic Anomaly detection, as has beenused in the DIDUCE tool[HL02] and the AMPLE[DLZ05] tools.

through the method, resulting in the state “Method Executed”. This would indicate

that the state values for the class or the method parameters passed in have not been

set properly to allow the execution of this path.

The last two states effectively deal with whether or not the code block containing

the statically detectable fault is reachable. For a Java method which has private scope

or a C method which has static visibility, this variable can only be “Block Reachable”

if there exists a direct call to the given method or function from within the scope of

the compilation unit. Otherwise, the method itself can not execute, and the value will

be “Block Unreachable”. For a Java method which has public or protected scope, or

a C method which has external linkage, it must be assumed that there is the potential

for the method to execute, and thus, by default, the node will have a value of “Block

Reachable”. “Block Unreachable” truly represents a rare state.

Fault Exposure Potential

The “Test Confidence” node serves to provide the capability to define the con-

fidence in the testing that has been used to obtain execution profiles. The testing

which is referred to reflects limited testing of the module for which the reliability is

being assessed. In the case of new version of a software component delivered from a

vendor, this would reflect best black box testing of the interface or functional testing

of the module. However, through the usage of execution trace capture, a white box

view of the component and the paths taken is obtained. This parameter allows the

evaluating engineer to adjust their confidence in the testing results based upon the

expected usage of the module in the field. As the engineer performing the reliability

analysis has more confidence that the results match what will be seen by a produc-

tion module, this value will be increased, reflecting less variance between the observed

coverage and the actual field coverage. Less confidence would indicate that more of

the unexecuted paths within the module would be expected to execute in the field.

The Fault Exposure potential relates the percentage of test paths covered and the

test confidence. When the test confidence is lowest and the percentage of executed

paths though the code block is lowest, this value will be highest. The value will be

lowest if the test confidence is very high and the percentage of executed paths is also

very high. In general, decreased test confidence will result in more variance in the

calculated network percentages as well.

The “Percent Paths Through Block Executed” node indicates what percentage of

the paths which pass through the code block containing the statically detectable fault

have been executed during testing. Based on an appropriate translation which scales

the number of paths through the code block relative to the percentages executed, this

node will have a value of “Very High”, “High”, “Medium”, “Low”, or “Very Low”.

Fault Execution Potential

The second subnetwork is based upon the premise that the code block containing

the statically detectable fault has not executed, but the containing method has exe-

cuted. In this case, the likelihood of this code executed can be related to the distance

to the nearest executed path as well as percentage of execution paths represented by

the nearest path.

The distance to the nearest path is measured in terms of the number of deci-

sions between the given code block containing a statically detectable fault and the

nearest executed path. Without additional knowledge, it is impossible to predict the

probability that a decision will result in a given outcome. Thus, the number of de-

cisions between the nearest executed path and the static analysis fault is effectively

a binomial distribution. This node can be represented by the states “Adjacent”, in

which the nearest executed path is only one decision away from the static analysis

fault, “Near”, “Far”, and “Very Far”. “None” is a placeholder state which is used to

indicate that the method itself has never been executed during program testing.

The percentage of paths through code block node represents the percentage of

paths through the method which pass through the code bock containing the statically

detectable fault. In the event that the method has not been executed, no assumption

to the probability of any given path executing can be assumed. All paths must

be assigned an equal probability of executing. Thus, this variable represents the

percentage of paths which lead into this program block. This node can have the

values “Very High”, “High”, “Medium”, “Low”, and “Very Low”.

The “Nearest Path Execution Percentage” node reflects the percentage of net exe-

cution paths which have gone through the nearest node. As this percentage increases,

there is an indication that more of the execution paths through the code are with in

a few decisions of this code block. Since there are more executions paths nearby, the

likelihood of reaching this code block is increased each time a nearby path is executed,

for only a few decisions may be required to be different in order to reach this location.

The Nearest Path Execution Percentage states are “Very High”, “High”, “Medium”,

“Low”, “Very Low”, and “None”.

The Fault Execution Potential node represents the potential of a code block con-

taining a statically detectable fault to be executed. As the distance from the nearest

path is lowest and the nearest execution path percentage is highest and the test con-

fidence is lowest, this value will be highest. These values will decrease as the distance

from the nearest execution path increases.

Net Fault Exposure Node

The net fault exposure node is switched based upon whether the code block has

been executed or not. If the code has been executed, this value will mirror that of

the Fault Exposure Potential node. If the code has not been executed, then this

parameter will reflect that of the Fault Execution Potential. This node can have the

values of “Very High”, “High”, “Medium”, “Low”, “Very Low”, and “None”.

6.2.4 Determining Reliability

Reliability for the code block is determined by combining the Fault Risk and the

Net Fault Exposure nodes together to form the Estimated Reliability Node. As the

fault risk increases and the net fault exposure increase, the overall reliability for the

code block will decrease. The net reliability for the block can therefore be expressed as

“Perfect”, “Very High”, “High”, “Medium”, “Low”, and “Very Low”. By default, a

code block which has no statically detectable faults shall have an Estimated Reliability

of “Perfect”.

Net reliability

The Fault Failed node will reflect whether or not the given fault has led to failure

during testing. Values for this node can be “Yes” or “No”, with the default value

being “No” unless an observed fault occurs.

In the event that a statically detected fault has actually failed during the limited

testing period, the estimated values using the Bayesian Belief network are replaced

with actual reliability values from testing. The actual Tested Reliability node reflects

the observed reliability of the software as it is related to this specific fault. Values

which can occur include “Very High”, “High”, “Medium”, “Low”, and “Very Low”.

The net Reliability node serves as a switch between the Tested Reliability observed

if a failure occurs and the Estimated Reliability node. In the event that a failure has

occurred, the value here will represent that of the Tested Reliability Node. Otherwise,

the Estimated reliability node will be mirrored.

Calibrated Net reliability

The Calibrated Net reliability node allows the user to calibrate the output of the

basic network relative to the actual system being analyzed. In essence, this node

serves as a constant multiplier to either increase of decrease the reliability measures

in order that the appropriate final values are obtained. While this capability exists

within the model, all testing thus far has used this node simply as a pass through node

where no change to the output probabilities from the Net Reliability node occur.

6.3 Multiple Faults In a Code Block

Determining the overall reliability for a code block is straight forward if there is

only a single statically detectable fault within the given code block. In this case, the

reliability of the code block would simply be the value output by the “Calibrated

Net Reliability” node of the Bayesian Belief Network. However, if there are multiple

statically detectable faults present, further processing is necessary to determine the

reliability of the given code block.

In traditional reliability modeling with two faults, the probability of failure can

be expressed as

P (F ) = Pf(F1) + Pf(F2) − Pf(F1) · Pf(F2) (6.1)

where Pf(F1) represents the probability that the first fault will fail on any given

execution and Pf(F2) represents the probability that the second fault will fail. If

an assumption of independence amongst faults failing is assumed, the probability of

failure for a system with two faults can be reduced to

P (F ) = Pf (F1) + Pf(F2) (6.2)

In this case,

P (F1|F2) = P (F2|F1) = 0 (6.3)

However, if a system has two dependant faults such that

P (F1|F2) = P (F2|F1) = 1 (6.4)

the probability of failure for the system can be reduced simply to

P (F ) = Pf(F1) = Pf(F2). (6.5)

This core concept of independence must be translated into the Bayesian belief net-

work provided and manifested in a meaningful manner. For organizational purposes,

statically detected faults are grouped and referenced using the taxonomy defined in

Appendix A. We have extended the Common Weakness Enumeration[MCJ05][MB06]

taxonomy. Thus, if two statically detectable faults are categorized into the same clas-

sification, it is assumed that they are multiple instances of the same core fault.

Figure 6-2: Simple Bayesian belief Network combining two statically detectable faults.

If the faults are not of the same type, then an estimation of the combinatorial

effect of their reliabilities must be obtained. In a traditional reliability model, this

is obtained by multiplying the reliabilities together for the two faults. However, this

model will use another Bayesian Belief Network system, as is shown in Figure 6-2.

Table 6.2: Network Combinatorial StatesNode Name State ValuesPrevious Typical Reliability Perfect, Very High, High, Medium, Low, Very LowPrevious Worst Case Reliability Perfect, Very High, High, Medium, Low, Very LowNew Fault Reliability Perfect, Very High, High, Medium, Low, Very LowNext Typical Reliability Perfect, Very High, High, Medium, Low, Very LowNext Worst Case Reliability Perfect, Very High, High, Medium, Low, Very Low

Table 6.3: Network Worst Case Combinatorial ProbabilitiesPreviousWorst Case New fault Next Worst Case ReliabilityReliability Reliability Perfect Very High High Average Low Very LowPerfect Perfect 1 0 0 0 0 0Perfect Very High 0 1 0 0 0 0Perfect High 0 0 1 0 0 0Perfect Average 0 0 0 1 0 0Perfect Low 0 0 0 0 1 0Perfect Very Low 0 0 0 0 0 1Very High Perfect 0 1 0 0 0 0Very High Very High 0 1 0 0 0 0Very High High 0 0 1 0 0 0Very High Average 0 0 0 1 0 0Very High Low 0 0 0 0 1 0Very High Very Low 0 0 0 0 0 1High Perfect 0 0 1 0 0 0High Very High 0 0 1 0 0 0High High 0 0 1 0 0 0High Average 0 0 0 1 0 0High Low 0 0 0 0 1 0High Very Low 0 0 0 0 0 1Average Perfect 0 0 0 1 0 0Average Very High 0 0 0 1 0 0Average High 0 0 0 1 0 0Average Average 0 0 0 1 0 0Average Low 0 0 0 0 1 0Average Very Low 0 0 0 0 0 1Low Perfect 0 0 0 0 1 0Low Very High 0 0 0 0 1 0Low High 0 0 0 0 1 0Low Average 0 0 0 0 1 0Low Low 0 0 0 0 1 0Low Very Low 0 0 0 0 0 1Very Low Perfect 0 0 0 0 0 1Very Low Very High 0 0 0 0 0 1Very Low High 0 0 0 0 0 1Very Low Average 0 0 0 0 0 1Very Low Low 0 0 0 0 0 1Very Low Very Low 0 0 0 0 0 1

In this network, the nodes have the states shown in Table 6.2. In a basic system

with one fault present, the Previous Typical and Worst Case Reliability values will

be initialized to “Perfect” and the New Fault reliability value will be initialized to the

reliability value obtained by a single instance of the static analysis reliability network.

The Next Typical and Next Worst case Reliabilities will be calculated based upon

Table 6.4: Network Typical Combinatorial ProbabilitiesPreviousTypical New fault Next Typical ReliabilityReliability Reliability Perfect Very High High Average Low Very LowPerfect Perfect 1 0 0 0 0 0Perfect Very High 0.5 0.5 0 0 0 0Perfect High 0 1 0 0 0 0Perfect Average 0 0.5 0.5 0 0 0Perfect Low 0 0 1 0 0 0Perfect Very Low 0 0 0.5 0.5 0 0Very High Perfect 0.5 0.5 0 0 0 0Very High Very High 0 1 0 0 0 0Very High High 0 0.5 0.5 0 0 0Very High Average 0 0 1 0 0 0Very High Low 0 0 0.5 0.5 0 0Very High Very Low 0 0 0 1 0 0High Perfect 0 1 0 0 0 0High Very High 0 0.5 0.5 0 0 0High High 0 0 1 0 0 0High Average 0 0 0.5 0.5 0 0High Low 0 0 0 1 0 0High Very Low 0 0 0 0.5 0.5 0Average Perfect 0 0.5 0.5 0 0 0Average Very High 0 0 1 0 0 0Average High 0 0 0.5 0.5 0 0Average Average 0 0 0 1 0 0Average Low 0 0 0 0.5 0.5 0Average Very Low 0 0 0 0 1 0Low Perfect 0 0 1 0 0 0Low Very High 0 0 0.5 0.5 0 0Low High 0 0 0 1 0 0Low Average 0 0 0 0.5 0.5 0Low Low 0 0 0 0 1 0Low Very Low 0 0 0 0 0.5 0.5Very Low Perfect 0 0 0.5 0.5 0 0Very Low Very High 0 0 0 1 0 0Very Low High 0 0 0 0.5 0.5 0Very Low Average 0 0 0 0 1 0Very Low Low 0 0 0 0 0.5 0.5Very Low Very Low 0 0 0 0 0 1

the conditional probabilities shown in Table 6.3 and Table 6.4.

This core network is expanded as necessary to combine all statically detectable

faults together. For each additional fault present within the system, there will be

one additional instance of this network, with the Previous Typical and Worst case

reliability values cascading from the previous instance of the network, as is shown in

Figure 6-3.

6.4 Combining Code Blocks to Obtain Net Relia-

bility

Once the reliability has been obtained for each code block within the method, it

is possible to obtain the overall reliability for each method. To do this, another set of

instances of the combinatorial network defined in Figure 6-2 is created. There will be

one combinatorial network created for each code block within the method. Eventually,

this will result in a complete network similar to that shown in Figure 6-4. This figure

shows the combination of 4 statically detectable faults present on two different code

blocks, which represents a very simple network. The majority of analyzed networks

typically contain upwards of 100 statically detectable faults present on 50 or more

code blocks, yielding the reliability of a single method.

Figure 6-3: Simple Bayesian belief Network combining four statically detectable faults.

Figure 6-4: Method combinatorial network showing the determination of the reliabilityfor a network with two blocks and four statically detectable faults.

Chapter 7

Method Combinatorial Network1

7.1 Introduction

Markov Models have long been used in the study of systems reliability. As such,

they have also been applied to software reliability modeling. Publications by Musa

[MIO90], Lyu [Lyu95], Rook [Roo90], The Reliability Analysis Center [Cen96], Grot-

tke [Gro01], and Xie [Xie91], Gokhale and Trivedi[GT97] and Trivedi [Tri02] all in-

clude extensive discussion on the usage of Markov Models for the calculation of Soft-

ware Reliability. One of the most commonly used models is that which has been

presented by Cheung[Che80]. In this model, a finite Markov chain with an absorbing

state is used to represent the execution of a software program. Each node of the

Markov model represents a group of executable statements having a single point of

entry and a single point of exit. The probability assigned to each edge represents the

probability that execution will follow the given path to the next node.

1Portions of this chapter have appeared in Schilling [Sch07].

Figure 7-1: A program flow graph.

Figure 7-1 represents a basic program flow graph. In this case, there is one node

within the flow graph which makes a decision (S1), and two possible execution paths

based upon that decision, (S2) and (S3) respectively. Reaching state (S4) indicates

that program execution has completed successfully. The transitions t1,2 and t1,3 repre-

sent the probability that program execution will follow the path S1 → S2 and S1 → S3

respectively.

By constructing a matrix P representing the transition probabilities for the pro-

gram, the average number of times that each state is visited can be obtained. Again

using the program flow exhibited in Figure 7-1, this matrix can be represented as

0 t1,2 t1,3 0

0 0 0 t2,4

0 0 0 t3,4

0 0 0 1

t1,2 + t1,3 = 1.0 (7.2)

t2,4 = 1.0 (7.3)

t3,4 = 1.0 (7.4)

The average number of times each statement set is executed can be calculated

from the fundamental matrix. A Markov Chain with states S1, S2, . . . Sn, where Sn

is an absorbing state, and all other states are transient can be partitioned into the

relationship

Q is an n − 1 by n − 1 matrix representing the transitional state probabilities,

C is a column vector, and

O is a row vector of n − 1 zeros.

The kth step transition probability matrix can be expressed as

Qk C ′

and will converge as k approaches infinity.

The fundamental matrix for the system can be defined as

M = (I − Q)−1 (7.7)

Returning to the initial problem, if one assigns the values t1,2 = .5 and t1,3 = .5

to the system, the matrix P has the values

0 0.5 0.5 0

0 0 0 1

0 0.5 0.5

M = (I − Q)−1 =

0 0.5 0.5

1 0.5 0.5

(7.10)

From this information, it can be concluded that on average, for each execution of

the program, node S2 will be visited 0.5 times and node S3 will be visited 0.5 times.

Figure 7-2: Program flow graph with internal loop.

If the control flow is modified slightly, as is shown in 7-2, the impact of the looping

construct of state S2 can be considered. If t2,2 == .75, then

1 2.0 0.5

(7.11)

indicating that on average node S2 will be visited 2.0 times and node S3 will be visited

0.5 times. If t2,2 == .99, then

1 50 0.5

(7.12)

indicating that on average node S2 will be visited 50 times and node S3 will be visited

0.5 times.

This capability can then be used to calculate the reliability of the program using

the relationships

R =∏

j (7.13)

ln R =∑

Vj ln Rj (7.14)

R = exp(∑

jVj ln Rj) . (7.15)

Rj represents the reliability of node j, and

Vj represents the average number of times node j is executed.

7.2 Problems with Markov Models

Markov models offer two distinct problems as the number of nodes increases. First,

solving Markov models requires extensive mathematical computation to calculate the

transpose of the transition matrix. This operation is an O(n2) operation, resulting in

extensive computation being necessary to compute the net reliability for the system.

Second, Markov Models also require accurate estimations for transition probabili-

ties and reliability values to be determined in order to construct the given model. In

many cases, it is not possible to accurately estimate these reliability values with an

appropriate degree of confidence in order for the Markov Model to be applied. What

is needed is a more general approach which, while providing reasonably accurate re-

sults, does not necessarily require the degree of precision necessary to use a Markov

model.

7.3 BBNs and the Cheung Model

Based on the rationale provided, our intent is to develop a system of Bayesian

Belief Networks which can be used to reliably predict the outcomes for a Markov

Model. The specific model which is to be modeled using a BBN is the reliability

model proposed by Cheung[Che80]. The Cheung model relates the net reliability

of the system to three factors, namely the number of nodes within a program, the

reliability of each node within the program, and the frequency of execution for each

The number of nodes within a program represents a structural parameter of the

software being analyzed. It is not uncommon for a real world software project to

be comprised of several thousand routines, each of which would be represented as a

single node within the model.

The reliability of each node represents the probability that when a given node

executes, execution will continue to completion without failure. Reliability can either

be measured experimentally or estimated using one of many numerous techniques.

The execution frequency for each node represents, on average, how many times

the given node will be executed when the program is run. It can also represent the

number of times a method is invoked per unit of time. This value can either be

obtained experimentally or through static flow analysis of the software program. In

the first case, the program use case influences the results, which may result in a

more accurate reliability measurement. This is especially true if a common piece of

software is used in multiple environments, as the reliability may be vastly different

depending upon the execution environment. However, the second case provides a

better representation for software failure, in which a significant portion of failures

can be attributed to rarely executing exception handling routines.

7.4 Method Combinatorial BBN

The basic BBN relating the reliability of two Markov model nodes is shown in

Figure 7-3. The nodes Reliability A and reliability B represent the reliability of

the two program segments. The Coverage A and Coverage B represent the average

number of executions for the given node in one unit time period. The net reliability

is directly related to the reliability of each of the two nodes as well as the execution

rate for those nodes.

In order to use this BBN, it is necessary that the continuous values for reliability

and execution frequency be translated into discrete values which can then be further

processed. Since reliability values are often quite high, usually .9 or higher, it is often

Figure 7-3: Basic BBN for modeling a Markov Model.

more convenient to discuss reliability in terms of unreliability, which can be expressed

U = 1 − R (7.16)

where R represents the reliability of the system and U represents the resulting un-

reliability of the system. As most systems typically have multiple nines within the

reliability value, the U value will typically be followed by several zero values and then

a 1 value. This being the case, let

U = −1 · log10(1 − R) (7.17)

With this translation, each increase in the value of U by 1 represents a decrease by a

factor of 10 of the unreliability of the system.

As a general statement, the reliability for any properly tested system will be at

least .99. Mission critical or safety critical avionics systems require failure rates values

Table 7.1: Bayesian Belief Network States Defined for ReliabilityState Name Abbreviation R U

Perfect P 0.99999 ≤ R 5 < U

Very High V H 0.9999 ≤ R < 0.99999 4 ≤ U < 5

High H 0.999 ≤ R < 0.9999 3 ≤ U < 4

Medium M 0.99 ≤ R < 0.999 2 ≤ U < 3

Low L 0.9 ≤ R < 0.99 1 ≤ U < 2

Very Low V L 0.0 ≤ R < 0.9 U < 1

less than 10−9 failures per hour of operation[Tha96]. Software, in general, by its very

essence is typically limited to a minimum failure rate of 10−4 failures per hour of

operation. Specialized techniques, such as N-version programming, can be applied to

increase this value, but even the best software typically has a minimum failure rate of

10−5[Tha96] failures per hour of operation, or four orders of magnitude greater than

that which is required for mission critical systems deployment. Based on this concept,

the states shown in Table 7.1 have been defined for the Bayesian Belief Network.

Table 7.2: Bayesian Belief Network States Defined for Execution RateState Name Abbreviation V log10(V )

Very High V H 31.6 ≤ V 1.5 ≤ log10(V )

High H 3.16 ≤ V < 31.6 0.5 ≤ log10(V ) < 1.5

Medium M 0.316 ≤ V < 3.16 −0.5 ≤ log10(V ) < 0.5

Low L 0.0316 ≤ V < 0.136 −1.5 ≤ log10(V ) < −0.5

Very Low V L V < 0.0316 log10(V ) < −1.5

Referring back to the relationship

ln R =∑

Vj ln Rj (7.18)

it can be observed that the execution rate for the node is just as significant in the

net reliability value as the reliability of the nodes. A factor of ten difference in a

given Vj value will impact the net reliability by one order of magnitude. Because the

execution rate can vary significantly, and is not bounded by an upper or lower bound,

the values for the execution rate are best expressed in terms of the log10 value of the

execution rate. This behavior results in the state definitions shown in Table 7.2.

7.5 Experimental Model Validation

In order to validate the results of this model, an experiment was set up which

would use the existing Markov model to generate test cases. These test cases would

then be fed into the Bayesian Belief Network and evaluated against the expected

results from the Markov Model.

Figure 7-4: A program flow graph.

To accomplish this, a MatLab script was created which evaluated a Markov model

simulation using the Cheung[Che80] model and the program flow which is shown in

Figure 7-7. For simplicity, R1 and R4 were fixed at 1, indicating that there was no

probability of failure for the entry and exit nodes. R2 and R3 were independently

varied between the values of 0 and .999999 with a median value of .999. The param-

eters t3,2, t2,3, t3,4, and t2,4 were also varied independently. Altogether, this resulted

in a total of 47730 test vectors being generated and the value ranges shown in Table

Table 7.3: Markov Model Parameter RangesParameter −1 × log10 r2 −1 × log10 r3 v2 v3 Net Reliability (Rnet)

Average 3.045 3.045 79.71 79.72 0.9548

Median 3 3 0.9999 0.9999 0.9914

STD 1.748 1.748 194.9 194.9 0.07225

Min 0 0 0.000012 0.000012 0

Max 6 6 997.1 997.1 0.999999

To evaluate the accuracy of the Bayesian Belief Network, a Java application was

developed using the EBayes[Coz99] core. This application used the same input pa-

rameters that the MatLab script used. The outputs of the Bayesian Belief network

were then compared with the expected values from the Markov model, creating error

values. Comparisons with the Markov model were done in the U domain, as this

allowed an accurate assessment of error across all magnitudes. This resulted in the

derived results shown in Table 7.4.

While the raw error values are important and indicate that the average error is

less than .5, or one half of the resolution of the model, a more thorough analysis of

the error can be obtained by looking at the number of test instances and the error

Table 7.4: Differences between the Markov Model reliability values and the BBN PredictedValues

Average 0.4156

Median 0.3799

STD 0.3212

Min 0.000031

Max 3.099

for those instances. Table 7.5 shows the number of test instances in which the error

fell within the documented bounds. 96.70% of the test cases had an error of less than

1.0 relative to the value calculated by the Markov Model in the U domain.

Table 7.5: Test Error ranges and countsError (E) E < 0.1 E < 0.25 E < 0.5 E < 0.75 E < 1.0

Count 5032 13284 28508 43078 46156

Percent of test cases 10.54% 27.83% 59.73% 90.25% 96.70%

The error in the Bayesian Belief network is normally distributed over the data

range, as is shown in Table 7.6.

Table 7.6: Test error relative to normal distributionZ value 0.5 1.0 2.0 3.0

Percentage 34.3% 72.1% 96.6% 98.7%

Normal Percentage 38.3% 68.2% 95.4% 99.7%

7.6 Extending the Bayesian Belief Network

While the network presented previously has been shown to be effective at per-

forming accurate reliability calculations, the network itself suffers from significant

limitations. Because the network can only compare two nodes at once, the program

being analyzed must either be limited to two nodes or two “non-perfect” nodes. This

limits the model itself to be a proof of concept model which can be used in academic

and theoretical settings.

Figure 7-5: Extended BBN for modeling a Markov Model.

However, by making a slight modification to the Bayesian Belief Network, it is

possible to extend the model so that it has broader application. This extension, shown

in Figure 7-5 incorporates a node which combines the net execution rate for the two

nodes. This value is scaled in the same manner as the execution rates for the two

input nodes.

By adding this additional node to the Belief network and assigning the appropri-

ate conditional probabilities, a virtually infinite number of execution nodes can be

assessed by structuring them in a manner similar to that shown in Figure 7-6. To

handle the case where there is a number of network nodes which is not equal to a

power of two, it is necessary to add a phantom state to the BBN for execution rate

Figure 7-6: Extended BBN allowing up to n nodes to be assessed for reliability

which indicates that the given node never executes and the output of the network

should only be dependent upon the other node values. This results in the modified

states for the execution rate as is shown in Table 7.7.

Table 7.7: Extended Bayesian Belief Network States Defined for Execution RateState Name State Abbreviation V log10(V )

Very High V H 31.6 ≤ V 1.5 ≤ log10(V )

High H 3.16 ≤ V < 31.6 0.5 ≤ log10(V ) < 1.5

Medium M 0.316 ≤ V < 3.16 −0.5 ≤ log10(V ) < 0.5

Low L 0.0316 ≤ V < 0.136 −1.5 ≤ log10(V ) < −0.5

Very Low V L 0 < V < 0.0316 log10(V ) < −1.5

Never N V = 0 N/A

7.7 Extended Network Verification

In order to validate the results of this model, an experiment was set up which

would use the existing Markov model to generate test cases. These test cases would

then be fed into the Bayesian Belief Network and evaluated against the expected

results from the Markov Model.

Figure 7-7: Extended program flow graph.

To accomplish this, a MatLab script was created which evaluated a Markov model

simulation using the Cheung[Che80] model and the program flow which is shown in

Figure 7-7. For simplicity, R1 and R4 were fixed at 1, indicating that there was no

probability of failure for the entry and exit nodes. R2 and R3 were independently

varied between the values of 0 and .999999 with a median value of .999. The param-

eters t3,2, t2,3, t3,4, and t2,4 were also varied independently. Altogether, this resulted

in a total of 47730 test vectors being generated and the value rages shown in Table

Table 7.8: Markov Model Parameter RangesParameter −1 × log10 r2 −1 × log10 r3 v2 v3 Net Reliability

(Rnet)

Average 3.045 3.045 79.71 79.72 0.9548

Median 3 3 0.9999 0.9999 0.9914

STD 1.748 1.748 194.9 194.9 0.07225

Min 0 0 0.000012 0.000012 0

Max 6 6 997.1 997.1 0.999999

To evaluate the accuracy of the Bayesian Belief Network, a Java application was

developed using the EBayes[Coz99] core. This application used the same input pa-

rameters that the MatLab script used. The outputs of the Bayesian Belief network

were then compared with the expected values from the Markov model, creating error

values. Comparisons with the Markov model were done in the U domain, as this

allowed an accurate assessment of error across all magnitudes. This resulted in the

derived results shown in Table 7.9.

Table 7.9: Differences between the Markov Model reliability values and the BBN PredictedValues

Average 0.4156

Median 0.3799

STD 0.3212

Min 0.000031

Max 3.099

While the raw error values are important and indicate that the average error is

less than .5, or one half of the resolution of the model, a more thorough analysis of

the error can be obtained by looking at the number of test instances and the error

for those instances. Table 7.10 shows the number of test instances in which the error

fell within the documented bounds. 96.70% of the test cases had an error of less than

1.0 relative to the value calculated by the Markov Model in the U domain.

Table 7.10: Test Error ranges and countsError (E) E < 0.1 E < 0.25 E < 0.5 E < 0.75 E < 1.0

Count 5032 13284 28508 43078 46156

Percent of test cases 10.54% 27.83% 59.73% 90.25% 96.70%

The error in the Bayesian Belief network is normally distributed over the data

range, as is shown in Table 7.11.

Table 7.11: Test error relative to normal distributionZ value 0.5 1.0 2.0 3.0

Percentage 34.3% 72.1% 96.6% 98.7%

Normal Percentage 38.3% 68.2% 95.4% 99.7%

7.8 Summary

This chapter has demonstrated that Bayesian Belief Networks can be used as

a substitute for complete Markov Models when one is assessing the reliability of a

software package. The network presented is capable of estimating the output of a

Markov Model for software reliability within one order of magnitude when using five

Bayesian Belief nodes per pair of program nodes.

It is certainly possible to improve the accuracy by increasing the number of states

used when converting the continuous reliability and execution probability variables

into discrete states. However, for systems in which the exact reliability parameters

are not known, the resolution provided by this network is sufficient to provide an

estimate of the net reliability. One area for future research certainly is to analyze the

effect of increasing the number of states relative to the increased precision that would

be obtained by this method.

Chapter 8

The Software Static Analysis

Reliability Tool1

In order to use the proposed software reliability model, a Software Static Analysis

Reliability Tool (SOSART) has been developed. The SOSART tool combines static

analysis results, coverage metrics, and source code into a readily understandable

interface, as well as for reliability calculation. However, before discussing the details

of the SOSART tool, it is important to understand the capabilities and limitations

of currently existing reliability tools.

8.1 Existing Software Reliability Analysis Tools

In the study of software reliability, there have been many tools that have been

developed to allow assessment and analysis. This section is intended to provide a

1Portions of this chapter have appeared in Schilling and Alam[SA06a].

brief overview of the existing tools and their capabilities in order that they can be

compared with the SOSART analysis tool developed as part of this research.

The Statistical Modeling and Estimation of Reliability Functions for Software

(SMERFS) [EKN98] [Wal01] estimates reliability of software systems using a black-

box approach. It provides a range of reliability models. The original tool used a tex-

tual based interface and operated predominantly in the UNIX environment. However,

the latest version of SMERFS, SMERFS3[Sto05], operates in a Windows environment

and includes a Graphical User Interface. SMERFS3 also supports extended function-

ality in the form of additional models for both hardware and software reliability. One

important feature missing from SMERFS is the capability to automatically collect

data. This can make it difficult to use with larger projects.

The Computer-Aided Software Reliability Estimation [SU99] (CASRE) is quite

similar to SMERFS in that it is also a black-box tool for software reliability estimation.

CASRE supports many of the same models supported by SMERFS. CASRE operates

in a Windows environment and does have a GUI. One significant feature included

in CASRE is calculating the linear combination of multiple models, allowing the

construction of more complex models.

The Automated Test Analysis for C (ATAC)[HLL94] evaluates the efficacy of

software tests using code coverage metrics. This represents the first tool discussed

that is a white-box tool. The command line tool uses a specialized compiler (atacCC)

which instruments compiled binaries to collect run-time trace information. The run-

time trace file records block coverage, decision coverage, c-use, p-use, etc. The tool,

assesses test completeness and visually displays lines of code not exercised by tests.

ATAC only functions on C code.

Software Reliability and Estimation Prediction Tool (SREPT)[RGT00] allows the

assessment software reliability across multiple lifecycle stages. Early reliability pre-

dictions are achieved through static complexity metric modeling and later estimates

include testing failure data. SREPT can estimate reliability as soon as the softwares

architecture has been developed, and the tool can also be used to estimate release

times based on project data and other criteria.

The Reliability of Basic and Ultra-reliable Software Tool (ROBUST)[Den99][LM95]

supports five different software reliability growth (SRG) models. Two of the four

models can be used with static metrics for estimation during the early stages of de-

velopment, while one model includes test coverage metrics. ROBUST operates on

data sets of failure times, intervals, or coverage. Data may be displayed in text form

or in an appropriate graphical form.

The “Good-Enough” Reliability Tool (GERT)[DZN+04] provides a means of calcu-

lating software reliability estimates and of quantifying the uncertainty in the estimate.

The tool combines static source code metrics with dynamic test coverage information.

The estimate and the confidence interval is built using the Software Testing and Re-

liability Early Warning (STREW)[Nag05] in-process metric suite. GERT provides

color-coded feedback on the thoroughness of the testing effort relative to prior suc-

cessful projects. GERT is available as an open source plug-in under the Common

Public License (CPL) for the open source Eclipse development environment. GERT

has been extended to Version 2 [SDWV05], providing additional metrics as well as

better data matching.

Thusfar, each of the tools discussed has lacked the ability to interact with static

analysis tools. The AWARE tool[SWX05] [HW06], developed by North Carolina State

University, however, does interact with static analysis tools. Developed as a plug in

for the Eclipse development environment, AWARE interfaces with the Findbugs static

analysis tool. However, whereas the key intent of the tools discussed previously is to

directly aid in software reliability assessment, the AWARE tool is intended to help

software engineers in prioritizing statically detectable faults based upon the likelihood

of them being either valid or a false positive. AWARE is also limited in that it only

supports the Findbugs static analysis tool, and the user interface consists of a basic

textual listing of faults displayed in an Eclipse environment.

8.2 Sosart Concept

To effectively use the model previously developed for all but the smallest of pro-

grams requires the development of an appropriate analysis tool. This tool will be re-

sponsible for integrating source code analysis, test watchpoint generation, and static

analysis importation.

The first responsibility for the SoSART tool is to act as a bug finding meta tool

which automatically combines and correlates statically detectable faults from different

static analysis tools. It should be noted that though, while many examples given

previously in this dissertation were examples of statically detectable C faults, the

general software reliability model is not intended to be language dependent. As such,

the first application for the tool involves an analysis of a Java application. Thus, the

tool itself must be constructed in a manner to allow multiple programming languages

to be imported if the appropriate parsers are constructed.

Beyond being a meta tool, however, SoSART is also an execution trace analysis

tool. The Schilling and Alam model requires detailed coverage information for each

method in order to assess the reliability of a given software package. The SoSART

tool thus includes a customized execution trace recording system which captures and

analyzes execution traces to generate branch execution metrics. While similar in

nature to the ATAC tool discussed previously in Section 8.1, the SOSART trace tool

does not require code instrumentation during the compile phase. Instead, for Java

programs, it interfaces with the Java Platform Debugger Architecture (JPDA) and

the Java Debug Interface (JDI). This allows any Java program to be analyzed without

recompilation or modification to the source code.

SoSART also consists of a complete language parser and analyzer which has been

implemented using the ANTLR[Par] toolkit. The parser breaks the source code into

the fundamental structural elements for the model, namely classes, methods, state-

ment blocks, and decisions. A class represents the highest level of organization and

consists of one or more methods. A statement block represents a continuous set of

source code instructions uninterrupted by a conditional statement. One or more state-

ment blocks, coupled with the appropriate conditional decisions, makes up a method.

When determining the path coverage, the SoSART tool uses the parsed information

to determine which execution traces match a given path through a given method.

As a side effect of parsing, SoSART is also capable of generating limited structural

metrics.

The SoSART user interface consists of two portions, a command line tool and

a graphic user interface. The command line toolkit allows users to execute Java

programs overtop of the SOSART system while it collects coverage information for

analysis usage.

SoSART also includes a graphical user interface built using the JGraph toolkit[Ben06].

This Graphical User Interface allows the generation of pseudo-UML Activity diagrams

for each method, as well as displaying the static faults, branch coverage, and struc-

tural metrics for each method. Statically detectable faults are identified through a

four color scheme, with green characterizing a statement block with no known stati-

cally detectable faults, yellow indicating a slight risk of failure within that statement

block due to a statically detectable fault, orange indicating an increased risk over yel-

low, and red indicating a serious potential for failure within the code segment. Color

coding uses gradients to indicate both the most significant risk identified as well as

the typical risk. The overall reliability is calculated by combining the observed ex-

ecution paths with the potential execution paths and the statically detectable fault

locations.

General software requirements for the SOSART tool are provided in Appendix B.

8.3 SOSART Implementation Metrics

Development of the SOSART tool, per the development process requirements,

used a process derived from the PSP process for most areas of development. This was

applied for all areas of the tool which were not considered to be “research intensive”,

such as the development of the specific Bayesian Belief Networks, the development of

the ANTLR parser which was viewed as a learning process, and other similar areas.

Altogether, the effort expended in the development of the SOSART tool is provided

in Table 8.1. Effort has been recorded to the nearest quarter hour.

Table 8.1: SOSART Development MetricsCategory Effort (Hours) Percentage Total

Planning 92.25 9.23%

High Level Design 262 26.24%

High Level Design Review 37.25 3.74%

Detailed Design 154.5 15.47%

Detailed Design Review 39.25 3.93%

Implementation1 224.75 22.5%

Code Review 29.5 2.96%

Testing 113.5 11.37%

Post Mortum and Project Tracking 26.75 2.68%

Debugging 18.5 1.85%

Total 998.25 100%1This data item combines both the implementation and compile phasesof a PSP project.

Table 8.2 provides performance metrics regarding the implementation of the SOSART

tool in terms of the actual design and implementation complexity. With the exception

of several parameters that were contained within autogenerated code, all implemen-

tation metrics are within standard accepted ranges. Using the value of 30359 LOC

for the tool and the relationship

productivity =LOC

time(8.1)

the productivity was 30.41 lines of code per hour. This number is extremely high

relative to the typical 10-12 LOC per hour expected for commercial grade production

code, but this can be explained by the nature and composition of the project. First,

Table 8.2: SOSART Overview MetricsMetric Description Core SOSART Path Trace Total

Number of Packages 20 1 21

Total Lines of Code 29182 1177 30359

Non-ANTLR Lines of Code 14424 1177 15601

Number of Classes 167 13 180

Number of Static Methods 151 9 160

Number of Attributes 571 51 622

Number of Overridden Methods 105 13 118

Number of Static Attributes 48 12 60

Number of Methods 1825 152 1977

Number of Defined Interfaces 25 2 27

Average McCabe Cyclomatic Complexity 3.55 2.13 N/A

Maximum Cyclomatic Complexity 171*/31 17 31

Average Nesting Depth 1.74 1.53 N/A

Maximum Nesting Depth 13**/9 5 9

Depth of Inheritance (Average) 2.641 1.30 N/A

Depth of Inheritance (Maximum) 7 3 7

* Complexity contained within ANTLR autogenerated code module.** Complexity contained within ANTLR autogenerated code module.

since the tool was an experimentally developed tool, the amount of time spent devel-

oping test plans and executing test plans was significantly less than would be expected

for a production grade project. Overall, this would result in a decrease in produc-

tivity if a professional grade development process were being followed. Second, the

usage of the ANTLR tool automatically generated a significant portion of the source

code. The ANTLR package itself contains 14758 lines of code. While a significant

amount of development was required to create the language description for ANTLR,

if these lines of code are removed from consideration, the productivity drops to 15.62

LOC per hour, or much closer to industry accepted values. Third and finally, while

reviews of code were conducted on the material as it was constructed, no significant

peer reviews occurred. Properly reviewing a program of this size, assuming a review

rate of 100 LOC per hour, would require an additional 144 hours of effort, bringing

the net total effort to 1142 hours, and the effective productivity to 12.6 LOC / hour.

Bug tracking for the SOSART tool was handled using the SourceForge bug track-

ing system. Any bug which was discovered after testing and the completion of integra-

tion into the development tip was tracked using the bug tracking database. Overall,

18 post-release defects were uncovered and tracked in this manner, resulting in a

post-release defect rate of 0.06 defects per KLOC. This number is extremely low, and

it is suspected that further significant defects will be uncovered as the tool is further

used in the field by different researchers.

8.4 External Software Packages Used within SOSART

In order to streamline development of the SOSART analysis tool, as well as ensure

an appropriate level of quality in the final delivered tool, three external components

were used within the development of the SOSART tool, namely the ANTLR parser

generator, the JGraph graphing routines, and the EBAYES Bayesian belief engine.

Another Tool for Language Recognition (ANTLR)[Par] is a language tool which

provides the required framework to constructing recognizers, compilers, and transla-

tors. The tool uses a grammar definition which contains Java, C#, Python, or C++

actions. ANTLR was chosen because of its availability under the BSD license as well

as a readily available grammar definition for the Java Programming language which

can be readily expanded upon. The ANTLR software was used to generate the parser

for the Java input code, appropriately separating Java files into classes and methods

as well as partitioning methods into code blocks and decisions.

JGraph[Ben06] is an open source graph visualization library developed entirely in

the Java language. It is fully Swing compatible in both its visual interface as well as

its design paradigm, and can run on any JVM 1.4 or later. JGraph is used principally

in the user interface area of the SOSART tool, allowing the visualization of method

flows as well as managing the layout of the UML activity diagrams.

EBayes[Coz99] is an optimized Java engine for the calculation of Bayesian Net-

works. Its goal was to develop an engine small enough and efficient enough to perform

Bayesian Network analysis on small embedded systems microprocessors. The engine

is derived from the JavaBayes[Coz01] system which was used for the conceptual gen-

eration of the Bayesian belief networks used in this research and within the SOSART

tool. The EBAYES engine was used to calculate the Bayesian Belief network values

within the SOSART tool.

8.5 SOSART Usage

8.5.1 Commencing Analysis

The general operation for the SOSART tool begins with the user obtaining a pre-

viously existing module to analyze. The module should compile successfully without

any significant compiler warnings. Assuming the user is using the GUI version of the

tool, the user will start the SOSART tool. Upon successful start, the user will be

provided with a standard graphical menu allowing for appropriate selections to be

Normal operation begins by the user selecting the Java files that are to be imported

and analyzed using the analysis menu, as is shown in Figure 8-1. Any number of files

can be selected for importation using the standard file dialog box so long as the

files reside within the same directory path. In addition to allowing the importation

of multiple files at one time, it is also possible to import multiple sets of files by

importing each set individually. The tool itself will protect against a single file being

imported multiple times into the project based upon the Java file name.

Figure 8-1: Analysis menu used to import Java source code files.

As the source files are imported into the given source tool, several things occur.

First and foremost, the class is parsed and separated into existing methods. Summary

panel data about each class is generated and this information is used to populate the

class summary panel, as is shown in Figure 8-2.

In addition to the summery panel, the key user interface for each imported class

Figure 8-2: Summary panel for imported class..

consists of a tabbed display, with on tab showing the source code, as is shown in

Figure 8-3. Importation will also generate a basic UML activity diagram or control

flow diagram for each method, as is shown in Figure 8-4.

Figure 8-3: Source Code Panel for program.

Figure 8-4: Basic Activity Diagram for program.

8.5.2 Obtaining Program Execution Profiles

In order to obtain the program execution profile for the given source code module,

it is necessary to obtain from the structure the line numbers which represent the start

of each code segment. The SOSART tool provides such capability. The watchpoints

are generated as a textual file by the SOSART GUI, and then these are fed to the

command line profiler while actually executes the program modules and obtains the

execution traces.

Figure 8-5: Basic Tracepoint Panel for program.

For the source code module loaded in Figure 8-3, there are many locations which

represent the start of a code block. Examples include Line 199, Line 216, Line 227,

etc. In order to obtain a listing of these locations, the tracepoint window is opened

from the analysis window. The Generate Tracepoints button will, based upon the

parsed source code, generate a listing of tracepoint locations, as is shown in Figure

Once the tracepoint locations have been defined, the Java Path Tracer portion of

the SOSART tool can be invoked from the command line. This tool uses a set of

command line parameters to indicate the classpath for the program which is to be

executed, the trace output file, the path output file, the tracepoint file which defines

the locations that are to be observed for program execution, as well as other requisite

parameters. These parameters are shown in detail in Figure 8-6, which represents the

manual page displayed when executing the tool from the command line if an improper

set of parameters is provided.

wws@WWS-Ubunto:~/GRC_Tempest/TempestSrc$ java -jar JavaPathTracer.jar

<class> missing Usage: java Trace <options>

<class> <args> <options> are:

-classpath <path> This parameter sets up the classpath for the JVM running as a JDB process.

-traceoutputfile <filename> Outputs a complete execution trace to a given file. Warning: Files may be large.

-pathoutputfile <filename> This is the output file that is to be used for placing the xml output of the paths traversed.

-tracepointfile <filename> This option will set up the given tracepoint file, indicating which tracepoints are to be logged.

-showtracepointsetup This option will turn on the display of the tracepoints as they are set.

-periodicpathoutput <rate> This option will enable periodic output of the paths traversed. The rate is given in seconds.

-stampedperiodicfilename This option will enable periodic output of the paths traversed. The rate is given in seconds.

-maxloopdepth This option sets the maximum number of times a loop will be recorded when tracing. The default is 0

which results in infinite tracing through loops.

-help Print this help message

<class> is the program to trace <args> are the arguments to <class>

Figure 8-6: Java Tracer command line usage.

By supplying the appropriate tracepoint parameters on the command line, the tool

can be invoked to analyze a given program set. In the case of the example program,

this results in the command line log as is shown in Figure 8-7.

wws@WWS-Ubunto:~/GRC_Tempest/TempestSrc$ java -jar JavaPathTracer.jar -pathoutputfile demo_tracer.xml -tracepointfile demo_trace.txt

-showtracepointsetup -periodicpathoutput 60 -maxloopdepth 1 -stampedperiodicfilename gov.nasa.grc.ewt.server.Tempest 9000 noauth

nolog nodebug nopersist

Deferring breakpoint gov.nasa.grc.ewt.server.HTTPString:1007.

It will be set after the class is loaded.

Starting Tempest Java $Revision: 1.3.2.1.4.2 $ at Tuesday, 08 May 2007 11:24:35 EDT

thread 1 about to accept on port 9000

thread 5 about to accept on port 9000Set deferred breakpoint gov.nasa.grc.ewt.server.HTTPString:996

Set deferred breakpoint gov.nasa.grc.ewt.server.HTTPString:993

thread 1 accepted a connection from 192.168.0.152 192.168.0.152

wws@WWS-Ubunto:~/GRC_Tempest/TempestSrc$

Figure 8-7: Java Tracer execution example.

In this particular instance, the most important output is the XML gathered trace

information showing how many times each of the methods was executed and which

paths through the method were taken. A short example of this is shown in Figure

8-8. This figure represents the execution traces obtained during approximately one

minute (60046ms to be exact) of program execution. The SOSART tool allows this

textual trace representation to be imported into the tool, as is shown in Figure 8-

9. In addition to the number of executions for each path being shown numerically

and by color, the relative execution number is shown graphically through thicker

and thinner execution path traces. Those paths which are taken more often have a

thicker execution trace line. Those paths which are executed less often have a thinner

execution path trace.

<Method filename="HTTPString.java" methodName="<init>" methodType="normal" >

<ExecutionPath method="<init>" path="<init>" pathcount="11" /></Method>

<ExecutionPath method="process" path="process

199 227 235 253 256 259 275 312 333 460 462 476

491 499 515 517 527 529 546 552 555 558 562" pathcount="3" />

199 227 235 253 256 259 275 312 333 460 462 491

499 515 517 527 533 546 552 555 558 562" pathcount="4" />

199 227 235 253 256 259 275 312 333 460 462 491

499 562" pathcount="1" />

199 227 235 253 256 259 275 312 333 470 488 499

515 517 527 533 546 552 555 558 562" pathcount="2" />

</Method>

<ExecutionPath method="processString" path="processString 1029 1037 1042 1045 546 552 555 558 562"

pathcount="1" />

</Method>

</Method>

<ExecutionPath method="sendContentHeaders" path="sendContentHeaders 942 949 952 960 964 970 973"

pathcount="11" /></Method>

<ExecutionPath method="sendFirstHeaders" path="sendFirstHeaders 893 896 901 909 916 923 932"

pathcount="11" /></Method>

</SourceFile>

</Program>

Figure 8-8: XML file showing program execution for HTTPString.java class.

Figure 8-9: Execution trace within the SOSART tool. Note that each path through thecode which has been executed is designated by a different color.

8.5.3 Importing Static Analysis Warnings

In order for SOSART to perform its intended purpose and act as a static analysis

Metadata tool, the tool must support the capability to import and analyze statically

detectable faults. Because of the vast variety of static analysis tools on the market,

and the many different manners in which they can be executed, the SOSART tool

does not automatically invoke the static analysis tools. Instead, it is expected that

the static analysis tools will be executed independent of the SOSART tool, through

the code compilation process or by another external tool prior to analyzing the source

code with SOSART.

Table 8.3: A listing of SOSART supported static analysis tools.Software Tool Domain Responsible Party

ESC-Java Academic

Software Engineering with Ap-plied Formal Methods GroupDepartment of Computer ScienceUniversity College Dublin

FindBugs Academic University of MarylandFortify Source CodeAnalysis (SCA) Commercial Fortify Software

JiveLint Commercial Sureshot SoftwareKlocwork K7 Commercial KlocworkLint4j Academic jutils.com

JLint* AcademicKonstantin KnizhnikCyrille Artho

PMD AcademicAvailable from SourceForge withBSD License

QAJ Commercial Programming Research Limited*Modified to generate an XML output file.

Once the analysis tools have been run, importation of the faults begins by im-

porting the source code that is to be analyzed into the SOSART tool, as has been

described previously. Once this has been completed, the statically detectable faults

detectives by each of the executed static analysis tools are imported into the program.

Static analysis tools supported by SOSART are given in Table 8.3.

As each statically detectable warning is imported, it is compared with the warn-

ings which have been previously imported into the SOSART tool and assigned a

taxonomical definition. If a warning has not been previously assigned a taxonomical

Figure 8-10: Taxonomy assignment panel.

Figure 8-11: A second taxonomy definition panel.

definition, a dialog box prompts the user to assign the fault to an appropriate def-

inition. An example of this dialog box is shown in Figure 8-10. In this particular

instance, the first instance of the PMD fault has been imported warning against a

method having multiple return statements. From a taxonomy standpoint, using the

SOSART taxonomy this fault can be classified as a General Logic Problem, as there is

no specific categorization defined for this type of fault. In general, this fault exhibits

a very low potential for immediate failure, though there is a maintenance risk associ-

ated with this fault, as methods which have multiple returns can be more difficult to

maintain over time versus methods which only contain a single return statement.

In the case of the fault defined in Figure 8-11, this represents a PMD fault where

the programmer may have confused the Java equality operator (==) with the assign-

ment operator(=). At a high level, this represents a Medium risk of failure. Upon

future review certain cases may be found to have a more significant risk of failure.

This fault is assigned to the taxonomical definition “Comparison and Assignment

confused”, as this best represents the area of the fault.

Once all faults have been imported and assigned to the appropriate taxonomical

definitions, the Activity Diagrams / Control flow diagrams for each method are up-

dated to include the statically detectable faults. This results in a display similar to

that shown in Figure 8-12. This display shows the respective anticipated reliability

for each block as a color code. Red indicates the most significant risk of failure,

while green indicates that the code segment is relatively safe from the risk of failure.

Orange and yellow reflect intermediate risks in between green and red, with yellow

being slightly more risky than green and orange being slightly less risky than red. In

addition to marking the fault as valid or invalid, this panel can also be used to mod-

ify the immediate failure risk, the maintenance risk, whether the fault failed during

testing, and fault related reliability which have been set by the taxonomy when the

fault was imported.

Each code segment block and listing of static analysis warnings can have a two

color gradient present. One color of the gradient represents the typical risk associated

with the given code block. The second color represents the maximum risk associated

with that code block. For example, a code block which contains both green and

orange colors indicates that the code block typically has very little risk associated

with it, but there is the potential that under certain circumstances, a significantly

high amount of risk exists. These colors are driven by the Bayesian Belief Network

described previously in Chapter 6.

By clicking on the static analysis listing, the software engineer can open a panel

which can then be used to mark a fault as either “Valid”, “Invalid”, or “Unverified”,

as is shown in Figure 8-13. Faults which are deemed to be invalid will be converted

into a grey background the next time the panel is opened. Faults which are valid

but have no risk will have a green background, and faults which are valid but have

a higher risk associated with them will be displayed with an appropriate background

color of either green, yellow, orange, or red, depending upon their inherent risk.

In addition to invoking this display panel from the method activity diagram /

program data flow diagrams, this same display panel can be invoked from the report

menu option. However, when invoking this display from the report menu, there are

a few differences. the report menu option has two potential selection values, namely

the class level faults option or the all file faults option. Whereas clicking on the code

segment on the activity diagram only shows the faults which are related to the given

code block, selecting the item from the menu displays either those faults which are

detectable at the class level (and thus, are not assigned to a given method) or all

statically detectable faults within the file. In either case, the behavior of the panel is

the same, just the faults that are displayed differs.

The report menu also allows the user to view report data about the project and

the distribution of statically detectable faults. For example, the report shown in

Figure 8-14 provides a complete report of the statically detectable faults which are

present within the HTTPString.java file of this project. The first three columns deal

with warning counts. The first column shows the number of faults of each type which

have been detected in the overall file. The second column indicates the number of

faults which have been deemed to be valid upon inspection of the fault, and the third

column indicates the number of faults which have been deemed to be invalid based

upon project inspection. The percent valid column indicates the number of faults of

each type which have been determined to be valid upon inspection. The number of

statements counts the number of executable statements found within this source code

module. the last two columns calculate the density of the warnings being detected

and found valid versus the number of statements.

This report can be generated for three different sets of data. The first set, de-

scribed previously, generates this report for the currently opened file. This same

report can also be generated at the project scope which encompasses all files that

are being analyzed. Finally, this report can be generated based upon historical data,

which encompasses all previously analyzed projects.

8.5.4 Historical Database

In order to allow the appropriate data retention and historical profiling so that the

SOSART tool improves over time, SOSART includes a built in historical database

system. When used properly, the database system allows the user to store past

information regarding the validity of previously detected faults over multiple projects.

Furthermore, to allow project families to be created, the database system can be

segmented to allow different database sets to be used based on the project. Even

though the system is referred to as a database, the current SOSART implementation

does not actually require the usage of a separately installed database, such as MySQL.

The analyze menu contains many of the parameters necessary to use the historical

database capabilities to analyze a given project, as is shown in Figure 8-15. From this

menu, the user can load and save the historical database to a given file. This allows

the user to control which historical database is used for assigning validity values to the

statically detectable faults. This capability also allows separate historical databases to

be kept, preventing violations of Non-Disclosure agreements when analyzing software

projects.

As a user is analyzing a module with SOSART, the static analysis warnings

that are manipulated are not automatically transferred into the historical database.

Instead, the user must explicitly force the analyzed warnings to transfer into the

database. This is done for two reasons. First off, this prevents the database from

being contaminated by erroneous entries made when learning the tool. Second, and

most importantly, this allows the user to prevent the transfer into the database until

a project has been fully completed. Large projects may require more than one pro-

gram execution of the SOSART tool in order to fully complete the analysis, and it

is best for the transfer of warnings into the database to be delayed until the project

is completed. This is a commonly used paradigm for metrics collection tools within

software engineering.

The analyze menu also has the capability to allow the user to clear the database

of all previously analyzed faults. In general, it is expected that this capability would

rarely be used, but it is supported to allow the database to be reset if new program

families are analyzed or there is some other desire to reset the historical data back to

its initial values.

The Program Configuration Panel, shown in Figure 8-16, also allows configuration

related to the historical database to be performed. The SOSART program, based

upon its default configuration, will automatically load a given database upon program

start. This is a basic feature of the program. However, it is possible to change which

database is loaded based upon the users preferences. It is also possible to configure

whether or not the database is automatically saved upon program exit. Under most

circumstances, it is desired for all changes to the historical database to be saved when

the program is exited. However, there may be certain circumstances where this is not

the appropriate action to take based upon the analysis being performed.

It is also possible to configure the tool to automatically transfer all analyzed

faults into the historical database upon program exit. While this does prevent the

occurrence of user error whereby the data is improperly transferred to the historical

database, it also can lead to corruption of the database if a person is analyzing a

project that is not intended to be kept for future reference.

The Randomize Database on Load feature allows the user to randomize the pri-

mary key used within the database when a given database is loaded. By design, the

key used to uniquely identify a statically detectable fault includes the file name, the

line number, the static analysis tool which detected the fault, and the fault which

was detected. If the database is randomized, an additional random value is appended

into the key definition. This allows for multiple instances of the same warning in

subsequent revisions of a given source code module to be considered unique. Ran-

domization also, to some extent, obfuscates the data, which may be important based

upon conditions of an NDA.

8.5.5 Exporting Graphics

In order to allow the graphics which are generated by SOSART to be imported

into external reports and other documents, SOSART supports the exportation of

program diagrams into graphics files. In order to avoid issues with royalties and

patent infringement, the Portable Network Graphic Format (PNG) was selected as

the only exportation format natively included with SOSART. The PNG format is a

raster image format intended to serve as a replacement for the GIF format. As such,

it is an Open Extensible Image Format with Lossless Compression. Another reason

for selecting the PNG format is that it was readily supported by the JGraph utilities

used for Graph manipulation.

Graphics exportation is only available if the currently selected tab on the display

is a Method activity diagram. The summary panels can not be exported, and the

code listing can not be exported in this manner. The resolution of the exportation

is effected by the image zoom as well. A larger zoom factor will result in a large

image, while a smaller zoom factor will result in a smaller image with less resolution.

Complex graphics may result in very large file sizes when exported.

8.5.6 Printing

In order to support the creation of hardcopy printouts of the analysis conducted

with SOSART, a standard print capability has been integrated with the tool. Printing

is accessed from the file manu, and opens a standard GUI print dialog series. Options

to be selected by the user include scaling features, which allow the graphs to be

printed to a normal size, to fit a given page size, to a specific scale size, or to fit a

certain number of pages.

Other print configuration parameters include the capability to print either the

current graph or all loaded graphs. If the current graph is selected, only the currently

viewed method graph, summary panel, or source code panel will be printed. If all

loaded graphs is selected, then the summary panel, source code panel, and all graphs

associated with the current method will be printed.

8.5.7 Project Saving

In order to facilitate the analysis of larger Java projects which can not be readily

analyzed in one sitting, as well as to protect the person doing the analysis from

random machine failure and retain results for future consultation, the SOSART tool

offers several mechanisms that can be used to load and save projects.

As with most analysis tools, SOSART offers the user the capability to create a new

project, save the current project, or load a previously saved project. When creating

a new project, the currently open project will first be closed before a new project is

created. The new project will not have any imported Java files, static warnings, or

execution traces loaded.

Saving a project will store all details related to the project, including loaded

source code modules, method activity diagrams, imported static analysis faults, and

execution traces. All data files are stored in XML using JavaBeans XML Persistence

mechanism.

Because of the large size of XML files created when saving an entire project, a

secondary method has been created to store projects. With this method, only the

static analysis warnings and their modified validity values and risks are stored for

future re-importation. This mechanism offers an extremely large savings in terms of

file size and also results in a tremendous performance improvement when loading a

large project. Using this mechanism, a user will work on a project assessing the risks

associated with a project. When the time comes to save the project, only the static

analysis warnings are saved. To again work on the project, the user must re-import

the Java files and program execution traces before reloading the statically detectable

faults which were previously saved.

8.5.8 GUI Capabilities

The SOSART GUI interface supports all common graphical behaviors related to

image size, including cascading of images, tiling of images, and zooming.

Zooming of images is principally used when viewing method activity diagrams.

There are three principle mechanisms for performing zoom operations. The “Zoom

In” and “Zoom Out” features zoom the graphic in or out by a factor of two, depending

on which menu item is selected. The zoom dialog box allows the user to zoom to one

of eight pre-configured zoom values, namely 200%, 175%, 150%, 125%, 100%, 75%,

50%, or 25%, as well as offering a drag bar which can set the zoom value to any

intermediate zoom factor.

Because of the possibility that there may be multiple java files imported into one

analysis project, the SOSART tool includes the capability to tile horizontally and

vertically, as well as cascade the opened files. These behaviors follow standard GUI

practices.

8.5.9 Reliability Report

The key functionality required by the SOSART tool is the ability to estimate the

reliability of a given program in a manner that can be used by a typical practicing

engineer. This is accomplished via the Reliability Report Panel, an example of which

is shown in Figure 8-17.

The reliability report panel provides the user with the appropriate details relative

to the given reliability of the loaded modules and execution traces. The display itself

is also color coded, with green indicating very good reliability values, and yellow,

orange, and red indicating lesser reliability values. In the particular example provided

in Figure 8-17, the net anticipated reliability is 0.999691.

In reviewing the report further, however, this high reliability is achieved because

the modules themselves very rarely execute. To facilitate these detailed reviews, the

reliability report can be exported to a text file for external review and processing.

Figure 8-18 represents a portion of the complete textual report detailing the reliability

for the report shown in Figure 8-17.

Figure 8-12: Imported Static Analysis Warnings.

Figure 8-13: Static Analysis Verification panel.

Figure 8-14: Static Analysis fault report.

Figure 8-15: Analyze menu of SOSART tool.

Figure 8-16: SOSART Program Configuration Panel.

Figure 8-17: SOSART Reliability Report Panel.

Method: setAddress

Posterior marginal for RawReliability: 1.000 0.000 0.000 0.000 0.000 0.000

Posterior marginal for CalibratedReliability: 1.000 0.000 0.000 0.000 0.000 0.000

Method: setBuffer

Method: process

Method: checkAuthorization

Method: verifyClient

Method: _getResponseMessage

Method: sendFirstHeaders

Method: sendContentHeaders

Method: sendExtraHeader

Method: sendResponse

Method: processString

Method: getLocalFile

Method: replaceString

##########################################################################################

Method Reliability Values

setAddress Posterior marginal for CoverageA: Very Often 0.000 Often 0.000 Normal 1.000

Rarely 0.000 Very Rarely 0.000 Never 0.000

setAddress Posterior marginal for ReliabilityA: Perfect 1.000 Very High 0.000 High 0.000

Medium 0.000 Low 0.000 Very Low 0.000

setBuffer Posterior marginal for CoverageB: Very Often 0.000 Often 0.000 Normal 0.000

setBuffer Posterior marginal for ReliabilityB: Perfect 1.000 Very High 0.000 High 0.000

process Posterior marginal for CoverageA: Very Often 0.000 Often 0.000 Normal 1.000

process Posterior marginal for ReliabilityA: Perfect 1.000 Very High 0.000 High 0.000

checkAuthorization Posterior marginal for CoverageB: Very Often 0.000 Often 0.000 Normal 1.000

checkAuthorization Posterior marginal for ReliabilityB: Perfect 1.000 Very High 0.000 High 0.000

verifyClient Posterior marginal for CoverageA: Very Often 0.000 Often 0.000 Normal 1.000

verifyClient Posterior marginal for ReliabilityA: Perfect 1.000 Very High 0.000 High 0.000

_getResponseMessage Posterior marginal for CoverageB: Very Often 0.000 Often 0.000 Normal 1.000

_getResponseMessage Posterior marginal for ReliabilityB: Perfect 1.000 Very High 0.000 High 0.000

sendFirstHeaders Posterior marginal for CoverageA: Very Often 0.000 Often 0.000 Normal 1.000

sendFirstHeaders Posterior marginal for ReliabilityA: Perfect 1.000 Very High 0.000 High 0.000

sendContentHeaders Posterior marginal for CoverageB: Very Often 0.000 Often 0.000 Normal 1.000

sendContentHeaders Posterior marginal for ReliabilityB: Perfect 1.000 Very High 0.000 High 0.000

##########################################################################################

Final Results...

setAddress:setBuffer:process:checkAuthorization:verifyClient:_getResponseMessage:sendFirstHeaders:sendContentHeaders

Posterior marginal for CoverageA: Very Often 0.000 Often 0.059 Normal 0.941 Rarely 0.000 Very Rarely 0.000 Never 0.000

sendExtraHeader:sendResponse:processString:getLocalFile:replaceString:verifyClient:_getResponseMessage:sendFirstHeaders:...

Posterior marginal for CoverageB: Very Often 0.000 Often 0.030 Normal 0.970 Rarely 0.000 Very Rarely 0.000 Never 0.000

setAddress:setBuffer:process:checkAuthorization:verifyClient:_getResponseMessage:sendFirstHeaders:sendContentHeaders

Posterior marginal for ReliabilityA: Perfect 0.047 very High 0.421 High 0.457 Medium 0.073 Low 0.002 Very Low 0.000

setAddress:setBuffer:process:checkAuthorization:verifyClient:_getResponseMessage:sendFirstHeaders:sendContentHeaders:...

Posterior marginal for NetCoverage: Very Often 0.000 Often 0.096 Normal 0.904 Rarely 0.000 Very Rarely 0.000 Never 0.000

setAddress:setBuffer:process:checkAuthorization:verifyClient:_getResponseMessage:sendFirstHeaders:sendContentHeaders:...

Posterior marginal for NetReliability: Perfect 0.015 very High 0.230 High 0.529 Medium 0.202 Low 0.022 Very Low 0.001

Net Anticipated Reliability: 0.999691

##########################################################################################

Figure 8-18: Textual Export of Reliability report for analyzed source.

Chapter 9

Model Validation using Open

Source Software

9.1 Introduction

In order to provide the first set of experimental validations for the SoSART model,

a series of experiments were conducted in which the reliability of an open source

software package was calculated using the SoSART model and then compared with

a reliability analysis using the STREW[Nag05] metrics. This series of experiments

consisted of 3 open source programs, the RealEstate academic game developed by

North Carolina State University and used for Software Engineering education, the

open source jsunit program available from SourceForge, and the Jester program, also

available from SourceForge.

Programs were selected with several criteria in mind. First off, due to the com-

plexities of using the SoSART analysis tool, smaller projects needed to be analyzed,

as the tool suffered from technical difficulties as larger applications were analyzed.

While the intent of the model was to be applicable up to approximately 20 KLOC,

significant tool problems developed as programs larger than 5KLOC were analyzed.

Second, in order to apply the STREW model, a set of JUnit test scripts was required.

The STREW metrics correlate expected software reliability with implementation met-

rics and testing metrics. Third, and finally, access to a set of pseudo requirements

or other program documentation was required in order to estimate the number of

program requirements in order to then estimate the resulting software reliability.

9.2 STREW Metrics and GERT

The STREW metrics suite is a set of development metrics defined by Nagappan

et al. [Nag05] [NWVO04] [NWV03] which have been shown to be effective at esti-

mating the software reliability from metric relationships. Through a combination of

measurable parameters, including:

1. Number of Test Cases

2. The number of Source Lines of Code (SLOC)

3. The Number of Test Lines of Code (TLOC)

4. The Number of Requirements

5. The Number of Test Assertions

6. The Number of Source Classes

7. The Number of Conditionals

8. The Number of Test Classes

Based on these metrics, the reliability of software can be estimated using the

equation

Reliability = C0 + C1 · R1 + C2 · R2 − C3 · R3 + C4 · R4 (9.1)

R1 = Number of Test CasesSLOC

R2 = Number of test casesNumber of Requirements

R3 = Test Lines of CodeSLOC

R4 = Number of AssertionsSLOC

C0, C1, C2, C3, C4 represent calibration constants.

The confidence interval, CI is calculated using the relationship

CI = Zα/2

R(1 − R)

n(9.2)

R represents the calculated reliability

Zα/2 represents the upper α/2 quartile of the standard normal distribution for

the desired confidence interval, and

n is the number of test cases provided by the project[DZN+04].

The STREW metrics are supported by the GERT [DZN+04] toolkit.

9.3 Real Estate Program Analysis

The Real Estate program is an example program developed as a part of the open

seminar project at NCSU. The program, developed by a series of graduate students,

documents the entire software development process for a simple game using the agile

development process. A complete suite of unit tests was constructed using JUnit, and

these test cases are readily available with the source code. The Real Estate program

was developed in the Java language and has the overview metrics provided in Table

9.1 and 9.2.

Table 9.1: Real Estate Overview MetricsMetric Description Value

Number of Packages 2

Total Lines of Code (LOC) 1250

Number of Static Methods (NSM) 14

Number of Classes 40

Number of Attributes (NOF) 80

Number of Overridden Methods (NORM) 6

Number of Static Attributes (NSF) 25

Number of Methods (NOM) 236

Number of Defined Interfaces 4

Table 9.2: RealEstate Class MetricsName NSM NOF NORM NSF NOM LOC DIT*Card.java 0 0 0 2 3 1 4CardCell.java 0 1 0 0 3 1 4Cell.java 0 3 1 0 9 1 4Die.java 0 0 0 0 1 2 4FreeParkingCell.java 0 0 0 0 2 4 3GameBoard.java 0 4 0 0 13 4 3GameBoardFull.java 0 0 0 0 1 7 3GameMaster.java 1 8 0 2 41 11 2GoCell.java 0 0 1 0 3 13 2GoToJailCell.java 0 0 0 0 2 14 2JailCard.java 0 1 0 0 4 18 2JailCell.java 0 0 0 1 2 18 2MoneyCard.java 0 3 0 0 4 23 1MovePlayerCard.java 0 2 0 0 4 25 1Player.java 0 9 1 0 34 28 1PropertyCell.java 0 5 1 0 11 38 6RailRoadCell.java 2 0 1 3 3 46 5TradeDeal.java 0 3 0 0 7 75 1UtilityCell.java 1 0 1 2 3 169 1BuyHouseDialog.java 0 3 0 1 7 1 4CCCellInfoFormatter.java 0 0 1 0 1 1 4CellInfoFormatterTest.java 0 0 0 0 2 1 4ChanceCellInfoFormatter.java 0 0 0 1 1 1 4FreeParkingCellInfoFormatter.java 0 0 0 1 1 4 3GameBoardUtil.java 5 0 0 0 0 10 3GoCellInfoFormatter.java 0 0 0 1 1 14 2GotoJailCellInfoFormatter.java 0 0 0 1 1 14 2GUICell.java 0 3 0 1 7 16 2GUIRespondDialog.java 0 2 0 1 3 16 2GUITradeDialog.java 0 6 0 1 4 17 2InfoFormatter.java 2 0 0 1 0 17 2InfoPanel.java 0 0 0 1 1 18 2JailCellInfoFormatter.java 0 0 0 1 1 19 2Main.java 2 0 0 0 0 19 2MainWindow.java 0 6 0 1 32 21 2PlayerPanel.java 0 13 0 1 18 34 1PropertyCellInfoFormatter.java 0 0 0 0 1 39 6RRCellInfoFormatter.java 0 0 0 0 1 50 5TestDiceRollDialog.java 0 4 0 1 2 59 1UtilCellInfoFormatter.java 0 0 0 0 1 92 1UtilDiceRoll.java 1 4 0 1 3 114 1*Depth of Inheritance Tree

While the RealEstate program represents a slightly different domain than the

intended domain for this software model, it does provide a readily available convenient

package which can be used as a proof of concept application for the tool and model.

In order to provide a baseline reliability estimate for the RealEstate program, the

GERT analysis tool [DZN+04] [Nag05] was invoked from within the Eclipse environ-

ment. While this tool was intended to directly calculate the reliability of the software

given the input parameters, compatibility issues between the tool and available Eclipse

platforms and Java Run Time Environments limited application of the tool to data

collection, and the reliability was calculated externally. From the STREW metrics,

the reliability for the RealEstate program was calculated to range between 0.9185

and 1.000, with an expected reliability of 0.9712, as is shown in Table 9.3.

Table 9.3: Real Estate STREW Metrics Reliability ParametersParameter Value

Estimated Reliability 0.9712

Confidence Interval 0.0526

Maximum Reliability 1

Minimum Reliability 0.9185

After the assessment using the GERT tool was completed, the source code was

statically analyzed using seven independent static analysis tools which were supported

by the SOSART tool.1 This resulted in the detection of 889 statically detectable

faults, 214 which were deemed to be valid faults upon review with the SOSART tool,

as is shown in Table 9.4.

Based on the imported static analysis faults, an estimated reliability value was

calculated using the model and the assumption that all execution methods will be

called at a uniform normal rate. This resulted in the first reliability estimate of

0.8945.

Once this reliability was obtained, the program was executed and execution traces

were captured with the SOSART tool, providing accurate data on the branch coverage

under the tested use case. This data was then imported into the SOSART tool and the

reliability was re-calculated (assuming a medium confidence in the testing performed)

and found to be 0.9753. This reliability jump can be attested to the fact that the

methods which were the most unreliable within the system were rarely (if ever) called

and the methods which were most often traversed are the most reliable within the

1While previous research included 10 static analysis tools, Licensing issues only allowed 8 toolsto be used for this portion of experimentation.

Table 9.4: RealEstate Class MetricsTotal SA 1 SA 2 SA 3 SA 4 SA 5 SA 6 SA 7 SA 8

Filename Valid / InvalidBuyHouseDialog.java 12 / 19 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 12 / 19 0 / 0 0 / 0Card.java 4 / 1 0 / 0 0 / 0 0 / 0 2 / 0 0 / 0 2 / 1 0 / 0 0 / 0CardCell.java 3 / 3 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 3 / 3 0 / 0 0 / 0Cell.java 5 / 4 0 / 0 0 / 1 0 / 0 0 / 0 0 / 0 5 / 3 0 / 0 0 / 0CellInfoFormatter.java 1 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 1 / 0 0 / 0 0 / 0ChanceCellInfoFormatter.java 0 / 2 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 2 0 / 0 0 / 0Die.java 2 / 2 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 2 / 2 0 / 0 0 / 0FreeParkingCell.java 1 / 1 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 1 / 1 0 / 0 0 / 0FreeParkingCellInfoFormatter.java 0 / 2 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 2 0 / 0 0 / 0GameBoard.java 8 / 44 0 / 0 4 / 28 0 / 0 0 / 0 0 / 0 4 / 16 0 / 0 0 / 0GameBoardFull.java 0 / 41 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 41 0 / 0 0 / 0GameBoardUtil.java 2 / 18 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 2 / 18 0 / 0 0 / 0GameMaster.java 13 / 127 1 / 0 3 / 82 2 / 1 0 / 0 1 / 0 6 / 44 0 / 0 0 / 0GoCell.java 2 / 2 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 2 / 2 0 / 0 0 / 0GoCellInfoFormatter.java 0 / 2 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 2 0 / 0 0 / 0GotoJail.java 1 / 4 0 / 0 0 / 2 0 / 0 0 / 0 0 / 0 1 / 2 0 / 0 0 / 0GotoJailCellInfoFormatter.java 0 / 2 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 2 0 / 0 0 / 0GUICell.java 8 / 12 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 8 / 12 0 / 0 0 / 0GUIRespondDialog.java 8 / 6 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 8 / 6 0 / 0 0 / 0GUITradeDialog.java 15 / 22 0 / 0 0 / 0 1 / 0 0 / 0 0 / 0 14 / 22 0 / 0 0 / 0InfoFormatter.java 4 / 9 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 4 / 9 0 / 0 0 / 0InfoPanel.java 1 / 3 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 1 / 3 0 / 0 0 / 0JailCard.java 2 / 6 0 / 0 0 / 2 0 / 0 0 / 0 0 / 0 2 / 4 0 / 0 0 / 0JailCell.java 3 / 2 0 / 0 0 / 0 0 / 0 1 / 0 0 / 0 2 / 2 0 / 0 0 / 0JailCellFormatter.java 0 / 2 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 2 0 / 0 0 / 0Main.java 1 / 32 0 / 0 0 / 0 0 / 0 0 / 0 0 / 6 1 / 18 0 / 8 0 / 0mainWindow.java 17 / 37 0 / 0 0 / 0 1 / 0 0 / 0 1 / 0 13 / 37 2 / 0 0 / 0MoneyCard.java 1 / 7 0 / 0 0 / 2 0 / 0 0 / 0 0 / 0 1 / 5 0 / 0 0 / 0MovePlayerCard.java 4 / 17 0 / 0 0 / 10 0 / 0 0 / 0 0 / 0 4 / 7 0 / 0 0 / 0Player.java 23 / 111 1 / 0 9 / 62 0 / 0 0 / 0 0 / 0 13 / 49 0 / 0 0 / 0PlayerPanel.java 22 / 19 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 22 / 19 0 / 0 0 / 0PropertyCell.java 5 / 16 0 / 0 1 / 4 0 / 0 0 / 0 0 / 0 4 / 12 0 / 0 0 / 0PropertyCellInfoFormatter.java 0 / 13 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 13 0 / 0 0 / 0RailRoadCell.java 7 / 7 0 / 0 0 / 3 0 / 0 1 / 0 0 / 0 6 / 4 0 / 0 0 / 0RealEstateGUI.java 28 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 28 / 0 0 / 0 0 / 0RespondDialog.java 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0RRCellInfoFormatter.java 0 / 11 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 11 0 / 0 0 / 0TestDiceRollDialog.java 1 / 27 0 / 0 0 / 2 0 / 0 0 / 1 0 / 0 1 / 24 0 / 0 0 / 0TradeDeal.java 0 / 11 0 / 0 0 / 3 0 / 0 0 / 0 0 / 0 0 / 8 0 / 0 0 / 0TradeDialog.java 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0UtilCellInfoFormatter.java 0 / 11 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 11 0 / 0 0 / 0UtilDiceRoll.java 3 / 11 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 3 / 11 0 / 0 0 / 0UtilityCell.java 7 / 9 0 / 0 1 / 4 0 / 0 0 / 0 0 / 0 6 / 5 0 / 0 0 / 0

Total 214 / 675 2 / 0 18 / 205 4 / 1 4 / 1 2 / 6 182 / 454 2 / 8 0 / 0

system.

These results, which are within the confidence interval as was calculated by the

STREW metrics, provide a preliminary proof of concept for the validity of using

statically detectable faults and the defined Bayesian Belief Network for assessing the

reliability of software.

9.4 JSUnit Program Analysis

JSUnit is an open-source unit testing framework which allows the testing of client-

side Javascript programs. Development began in 2001, and currently there are more

than 275 subscribed members and over 10000 downloads. The tool is developed in

Java and is available from Sourceforge. Code metrics for JSUnit are provided in Table

9.5 and Table 9.6.

Table 9.5: JSUnit Overview MetricsMetric Description Value

Total Lines of Code 582

Number of Static Methods 19

Number of Attributes 36

Number of Overridden Methods 3

Number of Static Attributes 32

Number of Methods 139

Table 9.6: JSUnit Class MetricsName NSM NOF NORM NSF NOM LOC DIT*AllTests.java 2 0 0 0 0 23 1ArgumentsConfiguration.java 0 5 0 0 7 19 2ArgumentsConfigurationTest.java 0 0 0 0 3 21 3Configuration.java 1 0 0 7 13 61 1ConfigurationException.java 0 2 0 0 3 5 3ConfigurationTest.java 0 0 0 0 4 9 3DistributedTest.java 0 1 0 1 3 36 3DistributedTestTest.java 0 1 0 0 5 19 3DummyHttpRequest.java 0 1 0 0 51 54 1EndToEndTestSuite.java 1 0 0 0 0 4 3EnvironmentVariablesConfiguration.java 0 0 0 0 6 6 2EnvironmentVariablesConfigurationTest.java 0 1 0 0 4 20 3JsUnitServer.java 1 7 2 0 24 86 2JsUnitServlet.java 1 0 0 1 0 1 3PropertiesConfigurationTest.java 0 0 0 0 4 23 3PropertiesFileConfiguration.java 0 2 1 1 8 15 2ResultAcceptorServlet.java 0 0 1 0 1 8 4ResultAcceptorTest.java 0 2 0 0 10 50 3ResultDisplayerServlet.java 0 0 1 0 1 15 4StandaloneTest.java 0 3 0 1 8 48 3StandaloneTestTest.java 0 0 0 0 3 7 4Suite.java 1 0 0 0 0 9 3TestCaseResult.java 2 4 0 3 13 28 1TestCaseResultBuilder.java 0 0 0 0 3 16 1TestCaseResultTest.java 0 0 0 0 6 40 3TestCaseResultWriter.java 0 1 0 6 5 33 1TestRunnerServlet.java 0 0 1 0 3 13 4TestSuiteResult.java 4 9 0 0 27 79 1TestSuiteResultBuilder.java 0 1 0 0 6 37 1TestSuiteResultTest.java 0 2 0 0 8 33 3TestSuiteResultWriter.java 0 1 0 12 8 42 1Utility.java 10 0 0 1 0 34 1*Depth of Inheritance Tree

Following the procedure applied previously, the reliability of the JSUnit program

was assessed using the STREW metrics, resulting in the data shown in Table 9.7.

Reliability was estimated to range between 0.6478 and 0.9596, with a typical reliability

value of 0.8037.

Table 9.7: JSUnit STREW Metrics Reliability ParametersParameter Value

Maximum Reliability 0.9596

Using the same 8 static analysis tools, the source code was analyzed for statically

detectable faults, as is shown in Table 9.8. A total of 480 statically detectable faults

were discovered, 220 of which were deemed to be valid and of varying risks.

Table 9.8: JSUnit Static Analysis FindingsTotal SA 1 SA 2 SA 3 SA 4 SA 5 SA 6 SA 7 SA 8

Filename Valid / InvalidTestRunnerServlet.java 1 / 1 1 / 0 0 / 0 0 / 0 0 / 0 0 / 1 0 / 0 0 / 0 0 / 0ResultDisplayServlet.java 2 / 2 0 / 0 0 / 0 2 / 0 0 / 0 0 / 2 0 / 0 0 / 0 0 / 0ResultAcceptorServer.java 0 / 1 0 / 0 0 / 0 0 / 0 0 / 0 0 / 1 0 / 0 0 / 0 0 / 0JSUnitServlet.java 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0Utility.java 21 / 16 0 / 0 4 / 0 3 / 0 0 / 0 4 / 0 8 / 16 0 / 0 2 / 0ArgumentsConfiguration.java 18 / 14 0 / 0 0 / 0 5 / 0 0 / 0 0 / 5 13 / 9 0 / 0 0 / 0Configuration.java 33 / 17 0 / 0 0 / 0 6 / 0 0 / 0 0 / 5 17 / 10 0 / 2 10 / 0DistributedTest.java 10 / 28 0 / 0 0 / 0 2 / 0 0 / 0 2 / 5 2 / 23 0 / 0 4 / 0EnvironmentVariablesConfiguration.java 0 / 1 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 1 0 / 0 0 / 0TestSuiteResultBuilder.java 12 / 18 0 / 0 0 / 0 4 / 0 0 / 0 1 / 3 5 / 15 0 / 0 2 / 0PropertiesFileConfiguration.java 6 / 3 0 / 0 0 / 0 0 / 0 0 / 0 1 / 0 3 / 3 0 / 0 2 / 0TestCaseResultsBuilder.java 5 / 6 0 / 0 0 / 0 2 / 0 0 / 0 0 / 0 3 / 6 0 / 0 0 / 0StandaloneTest.java 22 / 21 0 / 0 0 / 0 3 / 0 0 / 0 1 / 1 16 / 20 0 / 0 2 / 0TestCaseResultWriter.java 13 / 19 0 / 0 0 / 0 2 / 0 0 / 0 0 / 6 11 / 13 0 / 0 0 / 0TestSuiteResult.java 31 / 30 5 / 1 0 / 0 9 / 0 0 / 0 0 / 0 17 / 29 0 / 0 0 / 0ConfigurationExample.java 1 / 1 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 1 / 1 0 / 0 0 / 0TestCaseResult.java 13 / 6 0 / 0 0 / 0 2 / 0 0 / 0 0 / 0 11 / 6 0 / 0 0 / 0TestSuiteResultsWriter.java 6 / 30 0 / 0 0 / 0 1 / 0 0 / 0 0 / 0 5 / 30 0 / 0 0 / 0JsUnitServer.java 26 / 46 2 / 0 0 / 0 3 / 0 1 / 0 2 / 2 16 / 44 0 / 0 2 / 0

Total 220 / 8 / 1 4 / 0 44 / 0 1 / 0 11 / 31 128 / 0 / 2 24 / 0260 226

These results were then imported into the SOSART analysis tool which calculated

an anticipated reliability of 0.9082 if each and every method was executed with a

uniform normal rate. Adding execution traces to the model reduced the reliability

value to 0.8102. These values are slightly higher than would be expected by the

STREW metrics calculations, but are within the range of acceptable values.

9.5 Jester Program Analysis

Java Jester is an Open Source tool available from Sourceforge which is intended

to aid in Extreme programming development by finding code which is inadequately

covered during testing. Jester uses a technique referred to as mutation testing to

automatically inject errors into source and determine if those errors are detected by

the developed test cases. Jester is developed in Java and can test Java code. Code

metrics for Jester are provided in Table 9.9 and Table 9.10.

Table 9.9: Jester Overview MetricsMetric Description Value

Table 9.10: Jester Class MetricsName NSM NOF NORM NSF NOM LOC DIT*ConfigurationException.java 0 0 0 0 2 2 4FileBasedClassIterator.java 0 3 0 0 4 35 1FileBasedClassSourceCodeChanger.java 0 8 0 0 8 45 1IgnoreList.java 0 1 1 1 3 26 1IgnoreListDocument.java 0 3 1 1 10 54 1IgnorePair.java 0 2 3 0 7 8 1IgnoreRegion.java 0 2 1 0 4 5 1JesterArgumentException.java 0 0 0 0 1 1 3MainArguments.java 1 4 0 2 7 46 1RealClassTestTester.java 0 2 0 0 3 24 1RealCompiler.java 1 1 0 0 2 11 1RealConfiguration.java 0 2 0 1 13 31 1RealLogger.java 0 1 0 0 2 10 1RealMutationsList.java 0 2 0 1 5 40 1RealProgressReporter.java 0 2 0 0 5 8 1RealProgressReporterUI.java 1 2 0 0 3 40 6RealReport.java 0 10 1 0 20 76 1RealTestRunner.java 1 2 0 0 2 16 1RealXMLReportWriter.java 0 1 0 0 2 12 1ReportItem.java 0 5 1 0 6 54 1SimpleCodeMangler.java 0 2 0 0 6 18 1SimpleIntCodeMangler.java 1 0 0 0 4 29 2SourceChangeException.java 0 0 0 0 2 2 3TestRunnerImpl.java 3 1 0 6 10 101 1TestTester.java 3 3 0 3 2 61 1TwoStringSwappingCodeMangler.java 0 4 0 0 2 18 2Util.java 5 0 0 0 0 62 1*Depth of Inheritance Tree

The same procedure as was used for the RealEstate and JSUnit programs was

applied to the Jester program to estimate system reliability, resulting in the data

shown in Table 9.11. Reliability was estimated to range between 0.6478 and 0.9596,

with a typical reliability value of 0.8037.

Table 9.11: Jester STREW Metrics Reliability ParametersParameter Value

Maximum Reliability 0.9775

Using the same 8 static analysis tools, the source code was analyzed for statically

detectable faults, as is shown in Table 9.12. A total of 652 statically detectable faults

were discovered, 144 of which were deemed to be valid and of varying risks.

These results were then imported into the SOSART analysis tool which calculated

an anticipated reliability of 0.9024 if each and every method was executed with a

uniform normal rate. Execution traces for the program were obtained by running

the acceptance test suite included within the source code module, as this test set

was deemed to be representative of the desired use case for the program. Adding

execution traces to the model reduced the reliability value to 0.9067. These values

are slightly higher than would be expected by the STREW metrics calculations, but

are within the range of acceptable values.

Table 9.12: Jester Static Analysis FindingsTotal SA 1 SA 2 SA 3 SA 4 SA 5 SA 6 SA 7 SA 8

Filename Valid / InvalidClassIterator.java 0 / 1 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 1 0 / 0 0 / 0ClassSourceCodeChanger.java 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0ClassTestTester.java 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0CodeMangler.java 0 / 1 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 1 0 / 0 0 / 0Compiler.java 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0Configuration.java 0 / 9 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 9 0 / 0 0 / 0ConfigurationException.java 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0FileBasedClassIterator.java 2 / 12 0 / 0 2 / 3 0 / 0 0 / 0 0 / 0 0 / 9 0 / 0 0 / 0FileBasedClassSourceCodeChanger.java 6 / 29 0 / 0 0 / 13 0 / 1 0 / 0 0 / 5 6 / 10 0 / 0 0 / 0FileExistenceChecker.java 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0FileVisitor.java 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0IgnoreList.java 0 / 17 0 / 0 0 / 3 0 / 0 0 / 0 0 / 0 0 / 14 0 / 0 0 / 0IgnoreListDocument.java 5 / 52 0 / 0 2 / 25 0 / 0 1 / 0 0 / 0 2 / 27 0 / 0 0 / 0IgnorePair.java 0 / 17 0 / 0 0 / 11 0 / 0 0 / 0 0 / 0 0 / 6 0 / 0 0 / 0IgnoreRegion.java 0 / 9 0 / 0 0 / 3 0 / 0 0 / 0 0 / 0 0 / 6 0 / 0 0 / 0JesterArgumentException.java 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0Logger.java 1 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 1 / 0 0 / 0 0 / 0MainArguments.java 7 / 35 0 / 0 1 / 15 2 / 1 0 / 1 1 / 0 3 / 18 0 / 0 0 / 0MutationMarker.java 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0MutationsList.java 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0ProgressReporter.java 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0RealClassTester.java 4 / 7 0 / 0 4 / 2 0 / 0 0 / 0 0 / 0 0 / 5 0 / 0 0 / 0RealCompiler.java 4 / 6 0 / 0 4 / 1 0 / 0 0 / 0 0 / 1 0 / 4 0 / 0 0 / 0RealConfiguration.java 6 / 23 1 / 0 1 / 5 0 / 0 1 / 0 0 / 0 3 / 18 0 / 0 0 / 0RealLogger.java 3 / 7 0 / 0 0 / 0 0 / 0 0 / 0 3 / 1 0 / 6 0 / 0 0 / 0RealMutationsList.java 7 / 34 0 / 0 1 / 3 0 / 0 0 / 0 0 / 2 6 / 29 0 / 0 0 / 0RealProgressReporter.java 0 / 10 0 / 0 0 / 9 0 / 0 0 / 0 0 / 0 0 / 1 0 / 0 0 / 0RealProgressReporterUI.java 15 / 14 0 / 0 0 / 6 0 / 0 0 / 0 0 / 1 15 / 7 0 / 0 0 / 0RealReport.java 11 / 35 0 / 1 0 / 11 0 / 0 0 / 0 1 / 0 10 / 23 0 / 0 0 / 0RealTestRunner.java 0 / 7 0 / 0 0 / 0 0 / 0 0 / 0 0 / 2 0 / 5 0 / 0 0 / 0RealXMLReportWriter.java 3 / 15 0 / 0 2 / 4 0 / 0 0 / 0 0 / 0 1 / 11 0 / 0 0 / 0Report.java 3 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 3 / 0 0 / 0 0 / 0SimpleCodeMangler.java 1 / 8 0 / 0 0 / 3 0 / 0 0 / 0 0 / 0 1 / 5 0 / 0 0 / 0SimpleIntCodeMangler.java 2 / 11 0 / 0 0 / 2 0 / 0 0 / 0 0 / 0 2 / 9 0 / 0 0 / 0SourceChangeException.java 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0TestRunner.java 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0TestRunnerImpl.java 33 / 42 1 / 0 0 / 0 3 / 1 5 / 1 4 / 8 20 / 32 0 / 0 0 / 0TestTester.java 10 / 54 1 / 2 2 / 2 0 / 0 1 / 0 0 / 10 6 / 40 0 / 0 0 / 0TwoStringSwappingCodeManager.java 1 / 14 0 / 0 1 / 0 0 / 0 0 / 0 0 / 0 0 / 14 0 / 0 0 / 0Util.java 20 / 39 0 / 0 4 / 0 0 / 0 0 / 0 9 / 5 7 / 34 0 / 0 0 / 0XMLReportWriter.java 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0 0 / 0Total 144 / 508 3 / 3 24 / 121 5 / 3 8 / 2 18 / 35 86 / 344 0 / 0 0 / 0

9.6 Effort Analysis

While the previous sections of this chapter have discussed the accuracy of the

reliability model, without providing a cost effective mechanism for applying the model,

it is difficult to establish the practical application for the model. This section intends

to analyze the effort required to apply the model relative to other means of ensuring

reliability for reused source modules.

One of the oldest and thusfar most effective mechanisms for ensuring the reliabil-

ity of reused components is a complete formal review of the implementation of the

program. For code reviews to obtain their maximum effectiveness, the review rate for

the peer review meeting should be approximately 100 lines of code per hour[Gla79].

Furthermore, effective peer reviews require 3 to 4 meeting attendees. Thus, the ef-

fort required to complete review a source code package can be estimated using the

equation

ECR =LOC

RR· NR (9.3)

ECR represents the total effort necessary review the source code package

LOC represents the count of the lines of code within the package

RR represents the review rate for the source code package in LOC per unit of

NR represents the number of reviewers of the source code package.

For the three projects analyzed in this chapter, the effort can be estimated to be

between 1047.6 and 2250 minutes, as is shown in Table 9.13.

Table 9.13: Software Complete Review Effort EstimatesReal Estate JSUnit Jester

Lines of Code 1250 1033 582Estimated Effort to Analyze (Minutes) 750 619.8 349.2Persons Required 3 3 3Net Effort (Man-Minutes) 2250 1859.4 1047.6

The effort required to undertake the reliability analysis using the SOSART method

was measured during development to allow comparison with the estimated effort for

a complete source code review, resulting in the data shown in Table 9.14. The data

is broken into two portions, the effort required for the static analysis tools to analyze

the source code modules and the effort required to analyze the reliability of the

program. This second field includes the effort required to review the static analysis

tool detected faults using the SOSART tool as well as the time required to execute

limited module testing. These results clearly indicate that this method is cost effective

versus complete code review.

Table 9.14: Software Reliability Modeling Actual Effort

Real Estate JSUnit JesterStatic Analysis Execution Time (Minutes) 8.33 5.93 8.27SOSART Review (Minutes) 546.03 319.65 218.8Net Effort (Minutes) 554.36 325.58 227.07Effective Review Rate (LOC / Hour) 135 190 154Difference (Minutes) 1695 1533 820Savings 226% 247% 234%

Chapter 10

Model Validation using Tempest

The validation for the software reliability model presented previously relied upon

comparing the reliability calculated using the SOSART tool with the reliability cal-

culated through the STREW metrics method. While these experiments provided

a preliminary proof of concept for the model, a more extensive experiment using a

more appropriate software package was necessary. In this particular instance, the

given software package would be operated in an experimental fixture and the reliabil-

ity of the software obtained would be measured. These reliability values would then

be compared with the values obtained through the usage of the SOSART tool.

This chapter describes the experiment which was used to validate the reliability

model. The first section describes the Tempest Web Server software which was used

to validate the software reliability model. The second section describes the setup

which was used to evaluate the Tempest software from a reliability standpoint. The

third section of this chapter the discusses the results of measuring the reliability of the

Tempest software in an experimental environment. The fourth section of this chapter

discusses the process used to experimentally measure the reliability of the Tempest

software using the software reliability model and the SoSART tool. The fifth and

final section of this document discusses the economic costs which are incurred by

using this methodology.

10.1 Tempest Software

The Tempest web server was developed by members of the NASA Glenn Research

Center, Flight Software Engineering Branch will be analyzed through the SOSART

tool. It is an embedded real-time HTTP web server, accepting requests from standard

browsers running on remote clients and returning HTML files. It is capable of serving

Java applets, CORBA, and virtual reality (VRML), audio, video files, etc. NASA

uses Tempest for the remote control and monitoring of real, physical systems via

inter/intra-nets. The initial version of Tempest were developed for the VxWorks

program using the C programming language and occupied approximately 34kB ROM.

Subsequently, the code has been ported to the Java language and can execute on any

machine capable of running the Sun Java Virtual machine[YP99].

By its nature, Tempest represents platform technology which is intended to be

integrated into other products, as is shown in Figure 10-1. Tempest has been used for

various space applications and communications applications, distance learning[YB98],

Virtual Interactive Classrooms, and other Real Time Control applications[Dan97].

Future intended uses for Tempest include enabling near real-time communications

between earth-bound scientists and their physical/biological experiments on-board

Figure 10-1: Flow diagram showing the relationship between Tempest, controlled experi-ments, and the laptop web browsers[YP99].

space vehicles, mitigating astronaut risks resulting from cardiovascular alterations, as

well as developing new teaching aids for education enabling students and teachers to

perform experiments. Since being developed, Tempest has received several awards,

notably the Team 2000 FLC Award for Excellence in Technology Transfer, the 1999

Research and Development 100 Award, and the 1998 NASA Software of the Year

Award Winner

Tempest is implemented using the Java language. While the Java Language does

support Object oriented implementation and design, the Tempest web server is con-

structed more in a structural manner, as is shown Table 10.1. Standard class based

metrics are provided in Table 10.2.

As an embedded web server, Tempest allows the user a significant number of

Table 10.1: Tempest Overview MetricsMetric Description Value

Table 10.2: Tempest Class MetricsWeighted Number of Depth of

Static Overridden methods Static Number of Lines of InheritanceName Methods Attributes Methods per Class Attributes Methods Code TreeContentTag.java 0 1 0 41 0 2 211 1DummyLogger.java 1 0 0 1 0 0 0 1GetString.java 0 0 0 2 0 1 16 2HeadString.java 0 0 0 1 0 1 5 2HTTPFile.java 0 12 0 33 0 6 145 1HTTPString.java 1 21 0 155 32 12 797 1Logger.java 0 3 1 14 0 3 74 2MessageBuffer.java 0 1 0 5 0 2 21 1NotFoundException.java 0 1 1 3 0 3 7 3ObjectTag.java 0 1 0 50 0 4 300 1PostString.java 0 0 0 2 0 1 12 2RuntimeFlags.java 0 0 0 8 4 8 16 1SomeClass.java 0 1 0 6 0 4 27 1Tempest.java 2 1 0 97 12 4 552 1TimeRFC1123.java 2 0 0 2 0 0 17 1

configuration parameters which can be passed to the software on the command line

when an execution instance is started. Command line options are defined in Table

10.2 Evaluating Tempest

To evaluate the Reliability of the Tempest software and the accuracy of the pro-

posed software reliability model, an experiment was constructed using the Tempest

software and the Java Web Tester software package. In essence, one machine was

configured as a Tempest web server and was given test web site to serve., A second

machine was configured to use the Java Web Tester software package. This tool al-

Table 10.3: Tempest Configuration ParametersOption Description

port number

This option controls which port that Tempest listens on for client requests. The stan-dard HTTP port is 80, the default. UNIX users note that you must have root priv-ileges to open a server on port 80 (or any server socket below port 1024). If youopen Tempest on say, port 9900, the browser URL will be appended with :9900 (e.g.http://www.somesite.com:9900/index.html)

auth or noauth

These options control the client and user checks. These checks are performed against thecontents of the CLIENTS.SYS file and the USERS.SYS file. With ”noauth”, anyone canaccess the server from any machine. With ”auth” the user will be challenged for a validID and password (which appear in the USERS.SYS file) and the user must be accessingTempest from a client computer allowed in CLIENTS.SYS file.

log or nolog

These options control log file creation. The log file consists of a series of messages showingwho accessed the server, when, and from where. Be sure that you have a Tempest/LOGdirectory even if there are no files in it.An example message is: (Tempest/LOG/TempestLog.0) User adam access from24.55.242.190 at Saturday, 30 March 2002 05:46:35 ESTNote that the log file is not written until Tempest is stopped.

debug or nodebugThese options control the display of debugging messages from Tempest in the window (e.g.DOS window) used to start Tempest.

persist or nopersist

These options control the type of connection established between Tempest and the client.Normally, server-client connections are stateless, non-persistent connections after Tem-pest delivers the information requested by the client. In persistent connections, Tempestmaintains the connection with a client after the client’s request is satisfied. Persistentconnections are selected to reduce client-server transaction times when the transactionsare many. The disadvantage of persistent connections is that ports are consumed and notavailable to other clients. The timeout for a persistent connection is 6 seconds. The useof a persistent connection also depends on browser support for this feature.

lows the user to configure a set of web sites which are to be periodically monitored

for their reliability.

10.2.1 Java Web Tester

The Java Web Tester software was previously developed for research into the re-

liability of web servers, as is detailed in Schilling and Alam[SA07a]. This tool was

developed in the Java Programming language and allows the user to verify connec-

tivity with a set of existing websites.

The tool consists of three tools bundled into a single jar file. The first tool, a

GUI based tool, is used to configure the website tester and can be used for short

duration tests. The GUI allows the user to configure the remote site which is to be

used as a test site, the port to connect to the remote site, and the test rate. The test

rate determines how often the remote site is polled for a connection and subsequently

downloaded. Test rates can range between 1 and 3600 seconds. The tool also allows

the remote server to be pinged before an attempt is made to open the http connection.

Figure 10-2: Web tester GUI Tool

The web testing tool also allows the user to compare the file with a previously

downloaded file. The intent of this is to detect downloads in which the connection

is successful yet the actual material downloaded is corrupted. When comparing files,

an entry can be flagged if either the downloaded file is identical to the previously

downloaded file or differs from the previously downloaded depending upon how the

tool is configured.

The web site testing tool is not limited in the number of test sites that can

be tested. In testing the tool, up to 100 sites were tested simultaneously without

performance degradation. When enabled to run, each test operates as its own Java

Thread, thus preventing the behavior of one remote site from effecting other sites

being monitored.

The second portion of the web tester tool is a command line tool which allows

previously developed configurations for tests to be executed from a command shell.

This allows web testing to go on a background mode either via a UNIX script or

CRON job without a required graphical user interface being visible. During extended

duration tests, this is the preferred method to use.

The third portion of the tool, also a command line tool, post processed results

from the other segments of the web testing tool and created summary data reports

on website connectivity.

10.2.2 Initial Experimental Setup

The experiment began with setting up the Tempest software to serve a set of

test web pages in the University of Toledo OCARNet lab. For the purposes of this

experiment, two machines were isolated from the rest of the lab (and the rest of the

University of Toledo domain) through a standard commercial firewall and hub setup.

Two Linux workstations were setup on the OCARNet lab using the topology shown

in Figure 10-3. One machine served to execute the Tempest web server software,

serving out a sample web site. This second machine automatically polled the web

server once a minute and downloaded a series of the web pages from the sample web

site. The successes and failures were logged and stored.

In the first instance of testing, this setup executed continuously without software

failure for 2 calender months. However, this setup was flawed in several minor fash-

Figure 10-3: OCARNet Lab topology.

ions. First off, the Tempest software was only operated using one set of configuration

parameters. Second, and more importantly, the testing used very low bandwidth

utilization and did not stress the software from a performance perspective.

As part of the general web reliability study conducted by Schilling and Alam[SA07a],

a second test of the reliability of the software was conducted using a similar setup.

However, in this study, the web server operated using two different configuration pa-

rameters. However, the results related to the Tempest web server ended up being

flawed in that the computer used was inappropriately configured for the experiment

and suffered from significant performance problems which unfortunately influenced

the testing results1.

A third reliability study using the OCARNet equipment and a new Windows XP

machine was conducted. In this case, four different instances of the Tempest web

server executed simultaneously from one machine. Each of the four test instances ran

a different set of user configurable parameters, representing four different use cases

for the Tempest web server. This executed for one week before being abandoned

due to performance problems with the machine and required anti-virus software and

Windows Update features which could not be circumvented.

10.2.3 Final Experiment Setup

The final experimental setup used two Linux machines running in an independent

environment away from the OCARNet Lab. The experimental setup began by creat-

ing the network topology shown in Figure 10-4. In this topology, two Linux machines

were separated from the rest of the network by a router, effectively isolating them

from all traffic except for the web traffic between machines. One of the machines

served a set of web pages over the network. The second machine executed the Java

Web tester software.

Table 10.4: Tempest Test Instance Configurations

Test Instance TCP/IP Port Command Line Options1 9000 noauth nolog nodebug nopersist2 9001 noauth nolog nodebug persist3 9002 noauth nolog debug nopersist4 9003 noauth log nodebug nopersist

1While the Schilling and Alam[SA07a] article does include a Tempest Web server within itsresults, this is a separate machine. All data from the flawed experiment was removed before theanalysis of results were presented in that paper.

Figure 10-4: Network topology for test setup.

The machine executing the Tempest web server actually executed four different

instances of the web server in four different Unix processes. Each instance ran a dif-

ferent use case, manifested through the usage of different command line parameters,

as is shown in Table 10.4. While running multiple instances of Tempest simulta-

neously had been impossible under the Windows XP operating environment due to

resource constraints, the combination of a Dual Core microprocessor and the usage

of the Ubuntu Linux Operating System allowed for all four instances of Tempest to

run simultaneously without resource conflict or other performance problems.

Source Code Modifications

The experiment began by adding one class source code package, namely the Dum-

myLogger class, which is provided in Figure 10-5. This class provides for a single

01: package gov.nasa.grc.ewt.server;

03: public class DummyLogger {

04: public static void LogAccess()

06: // This routine does nothing, as we do not log anything.

Figure 10-5: DummyLogger.java class.

static method which can be called by any class within the Tempest project. This is

necessary due to a limitation of the JDI interface used by the SoSART tool. Under

certain circumstances, there will be method paths that will have code blocks either

optimized out of the final Java Byte Code or conditionals which do not contain ex-

ecutable code. In order for path tracing to function in a reliable fashion, each and

every code segment must contain at least one executable line on which a watchpoint

can be placed. In order to facilitate this, each and every code block as was parsed by

the SoSART tool was appended with a call to the DummyLogger.LogAccess static

method. While technically modifying the source code, this insertion was deemed not

to significantly change the behavior of the system, yet it did allow more accurate anal-

ysis with the SoSART tool and the JDI interface which it relies upon. An example

of changed code is given in Figure 10-6.

Once the code modification was completed and appropriately archived into the

local configuration management system, a clean build of the source code from the

archive occurred. In this operation, all existing class files and generated modules

were removed and rebuilt by the javac compiler. This ensured that any and all

remnants of the previous structure were removed.

01: package gov.nasa.grc.ewt.server;

03: class NotFoundException extends Exception {

04: String message;

06: public NotFoundException() {

07: super();

**08: DummyLogger.LogAccess();

11: public NotFoundException(String s) {

12: super(s);

14: message = new String(s);

17: public String toString() {

19: return message;

Figure 10-6: Modified NotFoundException.java class, showing lines added to call theDummyLogger routine.

The code was then imported into the SoSART analysis tool. This importation was

used to generate a set of tracepoints which would be used to log the program execution

profile under each of the four use cases that would be tested. The tracepoints were

saved to a text file.

10.2.4 Measured Reliability

The net goal for this series of experiments was to experimentally estimate the

reliability of the Tempest Web Server under four different use cases. In order to

do this, four different configurations of the Tempest Web Server were configured to

serve the same material. Each instance ran a different configuration. Over a 25 hour

period, the machines were then tested for operation and the number of failures and

mean time between failures was recorded for the test cases. This resulted in the data

provided in Table 10.5. Because of the fact that the first three use cases did not fail in

the first 24 hours of testing, the test was subsequently extended 168 hours. However,

the result remained substantially unchanged after 168 hours, as the first three use

cases still did not fail under test conditions.

Table 10.5: Tempest Field Measured Reliabilities

Configuration Operational Uptime Number of Failures MTBF(hours) (hours)

1 25.0 0 *2 25.0 0 *3 25.0 0 *4 14.3 3 4.77

* Can not be calculated, as no failures occurred.

Assuming an exponential probability density function, for the fourth use case, the

failure rate λ can be estimated through the equation

MTBF =1

λ(10.1)

which results in a λ of 0.2096. This can be translated into a reliability value of 0.8109

at one hour of program execution.

For the other examples, we must estimate the reliability based upon the fact that

there was no failure in the system. Using the relationship described in Hamlet and

Voas[HV93], the reliability for the first three use cases can be estimated to be 0.9840

at a 90% confidence interval.

Using this reliability as a calculation basis, and assuming an exponential proba-

bility density function, the MTBF for the software can be estimated to be 62.5 hours,

resulting in 33% probability of failure occurring at the 25 hour experimental cutoff.

10.2.5 Analyzing the Source code for Statically Detectable

Faults

The initial intent of analyzing the Tempest source code was to start by enabling

all static analysis rules on all tools. Thus, each and every potential rule would be

output, and each and every rule could be pulled into the SoSART tool. This would

allow SoSART to have a complete picture of the occurrence rate for each rule, and

all tools could be correlated in the most optimum fashion.

To automate the process of analysis, allowing for deterministic repeatability within

the process, the analysis was automated using a Apache Ant build script which au-

tomatically invoked each of the 10 static analysis tools.

Table 10.6: Tempest Rule Violation Count with All Rules EnabledTool 1 2 3 4 5 6 7 8 9 10ContentTag.java 0 1 33 2 25 334 14 243 14 13DummyLogger.java 0 1 2 1 0 12 0 5 0 1GetString.java 7 3 3 2 3 52 1 25 0 1HTTPFile.java 0 9 32 4 28 263 33 189 14 15HTTPString.java 4 31 79 9 134 464 85 1074 32 38HeadString.java 5 1 3 2 0 30 1 13 0 1Logger.java 0 10 8 2 13 130 6 97 4 3MessageBuffer.java 0 5 12 0 0 63 3 38 0 1NotFoundException.java 1 3 2 1 0 27 1 16 0 1ObjectTag.java 3 1 25 4 79 514 23 355 22 19PostString.java 8 4 3 2 0 37 2 20 6 2RuntimeFlags.java 4 1 9 4 0 65 4 41 0 0Tempest.java 40 10 60 7 145 873 34 656 44 21TimeRFC1123.java 0 1 11 1 0 48 1 25 2 1SomeClass.java 0 4 9 3 2 72 4 45 0 4Total 72 85 291 44 429 3984 212 2842 138 121

Using this approach, however, had one significant drawback. Because all of the

tools had all rules enabled, there was a significant number of violations which were

flagged, as is shown in 10.6. The 8218 rules which were flagged by the analysis tools

unfortunately overwhelmed the internal SoSART database engine, and this complete

Table 10.7: Tempest Rule Violation Densities with All Rules EnabledFile LOC 1 2 3 4 5 6 7 8 9 10ContentTag.java 300 0.000 0.003 0.110 0.006 0.083 1.11 0.046 0.810 0.046 0.043DummyLogger.java 1 0.000 1.000 2.000 1.000 0.000 12.0 0.000 5.00 0.000 1.000GetString.java 16 0.437 0.187 0.187 0.125 0.187 3.25 0.062 1.56 0.000 0.062HTTPFile.java 145 0.000 0.062 0.220 0.027 0.193 1.81 0.227 1.30 0.096 0.103HTTPString.java 797 0.005 0.038 0.099 0.011 0.168 1.83 0.106 1.34 0.040 0.047HeadString.java 5 1.000 0.200 0.600 0.400 0.000 6.00 0.200 2.60 0.000 0.200Logger.java 74 0.000 0.135 0.108 0.027 0.175 1.75 0.081 1.31 0.054 0.040MessageBuffer.java 21 0.000 0.238 0.571 0.000 0.000 3.00 0.142 1.80 0.000 0.047NotFoundException.java 7 0.142 0.428 0.285 0.142 0.000 3.85 0.142 2.28 0.000 0.142ObjectTag.java 300 0.010 0.003 0.083 0.013 0.263 1.71 0.076 1.18 0.073 0.063PostString.java 12 0.666 0.333 0.250 0.166 0.000 3.08 0.166 1.66 0.500 0.166RuntimeFlags.java 16 0.250 0.062 0.562 0.250 0.000 4.06 0.250 2.56 0.000 0.000Tempest.java 552 0.072 0.018 0.108 0.012 0.262 1.58 0.061 1.18 0.079 0.038TimeRFC1123.java 17 0.000 0.058 0.647 0.058 0.000 2.82 0.058 1.47 0.117 0.058SomeClass.java 27 0.000 0.148 0.333 0.111 0.074 2.66 0.148 1.66 0.000 0.148Total 1989 0.036 0.042 0.146 0.022 0.215 2.00 0.106 1.42 0.069 0.060

analysis could not occur. On average, 4.1317 warnings were issued for each line of

code present within the software, as is shown in Table 10.7.

Table 10.8: Static Analysis Rule Configuration MetricsTool Overall 1 2 3 4 5 6 7 8 9 10Total Rules Detected 1094 85 21 207 276 * 176 90 39 19 181Rules Disabled 620 2 0 156 252 * 73 55 14 0 68Rules Enabled 484 83 21 51 24 * 103 45 25 19 113Percent Disabled 56.7% 2.4% 0.0% 75.4% 91.3% * 41.5% 61.1% 35.9% 0.0% 37.6%* For this tool, it was impossible to obtain the metrics, as the documentation did not provide a complete warning listing.

To avoid this problem, it was necessary to configure each tool independently in or-

der to filter out those warnings which would not be capable of causing a direct system

failure. The exercise was conducted using the methodology described in Schilling and

Alam[SA06c]. All in all, once all ten of the tools were properly configured, 56.7% of

the rules had been disabled as being either stylistic in nature or otherwise represent-

ing faults which would not result in a system failure based upon the characteristics

of the detected fault. As is shown in Table 10.8, the percentage of rules disabled was

different for each tool, ranging from 0.0% to 75.4%.

Once each of the tools had been properly configured and inappropriate warnings

had been removed from analysis, the static analysis tools were re-executed using the

newly created configuration profiles, resulting in a total of 1867 warnings being issued

by the tools, as is shown in Table 10.9.

Table 10.9: Tempest Rule Violation Count with Configured Rulesets.Tool 1 2 3 4 5 6 7 8 9 10ContentTag.java 0 1 4 0 25 85 2 2 14 4DummyLogger.java 0 1 0 0 0 1 0 0 0 0GetString.java 2 3 0 1 3 23 0 0 0 0HTTPFile.java 0 9 6 0 28 63 13 1 14 4HTTPString.java 3 31 12 0 134 374 0 7 32 5HeadString.java 1 1 0 1 0 12 0 0 0 0Logger.java 0 10 0 1 13 26 1 1 4 3MessageBuffer.java 0 5 0 0 0 8 1 1 0 1NotFoundException.java 1 3 0 0 0 4 0 0 0 0ObjectTag.java 3 1 10 1 79 166 0 5 22 2PostString.java 3 4 0 1 0 14 0 0 6 0RuntimeFlags.java 4 1 0 4 0 13 4 0 0 0Tempest.java 39 10 9 5 145 199 5 6 44 8TimeRFC1123.java 0 1 1 0 0 12 1 1 2 1SomeClass.java 0 4 2 0 2 20 0 1 0 1Total 56 85 44 14 429 1020 27 25 138 29

10.2.6 SOSART Reliability Assessment

Once the statically detectable faults had detected by the static analysis tools,

these outputs were analyzed using the SOSART tool and assessed for their validity.

Of the 1967 warnings detected, 456, or 23.1%, were deemed to be valid faults which

had the potential of causing some form of systemic operational degradation which

might result in a software failure.

Table 10.10: Tempest Estimated Reliabilities using SOSARTOption Set SOSART Estimated Reliability

1 noauth nolog nodebug nopersist 0.98982 noauth nolog nodebug persist 0.98983 noauth nolog debug nopersist 0.98984 noauth log nodebug nopersist 0.9757

Using the SOSART tool, and assigning all execution paths the likelihood of “Nor-

mal” for their execution rate, the base reliability for the SOSART tool was estimated

to be 0.8427. Using the program execution profile generated by limited evaluations of

the usage options provided in Table 10.4, a set of estimated reliabilities were obtained,

as is shown in Table 10.10. It is important to note that the first three use cases result

in the exact same reliability estimation. This is caused by the fact that their exe-

cution profiles upon which the estimate are based are virtually identical. Execution

profiles for use cases 1 and 2 differ by only 9 execution points out of a total of 234

execution points, and do not include any different method invocations Furthermore,

each of these differences can be attributed to a single logical change within a method

in that in the first profile one branch is taken for a given decision but in the second

example a different branch of execution occurs. When making the same comparison

between the first execution profile and the third execution profile, there is a net total

of 41 branch locations which are different out of a total of 301. However, the exact

same methods are still invoked as are invoked in the first two profiles.

In the fourth case, which has a lower reliability score, there are 69 execution point

locations which different between the first and the fourth execution profiles. However,

more importantly, the fourth execution profile includes six method invocations in

classes which are not even used in the first two execution profiles. Thus, it can

clearly be justified that the first three execution profiles, given the granularity of

measurement for this experiment, will have identical reliability values while the fourth

use case will have a different reliability estimation due to the difference in execution

profiles.

Once all four reliabilities have been calculated, a comparison between the mech-

anisms can be obtained. In the first three use cases, the reliability estimated by

SOSART and the reliability calculated based upon field testing were quite similar,

with values of 0.9898 and 0.9840 respectively, representing a 0.5% difference. In the

case of the fourth use case, SOSART estimated a reliability of 0.9757 while the actual

measured reliability was 0.8109, or a difference of 16.89%.

Chapter 11

Conclusions and Future Directions

11.1 Conclusions

The problem of software reliability is vast and ever growing. As more and more

complex electronic devices rely further upon software for fundamental functionality,

the impact of software failure becomes greater. Market forces, however, have made

it more difficult to measure software reliability through traditional means. The reuse

of previously developed components, the emergence of open source software, and the

purchase of developed software has made delivering a reliable final product more

difficult.

The first chapter of this dissertation emphasized the need to investigate software

reliability. This need is urgent as the cost of failure to the American economy is

estimated to be $59.5 billion annually and is certain to grow, as the quantity of

embedded software in everyday items doubles every 18 months. Software reliability

problems have surpassed hardware as the principle source for system failure by at

least a factor of ten.

To provide justification for study, it was important to analyze past failures. Post-

mortum analysis of failure has been a common mechanism in other engineering fields,

yet has generally been lacking in the area of software engineering. Because of this

failure to obtain a historical perspective, there have been many failure modes which

repeatedly reoccur in different products. To this end, numerous case studies of failure

were presented to introduce the subject. Part of this presentation included whether

software static analysis tools would have been capable of detecting the fault and thus

preventing the failure. It was found that in a significant number of cases the fault

that ultimately led to the failure of the system was statically detectable.

This analysis led to the conceptual proposal to establish a relationship between

software reliability and statically detectable faults. While static analysis can not

guarantee completeness in its analysis, it has shown to be quite effective at detecting

faults.

It is with this background that an approach to modeling software reliability was

presented which targets the estimation of reliability for existing software. Traditional

software reliability models require significant data collection during development and

testing, including the operational time between failures, the severity of the failures,

code coverage during testing, and other metrics. In the case of COTS software pur-

chases or open source code, this development data is often not readily available,

leaving a software engineer little information to make an informed decision regarding

the reliability of a software module. This reliability model does not suffer from this

limitation as it only requires black box testing and static analysis of the source code

to estimate reliability. The reliability is calculated through a Bayesian Belief Network

incorporating the path coverage obtained during limited testing, the structure of the

source code, and results from multiple static analysis tools combined using a meta

Next, it was necessary to establish that static analysis tools can effectively find

faults within a Java program. This was established through the development of a

validation suite which proved that static analysis tools can be effective at finding

faults seeded within a validation suite. Overall, ten different analysis tools were used

to find 50 seeded faults, and 82% of the faults were detectable by one or more tools.

More importantly, 44% of the faults were detected by two or more static analysis

tools, indicating that multiple tool may aid in the reduction of false positives from

static analysis executions.

The static analysis tool effectiveness experiment also emphasized the importance

of proper tool configuration. In this case, the number of valid warnings was dwarfed

by the number of false positives detected which were incapable of causing a software

failure. However, it was also found that these false positives were often limited to a

small number of rules which could easily be disabled.

In order for this reliability model to be applied to software a reliability toolkit was

necessary. Therefore, a Software Static Analysis Reliability Toolkit (SOSART) was

constructed to allow the user to apply the reliability model to non-trivial projects.

An overview of the requirements for this tool as well as the tool usage was provided.

Proof of concept validation for the reliability model was presented in two exper-

iments. In the first experiment, the results of the SOSART reliability model were

compared with the results from the STREW metrics reliability model. In all cases,

the SOSART estimates were determined to be within the confidence interval for the

accuracy of the STREW metrics model, and typically less than 2% away from the es-

timate of the STREW metrics. In the second experiment, the results of applying the

SOSART method to an existing software project were presented. In this experiment,

an existing program is assessed for its operational reliability on the basis of four dif-

ferent sets of configuration parameters. Three of the parameter sets are found to have

identical reliabilities while the fourth is found to have a lesser reliability value. This

was both predicted by the SOSART tool and validated by the experimental results.

While the SOSART results exhibit slightly larger error than would be desired, this

can be explained by the highly non-linear nature of software execution.

These experiments provided the required proof of concept for the validity of this

reliability model. From a conceptual standpoint, it is possible to provide a high

level estimate of software reliability from static analysis tool execution coupled with

execution profiling. Furthermore, through an effort assessment, the method is shown

to be cost effective in that the effort required to apply this method is less than what

would be expected for a complete code review of source code.

11.2 Future Directions

This dissertation has put forth a proof of concept experiment that static analysis

can be used to estimate the reliability of a given software package. However, while

this work has been successful as a proof of concept, there are many areas which need

to be further investigated.

First, in terms of the Bayesian belief Network, we realize that our network is

highly simplified and may be improved in accuracy through additional parameters

and additional resolution of the given parameters.

While we considered clustering effects at the method and file level, it is known

that clustering occurs at the package and project levels as well. What we do not

know is the relationship between clustering at the various levels. For example, is

clustering at the method level more indicative of a valid cluster versus clustering at

the package level? How does clustering change as a module undergoes revision by

multiple development teams?

The impact of independent validation also needs further assessment. While our

results indicated that multiple tools often did detect the same faults, we also saw

multiple false positives being detected as well. Is the assumption that was made re-

garding the independence of algorithms truly valid? While we would like to believe

that independent commercial tools should be independent, the research of Knight and

Leveson [KL86] [KL90] indicates that n version programming may not result in as

much independence as would be anticipated, for while the versions are developed in-

dependently, the differences in algorithms is not as diverse as would be expected. This

frame of reference needs to be addressed with the algorithms and implementations

used for static analysis tools.

Our Bayesian Belief Network also simplifies the concept of a maintenance risk. It

is known that certain code constructs, such as missing braces for if constructs, often

results in implementation failures as a module is maintained. What we do not know

is qualitatively the risk that this poses over time. Our model simply indicates that a

coded construct either is or is not a maintenance risk. But yet, this parameter more

than likely has some degree of variability associated with it and should be modeled

in the same manner as the Immediate Failure Risk. Clearly additional research using

lessons learned databases and difference analysis of existing program modules across

revisions as field defects are fixed may be capable of addressing this issue.

Similar areas for research exist on the code coverage side of the Bayesian Belief

Network. We know that at a high level the assumptions used to generate this model

are valid, but in many cases, the exact relationship has not been definitively shown

through empirical study.

The SOSART tool clearly needs additional performance tuning and development.

Due to technical limitation, it was incapable of analyzing programs much larger than

approximately 2 KLOC. While that was acceptable for a proof of concept application

in which smaller programs were used, it is imperative that the tool be capable of

efficiently and reliably analyzing larger programs as well. It may be advisable that

the GUI used for fault analysis be separated from the mathematical model used to

calculate reliability, thus saving memory and allowing for distributed analysis of the

The model also needs a significant analysis in terms of its granularity. While in

this experiment the software programs typically exhibited what would be considered

relatively low reliabilities, the tool itself suffered from numerical granularity problems

in calculation which seemingly preclude its utilization with higher reliability modules.

This may be an effect of the network itself, or it may be an effect of the definitions

placed upon the model by the application.

Further experimental validation is also necessary in order to determine the rela-

tionships present in the model. While each of the software packages assessed in this

dissertation has used the same network coefficients, it is highly probable that the

relationships between nodes are not necessarily the same for all developed software.

The work of Nagappan et al.[NBZ06] indicates that there is no single set of predictive

metrics which can be used to estimate field failure rates. We believe that these con-

clusions also apply to our model, and that there may be multiple relationships which

are specific to the project domain or project family. While our model is targeted at

Embedded Systems, the bulk of the validation occurred with non-embedded appli-

cation programs. It is expected that with proper assessment in different domains as

well as the appropriate calibration using the built in calibration parameters for the

network, more accurate results can be obtained.

Another area of research is to look at the impact of statically detectable faults on

subsequent software releases. As has been discussed in Schilling and Alam[SA06c], it

is often a risk management issue which decides whether known statically detectable

faults are removed between releases of a given project. It may be possible to relate the

change in software reliability between releases with the change in statically detectable

faults given the appropriate analysis.

Lastly, for static analysis (or any other software engineering method) to be com-

mercially acceptable, it must be cost effective. While this research looked at cost in

terms of time, this is not entirely accurate. There are direct monetary costs asso-

ciated with static analysis unrelated to effort, including licensing fees, tool configu-

ration, training exercises, and others. In order for this method to be viable in the

commercial software engineering environment, it must be both applicable to the needs

of practicing software engineers. This has been one of the goals of this research. The

proof of concept model presented here appears promising, although it has only been

evaluated in the academic arena.

Bibliography

[AB01] C. Artho and A. Biere. Applying Static Analysis to Large-Scale, Multi-

threaded Java Programs. In Proceedings of the 13th Australian Software

Engineering Conference, pages 68–75, Canberra, Australia, 2001. IEEE

Computer Society Press.

[Ada84] Edward N. Adams. Optimizing preventive service of software products.

IBM J. Research and Development, 28(1):2–14, January 1984.

[Agu02] Joy M. Agustin. JBlanket: Support for Extreme Coverage in Java Unit

Testing. Technical Report 02-08, University of Hawaii at Manoa, 2002.

[AH04] Cyrille Artho and Klaus Havelund. Applying JLint to space exploration

software. In VMCAI, pages 297–308, 2004.

[All02] Eric Allan. Bug Patterns in Java. Apress, September 2002. ISBN: 1-

59059-061-9.

[Ana04] Charles River Analytics. About Bayesian Belief Networks. Charles River

Analytics, Inc., 625 Mount Auburn Street Cambridge, MA 02138, 2004.

[And96] Tom Anderson. Ariane 501. E-mail on safety critical mailing list., July

[Arn00] Douglas N. Arnold. The Patriot Missle Failure. Website, August 2000.

[Art01] C. Artho. Finding Faults in Multi-Threaded Programs. Master’s thesis,

Federal Institute of Technology, 2001.

[AT01] Paul Anderson and Tim Teitelbaum. Software inspection using

CodeSurfer . In WISE01: Proceedings of the First Workshop on In-

spection in Software Engineering, July 2001.

[BDG+04] Guillaume Brat, Doron Drusinsky, Dimitra Giannakopoulou, Allen Gold-

berg, Klaus Havelund, Mike Lowry, Corina Pasareanu, Arnaud Venet,

Willem Visser, and Rich Washington. Experimental evaluation of verifi-

cation and validation tools on Martian Rover software. Form. Methods

Syst. Des., 25(2-3):167–198, 2004.

[Ben06] David Benson. JGraph and JGraph Layout Pro User Manual, December

[BK03] Guillaume Brat and Roger Klemm. Static analysis of the Mars explo-

ration rover flight software. In Proceedings of the First International

Space Mission Challenges for Information Technology, pages 321–326,

[BL06] Steve Barriault and Marc Lalo. Tutorial: How to statically ensure soft-

ware reliability. Embedded Systems Design, 19(5), 2006.

[Blo01] Joshua Bloch. Effective Java programming Language Guide. Sun Mi-

crosystems, Inc., Mountain View, CA, USA, 2001.

[Bol02] Phillip J. Boland. Challenges in software reliability and testing. In Third

International Conference on Mathematical Methods in Reliability Method-

ology and Practice, 2002.

[BPS00] William R. Bush, Jonathan D. Pincus, and David J. Sielaff. A static

analyzer for finding dynamic programming errors. Software Practice and

Experience, 30(7):775–802, 2000.

[BR02] Thomas Ball and Sriram K. Rajamani. The SLAM project: debugging

system software via static analysis. In POPL, pages 1–3, 2002.

[Bro04] Matthew Broersma. Microsoft server crash nearly causes 800-plane pile

up. Techworld, 2004.

[BV03] Guillaume Brat and Arnaud Venet. Static program analysis using Ab-

stract Interpretation. Unpublished tutorial, Proceedings ASE’03: 18th

IEEE International Conference on Automated Software Engineering, Oc-

tober 6-10 2003.

[Car92] Ralph V. Carlone. GAO report: Patriot missle defense - software problem

led to system failure at Dhahran, Saudi Arabia. Technical report, General

Accounting Office, February 1992. GAO/IMTEC-92-26.

[Cen96] Reliability Analysis Center. Introduction to Software Reliability: A state

of the Art Review. Reliability Analysis Center (RAC), 1996.

[Che80] Roger C. Cheung. A user-oriented software reliability model. IEEE

Transactions on Software Engineering, 6(2):118–125, March 1980.

[CM04] Brian Chess and Gary McGraw. Static analysis for security. IEEE Secu-

rity & Privacy, 2(6):76–79, November-December 2004.

[Cor05] Steve Cornett. Code coverage analysis. Website, December 2005.

http://www.bullseye.com/coverage.

[Coz99] Fabio Gagliardi Cozman. Embedded Bayesian Networks.

http://www.cs.cmu.edu/˜javabayes/EBayes/Doc/, 1999.

[Coz01] Fabio G. Cozman. The JavaBayes System. ISBA Bulletin, 7(4):16–21,

[Cre05] Jack W. Crenshaw. Time to re-evaluate Windows CE. Embedded Systems

Programming, 18(2):9–14, 2005.

[Dan97] Carl Daniele. Embedded web technology: Internet technology applied to

real-time system control. Research & Technology 1997, 1997.

[Dar88] Ian F. Darwin. Checking C Programs with Lint. O’Reilly and Associates,

Inc,, 103 Morris Street, Suite A Sebastopol, CA 95472, October 1988.

[Den99] Jason Denton. Accurate Software Reliability Estimation. Master’s thesis,

Colorado State University, 1999.

[Dew90] Philip Elmer Dewitt. Ghost in the machine. Time, pages 58–59, January

29 1990.

[DLZ05] Valentin Dallmeier, Christian Lindig, and Andreas Zeller. Evaluating a

lightweight defect localization tool. In Workshop on the Evaluation of

Software Defect Detection Tools, June 12 2005.

[DZN+04] Martin Davidsson, Jiang Zheng, Nachiappan Nagappan, Laurie Williams,

and Mladen Vouk. GERT: An empirical reliability estimation and testing

feedback tool. In 15th International Symposium on Software Reliability

Engineering (ISSRE’04), pages 269–280, 2004.

[EGHT94] David Evans, John Guttag, James Horning, and Yang Meng Tan. LCLint:

A tool for using specifications to check code. In Proceedings of the ACM

SIGSOFT ’94 Symposium on the Foundations of Software Engineering,

pages 87–96, 1994.

[EKN98] William Everett, Samuel Keene, and Allen Nikora. Applying software

reliability engineering in the 1990s. IEEE Transactions on Reliability,

47(3):372–378, September 1998.

[EL03] Davie Evans and David Larochelle. Splint Manual. Secure Programming

Group University of Virginia Department of Computer Science, June 5

[ELC04] The Economic Impacts of the August 2003 Blackout. Technical report,

Electricity Consumers Resource Council, 2004.

[Eng05] Dawson R. Engler. Static analysis versus model checking for bug finding.

In CONCUR, page 1, 2005.

[FCJ04] Thomas Flowers, Curtis A. Carver, and James Jackson. Empowering stu-

dents and building confidence in novice programmers through Gauntlet.

In 34th ASEE IEEE Frontiers in Education Conference, 2004.

[FGMP95] Fabio Del Frate, Praerit Garg, Aditya P. Mathur, and Alberto Pasquini.

On the correlation between code coverage and software reliability. In

Proceedings of the Sixth International Symposium on Software Reliability

Engineering, pages 124–132, 1995.

[FL01] Cormac Flanagan and K. Rustan M. Leino. Houdini, an Annotation

Assistant for ESC/Java. In Proceedings of the International Symposium

of Formal Methods Europe on Formal Methods for Increasing Software

Productivity, pages 500–517, London, UK, 2001. Springer-Verlag.

[For05] Jeff Forristal. Source-code assessment tools kill bugs dead. Secure En-

terprise, December 2005.

[FPG94] Norman Fenton, Shari Lawrence Pfleeger, and Robert L. Glass. Science

and substance: A challenge to software engineers. IEEE Softw., 11(4):86–

95, 1994.

[Gan00] Jack Ganssle. Crash and burn: Disasters and what we can learn from

them. Embedded Systems Programming, November 2000.

[Gan01] Jack Ganssle. The best ideas for developing better firmware faster. Tech-

nical Presentation, 2001.

[Gan02a] Jack Ganssle. Born to fail. Embedded Systems Programming, 2002.

[Gan02b] Jack Ganssle. Codifying good software design. Embedded.com, August

[Gan04] Jack Ganssle. When disaster strikes. Embedded Systems Programming,

November 11 2004.

[Gar94] Praerit Garg. Investigating coverage-reliability relationship and sensitiv-

ity of reliability to errors in the operational profile. In CASCON ’94:

Proceedings of the 1994 conference of the Centre for Advanced Studies on

Collaborative research, page 19. IBM Press, 1994.

[Gar95] Praerit Garg. On code coverage and software reliability. Master’s thesis,

Purdue University, Department of Computer Sciences, May 1995.

[Gep04] Linda Geppert. Lost radio contact leaves pilots on their own. IEEE

Spectrum, November 2004.

[Ger04] Andy German. Software static code analysis lessons learned. Crosstalk,

16(11):13–17, 2004.

[GH01] Bjorn Axel Gran and Atte Helminen. A Bayesian Belief Network for

Reliability Assessment. In SAFECOMP ’01: Proceedings of the 20th

International Conference on Computer Safety, Reliability and Security,

pages 35–45, London, UK, 2001. Springer-Verlag.

[Gie98] Dirk Giesen. Philosophy and practical implementation of static analyzer

tools. Technical report, QA Systems Technologies, 1998.

[GJC+03] Vinod Ganapathy, Somesh Jha, David Chandler, David Melski, and

David Vitek. Buffer overrun detection using linear programming and

static analysis. In Proceedings of the 10th ACM conference on Computer

and Communications Security, pages 345–354, New York, NY, USA,

2003. ACM Press.

[GJSB00] James Gosling, Bill Joy, Guy L. Steele, and Gilad Bracha. The Java

Language Specification. Java series. Second edition, 2000.

[Gla79] Robert L. Glass. Software Reliability Guidebook. Prentice-Hall, Engle-

wood Cliffs, NJ, 1979.

[Gla99a] Robert L. Glass. Inspections - some surprise findings. Communications

ACM, 42(4):17–19, 1999.

[Gla99b] Robert L. Glass. The realities of software technology payoffs. Communi-

cations ACM, 42(2):74–79, 1999.

[Gle96] James Gleick. A bug and a crash. New York Times Magazine, December

[God05] Patrice Godefroit. The soundness of bugs is what matters. In Proceed-

ings of BUGS’2005 (PLDI’2005 Workshop on the Evaluation of Software

Defect Detection Tools), Chicago, IL, June 2005.

[Gra86] Jim Gray. Why do computers stop and what can be done about it? Proc.

5th Symp. on Reliability in Distributed Software and Database Systems,

pages 3–12, 1986.

[Gra00] Codesurfer technology overview: Dependence graphs and program slic-

ing. Technical report, GrammaTech, 2000.

[Gri04a] Chris Grindstaff. Findbugs, part 1: Improve the quality of your code

why and how to use findbugs. IBM DeveloperWorks, May 2004.

[Gri04b] Chris Grindstaff. Findbugs, part 2: Writing custom detectors how to

write custom detectors to find application-specific problems. IBM Devel-

operWorks, May 2004.

[Gro01] Michael Grottke. Software Reliability Model Study. Technical Report

IST-1999-55017, PETS, January 2001.

[GT97] Swapna Gokhale and Kishor Trivedi. Structure-based software reliability

prediction. In Proc. of Advanced Computing (ADCOMP), Chennai, India,

[GT05] Michael Grottke and Kishor S. Trivedi. A clasification of software faults.

In Supplementary Proceedings 16th IEEE International Symposium on

Software Reliability Engineering, Chicago, Illinois, 8-11 November 2005.

[Hac04] Mark Hachman. NASA: DOS glitch nearly killed mars rover. Extreme-

Tech, August 2004.

[Had99] P. Haddaway. An overview of some recent developments in Bayesian

problem solving techniques, 1999.

[Hal99] Todd Halvorson. Air Force Titan 4 rocket program suffers another failure.

Florida Today, May 8 1999.

[Hal03] Christopher D. Hall. When spacecraft wont point. In 2003 AAS/AIAA

Astrodynamics Specialists Conference, Big Sky, Montana, August 2003.

[Har99] K. J. Harrison. Static code analysis on the C-130J Hercules safety-critical

software. Technical report, Aerosystems International, UK, 1999.

[Hat95] Les Hatton. Safer C: Developing for High-Integrity and Safety-Critical

Systems. McGraw-Hill, January 1995.

[Hat99a] Les Hatton. Ariane 5: A smashing success. Software Testing and Quality

Engineering, 1(2), 1999.

[Hat99b] Les Hatton. Software faults and failures: Avoiding the avoidable and

living with the rest. Draft text from “Safer Testing” Course, December

[Hat07] Les Hatton. Language subsetting in an industrial context: a compari-

son of MISRA C 1998 and MISRA C 2004. Information and Software

Technology, 49(1):475–482, May 2007.

[HD03] Elise Hewett and Paul DiPalma. A survey of static and dynamic analyzer

tools. In Proceedings 1st CSci 780 Symposium on Software Engineering,

The College of William and Mary, December 15-16 2003.

[HFGO94] Monica Hutchins, Herb Foster, Tarak Goradia, and Thomas Ostrand.

Experiments of the effectiveness of dataflow and controlflow based test

adequacy criteria. In ICSE ’94: Proceedings of the 16th International

Conference on Software Engineering, pages 191–200, Los Alamitos, CA,

USA, 1994. IEEE Computer Society Press.

[HJv00] Marieke Huisman, Bart Jacobs, and Joachim van den Berg. A case study

in class library verification: Java’s Vector Class. Technical Report CSI-

R0007, 2000.

[HL02] Sudheendra Hangal and Monica S. Lam. Tracking down software bugs

using automatic anomaly detection. In Proceedings of the 24th Interna-

tional Conference on Software Engineering, May 2002.

[HLL94] Joseph R. Horgan, Saul London, and Michael R. Lyu. Achieving software

quality with testing coverage measures. IEEE Computer, 27(9):60–69,

September 1994.

[Hof99] Eric J. Hoffman. The NEAR rendezvous burn anomaly of december 1998.

Technical report, Johns Hopkins University, 1999.

[Hol99] C. Michael Holloway. From bridges to rockets: Lessons for software sys-

tems. In Proceedings of the 17th International System Safety Conference,

pages 598–607, August 1999.

[Hol04] Ralf Holly. Lint metrics and ALOA. C/C++ Users Journal, pages 18–22,

June 2004.

[Hot01] Chris Hote. Run-time error detection through semantic analysis: A

breakthrough solution to todays software testing inadequacies in auto-

motive. Technical report, Polyspace Technologies, September 2001.

[HP00] Klaus Havelund and Thomas Pressburger. Model checking JAVA pro-

grams using JAVA PathFinder. STTT, 2(4):366–381, 2000.

[HV93] Dick Hamlet and Jeff Voas. Faults on its sleeve: amplifying software

reliability testing. In ISSTA ’93: Proceedings of the 1993 ACM SIGSOFT

international symposium on Software testing and analysis, pages 89–98,

New York, NY, USA, 1993. ACM Press.

[HW06] Sarah Heckman and Laurie Williams. Automated adaptive ranking and

filtering of static analysis alerts. In 17th International Symposium on

Software Reliability Engineering, Raleigh, NC, November 2006.

[ISO90] International Standard ISO/IEC9899 - Programming Languages - C, De-

cember 1990. ISO/IEC9899-1990.

[ISO99] International Standard ISO/IEC9899 - Programming Languages - C, De-

cember 1999. ISO/IEC9899-1999.

[ISO03] ISO/IEC 14882 Programming languages C++ (Langages de programma-

tion C++). Technical report, International Standard ISO/IEC, Ameri-

can National Standards Institute, 25 West 43rd Street, New York, New

York 10036, October 15 2003.

[JBl] Jblanket. Online at. http://csdl.ics.hawaii.edu/Tools/JBlanket/.

[Jel04] Rick Jelliffe. Mini-review of Java bug finders. The O’Reilly Network,

March 15 2004.

[Jes04] Anick Jesdanun. GE energy acknowledges blackout bug. The Associated

Press, February 2004.

[JM97] Jean-Marc Jezequel and Bertrand Meyer. Design by contract: The lessons

of Ariane. Computer, 30(1):129–130, 1997.

[Joh78] S.C. Johnson. Lint, a C Program Checker. Unix Programmer’s Man-

ual 65, AT&T Bell Laboratories, 1978.

[KA] Konstantin Knizhnik and Cyrille Artho. JLint manual.

[Kan95] Cem Kaner. Software negligence and testing coverage. Technical report,

Florida Tech, 1995.

[KAYE04] Ted Kremenek, Ken Ashcraft, Junfeng Yang, and Dawson Engler. Corre-

lation exploitation in error ranking. In SIGSOFT ’04/FSE-12: Proceed-

ings of the 12th ACM SIGSOFT / Twelfth International Symposium on

Foundations of Software Engineering, pages 83–93, New York, NY, USA,

[KL86] J. C. Knight and N. G. Leveson. An experimental evaluation of the

assumption of independence in multiversion programming. IEEE Trans-

actions on Software Engineering, 12(1):96–109, 1986.

[KL90] John C. Knight and Nancy G. Leveson. A reply to the criticisms of the

Knight & Leveson experiment. SIGSOFT Softw. Eng. Notes, 15(1):24–

35, 1990.

[Koc04] Christopher Koch. Bursting the CMM hype. Software Quality, March 1

[Lad96] Peter B. Ladkin. Excerpt from the Case Study of The Space Shuttle

Primary Control System. Excerpted 12 August 1996, Communications

of the ACM 27(9), September 1984, p886, August 1996.

[LAW+04] Kathryn Laskey, Ghazi Alghamdi, Xun Wang, Daniel Barbara, Tom

Shackelford, Ed Wright, and Julie Fitzgerald. Detecting threatening be-

havior using bayesian networks. In Proceedings of the Conference on

Behavioral Representation in Modeling and Simulation, 2004.

[LB05] Marc Lalo and Steve Barriault. Maximizing software reliability and de-

veloper’s productivity in automotive: Run-time errors, MISRA, and se-

mantic analysis. Technical report, Polyspace Technologies, 2005.

[LE01] David Larochelle and David Evans. Statically detecting likely buffer over-

flow vulnerabilities. In USENIX Security Symposium, pages 177–190,

Washington, D. C., August 13-17 2001.

[Lee94] S. C. Lee. How Clementine really failed and what NEAR can learn. John

Hopkins University Applied Physics Laboratory Memorandum, May 26

[Lev94] Nancy G. Leveson. High-pressure steam engines and computer software.

IEEE Computer, pages 65–73, October 1994.

[LG99] Craig Larman and Rhett Guthrie. Java 2 Performance and Idiom Guide.

[Lio96] J. L. Lions. Ariane 5 flight 501 failure report by the inquiry board.

Technical report, CNES, 1996.

[LL05] V. Benjamin Livshits and Monica S. Lam. Finding security vulnerabili-

ties in Java applications with static analysis. In 14th USENIX Security

Symposium, 2005.

[LLQ+05] Shan Lu, Zhenmin Li, Feng Qin, Lin Tan, Pin Zhou, and Yuanyuan Zhou.

Bugbench: Benchmarks for evaluating bug detection tools. In Proceedings

of the Workshop on the Evaluation of Software Defect Detection Tools,

June 2005.

[LM95] Naixin Li and Y. K. Malaiya. ROBUST: a next generation software re-

liability engineering tool. In Proceedings of the Sixth International Sym-

posium on Software Reliability Engineering, pages 375–380, Toulouse,

France, October 24–27 1995.

[LT93] Nancy Leveson and Clark S. Turner. An investigation of the Therac-25

accidents. IEEE Computer, 26(7):18–41, 1993.

[Lyu95] Michael R. Lyu, editor. Handbook of Software Reliability Engineering.

Number ISBN 0-07-039400-8. McGraw-Hill publishing, 1995.

[Maj03] Dayle G. Majors. An investigation of the call integrity of the Linux

System. In Fast Abstract ISSRE, 2003.

[Mar99] Brian Marick. How to misuse code coverage. Technical report, Reliable

Software Technologies, 1999.

[MB06] Robert A. Martin and Sean Barnum. A status update: The common

weaknesses enumeration. In NIST Static Analysis Summit, Gaithersburg,

MD, June 29 2006.

[McA] McAfee. W32/nachi.worm.

[MCJ05] Robert A. Martin, Steven M. Christey, and Joe Jarzombek. The case for

common flaw enumeration. Technical report, MITRE Corporation, 2005.

[ME03] M. Musuvathi and D. Engler. Some lessons from using static analysis

and software model checking for bug finding, 2003.

[Mef05] Barmak Meftah. Benchmarking bug detection tools. In Workshop on the

Evaluation of Software Defect Detection Tools, June 2005.

[Mey92] Scott (Scott Douglas) Meyers. Effective C++: 50 specific ways to improve

your programs and designs. Addison-Wesley professional computing se-

ries. Addison Wesley Professional, 75 Arlington Street, Suite 300 Boston,

MA 02116, 1992.

[Mey01] Sleighton Meyer. Harris corporation completes acceptance test of FAA’s

voice switching and control system (VSCS) upgrade. Corporate Press

Release, November 2001.

[MIO90] John D. Musa, Anthony Iannino, and Kazuhira Okumoto. Software Re-

liability: Measurement, Prediction, Application. McGraw-Hill, Inc., New

York, NY, USA, Professional edition, 1990.

[MIS98] MISRA-C guidlines for the use of the C language in critical systems. The

Motor Industry Software Reliability Association, 1998.

[MIS04] MISRA-C:2004 guidlines for the use of the C language in critical systems.

The Motor Industry Software Reliability Association, October 2004.

[MKBD00] Eric Monk, J. Paul Keller, Keith Bohnenberger, and Michael C. Daconta.

Java Pitfalls: Time-Saving Solutions and Workarounds to Improve Pro-

grams (Paperback). John Wiley & Sons, 2000.

[MLB+94] Y.K. Malaiya, N. Li, J. Bieman, R. Karcich, and B. Skibbe. The relation-

ship between test coverage and reliability. In Proc. Int. Symp. Software

Reliability Engineering, pages 186–195, November 1994.

[MMZC06] Kevin Mattos, Christine Moreira, Mark Zingarelli, and Denis Coffey. The

effect of rapid commercial off-the-shelf (COTS) software insertion on the

software reliability of large-scale undersea combat systems. In 17th Inter-

national Symposium on Software Reliability Engineering, Raleigh, North

Carolina, November 2006.

[MvMS93] Y.K. Malaiya, A. von Mayrhauser, and P.K. Srimani. An examination

of fault exposure ratio. IEEE Transactions on Software Engineering,

19(11):1087–1094, November 1993.

[Mye76] Glenford J. Myers. Software Reliability: Principles and Practices. John

Wiley & Sons, 1976.

[Nag05] Nachiappan Nagappan. A software testing and reliability early warning

(STREW) metric suite. PhD thesis, 2005. Chair-Laurie A. Williams.

[NB05] Nachiappan Nagappan and Thomas Ball. Static analysis tools as early

indicators of pre-release defect density. In International Conference on

Software Engineering, (ICSE 2005)., 2005.

[NBZ06] Nachiappan Nagappan, Thomas Ball, and Andreas Zeller. Mining metrics

to predict component failures. In International Conference on Software

Engineering, Shanghai, China, May 2006.

[Neu99] Peter G. Neumann. The risks digest. Online Digest of Computing Failures

and Risks, September 15 1999.

[NF96] Martin Neil and Norman Fenton. Predicting software quality using

Bayesian Belief Networks. In Proc 21st Annual Software Eng Workshop,

pages 217–230, NASA Goddard Space Flight Centre, December 1996.

[NIS06] NIST. Source Code Analysis Tool Functional Specification. Technical re-

port, National Institute of Standards and Technology Information Tech-

nology Laboratory Software Diagnostics and Conformance Testing Divi-

sion, September 15 2006.

[NM96] Li Naixin and Y.K. Malaiya. Fault exposure ratio estimation and ap-

plications. In Seventh International Symposium on Software Reliability

Engineering (ISSRE ’96) p. 372, 1996.

[Nta97] Simeon Ntafos. The cost of software failures. In Proceedings of IASTED

Software Engineering Conference, pages 53–57, November 1997.

[NWV03] Nachiappan Nagappan, Laurie Williams, and Mladen Vouk. Towards a

metric suite for early software reliability assessment. In FastAbstract in

Supplementary Proceedings, International Symposium on Software Reli-

ability Engineering, 2003.

[NWV+04] Nachiappan Nagappan, Laurie Williams, Mladen Vouk, John Hudepohl,

and Will Snipes. A preliminary investigation of automated software in-

spection. In IEEE International Symposium on Software Reliability En-

gineering, pages 429–439., 2004.

[NWVO04] Nachiappan Nagappan, Laurie Williams, Mladen Vouk, and Jason Os-

borne. Initial results of using in-process testing metrics to estimate soft-

ware reliability. Technical report, North Carolina State University, 2004.

[OS00] Emilie O’Connell and Hossein Saiedian. Can you trust software capability

evaluations? Computer, 33(2):28–35, 2000.

[OWB04] Thomas J. Ostrand, Elaine J. Weyuker, and Robert M. Bell. Using static

analysis to determine where to focus dynamic testing effort. In Second In-

ternational Workshop on Dynamic Analysis, Edinburgh, Scotland, 2004.

[Pai] Ganesh J. Pai. A survey of software reliability models. A Project Report

CS 651: Dependable Computing.

[Pai01] Ganesh J Pai. Combining bayesian belief networks with fault trees to

enhance software reliability analysis. In Proceedings of the IEEE Inter-

national Symposium on Software Reliability Engineering, November 2001.

[Par] Terence Parr. Antlr parser generator. http://www.antlr.org.

[Pat02] David A. Patterson. A simple way to estimate the cost of downtime. In

LISA ’02: Proceedings of the 16th USENIX conference on System admin-

istration, pages 185–188, Berkeley, CA, USA, 2002. USENIX Association.

[Pav99] J. G. Pavlovich. Formal report of investigation of the 30th April 1999 Ti-

tan IV B/Centaur TC-14/Milstar-3 (B-32) Space Launch Mishap. Tech-

nical report, U.S. Air Force, 1999.

[PD01] Ganesh J Pai and Joanne Bechta Dugam. Enhancing Software Relia-

bility Estimation Using Bayesian Networks and Fault Trees. In ISSRE

FastAbstracts 2001. Chillarege, 2001.

[Pet94] Henry Petroski. Design Paradigms: Case Histories of Error and Judge-

ment in Engineering. Cambridge University Press, 1994.

[Pil03] Daniel Pilaud. Finding run time errors without testing in embedded

systems. Minatec, 2003. Keynote Address by Daniel Pilaud, Chairman

PolySpace Technologies, http://www.polyspace.com.

[POC93] Paul Piwowarski, Mitsuru Ohba, and Joe Caruso. Coverage measurement

experience during function test. In ICSE ’93: Proceedings of the 15th

international conference on Software Engineering, pages 287–301, Los

Alamitos, CA, USA, 1993. IEEE Computer Society Press.

[Pol] PolySpace for C++. Product Brochure.

[Pou03] Kevin Poulsen. Nachi worm infected Diebold ATMs. The Register,

November 25th 2003.

[Pou04a] Kevin Poulsen. Software bug contributed to blackout. SecurityFocus,

February 2004.

[Pou04b] Kevin Poulsen. Tracking the blackout big. The Register, April 2004.

[Pro] The Programming Research Group. High Integrety C++ Coding Standard

Manual, 2.2 edition.

[QAC98] QAC Clinic. Available Online, 1998. Available from

http://www.toyo.co.jp/ss/customersv/doc/qac clinic1.pdf.

[RAF04] Nick Rutar, Christian B. Almazan, and Jeffrey S. Foster. A comparison

of bug finding tools for Java. In Proceedings of the 15th IEEE Symposium

on Software Reliability Engineering, Saint-Malo, France, November 2004.

[Rai05] Abhishek Rai. On the role of static analysis in operating system checking

and runtime verification. Technical report, Stony Brook University, May

2005. Technical Report FSL-05-01.

[Ree97] Glenn E Reeves. What really happened on Mars. E-mail discussion of

the failure of the Mars Pathfinder spacecraft, December 1997.

[RGT00] S. Ramani, S. Gokhale, and K. S. Trivedi. Software reliability estimation

and prediction tool. Performance Evaluation, 39:37–60, 2000.

[Ric00] Debra J. Richardson. Static analysis. ICS 224: Software Testing and

Analysis Class Notes, Spring 2000.

[Roo90] Paul Rook, editor. Software Reliability Handbook. Centre for Software

Reliability, City University, London, U.K., 1990.

[SA05a] Walter Schilling and Mansoor Alam. A methodology for estimating soft-

ware reliability using limited testing. In Supplemental Proceedings ISSRE

2005: The 16th International Symposium on Software Reliability Engi-

neering, Chicago, IL, November 2005. IEEE Computer Society and IEEE

Reliability Society.

[SA05b] Walter Schilling and Mansoor Alam. Work In Progress - Measuring the

roitime for Static Analysis. In 2005 Frontiers in Education, Indianapolis,

IN, October 2005. IEEE Computer Society / ASEE.

[SA06a] Walter Schilling and Dr. Mansoor Alam. The software static analysis

reliability toolkit. In Supplemental Proceedings ISSRE 2006: The 17th

International Symposium on Software Reliability Engineering, Raleigh,

NC, November 2006. IEEE Computer Society and IEEE Reliability So-

ciety.

[SA06b] Walter Schilling and Mansoor Alam. Estimating software reliability with

static analysis technique. In Proceedings of the 15th International Confer-

ence on Software Engineering and Data Engineering (SEDE-2006), Los

Angeles, California, 2006. International Society for Computers and their

Applications (ISCA).

[SA06c] Walter Schilling and Mansoor Alam. Integrate static analysis into a

software development process. Embedded Systems Design, 19(11):57–66,

November 2006.

[SA06d] Walter Schilling and Mansoor Alam. Modeling the reliability of existing

software using static analysis. In Proceedings of the 2006 IEEE Inter-

national Electro/Information Technology Conference, East Lansing, MI,

2006. IEEE Region IV.

[SA07a] Walter Schilling and Mansoor Alam. Measuring the reliability of exist-

ing web servers. In Proceedings of the 2007 IEEE International Elec-

tro/Information Technology Conference, Chicago, IL, 2007. IEEE Region

[SA07b] Walter W. Schilling and Mansoor Alam. Evaluating the Effectiveness of

Java Static Analysis Tools. In Proceedings of the International Conference

on Embedded Systems and Applications, Las Vegas, NV, June 2007.

[Sch04a] Walter Schilling. Issues effecting the readiness of the Java language for

usage in safety critical real time systems. Submitted to fulfill partial re-

quirements for Special Topics: Java EECS8980-001, Dr. Gerald R Heur-

ing, Instructor, May 2004.

[Sch04b] Katherine V. Schinasi. Stronger Management Practices Are Needed to

Improve DODs Software-Intensive Weapon Acquisitions. Technical re-

port, Government Accounting Office, 2004.

[Sch05] Walter Schilling. Embedded systems software reliability. In NASA / Ohio

Space Grant Consortium 2004-2005 Annual Student Research Symposium

Proceedings XIII, Cleveland, Ohio, 2005. Ohio Space Grant Consortium.

[Sch07] Walter Schilling. Relating software reliability to execution rates using

bayesian belief networks. In NASA / Ohio Space Grant Consortium 2006-

2007 Annual Student Research Symposium Proceedings XV, Cleveland,

Ohio, 2007. Ohio Space Grant Consortium.

[SDWV05] Michele Strom, Martin Davidson, Laurie Williams, and Mladen Vouk.

The “Good Enough”’ Reliability Tool (GERT) - Version 2. In Supplemen-

tary Proceedings of the 16th IEEE International Symposium on Software

Reliability Engineering (ISSRE 2005), Chicago, Illinois, 8-11 November

[Sha06] Lui Sha. The complexity challenge in modern avionics software. In Na-

tional Workshop on Aviation Software Systems: Design for Certifiably

Dependable Systems, 2006.

[Sho84] Martin L. Shooman. Software Reliability: a historical perspective. IEEE

Transactions on Reliability, R-33(1), April 1984.

[Sho96] Martin L. Shooman. Avionics software problem occurrence rates. In The

Seventh International Symposium on Software Reliability Engineering,

[Sip97] Michael Sipser. Introduction to the Theory of Computation. PWS Pub-

lishing Company, 20 Park Plaza, Boston, MA 02116-4324, 1997.

[SK95] Hossein Saiedian and Richard Kuzara. SEI Capability Maturity Model’s

impact on contractors. Computer, 28(1):16–26, 1995.

[Sla98a] Gregory Slabodkin. Control-system designers say newer version could

have prevented LAN crash. GCN, December 14 1998.

[Sla98b] Gregory Slabodkin. Software glitches leave navy smart ship dead in the

water. GCN, July 13 1998.

[Sop01] Joe Sopko. CTC195, 197, 203, No Sound. National Electronic Service

Dealers Association of Ohio Newsletter, page 10, June 2001.

[Ste02] Henry Stewart. Meetign FDA requirements for validation of medical

device software, September 2002. Briefing Advertisement.

[Sto05] Walt Stoneburner. Software reliabilty overview. SMERFS Website, 2005.

[SU99] Curt Smith and Craig Uber. Experience report on early software reli-

ability prediction and estimation. In 10th International Symposium on

Software Reliability Engineering, November 1999.

[SWA+00] Donald Savage, Helen Worth, Diane E. Ainsworth, George Diller, and

Keith Takahashi. The Near Earth Asteroid Rendezvous: A Guide to the

Mission, the Spacecraft, and the People. NASA, 2000.

[SWX05] Sarah E. Smith, Laurie Williams, and Jun Xu. Expediting Program-

mer AWAREness of Anomalous Code. In Supplemental Proceedings 16th

International Symposium on Software Reliability Engineering, Chicago,

Illinois, 2005.

[Sys02] QA Systems. Overview large Java project code quality analysis. Technical

report, QA Systems, 2002.

[Tas02] Gregory Tassey. The economic impacts of inadequate infrastructure for

software testing. Technical Report RTI 7007.011, National Institute of

Standards and Technology, May 2002.

[Tha96] Henrik Thane. Safe and reliable computer control systems: Concepts and

methods. Technical report, Royal Institute of Technology, Mechatronics

Laboratory Department of Machien Design Royal Institute of Technology,

KTH S-100 44 Stockholm Sweden, 1996.

[Tri02] Kishor S. Trivedi. Probability and Statistics with Reliability, Queuing and

Computer Science Applications. John Wiley and Sons, Inc., 2002.

[VB04] Arnaud Venet and Guillaume Brat. Precise and efficient static array

bound checking for large embedded C programs. In PLDI ’04: Proceed-

ings of the ACM SIGPLAN 2004 conference on Programming language

design and implementation, pages 231–242. ACM Press, 2004.

[VBKM00] J. Viega, J. T. Bloch, Y. Kohno, and G. McGraw. ITS4: A static vul-

nerability scanner for C and C++ code. In ACSAC ’00: Proceedings of

the 16th Annual Computer Security Applications Conference, page 257,

Washington, DC, USA, 2000. IEEE Computer Society.

[vGB99] Jilles van Gurp and Jan Bosch. Using Bayesian Belief Networks in As-

sessing Software Designs. In ICT Architectures ’99, November 1999.

[VT01] Kalyanaraman Vaidyanathan and Kishor S. Trivedi. Extended classi-

fication of software faults based on aging. In The 12th International

Symposium on Software Reliability Engineering, 2001.

[Wag04] Stefan Wagner. Efficiency analysis of defect-detection techniques. Tech-

nical Report TUMI-0413, Institut fur Informatik, Technische Universitat

Munchen, 2004.

[Wal01] Dolores R. Wallace. Practical software reliability modeling. In Proceed-

ings 26th Annual NASA Goddard Software Engineering Workshop, pages

147–155, November 27-29 2001.

[Wal04] Matthew L. Wald. Maintance lapse blamed for air traffic control problem.

New York Times, September 2004.

[Whe04] David A. Wheeler. Flawfinder, May 30 2004. Manual Page.

[wir98] Sunk by Windows NT. wired.com, July 1998.

[WJKT05] Stefan Wagner, Jan Jrjens, Claudia Koller, and Peter Trischberger. Com-

paring bug finding tools with reviews and tests. In Proceedings of Testing

of Communicating Systems: 17th IFIP TC6/WG 6.1 International Con-

ference, TestCom 2005, Montreal, Canada, May - June 2005. Springer-

Verlag GmbH.

[Woe99] Jack J. Woehr. A conversation with Glenn Reeves: Really remote debug-

ging for real-time systems. Dr. Dobb’s Journal, November 1999.

[Xie91] M. Xie. Software Reliability Modeling. World Scientific Publishing Com-

pany, Singapore, 1991.

[XNHA05] Yichen Xie, Mayur Naik, Brian Hackett, and Alex Aiken. Soundness and

its role in bug detection systems. In Proceedings of Workshop on the

Evaluation of Software Defect Detection Tools (BUGS’05), June 2005.

[XP04] Shu Xiao and Christopher H. Pham. Performing high efficiency source

code static analysis with intelligent extensions. In APSEC, pages 346–

355, 2004.

[YB98] David York and Maria Babula. Virtual interactive classroom: A new

technology for distance learning developed. Research & Technology 1998,

[YP99] David York and Joseph Ponyik. New Web Server - the Java version of

Tempest - Produced. Research and Technology, 1999.

[ZLL04] Misha Zitser, Richard Lippmann, and Tim Leek. Testing static analysis

tools using exploitable buffer overflows from open source code. SIGSOFT

Software Engineering Notes, 29(6):97–106, 2004.

[ZWN+06] Jiang Zheng, Laurie Williams, Nachiappan Nagappan, Will Snipes,

John P. Hudepohl, and Mladen A. Vouk. On the value of static analysis

for fault detection in software. IEEE Transactions on Software Engineer-

ing, 32(4):240–253, 2006.

Appendix A

Fault Taxonomy

Table A.1: SoSART Static Analysis Fault Taxonomy

Num. Categorization Name Description

100 Input ValidationGeneral Input Valida-tion problem

101 Input Validation Path ManipulationAllowing user input to control paths used by the applica-tion may enable an attacker to access otherwise protectedfiles.

102 Input ValidationCross Site Script-ing.Basic XSS

Basic’ XSS involves a complete lack of cleansing of anyspecial characters, including the most fundamental XSSelements such as < and >.

103 Input Validation Resource InjectionAllowing user input to control resource identifiers mightenable an attacker to access or modify otherwise pro-tected system resources.

104 Input Validation OS Command Injec-tion

Command injection problems are a subset of injectionproblem, in which the process is tricked into calling ex-ternal processes of the attackers choice through the in-jection of control-plane data into the data plane. Alsocalled ”shell injection”.

105 Input Validation SQL Injection

SQL injection attacks are another instantiation of injec-tion attack, in which SQL commands are injected intodata-plane input in order to effect the execution of rede-fined SQL commands.

200 Range Errors General Range ErrorProblem

201 Range Errors Stack overflow

A stack overflow condition is a buffer overflow condition,where the buffer being overwritten is allocated on thestack (i.e., is a local variable or, rarely, a parameter to afunction).

202 Range Errors Heap overflow

A heap overflow condition is a buffer overflow, where thebuffer that can be overwritten is allocated in the heapportion of memory, generally meaning that the bufferwas allocated using a routine such as the POSIX malloc()call.

203 Range ErrorsFormat string vulnera-bility

Format string problems occur when a user has the abilityto control or write completely the format string used toformat data in the printf style family of C/C++ func-tions.

204 Range ErrorsImproper Null Termi-nation

The product does not properly terminate a string or ar-ray with a null character or equivalent terminator. Nulltermination errors frequently occur in two different ways.An off-by-one error could cause a null to be written outof bounds, leading to an overflow. Or, a program coulduse a strncpy() function call incorrectly, which prevents anull terminator from being added at all. Other scenariosare possible.

205 Range Errors Array Length problem Array length [%2d,%3d] is or may be less than zero.206 Range Errors Index out of Bounds Index [%2d,%3d] is or may be out of array bounds.

300 API AbuseGeneral API AbuseProblem

301 API Abuse Heap InspectionUsing realloc() to resize buffers that store sensitive in-formation can leave the sensitive information exposed toattack because it is not removed from memory.

Continued on next page

Table A.1 – continued from previous page

302 API AbuseOften Misused: StringManagement

Functions that manipulate strings encourage buffer over-flows.

400 Security Features General Security Fea-ture Problem

401 Security Features Hard-Coded PasswordStoring a password in plain text may result in a systemcompromise.

500 Time and State General Time andState Problem

501 Time and State Time-of-check Time-of-use race condition

Time-of-check, time-of-use race conditions occur whenbetween the time in which a given resource (or its ref-erence) is checked, and the time that resource is used, achange occurs in the resource to invalidate the results ofthe check.

502 Time and State Unchecked Error Con-dition

Ignoring exceptions and other error conditions may allowan attacker to induce unexpected behavior unnoticed.

600 Code QualityGeneral Code Qualityproblem

601 Code Quality Memory leak

Most memory leaks result in general software reliabilityproblems, but if an attacker can intentionally trigger amemory leak, the attacker might be able to launch a de-nial of service attack (by crashing the program) or takeadvantage of other unexpected program behavior result-ing from a low memory condition .

602 Code Quality Unrestricted CriticalResource Lock

A critical resource can be locked or controlled by an at-tacker, indefinitely, in a way that prevents access to thatresource by others, e.g. by obtaining an exclusive lockor mutex, or modifying the permissions of a shared re-source. Inconsistent locking discipline can lead to dead-lock.

603 Code Quality Double FreeCalling free() twice on the same value can lead to a bufferoverflow.

604 Code Quality Use After FreeUse after free errors sometimes have no effect and othertimes cause a program to crash.

605 Code Quality Uninitialized variable

Most uninitialized variable issues result in general soft-ware reliability problems, but if attackers can intention-ally trigger the use of an uninitialized variable, theymight be able to launch a denial of service attack bycrashing the program.

606 Code QualityUnintentional pointerscaling

In C and C++, one may often accidentally refer to thewrong memory due to the semantics of when math oper-ations are implicitly scaled.

607 Code Quality Improper pointer sub-traction

The subtraction of one pointer from another in order todetermine size is dependant on the assumption that bothpointers exist in the same memory chunk.

608 Code Quality Null DereferenceUsing the NULL value of a dereferenced pointer asthough it were a valid memory address

700 EncapsulationGeneral Encapsulationproblem

701 EncapsulationPrivate Array-TypedField Returned FromA Public Method

The contents of a private array may be altered unexpect-edly through a reference returned from a public method.

702 EncapsulationPublic Data Assignedto Private Array-Typed Field

Assigning public data to a private array is equivalentgiving public access to the array.

703 Encapsulation Overflow of static in-ternal buffer

A non-final static field can be viewed and edited in dan-gerous ways.

704 Encapsulation Leftover Debug Code

Debug code can create unintended entry points in an ap-plication. Output on System.out or System.err. Someprogrammers debug code with a debugger, some useprintouts on System.out and System.err. Some printoutsmay by mistake not be removed when the debug sessionis over. This rule flags output using System.out.XX()System.out.XX() and Exception.printStackTrace();

1000 Operator PrecedenceGeneral Operator Pri-ority Problem

1001 Operator Precedence Logical OperatorPrecedence Problem

May be wrong assumption about logical operators prece-dence

1002 Operator Precedence Shift Operator Prece-dence Problem

May be wrong assumption about shift operator priority

1003 Operator PrecedenceBit Operator Prece-dence Problem

May be wrong assumption about bit operation priority

1100 Object Oriented Problems General Object Ori-ented Problem

1101 Object Oriented Problems Incomplete Override

This error condition indicates that there is an over-ridden method but other associated methods have notbeen overridden. Override both Object.equals() and Ob-ject.hashCode() Some containers depends on both hashcode and equals when storing objects [3]. Method %2mis not overridden by method with the same name of de-rived class %3c.

1102 Object Oriented ProblemsComponent shadowinguncovered.

Component A in class B shadows one in base class C.

1103 Object Oriented Problems Run method not over-ridden

The run method should be overridden when extendingthe Thread class [7]. This rule does not warn when theclass is abstract.

1104 Object Oriented ProblemsClass ComparisonProblem

Rule 1034 - Compare classes using getClass() in non-final equals method. When comparing object for equal-ity use the .getClass() method to make sure that the ob-ject are of exactly the same type. Checking types withthe instance of operator only breaks the required equiva-lence relation [3] when a subclass has redefined the equalsmethod.

1105 Object Oriented Problems Finalizer BehaviorProblem

A finalizer implementing only the default behavior is un-necessary. The general contract of finalize is that it isinvoked if and when the virtual machine has determinedthat there is no longer any means by which this objectcan be accessed by any thread that has not yet died.Also, the finalize method is never invoked more than onceby a Java virtual machine for any given object A finalizershould always call the superclass’ finalizer.

1106 Object Oriented Problems Bad Inheritance Path A class has been derived in an inappropriate manner.

1200 Logic ProblemsGeneral Logic Prob-lem

1201 Logic ProblemsMissing Body inSwitch Statement

Suspicious SWITCH without body

1202 Logic Problems Missing Break State-ment

Possible miss of BREAK before CASE/DEFAULT. Nobreak statement found. The flow of control fall into thecase of default statement below. Is this the intention oris there a break statement missing? If this was deliber-ate then place a comment immediately before the nextcase of default statement. Add a comment containingthe substring: ”fall through” or ”falls through”.

1203 Logic ProblemsLogic always executedthe same path

If condition always follows the same execution path dueto logic. Comparison always produces the same result.

1204 Logic Problems Suspicious ElseBranch Association

May be wrong assumption about ELSE branch associa-tion

1205 Logic Problems Suspicious If BranchAssociation

May be wrong assumption about IF body. if statementwith empty body. If statements directly followed by asemicolon is usually an error.

1206 Logic Problems Missing If Statement ELSE without IF

1207 Logic ProblemsUnreachable State-ment

The defined statement can not be reached.

1208 Logic ProblemsSuspicious Loop BodyAssociation

May be wrong assumption about loop bod. while state-ment with empty body. while statements directly fol-lowed by a semi-colon is usually an error. for statementwith empty body. for statements directly followed by asemi-colon is usually an error.

1209 Logic Problems Suspicious Case state-ment

Suspicious CASE/DEFAULT

1210 Logic Problems Missing While No WHILE for DO

1211 Logic ProblemsImproper string com-parison

Compare strings as object references. String compari-son using == or != operator. Avoid using the == or!= operator to compare strings. Use String.equals() orString.compareTo() when testing strings for equality. [1]

1213 Logic Problems Missing Default Block

No default case in switch block. There was no ”default:”case in the switch block. Is this the intention? It isprobably better to include the default case with an ac-companying comment. If the flow of control should neverreach the default case, use an assertion for early error de-tection:

1214 Logic Problems True False boolean lit-eral test

Equality operation on ’true’ boolean literal. Avoidingequality operations on the ’true’ boolean literal savessome byte code instructions and in most cases it improvesreadability. Inequality operation on ’false’ boolean lit-eral. This rule flags inequality operations on the ’false’literal which is also an identity operation

1215 Logic Problems Dead Code DetectedThis rule flags dead code. Only ’if (false)’, ’while (false)’and ’for(..;false;..)’ are detected.

1216 Logic Problems Comparison Problem Compared expressions can be equal only when both ofthem are 0.

1217 Logic ProblemsCase value can not beproduced

Switch case constant %2d cant be produced by switchexpression.

1218 Logic Problems Zero operand Zero operand for %2s operation.1219 Logic Problems Result always zero. Result of operation %2s is always 0.

1220 Logic Problems Loop variable defini-tions

Something is wrong with the handling of a variablewithin a loop.

1300 Comments General CommentsProblem

1301 Comments Unclosed Comment Unclosed Comments1302 Comments Nested Comment Nested Comments

1303 Comments Missing javadoc tagEach compilation unit should have an @author javadoctag.

1304 Comments Unknown Javadoc tagA javadoc tag not part of the standard tags [6] was found.Is this the purpose or a spelling mistake?

1400 Exception handlingGeneral ExceptionHandling Problem

1401 Exception handlingGeneral ExceptionHandling Problem

1402 Exception handling Suspicious Catch Suspicious CATCH/FINALLY

1403 Exception handlingCatch / Throw toogeneral

Catched exception too general. Exceptions should behandled as specific as possible. Exception, Runtime Ex-ception, Error and Throwable are too general [8]. Run-time Exception is inappropriate to catch except at thetop level of your program. Runtime Exceptions usuallyrepresents programming errors and indicate that some-thing in your program is broken. Catching Exceptionor Throwable is therefore also inappropriate since thatcatch clause will also catch Runtime Exception. Pre-fer throwing subclasses instead of the general exceptionclasses. Exception, Runtime Exception, Throwable andError are considered too general

1404 Exception handling Empty Block Empty catch block. Exceptions should be handled in thecatch block.

1406 Exception handling Return in FinallyBlock

Avoid using the return statement in a finally block.

1500 Syntax Error General Syntax Error Uncategorized Syntax errors.1501 Syntax Error Missing colon No ’:’ after CASE. No ’;’ after FOR initialization part

1502 Syntax Error Assignment Problem May be ’=’ used instead of ’==’

1503 Syntax ErrorEscape SequenceProblem

May be incorrect escape sequence

1504 Syntax ErrorInteger constant prob-lem

May be ’l’ is used instead of ’1’ at the end of integerconstant

1600 ImportGeneral Import Prob-lem

1601 Import Unused importUnused Import Avoid importing types that are neverused.

1602 ImportExplicit import ofjava.lang classes.

Explicit import of java.lang classes. All types in thejava.lang package are implicitly imported. There’s noneed to import them explicitly.

1603 Import Wildcard Import

Wildcard import. Demand import declarations are notallowed in the compilation unit. Enumerating the im-ported types explicitly makes it very clear to the readerfrom what package a type come from when used in thecompilation unit.

1604 Import Duplicate ImportDuplicate import. The imported type has already beenimported.

1605 ImportImporting classes fromcurrent package.

Classes from the same package are imported by default.There’s no need to import them explicitly.

1700 PackagingGeneral Packagingproblem

1701 Packaging No Package foundNo package declaration found. Try to structure yourclasses in packages.

1800 Style General Style Problem

1801 StyleSpace Tab Indentationproblem

Mixed indentation. Both spaces and tabs have been usedto indent the source code. If the file is indented with oneeditor it might look unindented in another if the tab sizeis not exactly the same in both editors. Prefer spacesonly.

1802 Style Wrong modifier order

Wrong order of modifiers. The order of modifiers shouldaccording to [1] be: Class modifiers: public protected pri-vate abstract static final strictfp Field modifiers: publicprotected private static final transient volatile Methodmodifiers: public protected private abstract static finalsynchronized native strictfp

1803 Style80 Character LineLength Exceeded

Avoid lines longer than 80 characters, since they’re nothandled well by many terminals and tools [5]. Printersusually have a 80 character line limit. The printed copywill be hard to read when lines are wrapped.

1804 Style Assert is reserved key-word in JDK 1.4

Prefer using another name to improve portability.

1805 Style Declare ParametersFinal

Assigning a value to the formal parameters can confusemany programmers because the formal parameter may beassociated with the actual value passed to the method.Confusion might also arise if the method is long and thevalue of the parameter is changed. If you do not intendto assign any value to the formal parameters you canstate this explicitly by declaring them final.

1806 Style Suspicious MethodName

The method has almost the same name as Ob-ject.hashCode, Object.equals or Object.finalize. Maybethe intention is to override one of these?

1900 SynchronizationGeneral Synchroniza-tion problem

1901 SynchronizationBroken Double-checked locking idiom

Double-Checked Locking is widely cited and used as anefficient method for implementing lazy initialization ina multithreaded environment. Unfortunately does thisidiom not work in the presence of either optimizing com-pilers or shared memory multiprocessors [2].

1902 Synchronization Potential DeadlockLoop %2d: invocation of synchronized method %3m cancause deadlock

1903 Synchronization Synchronized methodoverwritten

Synchronized method %2m is overridden by non-synchronized method of derived class

1904 Synchronization Unsynchronized callsMethod %2m can be called from different threads and isnot synchronized

1905 SynchronizationVariable volatilityproblem

Field %2u of class %3c can be accessed from differentthreads and is not volatile

1906 Synchronization Potential Race Condi-tion

Value of lock %2u is changed outside synchronization orconstructor. Value of lock %2u is changed while (poten-tially) owning it

2000 Variables General VariableProblem

2001 Variables Unused VariableAn unused local variable may indicate a flaw in the pro-gram.

2002 Variables Comparison TypeProblem

Comparison of short with char.

2003 Variables Typecast problem Maybe type cast is not correctly applied.

2004 VariablesTruncation results indata loss

Data can be lost as a result of truncation to %2s.

2100 ConstructorGeneral ConstructorProblem

2101 ConstructorSuperclass constructornot called

A class which extends a class does not call the superclassconstructor.

Appendix B

SOSART Requirements

B.1 Functional Requirements

Requirement Rationale

The SOSART analysis tool shall becapable of loading any Java 1.4.2 com-pliant source code module and creat-ing UML based activity diagrams /control flow diagrams for each methodwithin the source code module.

The first project to be analyzed re-quires Java support.

Based upon a loaded source code mod-ule, SOSART shall be capable of gen-erating watchpoints which can be usedto collect execution profiles. Watch-points shall be located at the entry toeach and every code block, as well asat all return statements.

Execution traces require indications ofthe location for the start of each codeblock, and this must be obtained bystructurally analyzing the source codemodule.

The SOASRT tool shall be capableof visualizing execution traces whichhave been captured by the tool. Vi-sualization shall occur on the activitydiagram / flow diagram representationof the program.

Aids in understanding the executionflow and its relationship to the variouscode blocks within the method.

The number of times a given path hasbeen executed shall be displayed onthe activity diagram when an execu-tion path is added to the display.

Usability and understanding of the ex-ecuted paths.

Continued on Next Page. . .

The SOSART tool shall have the capa-bility to store external to the programa historical database which representsall warnings which have been analyzedduring program execution.

Allows historical trending to be usedto improve the accuracy of the model.

F.4.1Transfer to the historical databaseshall be at the operators command.

Prevents erroneous analysis resultsfrom being transferred into the histor-ical database.

F.4.2The SOSART tool shall allow the op-erator to clear the historical databaseas is necessary.

This allows projects of entirely differ-ent scope to be analyzed without datacontamination by projects from differ-ent domains.

F.4.3The SOSART tool shall allow the op-erator to store and retrieve differenthistorical databases as is necessary.

This allows projects from different do-mains to be analyzed independently.

F.5SOSART shall be capable of calculat-ing Cyclomatic Complexity and StaticPath count on a per method basis.

These are basic metrics which shouldbe readily available when analyzingCOTS and other developed software.

SOSART shall be capable of import-ing static analysis warnings and dis-playing the warnings on the generatedactivity diagrams / control flow dia-grams.

Visualization static analysis warningsrelative to the execution profile ob-tained during program execution.

SOSART shall be capable of interfac-ing at minimum with the followingJava Static analysis tools:1. JLint1

2. ESC/Java3. FindBugs4. Fortify SCA5. PMD6. QAJ7. Lint4J8. JiveLint9. Klocwork K7

These are commonly available staticanalysis tools for Java which havebeen shown to be reliable and thusserves as a starting set for this anal-ysis.

The SOSART GUI tool shall sup-port commonly existing document in-terface behavior, including but notlimited to image zooming, tiling, andprinting of generated graphics.

These are standard GUI behaviors ex-pected in a completed analysis tool.

SOSART shall automatically relaygraphics displays when necessary, butshall also have a button to force thegraphics to be relayed using the builtin algorithm.

Allows the user to force a relay ofthe display if a program failure occurswhich prevents the proper display ofthe activity diagram graph.

The SOSART tool shall provide thecapability to export generated graph-ics into a standard file format forgraphics.

This allows generated graphics to beimported into reports and other doc-uments as is necessary.

F.9SOSART shall allow the user to saveprojects and generated metrics for fu-ture usage.

Ease of use for long term projects.

The SOSART tool shall allow the userto save analyzed static analysis warn-ings separate from the project. Warn-ings saved as such shall be recoverablewith all attributes set to the valuesmodified by the user during analysis.

This will allow larger projects to beanalyzed which may not be storablein their complete format due to limita-tions of XML persistence within Java.

SOSART shall use a configuration filewhich shall store configuration pa-rameters for the tool across multipleprojects.

Allows the user to store common pa-rameters and commands within a con-figuration file so that they do not needto be set when invoking the tool fromthe command line.

F.11.1Configuration data shall be stored inthe XML format.

XML is a standard markup language.

F.12SOSART shall categorize importedstatic analysis warnings based upon adefined taxonomy.

Required for the proper characteriza-tion of warnings.

F.12.1

SOSART shall support the CWEand SAMATE taxonomies as well asa custom developed taxonomy forSOSART.

CWE and SAMATE are two existingtaxonomies for static analysis warn-ings. However, these taxonomies tar-get security as opposed to the largerdomain of static analysis tools.

F.12.2

All static analysis warnings shall becategorized into the appropriate tax-onomy upon importation if the cate-gorization has not already occurred.

This allows the most efficient catego-rization of faults to the taxonomy def-initions, as they are categorized by theuser when a new instance is detected.

F.12.3Fault taxonomy assignments shall beviewable within the SOSART tool bythe operator.

This allows the user to view existingtaxonomy assignments as is necessary.

The SOSART tool shall calculate theestimated software reliability usingthe reliability model developed bySchilling and Alam[SA06d] [SA06b][SA05a].

This is one of the fundamental pur-poses for the tool.

F.13.1The reliability shall be shown in a tex-tual format which can be saved to afile.

Allows storage of reliability and im-portation into external reports.

The SOSART tool shall be capableof generating reports based upon thestatically detectable faults importedinto the tool.

Basic core functionality required forthe tool to be a Metadata tool.

F.14.1The SOSART tool shall provide his-torical reports, project level reports,and file level reports.

These represent the three major clas-sifications of faults supported bySOSART, as a fault is either historicalin nature (in that it is from a previousproject), part of a project (which con-sists of multiple files), or part of a file.

The SOSART tool shall allow forfaults which are not directly at-tributable to a given method to bestored at the class level.

Certain static analysis faults may belocated in such a manner that they arenot related to a given method but aredirectly connected with the class dec-laration. While these faults do not di-rectly play into this reliability model,they should be kept for metrics pur-poses.

F.16The SOSART tool shall allow all filefaults to be viewed in a listing separatefrom the individual method displays.

Under certain circumstances, it maybe beneficial to visualize faults as a listinstead of on the activity diagrams.

F.17SOSART shall be capable of exportingdata into XML format for importationinto external programs.

XML is a standard language for inter-face between data systems.

The SOSART toolset set contain atrace generator capable of loggingbranch execution traces in a man-ner which can be imported into theSOSART tool.

This is necessary in order to log exe-cution traces for model usage.

1Currently, the JLint tool does not include support for XML output. In orderto interface properly with the SOSART tool, JLint will need to be improved toinclude XML output.

B.2 Development Process requirements

D.1The SOSART tool, when practical,shall be developed using the PersonalSoftware Process.

The PSP represents a well docu-mented approach to quality soft-ware development for small, individ-ual projects. This is essential to en-sure a reliably delivered analysis tool.However, being that this is a researchtool being developed using other re-searchers code as modules, it may notbe feasible to follow a strict PSP pro-cess during development.

D.1.1Development data shall be collectedusing the Software Process Dash-board.

Reliably following the PSP is simpli-fied by the usage of tools which au-tomatically collect the requisite data.By doing this, individual mistakes canbe reduced.

D.2Design documentation for theSOSART tool shall be created usingUML design format.

UML represents a standard approachfor software design which is readily un-derstood by practitioners.

Source code developed for theSOSART tool shall be verified forcoding standards compliance usingthe Checkstyle tool.

Enforcement of coding standards dur-ing development has been shown to re-duce the number of defects in a finaldelivered product. Checkstyle helpsto prevent common Java programmingmistakes as well as ensuring consistentstyle.

D3.1The SOSART tool shall compile with-out any compiler warnings in thehand-coded segments.

This ensures that any compiler de-tected problems have been removed.While it is desirable to remove com-piler warnings from the automaticallygenerated code, this may not be feasi-ble given the limitations of automaticcode generation.

SOSART source code, design docu-mentation, and other materials shallbe kept under version management atall times through development.

Appropriate software engineeringpractice.

D.4.1SOSART shall use the CVS versionmanagement system for all configura-tion management practices.

CVS is readily available as an opensource project and is well supportedand extremely extensible.

D.5SOSART shall be released through theSourceforge site.

SourceForge is a readily available dis-tribution site which is commonly usedfor Open Source programs.

B.3 Implementation Requirements

I.1The SOSART tool shall be imple-mented using the Java programminglanguage.

Java allows for easy development of aGUI, is portable, and is an appropri-ate for language research based tooldevelopment1.

I.2Source code parsing shall be accom-plished through the use of the ANTLRparser.

The ANTLR parser is a readily avail-able tool which can be easily dis-tributed. It is also well documentedand has an extensive record of success-ful usage.

The SOSART tool shall not useany constructs which would limit theportability of the tool to a given envi-ronment.

It is important to develop a tool whichis portable across multiple develop-ment platforms.

I.4Java 1.4.2 shall be used for tool devel-opment.

Many higher end UNIX systems donot support Java versions newer than1.4.2, and thus, would be unable torun the tool if newer Java constructsare used.

1This is in contrast to embedded systems development in which Java is generallynot an appropriate language for reasons explained in Schilling[Sch04a].

Schilling Walter William Jr

Documents