1 8 5 6
Center for Reliability Engineering
Integrating Software into PRA
B. Li, M. Li, A. Sinha, Y. Wei, C. Smidts
Presented by
Bin LiCenter for Reliability Engineering
University of Maryland, College Park
July 20, 2004
1 8 5 6
Center for Reliability Engineering
Integrating Software into PRA
Research Objectives
• The objective of our research is to extend the current PRA (Probabilistic Risk Assessment) methodology to integrate software in the risk assessment process.
• Such extension requires modeling the software, the computer platform on which it resides and the interactions it has with other systems.
1 8 5 6
Center for Reliability Engineering
Framework
Initiating EventAnalysis
Accident-Sequence
Construction
Accident-Sequence
Quantification
UncertaintyAnalysis
ChecklistsPHAFMEAHAZOPMaster Logic Diagram
Event Tree AnalysisEvent Sequence DiagramPetri NetsMakov Chains
Fault Tree AnalysisProbabilistic MethodsStatistical MethodsCommon Cause AnalysisHuman reliability Analysis
PRA Question1Initiators
PRAQuestion 2
Consequences
PRAQuestion 3
Probabilities
PRAQuestion 4Uncertainty
PRA Process Analysis Steps Techniques
Classical StatisticsBayesian StatisticsSensitivity Analysis
1. What is the software related failures in the system?2. How to classify the software related failures?3. In which level should we consider the softwarerelated failures?
1. Which methods should one use if softwareparticipates to the unfolding of the accident?
1. Which quantification models or methods can beused to quantify software related failures?2. Which kinds of data will be needed for thesequantification models?3. If these data are not available, How can we handlethe cases?
1. What is necessary uncertainty analysis for thisresearch?2. Which kinds of uncertainty analysis should bedone?
Questions that this research should answer Methodology
Software relatedfailure modes
ModelingApproaches andTechniques
QuantificationModels and Data
UncertaintyAnalysisMethodology
Framework for Integrating Software into PRA
1 8 5 6
Center for Reliability Engineering
Software related failure mode taxonomy
Input(Human, Software,
Hardware)Software
Output(Human, Software,
Hardware)
Computer
Environment
1 8 5 6
Center for Reliability Engineering
Software related failure mode taxonomy
Software related failures
Internal failures Interaction failures
Inputfailures
Outputfailures
Supportfailures
Multipleinteraction
failures
Environmental factors
Functionfailures
Attributefailures
Functionset
failures
1 8 5 6
Center for Reliability Engineering
Validation of the Failure Mode Taxonomy
Training
JSC Classification
UMD Verification of thetaxonomy and Consolidation
with JSC
Phase 1
Phase 2
Phase 3
• Validation Criteria:– Completeness– Consistency– Repeatability– Applicability
Validation Process
1 8 5 6
Center for Reliability Engineering
Completeness and ApplicabilityFailure Modes Definition Value-Initialization Value at initialization is incorrect. Value input or
output at initialization is incorrect. A function receives a bad value from a file at time = 0.
Value Logic An incorrect value is used due to a problem with logic. Value-Additional Logic Additional logic added to handle a value. Value-Display Display value, added, deleted or modified. Incorrect
value displayed in data field. Value-Hardware Value changes due to hardware changes or
modifications. Changing the NIC changes the MAC address and corrupts the license key.
Inadequate Requirements
Requirements were incorrect. The developer followed the requirements to the letter and they were determined to be incorrect.
Propagation of Failure Failure upstream of the module causes an unstable state or failure of modules downstream.
User error User knowingly or unknowingly uses software incorrectly or outside of the intentional design boundaries.
Failure Modes Added By JSC
1 8 5 6
Center for Reliability Engineering
Repeatability and Consistency
The conflicts in two rounds
Second Round
Category Functional I/O Support
Multiple Interaction
Failure Mode Attr
ibut
e
Fun
ctio
n
Am
ount
Ran
ge
Tim
e
Typ
e
Val
ue
Rat
e
CP
U
Per
iphe
ral
Res
ourc
e
Com
mun
icat
ion
Attribute 26 1 3 12 Functional
Function 64 14 1
Amount 2 2 1 1
Range 4 2 7 1
Time 3 1 4 1
Type 1 5 1
Value 25 42 1 61 1
I/O
Rate 1 2
CPU 1 0
Peripheral 1 0 Support
Resource 1 1 4
Fir
st R
ou
nd
Multiple Interaction Communication 3
1 8 5 6
Center for Reliability Engineering
Repeatability
Second round
Failure mode 1
Failure mode 2
Failure mode n
Failure mode 1
P11 P12 …… P1n P1+
Failure mode 2
P21 P22 …… P2n P2+
……
……
……
……
……
……
.
Fir
st r
ound
Failure mode n
Pn1 Pn2 …… Pnn Pn+
P+1 P+2 …… P+n
n
jiji PP
1
n
iijj PP
1
n
iiio PP
1
n
iiie PPP
1
e
eo
P
PPR
1
The measurement of repeatability (R) is the repeatability coefficient (Cohen’s Kappa), Kappa values less than 0.45 indicate inadequate repeatability, values above 0.62 indicate good repeatability,and values above 0.78 indicate excellent repeatability
R = 0.46
1 8 5 6
Center for Reliability Engineering
Results of the Validation of the Taxonomy
• The UMD and the JSC teams reached the following consensus:
– The taxonomy is completecomplete and can be appliedapplied to aerospace systems of various natures;
– The taxonomy includes failure modes applicable to autonomous real time systems and mission critical systems;
– The taxonomy considers all the failure modes in software;
– There is sufficient data available for the validation and enough flexibility to use alternative data.
• RepeatabilityRepeatability and ConsistencyConsistency are adequate.
1 8 5 6
Center for Reliability Engineering
Test-Based Approach - Procedure
• Identify events/components controlled by software in the MLD
• Identify events/components controlled by software in accident scenarios
• Specify the functions involved• Modeling of the Software Component in ESDs/ETs
and Fault Trees • Quantification
1 8 5 6
Center for Reliability Engineering
Identify Software Controlled Events/Components in the MLD
Loss of Occupants
AND GATE
Internal Accident Disaster
OR GATE
Fire
Gas
Chemical Explosion
Poision Materials (anthrax)
Exit Fails
AND GATE
Emergency Exit Fails
Normal Exit Path Fails
Gate Fails
Software Fails
OR GATE
Temperature
OR GATE
OR GATE
Software Fails Initiating Events
OR GATE
1 8 5 6
Center for Reliability Engineering
Identify Software Controlled Events/Components in Accident Scenarios
EmergencyExit
PACS(2)B1
LED1Yes
No
Yes
Safe
Safe
Loss ofOccupants
Safe
No
Yes
FireProtection
YesSafe
No
Loss ofOccupants
Guardthere
Guardaction
T1< Tcritical
Yes
No
Loss ofOccupants
DelayPACS
(1)B3
GateDelay of
opening of gate T2< Tcritical
Yes Yes Yes
No
NoNoNo
PACS (3)B2
Gate Delay of openingof gate T2< TcriticalT3< Tcritical
Yes Yes
NoNo No
Fire
No
Sequence 1
Sequence 2
Sequence 3
Sequence 4
Sequence 5-6
Sequence 7
Sequence 8-11
56
8 9 10 11
The userinsert the
card
Yes
Yes
Delay
The userinsert the PIN
Delay
Delay Safe
Loss ofOccupants
PACS (4)B2
GateDelay of opening
of gate T2< TcriticalT3< T
critical
Yes Yes
NoNo No
Sequence 12
Sequence 13-1513 14 15
YesThe userinsert the PIN
1 8 5 6
Center for Reliability Engineering
• Identify software behavior from ESD/ET– Identify stimuli and results
• Identify software component from requirements specifications– Identify inputs and outputs
• Match stimuli/inputs and results/outputs
Specify the Functions Involved
1 8 5 6
Center for Reliability Engineering
Modeling Software Component in ESDs/ETs and FTs
Input Delay ofExecution
SWExecution
Does the support platformfunction normally?
Yes
No
Does the required SWoutput match the inputrequired by the next
component?
Behavior specified inrequirements is consistent,
unique and the actualbehavior is adequate per
requirement
Yes
No
No
Continue on safe branch
Does the erroneous behaviorlead to a safe condition?
Continue on safe branch
NoUnsafe State
Does the required outputlead to a safe condition?
Continue on safe branch
Yes
Yes
Is the support platform fullynonfunctional?
Yes
Yes
No
Does this support failure leadto a safe condition?
Yes
No
Support platformbehaves in a
degraded modeInput Delay of
ExecutionSW
Execution
Behavior specified inrequirements is consistent,
unique and the actualbehavior is adequate per
requirement
Does the required SWoutput match the inputrequired by the next
component?
Yes
No
No
Continue on safe branch
Does the erroneous behaviorlead to a safe condition?
Continue on safe branch
No Unsafe State
Does the required outputlead to a safe condition?
Continue on safe branch
Yes
Yes
Yes
No Unsafe State
Unsafe State
Continue on safe branch
Unsafe State
No
1 8 5 6
Center for Reliability Engineering
• Utilizing testing to obtain the probability that the software leads to an unsafe state
• The process is as follows:– Define the test cases. These test cases cover both the normal input and
the abnormal input. The testing strategy includes the identification of normal input space and abnormal input space. Test cases are randomly sampled from these spaces.
– Build a Finite State Machine model of the software component to represent its behavior (the oracle). The operational profile derived from the input tree is also embedded into this FSM model.
– Automate the testing using the test scripts generated from the FSM model.
– Define and identify the software component’s safe and unsafe conditions within the context of each ESD sequence.
Quantification
1 8 5 6
Center for Reliability Engineering
Scalability
• The test based approach can be used for large scale systems because large finite state machines have been built and large systems can be tested by WinRunner.
• Scalability, describes the relationship between the effort needed to use this method for large systems and the effort needed for the smaller systems which are part of the investigation.
• Contributors to the effort are:• The modeling effort (time to build the finite state machine), • The test case generation time (time to generate the test cases in
TestMaster) • The test execution time (time to execute test cases in WinRunner).
1 8 5 6
Center for Reliability Engineering
Modeling Time
COCOMO II is used to calculate the time to construct the finite state machine model.
PM = A *(Size)E *27%*25%A=2.94 Emin=0.91Emax=1.226
Size1=70FP PMmax= 1 and PMmin=0.63
Size2=700FPPMMax=16.5 and PMmin=5.4
1 8 5 6
Center for Reliability Engineering
Test Generation time
• Test generation time in full coverage is a function of the size of the model.
• Empirical relations of the following forms can be found:
where
CsizeAtime *
• Empirical study shows :
transtions of number models of number statesof rnumbesize
055.0)(*0018.0 sizetime
1 8 5 6
Center for Reliability Engineering
Calculation of FSM Model Size• Size of the model is a function of Function Points and the
Operation profile.
• Procedure for calculating the size– Determine the basic size from Function Point calculations for the
system.
– Determine the reliability requirement for the testing process.– Calculate the number of iterations required for the target reliability.
– Calculate the size of the largest iterating sub-model.– Calculate the modified size.
),( OPFPfsize
)(6861.3)(0434.0 2 FPFPsize
frequency) least withpath the ofy probabilit
Failure ofy probabilit Thresholdn
log(
submodelsizensizesize actual *
1 8 5 6
Center for Reliability Engineering
Test Execution Time
soiexec Tmnt **0016.0)(*0000703.0 /
• Test Execution time(texec) is a linear function of the number of the
input/output (ni/o), numbers of check points (m) and the waiting time for
responses(Ts)
• Empirical study shows:
soiexec Tmnt **)(* /
1 8 5 6
Center for Reliability Engineering
Summary of Scalability Study
The results of the scalability show that:
• Modeling time can be calculated by using COCOMO II;
• Test generation time in full cover is a function of the size of the model;
• Test execution time is a linear function of the number of the input/output, numbers of check points and the waiting time for responses.
1 8 5 6
Center for Reliability Engineering
Ongoing and Future Research
• Continue the application of a large scale system – The application we chose is CM1 from the
Metrics Data Program
• Finalize the scalability study
• Continue the support failure modes study
• Continue the output failure modes study
• Conduct the fault propagation study