1
Software Fault Tolerance (SWFT)SWIFI in OSs
Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de
Prof. Neeraj Suri
Constantin Sârbu
Dept. of Computer ScienceTU Darmstadt, Germany
2
Fault Removal: Software Testing So far: Verification & Validation
Testing Techniques Static vs. Dynamic Black-box vs. White-box
Last time: Testing of dependable systems Modeling Fault-injection (FI / SWIFI) Some existing tools for fault injection
Today: Testing (SWIFI) of operating systems WHERE: Error propagation in OSs [Johansson’05] WHAT: Error selection for testing [Johansson’07] WHEN: Injection trigger selection [Johansson’07]
Next (last before mid-term exam!): Profiling the OS extensions (state change @ runtime)
3
Reminder: SWIFI
General SW Manipulate bits in memory locations, registers, buses etc.
• Emulation of HW faults Change text segment of processes
• Emulation of SW faults (bugs, defects) Dynamic: E.g., Op-code switch during operation Static: Change source code and recompile (a.k.a. mutation)
What is different in Oss? OS act as a mediator between HW and user SW
applications Kernel mode – low accessibility A failure of the OS often means failure of the whole system Often source code not available Add-on kernel extensions written by other parties than OS
producer -> lack of experience Etc.
4
OS Robustness Testing Efforts at DEEDS
Our research topics presented today:
Error propagation profiling• How errors propagate through OS to the user space• “Error Propagation Profiling of Operating Systems” (DSN’05)
Error selection• How an OS reacts to various types of injected errors• “On the Selection of Error Model(s) for OS Robustness
Evaluation” (DSN’07) Error trigger
• How to choose the injection instant?• “On the Impact of Injection Triggers for OS Robustness
Evaluation” (ISSRE’07)
Slides are the ones presented at each conference!http://www.deeds.informatik.tu-darmstadt.de/aja/
5
Error Propagation Profiling of Operating Systems
Andréas Johansson & Neeraj Suri
Department of Computer ScienceTechnische Universität Darmstadt, Germany
Presented at DSN 2005
6
Paper Objectives Investigate Experimental Error Propagation Profiling of OS
Interfaces/Svcs Quantitative and Metrics! Dynamism & Operational Profiles Black Box with no internal access
Motivation
OperatingSystem
OperatingSystem
HW/Drivers
Applications
Libraries
7
A
B D
C
E
F
IncreasinglIncreasinglyy
badbad
C
E
A
F
DB
!!
Profiling
8
Profiling
Experimental technique to ascertain “vulnerabilities” Identify (potential) sources, error propagation & hot spots,
etc. Estimate their “effects” on applications Component enhancement with “wrappers”
• if (X > 100 && Y < 30) then Exception();• Location of wrappers
Aspects Metrics for error propagation profiles Experimental analysis
9
System Model
Applications
Operating System
Drivers
?
10
Device Driver
Model the interfaces (defined in C) Export (functions provided by the driver) Import (functions used by the driver)
Driver X
dsx.1 … dsx.m osx.1 … osx.n
Hardware
Exported Imported
11
Error Model
Data level errors in OS-Driver interface Wrong values Based on the C-type
• Boundary• Special values• Offsets
Transient First occurrence
12
Metrics
Three metrics for profiling1. Propagation - how errors flow through the OS2. Exposure - which OS services are affected3. Diffusion - which drivers are the sources
Impact analysis
– Metrics– Case study (WinCE)– Results
13
Service Error Permeability
1. Service Error Permeability: Measure one driver’s influence
on one OS service Used to study service-driver
relations
)osin error in Pr(error POS
)dsin error in Pr(error PDS
..
..
zxiizx
yxiiyx
s
s
xD
is
14
OS Service Error Exposure
2. OS Service Error Exposure: An application uses certain services How are these services influenced
by driver errors? Used to compare services
x jxx jx ds
ijx
os
ijx PDSPOS
D.
D.
i
..
E
xD
is
15
Driver Error Diffusion
3. Driver Error Diffusion: Which driver affects the
system the most? Used to compare drivers
xD
i .i . s
.s
. Djxjx ds
ijx
os
ijx
x PDSPOS
is
16
Impact Analysis
Impact ascertained via failure mode analysis
Failure classes: Class NF: No visible effect Class 1: Error, no violation Class 2: Error, violation Class 3: OS Crash/Hang
?
17
Case Study: Windows CE
Targeted drivers Serial Ethernet
FI at interface Data level errors
Effects on OS services 4 Test applications
Test App
OS
DriversTargetDriver
Manager
Interceptor
DriversDrivers
Host
18
Error Model
Error C-Type #cases
Integers
int 7
unsigned int 5
long 7
unsigned long 5
short 7
unsigned short 5
LARGE_INTEGER 7
Void * void 3
Char’s
char 7
unsigned char 5
wchar_t 5
Boolean bool 1
Enums multiple #ident’s
Structs multiple 1
Case # New value
1 previous – 1
2 previous +1
3 1
4 0
5 -1
6 INT_MIN
7 INT_MAX
LONG RegQueryValueEx([in] HKEY hKey,
[in] LPCWSTR lpValueName,
[in] LPDWORD lpReserved,
[out] LPDWORD lpType,
[out] LPBYTE lpData,
[in/out] LPDWORD lpcbData);
19
Service Error Permeability
Ethernet driver 42 imported svcs 12 exported svcs
Most Class 1 3 Crashes (Class 3)
20
OS Service Error Exposure
Serial driver 50 imported svcs 10 exported svcs
Clustering of failures
21
Driver Error Diffusion Higher diffusion for Ethernet Most Class NF Failures at boot-up
Ethernet Serial
#Experiments 414 411
#Injections 228 187
#Class NF 330(80%)
377(92%)
#Class 1 80 (19%) 25 (7%)
#Class 2 1 9
#Class 3 3 0
0.616 0.460
0.002 0.022
0.007 0
k1DC
k3DC
k2DC
22
On the Selection of Error Model(s) On the Selection of Error Model(s)
for OS Robustness Evaluationfor OS Robustness Evaluation
Andréas Johansson, Neeraj SuriTU Darmstadt, Germany
Brendan MurphyMicrosoft Research, Cambridge, UK
Presented at DSN 2007
23
Objectives: “What to Inject?”
FI’s effectiveness arises based on the chosen error model being (a) representative of actual errors, and (b) effectively triggering “vulnerabilities”.
Comparative evaluation of “effectiveness” of different error models: Fewest injections? Most failures? Best “coverage”?
Propose a composite error model for enhancing FI effectiveness
24
Error Models Focus
Target errors arising in device drivers Main source of OS failures [1, 2] Developed by HW vendors Continually evolving
Considered error models Data-type Bit-flips Fuzzing
[1] Ganapathi et. al., LISA’06[2] Chou et. al., SOSP’01
25
System Model
Applications
Operating System
Drivers
OS-App services
OS-Driver services
26
Injection Methodology
Operating SystemOperating System
InterceptorInterceptor
Device DriverDevice Driver
Intercepts function calls between OS and driver
Driver binary modifiedto use Interceptor
OS reconfigured to use Interceptor
Implemented for Windows CE .Net
27
Chosen Drivers & Error Models
Error Models: Data-type (DT) Bit-flips (BF) Fuzzing (FZ)
Driver Description#Injection cases
DT BF FZ
cerfio_serial Serial port 397 2362 1410
91C111 Ethernet 255 1722 1050
atadisk CompactFlash 294 1658 1035
28
Error Models – Data-Type (DT) Errors
int foo(int a, int b) {…}
int ret = foo(0x45a209f1, 0x00000000);
29
Error Models – Data-Type (DT) Errors
Case New Value
1 Previous – 1
2 Previous +1
3 1
4 0
5 -1
6 INT_MIN
7 INT_MAX
0x80000000
int foo(int a, int b) {…}
int ret = foo(0x45a209f1, 0x00000000);
30
Error Models – Data-Type (DT) Errors
Varied #cases depending on the data type Requires tracking of the types for correct injection Complex implementation but scales well
int foo(int a, int b) {…}
int ret = foo(0x80000000, 0x00000000);
31
Error Models – Data-Type (DT) Errors
Data type C-Type #Cases
Integers
int 7
unsigned int 5
long 7
unsigned long 5
short 7
unsigned short 5
LARGE_INTEGER 7
Misc.
* void 3
HKEY 6
struct {…} multiple
Strings 4
Characters
char 7
unsigned char 5
wchar_t 5
Boolean bool 1
Enums multiple casesmultiple cases
32
Error Models – Bit-Flip (BF) Errors
int foo(int a, int b) {…}
int ret = foo(0x45a209f1, 0x00000000);
33
Error Models – Bit-Flip (BF) Errors
int foo(int a, int b) {…}
int ret = foo(0x45a209f1, 0x00000000);
1000101101000100000100111110001
34
Error Models – Bit-Flip (BF) Errors
int foo(int a, int b) {…}
int ret = foo(0x45a209f1, 0x00000000);
1000101101000101000100111110001
1000101101000100000100111110001
35
Error Models – Bit-Flip (BF) Errors
int foo(int a, int b) {…}
int ret = foo(0x45a289f1, 0x00000000);
Typically 32 cases per parameter Easy to implement
1000101101000101000100111110001
36
Error Models – Fuzzing (FZ) Errors
int foo(int a, int b) {…}
int ret = foo(0x45a209f1, 0x00000000);
37
Error Models – Fuzzing (FZ) Errors
int foo(int a, int b) {…}
int ret = foo(0x45a209f1, 0x00000000);
0x17af34c2
38
Error Models – Fuzzing (FZ) Errors
int foo(int a, int b) {…}
int ret = foo(0x17af34c2, 0x00000000);
Selective #cases Simple implementation
39
Comparison
Compare Error Models on:
Number of failures Effectiveness Experimentation Time Identifying services
Error propagation
40
Failure Classes & Driver Diffusion
Failure Class Description
No Failure No observable effect
Class 1Error propagated, but still satisfied the OS service specification
Class 2Error propagated and violated the service specification
Class 3 The OS hung or crashed
41
Failure Classes & Driver Diffusion
Failure Class Description
No Failure No observable effect
Class 1Error propagated, but still satisfied the OS service specification
Class 2Error propagated and violated the service specification
Class 3 The OS hung or crashed
Driver Diffusion [3]: a measure of a driver’s abilityto spread errors:
i .s
. Dyxds
iyx
x PDSxD
is
[3] Johansson, Suri, DSN’05
42
Number of Failures (Class 3)
0
10
20
30
40
50
60
70
80
FZBFDTFZBFDTFZBFDT
#C3
Failu
res
91C111cerfio_serial atadisk
43
Failure Classes & Driver Diffusion
Drivers DT BF FZ
cerfio_serial 1.50 1.05 1.56
91C111 0.73 0.98 0.69
atadisk 0.63 1.86 0.29
Driver Diffusion (Class 3)
Class 3
Class 2
Class 1
No failure
0%
20%
40%
60%
80%
100%
BFDT FZ
atadisk
BFDT FZ
91C111
BFDT FZ
cerfio_serial
44
Experimentation Time
Driver Error ModelExec. time
h min
cerfio_serial
DT 5 15
BF 38 14
FZ 20 44
91C111
DT 1 56
BF 17 20
FZ 7 48
atadisk
DT 2 56
BF 20 51
FZ 11 55
45
Identifying Services (Class 3)
Which OS services can cause Class 3 failures?
Which error model identifies most services (coverage)?
Is some model consistently better/worse?
Can we combine models?
Service DT BF FZ
1 X
2 X X
3 X
4 X X
5 X
6 X X
7 X X
8 X X
9 X X X
10 X X X
11 X X X
12 X
13 X
14 X X X
15 X
16 X X X
17 X
18 X
46
Identifying Services (Class 3 + 2)
Which OS services can cause Class 3 failures?
Which error model identifies most services (coverage)?
Is some model consistently better/worse?
Can we combine models?
Service DT BF FZ
1 O X O
2 X X O
3 X O
4 X X
5 X
6 X X
7 X X O
8 X X
9 X X X
10 X X X
11 X X X
12 O X
13 X
14 X X X
15 X
16 X X X
17 X
18 X
47
Bit-Flips: Sensitivity to Bit Position?
0
2
4
6
8
10
024681012141618202224262830Bit position
#Ser
vice
s
[LSB][MSB]
48
024681012141618
024681012141618202224262830
#Ser
vice
s
Bit position
Bit-Flips: Bit Position Profile
Cumulative #services identified
49
Fuzzing – Number of injections?
91111C
cerfio_serial
atadisk
0.2
0.4
0.6
0.8
1.2
1.0
1.4
1.6
1.8
2.0
Dif
fusi
on
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15#Injections
50
Composite Error Model
Let’s take the best of bit-flips and fuzzing Bit-flips: bit 0-9 and 31 Fuzzing: 10 cases
~50% fewer injections Identifies the same service set
500
1500
2500
3500
cerfio_serial
91C111atadisk
#Inj
ecti
ons All BF & FZ
Composite
51
Composite Error Model – Results
BFDT FZCM
atadisk
BFDT FZCM
91C111BFDT FZ
CM
cerfio_serial
Class 3
Class 2
Class 1
No failure
0%
20%
40%
60%
80%
100%
52
Summary Comparison across three well established error models + CM
Data-type Bit-flips Fuzzing
Model Implementation Coverage Execution
DT
BF
FZ
CM
53
Summary Comparison across three well established error models + CM
Data-type Bit-flips Fuzzing
Model Implementation Coverage Execution
DT
BF
FZ
CM
Requires tracking
data types
Requires few experiments
54
Summary Comparison across three well established error models + CM
Data-type Bit-flips Fuzzing
Model Implementation Coverage Execution
DT
BF
FZ
CM
Found the most Class 3 failures
Requires many experiments
55
Summary Comparison across three well established error models + CM
Data-type Bit-flips Fuzzing
Model Implementation Coverage Execution
DT
BF
FZ
CM
Finds additional services
56
Summary Comparison across three well established error models + CM
Data-type Bit-flips Fuzzing
Model Implementation Coverage Execution
DT
BF
FZ
CM
Profiling gives combined BF & FZ with high coverage
57
Summary Comparison across three well established error models + CM
Data-type Bit-flips Fuzzing
Outlook: Outlook: When to do the injection? More drivers, OS’s, models?
Model Implementation Coverage Execution
DT
BF
FZ
CM
58
On the Impact of Injection TriggersOn the Impact of Injection Triggersfor OS Robustness Evaluationfor OS Robustness Evaluation
Andréas JohanssonAndréas Johansson, Neeraj Suri, Neeraj Suri
Department of Computer ScienceDepartment of Computer ScienceTechnische Universtät DarmstadtTechnische Universtät Darmstadt
GermanyGermany
DEEDS: Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de
Brendan MurphyBrendan Murphy
Microsoft ResearchMicrosoft ResearchCambridgeCambridge
UKUK
Presented at ISSRE 2007Presented at ISSRE 2007
59
Operating System RobustnessOperating System Robustness
Operating SystemOperating System Key operational element Used in virtually all environments robustness! Drivers are a major source of failures [1] [2]
[1] Ganapathi et. al., LISA’06[2] Chou et. al., SOSP’01
60
Operating System RobustnessOperating System Robustness
External faults Robustness Drivers Interfaces
Experimental Fault injection Run-time
Interface OS-Driver No source code
Goal Identify services with robustness
issues Identify drivers spreading errors
Applications
Drivers
OS
61
Operating System RobustnessOperating System Robustness
The issues behind FI based OS robustness The issues behind FI based OS robustness Where to inject? [3] What to inject? [4] When to inject? [today]
OutlineOutline Problem definition Call strings and call blocks System and error model Experimental setup and method Results
[3] Johansson et. al., DSN’05[4] Johansson et. al., DSN’07
62
Fault InjectionFault Injection
Target: interface OS-DriverTarget: interface OS-Driver Each call potential injectionEach call potential injection Problem: too many callsProblem: too many calls
First-occurrence Sample (uniform?)
Service invocations
63
Fault InjectionFault Injection
Observation: calls are not made randomlyObservation: calls are not made randomly Repeating sequences of calls
Idea: select calls based on “operations”Idea: select calls based on “operations” Identify subsequences, select services
64
Call Strings & Call BlocksCall Strings & Call Blocks
Call stringCall string List of tokens (invocations) to a specific driver
Call blockCall block Subsequence of a call string May be repeating Corresponds to a higher level “operation” Used as trigger for injection
65
System and Error ModelSystem and Error Model
Error model: bit-flipsError model: bit-flips Shown to be effective Simple to implement
Injection Function parameter values
66
Experimental ProcessExperimental Process
Execute workloadExecute workload Record call string
Extract call blocksExtract call blocks Select service targets (1 per call block)
Define triggersDefine triggers Based on tracking call blocks
Perform injectionsPerform injections
67
Injection SetupInjection Setup
Target OS: Windows CE .NetTarget OS: Windows CE .Net Target HW: XScale 255Target HW: XScale 255
68
Failure ClassesFailure Classes
Failure Class Description
No Failure No observable effect
Class 1Error propagated, but still satisfied the OS service specification
Class 2Error propagated and violated the service specification
Class 3 The OS hung or crashed
69
Selected DriversSelected Drivers
Serial port driverSerial port driver Ethernet card driverEthernet card driver
Workload/driver phases:Workload/driver phases:
70
Serial Driver Call String and Call BlocksSerial Driver Call String and Call Blocks
Call string:Call string:
D02775(747){23}732775(747){23}23D02775(747){23}732775(747){23}23
Init Working Clean up
71
Ethernet Driver Call String and Call BlocksEthernet Driver Call String and Call Blocks
72
Driver ProfilesDriver Profiles
Driver invocation patterns differDriver invocation patterns differ Impact of call block injection efficiencyImpact of call block injection efficiency
Serial Ethernet
73
Driver ProfilesDriver Profiles
Driver invocation patterns differDriver invocation patterns differ Impact of call block injection efficiencyImpact of call block injection efficiency
Serial Ethernet
74
Serial Driver ResultsSerial Driver Results
75
Serial Driver Service IdentificationSerial Driver Service Identification
FO δ α β1 γ1 ω1 β2 γ2 ω2
CreateThread x x x
DisableThreadLibraryCalls
x x
EventModify x x
FreeLibrary x x
HalTranslateBusAddress x
InitializeCriticalSection x
InterlockedDecrement x
LoadLibrary x x
LocalAlloc x x
memcpy x x x
memset x x x
SetProcPermissions x x x
TransBusAddrToStatic x
76
Ethernet Driver ResultsEthernet Driver Results
TriggerSerial Ethernet
#Injections #C3 #Injections #C3
First Occ. 2436 8 1820 12
Call Blocks
8408 13 2356 12
77
SummarySummary
Where, What & When?Where, What & When? New timing model for interface fault injectionNew timing model for interface fault injection
Faults in device driversFaults in device drivers Based on call strings & call blocksBased on call strings & call blocks
ResultsResults Significant differenceSignificant difference More servicesMore services Driver dependentDriver dependent Driver profilingDriver profiling More injections (2436 vs. 8408)More injections (2436 vs. 8408) Focus on init/clean up?Focus on init/clean up?
78
Discussion & OutlookDiscussion & Outlook
Call block identificationCall block identification Scalability? New data structures (suffix trees)
Call block selectionCall block selection Working phase vs. initial/clean up
Determinism & concurrencyDeterminism & concurrency Workload selectionWorkload selection
Error modelsError models