Regression Analysis
Unscheduled Maintenance Issue:
36 flight squadrons
Each experiences unscheduled maintenance actions (UMAs)
UMAs costs $1000 to repair, on average.
You’ve got the Data… Now What?
Sq Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
101 36 53 51 61 63 54 50 65 62 51 68 45
104 60 42 56 63 39 65 63 67 66 52 59 60
108 53 61 59 87 61 46 52 85 84 75 78 68
Unscheduled Maintenance Actions(UMAs)
What do you want to know?
How many UMAs will there be next month? What is the average number of UMAs ?
Sample Mean
xxni
60
Sample Standard Deviation
sx xni
( )
.2
112 05
UMA Sample Statistics
UMAs
Mean 60Standard Error of Mean 2.01Median 60.5Mode 61Standard Deviation 12.05Minimum 36Maximum 87Count 36
UMAs Next Month
95% Confidence Interval
x 60 2 12
36 84 x
Average UMAs
95% Confidence Interval
60 2
1236
56 64
Model: Cost of UMAs for one squadron
If the cost per UMA = $1000, the
Expected cost for one squadron = $60,000
Model: Total Cost of UMAs
Expected Cost for all squadrons
= 60 * $1000 * 36 = $2,160,000
Model: Total Cost of UMAs
Expected Cost for all squadrons
= 60 * $1000 * 36 = $2,160,000
How confident are we about this estimate?
-3 -2 -1 0 1 2 3
.3413 .3413
.1359 .1359
.0215 .0215
~ 95%
mean (=60)
standard error =12/36 = 2
-3 -2 -1 0 1 2 3
.3413 .3413
.1359 .1359
.0215 .0215
~ 95%
~56 ~58 60 ~62 ~64
(1 standard unit = 2)
95% Confidence Interval on our estimate of UMAs and costs
60 + 2(2) = [56, 64]
low cost: 56 * $1000 * 36 = $2,016,000
high cost: 64 * $1000 * 36 = $2,304,000
What do you want to know?
How many UMAs will there be next month? What is the average number of UMAs ? Is there a relationship between UMAs and
and some other variable that may be used to predict UMAs?
What is that relationship?
Relationships
What might be related to UMAs? Pilot Experience ? Flight hours ? Sorties flown ? Mean time to failure (for specific parts) ? Number of landings / takeoffs ?
Regression:
To estimate the expected or mean value of UMAs for next month:
look for a linear relationship between UMAs and a “predictive” variable
If a linear relationship exists, use regression analysis
Regression analysis:
describes and evaluates
relationships between one variable
(dependent or explained variable), and
one or more other variables (called the independent or explanatory variables).
What is a good estimating variable for UMAs?
quantifiable predictable logical relationship with dependent
variable must be a linear relationship:
Y = a + bX
Sorties
Sq Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
101 100 120 114 132 146 124 110 138 140 114 157 106
104 130 106 124 140 100 146 142 141 148 118 128 130
108 122 134 126 190 136 110 120 196 184 154 172 157
Pilot Experience
Sq Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
101 6.06 2.81 3.37 3.87 4.22 6.67 2.61 1.96 2.96 2.45 3.29 3.73
104 4.61 2.45 4.65 5.71 7.23 3.01 2.53 1.54 4.49 1.73 4.81 5.17
108 1.11 5.75 4.9 3.59 6.88 1.17 2.59 5.87 7.28 7.79 5.87 2.47
Sample Statistics
Sorties Exp
Mean 135 4.06Standard Error of Mean 3.99 0.31Median 131 3.80Mode 100 #N/AStandard Deviation 23.92 1.84Minimum 100 1.11Maximum 196 7.79Count 36 36
Describing the Relationship
Is there a relationship? Do the two variables (UMAs and sorties or
experience) move together? Do they move in the same direction or in
opposite directions? How strong is the relationship?
How closely do they move together?
Positive Relationship
0
10
20
30
40
50
60
0 10 20 30 40 50 60
X
Y
Strong Positive Relationship
0
10
20
30
40
50
60
0 10 20 30 40 50 60
Negative Relationship
0
10
20
30
40
50
0 10 20 30 40 50
X
Y
Strong Negative Relationship
0
10
20
30
40
50
60
0 10 20 30 40 50 60
0
5
10
15
20
25
0 10 20 30 40 50 60
No Relationship
Relationship?
0
50
100
150
200
250
300
350
400
0 10 20 30 40 50 60
X
Y
Correlation Coefficient
Statistical measure of how closely two variables are moving together in a coordinated fashion Measures strength and direction
Value ranges from -1.0 to +1.0 +1.0 indicates “perfect” positive linear relation -1.0 indicates “perfect” negative linear relation 0 indicates no relation between the two variables
Correlation Coefficient
r
n x y x y
n x x n y yi i i i
i i i i
( )
( ) ( )2 2 2 2
Sorties vs. UMAs
0
10
20
30
40
50
60
70
80
90
0 50 100 150 200
Sorties
UM
As
r = .9788
Experience vs. UMAs
0
10
20
30
40
50
60
70
80
90
0.00 2.00 4.00 6.00 8.00 10.00
Pilot Experience
UM
As
r = .1896
Correlation Matrix
Correlation UMAs Sorties ExpUMAs 1Sorties 0.9787613 1Exp 0.1895905 0.198641 1
A Word of Caution...
Correlation does NOT imply causation It simply measures the coordinated
movement of two variables Variation in two variables may be due to
a third common variable The observed relationship may be due
to chance alone
What is the Relationship?
In order to use the correlation information to help describe the relationship between two variables we need a model
The simplest one is a linear model:
Y a bX
Fitting a Line to the Data
0
1
2
3
4
5
6
7
8
9
10
0 2 4 6 8 10 12 14
X
Y
One Possibility
0
1
2
3
4
5
6
7
8
9
10
0 2 4 6 8 10 12 14
X
Y
Sum of errors = 0
Another Possibility
Sum of errors = 0
0
1
2
3
4
5
6
7
8
9
10
0 2 4 6 8 10 12 14
X
Y
Which is Better?
Both have sum of errors = 0 Compare sum of absolute errors:
Y Y1 Error Abs err8 6 2 21 5 -4 46 4 2 24 5.5 -1.5 1.56 4.5 1.5 1.5
0 11
Y2 Error Abs err2 6 65 -4 48 -2 2
3.5 0.5 0.56.5 -0.5 0.5
0 13
Fitting a Line to the Data
0
1
2
3
4
5
6
7
8
9
10
0 2 4 6 8 10 12
X
Y
One Possibility
0
1
2
3
4
5
6
7
8
9
10
0 2 4 6 8 10 12
X
Y
Sum of absolute errors = 6
Another Possibility
Sum of absolute errors = 6
0
1
2
3
4
5
6
7
8
9
10
0 2 4 6 8 10 12
X
Y
Which is Better?
Sum of the absolute errors are equal Compare sum of errors squared:
Y Y1 Abs err Sum Sq4 4 0 07 3 4 162 2 0 05 3.5 1.5 2.252 2.5 0.5 0.25
6 18.5
Y2 Abs err Sum Sq5.6 1.6 2.563.8 3.2 10.24
2 0 04.7 0.3 0.092.9 0.9 0.81
6 13.7
50
60
70
80
90
100
100 110 120 130X
Y
The Correct Relationship: Y = a + bX + U
systematic random
50
60
70
80
90
100
100 110 120 130X
Y
The correct relationship:Y = a + bX + U
systematic random
Least-Squares Method
Penalizes large absolute errors
Y- intercept:
Slope:
bXY nXY
X nX
2 2
a Y bX
Assumptions
Linear relationship: Errors are random and normally distributed
with mean = 0 and variance = Supported by Central Limit Theorem
Y a bX U
2
Least Squares Regression for Sorties and UMAs
0
10
20
30
40
50
60
70
80
90
100
0 50 100 150 200
Sorties
UM
As
Regression Calculations
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.978761339R Square 0.957973758Adjusted R Square 0.956737692Standard Error 2.505836188Observations 36
ANOVAdf SS MS F Significance F
Regression 1 4866.50669 4866.50669 775.0183246 5.51636E-25Residual 34 213.49331 6.279215001Total 35 5080
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept -6.542935597 2.426476306 -2.696476195 0.01082052 -11.4741255 -1.611745688Sorties 0.492910634 0.017705663 27.83915093 5.51636E-25 0.456928421 0.528892848
Sorties vs. UMAs
0
10
20
30
40
50
60
70
80
90
100
0 50 100 150 200
Sorties
UM
As
. .Y X 654 49
Regression Calculations: Confidence in the predictions
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.978761339R Square 0.957973758Adjusted R Square 0.956737692Standard Error 2.505836188Observations 36
ANOVAdf SS MS F Significance F
Regression 1 4866.50669 4866.50669 775.0183246 5.51636E-25Residual 34 213.49331 6.279215001Total 35 5080
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept -6.542935597 2.426476306 -2.696476195 0.01082052 -11.4741255 -1.611745688Sorties 0.492910634 0.017705663 27.83915093 5.51636E-25 0.456928421 0.528892848
Confidence Interval for Estimate
30
40
50
60
70
80
90
100
90 100 110 120 130 140 150 160 170 180 190 200
Sorties
UM
As
( )/Y a bX t se 2
95% Confidence Interval for the model (b)
X
Y
Testing Model Parameters
How well does the model explain the variation in the dependent variable?
Does the independent variable really seem to matter?
Is the intercept constant statistically significant?
Variation
30
40
50
60
70
80
90
100
90 100 110 120 130 140 150 160 170 180 190 200
Sorties
UMAs
Y
YY
Coefficient of Determination
Values between 0 and 1 R2 = 1 when all data on line (r=1) R2 = 0 when no correlation (r=0)
R = Explained Variation
Total Variation2
Regression Calculations: How well does the model explain the variation?
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.978761339R Square 0.957973758Adjusted R Square 0.956737692Standard Error 2.505836188Observations 36
ANOVAdf SS MS F Significance F
Regression 1 4866.50669 4866.50669 775.0183246 5.51636E-25Residual 34 213.49331 6.279215001Total 35 5080
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept -6.542935597 2.426476306 -2.696476195 0.01082052 -11.4741255 -1.611745688Sorties 0.492910634 0.017705663 27.83915093 5.51636E-25 0.456928421 0.528892848
Does the IndependentVariable Matter?
If sorties do not help predict UMAs we expect b = 0
If b is not 0, is it statistically significant?
Y a bX
Regression Calculations: Does the Independent Variable Matter?
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.978761339R Square 0.957973758Adjusted R Square 0.956737692Standard Error 2.505836188Observations 36
ANOVAdf SS MS F Significance F
Regression 1 4866.50669 4866.50669 775.0183246 5.51636E-25Residual 34 213.49331 6.279215001Total 35 5080
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept -6.542935597 2.426476306 -2.696476195 0.01082052 -11.4741255 -1.611745688Sorties 0.492910634 0.017705663 27.83915093 5.51636E-25 0.456928421 0.528892848
95% Confidence Interval for the slope (a)
Mean of Y
Mean of X X
Y
Confidence Interval for Slope
30
40
50
60
70
80
90
100
90 100 110 120 130 140 150 160 170 180 190 200
Sorties
UM
As
Is the InterceptStatistically Significant?
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.978761339R Square 0.957973758Adjusted R Square 0.956737692Standard Error 2.505836188Observations 36
ANOVAdf SS MS F Significance F
Regression 1 4866.50669 4866.50669 775.0183246 5.51636E-25Residual 34 213.49331 6.279215001Total 35 5080
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept -6.542935597 2.426476306 -2.696476195 0.01082052 -11.4741255 -1.611745688Sorties 0.492910634 0.017705663 27.83915093 5.51636E-25 0.456928421 0.528892848
Confidence Intervalfor Y-intercept
30
40
50
60
70
80
90
100
90 110 130 150 170 190 210Sorties
UM
As
Basic Steps ofRegression Analysis
Formulate the model Plot scatter diagram for visual inspection Compute correlation coefficient Fit the regression line Test the model
Factors affecting estimation accuracy
Sample size (larger is better) Range of X values (wider is better) Standard deviation of U (smaller is
better)
Uses and Limitationsof Regression Analysis
Identifying relationships Not necessarily cause May be due to chance only
Forecasting future outcomes Only valid over the range of the data Past may not be good predictor of future
Common pitfalls in regression
Failure to draw scatter diagrams Omitting important variables from the
model The “two point” phenomenon Unfounded claims of model sophistication Insufficient attention to interval estimates
and predictions Predicting too far outside of known range
Lines can be deceiving...
X Variable 1 Line Fit Plot
0
2
4
6
8
10
12
14
0 5 10 15 20
X Variable 1
Y
R2 = .6662
Nonlinear Relationship
y = -0.1267x2 + 2.7808x - 5.9957R2 = 1
0
2
4
6
8
10
12
14
0 5 10 15 20
X
Y
Best fit?
X Variable 1 Line Fit Plot
0
2
4
6
8
10
12
14
0 5 10 15 20
X Variable 1
Y
Misleading data
X Variable 1 Line Fit Plot
0
2
4
6
8
10
12
14
0 5 10 15 20
X Variable 1
Y
Summary
Regression Analysis is a useful tool Helps quantify relationships
But be careful Does not imply cause and effect Don’t go outside range of data Check linearity assumptions Use common sense!
05
101520253035404550
0 5 10 15 20
Output
Co
st
r = 0.0
Non-linear relationship between output and cost