5/29/2013
1
Chapter 23
Correlation
and
Simple Linear Regression
Introduction
• In processes there is often a direct relationship
between two variables. If a strong relationship between
a process input variable is correlated with a key
process output variable (KPOV), the input variable
could then be considered a key process input variable
(KPIV).
• The equation 𝑌 = ƒ(𝑥) can express this relationship for
continuous variables, where 𝑌 is the dependent
variable and 𝑥 is the independent variable.
• Parameters of this equation can be determined using
regression techniques.
5/29/2013
2
Introduction
• After the establishment of a relationship, an appropriate
course of action would depend upon the particulars of the
situation. • If the overall process is not capable of consistently
meeting the needs of the customer, it may be appropriate
to initiate tighter specifications or to initiate control charts
for this key process input variable (KPIV).
• If the variability of a key process input variable describes
the normal variability of raw material, an alternative
course of action might be more appropriate. For this case
it could be beneficial to conduct a DOE with the objective
of determining other factor settings that would improve
the process output robustness.
23.1 S4/IEE Application Examples:
Regression
• An S4/IEE project was created to improve the 30,000-foot-
level metric days sales outstanding (DSO). One input that
surfaced from a cause-and-effect diagram was the size of
the invoice. A scatter plot and regression analysis of DSO
versus size of invoice was created.
• An S4/IEE project was created to improve the 30,000-foot-
level metric, the diameter of a manufactured part. One
input that surfaced from a cause-and-effect diagram was
the temperature of the manufacturing process. A scatter
plot of part diameter versus process temperature was
created.
5/29/2013
3
23.2 Scatter Plot (Dispersion Graph)
• A scatter plot or dispersion graph pictorially describes the
relationship between two variables.
• Care must be exercised when interpreting dispersion graphs.
A plot that shows a relationship does not prove a true cause-
and-effect relationship (i.e., it does not prove causation).
Happenstance data can cause the appearance of a
relationship. For example, the phase of the moon could
appear to affect a process that has a monthly cycle.
• When constructing a dispersion graph, first clearly define the
variables that are to be evaluated. Next collect at least 30
data pairs. Plot data pairs using the horizontal axis for
probable cause and using the vertical axis for probable effect.
23.3 Correction
• A statistic that can describe the strength of a linear
relationship between two variables is the sample
correlation coefficient (𝑟). A correlation coefficient can take
values between -1 and 1.
• A -1 indicates perfect negative correlation, while a 1
indicates perfect positive correlation. A zero indicates no
correlation.
• The equation for the sample correlation coefficient (𝑟) of
two variables:
𝑟 = (𝑥𝑖 − 𝑥 )(𝑦𝑖 − 𝑦 )
(𝑥𝑖 − 𝑥 )2 (𝑦𝑖 − 𝑦 )2
5/29/2013
4
23.3 Correction
23.3 Correction
• The hypothesis test for the correlation coefficient (𝜌) to
equal zero is
𝐻0: 𝜌 = 0
𝐻𝑎: 𝜌 ≠ 0
• If the 𝑥 and 𝑦 relationships are jointly normally distributed,
the test statistic for this hypothesis is
𝑡0 =𝑟 𝑛 − 2
1 − 𝑟2
• The null hypothesis is rejected if |𝑡0| > 𝑡𝛼 2 ,𝑛−2 using one-
sided 𝑡-table value.
5/29/2013
5
23.3 Correction
• Coefficient of determination (𝑅2) is simply the square of the
correlation coefficient (𝑟).
• Values for 𝑅2 describe the percentage of variability
accounted for by the model. For example, 𝑅2 = 0.8
indicates that 80% of the variability in the data is
accounted for by the model.
23.4 Example 23.1: Correlation
• The times for 25 soft drink deliveries (𝑦) monitored as a
function of delivery volume (𝑥) is shown in Table 23.1
(Montgomery and Peck 1982).
• The scatter diagram of these data indicates that there
probably is a strong correlation between the two variables.
5/29/2013
6
Number of
Cases (x) Delivery
Time (y)
7 16.68
3 11.50
3 12.03
4 14.88
6 13.75
7 18.11
2 8.00
7 17.83
30 79.24
5 21.50
16 40.33
10 21.00
4 13.50
6 19.75
9 24.00
10 29.00
6 15.35
7 19.00
3 9.50
17 35.10
10 17.90
26 52.32
9 18.75
8 19.83
4 10.75
23.4 Example 23.1: Correlation
23.4 Example 23.1: Correlation
• The sample correlation coefficient between delivery time and
delivery volume is determined through use of a computer
program or equation to be:
𝑟 = (𝑥𝑖 − 𝑥 )(𝑦𝑖 − 𝑦 )
(𝑥𝑖 − 𝑥 )2 (𝑦𝑖 − 𝑦 )2=
2473.42
1136.57 × 5784.54= 0.96
• Testing the null hypothesis that the correlation coefficient
equals zero yields
𝑡0 =𝑟 𝑛 − 2
1 − 𝑟2=
0.96 25 − 2
1 − 0.962= 17.56
• Using a single-sided t-table (Table D) at 𝛼/2 , we can reject 𝐻0 since 𝑡0.05/2,23 = 2.069. Or, we could use a two-sided t-table at
𝛼.
5/29/2013
7
23.5 Simple Linear Regression
• Correlation only measures association, while regression
methods are useful to develop quantitative variable
relationships that are useful for prediction.
• For this relationship the independent variable is variable 𝑥,
while the dependent variable is 𝑦. The simple linear
regression model takes the form:
𝑌 = 𝛽0 + 𝛽1𝑥 + 𝜀
• Where 𝛽0 is the intercept, 𝛽1 is the slope, and 𝜀 is the error
term. All data points do not typically fall exactly on the
regression model line. The error term makes up for these
differences from other variables such as measurement
errors, material variations, and personnel.
23.5 Simple Linear Regression
• When the magnitude of the coefficient of determination
(𝑅2) is large, the error term is relatively small and the
model has a good fit.
• When a linear regression model contains only one
independent (regressor or predictor) variable, it is called
simple linear regression. When a regression model
contains more than one independent variable, it is called a
multiple linear regression model.
5/29/2013
8
23.5 Simple Linear Regression
• Least squares minimizes the sum of squares of the
residuals. The fitted simple linear regression model:
𝑦 = 𝛽 0 + 𝛽 1𝑥
where the regression coefficients are,
𝛽 1 =𝑆𝑥𝑦
𝑆𝑥𝑥=
𝑦𝑖𝑥𝑖 − 𝑦𝑖
𝑛𝑖=1 𝑥𝑖
𝑛𝑖=1
𝑛𝑛𝑖=1
𝑥𝑖2𝑛
𝑖=1 − 𝑥𝑖
𝑛𝑖=1
2
𝑛
= 𝑦𝑖(𝑥𝑖 − 𝑥 )𝑛
𝑖=1
(𝑥𝑖 − 𝑥 )2𝑛𝑖=1
𝛽 0 = 𝑦 − 𝛽 1𝑥
• The difference between the observed value and the
corresponding fitted value is a residual.
23.5 Simple Linear Regression
• The ith residual is:
𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖
• Residuals are important in the investigation of the adequacy
of the fitted model along with detecting the departure of
underlying assumptions.
• Statistical regression programs can calculate the model and
plot the least-square estimates. Programs can also
generate a table of coefficients and con-duct an analysis of
variance. Significance tests of the regression coefficients
involve either the t distribution for the table of coefficients or
the F distribution for analysis of variance.
5/29/2013
9
23.5 Simple Linear Regression
• For the analysis of variance table, total variation is broken
down into the pieces described by the sum of squares (SS):
𝑆𝑆𝑡𝑜𝑡𝑎𝑙 = 𝑆𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 + 𝑆𝑆𝑒𝑟𝑟𝑜𝑟
Where,
𝑆𝑆𝑡𝑜𝑡𝑎𝑙 = (𝑦𝑖 − 𝑦 )2
𝑆𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 = (𝑦 𝑖 − 𝑦 )2
𝑆𝑆𝑒𝑟𝑟𝑜𝑟 = (𝑦𝑖 − 𝑦𝑖 )2
23.5 Simple Linear Regression
• Each sum of squares has an associated number of degrees
of freedom equal to:
• When divided by the appropriate number of degrees of
freedom, the sums of squares give good estimates of the
source of variability. This variability is analogous to a
variance calculation and is called mean square.
Sum of Squares Degrees of Freedom
𝑆𝑆𝑡𝑜𝑡𝑎𝑙 𝑛 − 1
𝑆𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 1
𝑆𝑆𝑒𝑟𝑟𝑜𝑟 𝑛 − 2
5/29/2013
10
23.5 Simple Linear Regression
• If there is no difference in treatment means, the two
estimates are presumed to be similar. If there is a
difference, we suspect that the regressor causes the
observed difference. Calculating the F-test statistic tests
the null hypothesis that there is no difference because of
the regressor:
𝐹0 =𝑀𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛
𝑀𝑆𝑒𝑟𝑟𝑜𝑟
• Using an F table, we should reject the null hypothesis and
conclude that the regressor causes a difference, at the
significance level of 𝛼, if
𝐹0 > 𝐹𝛼,1,𝑛−2
23.5 Simple Linear Regression ANOVA Table for Simple Regression
Source of
Variation
Sum of
Squares
Degrees of
Freedom
Mean
Square
𝑭𝟎
Regression 𝑆𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 1 𝑀𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝐹0 =
𝑀𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛
𝑀𝑆𝑒𝑟𝑟𝑜𝑟
Error 𝑆𝑆𝑒𝑟𝑟𝑜𝑟 𝑛 − 2 𝑀𝑆𝑒𝑟𝑟𝑜𝑟
Total 𝑆𝑆𝑡𝑜𝑡𝑎𝑙 𝑛 − 1
5/29/2013
11
23.5 Simple Linear Regression
• The coefficient of determination is a ratio of the explained
variation to total variation, which equates to:
𝑅2 = 1 −𝑆𝑆𝑒𝑟𝑟𝑜𝑟
𝑆𝑆𝑡𝑜𝑡𝑎𝑙=
𝑆𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛
𝑆𝑆𝑡𝑜𝑡𝑎𝑙=
(𝑦 𝑖 − 𝑦 )2
(𝑦𝑖 − 𝑦 )2
• The multiplication of this coefficient by 100 yields the
percentage variation explained by the least-squares
method. A higher percentage indicates a better least-
squares predictor.
23.5 Simple Linear Regression
• If a variable is added to a model equation, 𝑅2 will increase
even if the variable has no real value. A compensation for
this is an adjusted value, 𝑅2(adj), which has an
approximate unbiased estimate for the population 𝑅2 of
𝑅2(𝑎𝑑𝑗) = 1 −𝑆𝑆𝑒𝑟𝑟𝑜𝑟/(𝑛 − 𝑝)
𝑆𝑆𝑡𝑜𝑡𝑎𝑙/(𝑛 − 1)
• Where 𝑝 is the number of terms in the regression equation
and 𝑛 is the total number of degrees of freedom.
5/29/2013
12
23.5 Simple Linear Regression
• The correlation coefficient of the population (𝜌) and its
sample estimate(𝑟) are connected intimately with a bivariate
population known as the bivariate normal distribution. This
distribution is created from the joint frequency distributions
of the modeled variables. The frequencies have an elliptical
con-centration.
Supplement: Inferences Using the
Least-Squares Coefficients
• When two variables have a linear relationship,
the scatterplot tends to be clustered around a line
known as the least squares line.
• We think of the slope and intercept of the least-
squares line as estimates of the slope and
intercept of the true regression line.
24
5/29/2013
13
Vocabulary
• The linear model is yi = β0 + β1xi+ εi
• The dependent variable is yi
• The independent variable is xi
• The regression coefficients are β0 and β1
• The error is εi
• The line y = β0 + β1x is the true regression line.
• The quantities and are called the least
squares coefficients and can be computed. 0 1
25
Assumptions for Errors in
Linear Models
In the simplest situation, the following assumptions
are satisfied:
1. The errors 1,…,n are random and independent.
In particular, the magnitude of any error i does
not influence the value of the next error i+1.
2. The errors 1,…,n all have mean 0.
3. The errors 1,…,n all have the same variance,
which we denote by 2.
4. The errors 1,…,n are normally distributed.
26
5/29/2013
14
Distribution
In the linear model yi = 0 +1xi +i, under
assumptions 1 through 4, the observations y1,…,yn
are independent random variables that follow the
normal distribution. The mean and variance of yi
are given by
The slope represents the change in the mean of y
associated with an increase in one unit in the value
of x.
27
0 1iy ix
2 2.iy
More Distributions
Under assumptions 1 – 4:
• The quantities are normally distributed random variables.
• The means of are the true values 0 and 1, respectively.
• The standard deviations of are estimated with
and
where
is an estimate of the error standard deviation .
0 1ˆ ˆ and
28
0 1ˆ ˆ and
0 1ˆ ˆ and
0
2
ˆ2
1
1
( )n
ii
xs s
n x x
1ˆ
2
1
.
( )n
ii
ss
x x
2 2
1
(1 ) ( )
2
n
ii
r y y
sn
5/29/2013
15
Example
For the Hooke’s law data, compute s,
29
1 0ˆ ˆ, .β β
s s
Example
30
5/29/2013
16
Notes
1. Since there is a measure of variation of x in the
denominator in both of the uncertainties we just
defined, the more spread out x’s are the smaller
the uncertainties in .
2. Use caution: if the range of x values extends
beyond the range where the linear model holds,
the results will not be valid.
3. The quantities and
have Student’s t distribution with n – 2 degrees of
freedom.
31
0ˆ0 0
ˆ / s
1ˆ1 1
ˆ / s
0 1ˆ ˆ and
Confidence Intervals
• Level 100(1 – )% confidence intervals for 0 and 1
are given by 𝛽 0 ± 𝑡𝑛−2,𝛼/2𝑠𝛽 0 and 𝛽 1 ± 𝑡𝑛−2,𝛼/2𝑠𝛽 1
• A level 100(1 – )% confidence intervals for 0 + 1x
is given by 𝛽 0 + 𝛽 1𝑥 ± 𝑡𝑛−2,𝛼/2𝑠𝑦 where
𝑠𝑦 = 𝑠1
𝑛+
(𝑥 − 𝑥 )2
(𝑥𝑖 − 𝑥 )2𝑛𝑖=1
32
5/29/2013
17
Prediction Intervals
• A level 100(1 – )% confidence intervals for 0 + 1x
is given by 𝛽 0 + 𝛽 1𝑥 ± 𝑡𝑛−2,𝛼/2𝑠𝑝𝑟𝑒𝑑 where
𝑠𝑝𝑟𝑒𝑑 = 𝑠 1 +1
𝑛+
(𝑥 − 𝑥 )2
(𝑥𝑖 − 𝑥 )2𝑛𝑖=1
33
Inferences on the Population
Correlation
• When we have a random sample from a population
of ordered pairs, the correlation coefficient, r, is often
called the sample correlation.
• We have the true population correlation, ρ.
• If the population of ordered pairs has a certain
distribution known as a bivariate normal
distribution, then the sample correlation can be
used to construct CI’s and perform hypothesis tests
on the population correlation.
34
5/29/2013
18
23.6 Analysis of Residuals
• For our analysis, modeling errors are assumed to be
normally and independently distributed with mean zero and
a constant but unknown variance. An abbreviation for this
assumption is 𝑁𝐼𝐷(0, 𝜎2).
• An important method for testing the 𝑁𝐼𝐷 0, 𝜎2 assumption
of an experiment is residual analysis (a residual is the
difference between the observed value and the
corresponding fitted value). Residual analyses play an
important role in investigating the adequacy of the fitted
model and in detecting departures from the model.
23.6 Analysis of Residuals
• Residual analysis techniques include the following:
– Checking the normality assumption through a normal
probability plot and/or histogram of the residuals.
– Check for correlation between residuals by plotting
residuals in time sequence.
– Check for correctness of the model by plotting residuals
versus fitted values.
5/29/2013
19
23.7 Analysis of
Residuals: Normality Assessment
• If the 𝑁𝐼𝐷(0, 𝜎2) assumption is valid, a histogram plot of the
residuals should look like a sample from a normal
distribution. Expect considerable departures from a
normality appearance when the sample size is small. A
normal probability plot of the residuals can similarly be
conducted. If the underlying error distribution is normal, the
plot will resemble a straight line.
• Commonly a residual plot will show one point that is much l
arger or smaller than the others. This residual is typically
called an outlier. One or more outliers can distort the
analysis.
23.7 Analysis of
Residuals: Normality Assessment
• To perform a rough check for outliers, substitute residual
error values into:
𝑑𝑖𝑗 =𝑒𝑖𝑗
𝑀𝑆𝐸
• and examine the standardized residuals values. About
68% of the standardized residuals should fall within a 𝑑𝑖𝑗
value of 1. About 95% of the standardized residuals
should fall within a 𝑑𝑖𝑗 value of 2. Almost all (99%) of the
standardized residuals should fall within a 𝑑𝑖𝑗 value of 3.
5/29/2013
20
23.8 Analysis of
Residuals: Time Sequence
• A plot of residuals in time order of data collection helps
detect correlation between residuals. A tendency for positive
or negative runs of residuals indicates positive correlation.
• This implies a violation of the independence assumption. An
individual chart of residuals in chronological order by
observation number can verify the independence of errors.
Positive autocorrelation occurs when residuals do not
change signs as frequently as should be
expected, while negative autocorrelation is indicated when th
e residuals frequently change signs. This problem
should be avoided initially.
23.9 Analysis of Residuals: Fitted
Values
• Outliers, which appear as points that are either much higher
or lower than normal residual values. These points should
be investigated. Perhaps someone recorded a number
wrong. Perhaps an evaluation of this sample provides
additional knowledge that leads to a major process
improvement breakthrough.
• Non constant variance, where the difference between the
lowest and highest residual values either increases or
decreases for an increase in the fitted values. A
measurement instrument could cause this where error is
proportional to the measured value.
5/29/2013
21
23.10 Example 23.2: Simple Linear
Regression Regression Analysis: Delivery Time (y) versus Number of Cases (x)
The regression equation is
Delivery Time (y) = 3.32 + 2.18 Number of Cases (x)
Predictor Coef SE Coef T P
Constant 3.321 1.371 2.42 0.024
Number of Cases (x) 2.1762 0.1240 17.55 0.000
S = 4.18140 R-Sq = 93.0% R-Sq(adj) = 92.7%
Analysis of Variance
Source DF SS MS F P
Regression 1 5382.4 5382.4 307.85 0.000
Residual Error 23 402.1 17.5
Total 24 5784.5
23.10 Example 23.2: Simple Linear
Regression
Unusual Observations
Number of Delivery
Obs Cases (x) Time (y) Fit SE Fit Residual St Resid
9 30.0 79.240 68.606 2.764 10.634 3.39RX
22 26.0 52.320 59.901 2.296 -7.581 -2.17RX
R denotes an observation with a large standardized residual.
X denotes an observation whose X value gives it large leverage.
Minitab:
Stat
Regression
Regression
5/29/2013
22
23.9 Analysis of Residuals: Fitted
Values
Minitab:
Stat
Regression
Fitted line plot
• An extrapolation to the majority of data used fitting this
model. In addition, the plots indicate that these values are
not fitting the general model very well. The residuals versus
fitted plot indicates that there could also be an increase in
the variability of delivery time with an increase in the
number of cases.
23.10 Example 23.2: Simple Linear
Regression
5/29/2013
23
23.9 Analysis of Residuals: Fitted
Values
Minitab:
Stat
Regression
Regression
Graph
Four in 1
23.1 S4/IEE Assessments
• The regression model describes the region for which it
models and may not be an accurate representation for
extrapolated values.
• It is difficult to detect a cause-and-effect relationship if
measurement error is large.
• A true cause-and-effect relationship does not necessarily
exist when two variables are correlated.
• A process may have a third variable that affects the
process such that the two variables vary simultaneously.
• Least-squares predictions are based on history data, which
may not rep-resent future relationships.
5/29/2013
24
23.1 S4/IEE Assessments
• An important independent variable to improve a process
may be disregarded for further considerations because a
study did not show correlation between this variable and
the response that needed improvement. However, this
variable might be shown to be important within a DOE
if the variable were operated outside its normal operating
range.