Chapter 23 Correlation and Simple Linear Regressionweb.eng.fiu.edu/leet/TQM/chap23_2012.pdf23.4...

5/29/2013

1

Chapter 23

Correlation

and

Simple Linear Regression

Introduction

• In processes there is often a direct relationship

between two variables. If a strong relationship between

a process input variable is correlated with a key

process output variable (KPOV), the input variable

could then be considered a key process input variable

(KPIV).

• The equation 𝑌 = ƒ(𝑥) can express this relationship for

continuous variables, where 𝑌 is the dependent

variable and 𝑥 is the independent variable.

• Parameters of this equation can be determined using

regression techniques.

5/29/2013

2

Introduction

• After the establishment of a relationship, an appropriate

course of action would depend upon the particulars of the

situation. • If the overall process is not capable of consistently

meeting the needs of the customer, it may be appropriate

to initiate tighter specifications or to initiate control charts

for this key process input variable (KPIV).

• If the variability of a key process input variable describes

the normal variability of raw material, an alternative

course of action might be more appropriate. For this case

it could be beneficial to conduct a DOE with the objective

of determining other factor settings that would improve

the process output robustness.

23.1 S4/IEE Application Examples:

Regression

• An S4/IEE project was created to improve the 30,000-foot-

level metric days sales outstanding (DSO). One input that

surfaced from a cause-and-effect diagram was the size of

the invoice. A scatter plot and regression analysis of DSO

versus size of invoice was created.

• An S4/IEE project was created to improve the 30,000-foot-

level metric, the diameter of a manufactured part. One

input that surfaced from a cause-and-effect diagram was

the temperature of the manufacturing process. A scatter

plot of part diameter versus process temperature was

created.

5/29/2013

3

23.2 Scatter Plot (Dispersion Graph)

• A scatter plot or dispersion graph pictorially describes the

relationship between two variables.

• Care must be exercised when interpreting dispersion graphs.

A plot that shows a relationship does not prove a true cause-

and-effect relationship (i.e., it does not prove causation).

Happenstance data can cause the appearance of a

relationship. For example, the phase of the moon could

appear to affect a process that has a monthly cycle.

• When constructing a dispersion graph, first clearly define the

variables that are to be evaluated. Next collect at least 30

data pairs. Plot data pairs using the horizontal axis for

probable cause and using the vertical axis for probable effect.

23.3 Correction

• A statistic that can describe the strength of a linear

relationship between two variables is the sample

correlation coefficient (𝑟). A correlation coefficient can take

values between -1 and 1.

• A -1 indicates perfect negative correlation, while a 1

indicates perfect positive correlation. A zero indicates no

correlation.

• The equation for the sample correlation coefficient (𝑟) of

two variables:

𝑟 = (𝑥𝑖 − 𝑥 )(𝑦𝑖 − 𝑦 )

(𝑥𝑖 − 𝑥 )2 (𝑦𝑖 − 𝑦 )2

5/29/2013

4

23.3 Correction

23.3 Correction

• The hypothesis test for the correlation coefficient (𝜌) to

equal zero is

𝐻0: 𝜌 = 0

𝐻𝑎: 𝜌 ≠ 0

• If the 𝑥 and 𝑦 relationships are jointly normally distributed,

the test statistic for this hypothesis is

𝑡0 =𝑟 𝑛 − 2

1 − 𝑟2

• The null hypothesis is rejected if |𝑡0| > 𝑡𝛼 2 ,𝑛−2 using one-

sided 𝑡-table value.

5/29/2013

5

23.3 Correction

• Coefficient of determination (𝑅2) is simply the square of the

correlation coefficient (𝑟).

• Values for 𝑅2 describe the percentage of variability

accounted for by the model. For example, 𝑅2 = 0.8

indicates that 80% of the variability in the data is

accounted for by the model.

23.4 Example 23.1: Correlation

• The times for 25 soft drink deliveries (𝑦) monitored as a

function of delivery volume (𝑥) is shown in Table 23.1

(Montgomery and Peck 1982).

• The scatter diagram of these data indicates that there

probably is a strong correlation between the two variables.

5/29/2013

6

Number of

Cases (x) Delivery

Time (y)

7 16.68

3 11.50

3 12.03

4 14.88

6 13.75

7 18.11

2 8.00

7 17.83

30 79.24

5 21.50

16 40.33

10 21.00

4 13.50

6 19.75

9 24.00

10 29.00

6 15.35

7 19.00

3 9.50

17 35.10

10 17.90

26 52.32

9 18.75

8 19.83

4 10.75



• The sample correlation coefficient between delivery time and

delivery volume is determined through use of a computer

program or equation to be:

𝑟 = (𝑥𝑖 − 𝑥 )(𝑦𝑖 − 𝑦 )

(𝑥𝑖 − 𝑥 )2 (𝑦𝑖 − 𝑦 )2=

2473.42

1136.57 × 5784.54= 0.96

• Testing the null hypothesis that the correlation coefficient

equals zero yields

𝑡0 =𝑟 𝑛 − 2

1 − 𝑟2=

0.96 25 − 2

1 − 0.962= 17.56

• Using a single-sided t-table (Table D) at 𝛼/2 , we can reject 𝐻0 since 𝑡0.05/2,23 = 2.069. Or, we could use a two-sided t-table at

𝛼.

5/29/2013

7

23.5 Simple Linear Regression

• Correlation only measures association, while regression

methods are useful to develop quantitative variable

relationships that are useful for prediction.

• For this relationship the independent variable is variable 𝑥,

while the dependent variable is 𝑦. The simple linear

regression model takes the form:

𝑌 = 𝛽0 + 𝛽1𝑥 + 𝜀

• Where 𝛽0 is the intercept, 𝛽1 is the slope, and 𝜀 is the error

term. All data points do not typically fall exactly on the

regression model line. The error term makes up for these

differences from other variables such as measurement

errors, material variations, and personnel.


• When the magnitude of the coefficient of determination

(𝑅2) is large, the error term is relatively small and the

model has a good fit.

• When a linear regression model contains only one

independent (regressor or predictor) variable, it is called

simple linear regression. When a regression model

contains more than one independent variable, it is called a

multiple linear regression model.

5/29/2013

8


• Least squares minimizes the sum of squares of the

residuals. The fitted simple linear regression model:

𝑦 = 𝛽 0 + 𝛽 1𝑥

where the regression coefficients are,

𝛽 1 =𝑆𝑥𝑦

𝑆𝑥𝑥=

𝑦𝑖𝑥𝑖 − 𝑦𝑖

𝑛𝑖=1 𝑥𝑖

𝑛𝑖=1

𝑛𝑛𝑖=1

𝑥𝑖2𝑛

𝑖=1 − 𝑥𝑖

𝑛𝑖=1

2

𝑛

= 𝑦𝑖(𝑥𝑖 − 𝑥 )𝑛

𝑖=1

(𝑥𝑖 − 𝑥 )2𝑛𝑖=1

𝛽 0 = 𝑦 − 𝛽 1𝑥

• The difference between the observed value and the

corresponding fitted value is a residual.


• The ith residual is:

𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖

• Residuals are important in the investigation of the adequacy

of the fitted model along with detecting the departure of

underlying assumptions.

• Statistical regression programs can calculate the model and

plot the least-square estimates. Programs can also

generate a table of coefficients and con-duct an analysis of

variance. Significance tests of the regression coefficients

involve either the t distribution for the table of coefficients or

the F distribution for analysis of variance.

5/29/2013

9


• For the analysis of variance table, total variation is broken

down into the pieces described by the sum of squares (SS):

𝑆𝑆𝑡𝑜𝑡𝑎𝑙 = 𝑆𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 + 𝑆𝑆𝑒𝑟𝑟𝑜𝑟

Where,

𝑆𝑆𝑡𝑜𝑡𝑎𝑙 = (𝑦𝑖 − 𝑦 )2

𝑆𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 = (𝑦 𝑖 − 𝑦 )2

𝑆𝑆𝑒𝑟𝑟𝑜𝑟 = (𝑦𝑖 − 𝑦𝑖 )2


• Each sum of squares has an associated number of degrees

of freedom equal to:

• When divided by the appropriate number of degrees of

freedom, the sums of squares give good estimates of the

source of variability. This variability is analogous to a

variance calculation and is called mean square.

Sum of Squares Degrees of Freedom

𝑆𝑆𝑡𝑜𝑡𝑎𝑙 𝑛 − 1

𝑆𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 1

𝑆𝑆𝑒𝑟𝑟𝑜𝑟 𝑛 − 2

5/29/2013

10


• If there is no difference in treatment means, the two

estimates are presumed to be similar. If there is a

difference, we suspect that the regressor causes the

observed difference. Calculating the F-test statistic tests

the null hypothesis that there is no difference because of

the regressor:

𝐹0 =𝑀𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛

𝑀𝑆𝑒𝑟𝑟𝑜𝑟

• Using an F table, we should reject the null hypothesis and

conclude that the regressor causes a difference, at the

significance level of 𝛼, if

𝐹0 > 𝐹𝛼,1,𝑛−2

23.5 Simple Linear Regression ANOVA Table for Simple Regression

Source of

Variation

Sum of

Squares

Degrees of

Freedom

Mean

Square

𝑭𝟎

Regression 𝑆𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 1 𝑀𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝐹0 =

𝑀𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛

𝑀𝑆𝑒𝑟𝑟𝑜𝑟

Error 𝑆𝑆𝑒𝑟𝑟𝑜𝑟 𝑛 − 2 𝑀𝑆𝑒𝑟𝑟𝑜𝑟

Total 𝑆𝑆𝑡𝑜𝑡𝑎𝑙 𝑛 − 1

5/29/2013

11


• The coefficient of determination is a ratio of the explained

variation to total variation, which equates to:

𝑅2 = 1 −𝑆𝑆𝑒𝑟𝑟𝑜𝑟

𝑆𝑆𝑡𝑜𝑡𝑎𝑙=

𝑆𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛

𝑆𝑆𝑡𝑜𝑡𝑎𝑙=

(𝑦 𝑖 − 𝑦 )2

(𝑦𝑖 − 𝑦 )2

• The multiplication of this coefficient by 100 yields the

percentage variation explained by the least-squares

method. A higher percentage indicates a better least-

squares predictor.


• If a variable is added to a model equation, 𝑅2 will increase

even if the variable has no real value. A compensation for

this is an adjusted value, 𝑅2(adj), which has an

approximate unbiased estimate for the population 𝑅2 of

𝑅2(𝑎𝑑𝑗) = 1 −𝑆𝑆𝑒𝑟𝑟𝑜𝑟/(𝑛 − 𝑝)

𝑆𝑆𝑡𝑜𝑡𝑎𝑙/(𝑛 − 1)

• Where 𝑝 is the number of terms in the regression equation

and 𝑛 is the total number of degrees of freedom.

5/29/2013

12


• The correlation coefficient of the population (𝜌) and its

sample estimate(𝑟) are connected intimately with a bivariate

population known as the bivariate normal distribution. This

distribution is created from the joint frequency distributions

of the modeled variables. The frequencies have an elliptical

con-centration.

Supplement: Inferences Using the

Least-Squares Coefficients

• When two variables have a linear relationship,

the scatterplot tends to be clustered around a line

known as the least squares line.

• We think of the slope and intercept of the least-

squares line as estimates of the slope and

intercept of the true regression line.

24

5/29/2013

13

Vocabulary

• The linear model is yi = β0 + β1xi+ εi

• The dependent variable is yi

• The independent variable is xi

• The regression coefficients are β0 and β1

• The error is εi

• The line y = β0 + β1x is the true regression line.

• The quantities and are called the least

squares coefficients and can be computed. 0 1

25

Assumptions for Errors in

Linear Models

In the simplest situation, the following assumptions

are satisfied:

1. The errors 1,…,n are random and independent.

In particular, the magnitude of any error i does

not influence the value of the next error i+1.

2. The errors 1,…,n all have mean 0.

3. The errors 1,…,n all have the same variance,

which we denote by 2.

4. The errors 1,…,n are normally distributed.

26

5/29/2013

14

Distribution

In the linear model yi = 0 +1xi +i, under

assumptions 1 through 4, the observations y1,…,yn

are independent random variables that follow the

normal distribution. The mean and variance of yi

are given by

The slope represents the change in the mean of y

associated with an increase in one unit in the value

of x.

27

0 1iy ix

2 2.iy

More Distributions

Under assumptions 1 – 4:

• The quantities are normally distributed random variables.

• The means of are the true values 0 and 1, respectively.

• The standard deviations of are estimated with

and

where

is an estimate of the error standard deviation .

0 1ˆ ˆ and

28

0 1ˆ ˆ and

0 1ˆ ˆ and

0

2

ˆ2

1

1

( )n

ii

xs s

n x x

1ˆ

2

1

.

( )n

ii

ss

x x

2 2

1

(1 ) ( )

2

n

ii

r y y

sn

5/29/2013

15

Example

For the Hooke’s law data, compute s,

29

1 0ˆ ˆ, .β β

s s

Example

30

5/29/2013

16

Notes

1. Since there is a measure of variation of x in the

denominator in both of the uncertainties we just

defined, the more spread out x’s are the smaller

the uncertainties in .

2. Use caution: if the range of x values extends

beyond the range where the linear model holds,

the results will not be valid.

3. The quantities and

have Student’s t distribution with n – 2 degrees of

freedom.

31

0ˆ0 0

ˆ / s

1ˆ1 1

ˆ / s

0 1ˆ ˆ and

Confidence Intervals

• Level 100(1 – )% confidence intervals for 0 and 1

are given by 𝛽 0 ± 𝑡𝑛−2,𝛼/2𝑠𝛽 0 and 𝛽 1 ± 𝑡𝑛−2,𝛼/2𝑠𝛽 1

• A level 100(1 – )% confidence intervals for 0 + 1x

is given by 𝛽 0 + 𝛽 1𝑥 ± 𝑡𝑛−2,𝛼/2𝑠𝑦 where

𝑠𝑦 = 𝑠1

𝑛+

(𝑥 − 𝑥 )2

(𝑥𝑖 − 𝑥 )2𝑛𝑖=1

32

5/29/2013

17

Prediction Intervals

• A level 100(1 – )% confidence intervals for 0 + 1x

is given by 𝛽 0 + 𝛽 1𝑥 ± 𝑡𝑛−2,𝛼/2𝑠𝑝𝑟𝑒𝑑 where

𝑠𝑝𝑟𝑒𝑑 = 𝑠 1 +1

𝑛+

(𝑥 − 𝑥 )2

(𝑥𝑖 − 𝑥 )2𝑛𝑖=1

33

Inferences on the Population

Correlation

• When we have a random sample from a population

of ordered pairs, the correlation coefficient, r, is often

called the sample correlation.

• We have the true population correlation, ρ.

• If the population of ordered pairs has a certain

distribution known as a bivariate normal

distribution, then the sample correlation can be

used to construct CI’s and perform hypothesis tests

on the population correlation.

34

5/29/2013

18

23.6 Analysis of Residuals

• For our analysis, modeling errors are assumed to be

normally and independently distributed with mean zero and

a constant but unknown variance. An abbreviation for this

assumption is 𝑁𝐼𝐷(0, 𝜎2).

• An important method for testing the 𝑁𝐼𝐷 0, 𝜎2 assumption

of an experiment is residual analysis (a residual is the

difference between the observed value and the

corresponding fitted value). Residual analyses play an

important role in investigating the adequacy of the fitted

model and in detecting departures from the model.

23.6 Analysis of Residuals

• Residual analysis techniques include the following:

– Checking the normality assumption through a normal

probability plot and/or histogram of the residuals.

– Check for correlation between residuals by plotting

residuals in time sequence.

– Check for correctness of the model by plotting residuals

versus fitted values.

5/29/2013

19

23.7 Analysis of

Residuals: Normality Assessment

• If the 𝑁𝐼𝐷(0, 𝜎2) assumption is valid, a histogram plot of the

residuals should look like a sample from a normal

distribution. Expect considerable departures from a

normality appearance when the sample size is small. A

normal probability plot of the residuals can similarly be

conducted. If the underlying error distribution is normal, the

plot will resemble a straight line.

• Commonly a residual plot will show one point that is much l

arger or smaller than the others. This residual is typically

called an outlier. One or more outliers can distort the

analysis.

23.7 Analysis of

Residuals: Normality Assessment

• To perform a rough check for outliers, substitute residual

error values into:

𝑑𝑖𝑗 =𝑒𝑖𝑗

𝑀𝑆𝐸

• and examine the standardized residuals values. About

68% of the standardized residuals should fall within a 𝑑𝑖𝑗

value of 1. About 95% of the standardized residuals

should fall within a 𝑑𝑖𝑗 value of 2. Almost all (99%) of the

standardized residuals should fall within a 𝑑𝑖𝑗 value of 3.

5/29/2013

20

23.8 Analysis of

Residuals: Time Sequence

• A plot of residuals in time order of data collection helps

detect correlation between residuals. A tendency for positive

or negative runs of residuals indicates positive correlation.

• This implies a violation of the independence assumption. An

individual chart of residuals in chronological order by

observation number can verify the independence of errors.

Positive autocorrelation occurs when residuals do not

change signs as frequently as should be

expected, while negative autocorrelation is indicated when th

e residuals frequently change signs. This problem

should be avoided initially.

23.9 Analysis of Residuals: Fitted

Values

• Outliers, which appear as points that are either much higher

or lower than normal residual values. These points should

be investigated. Perhaps someone recorded a number

wrong. Perhaps an evaluation of this sample provides

additional knowledge that leads to a major process

improvement breakthrough.

• Non constant variance, where the difference between the

lowest and highest residual values either increases or

decreases for an increase in the fitted values. A

measurement instrument could cause this where error is

proportional to the measured value.

5/29/2013

21

23.10 Example 23.2: Simple Linear

Regression Regression Analysis: Delivery Time (y) versus Number of Cases (x)

The regression equation is

Delivery Time (y) = 3.32 + 2.18 Number of Cases (x)

Predictor Coef SE Coef T P

Constant 3.321 1.371 2.42 0.024

Number of Cases (x) 2.1762 0.1240 17.55 0.000

S = 4.18140 R-Sq = 93.0% R-Sq(adj) = 92.7%

Analysis of Variance

Source DF SS MS F P

Regression 1 5382.4 5382.4 307.85 0.000

Residual Error 23 402.1 17.5

Total 24 5784.5


Regression

Unusual Observations

Number of Delivery

Obs Cases (x) Time (y) Fit SE Fit Residual St Resid

9 30.0 79.240 68.606 2.764 10.634 3.39RX

22 26.0 52.320 59.901 2.296 -7.581 -2.17RX

R denotes an observation with a large standardized residual.

X denotes an observation whose X value gives it large leverage.

Minitab:

Stat

Regression

Regression

5/29/2013

22


Values

Minitab:

Stat

Regression

Fitted line plot

• An extrapolation to the majority of data used fitting this

model. In addition, the plots indicate that these values are

not fitting the general model very well. The residuals versus

fitted plot indicates that there could also be an increase in

the variability of delivery time with an increase in the

number of cases.


Regression

5/29/2013

23


Values

Minitab:

Stat

Regression

Regression

Graph

Four in 1

23.1 S4/IEE Assessments

• The regression model describes the region for which it

models and may not be an accurate representation for

extrapolated values.

• It is difficult to detect a cause-and-effect relationship if

measurement error is large.

• A true cause-and-effect relationship does not necessarily

exist when two variables are correlated.

• A process may have a third variable that affects the

process such that the two variables vary simultaneously.

• Least-squares predictions are based on history data, which

may not rep-resent future relationships.

5/29/2013

24

23.1 S4/IEE Assessments

• An important independent variable to improve a process

may be disregarded for further considerations because a

study did not show correlation between this variable and

the response that needed improvement. However, this

variable might be shown to be important within a DOE

if the variable were operated outside its normal operating

range.

Date post:	07-Jul-2020
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

Chapter 23 Correlation and Simple Linear Regressionweb.eng.fiu.edu/leet/TQM/chap23_2012.pdf23.4...

Documents