Linear Regression - Petra Christian...

Linear RegressionSiana Halim

Draper,N.R; Smith, H.; Applied Regression Analysis,3rd Edition, John Wiley & Sons, Inc. 1998Montgomery, D.C; Peck, E.A; Introduction to Linear Regression Analysis, 2nd Edition, 1992

OutlineOutline

Introduction

Fitting a Straight Line by Least Squares

Measures of Model Adequacy

The Need for Statistical AnalysisThe Need for Statistical Analysis

In research laboratories, experiments are being performed daily. These are usually small, carefully planned studies and result ca e u y p a e stu es a esu t in sets of data in modest size. The objective is often a quick yet accurate analysis enabling the accurate analysis, enabling the experimenter to move to „better“ experimental conditions, which will produce a product with desirable characteristics.


A Ph.D researcher may travel into Af i j l f an African jungle for a one-year

period of intensive data-gathering on plants or animals. She will return with the raw material for her thesis and will put much-effort into analyzing the data she has into analyzing the data she has, searching for the messages that they contain. It will not be easy to bt i d t h t i i obtain more data once her trip is

completed, so she must carefully analyze every aspect of what data she has.


Regression analysis is a h i h b d i

Predictor variable = input variables = inputs technique that can be used in

any of these situations.In any system in which

= inputs = X – variables= regressors= independent variables

variable quantities change; it is of interest to examine the effect that some variable

independent variables

Response variable = output variables = outputs effect that some variable

exert (or appear to exert) on others.W th f ll i

outputs = Y – variables= dependent variables

We use the following names :

Response variable = Model function + Random errorResponse variable Model function + Random error

Straight Line Relationship between Two Variables Variables

In much experimental work we wish to investigate how the changes in one variable affect g ganother variables.

In this example, for any given height there is a range of observed weights and vice versa This range of observed weights, and vice versa. This variation will be partially due to measurement errors but primarily due to variation between individuals. Thus no unique relationship between individuals. Thus no unique relationship between actual height and weight can be expected.

When we are concerned with the dependence of a random variable Y on a quantity X i e a a random variable Y on a quantity X,i.e., a variable but not randomly variable, an equation that relates Y on X is usually called a regression equation.equation.

Fitting a Straight Line by Least SquaresFitting a Straight Line by Least Squares

1313

12

11

25 observations of variable Y : pounds of steam used per monthX : average atmospheric temperature

Y

10

9

g p pin degrees Fahrenheit.

The linear first-order model

εββ ++= XY 108

7

6

(1)

X807060504030

Parameters Errors

Meaning of Linear ModelMeaning of Linear Model

When we say that a model is linear or nonlinear, we are When we say that a model is linear or nonlinear, we are referring to linearity or nonlinearity in the parameters. The value of the highest power of a predictor variable in the model is called the order of the model. For example

εβββ +++= 21110 XXY

Is a second-order (in X) linear in the (β‘s) regression modelmodel.

Least Square EstimationLeast Square EstimationModel estimate

(2)XbbY 10ˆ +=

Y(2)XbbY 10 +

Suppose we have n sets of observations (X1,Y1),...,(Xn,Yn) then we can write (1) as

(3)niXY iii ,...,1,10 =++= εββ

so that the sum of squares of deviation from the true line is

(4)( )

n n 22

X

( )( )∑ ∑

= =−−==

i iii XYS

i1 1

210

2 ββε

Sum of squares function

Least Square EstimationLeast Square EstimationWe can determine b0 and b1 by differentiating Eq. (4) first w.r.t to β0 and

From (6) we have n ng q ( ) β0

then w.r.t β1 and setting the result to zero

( ),2 10∑ −−−=∂∂ n

ii XYS βββ (7)

=∑ ∑−∑−

=∑ ∑−−= =n

i

n

ii

n

iiii

i iii

XbXbYX

XbnbY

1 1

21

10

1 110

0

0

(5)

( )

( ),21

101

10

∑ −−−=∂∂∂

=

=

n

iiii

i

XYXS βββ

β (7)

∑=∑+

= ==

ni

ni

i ii

YXbnb

or

10

1 11

So that the estimates b0 and b1 are solutions of the two equations :

n

(8)∑=∑−∑===

==n

iii

n

ii

n

ii

ii

YXXbXb11

21

10

11

(6)

( )

( ) 0

0

110

110

=∑ −−

=∑ −−

=

=n

iiii

n

iii

XbbYX

XbbYThese equations are called the normal equations (orthogonal).

where we substitute (b0,b1) for (β0,β1)

Least Square EstimationLeast Square Estimation

The solution of (8) XbYb

13

( )( )[ ]

( )∑ ∑

∑ ∑∑−=

−=

=22

11

10

nXX

nYXYXb

XbYbn

iiiii

12

11XY 0798.06230.13ˆ −=

( )( )( )

( )∑ −

−∑ −=

∑ ∑−

2XX

YYXX

nXX

i

ii

ii

Y

10

9

8

( )( )( ) ∑ −=∑ −=

∑ −=−∑ −=222 XnXXXS

YXnYXYYXXS

iiXX

iiiiXY

807060504030

7

6

( )∑ ∑ −=−= 222 YnYYYS iiYYX

807060504030

The Analysis of VarianceThe Analysis of VarianceY

iY

We now tackle the question of how much of the variation in the data has iY

iY

Y

iii eYY =− ˆ

YYi −ˆ

YYi −

much of the variation in the data has been explained by the regression line. Consider the following identity

)ˆ(ˆ YYYYYYe

iXX

XbbY 10ˆ +=

The residual ei is the difference between

)( YYYYYYe iiiii −−−=−=

iXi

two quantities :

(1)the deviation of the observed Yi from the overall mean

( ) ( ) ( )iiii YYYYYY ˆˆ −+−=−

If we square both sides of this and the overall mean

(2)and the deviation of the fitted from the overall mean

sum from i = 1,..., n, we obtain

( ) ( ) ( )222 ˆˆiiii YYYYYY −+−=∑ −

Sum of SquaresSum of Squares

⎟⎟⎞

⎜⎜⎛

+⎟⎟⎞

⎜⎜⎛

=⎟⎟⎞

⎜⎜⎛ squaresof Sum squaresof Sum squaresof Sum

⎟⎟⎠

⎜⎜⎝

+⎟⎟⎠

⎜⎜⎝

⎟⎟⎠

⎜⎜⎝ regressionaboutregressiontoduemean the about

ANOVA Table

Source of Variation Degree of Freedom (df)

Sum of Squares(SS)

Mean Square (MS)

Due to regression 1 ( )∑ −=n

YYSS2ˆ RMS

About regression(Residual)

n-1

( )∑=

−=i

iR YYSS1

( )∑ −=n

iiiE YYSS

1

2ˆ EE MS

nSSs =−

=2

2

R

( )

Total Corrected for mean

n-1=i 1

Y( )∑ −

=

n

ii YY

1

2

R2 StatisticR Statistic

2 bgivenregression to due SSR 0 Adjusted R2 = 1n

( )( )∑ −

∑ −=

=

2

2ˆ

YY

YY

Y mean the for corrected SS,Totalgg

R

i

i

0

Where n number of samples, p number of regressors in the linear models

11)1(1 2

−−−

−−pn

nR

( )∑ i

R2 measures the „proportion of total variation about the mean Y

regressors in the linear models

explained by the regression.

In fact, R is the correlation between Y and and is usually called

Y

Ythe multiple correlation coefficient.

R2 is then „the square of the multiple correlation coefficient“.

InferencesInferences

The basic assumptions in the model Y = β + β X i = 1 n :Yi = β0 + β1Xi, i = 1,..,n :

A1. εi is a random variable with mean zero and variance σ2 (unknown); that

2is E(εi) = 0, V(εi) = σ2

A2. εi are uncorrelated, i ≠ j, so that cov(εi,εj) = 0. Thus

E(Y ) = β + β X V(Y ) = σ2E(Yi) = β0 + β1Xi, V(Yi) = σ2

Yi and Yj , i ≠ j, are uncorrelated.A3. εi is a normally distributed random

variable with mean zero and variable, with mean zero and variance σ2 ; i.e.εi ~ N(0, σ2)

Under A3, εi,´,εj are not only uncorrelated but necessarily independentbut necessarily independent.


Variance and standard

22

1)(bV ==σσ

Variance and standard Deviation of b1

( )

( ) 2/12/121

21

)(

)(

XX

XXi

Sbsd

SXXbV

=⎤⎡

=

=−

=

∑

∑σσ

( )

( ) 2/12/121

2

)(.XX

XXi

Ss

XX

sbsdest

SXX

=⎤⎡

=

⎥⎦⎤

⎢⎣⎡ −

∑

∑

( ) XXi XX ⎥⎦

⎤⎢⎣⎡ −∑


Confidence Interval for β1 Test for H0: β1 = β10 vs H β β

( )101b β−

If we assume that the variations of the observations about the line are normal, 100(1-α)% confidence limits for β1 by

H1: β1 ≠ β10

( )( )1

101

bset

β=calculating

( ) 2/121

211,2

⎤⎡

⎟⎠⎞

⎜⎝⎛ −−

±snt

bα

Compare |t| with t (n-2, 1-1/2 α) from

where t(n 2 1/2α) is the 100(1 1/2α)

( ) 2/12⎥⎦⎤

⎢⎣⎡ −∑ XX i

Compare |t| with t (n 2, 1 1/2 α) from t-table.If it had happened that the observed |t| value had been smaller than the

where t(n-2,1/2α) is the 100(1-1/2α) percentage point of a t-distribution, with (n-2) degrees of freedom.

| |critical value, we could not reject the hypothesis.


Standard Deviation of b0 A t- test for 0

( )σ

2/1

2

2

0 )(⎪

⎪⎬⎫

⎪

⎪⎨⎧

=∑∑

XX

Xbsd i

A t test for H0: β0 = β00 vs H1: β0 ≠ β00

will be rejected if β00 fall outside the ( ) ⎪⎭⎪⎩ −∑ XXn i

100(1-α) Confidence Interval

will be rejected if β00 fall outside the confidence interval, and vice versa, or compare

bt 000 −=

β( )for β0

sX

ntb i

2/12

)112( ⎪⎬⎫

⎪⎨⎧

± ∑α( )

sXXn

Xt

i

i

2/1

2

2

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

−

=

∑∑

( )s

XXnntb

i20 )

21,2(

⎪⎭⎬

⎪⎩⎨

−−−±

∑∑α

With t(n-2, 1-1/2α).

F-Test for Significance of RegressionF Test for Significance of Regression

The ratio : This fact can be used as a test of The ratio :

R

MSMSF =0

This fact can be used as a test of H0:β1 = 0 versus H1: β1 ≠ 0. We compare the ratio F0 with the

follows an F-distribution with (1 2) d f f d

EMS 100(1-α)% point of the tabulated F(1, n-2) distribution. Reject H0 if

F0 > F(1, n-2)(1,n-2) degrees of freedom provided that β1 = 0

0 ( , )


Situations yS tuat o swhere the hypothesis H0:β1 = 0 is not rejected

y y

j

Situations x x

where the hypothesis H0:β1 = 0 is rejected

y y

j

x x

Regression Analysis : Y vs XRegression Analysis : Y vs X

The regression equation isY 13 6 0 0798 XY = 13.6 - 0.0798 X

Predictor Coef SE Coef T PConstant 13 6230 0 5815 23 43 0 000Constant 13.6230 0.5815 23.43 0.000X -0.07983 0.01052 -7.59 0.000

S = 0.890125 R-Sq = 71.4% R-Sq(adj) = 70.2%S 0.890125 R Sq 71.4% R Sq(adj) 70.2%

Analysis of VarianceSource DF SS MS F PRegression 1 45.592 45.592 57.54 0.000Residual Error 23 18.223 0.792Total 24 63.816

Measures of AdequacyMeasures of Adequacy

The major assumptions that we have made so far in our study of regression analysis are as follows :study of regression analysis are as follows :

1. The relationship between y and x is linear, or at least it is well approximated by a straight line

2. The error term ε has zero mean3 Th t h t t i 23. The error term ε has constant variance σ2

4. The errors are uncorrelated5 The errors are normally distributed5. The errors are normally distributed.

Residual AnalysisResidual Analysis

W h d fi d h id l explained by the regression

We have defined the residuals as

niyye iii ,...,1,ˆ =−=

p y gmodel. It is also convenient to think of residuals as the realized or observed values of the errorsniyye iii ,...,1, or observed values of the errors.The residuals have several important properties. They have

h i i Observation Fitted value

Residual may be viewed as the deviation between the data and th fit It i f th

mean zero, their approximate average variance is

nnthe fit. It is a measure of the variability not

( )E

E

n

ii

n

ii

MSnSS

n

e

n

ee=

−=

−=

−

− ∑∑==

2221

2

1

2

Normal Probability PlotNormal Probability Plot

Although small departures from normality do not effects the Although small departures from normality do not effects the model greatly, gross non normality is potentially more serious as the t- or F – statistics, and confidence and prediction intervals d d th lit tidepend on the normality assumption.Furthermore if the errors come from a distribution with thicker or heavier tails than the normal, the least squares fit may be sensitive to a small subset of the dataTo check the normality, we use the QQ (Quantile-Quantile) plot of residual(Quantile Quantile) plot of residual

Normal Probability PlotNormal Probability Plot

Ideal heavy tailed light tailed

right skew left skew

Plot of Residual against iyPlot of Residual against

A plot of the residuals ei versus the corresponding fitted

iy

values and xi is useful for detecting several common types of model inadequacies

satisfactory funnel double bow non linear

Other Residual PlotsOther Residual Plots

If the time are sequence in which the data were collected is known, it may b i t ti t l t th id l i t ti dbe instructive to plot the residuals against time order.

The time sequence plot of residuals may indicate that the errors at one time period are correlated with those at other time periods (autocorrelation)

Positive autocorrelate Negative autocorrelate

Testing Homogeneity of Pure Error (Optional)(Optional)

1. Bartlett’s Test Let υ = υ1+ υ2+…+ υm and m

Let s1,s2,…,sm be the estimate of σ2 from the m groups of repeats with υ1 υ2 υ degrees of

And m is the number of groups with 13

1

1

1

1

−

−+=

−

=

−∑m

C ii υυ

with υ1,υ2,…,υm degrees of freedom, respectively, where υj = nj – 1 and

repeat runs. The test statistic is then

CssBm

jjje⎭⎬⎫

⎩⎨⎧

−= ∑=1

22 lnln υυ

( )−

∑ −= =

j

n

ujju

j n

YYs

j

1

2

1

When the variance of the groups are all the same, B is distributed as χ 2 A significant B value could

j ⎭⎩

∑

∑= =

mi

m

iii

e

ss 1

2

υ

υχm-1

2. A significant B value could indicate inhomogeneous variances.

It could also indicate non-normality. =i

i1

It could also indicate non normality.

Testing Homogeneity of Pure Error (Optional)(Optional)

2. Levene’s Test using Means The appropriate F- statistic is then

Consider, in the jth group of repeats, the absolute deviations

( ) ( )∑ ∑ ∑ −−

∑ −−=

m n mjjju

m

jjj

j

nzz

mzzn

2

1

2

1

)1()(

Of the Y‘s from the means of

nuYYz jjuju ,...,2,1, =−=

where

( ) ( )∑ ∑ ∑= = =j u j

jjju1 1 1

nntheir repeats group. Consider this as a one-way classification and compare the „between

∑ ∑ ∑=∑== = ==

m

j

n

u

n

jjju

n

ujjuj

jj

nzznzz1 1 11

,

and compare the „between groups“ mean square with the „within groups“ mean square via an F test

The F-value is referred to F , using only the upper tail.

{ }∑ −− =mj jnm 1 )1(,1

an F-test.

Durbin-Watson TestDurbin Watson Test

The Durbin – Watson test checks for a sequential

It can be shown that :1 0≤ d ≤ 4 alwayschecks for a sequential

dependence in which each error (on so residual) is correlated

1. 0≤ d ≤ 4 always2. If successive residuals are

postively serially correlated, with those before and after it in the sequence.

that is, positively correlated in their sequence, d will be near 0

( )∑ ∑−== =

−

n

u

n

uuuu eeed

2 1

221

3. If succesive residuals are negatively correlated, d will be near 4 so that 4 d will be = =u u2 1 near 4, so that 4 – d, will be near 0

4. The distribution of d is 2symmetric about 2.


The test is conducted as follows : C d ( 4 d hi h i

•If 4-d < dL, conclude that Compare d (or 4-d, whichever is closer to zero) with dL dan du in the following table.

negative serial correlation is a possibility• if 4-d > du, conclude that no

• If d < dL, conclude that positive serial correlation is a possibility.• If d > d conculde that no serial

if 4 d du, conclude that no serial correlation is indicated.• if the d (or 4-d) value lies between d and d the test is If d > du, conculde that no serial

correlation is indicated.between dL and du, the test is inconclusive. An indication of positive or negative serial correlation would be cause for the model to be reexaminde.


1% 2.5% 5%

n d d d d d dn dL du dL du dL du

15 0.81 1.07 0.95 1.23 1.08 1.36

20 0.95 1.15 1.08 1.28 1.20 1.41

25 1.05 1.21 1.18 1.34 1.29 1.45

30 1.13 1.26 1.25 1.38 1.35 1.49

40 1.25 1.34 1.35 1.45 1.44 1.54

50 1.32 1.40 1.42 1.50 1.50 1.59

70 1.43 1.49 1.51 1.57 1.58 1.64

100 1 52 1 56 1 59 1 63 1 65 1 69Interpolate

100 1.52 1.56 1.59 1.63 1.65 1.69

150 1.61 1.64 - - 1.72 1.75

200 1.66 1.68 - - 1.76 1.78

linearly for intermediate n-values

Detection and Treatment of OutliersDetection and Treatment of Outliers

An outlier is an extreme observation. Residuals that are id bl l i b l l h h h h considerably larger in absolute value than the others, say three or

four standard deviations from the mean, are potential outliers. Outliers are data points that are not typical of the rest of the data.

Detection and Treatment of OutliersDetection and Treatment of Outliers

Outliers should be carefully investigated to see if a reason for

Sometimes we find that the outlier is l b f l l ibl investigated to see if a reason for

their unusual behavior can be found.Sometimes outliers are „bad“

an unusual but perfectly plausible observation.Deleting these points to „improve the fit equation“ can be dangerous as it Sometimes outliers are „bad

values, occuring as a result of unusuaal but explainable events. Examples include faulty

fit equation can be dangerous, as it can give the user a false sense of precision in estimation or prediction.

measurement or analysis, incorrect recording of data and failure of a measuring instrument. If this is the case, then the outlier should be corrected (if possible) or deleted from the data set.

Cook’s DistanceCook s Distance

Cook (1977) proposed that the influence of the ith data point be measured by the squared scaled distance

( ) ( ) ( )2)(ˆˆ)(ˆˆ psiYYiYYD′

= ( ) ( ) ( ){ }

{ } { } 2)()(

(ˆˆ),()(ˆ,ˆ)()(

ibbXXibbD

b(i)-bXi)YYiXbiYXbY

psiYYiYYDi

′′′=−==

−−=

is the vector of predicted values from a least square fit when the

{ } { } 2)()( psibbXXibbDi −′−=

is the vector of predicted values from a least square fit when the ith data point is deleted. p is any power, s is the sample of standard deviation.

Residual PlotsResidual Plots

Normal Probability Plot of the Residuals Residuals Versus the Fitted Values

Residual Plots for Y

Per

cent

99

90

50

Res

idua

l

1

0

Residual

P

210-1-2

10

1

Fitted Value

R

1110987

-1

-2

quen

cy

8

6

4 sidu

al

1

0

Histogram of the Residuals Residuals Versus the Order of the Data

Residual

Freq

1.51.00.50.0-0.5-1.0-1.5

4

2

0

Observation Order

Res

24222018161412108642

-1

-2

Date post:	20-Aug-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Linear Regression - Petra Christian...

Documents