WELCOME TO: STAT.1010: Riippuvuusanalyysi Statistical ...

WELCOME TO:

STAT.1010:

Riippuvuusanalyysi

Statistical Analysis of Contingency and

Regression

Bernd Pape

University of Vaasa

Department of Mathematics and Statistics

TERVETULOA!

www.uwasa.fi/∼bepa/Riippu.html

1

Literature:

• Amir D. Aczel:Complete Business Statistics

• Milton/ Arnold:Introduction to Probability and Statistics

• Moore/ McCabe:Introduction to the Practice of Statistics

• Conrad Carlberg:Statistical Analysis: Microsoft Excel

Old lecture notes in Finnish by Pentti Suomelawith SPSS as software may be downloadedfrom the course homepage.

There you will also find a collection of sta-tistical formulas and tables, which may andshould be brought to the exam!

Course Homepage:

www.uwasa.fi/∼bepa/Riippu.html

2

1. Introduction

1.1. Confidence Intervals and Hypothesis Tests

Confidence Intervals

A point estimate is a single value calculated

from the observation values in your sample in

order to estimate some parameter of the un-

derlying population. For example the sample

mean x =∑ni=1 xi, where n is the number of

observations xi in sample, is a point estimate

of the underlying population mean µ.

A problem with point estimates is that we are

almost sure that they are not the true param-

eter, because whenever we take a new sam-

ple with different observations, we will most

probably get a different point estimate leav-

ing us with many point estimates for many

different samples, while there is only a single

true parameter in the population which can-

not simultaneously be identical to all those

point estimates from the different samples.

3

By adding and subtracting margins of error

to your point estimate you convert your point

estimate into an interval estimate. This in-

creases the chance of the true parameter be-

ing covered for the price of a less precise es-

timate of its value.

If the sampling distribution of your estima-

tor is known, then the margins of error can

be determined such, that the resulting in-

terval has a precisely determined probabil-

ity 1−α, say, that the interval covers the

true parameter value. We have then found a

confidence interval at confidence level 1−α.

The sampling distribution of an estimator is

a smoothed histogram of its value in many

samples scaled such, that calculating its inte-

gral between two numbers will yield the prob-

ability that the estimate comes out with a

value somewhere between those numbers.

4

Example

We learned in STAT1030 that the standardized sam-ple mean in a sample of n observations

Tn =X − µS/√n

with sample variance S2 =1

n−1

n∑i=1

(Xi − X)2

is Student-t distributed with n−1 degrees of freedom.

Let tα/2(n−1) denote the value of Tn for which

P (Tn > tα/2(n−1)) =α

2

such that by symmetry of the Student t-distribution

P (|Tn| > tα/2(n−1)) = P

(|X − µ| > tα/2(n−1)

S√n

)= α

and the 1−α confidence interval for µ becomes

CI1−α =

(X − tα/2(n−1)

S√n, X + tα/2(n−1)

S√n

).

tα/2(n−1) is determined such that the area (=integral)

under the density curve of the Student t-distribution

with n−1 degrees of freedom between this value and

+∞ is exactly α2. These values are tabulated and avail-

able from Excel by typing =T.INV.2T(α; n−1) in any

cell (or TINV(α; n−1) before Excel 2007).

5

Hypothesis Tests

Whenever we calculate something based uponsample observations only, it is called a statistic.An estimator is a statistic used for the specialpurpose of estimating an underlying popula-tion parameter, such as x for µ.

Now suppose that rather than using a statis-tic in order to estimate some unknown pa-rameter, you have already an opinion aboutwhat the value of that parameter should beand you want to cross-check whether youropinion can be reasonably maintained in thelight of the sample statistics you got.

For example, theory claims that somethingshould be one on average, but in your sam-ple you find hat x = 2. Does this meanthat the theory is wrong or is this just be-cause you didn’t see the full population? Youcan make informed decisions about this ifyou know the sampling distribution of yourstatistic under the assumption that the nullhypothesis (e.g. µ = 1) is true. This is calledhypothesis testing.

6

The approach is to use the sampling distri-

bution under the null in order to calculate

the probability of getting a sample statistic

at least as extreme as the value you’ve got

even though the null hypothesis holds true.

This is called the p-value of the test.

If the p-value is large, it means that the prob-

ability of getting your sample statistic under

the presumed parameter value is large, and

you accept the null hypothesis that the pop-

ulation parameter is what you claimed it to

be.

If the p-value is small, it means that the prob-

ability of getting your sample statistic under

the presumed parameter value is small, and

you reject the null hypothesis against the al-

ternative hypothesis that the population pa-

rameter is something else.

How exactly the p-value is determined de-

pends upon whether you use a one-sided or

a two-sided test.7

In the case of testing whether the arithmetic

mean has the specific value µ0 under the null

hypothesis (H0 : µ = µ0) the alternative

hypothesis H1 in a two sided test is

H1 : µ 6= µ0,

whereas there are two options for a one-sided

test:

H1 : µ < µ0 OR H1 : µ > µ0.

For example, arithmetic sample means larger

than the hypothesized population mean,

x > µ0, are evidence against H0 in two sided

tests and in one-sided tests of the form H1 :

µ > µ0, but not in one-sided tests of the form

H1 : µ < µ0.

Similarly, arithmetic sample means smaller

than the hypothesized population mean,

x < µ0, are evidence against H0 in two sided

tests and in one-sided tests of the form H1 :

µ < µ0, but not in one-sided tests of the form

H1 : µ > µ0.

8

Hence we calculate the p-value of the test as

p = P (X≤ x |µ=µ0) for H1 : µ < µ0,p = P (X≥ x |µ=µ0) for H1 : µ > µ0,

p = P (|X−µ|≥|x−µ| |µ=µ0)for H1 : µ 6= µ0,

where X denotes the random variable con-taining the arithmetic mean whose value dif-fers from sample to sample, x denotes theparticular value of the arithmetic mean wegot for our sample at hand, and |µ = µ0stands for the condition that the populationmean has indeed the value µ0, as stated inthe null hypothesis.

The null hypothesis is rejected if the p-valuefalls below some prespecified level α, the so-called significance level of the test, otherwiseH0 is accepted. α denotes the probabilityof committing a Type I error, which meansfalsely rejecting a true null hypothesis. Wewant this probability to remain small, hencewe choose often α = 5%, if not even smaller.

9

Example

Using again that the standardized sample mean

Tn =X − µS/√n

in a sample of n observations is t(n−1)-distributed,

we get e.g. for the one-sided test against H1 : µ > µ0,

p = P (X≥ x |µ=µ0) = P (Tn ≥ t),

where t =x− µ0

s/√n

and s2 =1

n−1

n∑i=1

(xi−x)2 obtained in

our sample. p may hence be obtained from integrating

the Student t-distribution with n−1 degrees of freedom

between t and +∞ and is available from Excel using

the command T.DIST.RT(t;n−1). Similarly,

p = T.DIST.RT(−t;n−1) for H1 : µ < µ0,p = T.DIST.2T(|t| ;n−1) for H1 : µ 6= µ0.

Note that p-values of two-sided tests are twice

the p-values of one-sided tests, because, by

symmetry of the t-distribution:

P (|T |≥|t|)=P (T ≤−|t|)+P (T ≥|t|)=2P (T ≥|t|).

10

Example (continued)

If software is not available, we can still usethe critical values tα from statistical tablesin order to decide whether we may reject H0at significance level α or not. In order to dothat, express the condition p < α required forrejection of H0 in terms of threshold valuestα, recalling that P (T > tα) = α.

Hence the condition p < α for rejecting H0 :µ = µ0 against H1 : µ > µ0 reads

P (T >t)︸︷︷︸p

< P (T >tα)︸︷︷︸α

,

which is equivalent to the condition t > tα forpositive values of t (the only area of interestagainst H1 : µ > µ0), because there the t-distribution is monotonically decreasing.

Similarly the condition for rejecting H0 againstH1 : µ < µ0 is t <−tα. (P (T <t)<P (T <−tα))

The condition for rejecting H0 : µ = µ0 againstH1 : µ 6= µ0 is |t| > tα/2, because it meansthe same as the conditions t>tα/2 for t>0, ort<−tα/2 for t<0. (P (|T |> |t|)<P (|T |>tα/2))

11

1.2. Scales of Measurement

Recall that the applicability of different sta-

tistical methods depends upon the measure-

ment scale of the variable in question.

Variables on nominal scale cannot be used in

calulcations and don’t reveal any order either.

They can only be used for sorting statistical

units into groups.

Example: Gender, profession, colours,. . .

Variables on ordinal scale cannot be directly

used in calculations either, but they reveil an

implicit order. The ordering implies that it

is meaningful to define ranks, on the basis

of which it is possible to calculate quantiles

such as e.g. the median.

Example: agree/partially agree/disagree,

bad/average/good,. . .

12

Variables on interval scale can directly be used

in calculations involving only sums and dif-

ferences between observation values such as

the arithmetic mean x, however the aribtrari-

ness of the point of origin in these variables

precludes meaningful calculation of statistics

involving ratios of observation values.

Example: clock time, temperature,...

Variables on ratio scale share the properties

of variables on interval scale but allow addi-

tionally for meaningful calculation of statis-

tics based upon ratios of observation values,

such as the coefficient of variation or the har-

monic mean, because the point of origin is

uniquely defined as absence of the quantity

measured.

Example: Money, Weight, Time intervals,. . .

Note. Variables on interval scale can always

be transformed into ratio scale by taking dif-

ferences of the original variable.

Example: Clock time → Time intervals.

13

1.3. Interdependence of Statistical Variables

1.3.1. Both Variables on Nominal Scale

The analysis starts off from a so called

contingency table (ristiintaulukko) which dis-

plays the absolute counts (havaittut frekvenssit)

fij of statistical units belonging both to class

i of variable X and to class j of variable Y .

Dividing these counts by the total number of

observations n, yields the so called relative

frequencies (suhteelliset frekvenssit) pij =fijn ,

which applying the relative frequency approach

may be interpreted as a proxy for the proba-

bility that a randomly chosen statistical unit

belongs to class i of X and to class j of Y si-

multaneously. Therefore we call the observed

frequencies (both absolute and relative) also

the joint distribution (yhteysjakauma) of X

and Y .

14

The probabilities of a statistical unit to be-

long to a certain class of X, regardless of

its classification with respect to Y are given

by pi• =∑j pij, which together make up the

marginal distribution (reunajakauma) of X.

Similiarly, the collection of p•j =∑i pij of

probabilities for statistical units to belong to

class j of Y regardless of their classification

according to X are called the marginal distri-

bution of Y .

Now we know from probability calculus that

two events are independent when their joint

probability equals the product of their marginal

probabilities, that is, pij = pi•p•j, which leads

us to assume independence of X and Y when

the observed frequencies fij equal the so called

expected frequencies (odotettut frekvenssit)

eij, where

eij = npij = n · pi• · p•j = n ·fi•n·f•jn

=fi•f•jn

with fi• =∑j fij and f•j =

∑i fij.

15

The test statistic used in order to assesswhether fij ≈ eij is Pearson’s χ2 statistics:

χ2 =r∑

i=1

s∑j=1

(fij − eij)2

eij,

where we have assumed that the row variable(rivimuuttuja) X has r classes, and the columnvariable (sarakemuuttuja) Y has s classes,such that the overall dimension of the con-tingency table is (r × s).

The null and alternative hypotheses of theχ2 independence test (riippumattomuustesti)are:

H0 : X and Y are statistically independent

H1 : X and Y are statistically dependent

If the null hypothesis holds true, then χ2 isapproximately χ2-distributed with

df = (r − 1)(s− 1) degrees of freedom.

Large values of χ2 lead to a rejection of thenull hypothesis, which means that we believethere is dependence also out of sample.

16

When using statistical tables, one first de-

cides for a significance level (merkitsevyys-

taso) α, which denotes the risk one is ready

to take of falsely rejecting a null hypothesis

which in fact holds true, and compares the

obtained value for the χ2-statistics with the

corresponding critical value (kriittinen arvo)

χ2α(df) of the table.

Statistical software programmes report a p-

value, as described in the introduction, which

must be compared with the significance level.

We accept H0 if p ≥ α and reject H0 if p < α.

In Excel: p =CHIDIST(χ2; df).

Usually α = 0.05, such that:

H0 is accepted if χ2 ≤ χ2α(df) or p ≥ 0.05,

H0 is rejected if χ2 > χ2α(df) or p < 0.05.

Note that statistical significance of χ2 alone,

e.g. rejection of H0, does not yet make any

statement about the strength of dependence

between the variables.17

Table. Tail fractiles χ2α of the χ2-distribution: P (χ2>χ2

α(df))=α.

α 0.995 0.990 0.975 0.950 0.900 0.100 0.050 0.025 0.010 0.001

1 0.000039 0.000157 0.000982 0.003932 0.0158 2.706 3.841 5.024 6.635 10.827

2 0.0100 0.0201 0.0506 0.103 0.211 4.605 5.991 7.378 9.210 13.815

3 0.0717 0.115 0.216 0.352 0.584 6.251 7.815 9.348 11.345 16.266

4 0.207 0.297 0.484 0.711 1.064 7.779 9.488 11.143 13.277 18.466

5 0.412 0.554 0.831 1.145 1.610 9.236 11.070 12.832 15.086 20.515

6 0.676 0.872 1.237 1.635 2.204 10.645 12.592 14.449 16.812 22.457

7 0.989 1.239 1.690 2.167 2.833 12.017 14.067 16.013 18.475 24.321

8 1.344 1.647 2.180 2.733 3.490 13.362 15.507 17.535 20.090 26.124

9 1.735 2.088 2.700 3.325 4.168 14.684 16.919 19.023 21.666 27.877

10 2.156 2.558 3.247 3.940 4.865 15.987 18.307 20.483 23.209 29.588

11 2.603 3.053 3.816 4.575 5.578 17.275 19.675 21.920 24.725 31.264

12 3.074 3.571 4.404 5.226 6.304 18.549 21.026 23.337 26.217 32.909

13 3.565 4.107 5.009 5.892 7.041 19.812 22.362 24.736 27.688 34.527

14 4.075 4.660 5.629 6.571 7.790 21.064 23.685 26.119 29.141 36.124

15 4.601 5.229 6.262 7.261 8.547 22.307 24.996 27.488 30.578 37.698

16 5.142 5.812 6.908 7.962 9.312 23.542 26.296 28.845 32.000 39.252

17 5.697 6.408 7.564 8.672 10.085 24.769 27.587 30.191 33.409 40.791

18 6.265 7.015 8.231 9.390 10.865 25.989 28.869 31.526 34.805 42.312

19 6.844 7.633 8.907 10.117 11.651 27.204 30.144 32.852 36.191 43.819

20 7.434 8.260 9.591 10.851 12.443 28.412 31.410 34.170 37.566 45.314

21 8.034 8.897 10.283 11.591 13.240 29.615 32.671 35.479 38.932 46.796

22 8.643 9.542 10.982 12.338 14.041 30.813 33.924 36.781 40.289 48.268

23 9.260 10.196 11.689 13.091 14.848 32.007 35.172 38.076 41.638 49.728

24 9.886 10.856 12.401 13.848 15.659 33.196 36.415 39.364 42.980 51.179

25 10.520 11.524 13.120 14.611 16.473 34.382 37.652 40.646 44.314 52.619

26 11.160 12.198 13.844 15.379 17.292 35.563 38.885 41.923 45.642 54.051

27 11.808 12.878 14.573 16.151 18.114 36.741 40.113 43.195 46.963 55.475

28 12.461 13.565 15.308 16.928 18.939 37.916 41.337 44.461 48.278 56.892

29 13.121 14.256 16.047 17.708 19.768 39.087 42.557 45.722 49.588 58.301

30 13.787 14.953 16.791 18.493 20.599 40.256 43.773 46.979 50.892 59.702

40 20.707 22.164 24.433 26.509 29.051 51.805 55.758 59.342 63.691 73.403

50 27.991 29.707 32.357 34.764 37.689 63.167 67.505 71.420 76.154 86.660

60 35.534 37.485 40.482 43.188 46.459 74.397 79.082 83.298 88.379 99.608

70 43.275 45.442 48.758 51.739 55.329 85.527 90.531 95.023 100.425 112.317

80 51.172 53.540 57.153 60.391 64.278 96.578 101.879 106.629 112.329 124.839

90 59.196 61.754 65.647 69.126 73.291 107.565 113.145 118.136 124.116 137.208

100 67.328 70.065 74.222 77.929 82.358 118.498 124.342 129.561 135.807 149.449

χα2(df)

Example: The 5% critical value for a two-way

table with 2 levels per variable is 3.84 and the

p-value for a χ2-statistic of 5.024 is 0.025.

18

Recall that in the special case of two waytables, that is r = s = 2, the calculation ofthe χ2 statistic simplifies to

χ2 =n(f11f22 − f12f21)2

f1•f2•f•1f•2∼ χ2(1) under H0.

f11 f12 f1•f21 f22 f2•f•1 f•2 n

The χ2 statistics is only approximately χ2(1)-distributed (the χ2-distribution comes aboutby approximating the binomially distributedmarginal sums with the normal distribution).The approximation becomes more precise byapplying the following continuity correction(jatkuvuuskorjaus):

χ2 =n(|f11f22 − f12f21| − n/2)2

f1•f2•f•1f•2

Note that the shortcuts described on thispage are valid only for two-way tables andmay not be applied to contingency tables ofany other size than 2× 2.

19

A small p -value tells us that the assumption

of independence probably does not hold, that

is, the row and the column variable are prob-

ably dependent. However, the p -value says

nothing about how strong this dependence

actually is.

The most useful measure of dependence for

categorical data is Cramer’s V defined as

V =

√√√√ χ2

χ2max

=

√√√√ χ2

n(k − 1),

where k is the smaller of the number of rows

r and columns s. It ranges from 0 (complete

independence) to 1 (perfect dependence). As

a rule of thumb, there is no substantial de-

pendence if V < 0.1.

20

Example:

Satisfaction with the companies management:

fij Vocational trainingYes No Sum

Satisfied 87 112 199Don’t know 34 30 64Unsatisfied 22 96 118Sum 143 238 381

Expected frequencies under independence:

eij Vocational trainingYes No Sum

Satisfied 74.7 124.3 199Don’t know 24.0 40.0 64Unsatisfied 44.3 73.7 118Sum 143 238 381

df = (3−1)(2−1) = 2, χ20.05(2) = 5.99,

χ2 =3∑i=1

2∑j=1

(fij − eij)2

eij= 27.841 > χ2

0.05(2)

⇒There is dependence, V=√

27.841381(2−1) =0.27.

21

1.3.2. Both Variables on Ordinal Scale

χ2 may still be applied, but it doesn’t takethe order of the ranked classification into ac-count. The concordance (samansuuntaisuus)of the ranked classification is measured bySpearman’s rank correlation coefficient rs, al-so known as Spearman’s ρ, and Kendall’s τ .

In order to calculate these measures, one de-termines first the X and Y ’s ranks (sijaluvut):

x(1) ≤ x(2) ≤ · · · ≤ x(n), y(1) ≤ y(2) ≤ · · · ≤ y(n).

If there are no ties, then Spearman’s ρ andKendall’s τ are determined as

rS = 1−6

n∑i=1

d2i

n(n2 − 1)and τ = 1−

4Q

n(n− 1),

where di is the difference between ranks andQ is the number of discordant pairs (parittais-ten sijanvaihdosten lukumaara), that is pairs,where an increase in X corresponds to a de-crease in Y .

23

Example: (Snedecor & Cochran)Ranking of seven rats’ conditions by two observers:

Rat Ranking by DifferenceNumber Obs. 1 Obs. 2 di d2

i1 4 4 0 02 1 2 -1 13 6 5 1 14 5 6 -1 15 3 1 2 46 2 3 -1 17 7 7 0 0∑

di = 0∑d2i = 8

rS = 1−6∑d2i

n(n2 − 1)= 1−

6 · 87(49− 1)

= 0.857.

In order to compute Kendall’s τ , rearrange the tworankings so that one of them is in increasing order:

Rat No. 2 6 5 1 4 3 7Obs. 1 1 2 3 4 5 6 7Obs. 2 2 3 1 4 6 5 7

Taking each rank given by observer 2 in turn, countthe smaller ranks to the right of it and add thesecounts. For rank 2 the count is 1, since only rat 5 hasa smaller rank. The six counts are 1, 1, 0, 0, 1, 0,(no need to count the extreme right rank), such that

Q = 3 and τ = 1−4Q

n(n− 1)= 1−

12

42=

5

7≈ 0.714.

24

Recall that Spearman’s rank correlation co-efficient is indeed Pearson’s linear correlationcoefficient applied to ranks, which simplifiesto the form given above only in the specialcase that there are no ties (that is, there areno multiple observations for the same rank).

Both coefficients obey −1 ≤ rS, τ ≤ 1, wherers = τ = 1 ⇔ ranks in same order,rs = τ = −1 ⇔ ranks in opposite order,rs = τ = 0 ⇔ independent ranks.

We usually test independence of the ranks:

H0 : ρS = 0 or τ = 0.

The Real Statistics toolpack offers the exactp -values for this test.

If software is not available and you have asufficiently large sample size n, you can stilluse that the test statistic to test H0 : ρS = 0is in large samples approximately

z = rS√n ∼ N(0,1) under H0.

E.g. |z| > 1.96 is an indication that rs is significant at

α = 5%. The same approximation works for the linear

correlation coefficient r, but for τ : z ≈ 32τ√n.

25

1.3.3. Both Variables on Interval Scale

Linear association (lineaarinen riippuvuus) be-

tween two variables may be assessed using

Pearson’s linear correlation coefficient r=rxy

if both variables are at least on interval scale.

Recall:

• Pearson’s linear correlation coefficient is

symmetric in the sense that it makes no dif-

ference which variable you call x and which

you call y in calculating the correlation.

• rxy does not change when we change the

units of x, y, or both.

• rxy measures only the strength of linear re-

lationships. It does not describe curved rela-

tionships, no matter how strong they are.

• rxy is always a number between -1 and 1

with the sign of r indicating the sign of the

linear relationship.

• Pearson’s linear correlation coefficient is

more sensitive to outliers than Spearman’s

rank correlation coefficient and Kendall’s τ .27

Correlation and Regression

Recall that if y is the sum of a linear functionof x and some error term e with zero mean,that is,

y = y + e, where y = b0 + b1x, e = 0,

then we may determine the coefficients of theso called regression line (regressiosuora) bymeans of the method of least squares (OLS)(pns-menetalma) as

b1 = rxy ·sy

sxand b0 = y − b1x,

where sx and sy denote the standard devia-tions of x and y, respectively.

Recall:

x =1

n

n∑i=1

xi, s2x =

1

n− 1

(n∑i=1

x2i −

(∑n

i=1xi)2

n

),

y =1

n

n∑i=1

yi, s2y =

1

n− 1

(n∑i=1

y2i −

(∑n

i=1yi)2

n

),

rxy =

n∑i=1

xiyi −

(∑n

i=1xi) (∑n

i=1yi)

n√√√√[ n∑i=1

x2i −

(∑n

i=1xi)2

n

][n∑i=1

y2i −

(∑n

i=1yi)2

n

].

28

Coefficients of Determination

Pearson’s linear correlation coefficient rxy isrelated to the coefficient of determination R2

(selityskerroin/-aste) of such a regression by

R2 := r2xy,

where R2 measures the fit (yhteensopivuus)of the regression line as:

R2 =variance of predicted values y

variance of observed values y=s2y

s2y.

A better measure of fit when comparing re-gressions with varying numbers of regressorsis the so called adjusted R2 (tarkistettuselitysaste) given by

R2 = 1−n− 1

n− 2(1− r2

xy)

in the case of only one regressor. It ap-proaches the ordinary R2 for large n. Notethat unlike R2, R2 may become negative forcorrelations close to zero:

rxy = 0 ⇒ R2 = −1

n− 2.

29

Linear Regression in Excel

You can perform linear regression either with

excel’s own data analysis tool or with the

Real Statistics data analysis tool by Charles

Zaiontz available at www.real-statistics.com.

Excel’s own tool offers additional plots and

the real statistics tool offers additional anal-

ysis, both of which will be discussed later in

this course.

The regression output contains always:

• the coefficients of determination R2, R2;

• the standard error of the estimate se, which

is an estimator for the unknown standard de-

viation of the error term out of sample;

• an Analysis of Variance table; which is an

F -test of H0 : R2 = 0;

• A Parameter Estimates table containing the

regression parameters and t-tests of the hy-

potheses that the respective parameter is 0.

30

The ANOVA table for simple linear regression

Analysis of variance (ANOVA) summarizesinformation about sources of variation in thedata based on the framework

DATA = FIT + RESIDUAL.

The idea is that we may split up the devi-ation of the observed values yi from theirarithmetic mean y into a sum of the devi-ation of the regression fit yi from y and thedeviation of yi from yi:

(yi − y) = (yi − y) + (yi − yi).If we square each of the three deviations aboveand then sum over all n observations, it is analgeabraic fact that the sums of squares add:∑

(yi − y)2 =∑

(yi − y)2 +∑

(yi − yi)2,

which we rewrite as

SST = SSR + SSE,

where

SST =∑

(yi−y)2, SSR =∑

(yi−y)2, SSE =∑

(yi−yi)2.

32

In the abbreviations SST, SSR, and SSE, SSstands for sum of squares, and the T, R, andE stand for total, regression, and error.

Because s2y =SST/(n−1) and s2

y =SSR/(n−1)

R2 =s2y

s2y

=SSR

SST= 1−

SSE

SST.

Each sum of squares comes with associateddegrees of freedom, telling how many quan-tities used in their calculation can vary freelywithout changing any estimators of popula-tion parameters used in the same calculation.

DFT = n− 1

(n y-values minus one for calculating y=∑yi)

DFR = 1

(There are n different yi, but they are allproduced by varying the single variable x.)

DFE = n− 2

(n y-values minus 2 for calculating b0 and b1.)

33

Just like SST is the sum of SSR and SSE,

the total degrees of freedom is the sum of the

degrees of freedom for the regression model

and for the error:

DFT = DFR + DFE,

The ratio of the sum of squares to the de-

grees of freedom is called the mean square:

mean square =sum of squares

degrees of freedom.

We know already MST=∑

(yi− y)2/(n− 1),

which is just the sample variance s2y .

MSE =

∑(yi − yi)2

n− 2

is called the mean square error. Finally,

MSR =

∑(yi − y)2

1= SSR.

These can be used to assess whether β1 6= 0

out of sample, as is shown on the next slide.

34

ANOVA F -test for simple linear regression

Recall that while the methods of least squares

makes no assumptions about the data gener-

ating process behind the observations xi and

yi and may thus always be applied, hypothe-

sis tests about the coefficients of the regres-

sion line y = β0 +β1x require the error terms

ei = yi−(b0+b1xi) to be independent and nor-

mally distributed with mean 0 and common

standard deviation σ. Under this assumption:

F =MSR

MSE∼ F (1, n− 2) under H0 : β1 = 0.

When β1 6= 0, MSR tends to be large rela-

tive to MSE. So large values of F are evi-

dence against H0 in favour of the two-sided

alternative β1 6= 0. For simple linear regres-

sion, this test is equivalent to the two-sided

t-test for a significant slope coefficient to be

discussed on the next slides.

35

Student t-tests for Regression Parameters

Recall that the regresion output

b1 = rxy ·sy

sx, b0 = y − b1x, s2

e =

∑e2i

n− 2

are only estimates of the true regression pa-rameters β1 and β0 and σ2, which vary fromsample to sample. That is, we may regardthem as sample-specific outcomes of randomvariables with associated expected values andvariances.

Under certain conditions to be discussed soon,the expected values of these estimators are

E(b1) = β1, E(b0) = β0, and E(s2e) = σ2,

which is why we chosed them in the firstplace. The standard deviations of the es-timators for the regression coefficients turnout to be

σb1 =σ√SSX

and σb0 = σ

√1

n+

x2

SSX,

where SSX :=∑ni=1(xi − x)2.

36

Replacing σ with the standard error of theestimate se yields the standard errors for theestimated regression coefficients:

SEb1 =se√SSX

and SEb0 = se

√1

n+

x2

SSX,

which belong to the standard regression out-put. These may in turn be used in order togenerate confidence intervals and tests forthe regression slope and intercept as follows:

To test H0 : β1/0 =0, compute the test statistic

T1/0 =b1/0

SEb1/0

.

Reject H0 against• H1 : β1/0 6=0 (two-sided) if |T1/0|≥ tα/2(n−2)• H1 : β1/0 ≷ 0 (one-sided) if T1/0≷±tα(n−2).

A level (1−α) confidence interval for β0 is

b0 ± tα/2(n− 2) · SEb0.A level (1−α) confidence interval for β1 is

b1 ± tα/2(n− 2) · SEb1.

37

Fuel Efficiency as a Function of Speed (continued)

Number of Observations Read 60

Number of Observations Used 60

Analysis of Variance

Source DFSum of

SquaresMean

Square F Value Pr > F

Model 1 493.99177 493.99177 494.50 <.0001

Error 58 57.94073 0.99898

Corrected Total 59 551.93250

Root MSE 0.99949 R-Square 0.8950

Dependent Mean 17.72500 Adj R-Sq 0.8932

Coeff Var 5.63887

Parameter Estimates

Variable Label DF Parameter

EstimateStandard

Error t Value Pr > |t|95% Confidence

Limits

Intercept Intercept 1 -7.79632 1.15491 -6.75 <.0001 -10.10813 -5.48451

logmph logmph 1 7.87424 0.35410 22.24 <.0001 7.16543 8.58305

In the preceding example, the t-statistics cameabout by dividing the coefficient estimatesb0 = −7.796 and b1 = 7.874 by their re-spective standard errors SEb0 = 1.155 andSEb1 = 0.354.

The 95% confindence intervals for β0 andβ1 require the α

2 = 2.5% critical values ofthe t-distribution with n − 2 = 58 degreesof freedom (the same as for the residuals),which may be obtained from a table or bycalling T.INV.2T(0.05,58) in Excel as

tα2(n− 2) = t0.025(58) ≈ 2.002.

The 95% confindence intervals for β0 and β1are therefore:

b1 ± t0.025(58) · SEb1 = 7.874± 2.002 · 0.354

= (7.165, 8.583),

b0 ± t0.025(58) · SEb0 = −7.796± 2.002 · 1.155

= (−10.108, −5.485).

The fact that zero is not included in any ofthese confidence intervals implies that we canreject H0 : β1 =0 and H0 : β0 =0 in both two-sided and one-sided tests.

39

Confidence intervals for the mean responseand for individual observations

For any specific value of x, say x∗, the meanof the response y in this subpopulation is

µy = β0 + β1x∗,

which we estimate from the sample as

µy = b0 + b1x∗.

Alternatively we may interpret this expressionas a prediction for an individual observationy = b0 +b1x

∗ for x= x∗. The prediction in-terval for an individual observation, however,is wider than the confidence interval for themean due to the additional variation of indi-vidual responses about the mean response.

A level (1−α) confidence interval for µy is

µy±tα/2(n−2)·SEµ, SEµ = se

√1

n+

(x∗ − x)2

SSX.

A level (1−α) prediction interval for y is

y±tα/2(n−2)·SEy, SEy = se

√1 +

1

n+

(x∗ − x)2

SSX.

40

Fuel Efficieny as a Function of Speed (continued)

95% confidence limits for the mean response:

95% confidence limits for individual predictions:

1.4. Prerequisites in statistical inference

Statistical tests and confidence intervals are

derived on the basis of some central assump-

tions. We usually assume that our observa-

tions are random samples of some prespeci-

fied distribution, most commonly the normal

distribution or one of its derivatives. This, in

turn, requires our data to have certain char-

acteristics before a statistical method can be

meaningfully applied.

A general precondition is that the statistical

units/ observations are:

• independent of each other,

• are equally reliable,

• and the sample size is sufficiently large.

Beyond these general prerequesites, there are

preconditions that apply to the specific sta-

tistical method to be used.

42

1. Contingeny tables

Pearson’s χ2 used in independence and ho-mogeneity tests is approximately χ2-distributed,if there are sufficiently many observations,that is:

• all expected frequencies are greater than 1,• no more than 20% of the expected counts

are smaller than 5.

If any of those conditions is not met, thereare two options:

It’s best to use Fishers exact test, which isavailable as an option from the Chi-SquareTest of the Independence tool. It doesn’tuse the χ2-approximation at all and worksalso in small samples, where the assumptionsfor the χ2-test are not satisfied. It deliversalways the precise p-value (so it’s better thanthe χ2-test), but the Real Statistics excel addin calculates it only for tables with no morethan 9 cells. If you have a 2×2 table andno software is available, you should use thecontinuity correction discussed earlier.

43

2. Correlations

The tests for independence of ranks H0 :ρS = 0 or τ = 0 are exact and work also insmall samples. The same holds for testingthe linear correlation coefficient

H0 : ρ = 0 (x, y are linearly independent)

with the t-test

T =r√n− 2√

1− r2∼ t(n− 2) under H0

when both x and y are normally distributed.Otherwise the test is only approximate andrequires a sufficiently large sample size. Forsmall r and large n we get the approximatez-test:

Z = r√n ∼ N(0,1) under H0.

The Real Statistics data analysis tool allowsyou also to test H0 : ρ = ρ0 6= 0 (knownas Fisher’s test) in large samples, but againthe result is only approximate. The more ρ0deviates from 0, the larger the sample sizehas to be.

44

Example. Consider again the reputation andfamiliarity of 10 different brands:

reputation 7 2 2 9 0 1 9 0 1 0familiarity 4 7 3 13 2 7 9 4 1 1

Pearson’s linear correlation coefficient for thissample is r = 0.7388. We test H0 : ρ = 0against the alternative H1 : ρ 6= 0 by calcu-lating the test statistics

t =0.7388 ·

√10−2√

1− 0.73882≈ 3.10.

From a statistical table or by calling T.INV.2Tin excel we obtain the critical values

t0.02/2(8) = 2.896 and t0.01/2(8) = 3.355,

so the two-sided p-value is somewhere be-tween 1% and 2%. The exact p-value is

T.DIST.2T(3.100776; 8) = 0.014649.

Applying the large sample approximation yields

Z = 0.7388√

10 = 2.336 ⇒ p ≈ 2%, since

P (Z ≤ 2.336) =NORMSDIST(2.336)≈ 0.99,such that P (Z> |2.336|)≈2(1−0.99) = 0.02.

45

3. Regression Analysis

Example: (N. Weiss: Introductory Statistics)Consider the following sample of the prices(y in $100) of cars as a function of their age(x in years):

x 5 4 6 5 5 5 6 6 2 7 7y 85 103 70 82 89 98 66 95 169 70 48

A regression of price upon age yields:

y = 195.47− 20.26x.

Note that the predictions of a regression lineare not completely accurate as for exampleit predicts the price of 5 years old cars as

$19547− $2026 · 5 = $9417,

whereas the true prices vary from car to carbetween $8200 and $9800. The distribu-tion of a response variable Y for a specificvalue of the predictor variable X is called theconditional distribution (ehdollinen jakauma)of Y given the value X = x with conditionalmean E(Y |X = x) and conditional varianceV (Y |X = x).

47

The assumptions for regression inferences are:

1. Normal populations:For each value of the predictor variable X,the conditional distribution of the responsevariable Y is a normal distribution.

2. Population regression line:There are constants β0 and β1 such that, foreach value x of the predictor variable X, theconditional mean of the response variable is:

y = E(Y |X=x) = β0 + β1x.

3. Equal standard deviations (variances):The conditional standard deviations of theresponse variable Y are the same for all valuesof the predictor variable X:

V (Y |X=x) = σ2 = const.

The condition of equal standard deviations iscalled homoscedasticity.

4. Independent observations:The observations of the response variable areindependent of one another.

48

When assumptions 1–4 for regression infer-ences hold, then the random errors

εi = Yi − (β0 + β1Xi)

are independent and normally distributed withmean zero and variance σ2. The statisticalmodel for simple linear regression may there-fore equivalently be stated as

Yi = β0 + β1Xi + εi, εi iid N(0, σ2),

where ’iid’ stands for independent and iden-tically distributed. Note that this model hasthree parameters: β0, β1 and σ.

It turns out that the least square estimators

b1 = rxy ·sy

sxand b0 = y − b1x

are unbiased estimators of β1 and β0, whichare themselves normally distributed. An un-biased estimator of the unknown variance σ2

of the error term is given by the mean square

error MSE=∑e2i

n−2 where ei = yi − (b0 + b1xi).

We define the standard error of the estimate

as se =√

MSE =

√SSE

n− 2=

√√√√ ∑e2i

n− 2.

49

Regression Diagnostics: Residual Analysis

1. Normality of Residuals

Normality of residuals may be checked eithergraphically, by considering the shape para-meters of the distribution of residuals, or byperforming statistical tests.

Graphs

A first visual check of the normality assump-tion is taking a look at the histogram of resid-uals. If the histogram is not bell-shaped, theresiduals are not normally distributed.

A bell shaped distribution does, however, notguarantee that the distribution of residuals isnormal, for example the t-distribution is alsobell-shaped.

Excel’s data analysis tool can produce his-tograms, but it is not very good at findingmeaningful bin sizes and also clumsy to use.

50

Normal probability plots are plots with theranked observations x(i) on the horizontalaxis and the z-values zqi = Φ−1(qi) from thenormal distribution corresponding to therespective quantile qi (observed cumulativeprobability) of x(i) on the vertical axis. Inthis form the normal probability plot is alsocalled a quantile-quantile (Q-Q) plot.

Alternatively one may plot the expected nor-mal cumulative probabilities Φ(x(i)) on thevertical axis against the observed cumulativeprobabilities qi on the horizontal axis in socalled probability-probability (P-P) plots.

In either case, if the observations are nor-mally distributed, then the normal probabil-ity plot should be a straight line. Deviationsfrom this line allow for detection of outliersand qualitative identification of skewness andkurtosis.

Q-Q plots are more generally used than P-Pplots, because they stress deviations in thetails, where hypothesis tests are usually done(P-P plots stress deviations in the center).

51

Note. Excel’s regression tool has an option to

produce a normal probability plot. This is not

the relevant plot of the regression residuals

though, but a much less useful normality plot

for the unconditional y-values.

To get the relevant normality plot for the

residuals, you must first produce those resid-

uals from either Excel’s or the Real Statis-

tics data analysis tool and then apply De-

scriptive Statistics and Normality/ QQ Plot

within the Real Statistics data analysis tool-

box upon the residuals.

Alternatively you may obtain a P-P plot from

Excel by running a second regression with ar-

bitrary x-values and the previously obtained

residuals as y-values, asking for a normal prob-

ability plot within the regression window and

ignoring all other output.

52

Shape Paprameters

Skewness

Recall that non-symmetric unimodal distri-bution are skewed to the right if the obser-vations concentrate upon the lower values orclasses (Md< x), such that it has a long tailto the right, and skewed to the left, if the ob-servations concentrate upon the higher val-ues or classes (Md> x), such that the dis-tribution has a long tail to the left. Thisasymmetry is indicated by the (coefficient of)skewness:

g1 =1n

∑ni=1(xi − x)3

s3.

In general, the distribution is skewed to theleft (right) if g1 is smaller (larger) than zero.Unimodal distributions with g1 = 0 are sym-metric. That is, g1 6= 0 (in particular when|g1| > 2

√6/n, n =sample size) is evidence

that X is not normally distributed.

Skewness renders PP- and QQ-plots curvedrather than linear.

54

Kurtosis

The (coefficient of) Kurtosis, defined as

g2 =1n

∑ni=1(xi − x)4

s4− 3,

is a measure of peakedness (at least for uni-

modal distributions). That is, unimodal dis-

tributions with low kurtosis (g2 < 0), called

platykurtic, are rather evenly spread across all

possible values or classes, and unimodal dis-

tributions with high kurtosis (g2 > 0), called

leptokurtic, have a sharp peak at their mode.

Distributions with g2 ≈ 0 are called mesokurtic.

The kurtosis of the normal distribution is ex-

actly zero. Therefore, the sign of g2 tells for

unimodal distributions whether they are more

(g2 > 0) or less (g2 < 0) sharp peaked than

the normal distribution. A clear warning sign

against normality is when |g2| > 4√

6/n.

Kurtosis renders PP- and QQ-plots S-shaped.

55

Normality Tests

The most popular test for normality is calledShapiro-Wilk Test available from the Descrip-tive Statistics and Normality tool. The nullhypothesis is that the data is normally dis-tributed and the alternative hypothesis is thatit is not. So small p-values (e.g. p < 0.05) im-ply that the data is not normally distributed.

56

2. Linear Regression Relationship

Deviations from straight-line relationships are

immediately evident from the scatterplot of

the predictor and the response variables. Such

deviations are also visible as systematic pat-

terns instead of random scatters in so called

residual plots, where the residuals ei are plot-

ted either against the values of the predictor

variable xi or against the predicted response

values yi. These are available from Excel’s

regression tool, however usually you will have

to rescale the axes before getting a readable

result.

3. Constant residual variance

This may also be checked from residual plots:

Any systematic pattern in the scatter of the

residuals around zero contradicts the assump-

tion of constant residual variance.

57

Fuel Efficiency as a Function of Speed (Moore/McCabe)

mph50,040,030,020,010,0

mpg

25,0

22,5

20,0

17,5

15,0

12,5

R Sq Linear = 0,854

mph60,050,040,030,020,010,0

Uns

tand

ardi

zed

Res

idua

l

2,00000

0,00000

-2,00000

-4,00000

Miles per gallon versus logarithm of miles per hour

logmph4,00003,50003,00002,5000

mpg

25,0

22,5

20,0

17,5

15,0

12,5

R Sq Linear = 0,895

Residual Plot

logmph4,00003,50003,00002,5000

resi

d

2,0000

0,0000

-2,0000

-4,0000

4. Independent Observations: Residual Auto-

correlation and the Durbin-Watson Test

We say that the regression residuals are auto-

correlated if the correlation of any residual

with any of its preceding residuals is nonzero.

Residual autocorrelation is the most serious

violation of the assumptions of the statistical

model for linear regression.

Common reasons for residual autocorrelation:

• Two time-series are regressed upon each

other.

• The dependence of Y upon X is non-

linear.

• Additional regressors are missing.

• There are trends or seasonal variation in

the data.

• Missing data has been replaced by esti-

mates.

60

The 1. Order Autocorrelation (1. kertaluvun

autokorrelaatio) ρ1, that is the autocorrela-

tion of any residual εi with its preceding value

εi−1, is assessed by the Durbin-Watson test

statistic:

d =

∑ni=2(ei − ei−1)2∑n

i=1 e2i

,

which estimates 2(1− ρ1), that is:

d ≈ 2 ⇒ Residuals uncorrelated (Ok),d < 2 ⇒ εi positively autocorrelated,d > 2 ⇒ εi negatively autocorrelated.

The critical values dα depend upon the data

and are therefore not known exactly. But

there are tabulated upper limits dU and lower

limits dL, which depend only upon the num-

ber of regressors (in our case 1) and the num-

ber of data points, such that

dL < dα < dU .

61

To perform a Durbin-Watson test:

1. Choose a significance level (e.g. α=0.05).

2. Calculate d

(available from the Durbin-Watson testoption within Real Statistics regression).

3. Look up:dL(α2) and dU(α2) for a two-sided test,dL(α) and dU(α) for a one-sided test.

4. (i) Two-sided: H0 : ρ1 =0 vs. H1 : ρ1 6=0d ≤ dL or d ≥ 4−dL ⇒ reject H0.dU ≤ d ≤ 4−dU ⇒ accept H0.otherwise ⇒ inconclusive.

(ii) One-sided: H0 : ρ1 =0 vs. H1 : ρ1>0d ≤ dL ⇒ reject H0.d ≥ dU ⇒ accept H0.otherwise ⇒ inconclusive.

(iii) One-sided: H0 : ρ1 =0 vs. H1 : ρ1<0d ≥ 4− dL ⇒ reject H0.d ≤ 4− dU ⇒ accept H0.otherwise ⇒ inconclusive.

62

The Durbin-Watson test is an option withinthe Real Statistics Linear Regression tool:

Note that Real Statistics runs this as a onesided test against H1 : ρ1 < 0. For a two-sided test against H1 : ρ1 6= 0, replace your αwith α/2 in the corresponding field and checkalso 4 minus the critical values from the out-put. For a one sided test against H1 : ρ1 > 0,use your original α and check only 4 minusthe critical values from the output.

64

Date post:	14-Apr-2022
Category:	Documents
Upload:	others
View:	7 times
Download:	0 times

WELCOME TO: STAT.1010: Riippuvuusanalyysi Statistical ...

Documents