WELCOME TO:
STAT.1010:
Riippuvuusanalyysi
Statistical Analysis of Contingency and
Regression
Bernd Pape
University of Vaasa
Department of Mathematics and Statistics
TERVETULOA!
www.uwasa.fi/∼bepa/Riippu.html
1
Literature:
• Amir D. Aczel:Complete Business Statistics
• Milton/ Arnold:Introduction to Probability and Statistics
• Moore/ McCabe:Introduction to the Practice of Statistics
• Conrad Carlberg:Statistical Analysis: Microsoft Excel
Old lecture notes in Finnish by Pentti Suomelawith SPSS as software may be downloadedfrom the course homepage.
There you will also find a collection of sta-tistical formulas and tables, which may andshould be brought to the exam!
Course Homepage:
www.uwasa.fi/∼bepa/Riippu.html
2
1. Introduction
1.1. Confidence Intervals and Hypothesis Tests
Confidence Intervals
A point estimate is a single value calculated
from the observation values in your sample in
order to estimate some parameter of the un-
derlying population. For example the sample
mean x =∑ni=1 xi, where n is the number of
observations xi in sample, is a point estimate
of the underlying population mean µ.
A problem with point estimates is that we are
almost sure that they are not the true param-
eter, because whenever we take a new sam-
ple with different observations, we will most
probably get a different point estimate leav-
ing us with many point estimates for many
different samples, while there is only a single
true parameter in the population which can-
not simultaneously be identical to all those
point estimates from the different samples.
3
By adding and subtracting margins of error
to your point estimate you convert your point
estimate into an interval estimate. This in-
creases the chance of the true parameter be-
ing covered for the price of a less precise es-
timate of its value.
If the sampling distribution of your estima-
tor is known, then the margins of error can
be determined such, that the resulting in-
terval has a precisely determined probabil-
ity 1−α, say, that the interval covers the
true parameter value. We have then found a
confidence interval at confidence level 1−α.
The sampling distribution of an estimator is
a smoothed histogram of its value in many
samples scaled such, that calculating its inte-
gral between two numbers will yield the prob-
ability that the estimate comes out with a
value somewhere between those numbers.
4
Example
We learned in STAT1030 that the standardized sam-ple mean in a sample of n observations
Tn =X − µS/√n
with sample variance S2 =1
n−1
n∑i=1
(Xi − X)2
is Student-t distributed with n−1 degrees of freedom.
Let tα/2(n−1) denote the value of Tn for which
P (Tn > tα/2(n−1)) =α
2
such that by symmetry of the Student t-distribution
P (|Tn| > tα/2(n−1)) = P
(|X − µ| > tα/2(n−1)
S√n
)= α
and the 1−α confidence interval for µ becomes
CI1−α =
(X − tα/2(n−1)
S√n, X + tα/2(n−1)
S√n
).
tα/2(n−1) is determined such that the area (=integral)
under the density curve of the Student t-distribution
with n−1 degrees of freedom between this value and
+∞ is exactly α2. These values are tabulated and avail-
able from Excel by typing =T.INV.2T(α; n−1) in any
cell (or TINV(α; n−1) before Excel 2007).
5
Hypothesis Tests
Whenever we calculate something based uponsample observations only, it is called a statistic.An estimator is a statistic used for the specialpurpose of estimating an underlying popula-tion parameter, such as x for µ.
Now suppose that rather than using a statis-tic in order to estimate some unknown pa-rameter, you have already an opinion aboutwhat the value of that parameter should beand you want to cross-check whether youropinion can be reasonably maintained in thelight of the sample statistics you got.
For example, theory claims that somethingshould be one on average, but in your sam-ple you find hat x = 2. Does this meanthat the theory is wrong or is this just be-cause you didn’t see the full population? Youcan make informed decisions about this ifyou know the sampling distribution of yourstatistic under the assumption that the nullhypothesis (e.g. µ = 1) is true. This is calledhypothesis testing.
6
The approach is to use the sampling distri-
bution under the null in order to calculate
the probability of getting a sample statistic
at least as extreme as the value you’ve got
even though the null hypothesis holds true.
This is called the p-value of the test.
If the p-value is large, it means that the prob-
ability of getting your sample statistic under
the presumed parameter value is large, and
you accept the null hypothesis that the pop-
ulation parameter is what you claimed it to
be.
If the p-value is small, it means that the prob-
ability of getting your sample statistic under
the presumed parameter value is small, and
you reject the null hypothesis against the al-
ternative hypothesis that the population pa-
rameter is something else.
How exactly the p-value is determined de-
pends upon whether you use a one-sided or
a two-sided test.7
In the case of testing whether the arithmetic
mean has the specific value µ0 under the null
hypothesis (H0 : µ = µ0) the alternative
hypothesis H1 in a two sided test is
H1 : µ 6= µ0,
whereas there are two options for a one-sided
test:
H1 : µ < µ0 OR H1 : µ > µ0.
For example, arithmetic sample means larger
than the hypothesized population mean,
x > µ0, are evidence against H0 in two sided
tests and in one-sided tests of the form H1 :
µ > µ0, but not in one-sided tests of the form
H1 : µ < µ0.
Similarly, arithmetic sample means smaller
than the hypothesized population mean,
x < µ0, are evidence against H0 in two sided
tests and in one-sided tests of the form H1 :
µ < µ0, but not in one-sided tests of the form
H1 : µ > µ0.
8
Hence we calculate the p-value of the test as
p = P (X≤ x |µ=µ0) for H1 : µ < µ0,p = P (X≥ x |µ=µ0) for H1 : µ > µ0,
p = P (|X−µ|≥|x−µ| |µ=µ0)for H1 : µ 6= µ0,
where X denotes the random variable con-taining the arithmetic mean whose value dif-fers from sample to sample, x denotes theparticular value of the arithmetic mean wegot for our sample at hand, and |µ = µ0stands for the condition that the populationmean has indeed the value µ0, as stated inthe null hypothesis.
The null hypothesis is rejected if the p-valuefalls below some prespecified level α, the so-called significance level of the test, otherwiseH0 is accepted. α denotes the probabilityof committing a Type I error, which meansfalsely rejecting a true null hypothesis. Wewant this probability to remain small, hencewe choose often α = 5%, if not even smaller.
9
Example
Using again that the standardized sample mean
Tn =X − µS/√n
in a sample of n observations is t(n−1)-distributed,
we get e.g. for the one-sided test against H1 : µ > µ0,
p = P (X≥ x |µ=µ0) = P (Tn ≥ t),
where t =x− µ0
s/√n
and s2 =1
n−1
n∑i=1
(xi−x)2 obtained in
our sample. p may hence be obtained from integrating
the Student t-distribution with n−1 degrees of freedom
between t and +∞ and is available from Excel using
the command T.DIST.RT(t;n−1). Similarly,
p = T.DIST.RT(−t;n−1) for H1 : µ < µ0,p = T.DIST.2T(|t| ;n−1) for H1 : µ 6= µ0.
Note that p-values of two-sided tests are twice
the p-values of one-sided tests, because, by
symmetry of the t-distribution:
P (|T |≥|t|)=P (T ≤−|t|)+P (T ≥|t|)=2P (T ≥|t|).
10
Example (continued)
If software is not available, we can still usethe critical values tα from statistical tablesin order to decide whether we may reject H0at significance level α or not. In order to dothat, express the condition p < α required forrejection of H0 in terms of threshold valuestα, recalling that P (T > tα) = α.
Hence the condition p < α for rejecting H0 :µ = µ0 against H1 : µ > µ0 reads
P (T >t)︸ ︷︷ ︸p
< P (T >tα)︸ ︷︷ ︸α
,
which is equivalent to the condition t > tα forpositive values of t (the only area of interestagainst H1 : µ > µ0), because there the t-distribution is monotonically decreasing.
Similarly the condition for rejecting H0 againstH1 : µ < µ0 is t <−tα. (P (T <t)<P (T <−tα))
The condition for rejecting H0 : µ = µ0 againstH1 : µ 6= µ0 is |t| > tα/2, because it meansthe same as the conditions t>tα/2 for t>0, ort<−tα/2 for t<0. (P (|T |> |t|)<P (|T |>tα/2))
11
1.2. Scales of Measurement
Recall that the applicability of different sta-
tistical methods depends upon the measure-
ment scale of the variable in question.
Variables on nominal scale cannot be used in
calulcations and don’t reveal any order either.
They can only be used for sorting statistical
units into groups.
Example: Gender, profession, colours,. . .
Variables on ordinal scale cannot be directly
used in calculations either, but they reveil an
implicit order. The ordering implies that it
is meaningful to define ranks, on the basis
of which it is possible to calculate quantiles
such as e.g. the median.
Example: agree/partially agree/disagree,
bad/average/good,. . .
12
Variables on interval scale can directly be used
in calculations involving only sums and dif-
ferences between observation values such as
the arithmetic mean x, however the aribtrari-
ness of the point of origin in these variables
precludes meaningful calculation of statistics
involving ratios of observation values.
Example: clock time, temperature,...
Variables on ratio scale share the properties
of variables on interval scale but allow addi-
tionally for meaningful calculation of statis-
tics based upon ratios of observation values,
such as the coefficient of variation or the har-
monic mean, because the point of origin is
uniquely defined as absence of the quantity
measured.
Example: Money, Weight, Time intervals,. . .
Note. Variables on interval scale can always
be transformed into ratio scale by taking dif-
ferences of the original variable.
Example: Clock time → Time intervals.
13
1.3. Interdependence of Statistical Variables
1.3.1. Both Variables on Nominal Scale
The analysis starts off from a so called
contingency table (ristiintaulukko) which dis-
plays the absolute counts (havaittut frekvenssit)
fij of statistical units belonging both to class
i of variable X and to class j of variable Y .
Dividing these counts by the total number of
observations n, yields the so called relative
frequencies (suhteelliset frekvenssit) pij =fijn ,
which applying the relative frequency approach
may be interpreted as a proxy for the proba-
bility that a randomly chosen statistical unit
belongs to class i of X and to class j of Y si-
multaneously. Therefore we call the observed
frequencies (both absolute and relative) also
the joint distribution (yhteysjakauma) of X
and Y .
14
The probabilities of a statistical unit to be-
long to a certain class of X, regardless of
its classification with respect to Y are given
by pi• =∑j pij, which together make up the
marginal distribution (reunajakauma) of X.
Similiarly, the collection of p•j =∑i pij of
probabilities for statistical units to belong to
class j of Y regardless of their classification
according to X are called the marginal distri-
bution of Y .
Now we know from probability calculus that
two events are independent when their joint
probability equals the product of their marginal
probabilities, that is, pij = pi•p•j, which leads
us to assume independence of X and Y when
the observed frequencies fij equal the so called
expected frequencies (odotettut frekvenssit)
eij, where
eij = npij = n · pi• · p•j = n ·fi•n·f•jn
=fi•f•jn
with fi• =∑j fij and f•j =
∑i fij.
15
The test statistic used in order to assesswhether fij ≈ eij is Pearson’s χ2 statistics:
χ2 =r∑
i=1
s∑j=1
(fij − eij)2
eij,
where we have assumed that the row variable(rivimuuttuja) X has r classes, and the columnvariable (sarakemuuttuja) Y has s classes,such that the overall dimension of the con-tingency table is (r × s).
The null and alternative hypotheses of theχ2 independence test (riippumattomuustesti)are:
H0 : X and Y are statistically independent
H1 : X and Y are statistically dependent
If the null hypothesis holds true, then χ2 isapproximately χ2-distributed with
df = (r − 1)(s− 1) degrees of freedom.
Large values of χ2 lead to a rejection of thenull hypothesis, which means that we believethere is dependence also out of sample.
16
When using statistical tables, one first de-
cides for a significance level (merkitsevyys-
taso) α, which denotes the risk one is ready
to take of falsely rejecting a null hypothesis
which in fact holds true, and compares the
obtained value for the χ2-statistics with the
corresponding critical value (kriittinen arvo)
χ2α(df) of the table.
Statistical software programmes report a p-
value, as described in the introduction, which
must be compared with the significance level.
We accept H0 if p ≥ α and reject H0 if p < α.
In Excel: p =CHIDIST(χ2; df).
Usually α = 0.05, such that:
H0 is accepted if χ2 ≤ χ2α(df) or p ≥ 0.05,
H0 is rejected if χ2 > χ2α(df) or p < 0.05.
Note that statistical significance of χ2 alone,
e.g. rejection of H0, does not yet make any
statement about the strength of dependence
between the variables.17
Table. Tail fractiles χ2α of the χ2-distribution: P (χ2>χ2
α(df))=α.
α 0.995 0.990 0.975 0.950 0.900 0.100 0.050 0.025 0.010 0.001
1 0.000039 0.000157 0.000982 0.003932 0.0158 2.706 3.841 5.024 6.635 10.827
2 0.0100 0.0201 0.0506 0.103 0.211 4.605 5.991 7.378 9.210 13.815
3 0.0717 0.115 0.216 0.352 0.584 6.251 7.815 9.348 11.345 16.266
4 0.207 0.297 0.484 0.711 1.064 7.779 9.488 11.143 13.277 18.466
5 0.412 0.554 0.831 1.145 1.610 9.236 11.070 12.832 15.086 20.515
6 0.676 0.872 1.237 1.635 2.204 10.645 12.592 14.449 16.812 22.457
7 0.989 1.239 1.690 2.167 2.833 12.017 14.067 16.013 18.475 24.321
8 1.344 1.647 2.180 2.733 3.490 13.362 15.507 17.535 20.090 26.124
9 1.735 2.088 2.700 3.325 4.168 14.684 16.919 19.023 21.666 27.877
10 2.156 2.558 3.247 3.940 4.865 15.987 18.307 20.483 23.209 29.588
11 2.603 3.053 3.816 4.575 5.578 17.275 19.675 21.920 24.725 31.264
12 3.074 3.571 4.404 5.226 6.304 18.549 21.026 23.337 26.217 32.909
13 3.565 4.107 5.009 5.892 7.041 19.812 22.362 24.736 27.688 34.527
14 4.075 4.660 5.629 6.571 7.790 21.064 23.685 26.119 29.141 36.124
15 4.601 5.229 6.262 7.261 8.547 22.307 24.996 27.488 30.578 37.698
16 5.142 5.812 6.908 7.962 9.312 23.542 26.296 28.845 32.000 39.252
17 5.697 6.408 7.564 8.672 10.085 24.769 27.587 30.191 33.409 40.791
18 6.265 7.015 8.231 9.390 10.865 25.989 28.869 31.526 34.805 42.312
19 6.844 7.633 8.907 10.117 11.651 27.204 30.144 32.852 36.191 43.819
20 7.434 8.260 9.591 10.851 12.443 28.412 31.410 34.170 37.566 45.314
21 8.034 8.897 10.283 11.591 13.240 29.615 32.671 35.479 38.932 46.796
22 8.643 9.542 10.982 12.338 14.041 30.813 33.924 36.781 40.289 48.268
23 9.260 10.196 11.689 13.091 14.848 32.007 35.172 38.076 41.638 49.728
24 9.886 10.856 12.401 13.848 15.659 33.196 36.415 39.364 42.980 51.179
25 10.520 11.524 13.120 14.611 16.473 34.382 37.652 40.646 44.314 52.619
26 11.160 12.198 13.844 15.379 17.292 35.563 38.885 41.923 45.642 54.051
27 11.808 12.878 14.573 16.151 18.114 36.741 40.113 43.195 46.963 55.475
28 12.461 13.565 15.308 16.928 18.939 37.916 41.337 44.461 48.278 56.892
29 13.121 14.256 16.047 17.708 19.768 39.087 42.557 45.722 49.588 58.301
30 13.787 14.953 16.791 18.493 20.599 40.256 43.773 46.979 50.892 59.702
40 20.707 22.164 24.433 26.509 29.051 51.805 55.758 59.342 63.691 73.403
50 27.991 29.707 32.357 34.764 37.689 63.167 67.505 71.420 76.154 86.660
60 35.534 37.485 40.482 43.188 46.459 74.397 79.082 83.298 88.379 99.608
70 43.275 45.442 48.758 51.739 55.329 85.527 90.531 95.023 100.425 112.317
80 51.172 53.540 57.153 60.391 64.278 96.578 101.879 106.629 112.329 124.839
90 59.196 61.754 65.647 69.126 73.291 107.565 113.145 118.136 124.116 137.208
100 67.328 70.065 74.222 77.929 82.358 118.498 124.342 129.561 135.807 149.449
χα2(df)
Example: The 5% critical value for a two-way
table with 2 levels per variable is 3.84 and the
p-value for a χ2-statistic of 5.024 is 0.025.
18
Recall that in the special case of two waytables, that is r = s = 2, the calculation ofthe χ2 statistic simplifies to
χ2 =n(f11f22 − f12f21)2
f1•f2•f•1f•2∼ χ2(1) under H0.
f11 f12 f1•f21 f22 f2•f•1 f•2 n
The χ2 statistics is only approximately χ2(1)-distributed (the χ2-distribution comes aboutby approximating the binomially distributedmarginal sums with the normal distribution).The approximation becomes more precise byapplying the following continuity correction(jatkuvuuskorjaus):
χ2 =n(|f11f22 − f12f21| − n/2)2
f1•f2•f•1f•2
Note that the shortcuts described on thispage are valid only for two-way tables andmay not be applied to contingency tables ofany other size than 2× 2.
19
A small p -value tells us that the assumption
of independence probably does not hold, that
is, the row and the column variable are prob-
ably dependent. However, the p -value says
nothing about how strong this dependence
actually is.
The most useful measure of dependence for
categorical data is Cramer’s V defined as
V =
√√√√ χ2
χ2max
=
√√√√ χ2
n(k − 1),
where k is the smaller of the number of rows
r and columns s. It ranges from 0 (complete
independence) to 1 (perfect dependence). As
a rule of thumb, there is no substantial de-
pendence if V < 0.1.
20
Example:
Satisfaction with the companies management:
fij Vocational trainingYes No Sum
Satisfied 87 112 199Don’t know 34 30 64Unsatisfied 22 96 118Sum 143 238 381
Expected frequencies under independence:
eij Vocational trainingYes No Sum
Satisfied 74.7 124.3 199Don’t know 24.0 40.0 64Unsatisfied 44.3 73.7 118Sum 143 238 381
df = (3−1)(2−1) = 2, χ20.05(2) = 5.99,
χ2 =3∑i=1
2∑j=1
(fij − eij)2
eij= 27.841 > χ2
0.05(2)
⇒There is dependence, V=√
27.841381(2−1) =0.27.
21
1.3.2. Both Variables on Ordinal Scale
χ2 may still be applied, but it doesn’t takethe order of the ranked classification into ac-count. The concordance (samansuuntaisuus)of the ranked classification is measured bySpearman’s rank correlation coefficient rs, al-so known as Spearman’s ρ, and Kendall’s τ .
In order to calculate these measures, one de-termines first the X and Y ’s ranks (sijaluvut):
x(1) ≤ x(2) ≤ · · · ≤ x(n), y(1) ≤ y(2) ≤ · · · ≤ y(n).
If there are no ties, then Spearman’s ρ andKendall’s τ are determined as
rS = 1−6
n∑i=1
d2i
n(n2 − 1)and τ = 1−
4Q
n(n− 1),
where di is the difference between ranks andQ is the number of discordant pairs (parittais-ten sijanvaihdosten lukumaara), that is pairs,where an increase in X corresponds to a de-crease in Y .
23
Example: (Snedecor & Cochran)Ranking of seven rats’ conditions by two observers:
Rat Ranking by DifferenceNumber Obs. 1 Obs. 2 di d2
i1 4 4 0 02 1 2 -1 13 6 5 1 14 5 6 -1 15 3 1 2 46 2 3 -1 17 7 7 0 0∑
di = 0∑d2i = 8
rS = 1−6∑d2i
n(n2 − 1)= 1−
6 · 87(49− 1)
= 0.857.
In order to compute Kendall’s τ , rearrange the tworankings so that one of them is in increasing order:
Rat No. 2 6 5 1 4 3 7Obs. 1 1 2 3 4 5 6 7Obs. 2 2 3 1 4 6 5 7
Taking each rank given by observer 2 in turn, countthe smaller ranks to the right of it and add thesecounts. For rank 2 the count is 1, since only rat 5 hasa smaller rank. The six counts are 1, 1, 0, 0, 1, 0,(no need to count the extreme right rank), such that
Q = 3 and τ = 1−4Q
n(n− 1)= 1−
12
42=
5
7≈ 0.714.
24
Recall that Spearman’s rank correlation co-efficient is indeed Pearson’s linear correlationcoefficient applied to ranks, which simplifiesto the form given above only in the specialcase that there are no ties (that is, there areno multiple observations for the same rank).
Both coefficients obey −1 ≤ rS, τ ≤ 1, wherers = τ = 1 ⇔ ranks in same order,rs = τ = −1 ⇔ ranks in opposite order,rs = τ = 0 ⇔ independent ranks.
We usually test independence of the ranks:
H0 : ρS = 0 or τ = 0.
The Real Statistics toolpack offers the exactp -values for this test.
If software is not available and you have asufficiently large sample size n, you can stilluse that the test statistic to test H0 : ρS = 0is in large samples approximately
z = rS√n ∼ N(0,1) under H0.
E.g. |z| > 1.96 is an indication that rs is significant at
α = 5%. The same approximation works for the linear
correlation coefficient r, but for τ : z ≈ 32τ√n.
25
1.3.3. Both Variables on Interval Scale
Linear association (lineaarinen riippuvuus) be-
tween two variables may be assessed using
Pearson’s linear correlation coefficient r=rxy
if both variables are at least on interval scale.
Recall:
• Pearson’s linear correlation coefficient is
symmetric in the sense that it makes no dif-
ference which variable you call x and which
you call y in calculating the correlation.
• rxy does not change when we change the
units of x, y, or both.
• rxy measures only the strength of linear re-
lationships. It does not describe curved rela-
tionships, no matter how strong they are.
• rxy is always a number between -1 and 1
with the sign of r indicating the sign of the
linear relationship.
• Pearson’s linear correlation coefficient is
more sensitive to outliers than Spearman’s
rank correlation coefficient and Kendall’s τ .27
Correlation and Regression
Recall that if y is the sum of a linear functionof x and some error term e with zero mean,that is,
y = y + e, where y = b0 + b1x, e = 0,
then we may determine the coefficients of theso called regression line (regressiosuora) bymeans of the method of least squares (OLS)(pns-menetalma) as
b1 = rxy ·sy
sxand b0 = y − b1x,
where sx and sy denote the standard devia-tions of x and y, respectively.
Recall:
x =1
n
n∑i=1
xi, s2x =
1
n− 1
(n∑i=1
x2i −
(∑n
i=1xi)2
n
),
y =1
n
n∑i=1
yi, s2y =
1
n− 1
(n∑i=1
y2i −
(∑n
i=1yi)2
n
),
rxy =
n∑i=1
xiyi −
(∑n
i=1xi) (∑n
i=1yi)
n√√√√[ n∑i=1
x2i −
(∑n
i=1xi)2
n
][n∑i=1
y2i −
(∑n
i=1yi)2
n
].
28
Coefficients of Determination
Pearson’s linear correlation coefficient rxy isrelated to the coefficient of determination R2
(selityskerroin/-aste) of such a regression by
R2 := r2xy,
where R2 measures the fit (yhteensopivuus)of the regression line as:
R2 =variance of predicted values y
variance of observed values y=s2y
s2y.
A better measure of fit when comparing re-gressions with varying numbers of regressorsis the so called adjusted R2 (tarkistettuselitysaste) given by
R2 = 1−n− 1
n− 2(1− r2
xy)
in the case of only one regressor. It ap-proaches the ordinary R2 for large n. Notethat unlike R2, R2 may become negative forcorrelations close to zero:
rxy = 0 ⇒ R2 = −1
n− 2.
29
Linear Regression in Excel
You can perform linear regression either with
excel’s own data analysis tool or with the
Real Statistics data analysis tool by Charles
Zaiontz available at www.real-statistics.com.
Excel’s own tool offers additional plots and
the real statistics tool offers additional anal-
ysis, both of which will be discussed later in
this course.
The regression output contains always:
• the coefficients of determination R2, R2;
• the standard error of the estimate se, which
is an estimator for the unknown standard de-
viation of the error term out of sample;
• an Analysis of Variance table; which is an
F -test of H0 : R2 = 0;
• A Parameter Estimates table containing the
regression parameters and t-tests of the hy-
potheses that the respective parameter is 0.
30
The ANOVA table for simple linear regression
Analysis of variance (ANOVA) summarizesinformation about sources of variation in thedata based on the framework
DATA = FIT + RESIDUAL.
The idea is that we may split up the devi-ation of the observed values yi from theirarithmetic mean y into a sum of the devi-ation of the regression fit yi from y and thedeviation of yi from yi:
(yi − y) = (yi − y) + (yi − yi).If we square each of the three deviations aboveand then sum over all n observations, it is analgeabraic fact that the sums of squares add:∑
(yi − y)2 =∑
(yi − y)2 +∑
(yi − yi)2,
which we rewrite as
SST = SSR + SSE,
where
SST =∑
(yi−y)2, SSR =∑
(yi−y)2, SSE =∑
(yi−yi)2.
32
In the abbreviations SST, SSR, and SSE, SSstands for sum of squares, and the T, R, andE stand for total, regression, and error.
Because s2y =SST/(n−1) and s2
y =SSR/(n−1)
R2 =s2y
s2y
=SSR
SST= 1−
SSE
SST.
Each sum of squares comes with associateddegrees of freedom, telling how many quan-tities used in their calculation can vary freelywithout changing any estimators of popula-tion parameters used in the same calculation.
DFT = n− 1
(n y-values minus one for calculating y=∑yi)
DFR = 1
(There are n different yi, but they are allproduced by varying the single variable x.)
DFE = n− 2
(n y-values minus 2 for calculating b0 and b1.)
33
Just like SST is the sum of SSR and SSE,
the total degrees of freedom is the sum of the
degrees of freedom for the regression model
and for the error:
DFT = DFR + DFE,
The ratio of the sum of squares to the de-
grees of freedom is called the mean square:
mean square =sum of squares
degrees of freedom.
We know already MST=∑
(yi− y)2/(n− 1),
which is just the sample variance s2y .
MSE =
∑(yi − yi)2
n− 2
is called the mean square error. Finally,
MSR =
∑(yi − y)2
1= SSR.
These can be used to assess whether β1 6= 0
out of sample, as is shown on the next slide.
34
ANOVA F -test for simple linear regression
Recall that while the methods of least squares
makes no assumptions about the data gener-
ating process behind the observations xi and
yi and may thus always be applied, hypothe-
sis tests about the coefficients of the regres-
sion line y = β0 +β1x require the error terms
ei = yi−(b0+b1xi) to be independent and nor-
mally distributed with mean 0 and common
standard deviation σ. Under this assumption:
F =MSR
MSE∼ F (1, n− 2) under H0 : β1 = 0.
When β1 6= 0, MSR tends to be large rela-
tive to MSE. So large values of F are evi-
dence against H0 in favour of the two-sided
alternative β1 6= 0. For simple linear regres-
sion, this test is equivalent to the two-sided
t-test for a significant slope coefficient to be
discussed on the next slides.
35
Student t-tests for Regression Parameters
Recall that the regresion output
b1 = rxy ·sy
sx, b0 = y − b1x, s2
e =
∑e2i
n− 2
are only estimates of the true regression pa-rameters β1 and β0 and σ2, which vary fromsample to sample. That is, we may regardthem as sample-specific outcomes of randomvariables with associated expected values andvariances.
Under certain conditions to be discussed soon,the expected values of these estimators are
E(b1) = β1, E(b0) = β0, and E(s2e) = σ2,
which is why we chosed them in the firstplace. The standard deviations of the es-timators for the regression coefficients turnout to be
σb1 =σ√SSX
and σb0 = σ
√1
n+
x2
SSX,
where SSX :=∑ni=1(xi − x)2.
36
Replacing σ with the standard error of theestimate se yields the standard errors for theestimated regression coefficients:
SEb1 =se√SSX
and SEb0 = se
√1
n+
x2
SSX,
which belong to the standard regression out-put. These may in turn be used in order togenerate confidence intervals and tests forthe regression slope and intercept as follows:
To test H0 : β1/0 =0, compute the test statistic
T1/0 =b1/0
SEb1/0
.
Reject H0 against• H1 : β1/0 6=0 (two-sided) if |T1/0|≥ tα/2(n−2)• H1 : β1/0 ≷ 0 (one-sided) if T1/0≷±tα(n−2).
A level (1−α) confidence interval for β0 is
b0 ± tα/2(n− 2) · SEb0.A level (1−α) confidence interval for β1 is
b1 ± tα/2(n− 2) · SEb1.
37
Fuel Efficiency as a Function of Speed (continued)
Number of Observations Read 60
Number of Observations Used 60
Analysis of Variance
Source DFSum of
SquaresMean
Square F Value Pr > F
Model 1 493.99177 493.99177 494.50 <.0001
Error 58 57.94073 0.99898
Corrected Total 59 551.93250
Root MSE 0.99949 R-Square 0.8950
Dependent Mean 17.72500 Adj R-Sq 0.8932
Coeff Var 5.63887
Parameter Estimates
Variable Label DF Parameter
EstimateStandard
Error t Value Pr > |t|95% Confidence
Limits
Intercept Intercept 1 -7.79632 1.15491 -6.75 <.0001 -10.10813 -5.48451
logmph logmph 1 7.87424 0.35410 22.24 <.0001 7.16543 8.58305
In the preceding example, the t-statistics cameabout by dividing the coefficient estimatesb0 = −7.796 and b1 = 7.874 by their re-spective standard errors SEb0 = 1.155 andSEb1 = 0.354.
The 95% confindence intervals for β0 andβ1 require the α
2 = 2.5% critical values ofthe t-distribution with n − 2 = 58 degreesof freedom (the same as for the residuals),which may be obtained from a table or bycalling T.INV.2T(0.05,58) in Excel as
tα2(n− 2) = t0.025(58) ≈ 2.002.
The 95% confindence intervals for β0 and β1are therefore:
b1 ± t0.025(58) · SEb1 = 7.874± 2.002 · 0.354
= (7.165, 8.583),
b0 ± t0.025(58) · SEb0 = −7.796± 2.002 · 1.155
= (−10.108, −5.485).
The fact that zero is not included in any ofthese confidence intervals implies that we canreject H0 : β1 =0 and H0 : β0 =0 in both two-sided and one-sided tests.
39
Confidence intervals for the mean responseand for individual observations
For any specific value of x, say x∗, the meanof the response y in this subpopulation is
µy = β0 + β1x∗,
which we estimate from the sample as
µy = b0 + b1x∗.
Alternatively we may interpret this expressionas a prediction for an individual observationy = b0 +b1x
∗ for x= x∗. The prediction in-terval for an individual observation, however,is wider than the confidence interval for themean due to the additional variation of indi-vidual responses about the mean response.
A level (1−α) confidence interval for µy is
µy±tα/2(n−2)·SEµ, SEµ = se
√1
n+
(x∗ − x)2
SSX.
A level (1−α) prediction interval for y is
y±tα/2(n−2)·SEy, SEy = se
√1 +
1
n+
(x∗ − x)2
SSX.
40
Fuel Efficieny as a Function of Speed (continued)
95% confidence limits for the mean response:
95% confidence limits for individual predictions:
1.4. Prerequisites in statistical inference
Statistical tests and confidence intervals are
derived on the basis of some central assump-
tions. We usually assume that our observa-
tions are random samples of some prespeci-
fied distribution, most commonly the normal
distribution or one of its derivatives. This, in
turn, requires our data to have certain char-
acteristics before a statistical method can be
meaningfully applied.
A general precondition is that the statistical
units/ observations are:
• independent of each other,
• are equally reliable,
• and the sample size is sufficiently large.
Beyond these general prerequesites, there are
preconditions that apply to the specific sta-
tistical method to be used.
42
1. Contingeny tables
Pearson’s χ2 used in independence and ho-mogeneity tests is approximately χ2-distributed,if there are sufficiently many observations,that is:
• all expected frequencies are greater than 1,• no more than 20% of the expected counts
are smaller than 5.
If any of those conditions is not met, thereare two options:
It’s best to use Fishers exact test, which isavailable as an option from the Chi-SquareTest of the Independence tool. It doesn’tuse the χ2-approximation at all and worksalso in small samples, where the assumptionsfor the χ2-test are not satisfied. It deliversalways the precise p-value (so it’s better thanthe χ2-test), but the Real Statistics excel addin calculates it only for tables with no morethan 9 cells. If you have a 2×2 table andno software is available, you should use thecontinuity correction discussed earlier.
43
2. Correlations
The tests for independence of ranks H0 :ρS = 0 or τ = 0 are exact and work also insmall samples. The same holds for testingthe linear correlation coefficient
H0 : ρ = 0 (x, y are linearly independent)
with the t-test
T =r√n− 2√
1− r2∼ t(n− 2) under H0
when both x and y are normally distributed.Otherwise the test is only approximate andrequires a sufficiently large sample size. Forsmall r and large n we get the approximatez-test:
Z = r√n ∼ N(0,1) under H0.
The Real Statistics data analysis tool allowsyou also to test H0 : ρ = ρ0 6= 0 (knownas Fisher’s test) in large samples, but againthe result is only approximate. The more ρ0deviates from 0, the larger the sample sizehas to be.
44
Example. Consider again the reputation andfamiliarity of 10 different brands:
reputation 7 2 2 9 0 1 9 0 1 0familiarity 4 7 3 13 2 7 9 4 1 1
Pearson’s linear correlation coefficient for thissample is r = 0.7388. We test H0 : ρ = 0against the alternative H1 : ρ 6= 0 by calcu-lating the test statistics
t =0.7388 ·
√10−2√
1− 0.73882≈ 3.10.
From a statistical table or by calling T.INV.2Tin excel we obtain the critical values
t0.02/2(8) = 2.896 and t0.01/2(8) = 3.355,
so the two-sided p-value is somewhere be-tween 1% and 2%. The exact p-value is
T.DIST.2T(3.100776; 8) = 0.014649.
Applying the large sample approximation yields
Z = 0.7388√
10 = 2.336 ⇒ p ≈ 2%, since
P (Z ≤ 2.336) =NORMSDIST(2.336)≈ 0.99,such that P (Z> |2.336|)≈2(1−0.99) = 0.02.
45
3. Regression Analysis
Example: (N. Weiss: Introductory Statistics)Consider the following sample of the prices(y in $100) of cars as a function of their age(x in years):
x 5 4 6 5 5 5 6 6 2 7 7y 85 103 70 82 89 98 66 95 169 70 48
A regression of price upon age yields:
y = 195.47− 20.26x.
Note that the predictions of a regression lineare not completely accurate as for exampleit predicts the price of 5 years old cars as
$19547− $2026 · 5 = $9417,
whereas the true prices vary from car to carbetween $8200 and $9800. The distribu-tion of a response variable Y for a specificvalue of the predictor variable X is called theconditional distribution (ehdollinen jakauma)of Y given the value X = x with conditionalmean E(Y |X = x) and conditional varianceV (Y |X = x).
47
The assumptions for regression inferences are:
1. Normal populations:For each value of the predictor variable X,the conditional distribution of the responsevariable Y is a normal distribution.
2. Population regression line:There are constants β0 and β1 such that, foreach value x of the predictor variable X, theconditional mean of the response variable is:
y = E(Y |X=x) = β0 + β1x.
3. Equal standard deviations (variances):The conditional standard deviations of theresponse variable Y are the same for all valuesof the predictor variable X:
V (Y |X=x) = σ2 = const.
The condition of equal standard deviations iscalled homoscedasticity.
4. Independent observations:The observations of the response variable areindependent of one another.
48
When assumptions 1–4 for regression infer-ences hold, then the random errors
εi = Yi − (β0 + β1Xi)
are independent and normally distributed withmean zero and variance σ2. The statisticalmodel for simple linear regression may there-fore equivalently be stated as
Yi = β0 + β1Xi + εi, εi iid N(0, σ2),
where ’iid’ stands for independent and iden-tically distributed. Note that this model hasthree parameters: β0, β1 and σ.
It turns out that the least square estimators
b1 = rxy ·sy
sxand b0 = y − b1x
are unbiased estimators of β1 and β0, whichare themselves normally distributed. An un-biased estimator of the unknown variance σ2
of the error term is given by the mean square
error MSE=∑e2i
n−2 where ei = yi − (b0 + b1xi).
We define the standard error of the estimate
as se =√
MSE =
√SSE
n− 2=
√√√√ ∑e2i
n− 2.
49
Regression Diagnostics: Residual Analysis
1. Normality of Residuals
Normality of residuals may be checked eithergraphically, by considering the shape para-meters of the distribution of residuals, or byperforming statistical tests.
Graphs
A first visual check of the normality assump-tion is taking a look at the histogram of resid-uals. If the histogram is not bell-shaped, theresiduals are not normally distributed.
A bell shaped distribution does, however, notguarantee that the distribution of residuals isnormal, for example the t-distribution is alsobell-shaped.
Excel’s data analysis tool can produce his-tograms, but it is not very good at findingmeaningful bin sizes and also clumsy to use.
50
Normal probability plots are plots with theranked observations x(i) on the horizontalaxis and the z-values zqi = Φ−1(qi) from thenormal distribution corresponding to therespective quantile qi (observed cumulativeprobability) of x(i) on the vertical axis. Inthis form the normal probability plot is alsocalled a quantile-quantile (Q-Q) plot.
Alternatively one may plot the expected nor-mal cumulative probabilities Φ(x(i)) on thevertical axis against the observed cumulativeprobabilities qi on the horizontal axis in socalled probability-probability (P-P) plots.
In either case, if the observations are nor-mally distributed, then the normal probabil-ity plot should be a straight line. Deviationsfrom this line allow for detection of outliersand qualitative identification of skewness andkurtosis.
Q-Q plots are more generally used than P-Pplots, because they stress deviations in thetails, where hypothesis tests are usually done(P-P plots stress deviations in the center).
51
Note. Excel’s regression tool has an option to
produce a normal probability plot. This is not
the relevant plot of the regression residuals
though, but a much less useful normality plot
for the unconditional y-values.
To get the relevant normality plot for the
residuals, you must first produce those resid-
uals from either Excel’s or the Real Statis-
tics data analysis tool and then apply De-
scriptive Statistics and Normality/ QQ Plot
within the Real Statistics data analysis tool-
box upon the residuals.
Alternatively you may obtain a P-P plot from
Excel by running a second regression with ar-
bitrary x-values and the previously obtained
residuals as y-values, asking for a normal prob-
ability plot within the regression window and
ignoring all other output.
52
Shape Paprameters
Skewness
Recall that non-symmetric unimodal distri-bution are skewed to the right if the obser-vations concentrate upon the lower values orclasses (Md< x), such that it has a long tailto the right, and skewed to the left, if the ob-servations concentrate upon the higher val-ues or classes (Md> x), such that the dis-tribution has a long tail to the left. Thisasymmetry is indicated by the (coefficient of)skewness:
g1 =1n
∑ni=1(xi − x)3
s3.
In general, the distribution is skewed to theleft (right) if g1 is smaller (larger) than zero.Unimodal distributions with g1 = 0 are sym-metric. That is, g1 6= 0 (in particular when|g1| > 2
√6/n, n =sample size) is evidence
that X is not normally distributed.
Skewness renders PP- and QQ-plots curvedrather than linear.
54
Kurtosis
The (coefficient of) Kurtosis, defined as
g2 =1n
∑ni=1(xi − x)4
s4− 3,
is a measure of peakedness (at least for uni-
modal distributions). That is, unimodal dis-
tributions with low kurtosis (g2 < 0), called
platykurtic, are rather evenly spread across all
possible values or classes, and unimodal dis-
tributions with high kurtosis (g2 > 0), called
leptokurtic, have a sharp peak at their mode.
Distributions with g2 ≈ 0 are called mesokurtic.
The kurtosis of the normal distribution is ex-
actly zero. Therefore, the sign of g2 tells for
unimodal distributions whether they are more
(g2 > 0) or less (g2 < 0) sharp peaked than
the normal distribution. A clear warning sign
against normality is when |g2| > 4√
6/n.
Kurtosis renders PP- and QQ-plots S-shaped.
55
Normality Tests
The most popular test for normality is calledShapiro-Wilk Test available from the Descrip-tive Statistics and Normality tool. The nullhypothesis is that the data is normally dis-tributed and the alternative hypothesis is thatit is not. So small p-values (e.g. p < 0.05) im-ply that the data is not normally distributed.
56
2. Linear Regression Relationship
Deviations from straight-line relationships are
immediately evident from the scatterplot of
the predictor and the response variables. Such
deviations are also visible as systematic pat-
terns instead of random scatters in so called
residual plots, where the residuals ei are plot-
ted either against the values of the predictor
variable xi or against the predicted response
values yi. These are available from Excel’s
regression tool, however usually you will have
to rescale the axes before getting a readable
result.
3. Constant residual variance
This may also be checked from residual plots:
Any systematic pattern in the scatter of the
residuals around zero contradicts the assump-
tion of constant residual variance.
57
Fuel Efficiency as a Function of Speed (Moore/McCabe)
mph50,040,030,020,010,0
mpg
25,0
22,5
20,0
17,5
15,0
12,5
R Sq Linear = 0,854
mph60,050,040,030,020,010,0
Uns
tand
ardi
zed
Res
idua
l
2,00000
0,00000
-2,00000
-4,00000
Miles per gallon versus logarithm of miles per hour
logmph4,00003,50003,00002,5000
mpg
25,0
22,5
20,0
17,5
15,0
12,5
R Sq Linear = 0,895
Residual Plot
logmph4,00003,50003,00002,5000
resi
d
2,0000
0,0000
-2,0000
-4,0000
4. Independent Observations: Residual Auto-
correlation and the Durbin-Watson Test
We say that the regression residuals are auto-
correlated if the correlation of any residual
with any of its preceding residuals is nonzero.
Residual autocorrelation is the most serious
violation of the assumptions of the statistical
model for linear regression.
Common reasons for residual autocorrelation:
• Two time-series are regressed upon each
other.
• The dependence of Y upon X is non-
linear.
• Additional regressors are missing.
• There are trends or seasonal variation in
the data.
• Missing data has been replaced by esti-
mates.
60
The 1. Order Autocorrelation (1. kertaluvun
autokorrelaatio) ρ1, that is the autocorrela-
tion of any residual εi with its preceding value
εi−1, is assessed by the Durbin-Watson test
statistic:
d =
∑ni=2(ei − ei−1)2∑n
i=1 e2i
,
which estimates 2(1− ρ1), that is:
d ≈ 2 ⇒ Residuals uncorrelated (Ok),d < 2 ⇒ εi positively autocorrelated,d > 2 ⇒ εi negatively autocorrelated.
The critical values dα depend upon the data
and are therefore not known exactly. But
there are tabulated upper limits dU and lower
limits dL, which depend only upon the num-
ber of regressors (in our case 1) and the num-
ber of data points, such that
dL < dα < dU .
61
To perform a Durbin-Watson test:
1. Choose a significance level (e.g. α=0.05).
2. Calculate d
(available from the Durbin-Watson testoption within Real Statistics regression).
3. Look up:dL(α2) and dU(α2) for a two-sided test,dL(α) and dU(α) for a one-sided test.
4. (i) Two-sided: H0 : ρ1 =0 vs. H1 : ρ1 6=0d ≤ dL or d ≥ 4−dL ⇒ reject H0.dU ≤ d ≤ 4−dU ⇒ accept H0.otherwise ⇒ inconclusive.
(ii) One-sided: H0 : ρ1 =0 vs. H1 : ρ1>0d ≤ dL ⇒ reject H0.d ≥ dU ⇒ accept H0.otherwise ⇒ inconclusive.
(iii) One-sided: H0 : ρ1 =0 vs. H1 : ρ1<0d ≥ 4− dL ⇒ reject H0.d ≤ 4− dU ⇒ accept H0.otherwise ⇒ inconclusive.
62
The Durbin-Watson test is an option withinthe Real Statistics Linear Regression tool:
Note that Real Statistics runs this as a onesided test against H1 : ρ1 < 0. For a two-sided test against H1 : ρ1 6= 0, replace your αwith α/2 in the corresponding field and checkalso 4 minus the critical values from the out-put. For a one sided test against H1 : ρ1 > 0,use your original α and check only 4 minusthe critical values from the output.
64