Effect at a population level Decision making procedure Testing for a group effect Linear effect
First steps in data analysis with
David CauseurAgrocampus Ouest
IRMAR CNRS UMR 6625http://math.agrocampus-ouest.fr/infoglueDeliverLive/membres/david.causeur
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Free Printable Signs from www.hooverwebdesign.com
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Learning objectives
By the end of this course, students will be able to:• Implement statistical methods for common data analysis issues;• Choose appropriate procedures based on statistical arguments;• Assess the performance of a statistical decision rule;• Apply these key insights into class activities using a statistical software.
Readings
D. Causeur and Sheu, C.-F. (2017). Significance of a relationship. Onlineunpublished textbook.Freely downloadable here: http://http://math.agrocampus-ouest.fr/infoglueDeliverLive/membres/david.causeur/teaching
Assignments
• In-class short exams - 50%• Final Project - 50%
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Online resources
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Online resources
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Online resources
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Online resources
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Why ?• covers a huge range of functionality;• is free;• knowing is explicitly demanded in many many job offers;
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Outline1 Effect at a population level2 Decision making procedure3 Testing for a group effect
Exploring for a group effectOne-way analysis-of-variance modelLeast-squares estimation of effect parametersF-testThe special case of a two-level factor: t-testDetailing a significant group effectTesting a group effect using paired data
4 Linear effectLinearity of an effectLinear regression modellingLeast-squares fittingF-testComparing regression lines
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Effect at a population level
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Effect at a population levelOne of the most common question addressed by statistics
’is there an effect of this on that?’
• does an increase of a drug dose modify the blood pressureof a patient?
• does the nitrogen content in soil have an impact on cropyield?
• does the gender of a consumer affect his propensity to buya product?
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Effect at a population levelOne of the most common question addressed by statistics
’is there an effect of this on that?’
Illustrative example along the lectures:• The lean meat percentage (LMP) in a pig carcass
measures its commercial value.• In slaughterhouses, it is predicted using biometric
measurements (tissue depths).• To which extent is the LMP predictable from tissue depths?• Does the genetic type of a pig affect his LMP, his fat depth?
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Effect at a population levelHow to handle this:
By considering it involves two kinds of variables:• the response variable Y ,• the explanatory variables X .
... the variations of Y being possibly related to the values of X
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Effect at a population levelHow to handle this:
By considering it involves two kinds of variables:• the response variable Y ,• the explanatory variables X .
... the variations of Y being possibly related to the values of X
Definition’X has an effect on Y ’
can be formulated mathematically as
’the distribution of Y restricted to items having the same valuex of X , the so-called conditional distribution of Y givenX = x , actually depends on that value x ’.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Effect at a population levelHow to handle this:
By considering it involves two kinds of variables:• the response variable Y ,• the explanatory variables X .
... the variations of Y being possibly related to the values of X
Definition (in the pig data context)’The genetic type has an effect on the fat depth’
can be formulated mathematically as
’the within-genetic type distributions of the fat depths are notthe same’.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Data for decision makingData: joint observations (xi , yi)i=1,...,n of X and Y , with n ≥ 2
DefinitionThe set of n items for which we have observations (xi , yi)i=1,...,nis named the sample, and n the sample size.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Data for decision makingData: joint observations (xi , yi)i=1,...,n of X and Y , with n ≥ 2
R script> dta = read.table("pig.txt",header=TRUE)
dta has 60 rows, one for each pig carcass in the sample, and 6 columns:• LMP for the lean meat percentage,• VFAT for the fat depth measured in the area of lumbar vertebra (mm),• BFAT for the back fat depth (mm),• BMUSCLE for the back muscle depth (mm),• SEX for the sex of the animal and• GENET for its genotype, regarding a gene of interest.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Data for decision makingData: joint observations (xi , yi)i=1,...,n of X and Y , with n ≥ 2
DefinitionThe set of n items for which we have observations (xi , yi)i=1,...,nis named the sample, and n the sample size.
Provided the sample is representative of a widerpopulation
Conclusions and/or decisions are supposed to be valid at thepopulation level.
DefinitionThe statistical methodology aiming at making a decision for apopulation, based on a sample of this population, is namedinferential statistics.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Inferential versus exploratory data analysisOur focus: using sample data, making a decision about theexistence or not of an effect at the population level
Exploratory data analysis will be used first to describe theeffect of interest:• using graphical representations• or summary statistics
Exploratory data analysis is a complement of inferentialstatistics, insightful to build hypotheses.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Data summarySummary statistics: summarize the distribution of randomobservations
R script> summary(dta) # Provides a columnwise summary of the data table
LMP VFAT BFAT BMUSCLEMin. :48.62 Min. : 9.16 Min. : 7.005 Min. :48.10
1st Qu.:58.07 1st Qu.:14.09 1st Qu.:11.050 1st Qu.:57.80Median :59.84 Median :15.95 Median :13.210 Median :61.27
Mean :59.47 Mean :16.40 Mean :13.424 Mean :61.643rd Qu.:61.18 3rd Qu.:18.43 3rd Qu.:15.328 3rd Qu.:66.05
Max. :66.30 Max. :23.83 Max. :22.195 Max. :72.56
SEX GENETF:32 P0 :15M:28 P25 :23
P50 :21NA’s: 1
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Data summary
Most importantly, the distribution of a random variable is firstrelated to its nature,• either numeric: measured on a continuous or discrete
scale (LMP, tissue depths, ...)• or categorical: defining subgroups in the population (sex,
genetic type, ...)... sometimes ambiguous: e.g. number of children.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Mean, median and quantilesThe mean and median indicate the position of a seriesx1, . . . , xn on the real axis.
DefinitionThe mean of (x1, . . . , xn) is defined as follows:
x =x1 + . . .+ xn
n.
The median is defined less explicitly by two properties:
1n
card {i = 1, . . . ,n, xi ≤ median(x)} ≥ 0.5,
1n
card {i = 1, . . . ,n, xi ≥ median(x)} ≥ 0.5.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Mean, median and quantilesThe mean and median indicate the position of a seriesx1, . . . , xn on the real axis.
R script> x = 1:4 # x is the series 1,2,3,4
> mean(x) # mean value of the series
[1] 2.5
> median(x) # median value
[1] 2.5
> x[4] = 40 # The 4th value is now 40
> mean(x)
[1] 11.5
> median(x)
[1] 2.5
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Mean, median and quantilesThe mean and median indicate the position of a seriesx1, . . . , xn on the real axis.
The median can be viewed as one of the three quartiles, the50%-quartile q0.5(x), the 1st quartile being q0.25(x) and the 3rdq0.75(x).
DefinitionFor all 0 ≤ α ≤ 1, the 100α%−quantile of (x1, . . . , xn) is usuallydenoted qα(x) and defined as follows:
1n
card {i = 1, . . . ,n, xi ≤ qα(x)} ≥ α,
1n
card {i = 1, . . . ,n, xi ≥ qα(x)} ≥ 1− α.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Outline1 Effect at a population level2 Decision making procedure3 Testing for a group effect
Exploring for a group effectOne-way analysis-of-variance modelLeast-squares estimation of effect parametersF-testThe special case of a two-level factor: t-testDetailing a significant group effectTesting a group effect using paired data
4 Linear effectLinearity of an effectLinear regression modellingLeast-squares fittingF-testComparing regression lines
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Significance of an effect
DefinitionThe effect of X on Y will be said to be significant if there is anevidence deduced from the data analysis that this effectactually exists at the population level.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Decision errorsDecision making has to deal with two kinds of errors:
Definition• Type-I error: declaring an effect as significant whereas it
does not exist at the population level.• Type-II error: declaring an effect as non-significant
whereas it does exist at the population level.
Those two error types are antagonist:• Liberal decision making: declaring the effect as
significant even for a light evidence, large risk of a type-Ierror and low risk of a type-II error.
• Conservative decision making: declaring the effect assignificant only if the evidence is absolutely sure, low riskof a type-I error and large risk of a type-II error.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
The conservative decision making processThe statistical decision making process favours a lowprobability of a Type-I error.
Two asymmetric hypothesized states of the effect:{H0 : the effect does not exist at the population levelH1 : the effect actually exists at the population level
H0 is called the null hypothesis: it will only be rejected if thereis a clear evidence that H0 is not consistent with theobservations.
R.A. Fisher, ’The Design of Experiments’ (1935)The null hypothesis is never proved or established, but ispossibly disproved, in the course of experimentation.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
The conservative decision making processThe statistical decision making process favours a lowprobability of a Type-I error.
Two asymmetric hypothesized states of the effect:{H0 : the effect does not exist at the population levelH1 : the effect actually exists at the population level
H0 is called the null hypothesis: it will only be rejected if thereis a clear evidence that H0 is not consistent with theobservations.
DefinitionThe test of H0 consists in deciding to reject or not H0 based ondata.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Outline for a decision making processThe 3 components of a statistical decision making process:• The test statistics T . Relevantly chosen to measure the
effect size: the larger the effect size in the data, the largerthe value of T .
• The null distribution of T : the distribution of T under H0.Suppose we know that PH0
(T ≥ 2) ≤ 0.05, then observingT = 3 shall encourage to reject H0.
• The p-value of the test: the probability, calculated underthe null hypothesis, that the test statistics exceeds theobserved value.
If the p-value is lower than a preset type-I error level α(usually α = 0.05), then the effect is significant.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Outline for a decision making processInspired from a famous TV program: suppose you take aflight to an unknown destination, blindfolded.
• Your guess (H0): the destination is Brittany, west of France;• At your arrival, you evaluate the outside temperature (test
statistics) at 40◦;• Is T = 40◦ consistent with your guess?• To answer this question, your knowledge is that the
probability that the temperature exceeds 40◦ in Brittany isvery low (null distribution).
• Your conclusion: the null hypothesis is rejected.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Outline1 Effect at a population level2 Decision making procedure3 Testing for a group effect
Exploring for a group effectOne-way analysis-of-variance modelLeast-squares estimation of effect parametersF-testThe special case of a two-level factor: t-testDetailing a significant group effectTesting a group effect using paired data
4 Linear effectLinearity of an effectLinear regression modellingLeast-squares fittingF-testComparing regression lines
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Exploring group effect
Definition (reminder)’X has an effect on Y ’
can be formulated mathematically as
’the conditional distribution of Y given X = x actuallydepends on that value x ’.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Exploring group effect
Definition (reminder)’X has an effect on Y ’
can be formulated mathematically as
’the conditional distribution of Y given X = x actuallydepends on that value x ’.
Exploring a group effect using summary statisticsR script
> with(dta,numSummary(BFAT,groups=GENET))
mean sd IQR 0% 25% 50% 75% 100%P0 15.16467 2.593618 3.6725 10.600 13.260 15.440 16.9325 20.170
P25 12.95435 3.628696 3.9225 7.005 10.830 12.035 14.7525 22.195P50 12.87071 2.420862 3.7850 9.420 10.855 12.835 14.6400 17.140
data:nP0 15
P25 23P50 21
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Exploring group effect
DefinitionThe standard deviation is defined as follows:
sx =
√∑ni=1(xi − x)2
n − 1.
The larger sx , the larger the variations of xi around x .
The variance s2x is almost the mean of squared xi − x .
Why dividing by n− 1 and not n? The sum of variations xi − xbeing 0, xi − x are only (n − 1) linearly independent variations.
General principle: the number k of linear dependenciesbetween variations is accounted for by dividing the squaredvariation by its degrees of freedom n − k and not n.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Exploring group effectExploring a group effect using summary statistics
R script> with(dta,numSummary(BFAT,groups=GENET))
mean sd IQR 0% 25% 50% 75% 100%P0 15.16467 2.593618 3.6725 10.600 13.260 15.440 16.9325 20.170
P25 12.95435 3.628696 3.9225 7.005 10.830 12.035 14.7525 22.195P50 12.87071 2.420862 3.7850 9.420 10.855 12.835 14.6400 17.140
data:nP0 15
P25 23P50 21
It is deduced that:• the mean backfat depth is slightly larger for P0 than for P25
and P50;• the backfat depth values are more dispersed for P25 than
for P0 and P50
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Exploring group effectExploring a group effect using plots
R script
> with(dta,+ plot(GENET,BFAT,col="darkgray",cex.lab=1.25,pch=16,+ main="Distribution of the backfat depth (mm) across genetictypes"))
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Exploring group effect
●
P0 P25 P50
1015
20
Distribution of the backfat depth (mm) across genetic types
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Exploring group effect
DefinitionA boxplot along the y-axis summarizes the dispersion of aseries of numeric values by the following graphical elements:• the box, which lower value is q0.25 and upper value q0.75.
The plain segment within the box locates the median;• the lower whisker, that extends to the smallest value
which is no more that 1.5× IQR from the median;• the upper whisker, that extends to the largest value which
is no more that 1.5× IQR from the median;• isolated dots for each value out of the limits of the
whiskers.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Exploring group effectExploring using plots
R script
> vbreaks = seq(from=7,to=23,by=2)> # defines a partition of the BFAT values
> with(dta,+ Hist(x=BFAT,groups=GENET,scale="percent",+ breaks=vbreaks,col="darkgray",cex.lab=1.25,pch=16,+ xlab="Backfat depth (mm)",+ main="Distribution of backfat depth across genetic types"))
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Exploring group effect
GENET = P0
Backfat depth (mm)
perc
ent
10 15 20
05
1525
GENET = P25
Backfat depth (mm)
perc
ent
10 15 20
05
1525
GENET = P50
Backfat depth (mm)
perc
ent
10 15 20
05
1525
Distribution of backfat depth across genetic types
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Exploring group effect
DefinitionLet us split the interval covering all the (x1, . . . , xn) into ak−partition: B1 = [a0; a1[, B2 = [a1; a2[, . . . ,B2 = [ak−1; ak [,where k is a pre-chosen number of bins.
A histogram is a bar plot, the support of the i th bar being thebin Bi and its area being proportional to the number ni (or theproportion pi ) of values falling within Bi .
When the widths of the bins are all equal, the heights of thebars are just proportional to ni (or pi ).
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Distribution of a numeric variable
Definition (reminder)’X has an effect on Y ’
can be formulated mathematically as
’the conditional distribution of Y given X = x actuallydepends on that value x ’.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Distribution of a numeric variable
Definition (reminder)’X has an effect on Y ’
can be formulated mathematically as
’the conditional distribution of Y given X = x actuallydepends on that value x ’.
Conditional distributions of Y can have different:• means;• variances;• density shapes.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Distribution of a numeric variable
Definition (reminder)’X has an effect on Y ’
can be formulated mathematically as
’the conditional distribution of Y given X = x actuallydepends on that value x ’.
Conditional distributions of Y can have different:• means;• variances;• density shapes.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
The normality frameworkUsual restrictions:• The general shape of within-group density functions is the
same;• The ’reference’ distribution in group i is N (µi ;σi):
P(Y ≤ y | i th group) =
∫ y
−∞fµi ,σi (t)dt ,
where fµ,σ is the density function of N (µ;σ):
fµ,σ(t) =1√2π
exp(− 1
2σ2 (t − µ)2).
Effect at a population level Decision making procedure Testing for a group effect Linear effect
The normality frameworkUsual restrictions:• The general shape of within-group density functions is the
same;• The ’reference’ distribution in group i is N (µi ;σi):
P(Y ≤ y | i th group) =
∫ y
−∞fµi ,σ(t)dt , [homoscedasticity]
where fµ,σ is the density function of N (µ;σ):
fµ,σ(t) =1√2π
exp(− 1
2σ2 (t − µ)2).
Definition (simplified)There is an effect of X on Y if, for at least two i 6= i ′, µi 6= µi ′ .
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Statistical model for a group effectIf Yij stands for the response value for the j th sampling item(j = 1, . . . ,ni ) of the i th group (i = 1, . . . , I):
Yij = µi + εij ,
where εij ∼ N (0, σ) is the residual error.
The above decomposition of Yij exhibits two additive parts:• the non-random part µi concentrates the variations of the
response only due to the factor;⇒ µi are I unknown parameters.
• the random part εij captures the within-group variations ofthe response.⇒ σ, the residual standard deviation, is another unknown
parameter.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Judging for the existence of an effectTest for an effect of X on Y :{
H0 : µ1 = . . . = µI = µ(no effect of X at the population level)H1 : For at least one couple (i , i ′), with i 6= i ′, µi 6= µi ′ .
.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Judging for the existence of an effectTest for an effect of X on Y :{
H0 : µ1 = . . . = µI = µ(no effect of X at the population level)H1 : For at least one couple (i , i ′), with i 6= i ′, µi 6= µi ′ .
.
Choice between two models:• the null model for which X has no effect on Y [submodel]
Yij = µ+ εij .
• and the nonnull model:
Yij = µi + εij .
Effect at a population level Decision making procedure Testing for a group effect Linear effect
The one-way ANOVA model
DefinitionThe one-way analysis of variance (ANOVA) model for theeffect of X on Y is usually formulated as follows:
Yij = µ+ αi + εij , where α1, . . . , αI are the effect parameters.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
The one-way ANOVA model
DefinitionThe one-way analysis of variance (ANOVA) model for theeffect of X on Y is usually formulated as follows:
Yij = µ+ αi + εij , where α1, . . . , αI are the effect parameters.
Parameterizations (µ1, . . . , µI) and (µ, α1, . . . , αI) are notequivalent: indeed, equating, for all i = 1, . . . , I, µ+ αi = µihas an infinity of solutions.
The most common ways of fixing this:• Consider that α1 + . . .+ αI = 0. Since for all i , µ+ αi = µi ,
then, µ =∑I
i=1 µi/I.• Consider that α1 = 0. Then, µ = µ1 and, for all i = 1, . . . , I,αi = µi − µ1.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
The one-way ANOVA model
DefinitionThe one-way analysis of variance (ANOVA) model for theeffect of X on Y is usually formulated as follows:
Yij = µ+ αi + εij , where α1, . . . , αI are the effect parameters.
Basic idea: Testing for the significance of an effect amounts tocomparing the goodness-of-fit of the null and nonnull models.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Fitting a one-way ANOVA model
DefinitionFitting a one-way ANOVA model amounts to estimating itsparameters, namely assigning values to the parameters so thatthe model is as
:::::close as possible to the data.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Fitting a one-way ANOVA model
DefinitionFitting a one-way ANOVA model amounts to estimating itsparameters, namely assigning values to the parameters so thatthe model is as
:::::close as possible to the data.
Close? ... minimization of the least squares criterion:
SS(µ1, . . . , µI) =
n1∑j=1
(Y1j − µ1)2 + . . .+
nI∑j=1
(YIj − µI)2.
by separately minimizing summands∑ni
j=1(Yij − µi)2 :
µi =Yi1 + . . .+ Yini
ni= Yi•
µi is said to be the least-squares estimator of µi .
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Fitting a one-way ANOVA model
DefinitionFitting a one-way ANOVA model amounts to estimating itsparameters, namely assigning values to the parameters so thatthe model is as
:::::close as possible to the data.
DefinitionAn estimator θ of θ is a function of the data, designed toensure that θ is close to the true value θ of the parameter.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Fitting a one-way ANOVA modelR script
> bfat.lm1 = lm(BFAT ˜ -1+GENET,data=dta,+ na.action=na.exclude)
# na.action tells what to do with missing data (exclusion)
> coef(bfat.lm1) # Extracts estimated coefficients
GENETP0 GENETP25 GENETP5015.16467 12.95435 12.87071
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Fitting a one-way ANOVA modelR script
> bfat.lm2 = lm(BFAT ˜ GENET,data=dta,+ na.action=na.exclude)
> coef(bfat.lm2)
(Intercept) GENETP25 GENETP5015.164667 -2.210319 -2.293952
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Estimation accuracy
DefinitionThe accuracy of an estimator θ of θ is usually measured by:• its bias, namely the expected value of the estimation errorθ − θ: bθ = E(θ − θ);
• the Root Mean Squared Error (RMSE) of the estimationerror: MSEθ = E
((θ − θ)2).
If bθ = 0,
• θ is said to be unbiased.• RMSEθ coincides with the standard deviation σθ of θ.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Accuracy of least-squares estimationThe estimation error µi − µi has the following expression:
µi − µi =Yi1 + . . .+ Yini
ni− µi =
(Yi1 − µi) + . . .+ (Yini − µi)
ni,
=εi1 + . . .+ εini
ni.
Therefore,• µi − µi is normally distributed;• µi is unbiased;• MSEµi is given by:
MSEµi =Var(εi1) + . . .+ Var(εini )
n2i
,
=σ2 + . . .+ σ2
n2i
=σ2
ni.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Least-squares fit of the ANOVA modelDecomposition of Yij as a sum Yij + εij of:
• fitted values Yij = µi = µ+ αi , which variations are onlydue to the explanatory variable,
• and by difference, residuals εij = Yij − Yij = Yij − Yi•,which variations are within-group differences between theresponse values and their group means.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Least-squares fit of the ANOVA modelR script
> bfat.fit = fitted(bfat.lm2) # Extracts fitted values
> numSummary(bfat.fit,group=dta$GENET,+ statistics=c("mean","sd"))
mean sd data:nP0 15.16467 1.162899e-15 15
P25 12.95435 0.000000e+00 23P50 12.87071 5.086712e-15 21
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Least-squares fit of the ANOVA modelR script
> bfat.res = residuals(bfat.lm2) # Extracts residuals
> numSummary(bfat.res,group=dta$GENET,+ statistics=c("mean","sd"))
mean sd data:nP0 1.110223e-15 2.593618 15
P25 1.927052e-17 3.628696 23P50 -2.193599e-16 2.420862 21
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Estimation of the residual standard deviationSince σ2 is the variance of the residual error terms:
σ2 =
∑n1j=1(Y1j − Y1•)
2 + . . .+∑nI
j=1(YIj − YI•)2
n − I,
=RSSn − I
.
Note: RSS is not divided by n but by n − I.
n − I are the residual degrees of freedom, which can beviewed as the number of linearly independent residuals.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Estimation of the residual standard deviationR script
> # Summary extracts useful statistics from the fitted model
> # including the residual standard deviation
> summary(bfat.lm2)$sigma
[1] 2.99127
Effect at a population level Decision making procedure Testing for a group effect Linear effect
ANOVA equationTesting for an effect: comparison between the null andnonnull models
Goodness-of-fit is measured by the residual sum of squares:
RSS =I∑
i=1
ni∑j=1
(Yij − Yi•)2, [nonnull model]
RSS0 =I∑
i=1
ni∑j=1
(Yij − Y••)2. [null model]
where Y•• = n1n Y1• + n2
n Y2• + . . .+ nIn YI•.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
ANOVA equationTesting for an effect: comparison between the null andnonnull models
Goodness-of-fit is measured by the residual sum of squares:
RSS =I∑
i=1
ni∑j=1
(Yij − Yi•)2, [nonnull model]
RSS0 =I∑
i=1
ni∑j=1
(Yij − Y••)2. [null model]
ANOVA equation : RSS0 =I∑
i=1
ni(Yi• − Y••)2 + RSS.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Model assessment using the R2
Effect is assessed by the following ratio:
R2 =RSS0 − RSS
RSS0,
=
∑Ii=1 ni(Yi• − Y••)2
RSS0.
Indeed,• 0 ≤ R2 ≤ 1,• R2 = 0 corresponds to ’no effect’,• R2 = 1 corresponds to ’perfect effect’.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Model assessment using the R2
R script
> # Extracts the R2
> summary(bfat.lm2)$r.squared
[1] 0.1016868
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Significance of an effect using the F-testIs R2 = 0.10 too weak to consider that genetic-type differencesexist at the population level?
Test statistics (the F-test):
F =(RSS0 − RSS)/(I − 1)
RSS/(n − I).
Degrees of freedom:• RSS: n − I d.f;• RSS0 − RSS: I − 1 d.f.
Indeed, the between-group variations ni(Yi• −Y••) sums tozero: only I − 1 of them are linearly independent.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Significance of an effect using the F-testR script
> # Extracts the F-test statistics
> summary(bfat.lm2)$fstatistic
value numdf dendf3.169528 2.000000 56.000000
Effect at a population level Decision making procedure Testing for a group effect Linear effect
p-value and decision ruleSignificance or not of an effect now relies on the judgement thatF = 3.170 is abnormally large or not, regarding its nulldistribution.
Here, the null distribution is the Fisher distribution FI−1,n−I withI − 1 and n − I degrees of freedom.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
p-value and decision ruleR script
> # Defines a sequence of x values
> x = seq(from=0,to=8,by=0.01)
> # Calculates the Fisher density function for each x
> y = df(x,df1=2,df2=56)
> # Plots the density function
> plot(x,y,type="l",xlab="F-test statistics",ylab="Density",+ main="Density function of the Fisher distribution+ with 2 and 56 d.f.",lwd=2)
Effect at a population level Decision making procedure Testing for a group effect Linear effect
p-value and decision rule
0 2 4 6 8
0.0
0.2
0.4
0.6
0.8
1.0
Density function of the Fisher distribution with 2 and 56 d.f.
Fisher test statistics
Den
sity
Effect at a population level Decision making procedure Testing for a group effect Linear effect
p-value and decision ruleR script
> # p-value of the test
> pf(3.170,df1=2,df2=56,lower.tail=FALSE)
[1] 0.04963573
Effect at a population level Decision making procedure Testing for a group effect Linear effect
p-value and decision ruleR script
> # Displays the complete ANOVA table
> anova(bfat.lm2)
Analysis of Variance TableResponse: BFAT
Df Sum Sq Mean Sq F value Pr(>F)GENET 2 56.72 28.3600 3.1695 0.04966
Residuals 56 501.07 8.9477
Effect at a population level Decision making procedure Testing for a group effect Linear effect
p-value and decision ruleFirst row of ANOVA table for the effect of the genetic type andsecond row for the residual error:• Df: degrees of freedom, respectively I − 1 and n − I;• Sum Sq: sum-of-squares, respectively RSS0 − RSS and
RSS;• Mean Sq: mean-squares, respectively
(RSS0 − RSS)/(I − 1) and RSS/(n − I);• F value: F-statistics, the ratio of mean-squares;• Pr(>F): p-value of the F-test.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Two-group comparisonTest for the effect of a 2-level factor:{
H0 : δ = µ1 − µ2 = 0H1 : δ = µ1 − µ2 6= 0
R script
> bfatsex.lm = lm(BFAT ˜ SEX,data=dta)
> anova(bfatsex.lm)
Analysis of Variance TableResponse: BFAT
Df Sum Sq Mean Sq F value Pr(>F)SEX 1 74.0 74.001 8.6237 0.004752
Residuals 58 497.7 8.581
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Two-group comparison t-testLet us first focus on the F-test statistics:
F =n1(Y1• − Y••)2 + n2(Y2• − Y••)2
σ2 ,
with
n1(Y1• − Y••)2 = n1
(Y1• −
n1Y1• + n2Y2•n1 + n2
)2,
= n1n22
( Y2• − Y1•n1 + n2
)2,
n2(Y2• − Y••)2 = n2n21
( Y2• − Y1•n1 + n2
)2.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Two-group comparison t-testLet us first focus on the F-test statistics:
F =n1(Y1• − Y••)2 + n2(Y2• − Y••)2
σ2 ,
=( Y2• − Y1•
σ
)2 n1n2(n1 + n2)
(n1 + n2)2 ,
=( Y2• − Y1•
σ√
1n1
+ 1n2︸ ︷︷ ︸
T
)2.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Two-group comparison t-testLet us first focus on the F-test statistics:
F = T 2, where T =Y2• − Y1•
σ√
1n1
+ 1n2
=δ
σδ.
Indeed,
σ2δ
= Var(Y1• − Y2•),
= Var(Y1•) + Var(Y2•),
= σ2( 1
n1+
1n2
).
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Two-group comparison t-testLet us first focus on the F-test statistics:
F = T 2, where T =Y2• − Y1•
σ√
1n1
+ 1n2
=δ
σδ.
DefinitionLet θ be an estimator of θ and σθ and estimator of the standarddeviation of θ.
For the test of H0: θ = θ0, Tθ0 = (θ − θ0)/σθ is called a t-teststatistics.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Two-group comparison t-testR script
> # var.equal=TRUE in t.test states that> # within-group standard deviations are assumed to be equal> # mu=0 states that the mean difference under the null is zero> # mu=0 is the default option
> t.test(BFAT ˜ SEX,var.equal=TRUE,data=dta,mu=0)$statistic
t-2.936608
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Two-group comparison t-testLet us first focus on the F-test statistics:
F = T 2, where T =Y2• − Y1•
σ√
1n1
+ 1n2
=δ
σδ.
In the present situation, the null distribution of T is the Studentdistribution with n − 2 degrees of freedom, denoted Tn−2.
R script
> # pt is the Student probability distribution function> 2*pt(2.936608,df=58,lower.tail=FALSE)
[1] 0.004751769
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Two-group comparison t-testLet us first focus on the F-test statistics:
F = T 2, where T =Y2• − Y1•
σ√
1n1
+ 1n2
=δ
σδ.
In the present situation, the null distribution of T is the Studentdistribution with n − 2 degrees of freedom, denoted Tn−2.
R script
> # The same value is also provided as an output of t.test> t.test(BFAT ˜ SEX,var.equal=TRUE,data=dta)$p.value
[1] 0.004751764
Effect at a population level Decision making procedure Testing for a group effect Linear effect
One-sided t-testsTest of H0: δ = 0 against H1: δ < 0.
The rejection rule is one-sided: H0 is rejected if T isconsidered as suspiciously too small under the null.
Consistently, the p-value is just the probability that a Tn−2variable is lower than the observed value of T :
R script
> pt(-2.936608,df=58,lower.tail=TRUE)
[1] 0.002375884
Effect at a population level Decision making procedure Testing for a group effect Linear effect
One-sided t-testsTest of H0: δ = 0 against H1: δ < 0.
The rejection rule is one-sided: H0 is rejected if T isconsidered as suspiciously too small under the null.
Consistently, the p-value is just the probability that a Tn−2variable is lower than the observed value of T :
R script
> # alternative="less" is used for the present one-sided test> t.test(BFAT ˜ SEX,var.equal=TRUE,data=dta,+ alternative="less")$p.value
[1] 0.002375882
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Confidence intervals for parametersBack to the two-sided test, with |T | = 2.9366:
The difference between the mean backfat depths of males andfemales is significant, with type-I error level α = 0.05.
What would be the largest value t? of |T | for which the nullhypothesis would not be rejected, with type-I error level α?
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Confidence intervals for parametersBack to the two-sided test, with |T | = 2.9366:
The difference between the mean backfat depths of males andfemales is significant, with type-I error level α = 0.05.
What would be the largest value t? of |T | for which the nullhypothesis would not be rejected, with type-I error level α?
If |T | = t?, then the p-value takes its largest value α, over whichthe null hypothesis is not rejected.
Therefore, t? = t(n−2)1−α/2 is the 100(1− α/2)%-quantile of the
Tn−2 distribution.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Confidence intervals for parametersBack to the two-sided test, with |T | = 2.9366:
The difference between the mean backfat depths of males andfemales is significant, with type-I error level α = 0.05.
What would be the largest value t? of |T | for which the nullhypothesis would not be rejected, with type-I error level α?
R script
> qt(0.975,df=58)
[1] 2.001717
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Confidence intervals for parametersThe confidence interval CI1−α(δ), with confidence level 1− α,can be viewed as:
CI1−α(δ) = {δ0, H0 : δ = δ0 is not rejected at level α} ,
=
{δ0, −t(n−2)
1−α/2 ≤δ − δ0
σδ≤ t(n−2)
1−α/2
},
=[δ − t(n−2)
1−α/2σδ; δ + t(n−2)1−α/2σδ
].
DefinitionThe set of values θ0 such that the null hypothesis H0 : θ = θ0 isnot rejected by a pre-chosen test at type-I error level α is aconfidence interval for θ, with confidence level 1− α.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Confidence intervals for parametersR script
> # var.equal=TRUE in t.test states that> # within-group standard deviations are assumed to be equal> t.test(BFAT ˜ SEX,var.equal=TRUE,data=dta)
Two Sample t-testdata: BFAT by SEX
t = -2.9366, df = 58, p-value = 0.004752alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:
-3.7434566 -0.7086862
sample estimates:mean in group F mean in group M
12.38500 14.61107
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Confidence intervals for parametersFor the mean parameters µ1 and µ2:
CI1−α(µi) =[µi − t(n−2)
1−α/2σ√ni
; µi + t(n−2)1−α/2
σ√ni
].
R script
> # Confidence intervals for mean backfat depths by sex> bfatsex.lm1 = lm(BFAT ˜ -1+SEX,data=dta)
> # level = 0.95 (default) sets the confidence level at 0.95> cbind(coef(bfatsex.lm1),confint(bfatsex.lm1,level=0.95))
2.5 % 97.5 %SEXF 12.38500 11.34843 13.42157SEXM 14.61107 13.50293 15.71921
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Power of the t-testTo which extent can the t-test detect a targeted meandifference at the population level?
In the pig data context: if the mean difference at the populationlevel between males and females is 1, can we be sure that thetest will declare the effect as significant?
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Power of the t-testTo which extent can the t-test detect a targeted meandifference at the population level?
In the pig data context: if the mean difference at the populationlevel between males and females is 1, what is the probabilitythat the t-test declares the effect as significant?
DefinitionLet us consider the test of the null hypothesis H0 : θ = θ0against the alternative hypothesis H1 : θ = θ0 + τ , with τ 6= 0,at the type-I error level α.
The power of the test is the probability that the test rejects thenull under H1:
Power(τ) = Pθ=θ0+τ(|T | ≥ t(n−2)1−α/2).
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Power of the t-testDistribution of the t-test statistics under H1 : θ = θ0 + τ :noncentral Student distribution Tn−2(λ) with:
λ =τ
σ√
1n1
+ 1n2
. [noncentrality parameter]
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Power of the t-testR script
> x = seq(from=-5,to=5,by=0.1)# Sequence of regularly spaced values in [-5,5]
> dstud = dt(x,df=58)# Density function of the Student distribution (58 d.f.)
> plot(x,dstud,main="Density function of Student distribution (58 d.f.)",+ ylab="Density",lwd=2,col="darkgray",cex.lab=1.25,cex.axis=1.25,type="l",+ ylim=c(0,0.5)) # Plots the density curve
> dstud.nc = dt(x,df=58,ncp=1)> # Density of the noncentral Student distribution (58 d.f., lambda=1)
> lines(x,dstud.nc,lwd=2,col="blue") # Adds the noncentral density curve> legend("topleft",col=c("darkgray","blue"),lwd=2,bty="n",+ legend=c("Student (58 d.f)","Noncentral Student (58 d.f., ncp=1)"))
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Power of the t-test
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
0.5
Density function of Student distribution (58 d.f.)
x
Den
sity
Student (58 d.f)Noncentral Student (58 d.f., ncp=1)
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Power of the t-testR script
> qttest = qt(0.975,df=58) # 97.5% quantile of the null distribution
> qttest
[1] 2.001717
> sigma = summary(bfatsex.lm)$sigma # Residual standard deviation
> lambda = 1/(sigma*sqrt((1/32)+(1/28))) # Noncentrality parameter
> pt(-qttest,df=58,ncp=lambda)++ pt(qttest,df=58,ncp=lambda,lower.tail=FALSE) # Power of the t-test
[1] 0.2543652
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Power of the t-testR script
> power.t.test(delta=1,n=30,sd=sigma,sig.level=0.05)> # delta is the mean difference to detect at population level> # n = 30 is the group size> # sd is the within-group standard deviation> # sig.level is the type-I error level
Two-sample t test power calculation
n = 30delta = 1
sd = 2.929351sig.level = 0.05
power = 0.2547309alternative = two.sided
NOTE: n is number in *each* group
Effect at a population level Decision making procedure Testing for a group effect Linear effect
What makes a t-test powerful?The larger λ, the more powerful the t-test.
Consequently, the power of a t-test depends on:• |τ |/σ: the larger, the more powerful the test.
R script
> power.t.test(delta=5,n=30,sd=sigma,sig.level=0.05)$power
[1] 0.9999972
• n1 and n2: the larger, the more powerful the test.
R script
> power.t.test(delta=1,power=0.90,sd=sigma,sig.level=0.05)$n
[1] 181.2963
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Post-hoc testsDeclaring the effect of a factor as significant: some group meanresponses are different.
Which groups?
For example, significant effect of ’Genetic type’:• P0 6= P25 6= P50?• or P0 6= {P25 = P50}?• or ...
Post-hoc tests for I groups: I(I − 1)/2 simultaneous tests ofthe null hypotheses H(ii ′)
0 : αi = αi ′ , for 1 ≤ i < i ′ ≤ I.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Post-hoc testsIs it safe to test simultaneously many hypotheses withα = 0.05?
If I is large, I(I − 1)/2 of pairwise comparisons can becomevery large,
... which increases the probability of one or more erroneousrejections of H(ii ′)
0 :
1− Pall H(ii′)
0(H(12)
0 not rejected, . . . ,H(I−1,I)0 not rejected),
= 1− PH(12)
0(H(12)
0 not rejected) . . .PH(I−1,I)
0(H(I−1,I)
0 not rejected),
[Under an independence assumption among tests]≤ 1− (1− α)I(I−1)/2
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Post-hoc testsIs it safe to test simultaneously many hypotheses withα = 0.05?
If I is large, I(I − 1)/2 of pairwise comparisons can becomevery large,
... which increases the probability of one or more erroneousrejections of H(ii ′)
0 :
Family Wise Error Rate (FWER) ≤ 1− (1− α)I(I−1)/2
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Post-hoc testsR script
> 1-(1-0.05) ˆ 3
[1] 0.142625
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Post-hoc testsSidak’s correction of α: α∗ = 1− (1− α)2/I(I−1) guaranteesthat FWER ≤ α:
R script
> alpha = 1-(1-0.05) ˆ (1/3)
> alpha
[1] 0.01695243
> 1-(1-alpha) ˆ 3
[1] 0.05
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Simultaneous testsR script
> bfat.lm = lm(BFAT ˜ GENET,data=dta)
> summary(bfat.lm)$coefficients
Estimate Std. Error t value Pr(>|t|)(Intercept) 15.164667 0.7723426 19.634636 8.809707e-27
GENETP25 -2.210319 0.9927454 -2.226471 3.002270e-02GENETP50 -2.293952 1.0112339 -2.268469 2.717540e-02
> tmp = dta # Temporary dataset similar to dta
> tmp$GENET = relevel(dta$GENET,"P25")> # tmp$GENET = dta$GENET except that the reference level is P25
> tmp.lm = lm(BFAT ˜ GENET,data=tmp)
> summary(tmp.lm)$coefficients # P25 vs P50
Estimate Std. Error t value Pr(>|t|)(Intercept) 12.95434783 0.6237230 20.76939415 5.504528e-28GENETP0 2.21031884 0.9927454 2.22647094 3.002270e-02
GENETP50 -0.08363354 0.9028351 -0.09263435 9.265247e-01
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Paired data’Toy’ case study: 2 food products rated by 3 observers on a’liking’ scale from 1 (’dislike’) to 10 (’like’).
ObserversProducts J1 J2 J3
A 2 3 6B 4 6 8
DefinitionLetYij denote the response value measured for the j th samplingitem, 1 ≤ j ≤ J, in the i th group, 1 ≤ 1 ≤ I.
If the j th sampling item is the same in all groups, then the dataare said to be paired.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Paired dataR script
> toydta = data.frame(Observer=rep(c("J1","J2","J3"),rep(2,3)),+ Product=rep(c("A","B"),3),Rating=c(2,4,3,6,6,8))
> toydta
Observer Product Rating1 J1 A 22 J1 B 43 J2 A 34 J2 B 65 J3 A 66 J3 B 8
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Paired data’Toy’ case study: 2 food products rated by 3 observers on a’liking’ scale from 1 (’dislike’) to 10 (’like’).
F-test for the ’product’ effect on ’rating’R script
> toydta.lm1 = lm(Rating ˜ Product,data=toydta)
> anova(toydta.lm1)
Analysis of Variance Table
Response: RatingDf Sum Sq Mean Sq F value Pr(>F)
Product 1 8.1667 8.1667 1.96 0.2341Residuals 4 16.6667 4.1667
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Paired data
Can we seriously declare that there is no statisticalevidence of a ’product’ effect?
ObserversProducts J1 J2 J3
A 2 3 6B 4 6 8
Effect at a population level Decision making procedure Testing for a group effect Linear effect
A 2-way analysis of variance model for paired data
DefinitionLetYij denote the response value measured for the j th samplingitem, 1 ≤ j ≤ J, in the i th group, 1 ≤ 1 ≤ I:
Yij = µ+ αi + βj + eij ,
where• αi , i = 1, . . . , I, are the group effect parameters,• βj , j = 1, . . . , J are the ’individual’ effect parameters.
The residual error eij ∼ N (0;σ).
Remark: the 1-way ANOVA Model is a submodel of the 2-wayANOVA model: εij = βj + eij .
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Fitting of the two-way analysis of variance modelMinimization of the least-squares criterion:
I∑i=1
J∑j=1
(Yij − µ− αi − βj)2 = min
µ,αi ,βj
I∑i=1
J∑j=1
(Yij − µ− αi − βj)2.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Fitting of the two-way analysis of variance modelMinimization of the least-squares criterion:
I∑i=1
J∑j=1
(Yij − µ− αi − βj)2 = min
µ,αi ,βj
I∑i=1
J∑j=1
(Yij − µ− αi − βj)2.
Equating to zero the partial derivatives of the least-squarescriterion with respect to µ, αi , i = 2, . . . , I and βj , j = 2, . . . , J:−2∑I
i=1∑J
j=1(Yij − µ− αi − βj) = 0,−2∑J
j=1(Yij − µ− αi − βj) = 0, for i = 2, . . . , I,−2∑I
i=1(Yij − µ− αi − βj) = 0, for j = 2, . . . , J.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Fitting of the two-way analysis of variance modelMinimization of the least-squares criterion:
I∑i=1
J∑j=1
(Yij − µ− αi − βj)2 = min
µ,αi ,βj
I∑i=1
J∑j=1
(Yij − µ− αi − βj)2.
Developing the sums in each equation and dividing by thenumber of summands:
Y•• − µ−∑I
i=1 αi/I −∑J
j=1 βj/J = 0,Yi• − µ− αi −
∑Jj=1 βj/J = 0, for i = 2, . . . , I,
Y•j − µ−∑I
i=1 αi/I − βj = 0, for j = 2, . . . , J,
where Y•j =∑I
i=1 Yij/I.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Fitting of the two-way analysis of variance modelMinimization of the least-squares criterion:
I∑i=1
J∑j=1
(Yij − µ− αi − βj)2 = min
µ,αi ,βj
I∑i=1
J∑j=1
(Yij − µ− αi − βj)2.
It is deduced that µ = Y•• −∑I
i=1 αi/I −∑J
j=1 βj/J.
Plugging-in µ in the remaining equations:
αi −I∑
i=1
αi/I = Yi• − Y••, for i = 2, . . . , I,
βj −J∑
j=1
βj/J = Y•j − Y••, for j = 2, . . . , J.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Fitting of the two-way analysis of variance modelMinimization of the least-squares criterion:
I∑i=1
J∑j=1
(Yij − µ− αi − βj)2 = min
µ,αi ,βj
I∑i=1
J∑j=1
(Yij − µ− αi − βj)2.
Summing the (I − 1) first (respectively (J − 1) last) equationsabove gives:•∑I
i=1 αi/I = −(Y1• − Y••)
•∑J
j=1 βj/J = −(Y•1 − Y••).Therefore,• αi = Yi• − Y1•, for i = 1, . . . , I.• βj = Y•j − Y•1, for j = 1, . . . , J.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Fitting of the two-way analysis of variance modelR script
> toydta.lm2 = lm(Rating ˜ Product+Observer,data=toydta)
> # Extract the least-squares estimation of coefficients> coef(toydta.lm2)
(Intercept) ProductB ObserverJ2 ObserverJ31.833333 2.333333 1.500000 4.000000
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Fitting of the two-way analysis of variance modelResiduals: eij = Yij − µ− αi − βj = Yij − Yi• − Y•j + Y••.
Estimation of the residual variance σ2:
σ2 =
∑Ii=1∑J
j=1(Yij − Yi• − Y•j + Y••)2
(I − 1)(J − 1).
R script
> summary(toydta.lm2)$sigma
[1] 0.4082483
> summary(toydta.lm1)$sigma
[1] 2.041241
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Two-way analysis of variance F-testLet us start from the one-way ANOVA equation :
∑Ii=1∑J
j=1(Yij − Y••)2
=I∑
i=1
J(Yi• − Y••)2 +I∑
i=1
J∑j=1
(Yij − Yi•)2,
=I∑
i=1
J(Yi• − Y••)2 +J∑
j=1
I(Y•j − Y••)2 +I∑
i=1
J∑j=1
(Yij − Yi• − Y•j + Y••
)2.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Two-way analysis of variance F-testLet us start from the one-way ANOVA equation :∑I
i=1∑J
j=1(Yij − Y••)2
=I∑
i=1
J(Yi• − Y••)2 +I∑
i=1
J∑j=1
(Yij − Yi•)2,
=I∑
i=1
J(Yi• − Y••)2 +J∑
j=1
I(Y•j − Y••)2 +I∑
i=1
J∑j=1
(Yij − Yi• − Y•j + Y••
)2.
R script
> toydta.lm2 = lm(Rating ˜ Product+Observer,data=toydta)
> anova(toydta.lm2)
Analysis of Variance Table
Response: RatingDf Sum Sq Mean Sq F value Pr(>F)
Product 1 8.1667 8.1667 49 0.0198Observer 2 16.3333 8.1667 49 0.0200Residuals 2 0.3333 0.1667
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Two-way analysis of variance F-testEquivalent paired t-test for a two-group comparison:
R script
> # Paired t-test> t.test(Rating ˜ Product,data=toydta,paired=TRUE)
Paired t-test
data: Rating by Product
t = -7, df = 2, p-value = 0.0198
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:-3.7675509 -0.8991158
sample estimates:mean of the differences
-2.333333
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Outline1 Effect at a population level2 Decision making procedure3 Testing for a group effect
Exploring for a group effectOne-way analysis-of-variance modelLeast-squares estimation of effect parametersF-testThe special case of a two-level factor: t-testDetailing a significant group effectTesting a group effect using paired data
4 Linear effectLinearity of an effectLinear regression modellingLeast-squares fittingF-testComparing regression lines
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Prediction of the Lean Meat PercentageR script
> # Scatterplot of LMP against backfat depth> with(dta,plot(BFAT,LMP,bty="l",xlab="Backfat depth (mm)",+ ylab="LMP",cex.lab=1.25,pch=16,+ main="Effect of backfat depth on Lean Meat Percentage"))
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Prediction of the Lean Meat Percentage
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●●
●
10 15 20
5055
6065
Effect of backfat depth on Lean Meat Percentage
Backfat depth (mm)
LMP
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Prediction of the Lean Meat PercentageR script
> # Affects each item in a class of the partition 7<9<11<...<23> cutbfat = cut(dta$BFAT,breaks=seq(from=7,to=23,by=2))
> xcenters = seq(from=8,to=22,by=2) # Centers of the classes
> # Means of LMP in each class of the partition> ymeans = tapply(dta$LMP,cutbfat,mean)
> # Empirical effect curve> lines(xcenters,ymeans,lwd=2,type="b",pch=16,+ col="blue",cex=2)
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Prediction of the Lean Meat Percentage
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●●
●
10 15 20
5055
6065
Effect of backfat depth on Lean Meat Percentage
Backfat depth (mm)
LMP
●
●
●
●●
●
●
●
Effect at a population level Decision making procedure Testing for a group effect Linear effect
CorrelationThe statistical evidence of a linear relationship should notdepend,• neither on the position• nor on the dispersion
of the marginal distributions of X and Y .
... can be assessed on the scaled series x and y .
DefinitionLet (x1, . . . , xn) be a series of numeric values, with empiricalmean x and standard deviation sx , then the scaled values xiare obtained by subtracting the mean and dividing by thestandard deviation:
xi =xi − x
sx.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
CorrelationR script
> bfat.scaled = scale(dta$BFAT)
> lmp.scaled = scale(dta$LMP)
> mean(bfat.scaled);mean(lmp.scaled)
[1] 2.366597e-16[1] 5.855451e-16
> sd(bfat.scaled);sd(lmp.scaled)
[1] 1[1] 1
Effect at a population level Decision making procedure Testing for a group effect Linear effect
CorrelationScaled values can be used to identify extreme values:
R script
> # Identification of ’extreme’ backfat values> extrem.bfat = which(abs(bfat.scaled)>1.96)
> # Display of the ’outliers’> data.frame(WHICH=extrem.bfat,+ BFAT=bfat.scaled[extrem.bfat],LMP=lmp.scaled[extrem.bfat])
WHICH BFAT LMP1 6 2.817719 -1.0895382 43 -2.062037 1.4551893 48 2.167192 -2.437327
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Correlation
DefinitionThe correlation coefficient rxy is the mean cross-product ofthe scaled values:
rxy =
∑ni=1 xi yi
n − 1.
Equivalently,
rxy =1
n − 1
∑ni=1(xi − x)(yi − y)
sxsy=
sxy
sxsy,
where sxy is the covariance of the series:
sxy =
∑ni=1(xi − x)(yi − y)
n − 1.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Correlationrxy can be interpreted as follows:• rxy ≈ 1 can be a good indicator of a clear and increasing
linear relationship between X and Y .• rxy ≈ −1 can be a good indicator of a clear and decreasing
linear relationship between X and Y .• rxy ≈ 0 can be a good indicator of an absence of a linear
relationship between X and Y .rxy should complete the visual impression deduced from ascatterplot
Effect at a population level Decision making procedure Testing for a group effect Linear effect
CorrelationR script
> # Calculation of the correlation coefficient between> # backfat depth and LMP> cor(dta$BFAT,dta$LMP)
[1] -0.7770074
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Linear regressionRelationship between the LMP (Y ) and the backfat depth(X ): the way E(Y | X = x) depends on x can be assumed to bewell described by a linear function of x .
DefinitionIt is assumed that, given X = x , Y is normally distributed, withthe same standard deviation for all x .
There is a linear effect of X on Y if the conditional mean of Ygiven X = x is a linear function of x :
E(Y | X = x) = β0 + β1x ,
where β0 is the intercept parameter and β1 the slopeparameter. The above model is named simple linearregression model.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Linear regressionWhy regression?
The statistical meaning of regression is inherited from
Francis Galton (1886). "Regression towards mediocrity inhereditary stature". The Journal of the Anthropological Institute
of Great Britain and Ireland, Vol. 15. 246–263
who aimed at understanding the heritability of the phenotypeheight in humans.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Linear regression
●
●
●●
●●
● ● ●●
●
●●
● ● ●
●
●●
●●
●
●
●
● ●● ●● ●
● ●
●
● ● ●● ●
●
●●●
●●
● ● ●●
● ●● ●
●
●
● ●●
●
●● ●
● ● ● ●●
●
●
● ●●
●
●
●
●●
● ●●
● ●● ●
●
●●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●●
●
●
●●
●● ● ●●
●
●●
● ●●
●●
●●
●
●●●●
●●●
●
●●
●●●
● ●
●
●
●
●
●
●●
●
●
●●
● ●●
●● ●●
●●●
●●●
●●●●
●●
●
●●●●
●●
●●
●●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●● ●
●●
●● ●
●
●●
●●
●● ●
●
●● ●●
●●●
●●
●
●● ●
●
●●●
●● ●●
●●
●●●
●●●●●●●
●●
● ● ●●
●●●
●● ●●
●
●●
●
●●
●● ● ●
●
●
● ●●
●●●●●
●
●● ●●
●●
●
●
●●●
● ●● ●● ●
●
●
●
●
●●●●●●
●
●
● ●●●
●●
●
●●
●
●●
●
●●●
●●● ●
●
●●● ●
●●
●●
●
●●●
●●●
●●
●
●
●●
●●
●● ●
●
●●
●
●●●
●
●●●
●●
●
●
●
●●●
●●●
●●
●●● ●
● ●
●
●●●●
●●
●●
●
●●●●●●
●
●●
●
●●
●
●●
●
●●●
●●
●
● ●●
●
●
●●●
●
●
●●●
● ●●
●
●●
●●●
●●●
●
●●
●
●
●
●
●
●
●●
●
●●
●●
●●●●
●
●
●●
●
●
● ●●●
●●●
●
●●
●●
●●
●●●
●
●●●
●
●●● ●●
●●●
●●●
● ●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●●●
●●
●
●
●
●
●●●
●●●●
●
●● ●●● ●
●
●●
●
●●
●
●●
●●●
●
●●
●
●●●
●● ●
●
●
● ●●
●
●
●●●
●●
● ●
●●●
●
●
●●●●
●●●
●●
●
●●
●●
●●
●
●
●●
●●● ●
●● ●
● ●● ●●
●
●●
●●
●●●
● ●
●
●
●●
● ●●●
●
●
●●●
● ●●●
● ●
●
●
●●●● ●●
●
● ●
●
●
● ●●
●
● ●●
●●
●
●
●●
●●
●
●
●
●●●
● ●●
●
●
●●
● ●●
●●●
●
●
●●
●●
●●●● ●
●● ● ●
●
●
● ●●●
●
●●
●
●●●
●●
●
● ●●
●
●●
●
●
●
●●
●●●
●
●
●●
●
●
●●
●●
●
●●
● ●● ●●
●
●
●●●●
●●●
●●
●● ●●
●
●
● ●●● ●
●●
●
●
●●
●
●●●●●
●
●
●
●●
● ●
● ●
●● ●● ●● ●
●
●
●●
●●●
●●
● ●
●●
●●●●
●
●●
●●●
●●
●●●
●● ●
●
●●
●●
●
●
●
●● ● ●
●●●
●●
● ●
●
●●●●
●●●
●
● ●●● ●
●
●
●●
●●
●●
● ●●
●
●●●
●●
●●● ●
●
●
●
●●
●● ● ●
●● ●●●
●
●
●
●
●
●
●
60 65 70 75
6065
7075
Galton (1886)'s data
Height of mid−parent (inches)
Hei
ght o
f chi
ld (
inch
es)
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Linear regression
●
●
●●
●●
● ● ●●
●
●●
● ● ●
●
●●
●●
●
●
●
● ●● ●● ●
● ●
●
● ● ●● ●
●
●●●
●●
● ● ●●
● ●● ●
●
●
● ●●
●
●● ●
● ● ● ●●
●
●
● ●●
●
●
●
●●
● ●●
● ●● ●
●
●●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●●
●
●
●●
●● ● ●●
●
●●
● ●●
●●
●●
●
●●●●
●●●
●
●●
●●●
● ●
●
●
●
●
●
●●
●
●
●●
● ●●
●● ●●
●●●
●●●
●●●●
●●
●
●●●●
●●
●●
●●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●● ●
●●
●● ●
●
●●
●●
●● ●
●
●● ●●
●●●
●●
●
●● ●
●
●●●
●● ●●
●●
●●●
●●●●●●●
●●
● ● ●●
●●●
●● ●●
●
●●
●
●●
●● ● ●
●
●
● ●●
●●●●●
●
●● ●●
●●
●
●
●●●
● ●● ●● ●
●
●
●
●
●●●●●●
●
●
● ●●●
●●
●
●●
●
●●
●
●●●
●●● ●
●
●●● ●
●●
●●
●
●●●
●●●
●●
●
●
●●
●●
●● ●
●
●●
●
●●●
●
●●●
●●
●
●
●
●●●
●●●
●●
●●● ●
● ●
●
●●●●
●●
●●
●
●●●●●●
●
●●
●
●●
●
●●
●
●●●
●●
●
● ●●
●
●
●●●
●
●
●●●
● ●●
●
●●
●●●
●●●
●
●●
●
●
●
●
●
●
●●
●
●●
●●
●●●●
●
●
●●
●
●
● ●●●
●●●
●
●●
●●
●●
●●●
●
●●●
●
●●● ●●
●●●
●●●
● ●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●●●
●●
●
●
●
●
●●●
●●●●
●
●● ●●● ●
●
●●
●
●●
●
●●
●●●
●
●●
●
●●●
●● ●
●
●
● ●●
●
●
●●●
●●
● ●
●●●
●
●
●●●●
●●●
●●
●
●●
●●
●●
●
●
●●
●●● ●
●● ●
● ●● ●●
●
●●
●●
●●●
● ●
●
●
●●
● ●●●
●
●
●●●
● ●●●
● ●
●
●
●●●● ●●
●
● ●
●
●
● ●●
●
● ●●
●●
●
●
●●
●●
●
●
●
●●●
● ●●
●
●
●●
● ●●
●●●
●
●
●●
●●
●●●● ●
●● ● ●
●
●
● ●●●
●
●●
●
●●●
●●
●
● ●●
●
●●
●
●
●
●●
●●●
●
●
●●
●
●
●●
●●
●
●●
● ●● ●●
●
●
●●●●
●●●
●●
●● ●●
●
●
● ●●● ●
●●
●
●
●●
●
●●●●●
●
●
●
●●
● ●
● ●
●● ●● ●● ●
●
●
●●
●●●
●●
● ●
●●
●●●●
●
●●
●●●
●●
●●●
●● ●
●
●●
●●
●
●
●
●● ● ●
●●●
●●
● ●
●
●●●●
●●●
●
● ●●● ●
●
●
●●
●●
●●
● ●●
●
●●●
●●
●●● ●
●
●
●
●●
●● ● ●
●● ●●●
●
●
●
●
●
●
●
60 65 70 75
6065
7075
Galton (1886)'s data
Height of mid−parent (inches)
Hei
ght o
f chi
ld (
inch
es)
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Linear regression
●
●
●●
●●
● ● ●●
●
●●
● ● ●
●
●●
●●
●
●
●
● ●● ●● ●
● ●
●
● ● ●● ●
●
●●●
●●
● ● ●●
● ●● ●
●
●
● ●●
●
●● ●
● ● ● ●●
●
●
● ●●
●
●
●
●●
● ●●
● ●● ●
●
●●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●●
●
●
●●
●● ● ●●
●
●●
● ●●
●●
●●
●
●●●●
●●●
●
●●
●●●
● ●
●
●
●
●
●
●●
●
●
●●
● ●●
●● ●●
●●●
●●●
●●●●
●●
●
●●●●
●●
●●
●●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●● ●
●●
●● ●
●
●●
●●
●● ●
●
●● ●●
●●●
●●
●
●● ●
●
●●●
●● ●●
●●
●●●
●●●●●●●
●●
● ● ●●
●●●
●● ●●
●
●●
●
●●
●● ● ●
●
●
● ●●
●●●●●
●
●● ●●
●●
●
●
●●●
● ●● ●● ●
●
●
●
●
●●●●●●
●
●
● ●●●
●●
●
●●
●
●●
●
●●●
●●● ●
●
●●● ●
●●
●●
●
●●●
●●●
●●
●
●
●●
●●
●● ●
●
●●
●
●●●
●
●●●
●●
●
●
●
●●●
●●●
●●
●●● ●
● ●
●
●●●●
●●
●●
●
●●●●●●
●
●●
●
●●
●
●●
●
●●●
●●
●
● ●●
●
●
●●●
●
●
●●●
● ●●
●
●●
●●●
●●●
●
●●
●
●
●
●
●
●
●●
●
●●
●●
●●●●
●
●
●●
●
●
● ●●●
●●●
●
●●
●●
●●
●●●
●
●●●
●
●●● ●●
●●●
●●●
● ●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●●●
●●
●
●
●
●
●●●
●●●●
●
●● ●●● ●
●
●●
●
●●
●
●●
●●●
●
●●
●
●●●
●● ●
●
●
● ●●
●
●
●●●
●●
● ●
●●●
●
●
●●●●
●●●
●●
●
●●
●●
●●
●
●
●●
●●● ●
●● ●
● ●● ●●
●
●●
●●
●●●
● ●
●
●
●●
● ●●●
●
●
●●●
● ●●●
● ●
●
●
●●●● ●●
●
● ●
●
●
● ●●
●
● ●●
●●
●
●
●●
●●
●
●
●
●●●
● ●●
●
●
●●
● ●●
●●●
●
●
●●
●●
●●●● ●
●● ● ●
●
●
● ●●●
●
●●
●
●●●
●●
●
● ●●
●
●●
●
●
●
●●
●●●
●
●
●●
●
●
●●
●●
●
●●
● ●● ●●
●
●
●●●●
●●●
●●
●● ●●
●
●
● ●●● ●
●●
●
●
●●
●
●●●●●
●
●
●
●●
● ●
● ●
●● ●● ●● ●
●
●
●●
●●●
●●
● ●
●●
●●●●
●
●●
●●●
●●
●●●
●● ●
●
●●
●●
●
●
●
●● ● ●
●●●
●●
● ●
●
●●●●
●●●
●
● ●●● ●
●
●
●●
●●
●●
● ●●
●
●●●
●●
●●● ●
●
●
●
●●
●● ● ●
●● ●●●
●
●
●
●
●
●
●
60 65 70 75
6065
7075
Galton (1886)'s data
Height of mid−parent (inches)
Hei
ght o
f chi
ld (
inch
es)
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Linear regression
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Linear regressionRegression models are models for the conditional expectationof Y , given a profile of p explanatory variables x = (x1, . . . , xp)′:
E(Y | X = x) = f (x),
where f gives its specific shape to the regression model.
The simple linear regression model is a particular case:• x is restricted to only one variable (hence simple);• f (x) has the specific form of a straight line;• the conditional distribution of Y given X = x is normal;• the conditional variance of Y given X = x is constant.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Linear regressionAlternatively,
Y = β0 + β1x + ε,
where ε = Y − E(Y | X = x) = Y − β0 − β1x is named theresidual term.
Given X = x , ε is normally distributed with• E(ε | X = x) = 0;• and Var(ε | X = x) = σ2.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Least-squares estimationData: (xi , yi)i=1,...,n.
Minimization of the least squares criterion SS(β):
SS(β) =n∑
i=1
(Yi − β0 − β1xi)2.
The least-squares estimators β0 and β1 are the minimizers ofSS(β).
For a sampling item with X = x , the fitted response value isY = β0 + β1x .
Analogously, the fitted regression line is the line with equationx 7→ β0 + β1x : the closest line from the data.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Closed-form expression for least-squares estimatorsEstimating equations:{
∂SS∂β0
(β) = −2∑n
i=1(Yi − β0 − β1xi) = 0∂SS∂β1
(β) = −2∑n
i=1 xi(Yi − β0 − β1xi) = 0
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Closed-form expression for least-squares estimatorsEstimating equations:{
∂SS∂β0
(β) = −2∑n
i=1(Yi − β0 − β1xi) = 0∂SS∂β1
(β) = −2∑n
i=1 xi(Yi − β0 − β1xi) = 0
Dividing the first equation by n:
β0 = Y − β1x .
... the ’mean’ individual, with coordinates (x , y), lies in the fittedregression line.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Closed-form expression for least-squares estimatorsEstimating equations:{
∂SS∂β0
(β) = −2∑n
i=1(Yi − β0 − β1xi) = 0∂SS∂β1
(β) = −2∑n
i=1 xi(Yi − β0 − β1xi) = 0
Plugging-in this expression in the second equation:
n∑i=1
xi(Yi − Y − β1[xi − x ]) = 0,
or equivalently, dividing by n − 1,
s2x β1 = sxy ,
β1 =sxy
s2x.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Closed-form expression for least-squares estimatorsR script
> # Least-squares fit of the regression model> lmp.lm = lm(LMP ˜ BFAT,data=dta)
> # Extract estimated coefficients> beta = coef(lmp.lm)> beta
(Intercept) BFAT71.0607926 -0.8631092
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Closed-form expression for least-squares estimatorsR script
> # Adds the regression line on the scatterplot> abline(beta,lwd=2)
> # Adds a legend to the plot> legend("bottomleft",lwd=2,bty="n",col=c("blue","black"),+ legend=c("Empirical effect curve","Least-squares linear fit"))
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Closed-form expression for least-squares estimators
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●●
●
10 15 20
5055
6065
Effect of backfat depth on Lean Meat Percentage
Backfat depth (mm)
LMP
●
●
●
●●
●
●
●
Empirical effect curveLeast−squares linear fit
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Residual standard deviationThe residuals εi = Yi − β0− β1xi linearly depend on each other:{
ε1 + . . .+ εn = 0x1ε1 + . . .+ xnεn = 0,
Degree-of-freedom corrected sample variance σ2:
σ2 =
∑ni=1(Yi − β0 − β1xi)
2
n − 2.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Confidence intervals for the regression parametersβ0 and β1 are linear combinations of the Yi :
β1 =1
n − 1
n∑i=1
xi − xs2
xYi .
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Confidence intervals for the regression parametersβ0 and β1 are linear combinations of the Yi :
β1 =1
n − 1
n∑i=1
xi − xs2
xYi .
As a linear combination of the normally and independentlydistributed observations Yi , β1 is itself normally distributed with:
E(β1 | X = x) =1
n − 1
n∑i=1
xi − xs2
xE(Yi | Xi = xi),
= β0
[ 1n − 1
n∑i=1
xi − xs2
x
]+ β1
[ 1n − 1
n∑i=1
(xi − x)xi
s2x
],
= β1.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Confidence intervals for the regression parametersβ0 and β1 are linear combinations of the Yi :
β1 =1
n − 1
n∑i=1
xi − xs2
xYi .
As a linear combination of the normally and independentlydistributed observations Yi , β1 is itself normally distributed with:
Var(β1 | X = x) =1
(n − 1)2
n∑i=1
(xi − x)2
s4x
Var(Yi | Xi = xi).
=σ2
(n − 1)2
n∑i=1
(xi − x)2
s4x
,
=σ2
n − 11s2
x.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Confidence intervals for the regression parametersβ0 and β1 are linear combinations of the Yi :
β1 =1
n − 1
n∑i=1
xi − xs2
xYi .
As a linear combination of the normally and independentlydistributed observations Yi , β1 is itself normally distributed with:
Given X = x , β1 − β1 ∼ N (0;σ√
n − 11sx
)
The estimation accuracy is favored by• a small σ, or equivalently a good adequacy of the
regression model to the data,• a large sample size n,• a large dispersion of the values of x .
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Confidence intervals for the regression parametersTo sum up: β0 and β1 are normally distributed with means β0and β1 respectively and standard deviations σβ0
and σβ1respectively:
σ2β0
=σ2
n − 1
[n − 1n
+1s2
x
],
σ2β1
=σ2
n − 11s2
x.
Therefore, the following confidence intervals, with confidencelevel 1− α, are deduced for β0 and β1: for j = 0 or j = 1,
CI1−α(βj) =[βj − t(n−2)
1−α/2σβj; βj + t(n−2)
1−α/2σβj
],
where σβjis obtained by plugging-in the estimator σ2 of σ2
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Confidence intervals for the regression parametersR script
> # Confidence intervals for the regression coefficients> # level = 0.95 (default) sets the confidence level at 0.95
> cbind(coef(lmp.lm),confint(lmp.lm,level=0.95))
2.5 % 97.5 %(Intercept) 71.0607926 68.529254 73.5923316
BFAT -0.8631092 -1.046898 -0.6793204
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Confidence band for the regression lineWe call confidence band of the regression line at level 1− αand we denote CB1−α(β) the following family of confidenceintervals:
CB1−α(β) = {CI1−α(β0 + β1x); for all x} ,
where CI1−α(β0 + β1x) is a confidence interval at level 1− α forβ0 + β1x .
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Confidence band for the regression lineLet us start with the estimation of β0 + β1x :
Y = β0 + β1x = Y + β1(x − x) = Y +x − x
s2x
sxy ,
=1n
n∑i=1
[1 +
nn − 1
(x − x)(xi − x)
s2x
]Yi ,
=n∑
i=1
hi(x)Yi , where hi(x) =1n
+(x − x)
n − 1xi − x
s2x
.
Note:•∑n
i=1 hi(x) = 1•∑n
i=1 hi(x)xi = x
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Confidence band for the regression lineLet us start with the estimation of β0 + β1x :
Y =n∑
i=1
hi(x)Yi , withn∑
i=1
hi(x) = 1,n∑
i=1
hi(x)xi = x .
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Confidence band for the regression lineLet us start with the estimation of β0 + β1x :
Y =n∑
i=1
hi(x)Yi , withn∑
i=1
hi(x) = 1,n∑
i=1
hi(x)xi = x .
As a linear combination of the Yi , Y is normally distributed,with:
E(Y | X = x) =n∑
i=1
hi(x)E(Yi | Xi = xi),
= β0
[ n∑i=1
hi(x)]
+ β1
[ n∑i=1
hi(x)xi
],
= β0 + β1x .
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Confidence band for the regression lineLet us start with the estimation of β0 + β1x :
Y =n∑
i=1
hi(x)Yi , withn∑
i=1
hi(x) = 1,n∑
i=1
hi(x)xi = x .
As a linear combination of the Yi , Y is normally distributed,with:
Var(Y | X = x) =n∑
i=1
h2i (x) Var(Yi | Xi = xi),
= σ2n∑
i=1
h2i (x) =
σ2
n
[1 +
nn − 1
(x − xsx
)2].
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Confidence band for the regression lineIt is deduced from the above results that CI1−α(β0 + β1x) =
[Y − t(n−2)
1−α/2σ√
n
√1 +
nn − 1
(x − xsx
)2; Y + t(n−2)
1−α/2σ√
n
√1 +
nn − 1
(x − xsx
)2],
Whatever the model adequacy, the estimation accuracy can beimproved arbitrarily,
... provided the sample size can be increased with no limit.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Confidence band for the regression lineR script
# Defines a high resolution sequence of x values> x = seq(from=6,to=23,length=1000)
> # Calculates predictions for the sequence of x values> # with 95%-confidence intervals for the regression line
> pred = predict(lmp.lm,newdata=data.frame(BFAT=x),interval="confidence")> # pred has 3 columns: fitted values, upper and lower limits
> # Plots the upper limit of the confidence band> plot(x,pred[,"upr"],type="l",xlab="Backfat depth",ylab="LMP",ylim=c(48,68))
> lines(x,pred[,"lwr"]) # Adds the lower limit of the confidence band ...
> # Shades the confidence band
> polygon(c(x,rev(x)),c(pred[,"upr"],rev(pred[,"lwr"])),col="gray95")
> lines(x,pred[,"fit"],lwd=2) # Adds the regression line
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Confidence band for the regression line
10 15 20
5055
6065
Backfat depth
LMP
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Confidence band for predictionWhatever the estimation accuracy, the prediction accuracyhas also to account for the model adequacy.
Let Y ? be the unobserved response value of an item for whichX ? = x?:
Y ? = β0 + β1x? is the corresponding prediction
with Var(Y ? − Y ? | X = x , X ? = x?)
= Var(Y ? | X ? = x?) + Var(Y ? | X = x),
= σ2 +σ2
n
[1 +
nn − 1
(x? − xsx
)2].
Therefore, the lowest variance of the prediction error is reachedfor x? = x and equals σ2 + σ2/n.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Confidence band for predictionR script
> # Calculates predictions for the sequence of x values> # with 95%-confidence intervals for the predictions> pred = predict(lmp.lm,newdata=data.frame(BFAT=x),interval="prediction")
> # Plots the limits of the confidence band> lines(x,pred[,"upr"])> lines(x,pred[,"lwr"])
> # Shades the confidence band> color = adjustcolor("darkgray",alpha=0.3) # Creates a transparent gray> polygon(c(x,rev(x)),c(pred[,"upr"],rev(pred[,"lwr"])),col=color)
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Confidence band for prediction
10 15 20
5055
6065
Backfat depth
LMP
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Measuring how strong is a linear effectYet another model comparison issue:
To which extent• M : Y = β0 + β1x + ε with RSS =
∑ni=1(Yi − Yi)
2
better fits to the data than
• M0 : Y = β0 + ε with RSS0 =∑n
i=1(Yi − Y )2?
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Measuring how strong is a linear effectYet another model comparison issue:
To which extent• M : Y = β0 + β1x + ε with RSS =
∑ni=1(Yi − Yi)
2
better fits to the data than
• M0 : Y = β0 + ε with RSS0 =∑n
i=1(Yi − Y )2?
ANOVA equation:
RSS0 =n∑
i=1
(Yi − Y )2 + RSS.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Measuring how strong is a linear effectR script
> fit = fitted(lmp.lm) # Extracts fitted values
> plot(dta$LMP,fit,pch=16,xlim=c(48,67),ylim=c(48,67),+ xlab="Observed LMP values",ylab="Fitted LMP values")
> abline(a=0,b=1,lwd=2,col="gray")# Adds the line y=x to the plot
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Measuring how strong is a linear effect
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
50 55 60 65
5055
6065
Observed LMP values
Fitt
ed L
MP
val
ues
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Measuring how strong is a linear effectYet another model comparison issue:
To which extent• M : Y = β0 + β1x + ε with RSS =
∑ni=1(Yi − Yi)
2
better fits to the data than
• M0 : Y = β0 + ε with RSS0 =∑n
i=1(Yi − Y )2?
The R2 coefficient to compare RSS and RSS0:
R2 =RSS0 − RSS
RSS0.
• 0 ≤ R2 ≤ 1;• R2 = 0: absence of an effect of x ;• R2 = 1: ’perfect’ effect of x .
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Measuring how strong is a linear effectR script
> # Extracts the R2
> summary(lmp.lm)$r.squared
[1] 0.6037405
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Measuring how strong is a linear effectYet another model comparison issue:
To which extent• M : Y = β0 + β1x + ε with RSS =
∑ni=1(Yi − Yi)
2
better fits to the data than
• M0 : Y = β0 + ε with RSS0 =∑n
i=1(Yi − Y )2?
The R2 coefficient to compare RSS and RSS0:
R2 = r2y ,y
• 0 ≤ R2 ≤ 1;• R2 = 0: absence of an effect of x ;• R2 = 1: ’perfect’ effect of x .
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Measuring how strong is a linear effectR script
> cor(dta$LMP,fit) ˆ 2
[1] 0.6037405
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Testing for a linear effectThe R2 can be associated to a p-value for the following test:{
H0 : M does not fit better to the data thanM0H1 : M better fits to the data thanM0
F-test for the significance of the relationship between Y and x :
F =RSS0 − RSSRSS/(n − 2)
.
One degree of freedom for RSS0 − RSS =∑n
i=1(Yi − Y )2:
Yi − Y = β1(xi − x) proportional to xi − x .
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Testing for a linear effectR script
> # Extracts the F-test statistics
> summary(lmp.lm)$fstatistic
value numdf dendf88.36873 1.00000 58.00000
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Testing for a linear effectSignificance or not of an effect now relies on the judgement thatF = 88.369 is large or not regarding its distribution under thenull hypothesis.
The former null distribution is a Fisher distribution F1,n−2
R script
> # Displays the complete ANOVA table
> anova(lmp.lm)
Analysis of Variance Table
Response: LMPDf Sum Sq Mean Sq F value Pr(>F)
BFAT 1 425.90 425.90 88.369 2.915e-13Residuals 58 279.53 4.82
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Testing for a linear effectANOVA table:• Df: degrees of freedom, respectively 1 and n − 2;• Sum Sq: sum-of-squares, respectively RSS0 − RSS and
RSS;• Mean Sq: mean squares, respectively (RSS0 − RSS)/1
and RSS/(n − 2);• F value: F-statistics, the ratio of the mean squares;• Pr(>F): p-value, the probability that the F-statistics
exceeds F value under the null hypothesis.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Group-wise linear effectsCould the relationship between the LMP and the backfat depthbe specific of the genetic type?
In a regression modeling framework, this issue introducestwo explanatory variables:• the backfat depth, a numeric covariate;• and the genetic type, a factor.
The fact that the effect of the numeric covariate is not thesame according to the level of the grouping variable is calledan interaction effect between the two explanatory variables.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Introducing a group effect in a regression modelLet Yij stand for the response value of the j th sampling item,j = 1, . . . ,ni , in the i th group, i = 1, . . . , I
and xij the corresponding value of the explanatory variable.
Regression model within the 1st group (’reference’):
Y1j = µ+ βx1j + ε1j , ε1j ∼ N (0;σ)
Regression model within the ith group, with i 6= 1:
Yij = µ+ αi + (β + γi)xij + εij , ε1j ∼ N (0;σ)
where:• α2, . . . , αI are the group effect parameters;• γ2, . . . , γI are the interaction effect parameters.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Introducing a group effect in a regression modelLet Yij stand for the response value of the j th sampling item,j = 1, . . . ,ni , in the i th group, i = 1, . . . , I
and xij the corresponding value of the explanatory variable.
Note that:• the one-way analysis of variance model for the mean
comparison of Y across groups is a submodel, obtainedwith β = 0 and γ2 = . . . = γI = 0.
• the linear regression model for the study of the effect of xon Y is also a submodel, obtained with α2 = . . . = αI = 0and γ2 = . . . = γI = 0.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Introducing a group effect in a regression modelR script
> lmp.lm = lm(LMP ˜ BFAT*GENET,data=dta)
> coef(lmp.lm)
(Intercept) BFAT GENETP25 GENETP5080.2215328 -1.4575042 -9.7600733 -13.7187769
BFAT:GENETP25 BFAT:GENETP500.6268832 0.9550714
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Introducing a group effect in a regression modelCorrespondence between the R output and the parameters:
Parameter Name in R Estimateµ (Intercept) 80.2215328β BFAT -1.4575042α2 GENETP25 -9.7600733α3 GENETP50 -13.7187769γ2 BFAT:GENETP25 0.6268832γ3 BFAT:GENETP50 0.9550714
Is the effect of the backfat depth on the LMP really• the most obvious in genetic sub-population P0,• less clear in sub-population P25
• and even less clear in sub-population P50?
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Testing for group effect in a regression modelThe corresponding model comparison issue:
To which extent• M : Yij = µ+ αi + (β + γi)xij + εij
better fits to the data than
• M0 : Yij = µ+ αi + βxij + εij?
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Testing for group effect in a regression modelR script
> lmp.lm0 = lm(LMP ˜ BFAT+GENET,data=dta)
> coef(lmp.lm0)
(Intercept) BFAT GENETP25 GENETP5071.33759632 -0.87167292 -0.34433658 -0.08245607
> sum(residuals(lmp.lm) ˆ 2) # RSS for the full model
[1] 229.4831
> sum(residuals(lmp.lm0) ˆ 2) # RSS for the submodel
[1] 278.2727
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Testing for group effect in a regression modelFor the following test:{
H0 : M does not fit better to the data thanM0H1 : M better fits to the data thanM0
the appropriate F-test is given by:
F =(RSS0 − RSS)/(I − 1)
RSS/(n − 2I).
General rules for the degrees of freedom• RSS0 − RSS: difference between the numbers of
parameters of the two models.• RSS: difference between the sample size n and the
number of parameters.Under the null hypothesis, F ∼ FI−1,n−2I .
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Testing for group effect in a regression modelR script
> anova(lmp.lm0,lmp.lm)
Analysis of Variance Table
Model 1: LMP ˜ BFAT + GENETModel 2: LMP ˜ BFAT * GENET
Res.Df RSS Df Sum of Sq F Pr(>F)1 55 278.272 53 229.48 2 48.79 5.6341 0.006045
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Testing for group effect in a regression modelColumnwise description of the ANOVA table:• Res.Df: residual degrees of freedom;• RSS: residual sum-of-squares;• Df: degrees of freedom of RSS0 − RSS;• Sum of Sq: fitting gain RSS0 − RSS;• F: F-test statistics for the comparison of Model 1 andModel 2;
• Pr(>F): p-value of the test.
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Tests for pairwise comparisonsR script
> # Sidak correction for a control of the FWER at level 0.05> alpha = 1-(1-0.05) ˆ (1/3)> alpha[1] 0.01695243
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Tests for pairwise comparisonsR script
> lmp.lm = lm(BFAT ˜ GENET*BFAT,data=dta)
> summary(lmp.lm)$coefficients
Estimate Std. Error t value Pr(>|t|)(Intercept) 80.2215328 3.2957112 24.341190 1.935662e-30
BFAT -1.4575042 0.2144210 -6.797394 9.555535e-09GENETP25 -9.7600733 3.6821573 -2.650640 1.056916e-02GENETP50 -13.7187769 4.1457585 -3.309111 1.687184e-03
BFAT:GENETP25 0.6268832 0.2468264 2.539774 1.406063e-02BFAT:GENETP50 0.9550714 0.2879532 3.316759 1.649469e-03
Effect at a population level Decision making procedure Testing for a group effect Linear effect
Tests for pairwise comparisonsR script
> tmp = dta # Temporary dataset similar to dta
> tmp$GENET = relevel(dta$GENET,"P25")> # tmp$GENET = dta$GENET except that the reference level is P25
> tmp.lm = lm(BFAT ˜ GENET*BFAT,data=tmp)
> summary(tmp.lm)$coefficients # P25 vs P50
Estimate Std. Error t value Pr(>|t|)(Intercept) 70.4614595 1.6421236 42.908743 7.681692e-43GENETP0 9.7600733 3.6821573 2.650640 1.056916e-02
GENETP50 -3.9587036 3.0036929 -1.317945 1.931901e-01BFAT -0.8306210 0.1222575 -6.794030 9.675360e-09
GENETP0:BFAT -0.6268832 0.2468264 -2.539774 1.406063e-02GENETP50:BFAT 0.3281883 0.2277884 1.440759 1.555347e-01