Date post: | 01-Jan-2016 |
Category: |
Documents |
Upload: | talon-holloway |
View: | 19 times |
Download: | 0 times |
Linear Modelling I
Richard MottWellcome Trust Centre for Human
Genetics
Synopsis
• Linear Regression• Correlation• Analysis of Variance• Principle of Least Squares
Correlation
Correlation and linear regression
• Is there a relationship?• How do we summarise it?• Can we predict new obs?• What about outliers?
Correlation Coefficient r• -1 < r < 1
• r=0 no relationship
• r=0.6
• r=1 perfect positive linear
• r=-1 perfect negative linear
Examples of Correlation(taken from Wikipedia)
Calculation of r• Data
Correlation in R
> cor(bioch$Biochem.Tot.Cholesterol,bioch$Biochem.HDL,use="complete")[1] 0.2577617
> cor.test(bioch$Biochem.Tot.Cholesterol,bioch$Biochem.HDL,use="complete")
Pearson's product-moment correlation
data: bioch$Biochem.Tot.Cholesterol and bioch$Biochem.HDL t = 11.1473, df = 1746, p-value < 2.2e-16alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.2134566 0.3010088 sample estimates: cor 0.2577617
> pt(11.1473,df=1746,lower.tail=FALSE) # T distribution on 1746 degrees of freedom[1] 3.154319e-28
Linear Regression
Fit a straight line to data
• a intercept• b slope• ei error
– Normally distributed– E(ei) = 0
– Var(ei) = s2
Example: simulated data
R code> # simulate 30 data points> x <- rnorm(30) > e <- rnorm(30)> x <- 1:30> e <- rnorm(30,0,5)> y <- 1 + 3*x + e
> # fit the linear model> f <- lm(y ~ x)
> # plot the data and the predicted line> plot(x,y)> abline(reg=f)
> print(f)
Call:lm(formula = y ~ x)
Coefficients:(Intercept) x -0.08634 3.04747
Least Squares
• Estimate a, b by least squares
• Minimise sum of squared residuals between y and the prediction a+bx
• Minimise
Why least squares?
• LS gives simple formulae for the estimates for a, b
• If the errors are Normally distributed then the LS estimates are “optimal”In large samples the estimates converge to the true valuesNo other estimates have smaller expected errorsLS = maximum likelihood
• Even if errors are not Normal, LS estimates are often useful
Analysis of Variance (ANOVA)LS estimates have an important property: they partition the sum of squares (SS) into fitted and error components
• total SS = fitting SS + residual SS• only the LS estimates do this
Component Degrees of freedom
Mean Square(ratio of SS to df)
F-ratio (ratio of FMS/RMS)
Fitting SS 1
Residual SS n-2
Total SS n-1
€
(yi − y )2 = ( ˆ b (xi − x ))2 + (yi − ˆ b xi − ˆ a )2
i
∑i
∑i
∑
€
( ˆ b (xi − x ))2
i
∑
€
(yi − ˆ b xi − ˆ a )2
i
∑
€
(yi − y )2
i
∑ €
( ˆ b (xi − x ))2
i
∑
€
(y i − ˆ b x i − ˆ a )2 /(n − 2)i
∑
€
(n − 2) ( ˆ b (xi − x ))2
i
∑
(yi − ˆ b xi − ˆ a )2
i
∑
ANOVA in R
Component SS Degrees of freedom
Mean Square F-ratio
Fitting SS 20872.7 1 20872.7 965Residual SS 605.6 28 21.6Total SS 21478.3 29
> anova(f)Analysis of Variance Table
Response: y Df Sum Sq Mean Sq F value Pr(>F) x 1 20872.7 20872.7 965 < 2.2e-16 ***Residuals 28 605.6 21.6
> pf(965,1,28,lower.tail=FALSE)[1] 3.042279e-23
Hypothesis testing• no relationship between y and x
• Assume errors ei are independent and normally distributed N(0,s2)
• If H0 is true then the expected values of the sums of squares in the ANOVA are
• degrees freedom
• Expectation
• F ratio = (fitting MS)/(residual MS) ~ 1 under H0
• F >> 1 implies we reject H0
• F is distributed as F(1,n-2)€
(yi − y )2 =i
∑ ( ˆ b (xi − x ))2 +i
∑ (yi − ˆ b xi − ˆ a )2
i
∑n −1 1 n − 2
(n − 2)σ 2 σ 2 (n − 2)σ 2
Degrees of Freedom• Suppose are iid N(0,1)
• Then ie n independent variables
• What about ?
• These values are constrained to sum to 0:
• Therefore the sum is distributed as if it comprised one fewer observation, hence it has n-1 df (for example, its expectation is n-1)
• In particular, if p parameters are estimated from a data set, then the residuals
have p constraints on them, so they behave like n-p independent variables
The F distribution• If e1….en are independent and identically distributed (iid)
random variables with distribution N(0,s2), then:• e1
2/s2 … en2/s2 are each iid chi-squared random variables with
chi-squared distribution on 1 degree of freedom c12
• The sum Sn = Si ei2/s2 is distributed as chi-squared cn
2
• If Tm is a similar sum distributed as chi-squared cm2, but
independent of Sn, then (Sn/n)/(Tm/m) is distributed as an F random variable F(n,m)
• Special cases:– F(1,m) is the same as the square of a T-distribution on m df– for large m, F(n,m) tends to cn
2
ANOVA – HDL example> ff <- lm(bioch$Biochem.HDL ~ bioch$Biochem.Tot.Cholesterol)> ff
Call:lm(formula = bioch$Biochem.HDL ~
bioch$Biochem.Tot.Cholesterol)
Coefficients: (Intercept) bioch$Biochem.Tot.Cholesterol 0.2308 0.4456
> anova(ff)Analysis of Variance Table
Response: bioch$Biochem.HDL Df Sum Sq Mean Sq F value
Pr(>F) bioch$Biochem.Tot.Cholesterol 1 149.660 149.660 1044 Residuals 1849 265.057 0.143
> pf(1044,1,28,lower.tail=FALSE)[1] 1.040709e-23
HDL = 0.2308 + 0.4456*Cholesterol
correlation and ANOVA
• r2 = FSS/TSS = fraction of variance explained by the model
• r2 = F/(F+n-2)– correlation and ANOVA are equivalent– Test of r=0 is equivalent to test of b=0 – T statistic in R cor.test is the square root of the ANOVA F statistic
– r does not tell anything about magnitudes of estimates of a, b– r is dimensionless
Effect of sample size
Total Cholesterol vs HDL dataExample R session to sample subsets of data and compute correlations
seqq <- seq(10,300,5)corr <- matrix(0,nrow=length(seqq),ncol=2)colnames(corr) <- c( "sample size", "P-value")n <- 1for(i in seqq) {
res <- rep(0,100)for(j in 1:100) {
s <- sample(idx,i)data <- bioch[s,]co <- cor.test(data$Biochem.Tot.Cholesterol,
data$Biochem.HDL,na="pair")res[j] <- co$p.value
}m <- exp(mean(log(res)))cat(i, m, "\n")corr[n,] <- c(i, m)n <- n+1
}
Calculating the right sample size n
• The R library “pwr” contains functions to compute the sample size for many problems, including correlation pwr.r.test() and ANOVA pwr.anova.test()
Problems with non-linearityAll plots have r=0.8 (taken from Wikipedia)
Non-Parametric CorrelationSpearman Rank Correlation Coefficient
• Replace observations by their ranks• eg x= ( 5, 1, 4, 7 ) -> rank(x) = (3,1,2,4)• Compute sum of squared differences between
ranks
• in R:– cor( x, y, method=“spearman”)– cor.test(x,y,method=“spearman”)
Spearman Correlation> cor.test(xx,y, method=“pearson”)
Pearson's product-moment correlation
data: xx and y t = 0.0221, df = 28, p-value = 0.9825alternative hypothesis: true correlation is not
equal to 0 95 percent confidence interval: -0.3566213 0.3639062 sample estimates: cor 0.004185729
> cor.test(xx,y,method="spearman")
Spearman's rank correlation rho
data: xx and y S = 2473.775, p-value = 0.01267alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.4496607
Multiple Correlation
• The R cor function can be used to compute pairwise correlations between many variables at once, producing a correlation matrix.
• This is useful for example, when comparing expression of genes across subjects.
• Gene coexpression networks are often based on the correlation matrix.
• in Rmat <- cor(df, na=“pair”)
– computes the correlation between every pair of columns in df, removing missing values in a pairwise manner
– Output is a square matrix correlation coefficients
One-Way ANOVA
• Model y as a function of a categorical variable taking p values– eg subjects are classified into p families– want to estimate effect due to each family and
test if these are different– want to estimate the fraction of variance
explained by differences between families – ( an estimate of heritability)
One-Way ANOVA
LS estimators
average over group i
One-Way ANOVA
• Variance is partitioned in to fitting and residual SS
total SS
n-1
fitting SSbetween groups
p-1
residual SSwith groups
n-p degrees of freedom
One-Way ANOVA
Component SS Degrees of freedom
Mean Square(ratio of SS to df)
F-ratio (ratio of FMS/RMS)
Fitting SS p-1
Residual SS n-p
Total SS n-1
Under Ho: no differences between groupsF ~ F(p-1,n-p)
One-Way ANOVA in Rfam <- lm( bioch$Biochem.HDL ~ bioch$Family )> anova(fam)Analysis of Variance Table
Response: bioch$Biochem.HDL Df Sum Sq Mean Sq F value Pr(>F) bioch$Family 173 6.3870 0.0369 3.4375 < 2.2e-16 ***Residuals 1727 18.5478 0.0107 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 >
Component SS Degrees of freedom
Mean Square(ratio of SS to df)
F-ratio (ratio of FMS/RMS)
Fitting SS 6.3870 173 0.0369 3.4375
Residual SS 18.5478 1727 0.0107
Total SS 24.9348 1900
Non-ParametricOne-Way ANOVA
• Kruskall-Wallis Test• Useful if data are highly non-Normal– Replace data by ranks– Compute average rank within each group– Compare averages– kruskal.test( formula, data )
Factors in R
• Grouping variables in R are called factors• When a data frame is read with read.table()– a column is treated as numeric if all non-missing entries
are numbers– a column is boolean if all non-missing entries are T or F
(or TRUE or FALSE)– a column is treated as a factor otherwise– the levels of the factor is the set of distinct values– A column can be forced to be treated as a factor using the
function as.factor(), or as a numeric vector using as.numeric()
Linear Modelling in R
• The R function lm() fits linear models• It has two principal arguments (and some
optional ones)• f <- lm( formula, data )– formula is an R formula– data is the name of the data frame containing the
data (can be omitted, if the variables in the formula include the data frame)
formulae in R
• Biochem.HDL ~ Biochem$Tot.Cholesterol
– linear regression of HDL on Cholesterol – 1 df
• Biochem.HDL ~ Family
– one-way analysis of variance of HDL on Family– 173 df
• The degrees of freedom are the number of independent parameters to be estimated
ANOVA in R• f <- lm(Biochem.HDL ~ Tot.Cholesterol, data=biochem)• [OR f <- lm(biochem$Biochem.HDL ~ biochem$Tot.Cholesterol)]
• a <- anova(f)
• f <- lm(Biochem.HDL ~ Family, data=biochem)• a <- anova(f)