EECS E6690: Statistical Learning forBiological and Information Systems
Lecture1: Introduction
Prof. Predrag R. JelenkovicTime: Tuesday 4:10-6:40pm
1127 Seeley W. Mudd Building
Dept. of Electrical EngineeringColumbia University , NY 10027, USAOffice: 812 Schapiro Research Bldg.
Phone: (212) 854-8174Email: [email protected]
URL: http://www.ee.columbia.edu/∼predrag
E6690 Statistical Learning: Brief Description
I Deluge of Data in Biology and Information Systems: Ongoingadvancements in information systems as well as the emergingrevolution in microbiology and neuroscience are creating a deluge ofdata, whose mining, inference and prediction will have an enormouseconomic, social, scientific and medical/therapeutic impact.
I Biology: For example, in biology, microarray technology is creatingvast amounts of gene expression data, whose understanding couldlead to better diagnostics and potential cure of cancer.
I Information Systems: Similarly, in information systems, companieslike Google, Amazon, Facebook, etc., are facing various problems onmassive data sets, e.g., ranking and community detection.
E6690 Statistical Learning: Brief DescriptionThis course will cover a variety of fundamental statistical (machine)learning techniques that are suitable for the emerging problems in theseapplication areas:
I Basics of Statistics and Optimization
I Introduction to Statistical/Machine Leraning Techniques
I Supervised versus unsupervised learningI Inference and predictionI Linear versus nonlinear modelsI Training, testing and validationI RegularizationI And many more
I Specifics of Biological and Information Systems Data
I High dimensionality and need for regularizationI Large sparse graphsI Community detectionI RankingI Association rules (Market basket analysis)
E6690 Statistical Learning: Course Logistics
Prerequisites: Calculus. Some knowledge of probability/statisticsand optimization is strongly encouraged, but not required.Familiarity with a programming language, say Matlab, is highlydesirable.
Textbooks: The following two books will represent the supportingreferences for the course. The books are available online:
ESL Hastie, T., Tibshirani, R. and Friedman, J.The Elements of Statistical Learning: Data Mining, Inference and Prediction, 2nd Edition.Springer, 2009. https://web.stanford.edu/~hastie/Papers/ESLII.pdf
ISL James, G., Witten, D. Hastie, T. and Tibshirani, R. An Introduction to Statistical Learning,Springer, 2014. http://faculty.marshall.usc.edu/gareth-james/ISL/
In addition, lecture notes as well as occasionally other books andresearch papers will be used.
Homework: Biweekly homework will be assigned (about 4)
Programming: The course uses R language. Pointers to its freedownload and resources, as well as basic examples of programmingin R will be covered in class.
Grading: Homework (20%) + Midterm (35%) + Final Proj (45%).
E6690 Statistical Learning: Course LogisticsMidterm: In class, closed book; 2 page cheat-sheet allowed; 2 1/2 hours
I Mixture of problem solving and descriptive answers
Final Project: Done in groups of 2-3 studentsI First, select a paper(s) from a data repository, e.g.:
I GEO (Gene Expression Omnibus) Data Repositoryhttps://www.ncbi.nlm.nih.gov/geo/
I UC Irvine Machine Learning Repositoryhttps://archive.ics.uci.edu/ml/datasets.php
I General Project Outline1. Introduction: e.g., describe the application area, problems
considered, etc2. Data set(s) and paper(s): e.g., describe data in detail, what
was done in the paper(s), common stat/machine learningtools, etc
3. Reproduce the results from the paper(s)4. Try different techniques learned in class, or propose new ones5. Discussion and conclusion: e.g., compare different techniques,
pros and cons, future work, etc
Statistical Learning: What Does It Involve?In general, Statistical (Machine) Learning (supervised) problems typicallycan be posed as
Y = f(X) + ε
where ε is the nose.Problem: Estimate f from training data {(xi, yi)}, and then use it as ageneral solution.Two main setups:
I Noiseless case (Y = f(X)): more common in machine learning
I Noisy case (Y = f(X) + ε): more prevalent in statistics
Areas involved:
I Approximation theory - for picking a class of functions
I Optimization - for fitting the training data
I Computing - fitting and testing
I Probability and Statistics - testing, error estimation
Machine Learning Versus Classical Programming
Interesting Question: What is the difference between classicalprogramming and statistical/machine learning?
Y = f(X)
I Classical Programming: f is an algorithm designed by a person
I Statistical Learning: f is discovered through examples by training
General Course Objectives
I Focus/motivation - emerging applications in:
I Biology and MedicineI Information Technology, e.g. problems in: Google, Facebook,
Twitter, Amazon, etc.
I Learn fundamental concepts and techniques in statistical (machine)learning techniques that are
I Suitable for these application areasI Useful and applicable in general
I Develop the necessary knowledge as we go (e.g., Statistics,Optimization, Approximation Theory, etc)
I Learn R
I Have a hands-on experience on a real, practical problem through afinal project
Overall objective: Become an expert in Statistical/Machine Learning
Programming in R: Computing Platform
I Language and environment for statistical computing andgraphics
I Free softwareI Download
I R from http://cran.r-project.org/I RStudio, an Integrated Development Environment for R, from
http://www.rstudio.com/products/rstudio/download/
I ResourcesI R for beginnersI Quick-RI Cookbook for RI R for Data ScienceI Try R
Brief Statistics ReviewCrash Course in Undergraduate Statistics
Example
The following numbers are particle (contamination) counts for asample of 10 semiconductor silicon wafers:
50 48 44 56 61 52 53 55 67 51
Over a long run the process average for wafer particle counts hasbeen 50 counts per wafer, and on the basis of the sample, we wantto test whether a change has occurred.
I Are data consistent is a given hypothesis?
I Idea: Data → scalar with a known distribution → likelihood
I Not a unique “transformation”
Estimates
I A statistic is a property of sample data taken from apopulation
I A point estimate of some unknown parameter is a statisticthat provides a best guess at the parameter value
I A point estimate θ is unbiased if Eθ = θ
I X1, X2, . . . , Xn – i.i.d. with mean µ and variance σ2
I ExamplesI Sample mean
X =1
n
n∑i=1
Xi
I Sample variance
S2 =1
n− 1
n∑i=1
(Xi − X)2
I Variability: Var(X) = σ2/n ≈ SE(X)2
SE is standard error, SE(X)2 = S2/n
Variability of estimates: Known variance
I If X1, . . . , Xn are i.i.d. normal, thenI X is normal:
X − µ√σ2/n
∼ N (0, 1)
I S2 has a known distribution:
n− 1
σ2S2 ∼ χ2
n−1,
where χ2n−1 (Chi - square) is a random variable whose
distribution is equal to the sum of (n− 1) squares ofindependent standard normal random variables
I X and S2 are independent (prove)
I If X1, . . . , Xn are not i.i.d normal, then CLT:
X − µ√σ2/n
⇒ N (0, 1)
Variability of estimates: Unknown variance
I If X1, . . . , Xn are i.i.d. normal, thenI t-statistic:
X − µ√S2/n
∼ N (0, 1)√χ2n−1/(n− 1)
∼ tn−1,
where tn−1 is Student’s t-distribution with (n− 1) degrees offreedom
I Representation of tn: Let Z ∼ N (0, 1) and V ∼ χ2n be
independentZ√V/n
∼ tn
t-distributionI Zero meanI Variance (n > 2): n/(n− 2)
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
PDFs of t distributions
x value
dens
ity
degrees of freedom
n=3n=5n=8n=30normal
t-test
I Null hypothesis H0 : µ = µ0
I Under H0, t-statistic:
t =X − µ0√S2/n
∼ tn−1
and the corresponding p-value is the probability of observing|tn−1| that is ≥ |t|, i.e., p = P[|tn−1| ≥ |t|].
I Large values of t unlikely under H0
I Typically:I pick a significance value, say α = 0.05I reject if p < α, say p < 0.05I accept if p ≥ α, say p ≥ 0.05
Intro to Statistical Learning
Supervised vs. unsupervised learningI Supervised learning: there is an input-output relationship
Y = f(X) + ε
I X ∈ Rp - Vector of p predictor measurementsI Y ∈ R - Outcome measurementsI ε: noiseI Two problems:
I Regression: Y is quantitativeI Classification: Y is categorical
I Training data (observations): (x1, y1), (x2, y2), . . . , (xn, yn)I Objectives:
I Statistics: Prediction, inferenceI Machine learning: Solve a problem via training
I Unsupervised learning: No outcome variable Y
I Objective can be vague - just exploring dataI Learn interesting phenomena in data, e.g.:
I Clustering, community detection, data association, lowdimensional representation
Learning
I Let Y ∈ R be the output variable, and X ∈ Rp the inputvector X = (X1, X2, . . . , Xp). Then
Y = f(X) + ε
I Want to estimate what f is
I ε is unavoidable noise that is independent of X, zero mean
I How to estimate f from the data? How to evaluate theestimate?
I Given an estimate f for f , predict unavailable values of Y forknown values of X: Y = f(X)
I Reducible and irreducible errors:I f is not exactly f , but f can potentially be learnt given
enough dataI even if f is known, there is error: ε = Y − f(X)
Two approaches to estimate f
I ParametricI Assume a specific form of fI Example: the linear model
f(X) = β0 + β1X1 + β2X2 + ...+ βpXp
I Use training data to choose the values of parametersβ0, β1, ..., βp
I Pro: easier to estimate parameters than arbitrary functionI Con: the choice of f might be (very) wrong
I Non-parametricI Make the parametric form more flexibleI This makes f more complex and potentially following the noise
too closely, thereby overfittingI Get f as close as possible to the data points, subject to not
being too non-smoothI Pro: more likely to get f right, especially if f is “strange”I Con: more data is needed to obtain a good estimate for f
Example
●
●
●
●
● ●
●
●
●
●
● ●●
●
●●
●
●
●
●
5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
training
x
y
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
testing
x
y
I More complicated models not always better - e.g., overfitting
I Amount of available data
I Interpretability
Linear Regression
Idea
I Simple approach to supervised learning
I Assumes linear dependence of quantitative Y onX1, X2, . . . , Xp
I True regression functions are never linear!
I Extremely useful both conceptually and practically
Data set
I Will use Advertising.csv to illustrate conceptsI 200 observations:
"","TV","Radio","Newspaper","Sales"
"1",230.1,37.8,69.2,22.1
"2",44.5,39.3,45.1,10.4
"3",17.2,45.9,69.3,9.3
.
.
.
"198",177,9.3,6.4,12.8
"199",283.6,42,66.2,25.5
"200",232.1,8.6,8.7,13.4
Advertising data set
TV
0 20 40
●
●●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●●
● ●
●
● ●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
● ●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●●
●
● ●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
5 15 25
010
025
0
●
●●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●●
●
●
●●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●●
●
● ●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
020
40 ●●
●●
●
●
●
●
● ●●
●
●
●
●
●
●●
●●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
Radio
●●
●●
●
●
●
●
● ●●
●
●
●
●
●
●●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●●
●●
●
●
●
●
●
●●●
●
●
●
●
●
● ●
●
●
●
●
●
●●●
●
●
●●
●●
●
●
●
●
● ●●
●
●
●
●
●
●●
●●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
●
●
●● ●
●
●
●●
●●
●
●
●
●●
●
●
● ●
●
●
●
●● ●
●●●
● ●●●
●● ●●
●● ●●
●
●
●
●●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●● ●
●●
●
●
●
●
●
●●
●
●●
●●● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●● ●
●
●●
●
●
●●●
●
●
●
●●
●
●
●●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●●●
● ●
●
●●
●●
●
●
●●
●
●
●● ●
●
●
●●
●
● ●
●
●
●
●
● ● ●
●
●
●●
●●
●●
●
●●
●●
●
●
● ●
●
●●●
● ●
●
●
●
●
●●●
●
●
●●
●●
●
●
●
●●
●
●
●●
●
●
●
●●●
●● ●
● ●●●
●●● ●
●● ●●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●● ●
●●
●
●
●
●
●
●●●
● ●
● ● ●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
● ●
●
●
●●●
●
●
●
●●
●
●
●●
●
●
●●
● ●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●● ●
● ●
●
●●
●●
●
●
●●
●
●
●●●
●
●
●●
●
●●
●
●
●
●
●●●
●
●
●●
●●
●●
●
●●
●●
●
●
●●
●
●●●● ●
●
●
Newspaper
040
80
●
●
●●●
●
●
●●
●●
●
●
●
●●
●
●
● ●
●
●
●
●● ●
●● ●
● ●●●
●● ● ●
●● ●●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●● ●
●●
●
●
●
●
●
●●
●
●●
●● ● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●●●
●
●
●
●●
●
●
●●
●
●
●●
● ●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●●●
● ●
●
●●
●●
●
●
●●
●
●
●●●
●
●
●●
●
● ●
●
●
●
●
● ●●
●
●
●●●
●
●●
●
●●
●●
●
●
●●
●
●●●
● ●
●
●
0 100 250
515
25
●
●●
●
●
●
●●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●● ●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●● ●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●●
●● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
● ●
●● ●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●●
●
● ●
●
●
●●
●
●
●
● ●
●
●
● ●●●
●
●
●
●
●
●
●●
●
● ●
●
●●
●●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●● ●
●●●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●●
●
●●
●
●
● ●● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●●
●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●●
●
●
●
●
●●
● ●●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●●
●
●●
●
●
●●
●
●
●
●●
●
●
●● ●●
●
●
●
●
●
●
●●
●
●●
●
●●
●●
●
●
●
0 40 80
●
●●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●●●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●●
●
● ●
●
●
●●
●
●
●
●●
●
●
●●●●
●
●
●
●
●
●
●●
●
●●
●
●●
●●
●
●
● Sales
Single predictor: TV vs. Sales> adv<-read.csv("advertising.csv",header=TRUE,sep=",")
> plot(adv$TV,adv$Sales,xlab="TV",ylab="Sales",col="red")
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
0 50 100 150 200 250 300
510
1520
25
TV
Sal
es
I Linear modelY = β0 + β1X + ε,
whereI β0 and β1: unknown constants/parameters/coefficients
(intercept and slope)I ε: error term
Single predictor: Model selection
I Estimate β0 and β1 based on data
I Given estimates β0 and β1, predict future sales using
y = β0 + β1x
I y: prediction of Y given X = x
I Residuals: yi − yi = yi − (β0 + β1xi)
I Select β0 and β1 to “minimize” residuals
I How to minimize a vector?
Need to Define Distance: Vector normsI Example: lp norm
‖z‖p =
(n∑i=1
|zi|p)1/p
I Example: 3 data point - {(0, 1), (1, 0), (2, 1)}The result depends on the choice of the norm (!)(parallel to x-axis due to symmetry)
●
●
●
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
One dimensional l2 regression: Least squares
I min ‖y − y‖2I Residual Sum of Squares (RSS):
RSS ≡ RSS(β0, β1) = ‖y − y‖22 =
n∑i=1
(yi − yi)2
I Least squares approach: minβ0,β1 RSS
I Solution:
β1 =
∑ni=1(yi − y)(xi − x)∑n
i=1(xi − x)2,
β0 = y − β1x,
where x = n−1∑n
i=1 xi and y = n−1∑n
i=1 yi are the samplemeans
Example> lm1<-lm(adv$Sales~adv$TV)
> summary(lm1)
Sales = 7.032594 + 0.047537× TV
> plot(adv$TV,adv$Sales,xlab="TV",ylab="Sales",col="red",pch=20)
> abline(lm(adv$Sales~adv$TV),col="blue",lwd=2)
> Sales_Predict<-predict(lm1)
> segments(adv$TV, adv$Sales, adv$TV, Sales_Predict)
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 50 100 150 200 250 300
510
1520
25
TV
Sal
es
Example: l2 vs. l1
I One point in the data set modified
●
●●
●
●
●
●●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
● ●●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
0 50 100 150 200 250 300
010
2030
4050
Original data set
TV
Sal
es
●
●●
●
●
●
●●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
● ●●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
0 50 100 150 200 250 3000
1020
3040
50
Modified data set
TV
Sal
es
Coefficient estimatesI Suppose the true model is
Sales = β0 + β1 × TV + ε
I How good are estimates β0 and β1?
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 50 100 150 200 250 300
05
1015
2025
30
TV
Sal
es
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
0 50 100 150 200 250 300
05
1015
2025
30
TV
Sal
es
i = 1, . . . , 100 : Sales = 7.241734 + 0.049069× TV
i = 101, . . . , 200 : Sales = 6.803818 + 0.046135× TV
Properties of β0 and β1
I Repeated sampling
I β0 and β1 vary
I Means:Eβ0 = β0 and Eβ1 = β1
I Variances:
Var(β1) =σ2∑n
i=1(xi − x)2,
Var(β0) = σ2(
1
n+
x2∑ni=1(xi − x)2
),
where σ2 = Var(ε)
I An estimate of σ2:
RSE2 =1
n− 2
n∑i=1
(yi − yi)2 =1
n− 2RSS,
where RSE is the Residual Standard Error
Confidence intervals
I Normality assumption: ε ∼ N (0, σ2)
I t-statistic:β1 − β1SE(β1)
∼ tn−2,
where
SE(β1)2 =
1
n− 2
∑ni=1(yi − yi)2∑ni=1(xi − x)2
I (1− γ) confidence interval:
[β1 − SE(β1) · tγ/2,n−2, β1 + SE(β1) · tγ/2,n−2]
is such that
P[β1 ∈ [β1−SE(β1) · tγ/2,n−2, β1 + SE(β1) · tγ/2,n−2]] = 1− γ,
where tγ/2,n−2 is the (1− γ/2)-th quantile of the tn−2
distribution
Hypothesis testing
I Typical testing (null vs. alternative hypothesis):
H0: there is no relationship between X and Yversus alternative
HA: there is some relationship between X and Y
I Formally:
H0 : β1 = 0 vs. HA : β1 6= 0
I To test H0 (β1 = 0), compute a t-statistic:
t =β1 − 0
SE(β1),
which is distributed according to a t-distribution with (n− 2)degrees of freedom
I Compute the p-value – probability of observing any valueequal to |t| or larger
Example
> summary(lm1)
Call:
lm(formula = adv$Sales ~ adv$TV)
Residuals:
Min 1Q Median 3Q Max
-8.3860 -1.9545 -0.1913 2.0671 7.2124
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.032594 0.457843 15.36 <2e-16 ***
adv$TV 0.047537 0.002691 17.67 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 3.259 on 198 degrees of freedom
Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
> qt(0.975,198)
[1] 1.972017
Reading:
ISL: Read in detail Chapter 2 and Section 3.1.Also, looking through the entire Chapters 1-3 is recommended.