+ All Categories
Home > Documents > EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General...

EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General...

Date post: 13-Mar-2020
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
EECS E6690: Statistical Learning for Biological and Information Systems Lecture1: Introduction Prof. Predrag R. Jelenkovi´ c Time: Tuesday 4:10-6:40pm 1127 Seeley W. Mudd Building Dept. of Electrical Engineering Columbia University , NY 10027, USA Office: 812 Schapiro Research Bldg. Phone: (212) 854-8174 Email: [email protected] URL: http://www.ee.columbia.edu/predrag
Page 1: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

EECS E6690: Statistical Learning forBiological and Information Systems

Lecture1: Introduction

Prof. Predrag R. JelenkovicTime: Tuesday 4:10-6:40pm

1127 Seeley W. Mudd Building

Dept. of Electrical EngineeringColumbia University , NY 10027, USAOffice: 812 Schapiro Research Bldg.

Phone: (212) 854-8174Email: [email protected]

URL: http://www.ee.columbia.edu/∼predrag

Page 2: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

E6690 Statistical Learning: Brief Description

I Deluge of Data in Biology and Information Systems: Ongoingadvancements in information systems as well as the emergingrevolution in microbiology and neuroscience are creating a deluge ofdata, whose mining, inference and prediction will have an enormouseconomic, social, scientific and medical/therapeutic impact.

I Biology: For example, in biology, microarray technology is creatingvast amounts of gene expression data, whose understanding couldlead to better diagnostics and potential cure of cancer.

I Information Systems: Similarly, in information systems, companieslike Google, Amazon, Facebook, etc., are facing various problems onmassive data sets, e.g., ranking and community detection.

Page 3: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

E6690 Statistical Learning: Brief DescriptionThis course will cover a variety of fundamnetal statistical (machine)learning techniques that are suitable for the emerging problems in theseapplication areas:

I Basics of Statistics and Optimization

I Introduction to Statistical/Machine Leraning Techniques

I Supervised versus unsupervised learningI Inference and predictionI Linear versus nonlinear modelsI Training, testing and validationI RegularizationI And many more

I Specifics of Biological and Information Systems Data

I High dimensionality and need for regularizationI Large sparse graphsI Community detectionI RankingI Association rules (Market basket analysis)

Page 4: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

E6690 Statistical Learning: Course Logistics

Prerequisites: Calculus. Some knowledge of probability/statisticsand optimization is strongly encouraged, but not required.Familiarity with a programming language, say Matlab, is highlydesirable.

Textbooks: The following two books will represent the supportingreferences for the course. The books are available online:

ESL Hastie, T., Tibshirani, R. and Friedman, J.The Elements of Statistical Learning: Data Mining, Inference and Prediction, 2nd Edition.Springer, 2009. https://web.stanford.edu/~hastie/Papers/ESLII.pdf

ISL James, G., Witten, D. Hastie, T. and Tibshirani, R.An Introduction to Statistical Learning, Springer, 2014. http://www-bcf.usc.edu/~gareth/ISL/

In addition, lecture notes and research papers will be used.

Homework: Biweekly homework will be assigned (about 4)

Programming: The course uses R language. Pointers to its freedownload, as well as basic examples of programming in R will becovered in class.

Grading: Homework (20%) + Midterm (35%) + Final Project(45%).

Page 5: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

E6690 Statistical Learning: Course LogisticsMidterm: In class, closed book; 2 page cheat-sheet allowed; 2 1/2 hours

I Mixture of problem solving and descriptive answers

Final Project: Done in groups of 2-3 studentsI First, select a paper(s) from a data repository, e.g.:

I GEO (Gene Expression Omnibus) Data Repositoryhttps://www.ncbi.nlm.nih.gov/geo/

I UC Irvine Machine Learning Repositoryhttps://archive.ics.uci.edu/ml/datasets.html

I General Project Outline1. Introduction: e.g., describe the application area, problems

considered, etc2. Data set(s) and paper(s): e.g., describe data in detail, what

was done in the paper(s), common stat/machine learningtools, etc

3. Reproduce the results from the paper(s)4. Try different techniques learned in class, or propose new ones5. Discussion and conclusion: e.g., compare different techniques,

pros and cons, future work, etc

Page 6: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

Statistical Learning: What Does It Involve?In general, Statistical (Machine) Learning (supervised) problems typicallycan be posed as

Y = f(X)

Problem: Estimate f from training data {(xi, yi)}, and then use it ingeneralAreas involved:

I Approximation theory - for picking a class of functions

I Optimization - for fitting the training data

I Computing - fitting and testing

I Probability and Statistics - testing, error estimation

Interesting Question: What is the difference between classicalprogramming and statistical/machine learning?

I Classical Programming: f is an algorithm designed by a person

I Statistical Learning: f is discovered through examples by training

Page 7: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

General Course Objectives

I Focus/motivation - emerging applications in:

I Biology and MedicineI Information Technology, e.g. problems in: Google, Facebook,

Twitter, Amazon, etc.

I Learn fundamental concepts and techniques in statistical (machine)learning techniques that are

I Suitable for these application areasI Useful and applicable in general

I Develop the necessary knowledge as we go (e.g., Statistics,Optimization, Approximation Theory)

I Learn R

I Have a hands-on experience on a real, practical problem through afinal project

Overall: Become an expert(!) in Statistical/Machine Learning

Page 8: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

Programming in R: Computing Platform

I Language and environment for statistical computing andgraphics

I Free softwareI Download

I R from http://cran.r-project.org/I RStudio, an Integrated Development Environment for R, from


I ResourcesI R for beginnersI Quick-RI Cookbook for RI R for Data ScienceI Try R

Page 9: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

Brief Statistics Review

Page 10: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine


The following numbers are particle (contamination) counts for asample of 10 semiconductor silicon wafers:

50 48 44 56 61 52 53 55 67 51

Over a long run the process average for wafer particle counts hasbeen 50 counts per wafer, and on the basis of the sample, we wantto test whether a change has occurred.

I Are data consistent is a given hypothesis?

I Idea: Data → scalar with a known distribution → likelihood

I Not a unique “transformation”

Page 11: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine


I A statistic is a property of sample data taken from apopulation

I A point estimate of some unknown parameter is a statisticthat provides a best guess at the parameter value

I A point estimate θ is unbiased if Eθ = θ

I X1, X2, . . . , Xn – i.i.d. with mean µ and variance σ2

I ExamplesI Sample mean

X =1




I Sample variance

S2 =1

n− 1


(Xi − X)2

I Variability: Var(X) = SE(X)2 = σ2/nSE is standard error, SE(X)2 ≈ S2/n

Page 12: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

Variability of estimates: Known variance

I If X1, . . . , Xn are i.i.d. normal, thenI X is normal:

X − µ√σ2/n

∼ N (0, 1)

I S2 has a known distribution:

n− 1

σ2S2 ∼ χ2


where χ2n−1 (Chi - square) is the distribution of the sum of

(n− 1) squares of independent standard normal randomvariables

I X and S2 are independent

I ... if not, then CLT:

X − µ√σ2/n

⇒ N (0, 1)

Page 13: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

Variability of estimates: Unknown variance

I If X1, . . . , Xn are i.i.d. normal, thenI t-statistic:

X − µ√S2/n

∼ N (0, 1)√χ2n−1/(n− 1)

∼ tn−1,

where tn−1 is Student’s t-distribution with (n− 1) degrees offreedom

I tn: independent Z ∼ N (0, 1) and V ∼ χ2n


∼ tn

Page 14: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

t-distributionI Zero meanI Variance (n > 2): n/(n− 2)

−4 −2 0 2 4






PDFs of t distributions

x value



degrees of freedom


Page 15: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine


I Null hypothesis H0 : µ = µ0

I Under H0, t-statistic:

t =X − µ0√S2/n

∼ tn−1

and the corresponding p-value is the probability of observing|tn−1| that is ≥ |t|, i.e., p = P[|tn−1| ≥ |t|].

I Large values of t unlikely under H0

I Typically:I reject if p < 0.01I accept if p > 0.1I not sure if 0.01 ≤ p ≤ 0.1.

(Or, simply: if p < 0.05→ reject, if p ≥ 0.05→ accept)

Page 16: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

Intro to Statistical Learning

Page 17: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

Supervised vs. unsupervised learningI Supervised learning: there is an input-output relationship

Y = f(X)

I X - Vector of p predictor measurementsI Y - Outcome measurementsI Two problems:

I Regression: Y is quantitativeI Classification: Y is categorical

I Training data (observations): (x1, y1), (x2, y2), . . . , (xn, yn)I Objectives:

I PredictionI Inference

I Unsupervised learning: No outcome variable Y

I Objective can be vague - just exploring dataI Learn interesting phenomena in data, e.g.:

I Clustering, community detection, data association, lowdimensional representation

Page 18: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine


I Let Y be the output variable, and X the input variablesX1, X2, . . . , Xp. Then

Y = f(X) + ε

I Want to estimate what f is

I ε is unavoidable noise that is independent of X, zero mean

I How to estimate f from the data? How to evaluate theestimate?

I Given an estimate f for f , predict unavailable values of Y forknown values of X: Y = f(X)

I Reducible and irreducible errors:I f is not exactly f , but f can potentially be learnt given

enough dataI even if f is known, there is error: ε = Y − f(X)

Page 19: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

Two approaches to estimate f

I ParametricI Assume a specific form of fI Example: the linear model

f(X) = β0 + β1X1 + β2X2 + ...+ βpXp

I Use training data to choose the values of parametersβ0, β1, ..., βp

I Pro: easier to estimate parameters than arbitrary functionI Con: the choice of f might be (very) wrong

I Non-parametricI Make the parametric form more flexibleI This makes f more complex and potentially following the noise

too closely, thereby overfittingI Get f as close as possible to the data points, subject to not

being too non-smoothI Pro: more likely to get f right, especially if f is “strange”I Con: more data is needed to obtain a good estimate for f

Page 20: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine


● ●

● ●●


5 10 15 20











● ●

5 10 15 20










I More complicated models not always better - e.g., overfitting

I Amount of available data

I Interpretability

Page 21: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

Linear Regression

Page 22: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine


I Simple approach to supervised learning

I Assumes linear dependence of quantitative Y onX1, X2, . . . , Xp

I True regression functions are never linear!

I Extremely useful both conceptually and practically

Page 23: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

Data set

I Will use Advertising.csv to illustrate conceptsI 200 observations:











Page 24: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

Advertising data set


0 20 40



● ●


● ●










● ●








● ●●


● ●

● ●




● ●





● ●






















● ●




● ●




● ●





5 15 25






● ●


● ●










● ●











● ●




● ●






40 ●●


● ●●



● ●●






● ●










● ●






● ●●




● ●●




















● ●




● ●




● ●●



● ●●






● ●
















● ●●

●● ●




● ●

●● ●


● ●●●

●● ●●

●● ●●


● ●

●●● ●




●●● ●


●● ●











● ●




●● ●


● ●

● ● ●






● ●


● ●







●● ●

● ●●●

●●● ●

●● ●●



● ●● ●



● ●

● ● ●●



● ●





● ●




●● ●

● ●














●●●● ●








● ●

●● ●

●● ●

● ●●●

●● ● ●

●● ●●



●●● ●




●● ● ●








● ●





● ●






● ●

● ●●







● ●

0 100 250






● ●


● ●




●● ●





●● ●●●





●● ●



● ●

●● ●





● ●


● ●

● ●●●


● ●










● ●

●● ●



● ●



● ●● ●●

● ●






● ●●


● ●●








●● ●●





0 40 80






● ●























● ●








● Sales

Page 25: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

Single predictor: TV vs. Sales> adv<-read.csv("advertising.csv",header=TRUE,sep=",")

> plot(adv$TV,adv$Sales,xlab="TV",ylab="Sales",col="red")










● ●





● ●

● ●

● ●●


0 50 100 150 200 250 300







I Linear modelY = β0 + β1X + ε,

whereI β0 and β1: unknown constants/parameters/coefficients

(intercept and slope)I ε: error term

Page 26: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

Single predictor: Model selection

I Estimate β0 and β1 based on data

I Given estimates β0 and β1, predict future sales using

y = β0 + β1x

I y: prediction of Y given X = x

I Residuals: yi − yi = yi − (β0 + β1xi)

I Select β0 and β1 to “minimize” residuals

I How to minimize a vector?

Page 27: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

Need to Define Distance: Vector normsI Example: lp norm

‖z‖p =



I Example: 3 data point - {(0, 1), (1, 0), (2, 1)}The result depends on the choice of the norm (!)(parallel to x-axis due to symmetry)

0.0 0.5 1.0 1.5 2.0







Page 28: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

l2 regression: Least squares

I min ‖y − y‖2I Residual Sum of Squares (RSS):

RSS ≡ RSS(β0, β1) = ‖y − y‖22 =


(yi − yi)2

I Least squares approach: minβ0,β1 RSS

I Solution:

β1 =

∑ni=1(yi − y)(xi − x)∑n

i=1(xi − x)2,

β0 = y − β1x,

where x = n−1∑n

i=1 xi and y = n−1∑n

i=1 yi are the samplemeans

Page 29: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

Example> lm1<-lm(adv$Sales~adv$TV)

> summary(lm1)

Sales = 7.032594 + 0.047537× TV

> plot(adv$TV,adv$Sales,xlab="TV",ylab="Sales",col="red",pch=20)

> abline(lm(adv$Sales~adv$TV),col="blue",lwd=2)

> Sales_Predict<-predict(lm1)

> segments(adv$TV, adv$Sales, adv$TV, Sales_Predict)








● ●

● ●

● ●

0 50 100 150 200 250 300







Page 30: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

Example: l2 vs. l1

I One point in the data set modified




● ●








●● ●●





●● ●

● ●






● ●

● ●

● ●●




0 50 100 150 200 250 300




Original data set







● ●








●● ●●





●● ●

● ●






● ●

● ●

● ●●




0 50 100 150 200 250 3000




Modified data set




Page 31: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

Coefficient estimatesI Suppose the true model is

Sales = β0 + β1 × TV + ε

I How good are estimates β0 and β1?


● ●






●● ●●

0 50 100 150 200 250 300









● ●






● ●

● ●

● ●●


0 50 100 150 200 250 300








i = 1, . . . , 100 : Sales = 7.241734 + 0.049069× TV

i = 101, . . . , 200 : Sales = 6.803818 + 0.046135× TV

Page 32: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

Properties of β0 and β1

I Repeated sampling

I β0 and β1 vary

I Means:Eβ0 = β0 and Eβ1 = β1

I Variances:

Var(β1) =σ2∑n

i=1(xi − x)2,

Var(β0) = σ2(



x2∑ni=1(xi − x)2


where σ2 = Var(ε)

I An estimate of σ2:

RSE2 =1

n− 2


(yi − yi)2 =1

n− 2RSS,

where RSE is the Residual Standard Error

Page 33: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

Confidence intervals

I Normality assumption: ε ∼ N (0, σ2)

I t-statistic:β1 − β1SE(β1)

∼ tn−2,


SE(β1)2 =


n− 2

∑ni=1(yi − yi)2∑ni=1(xi − x)2

I (1− γ) confidence interval:

[β1 − SE(β1) · tγ/2,n−2, β1 + SE(β1) · tγ/2,n−2]

is such that

P[β1 ∈ [β1−SE(β1) · tγ/2,n−2, β1 + SE(β1) · tγ/2,n−2]] = 1− γ,

where tγ/2,n−2 is the (1− γ/2)-th quantile of the tn−2


Page 34: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine

Hypothesis testing

I Typical testing (null vs. alternative hypothesis):

H0: there is no relationship between X and Yversus

H1: there is some relationship between X and Y

I Formally:

H0 : β1 = 0 vs. H1 : β1 6= 0

I To test H0 (β1 = 0), compute a t-statistic:

t =β1 − 0


which is distributed according to a t-distribution with (n− 2)degrees of freedom

I Compute the p-value – probability of observing any valueequal to |t| or larger

Page 35: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine


> summary(lm1)


lm(formula = adv$Sales ~ adv$TV)


Min 1Q Median 3Q Max

-8.3860 -1.9545 -0.1913 2.0671 7.2124


Estimate Std. Error t value Pr(>|t|)

(Intercept) 7.032594 0.457843 15.36 <2e-16 ***

adv$TV 0.047537 0.002691 17.67 <2e-16 ***


Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 3.259 on 198 degrees of freedom

Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099

F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16

> qt(0.975,198)

[1] 1.972017

Page 36: EECS E6690: Statistical Learning for Biological and Information … · 2018-09-07 · General Course Objectives I Focus/motivation - emerging applications in: I Biology and Medicine


ISL: Read in detail Chapter 2 and Section 3.1.Also, looking through the entire Chapters 1-3 is recommended.
