+ All Categories
Home > Documents > Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading:...

Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading:...

Date post: 11-Jun-2020
Category:
Upload: others
View: 7 times
Download: 0 times
Share this document with a friend
26
Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September 19, 2018 1 / 23
Transcript
Page 1: Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September

Lecture 5: Clustering, Linear Regression

Reading: Chapter 10, Sections 3.1-2

STATS 202: Data mining and analysis

Sergio BacalladoSeptember 19, 2018

1 / 23

Page 2: Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September

Announcements

I Starting next week, Julia Fukuyama will be having her officehours after business hours for SCPD students (throughSkype). The times will be updated on the website as usual.

I Probability problems.

I Homework 2 will go out this afternoon. It’s a long one.

2 / 23

Page 3: Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September

Hierarchical clustering

Most algorithms for hierarchical clustering are agglomerative.

3

4

1 6

9

2

8

5 7

0.0

0.5

1.0

1.5

2.0

2.5

3.0

1

2

3

4

5

6

7

8

9

−1.5 −1.0 −0.5 0.0 0.5 1.0

−1

.5−

1.0

−0

.50

.00

.5

X1

X2

The output of the algorithm is adendogram. We must be carefulabout how we interpret thedendogram.

3 / 23

Page 4: Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September

Notion of distance between clusters

At each step, we link the 2 clusters that are “closest” to each other.

Hierarchical clustering algorithms are classified according to thenotion of distance between clusters.

Complete linkage:The distance between 2 clusters is themaximum distance between any pair ofsamples, one in each cluster.

4 / 23

Page 5: Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September

Notion of distance between clusters

At each step, we link the 2 clusters that are “closest” to each other.

Hierarchical clustering algorithms are classified according to thenotion of distance between clusters.

Single linkage:The distance between 2 clusters is theminimum distance between any pair ofsamples, one in each cluster.

4 / 23

Page 6: Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September

Notion of distance between clusters

At each step, we link the 2 clusters that are “closest” to each other.

Hierarchical clustering algorithms are classified according to thenotion of distance between clusters.

Average linkage:The distance between 2 clusters is theaverage of all pairwise distances.

4 / 23

Page 7: Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September

Example

Average Linkage Complete Linkage Single Linkage

Figure 10.12

5 / 23

Page 8: Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September

Clustering is riddled with questions and choices

I Is clustering appropriate? i.e. Could a sample belong to morethan one cluster?

I Mixture models, soft clustering, topic models.

I How many clusters are appropriate?I Choose subjectively — depends on the inference sought.

I There are formal methods based on gap statistics, mixturemodels, etc.

I Are the clusters robust?I Run the clustering on different random subsets of the data. Is

the structure preserved?

I Try different clustering algorithms. Are the conclusionsconsistent?

I Most important: temper your conclusions.

6 / 23

Page 9: Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September

Clustering is riddled with questions and choices

I Should we scale the variables before doing the clustering.I Variables with larger variance have a larger effect on the

Euclidean distance between two samples.

( Area in acres, Price in US$, Number of houses )Property 1 ( 10, 450,000, 4 )Property 2 ( 5, 300,000, 1 )

I Does Euclidean distance capture dissimilarity between samples?

7 / 23

Page 10: Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September

Correlation distance

Example: Suppose that we want to cluster customers at a storefor market segmentation.

I Samples are customers

I Each variable corresponds to a specific product and measuresthe number of items bought by the customer during a year.

5 10 15 20

05

10

15

20

Variable Index

Observation 1

Observation 2

Observation 3

1

2

3

8 / 23

Page 11: Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September

Correlation distance

I Euclidean distance would cluster all customers who purchasefew things (orange and purple).

I Perhaps we want to cluster customers who purchase similarthings (orange and teal).

I Then, the correlation distance may be a more appropriatemeasure of dissimilarity between samples.

5 10 15 20

05

10

15

20

Variable Index

Observation 1

Observation 2

Observation 3

1

2

3

9 / 23

Page 12: Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September

Simple linear regression

0 50 100 150 200 250 300

510

15

20

25

TV

Sale

s

Figure 3.1

yi = β0 + β1xi + εi

εi ∼ N (0, σ) i.i.d.

The estimates β0 and β1 arechosen to minimize the residualsum of squares (RSS):

RSS =

n∑i=1

(yi − yi)2

=

n∑i=1

(yi − β0 − β1xi)2.

10 / 23

Page 13: Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September

Estimates β0 and β1

A little calculus shows that the minimizers of the RSS are:

β1 =

∑ni=1(xi − x)(yi − y)∑n

i=1(xi − x)2,

β0 = y − β1x.

11 / 23

Page 14: Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September

Assesing the accuracy of β0 and β1

−2 −1 0 1 2

−10

−5

05

10

X

Y

−2 −1 0 1 2

−10

−5

05

10

X

Y

Figure 3.3

The Standard Errors for theparameters are:

SE(β0)2 = σ2[1

n+

x2∑ni=1(xi − x)2

]

SE(β1)2 =σ2∑n

i=1(xi − x)2.

The 95% confidence intervals:

β0 ± 2 · SE(β0)

β1 ± 2 · SE(β1)

12 / 23

Page 15: Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September

Hypothesis test

H0: There is no relationship between X and Y .

Ha: There is some relationship between X and Y . H0:β1 = 0.

Ha: β1 6= 0.

Test statistic: t = β1−0

SE(β1).

Under the null hypothesis, this has a t-distributionwith n− 2 degrees of freedom.

13 / 23

Page 16: Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September

Interpreting the hypothesis test

I If we reject the null hypothesis, can we assume there is a linearrelationship?

I No. A quadratic relationship may be a better fit, for example.

I If we don’t reject the null hypothesis, can we assume there isno relationship between X and Y ?

I No. This test is only powerful against certain monotonealternatives. There could be more complex non-linearrelationships.

14 / 23

Page 17: Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September

Multiple linear regression

X1

X2

Y

Figure 3.4

Y = β0 + β1X1 + · · ·+ βpXp + ε

ε ∼ N (0, σ) i.i.d.

or, in matrix notation:

Ey = Xβ,

where y = (y1, . . . , yn)T ,

β = (β0, . . . , βp)T and X is our

usual data matrix with an extracolumn of zeroes on the left toaccount for the intercept.

15 / 23

Page 18: Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September

Multiple linear regression answers several questions

I Is at least one of the variables Xi useful for predicting theoutcome Y ?

I Which subset of the predictors is most important?

I How good is a linear model for these data?

I Given a set of predictor values, what is a likely value for Y ,and how accurate is this prediction?

16 / 23

Page 19: Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September

The estimates β

Our goal again is to minimize the RSS:

RSS =

n∑i=1

(yi − yi)2

=

n∑i=1

(yi − β0 − β1xi,1 − · · · − βpxi,p)2.

One can show that this is minimized by the vector β:

β = (XTX)−1XTy.

17 / 23

Page 20: Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September

Which variables are important?

Consider the hypothesis:

H0 : The last q predictors have no relation with Y .

Let RSS0 be the residual sum of squares for the model whichexcludes these variables. The F -statistic is defined by:

F =(RSS0 − RSS)/qRSS/(n− p− 1)

.

Under the null hypothesis, this has an F -distribution.

Example: If q = p, we test whether any of the variables isimportant.

RSS0 =n∑i=1

(yi − y)2

18 / 23

Page 21: Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September

Which variables are important?

Consider the hypothesis:

H0 : βp−q+1 = βp−q+2 = · · · = βp = 0.

Let RSS0 be the residual sum of squares for the model whichexcludes these variables. The F -statistic is defined by:

F =(RSS0 − RSS)/qRSS/(n− p− 1)

.

Under the null hypothesis, this has an F -distribution.

Example: If q = p, we test whether any of the variables isimportant.

RSS0 =n∑i=1

(yi − y)2

18 / 23

Page 22: Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September

Which variables are important?A multiple linear regression in R has the following output:

19 / 23

Page 23: Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September

Which variables are important?

The t-statistic associated to the ith predictor is the square root ofthe F -statistic for the null hypothesis which sets only βi = 0.

A low p-value indicates that the predictor is important.

Warning: If there are many predictors, even under the nullhypothesis, some of the t-tests will have low p-values.

20 / 23

Page 24: Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September

How many variables are important?

When we select a subset of the predictors, we have 2p choices.

A way to simplify the choice is to define a range of models with anincreasing number of variables, then select the best.

I Forward selection: Starting from a null model, includevariables one at a time, minimizing the RSS at each step.

I Backward selection: Starting from the full model, eliminatevariables one at a time, choosing the one with the largestp-value at each step.

I Mixed selection: Starting from a null model, includevariables one at a time, minimizing the RSS at each step. Ifthe p-value for some variable goes beyond a threshold,eliminate that variable.

Choosing one model in the range produced is a form of tuning.21 / 23

Page 25: Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September

How good is the fit?

To assess the fit, we focus on the residuals.I The RSS always decreases as we add more variables.

I The residual standard error (RSE) corrects this:

RSE =

√1

n− p− 1RSS.

I Visualizing the residuals can reveal phenomena that are notaccounted for by the model; eg. synergies or interactions:

Sales

Radio

TV

22 / 23

Page 26: Lecture 5: Clustering, Linear Regression · Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September

How good are the predictions?

The function predict in R output predictions from a linear model:

Confidence intervals reflect the uncertainty on β.

Prediction intervals reflect uncertainty on β and the irreducibleerror ε as well.

23 / 23


Recommended