Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content ›...

Post on 27-Jun-2020

0 views 0 download

transcript

Lecture 2: Supervised vs. unsupervisedlearning, bias-variance tradeoff

Reading: Chapter 2

STATS 202: Data mining and analysis

Sergio BacalladoSeptember 24, 2014

1 / 20

Supervised vs. unsupervised learning

In unsupervised learning we start with a data matrix:

Variables or factors

Samples

orun

its

2 / 20

Supervised vs. unsupervised learning

In unsupervised learning we start with a data matrix:

Variables or factors

Quantitative, eg. weight, height, number of children, ...

Samples

orun

its

2 / 20

Supervised vs. unsupervised learning

In unsupervised learning we start with a data matrix:

Variables or factors

Qualitative, eg. college major, profession, gender, ...

Samples

orun

its

2 / 20

Supervised vs. unsupervised learning

In unsupervised learning we start with a data matrix:

Our goal is to:

I Find meaningful relationships between the variables or units.Correlation analysis.

I Find low-dimensional representations of the data which makeit easy to visualize the variables and units. PCA, ICA, isomap,locally linear embeddings, etc.

I Find meaningful groupings of the data. Clustering.

Unsupervised learning is also known in Statistics as exploratorydata analysis.

3 / 20

Supervised vs. unsupervised learning

In supervised learning, there are input variables, and outputvariables:

Variables or factors

Samples

orun

its

Input variables Output variable

4 / 20

Supervised vs. unsupervised learning

In supervised learning, there are input variables, and outputvariables:

Variables or factors

Samples

orun

its

Input variables Output variable

If quantitative, we saythis is a regressionproblem

4 / 20

Supervised vs. unsupervised learning

In supervised learning, there are input variables, and outputvariables:

Variables or factors

Samples

orun

its

Input variables Output variable

If qualitative, we saythis is a classificationproblem

4 / 20

Supervised vs. unsupervised learning

In supervised learning, there are input variables, and outputvariables:

If X is the vector of inputs for a particular sample. The outputvariable is modeled by:

Y = f(X) + ε︸︷︷︸Random error

Our goal is to learn the function f , using a set of training samples.

5 / 20

Supervised vs. unsupervised learning

Y = f(X) + ε︸︷︷︸Random error

Motivations:

I Prediction: Useful when the input variable is readilyavailable, but the output variable is not.

Example: Predict stock prices next month using data from lastyear.

I Inference: A model for f can help us understand thestructure of the data — which variables influence the output,and which don’t? What is the relationship between eachvariable and the output, e.g. linear, non-linear?

Example: What is the influence of genetic variations on theincidence of heart disease.

6 / 20

Parametric and nonparametric methods:

There are two kinds of supervised learning method:

I Parametric methods: We assume that f takes a specificform. For example, a linear form:

f(X) = X1β1 + · · ·+Xpβp

with parameters β1, . . . , βp. Using the training data, we try tofit the parameters.

I Non-parametric methods: We don’t make any assumptionson the form of f , but we restrict how “wiggly” or “rough” thefunction can be.

7 / 20

Parametric vs. nonparametric prediction

Years of Education

Sen

iorit

y

Incom

e

Years of Education

Sen

iorit

y

Incom

e

Figures 2.4 and 2.5

Parametric methods have a limit of fit quality. Non-parametricmethods keep improving as we add more data to fit.

Parametric methods are often simpler to interpret.

8 / 20

Prediction error

Training data: (x1, y1), (x2, y2) . . . (xn, yn)Predicted function: f .

Our goal in supervised learning is to minimize the prediction error.For regression models, this is typically the Mean Squared Error:

MSE(f) = E(y0 − f(x0))2.

Unfortunately, this quantity cannot be computed, because we don’tknow the joint distribution of (X,Y ). We can compute a sampleaverage using the training data; this is known as the training MSE:

MSEtraining(f) =1

n

n∑i=1

(yi − f(xi))2.

9 / 20

Prediction error

The main challenge of statistical learning is that a low trainingMSE does not imply a low MSE.

If we have test data {(x′i, y′i); i = 1, . . . ,m} which were not used tofit the model, a better measure of quality for f is the test MSE:

MSEtest(f) =1

m

m∑i=1

(y′i − f(x′i))2.

10 / 20

Figure 2.9.

0 20 40 60 80 100

24

68

10

12

X

Y

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

FlexibilityM

ea

n S

qu

are

d E

rro

r

The circles are simulated data from the black curve. Inthis artificial example, we know what f is.

11 / 20

Figure 2.9.

0 20 40 60 80 100

24

68

10

12

X

Y

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

Flexibility

Me

an

Sq

ua

red

Err

or

Three estimates f are shown:1. Linear regression.2. Splines (very smooth).3. Splines (quite rough).

11 / 20

Figure 2.9.

0 20 40 60 80 100

24

68

10

12

X

Y

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

FlexibilityM

ea

n S

qu

are

d E

rro

r

Red line: Test MSE.Gray line: Training MSE.

11 / 20

Figure 2.10

0 20 40 60 80 100

24

68

10

12

X

Y

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

Flexibility

Me

an

Sq

ua

red

Err

or

The function f is now almost linear.

12 / 20

Figure 2.11

0 20 40 60 80 100

−1

00

10

20

X

Y

2 5 10 20

05

10

15

20

Flexibility

Me

an

Sq

ua

red

Err

or

When the noise ε has small variance, the third method does well.

13 / 20

The bias variance decomposition

Let x0 be a fixed test point, y0 = f(x0) + ε0, and f be estimatedfrom n training samples (x1, y1) . . . (xn, yn).

Let E denote the expectation over y0 and the training outputs(y1, . . . , yn). Then, the Mean Squared Error at x0 can bedecomposed:

MSE(x0) = E(y0−f(x0))2 = Var(f(x0))+[Bias(f(x0))]2+Var(ε0).

14 / 20

The bias variance decomposition

Let x0 be a fixed test point, y0 = f(x0) + ε0, and f be estimatedfrom n training samples (x1, y1) . . . (xn, yn).

Let E denote the expectation over y0 and the training outputs(y1, . . . , yn). Then, the Mean Squared Error at x0 can bedecomposed:

MSE(x0) = E(y0−f(x0))2 = Var(f(x0))+[Bias(f(x0))]2+Var(ε0).

Irreducible error

14 / 20

The bias variance decomposition

Let x0 be a fixed test point, y0 = f(x0) + ε0, and f be estimatedfrom n training samples (x1, y1) . . . (xn, yn).

Let E denote the expectation over y0 and the training outputs(y1, . . . , yn). Then, the Mean Squared Error at x0 can bedecomposed:

MSE(x0) = E(y0−f(x0))2 = Var(f(x0))+[Bias(f(x0))]2+Var(ε0).

The variance of the estimate of Y : E[f(x0)− E(f(x0))]2

This measures how much the estimate of f at x0changes when we sample new training data.

14 / 20

The bias variance decomposition

Let x0 be a fixed test point, y0 = f(x0) + ε0, and f be estimatedfrom n training samples (x1, y1) . . . (xn, yn).

Let E denote the expectation over y0 and the training outputs(y1, . . . , yn). Then, the Mean Squared Error at x0 can bedecomposed:

MSE(x0) = E(y0−f(x0))2 = Var(f(x0))+[Bias(f(x0))]2+Var(ε0).

The squared bias of the estimate of Y : [E(f(x0))− f(x0)]2

This measures the deviation of the averageprediction f(x0) from the truth f(x0).

14 / 20

15 / 20

15 / 20

15 / 20

Implications of bias variance decomposition

MSE(x0) = E(y0−f(x0))2 = Var(f(x0))+[Bias(f(x0))]2+Var(ε).

I The MSE is always positive.I Each element on the right hand side is always positive.I Therefore, typically when we decrease the bias beyond some

point, we increase the variance, and vice-versa.

More flexibility ⇐⇒ Higher variance ⇐⇒ Lower bias.

16 / 20

Squiggly f , high noise Linear f , high noise Squiggly f , low noise

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

Flexibility

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

Flexibility

2 5 10 20

05

10

15

20

Flexibility

MSEBiasVar

Figure 2.12

17 / 20

Classification problems

In a classification setting, the output takes values in a discrete set.

For example, if we are predicting the brand of a car based on anumber of variables, the function f takes values in the set{Ford, Toyota, Mercedes-Benz, . . . }.

18 / 20

Classification problems

In a classification setting, the output takes values in a discrete set.

For example, if we are predicting the brand of a car based on anumber of variables, the function f takes values in the set{Ford, Toyota, Mercedes-Benz, . . . }.

The model:Y = f(X) + ε

becomes insufficient, as X is not necessarily real-valued.

18 / 20

Classification problems

In a classification setting, the output takes values in a discrete set.

For example, if we are predicting the brand of a car based on anumber of variables, the function f takes values in the set{Ford, Toyota, Mercedes-Benz, . . . }.

The model:(((((((Y = f(X) + ε

becomes insufficient, as X is not necessarily real-valued.

18 / 20

Classification problems

In a classification setting, the output takes values in a discrete set.

For example, if we are predicting the brand of a car based on anumber of variables, the function f takes values in the set{Ford, Toyota, Mercedes-Benz, . . . }.

We will use slightly different notation:

P (X,Y ) : joint distribution of (X,Y ),

P (Y | X) : conditional distribution of X given Y,yi : prediction for xi.

18 / 20

Loss function for classification

There are many ways to measure the error of a classificationprediction. One of the most common is the 0-1 loss:

E(1(y0 6= y0))

Like the MSE, this quantity can be estimated from training andtest data by taking a sample average:

1

n

n∑i=1

1(yi 6= yi)

19 / 20

Bayes classifier

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o oo

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

X1

X2

Figure 2.13

In practice, we never know thejoint probability P . However, wecan assume that it exists.

The Bayes classifier assigns:

yi = argmaxj P (Y = j | X = xi)

It can be shown that this is thebest classifier under the 0-1 loss.

20 / 20