Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content ›...

transcript

Lecture 2: Supervised vs. unsupervisedlearning, bias-variance tradeoff

Reading: Chapter 2

STATS 202: Data mining and analysis

Sergio BacalladoSeptember 24, 2014

1 / 20

Supervised vs. unsupervised learning

In unsupervised learning we start with a data matrix:

Variables or factors

Samples

2 / 20

Quantitative, eg. weight, height, number of children, ...

Samples

2 / 20

Qualitative, eg. college major, profession, gender, ...

Samples

2 / 20

Our goal is to:

I Find meaningful relationships between the variables or units.Correlation analysis.

I Find low-dimensional representations of the data which makeit easy to visualize the variables and units. PCA, ICA, isomap,locally linear embeddings, etc.

I Find meaningful groupings of the data. Clustering.

Unsupervised learning is also known in Statistics as exploratorydata analysis.

3 / 20

In supervised learning, there are input variables, and outputvariables:

Samples

Input variables Output variable

4 / 20

Samples

If quantitative, we saythis is a regressionproblem

4 / 20

Samples

If qualitative, we saythis is a classificationproblem

4 / 20

If X is the vector of inputs for a particular sample. The outputvariable is modeled by:

Y = f(X) + ε︸︷︷︸Random error

Our goal is to learn the function f , using a set of training samples.

5 / 20

Y = f(X) + ε︸︷︷︸Random error

Motivations:

I Prediction: Useful when the input variable is readilyavailable, but the output variable is not.

Example: Predict stock prices next month using data from lastyear.

I Inference: A model for f can help us understand thestructure of the data — which variables influence the output,and which don’t? What is the relationship between eachvariable and the output, e.g. linear, non-linear?

Example: What is the influence of genetic variations on theincidence of heart disease.

6 / 20

Parametric and nonparametric methods:

There are two kinds of supervised learning method:

I Parametric methods: We assume that f takes a specificform. For example, a linear form:

f(X) = X1β1 + · · ·+Xpβp

with parameters β1, . . . , βp. Using the training data, we try tofit the parameters.

I Non-parametric methods: We don’t make any assumptionson the form of f , but we restrict how “wiggly” or “rough” thefunction can be.

7 / 20

Parametric vs. nonparametric prediction

Years of Education

Figures 2.4 and 2.5

Parametric methods have a limit of fit quality. Non-parametricmethods keep improving as we add more data to fit.

Parametric methods are often simpler to interpret.

8 / 20

Prediction error

Training data: (x1, y1), (x2, y2) . . . (xn, yn)Predicted function: f .

Our goal in supervised learning is to minimize the prediction error.For regression models, this is typically the Mean Squared Error:

MSE(f) = E(y0 − f(x0))2.

Unfortunately, this quantity cannot be computed, because we don’tknow the joint distribution of (X,Y ). We can compute a sampleaverage using the training data; this is known as the training MSE:

MSEtraining(f) =1

n∑i=1

(yi − f(xi))2.

9 / 20

Prediction error

The main challenge of statistical learning is that a low trainingMSE does not imply a low MSE.

If we have test data {(x′i, y′i); i = 1, . . . ,m} which were not used tofit the model, a better measure of quality for f is the test MSE:

MSEtest(f) =1

m∑i=1

(y′i − f(x′i))2.

10 / 20

Figure 2.9.

0 20 40 60 80 100

2 5 10 20

FlexibilityM

The circles are simulated data from the black curve. Inthis artificial example, we know what f is.

11 / 20

Figure 2.9.

0 20 40 60 80 100

2 5 10 20

Flexibility

Three estimates f are shown:1. Linear regression.2. Splines (very smooth).3. Splines (quite rough).

11 / 20

Figure 2.9.

0 20 40 60 80 100

2 5 10 20

FlexibilityM

Red line: Test MSE.Gray line: Training MSE.

11 / 20

Figure 2.10

0 20 40 60 80 100

2 5 10 20

Flexibility

The function f is now almost linear.

12 / 20

Figure 2.11

0 20 40 60 80 100

2 5 10 20

Flexibility

When the noise ε has small variance, the third method does well.

13 / 20

The bias variance decomposition

Let x0 be a fixed test point, y0 = f(x0) + ε0, and f be estimatedfrom n training samples (x1, y1) . . . (xn, yn).

Let E denote the expectation over y0 and the training outputs(y1, . . . , yn). Then, the Mean Squared Error at x0 can bedecomposed:

MSE(x0) = E(y0−f(x0))2 = Var(f(x0))+[Bias(f(x0))]2+Var(ε0).

14 / 20

Irreducible error

14 / 20

The variance of the estimate of Y : E[f(x0)− E(f(x0))]2

This measures how much the estimate of f at x0changes when we sample new training data.

14 / 20

The squared bias of the estimate of Y : [E(f(x0))− f(x0)]2

This measures the deviation of the averageprediction f(x0) from the truth f(x0).

14 / 20

15 / 20

Implications of bias variance decomposition

MSE(x0) = E(y0−f(x0))2 = Var(f(x0))+[Bias(f(x0))]2+Var(ε).

I The MSE is always positive.I Each element on the right hand side is always positive.I Therefore, typically when we decrease the bias beyond some

point, we increase the variance, and vice-versa.

More flexibility ⇐⇒ Higher variance ⇐⇒ Lower bias.

16 / 20

Squiggly f , high noise Linear f , high noise Squiggly f , low noise

2 5 10 20

Flexibility

2 5 10 20

Flexibility

2 5 10 20

Flexibility

MSEBiasVar

Figure 2.12

17 / 20

Classification problems

In a classification setting, the output takes values in a discrete set.

For example, if we are predicting the brand of a car based on anumber of variables, the function f takes values in the set{Ford, Toyota, Mercedes-Benz, . . . }.

18 / 20

The model:Y = f(X) + ε

becomes insufficient, as X is not necessarily real-valued.

18 / 20

The model:(((((((Y = f(X) + ε

becomes insufficient, as X is not necessarily real-valued.

18 / 20

We will use slightly different notation:

P (X,Y ) : joint distribution of (X,Y ),

P (Y | X) : conditional distribution of X given Y,yi : prediction for xi.

18 / 20

Loss function for classification

There are many ways to measure the error of a classificationprediction. One of the most common is the 0-1 loss:

E(1(y0 6= y0))

Like the MSE, this quantity can be estimated from training andtest data by taking a sample average:

n∑i=1

1(yi 6= yi)

19 / 20

Bayes classifier

Figure 2.13

In practice, we never know thejoint probability P . However, wecan assume that it exists.

The Bayes classifier assigns:

yi = argmaxj P (Y = j | X = xi)

It can be shown that this is thebest classifier under the 0-1 loss.

20 / 20

Lecture2: Supervisedvs. unsupervised learning,bias ... › class › stats202 › content ›...

Documents