+ All Categories
Home > Documents > CSC 411: Lecture 02: Linear...

CSC 411: Lecture 02: Linear...

Date post: 29-Jun-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
104
CSC 411: Lecture 02: Linear Regression Richard Zemel, Raquel Urtasun and Sanja Fidler University of Toronto (Most plots in this lecture are from Bishop’s book) Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 1 / 22
Transcript
Page 1: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

CSC 411: Lecture 02: Linear Regression

Richard Zemel, Raquel Urtasun and Sanja Fidler

University of Toronto

(Most plots in this lecture are from Bishop’s book)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 1 / 22

Page 2: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Problems for Today

What should I watch this Friday?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 2 / 22

Page 3: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Problems for Today

What should I watch this Friday?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 2 / 22

Page 4: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Problems for Today

Goal: Predict movie rating automatically!

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 2 / 22

Page 5: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Problems for Today

Goal: How many followers will I get?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 2 / 22

Page 6: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Problems for Today

Goal: Predict the price of the house

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 2 / 22

Page 7: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Regression

What do all these problems have in common?

I Continuous outputs, we’ll call these t

(e.g., a rating: a real number between 0-10, # of followers, house

price)

Predicting continuous outputs is called regression

What do I need in order to predict these outputs?

I Features (inputs), we’ll call these x (or x if vectors)

I Training examples, many x

(i) for which t

(i) is known (e.g., many

movies for which we know the rating)

I A model, a function that represents the relationship between x and t

I A loss or a cost or an objective function, which tells us how well our

model approximates the training examples

I Optimization, a way of finding the parameters of our model that

minimizes the loss function

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22

Page 8: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Regression

What do all these problems have in common?

I Continuous outputs, we’ll call these t

(e.g., a rating: a real number between 0-10, # of followers, house

price)

Predicting continuous outputs is called regression

What do I need in order to predict these outputs?

I Features (inputs), we’ll call these x (or x if vectors)

I Training examples, many x

(i) for which t

(i) is known (e.g., many

movies for which we know the rating)

I A model, a function that represents the relationship between x and t

I A loss or a cost or an objective function, which tells us how well our

model approximates the training examples

I Optimization, a way of finding the parameters of our model that

minimizes the loss function

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22

Page 9: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Regression

What do all these problems have in common?

I Continuous outputs, we’ll call these t

(e.g., a rating: a real number between 0-10, # of followers, house

price)

Predicting continuous outputs is called regression

What do I need in order to predict these outputs?

I Features (inputs), we’ll call these x (or x if vectors)

I Training examples, many x

(i) for which t

(i) is known (e.g., many

movies for which we know the rating)

I A model, a function that represents the relationship between x and t

I A loss or a cost or an objective function, which tells us how well our

model approximates the training examples

I Optimization, a way of finding the parameters of our model that

minimizes the loss function

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22

Page 10: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Regression

What do all these problems have in common?

I Continuous outputs, we’ll call these t

(e.g., a rating: a real number between 0-10, # of followers, house

price)

Predicting continuous outputs is called regression

What do I need in order to predict these outputs?

I Features (inputs), we’ll call these x (or x if vectors)

I Training examples, many x

(i) for which t

(i) is known (e.g., many

movies for which we know the rating)

I A model, a function that represents the relationship between x and t

I A loss or a cost or an objective function, which tells us how well our

model approximates the training examples

I Optimization, a way of finding the parameters of our model that

minimizes the loss function

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22

Page 11: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Regression

What do all these problems have in common?

I Continuous outputs, we’ll call these t

(e.g., a rating: a real number between 0-10, # of followers, house

price)

Predicting continuous outputs is called regression

What do I need in order to predict these outputs?

I Features (inputs), we’ll call these x (or x if vectors)

I Training examples, many x

(i) for which t

(i) is known (e.g., many

movies for which we know the rating)

I A model, a function that represents the relationship between x and t

I A loss or a cost or an objective function, which tells us how well our

model approximates the training examples

I Optimization, a way of finding the parameters of our model that

minimizes the loss function

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22

Page 12: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Regression

What do all these problems have in common?

I Continuous outputs, we’ll call these t

(e.g., a rating: a real number between 0-10, # of followers, house

price)

Predicting continuous outputs is called regression

What do I need in order to predict these outputs?

I Features (inputs), we’ll call these x (or x if vectors)

I Training examples, many x

(i) for which t

(i) is known (e.g., many

movies for which we know the rating)

I A model, a function that represents the relationship between x and t

I A loss or a cost or an objective function, which tells us how well our

model approximates the training examples

I Optimization, a way of finding the parameters of our model that

minimizes the loss function

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22

Page 13: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Regression

What do all these problems have in common?

I Continuous outputs, we’ll call these t

(e.g., a rating: a real number between 0-10, # of followers, house

price)

Predicting continuous outputs is called regression

What do I need in order to predict these outputs?

I Features (inputs), we’ll call these x (or x if vectors)

I Training examples, many x

(i) for which t

(i) is known (e.g., many

movies for which we know the rating)

I A model, a function that represents the relationship between x and t

I A loss or a cost or an objective function, which tells us how well our

model approximates the training examples

I Optimization, a way of finding the parameters of our model that

minimizes the loss function

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22

Page 14: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Regression

What do all these problems have in common?

I Continuous outputs, we’ll call these t

(e.g., a rating: a real number between 0-10, # of followers, house

price)

Predicting continuous outputs is called regression

What do I need in order to predict these outputs?

I Features (inputs), we’ll call these x (or x if vectors)

I Training examples, many x

(i) for which t

(i) is known (e.g., many

movies for which we know the rating)

I A model, a function that represents the relationship between x and t

I A loss or a cost or an objective function, which tells us how well our

model approximates the training examples

I Optimization, a way of finding the parameters of our model that

minimizes the loss function

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22

Page 15: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Regression

What do all these problems have in common?

I Continuous outputs, we’ll call these t

(e.g., a rating: a real number between 0-10, # of followers, house

price)

Predicting continuous outputs is called regression

What do I need in order to predict these outputs?

I Features (inputs), we’ll call these x (or x if vectors)

I Training examples, many x

(i) for which t

(i) is known (e.g., many

movies for which we know the rating)

I A model, a function that represents the relationship between x and t

I A loss or a cost or an objective function, which tells us how well our

model approximates the training examples

I Optimization, a way of finding the parameters of our model that

minimizes the loss functionZemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22

Page 16: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Today: Linear Regression

Linear regression

I continuous outputs

I simple model (linear)

Introduce key concepts:

I loss functions

I generalization

I optimization

I model complexity

I regularization

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 4 / 22

Page 17: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Simple 1-D regression

Circles are data points (i.e., training examples) that are given to us

The data points are uniform in x , but may be displaced in y

t(x) = f (x) + ✏

with ✏ some noise

In green is the ”true” curve that we don’t know

Goal: We want to fit a curve to these points

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 5 / 22

Page 18: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Simple 1-D regression

Circles are data points (i.e., training examples) that are given to us

The data points are uniform in x , but may be displaced in y

t(x) = f (x) + ✏

with ✏ some noise

In green is the ”true” curve that we don’t know

Goal: We want to fit a curve to these points

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 5 / 22

Page 19: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Simple 1-D regression

Circles are data points (i.e., training examples) that are given to us

The data points are uniform in x , but may be displaced in y

t(x) = f (x) + ✏

with ✏ some noise

In green is the ”true” curve that we don’t know

Goal: We want to fit a curve to these points

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 5 / 22

Page 20: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Simple 1-D regression

Circles are data points (i.e., training examples) that are given to us

The data points are uniform in x , but may be displaced in y

t(x) = f (x) + ✏

with ✏ some noise

In green is the ”true” curve that we don’t know

Goal: We want to fit a curve to these points

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 5 / 22

Page 21: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Simple 1-D regression

Key Questions:

I How do we parametrize the model?

I What loss (objective) function should we use to judge the fit?

I How do we optimize fit to unseen test data (generalization)?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 6 / 22

Page 22: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Simple 1-D regression

Key Questions:

I How do we parametrize the model?

I What loss (objective) function should we use to judge the fit?

I How do we optimize fit to unseen test data (generalization)?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 6 / 22

Page 23: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Simple 1-D regression

Key Questions:

I How do we parametrize the model?

I What loss (objective) function should we use to judge the fit?

I How do we optimize fit to unseen test data (generalization)?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 6 / 22

Page 24: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Simple 1-D regression

Key Questions:

I How do we parametrize the model?

I What loss (objective) function should we use to judge the fit?

I How do we optimize fit to unseen test data (generalization)?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 6 / 22

Page 25: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Example: Boston Housing data

Estimate median house price in a neighborhood based on neighborhoodstatistics

Look at first possible attribute (feature): per capita crime rate

Use this to predict house prices in other neighborhoods

Is this a good input (attribute) to predict house prices?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 7 / 22

Page 26: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Example: Boston Housing data

Estimate median house price in a neighborhood based on neighborhoodstatistics

Look at first possible attribute (feature): per capita crime rate

Use this to predict house prices in other neighborhoods

Is this a good input (attribute) to predict house prices?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 7 / 22

Page 27: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Example: Boston Housing data

Estimate median house price in a neighborhood based on neighborhoodstatistics

Look at first possible attribute (feature): per capita crime rate

Use this to predict house prices in other neighborhoods

Is this a good input (attribute) to predict house prices?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 7 / 22

Page 28: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Example: Boston Housing data

Estimate median house price in a neighborhood based on neighborhoodstatistics

Look at first possible attribute (feature): per capita crime rate

Use this to predict house prices in other neighborhoods

Is this a good input (attribute) to predict house prices?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 7 / 22

Page 29: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Represent the Data

Data is described as pairs D = {(x (1), t(1)), · · · , (x (N), t(N))}I

x 2 R is the input feature (per capita crime rate)I

t 2 R is the target output (median house price)I (i) simply indicates the training examples (we have N in this case)

Here t is continuous, so this is a regression problem

Model outputs y , an estimate of t

y(x) = w

0

+ w

1

x

What type of model did we choose?

Divide the dataset into training and testing examples

I Use the training examples to construct hypothesis, or function

approximator, that maps x to predicted y

I Evaluate hypothesis on test set

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22

Page 30: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Represent the Data

Data is described as pairs D = {(x (1), t(1)), · · · , (x (N), t(N))}I

x 2 R is the input feature (per capita crime rate)I

t 2 R is the target output (median house price)I (i) simply indicates the training examples (we have N in this case)

Here t is continuous, so this is a regression problem

Model outputs y , an estimate of t

y(x) = w

0

+ w

1

x

What type of model did we choose?

Divide the dataset into training and testing examples

I Use the training examples to construct hypothesis, or function

approximator, that maps x to predicted y

I Evaluate hypothesis on test set

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22

Page 31: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Represent the Data

Data is described as pairs D = {(x (1), t(1)), · · · , (x (N), t(N))}I

x 2 R is the input feature (per capita crime rate)I

t 2 R is the target output (median house price)I (i) simply indicates the training examples (we have N in this case)

Here t is continuous, so this is a regression problem

Model outputs y , an estimate of t

y(x) = w

0

+ w

1

x

What type of model did we choose?

Divide the dataset into training and testing examples

I Use the training examples to construct hypothesis, or function

approximator, that maps x to predicted y

I Evaluate hypothesis on test set

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22

Page 32: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Represent the Data

Data is described as pairs D = {(x (1), t(1)), · · · , (x (N), t(N))}I

x 2 R is the input feature (per capita crime rate)I

t 2 R is the target output (median house price)I (i) simply indicates the training examples (we have N in this case)

Here t is continuous, so this is a regression problem

Model outputs y , an estimate of t

y(x) = w

0

+ w

1

x

What type of model did we choose?

Divide the dataset into training and testing examples

I Use the training examples to construct hypothesis, or function

approximator, that maps x to predicted y

I Evaluate hypothesis on test set

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22

Page 33: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Represent the Data

Data is described as pairs D = {(x (1), t(1)), · · · , (x (N), t(N))}I

x 2 R is the input feature (per capita crime rate)I

t 2 R is the target output (median house price)I (i) simply indicates the training examples (we have N in this case)

Here t is continuous, so this is a regression problem

Model outputs y , an estimate of t

y(x) = w

0

+ w

1

x

What type of model did we choose?

Divide the dataset into training and testing examples

I Use the training examples to construct hypothesis, or function

approximator, that maps x to predicted y

I Evaluate hypothesis on test set

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22

Page 34: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Represent the Data

Data is described as pairs D = {(x (1), t(1)), · · · , (x (N), t(N))}I

x 2 R is the input feature (per capita crime rate)I

t 2 R is the target output (median house price)I (i) simply indicates the training examples (we have N in this case)

Here t is continuous, so this is a regression problem

Model outputs y , an estimate of t

y(x) = w

0

+ w

1

x

What type of model did we choose?

Divide the dataset into training and testing examples

I Use the training examples to construct hypothesis, or function

approximator, that maps x to predicted y

I Evaluate hypothesis on test set

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22

Page 35: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Represent the Data

Data is described as pairs D = {(x (1), t(1)), · · · , (x (N), t(N))}I

x 2 R is the input feature (per capita crime rate)I

t 2 R is the target output (median house price)I (i) simply indicates the training examples (we have N in this case)

Here t is continuous, so this is a regression problem

Model outputs y , an estimate of t

y(x) = w

0

+ w

1

x

What type of model did we choose?

Divide the dataset into training and testing examples

I Use the training examples to construct hypothesis, or function

approximator, that maps x to predicted y

I Evaluate hypothesis on test set

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22

Page 36: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Noise

A simple model typically does not exactly fit the data

I lack of fit can be considered noise

Sources of noise:

I Imprecision in data attributes (input noise, e.g., noise in per-capita

crime)I Errors in data targets (mis-labeling, e.g., noise in house prices)I Additional attributes not taken into account by data attributes, a↵ect

target values (latent variables). In the example, what else could a↵ect

house prices?I Model may be too simple to account for data targets

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 9 / 22

Page 37: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Noise

A simple model typically does not exactly fit the data

I lack of fit can be considered noise

Sources of noise:

I Imprecision in data attributes (input noise, e.g., noise in per-capita

crime)I Errors in data targets (mis-labeling, e.g., noise in house prices)I Additional attributes not taken into account by data attributes, a↵ect

target values (latent variables). In the example, what else could a↵ect

house prices?I Model may be too simple to account for data targets

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 9 / 22

Page 38: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Noise

A simple model typically does not exactly fit the data

I lack of fit can be considered noise

Sources of noise:

I Imprecision in data attributes (input noise, e.g., noise in per-capita

crime)

I Errors in data targets (mis-labeling, e.g., noise in house prices)I Additional attributes not taken into account by data attributes, a↵ect

target values (latent variables). In the example, what else could a↵ect

house prices?I Model may be too simple to account for data targets

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 9 / 22

Page 39: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Noise

A simple model typically does not exactly fit the data

I lack of fit can be considered noise

Sources of noise:

I Imprecision in data attributes (input noise, e.g., noise in per-capita

crime)I Errors in data targets (mis-labeling, e.g., noise in house prices)

I Additional attributes not taken into account by data attributes, a↵ect

target values (latent variables). In the example, what else could a↵ect

house prices?I Model may be too simple to account for data targets

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 9 / 22

Page 40: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Noise

A simple model typically does not exactly fit the data

I lack of fit can be considered noise

Sources of noise:

I Imprecision in data attributes (input noise, e.g., noise in per-capita

crime)I Errors in data targets (mis-labeling, e.g., noise in house prices)I Additional attributes not taken into account by data attributes, a↵ect

target values (latent variables). In the example, what else could a↵ect

house prices?

I Model may be too simple to account for data targets

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 9 / 22

Page 41: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Noise

A simple model typically does not exactly fit the data

I lack of fit can be considered noise

Sources of noise:

I Imprecision in data attributes (input noise, e.g., noise in per-capita

crime)I Errors in data targets (mis-labeling, e.g., noise in house prices)I Additional attributes not taken into account by data attributes, a↵ect

target values (latent variables). In the example, what else could a↵ect

house prices?I Model may be too simple to account for data targets

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 9 / 22

Page 42: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Least-Squares Regression

Define a modelStandard loss/cost/objective function measures the squared error between y

and the true value t

How do we obtain weights w = (w0

,w1

)?

For the linear model, what kind of a function is `(w)?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22

Page 43: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Least-Squares Regression

Define a modely(x) = function(x ,w)

Standard loss/cost/objective function measures the squared error between y

and the true value t

How do we obtain weights w = (w0

,w1

)?

For the linear model, what kind of a function is `(w)?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22

Page 44: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Least-Squares Regression

Define a model

Linear: y(x) = w

0

+ w

1

x

Standard loss/cost/objective function measures the squared error between y

and the true value t

How do we obtain weights w = (w0

,w1

)?

For the linear model, what kind of a function is `(w)?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22

Page 45: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Least-Squares Regression

Define a model

Linear: y(x) = w

0

+ w

1

x

Standard loss/cost/objective function measures the squared error between y

and the true value t

`(w) =NX

n=1

[t(n) � y(x (n))]2

How do we obtain weights w = (w0

,w1

)?

For the linear model, what kind of a function is `(w)?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22

Page 46: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Least-Squares Regression

Define a model

Linear: y(x) = w

0

+ w

1

x

Standard loss/cost/objective function measures the squared error between y

and the true value t

Linear model: `(w) =NX

n=1

[t(n) � (w0

+ w

1

x

(n))]2

How do we obtain weights w = (w0

,w1

)?

For the linear model, what kind of a function is `(w)?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22

Page 47: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Least-Squares Regression

Define a model

Linear: y(x) = w

0

+ w

1

x

Standard loss/cost/objective function measures the squared error between y

and the true value t

Linear model: `(w) =NX

n=1

[t(n) � (w0

+ w

1

x

(n))]2

For a particular hypothesis (y(x) defined by a choice of w, drawn in red),what does the loss represent geometrically?

How do we obtain weights w = (w0

,w1

)?

For the linear model, what kind of a function is `(w)?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22

Page 48: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Least-Squares Regression

Define a model

Linear: y(x) = w

0

+ w

1

x

Standard loss/cost/objective function measures the squared error between y

and the true value t

Linear model: `(w) =NX

n=1

[t(n) � (w0

+ w

1

x

(n))]2

The loss for the red hypothesis is the sum of the squared vertical errors

(squared lengths of green vertical lines)

How do we obtain weights w = (w0

,w1

)?

For the linear model, what kind of a function is `(w)?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22

Page 49: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Least-Squares Regression

Define a model

Linear: y(x) = w

0

+ w

1

x

Standard loss/cost/objective function measures the squared error between y

and the true value t

Linear model: `(w) =NX

n=1

[t(n) � (w0

+ w

1

x

(n))]2

How do we obtain weights w = (w0

,w1

)?

For the linear model, what kind of a function is `(w)?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22

Page 50: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Least-Squares Regression

Define a model

Linear: y(x) = w

0

+ w

1

x

Standard loss/cost/objective function measures the squared error between y

and the true value t

Linear model: `(w) =NX

n=1

[t(n) � (w0

+ w

1

x

(n))]2

How do we obtain weights w = (w0

,w1

)? Find w that minimizes loss `(w)

For the linear model, what kind of a function is `(w)?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22

Page 51: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Least-Squares Regression

Define a model

Linear: y(x) = w

0

+ w

1

x

Standard loss/cost/objective function measures the squared error between y

and the true value t

Linear model: `(w) =NX

n=1

[t(n) � (w0

+ w

1

x

(n))]2

How do we obtain weights w = (w0

,w1

)?

For the linear model, what kind of a function is `(w)?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22

Page 52: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

l (w)=w02+w1

2

Page 53: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Mixture(of(three(Gaussians(

Page 54: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

MOG$contours$

Page 55: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Contour'Maps'

Page 56: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Contour'Map'

Page 57: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Optimizing the Objective

One straightforward method: gradient descent

I initialize w (e.g., randomly)

I repeatedly update w based on the gradient

w w � �@`

@w

� is the learning rate

For a single training case, this gives the LMS update rule (Least MeanSquares):

Note: As error approaches zero, so does the update (w stops changing)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 11 / 22

Page 58: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Optimizing the Objective

One straightforward method: gradient descent

I initialize w (e.g., randomly)

I repeatedly update w based on the gradient

w w � �@`

@w

� is the learning rate

For a single training case, this gives the LMS update rule (Least MeanSquares):

Note: As error approaches zero, so does the update (w stops changing)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 11 / 22

Page 59: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Optimizing the Objective

One straightforward method: gradient descent

I initialize w (e.g., randomly)

I repeatedly update w based on the gradient

w w � �@`

@w

� is the learning rate

For a single training case, this gives the LMS update rule (Least MeanSquares):

Note: As error approaches zero, so does the update (w stops changing)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 11 / 22

Page 60: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Optimizing the Objective

One straightforward method: gradient descent

I initialize w (e.g., randomly)

I repeatedly update w based on the gradient

w w � �@`

@w

� is the learning rate

For a single training case, this gives the LMS update rule (Least MeanSquares):

Note: As error approaches zero, so does the update (w stops changing)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 11 / 22

Page 61: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Optimizing the Objective

One straightforward method: gradient descent

I initialize w (e.g., randomly)

I repeatedly update w based on the gradient

w w � �@`

@w

� is the learning rate

For a single training case, this gives the LMS update rule (Least MeanSquares):

Note: As error approaches zero, so does the update (w stops changing)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 11 / 22

Page 62: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Optimizing the Objective

One straightforward method: gradient descent

I initialize w (e.g., randomly)

I repeatedly update w based on the gradient

w w � �@`

@w

� is the learning rate

For a single training case, this gives the LMS update rule (Least MeanSquares):

w w + 2�(t(n) � y(x (n)))x (n)

Note: As error approaches zero, so does the update (w stops changing)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 11 / 22

Page 63: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Optimizing the Objective

One straightforward method: gradient descent

I initialize w (e.g., randomly)

I repeatedly update w based on the gradient

w w � �@`

@w

� is the learning rate

For a single training case, this gives the LMS update rule (Least MeanSquares):

w w + 2� (t(n) � y(x (n)))| {z }error

x

(n)

Note: As error approaches zero, so does the update (w stops changing)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 11 / 22

Page 64: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Optimizing Across Training Set

Two ways to generalize this for all examples in training set:

1. Batch updates: sum or average updates across every example n, thenchange the parameter values

w w + 2�NX

n=1

(t(n) � y(x (n)))x (n)

2. Stochastic/online updates: update the parameters for each trainingcase in turn, according to its own gradients

I Underlying assumption: sample is independent and identicallydistributed (i.i.d.)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 12 / 22

Page 65: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Optimizing Across Training Set

Two ways to generalize this for all examples in training set:

1. Batch updates: sum or average updates across every example n, thenchange the parameter values

w w + 2�NX

n=1

(t(n) � y(x (n)))x (n)

2. Stochastic/online updates: update the parameters for each trainingcase in turn, according to its own gradients

I Underlying assumption: sample is independent and identicallydistributed (i.i.d.)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 12 / 22

Page 66: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Optimizing Across Training Set

Two ways to generalize this for all examples in training set:

1. Batch updates: sum or average updates across every example n, thenchange the parameter values

w w + 2�NX

n=1

(t(n) � y(x (n)))x (n)

2. Stochastic/online updates: update the parameters for each trainingcase in turn, according to its own gradients

Algorithm 1 Stochastic gradient descent1: Randomly shu✏e examples in the training set2: for i = 1 to N do

3: Update:

w w + 2�(t(i) � y(x (i)))x (i) (update for a linear model)

4: end for

I Underlying assumption: sample is independent and identicallydistributed (i.i.d.)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 12 / 22

Page 67: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Optimizing Across Training Set

Two ways to generalize this for all examples in training set:

1. Batch updates: sum or average updates across every example n, thenchange the parameter values

w w + 2�NX

n=1

(t(n) � y(x (n)))x (n)

2. Stochastic/online updates: update the parameters for each trainingcase in turn, according to its own gradients

I Underlying assumption: sample is independent and identicallydistributed (i.i.d.)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 12 / 22

Page 68: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Analytical Solution?

For some objectives we can also find the optimal solution analytically

This is the case for linear least-squares regression

How?

Compute the derivatives of the objective wrt w and equate with 0

Define:

t = [t(1), t(2), . . . , t(N)]T

X =

2

664

1, x (1)

1, x (2)

. . .1, x (N)

3

775

Then:w = (XT

X)�1

X

Tt

(work it out!)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 13 / 22

Page 69: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Analytical Solution?

For some objectives we can also find the optimal solution analytically

This is the case for linear least-squares regression

How?

Compute the derivatives of the objective wrt w and equate with 0

Define:

t = [t(1), t(2), . . . , t(N)]T

X =

2

664

1, x (1)

1, x (2)

. . .1, x (N)

3

775

Then:w = (XT

X)�1

X

Tt

(work it out!)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 13 / 22

Page 70: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Analytical Solution?

For some objectives we can also find the optimal solution analytically

This is the case for linear least-squares regression

How?

Compute the derivatives of the objective wrt w and equate with 0

Define:

t = [t(1), t(2), . . . , t(N)]T

X =

2

664

1, x (1)

1, x (2)

. . .1, x (N)

3

775

Then:w = (XT

X)�1

X

Tt

(work it out!)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 13 / 22

Page 71: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Analytical Solution?

For some objectives we can also find the optimal solution analytically

This is the case for linear least-squares regression

How?

Compute the derivatives of the objective wrt w and equate with 0

Define:

t = [t(1), t(2), . . . , t(N)]T

X =

2

664

1, x (1)

1, x (2)

. . .1, x (N)

3

775

Then:w = (XT

X)�1

X

Tt

(work it out!)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 13 / 22

Page 72: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Multi-dimensional Inputs

One method of extending the model is to consider other input dimensions

y(x) = w

0

+ w

1

x

1

+ w

2

x

2

In the Boston housing example, we can look at the number of rooms

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 14 / 22

Page 73: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Multi-dimensional Inputs

One method of extending the model is to consider other input dimensions

y(x) = w

0

+ w

1

x

1

+ w

2

x

2

In the Boston housing example, we can look at the number of rooms

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 14 / 22

Page 74: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Linear Regression with Multi-dimensional Inputs

Imagine now we want to predict the median house price from thesemulti-dimensional observations

Each house is a data point n, with observations indexed by j :

x

(n) =⇣x

(n)1

, · · · , x (n)j , · · · , x (n)d

We can incorporate the bias w0

into w, by using x

0

= 1, then

y(x) = w

0

+dX

j=1

wjxj = w

Tx

We can then solve for w = (w0

,w1

, · · · ,wd). How?

We can use gradient descent to solve for each coe�cient, or compute w

analytically (how does the solution change?)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 15 / 22

Page 75: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Linear Regression with Multi-dimensional Inputs

Imagine now we want to predict the median house price from thesemulti-dimensional observations

Each house is a data point n, with observations indexed by j :

x

(n) =⇣x

(n)1

, · · · , x (n)j , · · · , x (n)d

We can incorporate the bias w0

into w, by using x

0

= 1, then

y(x) = w

0

+dX

j=1

wjxj = w

Tx

We can then solve for w = (w0

,w1

, · · · ,wd). How?

We can use gradient descent to solve for each coe�cient, or compute w

analytically (how does the solution change?)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 15 / 22

Page 76: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Linear Regression with Multi-dimensional Inputs

Imagine now we want to predict the median house price from thesemulti-dimensional observations

Each house is a data point n, with observations indexed by j :

x

(n) =⇣x

(n)1

, · · · , x (n)j , · · · , x (n)d

We can incorporate the bias w0

into w, by using x

0

= 1, then

y(x) = w

0

+dX

j=1

wjxj = w

Tx

We can then solve for w = (w0

,w1

, · · · ,wd). How?

We can use gradient descent to solve for each coe�cient, or compute w

analytically (how does the solution change?)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 15 / 22

Page 77: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Linear Regression with Multi-dimensional Inputs

Imagine now we want to predict the median house price from thesemulti-dimensional observations

Each house is a data point n, with observations indexed by j :

x

(n) =⇣x

(n)1

, · · · , x (n)j , · · · , x (n)d

We can incorporate the bias w0

into w, by using x

0

= 1, then

y(x) = w

0

+dX

j=1

wjxj = w

Tx

We can then solve for w = (w0

,w1

, · · · ,wd). How?

We can use gradient descent to solve for each coe�cient, or compute w

analytically (how does the solution change?)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 15 / 22

Page 78: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Linear Regression with Multi-dimensional Inputs

Imagine now we want to predict the median house price from thesemulti-dimensional observations

Each house is a data point n, with observations indexed by j :

x

(n) =⇣x

(n)1

, · · · , x (n)j , · · · , x (n)d

We can incorporate the bias w0

into w, by using x

0

= 1, then

y(x) = w

0

+dX

j=1

wjxj = w

Tx

We can then solve for w = (w0

,w1

, · · · ,wd). How?

We can use gradient descent to solve for each coe�cient, or compute w

analytically (how does the solution change?)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 15 / 22

Page 79: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

More Powerful Models?

What if our linear model is not good? How can we create a morecomplicated model?

We can create a more complicated model by defining input variables that arecombinations of components of x

Example: an M-th order polynomial function of one dimensional feature x :

y(x ,w) = w

0

+MX

j=1

wjxj

where x

j is the j-th power of x

We can use the same approach to optimize for the weights w

How do we do that?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 16 / 22

Page 80: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Fitting a Polynomial

What if our linear model is not good? How can we create a morecomplicated model?

We can create a more complicated model by defining input variables that arecombinations of components of x

Example: an M-th order polynomial function of one dimensional feature x :

y(x ,w) = w

0

+MX

j=1

wjxj

where x

j is the j-th power of x

We can use the same approach to optimize for the weights w

How do we do that?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 16 / 22

Page 81: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Fitting a Polynomial

What if our linear model is not good? How can we create a morecomplicated model?

We can create a more complicated model by defining input variables that arecombinations of components of x

Example: an M-th order polynomial function of one dimensional feature x :

y(x ,w) = w

0

+MX

j=1

wjxj

where x

j is the j-th power of x

We can use the same approach to optimize for the weights w

How do we do that?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 16 / 22

Page 82: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Fitting a Polynomial

What if our linear model is not good? How can we create a morecomplicated model?

We can create a more complicated model by defining input variables that arecombinations of components of x

Example: an M-th order polynomial function of one dimensional feature x :

y(x ,w) = w

0

+MX

j=1

wjxj

where x

j is the j-th power of x

We can use the same approach to optimize for the weights w

How do we do that?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 16 / 22

Page 83: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Fitting a Polynomial

What if our linear model is not good? How can we create a morecomplicated model?

We can create a more complicated model by defining input variables that arecombinations of components of x

Example: an M-th order polynomial function of one dimensional feature x :

y(x ,w) = w

0

+MX

j=1

wjxj

where x

j is the j-th power of x

We can use the same approach to optimize for the weights w

How do we do that?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 16 / 22

Page 84: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Which Fit is Best?

from Bishop

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 17 / 22

Page 85: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Generalization

Generalization = model’s ability to predict the held out data

What is happening?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 18 / 22

Page 86: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Generalization

Generalization = model’s ability to predict the held out data

What is happening?

Our model with M = 9 overfits the data (it models also noise)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 18 / 22

Page 87: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Generalization

Generalization = model’s ability to predict the held out data

What is happening?

Our model with M = 9 overfits the data (it models also noise)

Not a problem if we have lots of training examples

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 18 / 22

Page 88: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Generalization

Generalization = model’s ability to predict the held out data

What is happening?

Our model with M = 9 overfits the data (it models also noise)

Let’s look at the estimated weights for various M in the case of fewerexamples

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 18 / 22

Page 89: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Generalization

Generalization = model’s ability to predict the held out data

What is happening?

Our model with M = 9 overfits the data (it models also noise)

Let’s look at the estimated weights for various M in the case of fewerexamples

The weights are becoming huge to compensate for the noise

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 18 / 22

Page 90: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Generalization

Generalization = model’s ability to predict the held out data

What is happening?

Our model with M = 9 overfits the data (it models also noise)

Let’s look at the estimated weights for various M in the case of fewerexamples

The weights are becoming huge to compensate for the noise

One way of dealing with this is to encourage the weights to be small (thisway no input dimension will have too much influence on prediction). This iscalled regularization

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 18 / 22

Page 91: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Regularized Least Squares

Increasing the input features this way can complicate the model considerably

Goal: select the appropriate model complexity automatically

Standard approach: regularization

˜̀(w) =NX

n=1

[t(n) � (w0

+ w

1

x

(n))]2 + ↵wTw

Intuition: Since we are minimizing the loss, the second term will encouragesmaller values in w

When we use the penalty on the squared weights we have ridge regression instatistics

Leads to a modified update rule for gradient descent:

w w + 2�[NX

n=1

(t(n) � y(x (n)))x (n) � ↵w]

Also has an analytical solution: w = (XTX+ ↵ I)�1

X

Tt (verify!)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 19 / 22

Page 92: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Regularized Least Squares

Increasing the input features this way can complicate the model considerably

Goal: select the appropriate model complexity automatically

Standard approach: regularization

˜̀(w) =NX

n=1

[t(n) � (w0

+ w

1

x

(n))]2 + ↵wTw

Intuition: Since we are minimizing the loss, the second term will encouragesmaller values in w

When we use the penalty on the squared weights we have ridge regression instatistics

Leads to a modified update rule for gradient descent:

w w + 2�[NX

n=1

(t(n) � y(x (n)))x (n) � ↵w]

Also has an analytical solution: w = (XTX+ ↵ I)�1

X

Tt (verify!)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 19 / 22

Page 93: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Regularized Least Squares

Increasing the input features this way can complicate the model considerably

Goal: select the appropriate model complexity automatically

Standard approach: regularization

˜̀(w) =NX

n=1

[t(n) � (w0

+ w

1

x

(n))]2 + ↵wTw

Intuition: Since we are minimizing the loss, the second term will encouragesmaller values in w

When we use the penalty on the squared weights we have ridge regression instatistics

Leads to a modified update rule for gradient descent:

w w + 2�[NX

n=1

(t(n) � y(x (n)))x (n) � ↵w]

Also has an analytical solution: w = (XTX+ ↵ I)�1

X

Tt (verify!)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 19 / 22

Page 94: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Regularized Least Squares

Increasing the input features this way can complicate the model considerably

Goal: select the appropriate model complexity automatically

Standard approach: regularization

˜̀(w) =NX

n=1

[t(n) � (w0

+ w

1

x

(n))]2 + ↵wTw

Intuition: Since we are minimizing the loss, the second term will encouragesmaller values in w

When we use the penalty on the squared weights we have ridge regression instatistics

Leads to a modified update rule for gradient descent:

w w + 2�[NX

n=1

(t(n) � y(x (n)))x (n) � ↵w]

Also has an analytical solution: w = (XTX+ ↵ I)�1

X

Tt (verify!)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 19 / 22

Page 95: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Regularized Least Squares

Increasing the input features this way can complicate the model considerably

Goal: select the appropriate model complexity automatically

Standard approach: regularization

˜̀(w) =NX

n=1

[t(n) � (w0

+ w

1

x

(n))]2 + ↵wTw

Intuition: Since we are minimizing the loss, the second term will encouragesmaller values in w

When we use the penalty on the squared weights we have ridge regression instatistics

Leads to a modified update rule for gradient descent:

w w + 2�[NX

n=1

(t(n) � y(x (n)))x (n) � ↵w]

Also has an analytical solution: w = (XTX+ ↵ I)�1

X

Tt (verify!)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 19 / 22

Page 96: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Regularized Least Squares

Increasing the input features this way can complicate the model considerably

Goal: select the appropriate model complexity automatically

Standard approach: regularization

˜̀(w) =NX

n=1

[t(n) � (w0

+ w

1

x

(n))]2 + ↵wTw

Intuition: Since we are minimizing the loss, the second term will encouragesmaller values in w

When we use the penalty on the squared weights we have ridge regression instatistics

Leads to a modified update rule for gradient descent:

w w + 2�[NX

n=1

(t(n) � y(x (n)))x (n) � ↵w]

Also has an analytical solution: w = (XTX+ ↵ I)�1

X

Tt (verify!)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 19 / 22

Page 97: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Regularized Least Squares

Increasing the input features this way can complicate the model considerably

Goal: select the appropriate model complexity automatically

Standard approach: regularization

˜̀(w) =NX

n=1

[t(n) � (w0

+ w

1

x

(n))]2 + ↵wTw

Intuition: Since we are minimizing the loss, the second term will encouragesmaller values in w

When we use the penalty on the squared weights we have ridge regression instatistics

Leads to a modified update rule for gradient descent:

w w + 2�[NX

n=1

(t(n) � y(x (n)))x (n) � ↵w]

Also has an analytical solution: w = (XTX+ ↵ I)�1

X

Tt (verify!)

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 19 / 22

Page 98: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

Regularized least squares

Better generalization

Choose ↵ carefully

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 20 / 22

Page 99: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

1-D regression illustrates key concepts

Data fits – is linear model best (model selection)?

I Simple models may not capture all the important variations (signal) inthe data: underfit

I More complex models may overfit the training data (fit not only thesignal but also the noise in the data), especially if not enough data toconstrain model

One method of assessing fit: test generalization = model’s ability to predictthe held out data

Optimization is essential: stochastic and batch iterative approaches; analyticwhen available

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 21 / 22

Page 100: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

1-D regression illustrates key concepts

Data fits – is linear model best (model selection)?

I Simple models may not capture all the important variations (signal) inthe data: underfit

I More complex models may overfit the training data (fit not only thesignal but also the noise in the data), especially if not enough data toconstrain model

One method of assessing fit: test generalization = model’s ability to predictthe held out data

Optimization is essential: stochastic and batch iterative approaches; analyticwhen available

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 21 / 22

Page 101: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

1-D regression illustrates key concepts

Data fits – is linear model best (model selection)?

I Simple models may not capture all the important variations (signal) inthe data: underfit

I More complex models may overfit the training data (fit not only thesignal but also the noise in the data), especially if not enough data toconstrain model

One method of assessing fit: test generalization = model’s ability to predictthe held out data

Optimization is essential: stochastic and batch iterative approaches; analyticwhen available

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 21 / 22

Page 102: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

1-D regression illustrates key concepts

Data fits – is linear model best (model selection)?

I Simple models may not capture all the important variations (signal) inthe data: underfit

I More complex models may overfit the training data (fit not only thesignal but also the noise in the data), especially if not enough data toconstrain model

One method of assessing fit: test generalization = model’s ability to predictthe held out data

Optimization is essential: stochastic and batch iterative approaches; analyticwhen available

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 21 / 22

Page 103: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

1-D regression illustrates key concepts

Data fits – is linear model best (model selection)?

I Simple models may not capture all the important variations (signal) inthe data: underfit

I More complex models may overfit the training data (fit not only thesignal but also the noise in the data), especially if not enough data toconstrain model

One method of assessing fit: test generalization = model’s ability to predictthe held out data

Optimization is essential: stochastic and batch iterative approaches; analyticwhen available

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 21 / 22

Page 104: CSC 411: Lecture 02: Linear Regressionbonner/courses/2019f/csc411/lectures/02_regression.pdfPredicting continuous outputs is called regression What do I need in order to predict these

So...

Which movie will you watch?

Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 22 / 22


Recommended