Pattern Recognition 2020 Introduction - Universiteit Utrecht

transcript

Pattern Recognition 2020Introduction

Ad Feelders

Universiteit Utrecht

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 53

About the Course

Lecturers: Zerrin Yumak and Ad Feelders

Teaching Assistants: Ali Katsheh and Jiayuan Hu

Course info: http://www.cs.uu.nl/docs/vakken/mpr/

About the Course

Part I (Ad Feelders): Introduction to statistical machine learning.

General principles of data analysis: overfitting, bias-variance trade-off,model selection, regularization, the curse of dimensionality.

Linear statistical models for regression and classification.

Clustering and unsupervised learning.

Support vector machines.

Required literature:

About the Course

Part II (Zerrin Yumak): Neural networks and deep learning.

Feed-forward neural networks.

Convolutional neural networks.

Recurrent neural networks.

Recommended reading:

About the Course

Practical assignment: analysis of handwritten digit data in R orPython (teams of 2 students)

Deep learning project: subject of own choice (teams of 5 students).

Online lectures in MS Teams (Wednesday and Friday).

Online support for practical assignment and deep learning project inMS Teams (Friday after the lecture, starting next week).

Grading:

Practical assignment (20%)

Deep learning project (40%)

Written exam (40%)

What is statistical pattern recognition?

The field of pattern recognition/machine learning is concernedwith the automatic discovery of regularities in data through theuse of computer algorithms and with the use of these regularities totake actions such as classifying the data into different categories.

(Bishop, page 1)

28 × 28 pixel images

Machine Learning Approach

Use training dataD = {(x1, t1), . . . , (xN , tN)}

of N labeled examples, and fit a model to the training data.

This model can subsequently be used to predict the class (digit) for newinput vectors x.

The ability to categorize correctly new examples is called generalization.

Types of Learning Problems

Supervised Learning

Numeric target: regression.Discrete unordered target: classification.Discrete ordered target: ordinal classification; ranking.. . .

Unsupervised Learning

Clustering.Density estimation.. . .

Example: Polynomial Curve Fitting

t = sin(2πx) + ε, with ε ∼ N (µ = 0, σ = 0.3).

Polynomial Curve Fitting

Fit a model:

y(x ,w) = w0 + w1x + w2x2 + . . .+ wMxM

=M∑j=0

wjxj (1.1)

Linear function of the coefficients w = w0,w1, . . . ,wM .

The coefficients (or weights) w are estimated (or learned) from the data.

PS: equation numbers refer to the book of Bishop.

Error Function

We choose those values for w that minimize the sum of squared errors

E (w) =1

N∑n=1

{y(xn,w)− tn}2 (1.2)

Why square the difference between predicted and true value?

Error Function

y(xn,w)

Curves Fitted with Least Squares (in red)

��

Magnitude of Coefficients

M = 0 M = 1 M = 3 M = 9

w?0 0.19 0.82 0.31 0.35

w?1 -1.27 7.99 232.37

w?2 -25.43 -5321.83

w?3 17.37 48568.31

w?4 -231639.30

w?5 640042.26

w?6 -1061800.52

w?7 1042400.18

w?8 -557682.99

w?9 125201.43

Training and Test Error

��

0 3 6 90

1TrainingTest

ERMS =√

2E (w?)/N (1.3)

Overfitting and Sample Size

��

Red curve (M = 9) is much more smooth for N = 100 than for N = 15.Also, it is closer to the true (green) curve.

Regularization

Adjusted error function

E (w) =1

N∑n=1

{y(xn,w)− tn}2 +λ

2‖w‖2 (1.4)

with ‖w‖2 = wTw = w20 + w2

1 + . . .+ w2M .

Shrink coefficients towards zero.

Ridge regression

Neural networks: weight decay

Magnitude of Coefficients (M = 9)

lnλ = −∞ lnλ = −18 lnλ = 0

w?0 0.35 0.35 0.13

w?1 232.37 4.74 -0.05

w?2 -5321.83 -0.77 -0.06

w?3 48568.31 -31.97 -0.05

w?4 -231639.30 -3.89 -0.03

w?5 640042.26 55.28 -0.02

w?6 -1061800.52 41.32 -0.01

w?7 1042400.18 -45.95 -0.00

w?8 -557682.99 -91.53 0.00

w?9 125201.43 72.68 0.01

Fitted Curves for M = 9, λ ≈ 10−8, λ = 1.

� ��

RMSE versus lnλ for M = 9

��

� ��−35 −30 −25 −200

1TrainingTest

Probability distribution and likelihood function

Binomial distribution with parameters N and π:

p(t) =

)πt (1− π)N−t

Binomial distribution with N = 10 and π = 0.7:

p(t) =

)0.7t 0.310−t

Probability of observing t = 8:(108

)0.780.32 ≈ 0.234

Likelihood function if we observe 7 heads in 10 trials:

L(π | t = 7) =

)π7(1− π)3

Probability and Likelihood

0.1 0.3 0.5 0.7 0.9

0 .349 .028 .0011 .387 .121 .012 .194 .234 .044 .0023 .057 .267 .117 .0094 .011 .2 .205 .0365 .002 .103 .246 .103 .0026 .036 .205 .2 .0117 .009 .117 .267 .0578 .002 .044 .234 .1949 .01 .121 .38710 .001 .028 .349

1 1 1 1 1

Probability distribution for π = 0.7 and N = 10.Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 22 / 53

Probability and Likelihood

0.1 0.3 0.5 0.7 0.9

0 .349 .028 .0011 .387 .121 .012 .194 .234 .044 .0023 .057 .267 .117 .0094 .011 .2 .205 .0365 .002 .103 .246 .103 .0026 .036 .205 .2 .0117 .009 .117 .267 .0578 .002 .044 .234 .1949 .01 .121 .38710 .001 .028 .349

1 1 1 1 1

Likelihood function for observing t = 7 in 10 trials.Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 23 / 53

Likelihood function

Lett = (t1, . . . , tN)

be N independent observations, all from the same probability distribution

p(t | θ),

where θ is the parameter vector of p (e.g. θ = (µ, σ) for normaldistribution), then

L(θ | t) ∝N∏

p(tn| θ)

is the likelihood function for t.

Maximum likelihood estimation:Find that particular value θML which maximizes L, i.e. that θML such thatthe observed t are more likely to have come from p(t | θML) than fromp(t | θ) for any other value of θ.Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 24 / 53

Maximum Likelihood Estimation

Take the derivatives of L with respect to the components of θ and equatethem to zero (normal equations)

∂θj= 0

Solve for the θj (and check second order condition).Maximizing the loglikelihood function is often easier

L(θ | t) = ln{L(θ | t)} = ln

p(tn | θ)

ln p(tn | θ)

since ln ab = ln a + ln b.

This is allowed because the ln function is strictly increasing on (0,∞).Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 25 / 53

Likelihood function

Likelihood function for 7 heads out of 10 coin flips:

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Example: coin flipping

Random variable t with t = 1 if heads comes up, and t = 0 if tails comesup. π = P(t = 1). Probability distribution for one coin flip

p(t) = πt(1− π)1−t

Sequence of N coin flips

p(t) = p(t1, t2, ..., tN) =N∏

πtn(1− π)1−tn

which defines the likelihood when viewed as a function of π. Theloglikelihood function consequently becomes

L(π | t) =N∑

tn ln(π) + (1− tn) ln(1− π)

since ln ab = b ln a.Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 27 / 53

Example: coin flipping (continued)

In a sequence of 10 coin flips with seven times heads coming up, we obtain

L(π) = ln(π7(1− π)3) = 7 lnπ + 3 ln(1− π)

To determine the maximum we take the derivative with respect to π,equate to zero, and solve for π:

π− 3

1− π= 0

which yields maximum likelihood estimate πML = 0.7.

Note:d ln x

Model Selection

Cross-Validation

Score = Quality of Fit − Complexity Penalty

For example: AIC = ln p(D|wML)−M

where ln p(D|wML) is the maximized loglikelihood and M is the number ofparameters of the fitted model.

The Curse of Dimensionality

��

0 0.25 0.5 0.75 10

Predict class of ×.

��

0 0.25 0.5 0.75 10

Predict class of ×.

D = 1x1

Number of rectangles grows exponentially with D. If D is large, mostrectangles will be empty (contain no data).

Decision Theory

Suppose we have to make a decision in a situation involving uncertainty.Two steps

1 Inference: Learn p(x, t) from data. This problem is the main subjectof this course.

2 Decision: Given this estimate of p(x, t), determine the optimaldecision. Relatively straightforward.

Decision Theory: Example

Predict whether patient has cancer from X-ray image.

Let t = 1 denote that cancer is present.Then knowledge of

p(t = 1|x) =p(x|t = 1)p(t = 1)

would allow us to make optimal predictions of t from x (given anappropriate loss function).

Loss Functions for Classification

Suppose we know p(x, t).Task: given a value for x, predict the class label t.Lkj : loss of predicting class j when the true class is k .K : number of classes.

To minimize expected loss, predict the class j that minimizes:

K∑k=1

Lkjp(t = k | x), (1.81)

where j ∈ {1, . . . ,K}.

Minimizing the Misclassification Rate

To minimize the probability of misclassification, we take (0/1 loss)

{0 if j = k1 otherwise

The minimum of

K∑k=1

Lkjp(t = k | x) =∑k 6=j

p(t = k | x) = 1− p(t = j | x)

is now achieved if we assign to the class j for which p(t = j | x) ismaximum.

Minimizing Expected Loss

Example loss matrix for prediction of cancer:

0 0 11 10 0

Suppose p(t = 0) = 0.8 and p(t = 1) = 0.2.

The expected loss of predicting “no cancer present” is:

L00 × p(t = 0) + L10 × p(t = 1) = 0× 0.8 + 10× 0.2 = 2

The expected loss of predicting “cancer present” is:

L01 × p(t = 0) + L11 × p(t = 1) = 1× 0.8 + 0× 0.2 = 0.8

Even though the probability of cancer is “only” 0.2, loss is minimized if wepredict (act as if) cancer is present.Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 37 / 53

Properties of Expectation and Variance

Some useful properties:

1 E[c] = c for constant c .

2 E[cx ] = cE[x ].

3 E[x ± y ] = E[x ]± E[y ].

4 var[c] = 0 for constant c .

5 var[cx ] = c2var[x ].

6 var[x ± y ] = var[x ] + var[y ] if x and y independent.

Loss function for regression

Let t0 be a random draw from p(t | x0) and we predict the value of t0 bysome number y0 = y(x0). The expected squared prediction error is:

E[(y0 − t0)2] = E[y20 − 2y0t0 + t20 ]

= y20 − 2y0E[t0] + E[t20 ],

where expectation is taken with respect to p(t | x0).

To minimize this expression we solve

d(y20 − 2y0E[t0] + E[t20 ])

dy0= 2y0 − 2E[t0] = 0

which gives y0 = E[t0]. Conclusion: predict the expected value (mean)!

Minimizing expected squared prediction error

Since this reasoning applies to any value of x we might pick, we have that

y(x) = Et [t | x ] (1.89)

minimizes the expected squared prediction error.

The function Et [t | x ] is called the (population) regression function.

Population Regression Function

p(t|x0)

Question

We have derived the result that

y(x) = Et [t|x ] (1.89)

minimizes the expected squared prediction error.

How could we use this result to construct a prediction rule y(x) from afinite data sample

D = {(x1, t1), . . . , (xN , tN)}?

Simple approach to regression?

Predict the mean of the target values of all training observations withx = x0.

Simple approach to regression?

Predict the mean of the target values of training observations with x-valueclosest to x0.

Nearest-neighbor functions

Consider a regression problem with input variable x and the outputvariable t:

for each input value x , we define a neighborhood Nk(x) containingthe indices n of the k points (xn, tn) from the training data that arethe closest to x ;

from the neighborhood function Nk(x), we construct the function

yk(x) =1

∑n∈Nk (x)

The function yk(x) is called the k-nearest neighbor function.

An example learning problem

In a clinical study of risk factors for cardiovascular disease,

the independent variable x is a patient’s waist circumference;

the dependent variable t is a patient’s deep abdominal adipose tissue.

The researchers want to predict the amount of deep abdominal adiposetissue from a simple measurement of waist circumference.

Scatterplot of the data

For learning the relationship between x and t, measurements (xn, tn) on109 men between 18 and 42 years of age, are available:

60 70 80 90 100 110 120

waist circumference (X)

An example

We consider eight (consecutive) points (xn, tn) from the clinical study ofrisk factors for cardiovascular disease:

1.(68.85, 55.78) 5.(73.10, 38.21)2.(71.85, 21.68) 6.(73.20, 32.22)3.(71.90, 28.32) 7.(73.80, 43.35)4.(72.60, 25.89) 8.(74.15, 33.41)

68 69 70 71 72 73 74

waist circumference (X)

The example (continued)

With k = 2, the neighborhood of x = 73.00 equals

N2(x = 73.00) = {5, 6}

and we find

y2(x = 73.00) =38.21 + 32.22

2= 35.215

With k = 5, we find y5(x = 73.00) = 33.598.

The example continued

With k = 2 and Euclidean distance, the following k-nearest neighborfunction is constructed from the training data:

70 80 90 100 110 120

kNN with k=2

Waist Circumference

The example continued

With k = 20 and Euclidean distance, the following k-nearest neighborfunction is constructed from the training data:

70 80 90 100 110 120

kNN with k=20

Waist Circumference

kNN: going to the extremes

70 80 90 100 110 120

kNN with k=1

Waist Circumference

70 80 90 100 110 120

kNN with k=109

Waist Circumference

The idea of k-nearest neighbor

We recall that, for a regression problem, the best prediction for the outputvariable t at the input value x is the mean E[t | x ]:

the nearest-neighbor function approximates the mean by averagingover the training data;

the nearest-neighbor function relaxes conditioning at a specific inputvalue to the neighborhood of that value.

The nearest-neighbor function thus implements the idea of selecting themeans for prediction directly.

Pattern Recognition 2020 Introduction - Universiteit Utrecht

Documents