Pattern Recognition 2020 Introduction - Universiteit Utrecht

Post on 18-Oct-2021

4 views 0 download

transcript

Pattern Recognition 2020Introduction

Ad Feelders

Universiteit Utrecht

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 53

About the Course

Lecturers: Zerrin Yumak and Ad Feelders

Teaching Assistants: Ali Katsheh and Jiayuan Hu

Course info: http://www.cs.uu.nl/docs/vakken/mpr/

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 2 / 53

About the Course

Part I (Ad Feelders): Introduction to statistical machine learning.

General principles of data analysis: overfitting, bias-variance trade-off,model selection, regularization, the curse of dimensionality.

Linear statistical models for regression and classification.

Clustering and unsupervised learning.

Support vector machines.

Required literature:

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 3 / 53

About the Course

Part II (Zerrin Yumak): Neural networks and deep learning.

Feed-forward neural networks.

Convolutional neural networks.

Recurrent neural networks.

Recommended reading:

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 4 / 53

About the Course

Practical assignment: analysis of handwritten digit data in R orPython (teams of 2 students)

Deep learning project: subject of own choice (teams of 5 students).

Online lectures in MS Teams (Wednesday and Friday).

Online support for practical assignment and deep learning project inMS Teams (Friday after the lecture, starting next week).

Grading:

Practical assignment (20%)

Deep learning project (40%)

Written exam (40%)

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 5 / 53

What is statistical pattern recognition?

The field of pattern recognition/machine learning is concernedwith the automatic discovery of regularities in data through theuse of computer algorithms and with the use of these regularities totake actions such as classifying the data into different categories.

(Bishop, page 1)

28 × 28 pixel images

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 6 / 53

Machine Learning Approach

Use training dataD = {(x1, t1), . . . , (xN , tN)}

of N labeled examples, and fit a model to the training data.

This model can subsequently be used to predict the class (digit) for newinput vectors x.

The ability to categorize correctly new examples is called generalization.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 7 / 53

Types of Learning Problems

Supervised Learning

Numeric target: regression.Discrete unordered target: classification.Discrete ordered target: ordinal classification; ranking.. . .

Unsupervised Learning

Clustering.Density estimation.. . .

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 8 / 53

Example: Polynomial Curve Fitting

0 1

−1

0

1

t = sin(2πx) + ε, with ε ∼ N (µ = 0, σ = 0.3).

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 9 / 53

Polynomial Curve Fitting

Fit a model:

y(x ,w) = w0 + w1x + w2x2 + . . .+ wMxM

=M∑j=0

wjxj (1.1)

Linear function of the coefficients w = w0,w1, . . . ,wM .

The coefficients (or weights) w are estimated (or learned) from the data.

PS: equation numbers refer to the book of Bishop.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 10 / 53

Error Function

We choose those values for w that minimize the sum of squared errors

E (w) =1

2

N∑n=1

{y(xn,w)− tn}2 (1.2)

Why square the difference between predicted and true value?

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 11 / 53

Error Function

t

x

y(xn,w)

tn

xn

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 12 / 53

Curves Fitted with Least Squares (in red)

�����

0 1

−1

0

1

�����

0 1

−1

0

1

�����

0 1

−1

0

1

�����

0 1

−1

0

1

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 13 / 53

Magnitude of Coefficients

M = 0 M = 1 M = 3 M = 9

w?0 0.19 0.82 0.31 0.35

w?1 -1.27 7.99 232.37

w?2 -25.43 -5321.83

w?3 17.37 48568.31

w?4 -231639.30

w?5 640042.26

w?6 -1061800.52

w?7 1042400.18

w?8 -557682.99

w?9 125201.43

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 14 / 53

Training and Test Error

�����

0 3 6 90

0.5

1TrainingTest

ERMS =√

2E (w?)/N (1.3)

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 15 / 53

Overfitting and Sample Size

�������

0 1

−1

0

1

���������

0 1

−1

0

1

Red curve (M = 9) is much more smooth for N = 100 than for N = 15.Also, it is closer to the true (green) curve.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 16 / 53

Regularization

Adjusted error function

E (w) =1

2

N∑n=1

{y(xn,w)− tn}2 +λ

2‖w‖2 (1.4)

with ‖w‖2 = wTw = w20 + w2

1 + . . .+ w2M .

Shrink coefficients towards zero.

Ridge regression

Neural networks: weight decay

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 17 / 53

Magnitude of Coefficients (M = 9)

lnλ = −∞ lnλ = −18 lnλ = 0

w?0 0.35 0.35 0.13

w?1 232.37 4.74 -0.05

w?2 -5321.83 -0.77 -0.06

w?3 48568.31 -31.97 -0.05

w?4 -231639.30 -3.89 -0.03

w?5 640042.26 55.28 -0.02

w?6 -1061800.52 41.32 -0.01

w?7 1042400.18 -45.95 -0.00

w?8 -557682.99 -91.53 0.00

w?9 125201.43 72.68 0.01

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 18 / 53

Fitted Curves for M = 9, λ ≈ 10−8, λ = 1.

� ������� �

0 1

−1

0

1

� ������

0 1

−1

0

1

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 19 / 53

RMSE versus lnλ for M = 9

�����

� ���−35 −30 −25 −200

0.5

1TrainingTest

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 20 / 53

Probability distribution and likelihood function

Binomial distribution with parameters N and π:

p(t) =

(Nt

)πt (1− π)N−t

Binomial distribution with N = 10 and π = 0.7:

p(t) =

(10t

)0.7t 0.310−t

Probability of observing t = 8:(108

)0.780.32 ≈ 0.234

Likelihood function if we observe 7 heads in 10 trials:

L(π | t = 7) =

(107

)π7(1− π)3

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 21 / 53

Probability and Likelihood

t π

0.1 0.3 0.5 0.7 0.9

0 .349 .028 .0011 .387 .121 .012 .194 .234 .044 .0023 .057 .267 .117 .0094 .011 .2 .205 .0365 .002 .103 .246 .103 .0026 .036 .205 .2 .0117 .009 .117 .267 .0578 .002 .044 .234 .1949 .01 .121 .38710 .001 .028 .349

1 1 1 1 1

Probability distribution for π = 0.7 and N = 10.Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 22 / 53

Probability and Likelihood

t π

0.1 0.3 0.5 0.7 0.9

0 .349 .028 .0011 .387 .121 .012 .194 .234 .044 .0023 .057 .267 .117 .0094 .011 .2 .205 .0365 .002 .103 .246 .103 .0026 .036 .205 .2 .0117 .009 .117 .267 .0578 .002 .044 .234 .1949 .01 .121 .38710 .001 .028 .349

1 1 1 1 1

Likelihood function for observing t = 7 in 10 trials.Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 23 / 53

Likelihood function

Lett = (t1, . . . , tN)

be N independent observations, all from the same probability distribution

p(t | θ),

where θ is the parameter vector of p (e.g. θ = (µ, σ) for normaldistribution), then

L(θ | t) ∝N∏

n=1

p(tn| θ)

is the likelihood function for t.

Maximum likelihood estimation:Find that particular value θML which maximizes L, i.e. that θML such thatthe observed t are more likely to have come from p(t | θML) than fromp(t | θ) for any other value of θ.Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 24 / 53

Maximum Likelihood Estimation

Take the derivatives of L with respect to the components of θ and equatethem to zero (normal equations)

∂L

∂θj= 0

Solve for the θj (and check second order condition).Maximizing the loglikelihood function is often easier

L(θ | t) = ln{L(θ | t)} = ln

{N∏

n=1

p(tn | θ)

}

=N∑

n=1

ln p(tn | θ)

since ln ab = ln a + ln b.

This is allowed because the ln function is strictly increasing on (0,∞).Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 25 / 53

Likelihood function

Likelihood function for 7 heads out of 10 coin flips:

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.00

000.

0005

0.00

100.

0015

0.00

20

π

L(π)

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 26 / 53

Example: coin flipping

Random variable t with t = 1 if heads comes up, and t = 0 if tails comesup. π = P(t = 1). Probability distribution for one coin flip

p(t) = πt(1− π)1−t

Sequence of N coin flips

p(t) = p(t1, t2, ..., tN) =N∏

n=1

πtn(1− π)1−tn

which defines the likelihood when viewed as a function of π. Theloglikelihood function consequently becomes

L(π | t) =N∑

n=1

tn ln(π) + (1− tn) ln(1− π)

since ln ab = b ln a.Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 27 / 53

Example: coin flipping (continued)

In a sequence of 10 coin flips with seven times heads coming up, we obtain

L(π) = ln(π7(1− π)3) = 7 lnπ + 3 ln(1− π)

To determine the maximum we take the derivative with respect to π,equate to zero, and solve for π:

dLdπ

=7

π− 3

1− π= 0

which yields maximum likelihood estimate πML = 0.7.

Note:d ln x

dx=

1

x

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 28 / 53

Model Selection

Cross-Validation

run 1

run 2

run 3

run 4

Score = Quality of Fit − Complexity Penalty

For example: AIC = ln p(D|wML)−M

where ln p(D|wML) is the maximized loglikelihood and M is the number ofparameters of the fitted model.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 29 / 53

The Curse of Dimensionality

���

���

0 0.25 0.5 0.75 10

0.5

1

1.5

2

Predict class of ×.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 30 / 53

The Curse of Dimensionality

���

���

0 0.25 0.5 0.75 10

0.5

1

1.5

2

Predict class of ×.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 31 / 53

The Curse of Dimensionality

x1

D = 1x1

x2

D = 2

x1

x2

x3

D = 3

Number of rectangles grows exponentially with D. If D is large, mostrectangles will be empty (contain no data).

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 32 / 53

Decision Theory

Suppose we have to make a decision in a situation involving uncertainty.Two steps

1 Inference: Learn p(x, t) from data. This problem is the main subjectof this course.

2 Decision: Given this estimate of p(x, t), determine the optimaldecision. Relatively straightforward.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 33 / 53

Decision Theory: Example

Predict whether patient has cancer from X-ray image.

Let t = 1 denote that cancer is present.Then knowledge of

p(t = 1|x) =p(x|t = 1)p(t = 1)

p(x)

would allow us to make optimal predictions of t from x (given anappropriate loss function).

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 34 / 53

Loss Functions for Classification

Suppose we know p(x, t).Task: given a value for x, predict the class label t.Lkj : loss of predicting class j when the true class is k .K : number of classes.

To minimize expected loss, predict the class j that minimizes:

K∑k=1

Lkjp(t = k | x), (1.81)

where j ∈ {1, . . . ,K}.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 35 / 53

Minimizing the Misclassification Rate

To minimize the probability of misclassification, we take (0/1 loss)

Lkj =

{0 if j = k1 otherwise

The minimum of

K∑k=1

Lkjp(t = k | x) =∑k 6=j

p(t = k | x) = 1− p(t = j | x)

is now achieved if we assign to the class j for which p(t = j | x) ismaximum.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 36 / 53

Minimizing Expected Loss

Example loss matrix for prediction of cancer:

kj

0 1

0 0 11 10 0

Suppose p(t = 0) = 0.8 and p(t = 1) = 0.2.

The expected loss of predicting “no cancer present” is:

L00 × p(t = 0) + L10 × p(t = 1) = 0× 0.8 + 10× 0.2 = 2

The expected loss of predicting “cancer present” is:

L01 × p(t = 0) + L11 × p(t = 1) = 1× 0.8 + 0× 0.2 = 0.8

Even though the probability of cancer is “only” 0.2, loss is minimized if wepredict (act as if) cancer is present.Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 37 / 53

Properties of Expectation and Variance

Some useful properties:

1 E[c] = c for constant c .

2 E[cx ] = cE[x ].

3 E[x ± y ] = E[x ]± E[y ].

4 var[c] = 0 for constant c .

5 var[cx ] = c2var[x ].

6 var[x ± y ] = var[x ] + var[y ] if x and y independent.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 38 / 53

Loss function for regression

Let t0 be a random draw from p(t | x0) and we predict the value of t0 bysome number y0 = y(x0). The expected squared prediction error is:

E[(y0 − t0)2] = E[y20 − 2y0t0 + t20 ]

= y20 − 2y0E[t0] + E[t20 ],

where expectation is taken with respect to p(t | x0).

To minimize this expression we solve

d(y20 − 2y0E[t0] + E[t20 ])

dy0= 2y0 − 2E[t0] = 0

which gives y0 = E[t0]. Conclusion: predict the expected value (mean)!

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 39 / 53

Minimizing expected squared prediction error

Since this reasoning applies to any value of x we might pick, we have that

y(x) = Et [t | x ] (1.89)

minimizes the expected squared prediction error.

The function Et [t | x ] is called the (population) regression function.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 40 / 53

Population Regression Function

t

xx0

y(x0)

y(x)

p(t|x0)

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 41 / 53

Question

We have derived the result that

y(x) = Et [t|x ] (1.89)

minimizes the expected squared prediction error.

How could we use this result to construct a prediction rule y(x) from afinite data sample

D = {(x1, t1), . . . , (xN , tN)}?

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 42 / 53

Simple approach to regression?

xx0

t

y(x0)

Predict the mean of the target values of all training observations withx = x0.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 43 / 53

Simple approach to regression?

xx0

t

y(x0)

Predict the mean of the target values of training observations with x-valueclosest to x0.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 44 / 53

Nearest-neighbor functions

Consider a regression problem with input variable x and the outputvariable t:

for each input value x , we define a neighborhood Nk(x) containingthe indices n of the k points (xn, tn) from the training data that arethe closest to x ;

from the neighborhood function Nk(x), we construct the function

yk(x) =1

k

∑n∈Nk (x)

tn

The function yk(x) is called the k-nearest neighbor function.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 45 / 53

An example learning problem

In a clinical study of risk factors for cardiovascular disease,

the independent variable x is a patient’s waist circumference;

the dependent variable t is a patient’s deep abdominal adipose tissue.

The researchers want to predict the amount of deep abdominal adiposetissue from a simple measurement of waist circumference.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 46 / 53

Scatterplot of the data

For learning the relationship between x and t, measurements (xn, tn) on109 men between 18 and 42 years of age, are available:

0

50

100

150

200

250

60 70 80 90 100 110 120

deep

abd

omin

al A

T (

Y)

waist circumference (X)

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 47 / 53

An example

We consider eight (consecutive) points (xn, tn) from the clinical study ofrisk factors for cardiovascular disease:

1.(68.85, 55.78) 5.(73.10, 38.21)2.(71.85, 21.68) 6.(73.20, 32.22)3.(71.90, 28.32) 7.(73.80, 43.35)4.(72.60, 25.89) 8.(74.15, 33.41)

20

25

30

35

40

45

50

55

60

68 69 70 71 72 73 74

deep

abd

omin

al A

T (

Y)

waist circumference (X)

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 48 / 53

The example (continued)

With k = 2, the neighborhood of x = 73.00 equals

N2(x = 73.00) = {5, 6}

and we find

y2(x = 73.00) =38.21 + 32.22

2= 35.215

With k = 5, we find y5(x = 73.00) = 33.598.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 49 / 53

The example continued

With k = 2 and Euclidean distance, the following k-nearest neighborfunction is constructed from the training data:

70 80 90 100 110 120

5010

015

020

025

0

kNN with k=2

Waist Circumference

Adi

pose

Tis

sue

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 50 / 53

The example continued

With k = 20 and Euclidean distance, the following k-nearest neighborfunction is constructed from the training data:

70 80 90 100 110 120

5010

015

020

025

0

kNN with k=20

Waist Circumference

Adi

pose

Tis

sue

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 51 / 53

kNN: going to the extremes

70 80 90 100 110 120

5010

015

020

025

0

kNN with k=1

Waist Circumference

Adi

pose

Tis

sue

70 80 90 100 110 120

5010

015

020

025

0

kNN with k=109

Waist Circumference

Adi

pose

Tis

sue

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 52 / 53

The idea of k-nearest neighbor

We recall that, for a regression problem, the best prediction for the outputvariable t at the input value x is the mean E[t | x ]:

the nearest-neighbor function approximates the mean by averagingover the training data;

the nearest-neighbor function relaxes conditioning at a specific inputvalue to the neighborhood of that value.

The nearest-neighbor function thus implements the idea of selecting themeans for prediction directly.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 53 / 53