Linear models for classification

transcript

Linear Models for Classification

Sung-Yub Kim

Dept of IE, Seoul National University

February 18, 2017

1 Introduction

2 Discriminant Functions

3 Probabilistic Generative Models

4 Probabilistic Disriminative Models

5 The Laplace Approximation

6 Bayesian Lostic Regression

Introduction

Bishop, C. M. Pattern Recognition and Machine Learning Information Science and Statistics, Springer, 2006.

Kevin P. Murphy. Machine Learning - A Probabilistic Perspective Adaptive Computation and Machine

Learning, MIT press, 2012.

Ian Goodfellow and Yoshua Bengio and Aaron Courville. Deep Learning Computer Science and Intelligent

Systems, MIT Press, 2016.

Introduction

Goal: Take an input vector x and assign it to one of K discrete classes Ck ,where k=1,· · · ,K.

The input space is divided into decision reigions and its boundary calleddecision boundaries or decision surfaces.

Linearly Separable → Separating Hyperplane Theorem

1-of-K Coding Scheme

t = (0, 0, 1, 0, 0)> (1)

Each tk means the probability that the class is CkGeneralized Linear Model

y(x) = f (w>x + w0) (2)

Nonlinear function which is needed to make probabilistic output f (·) iscalled activation function, and its inverse is called link function. Sincetheir decision boundary is y(x) = c for some constant c, they are calledGeneralized Linear Models(GLM).

Discriminant Functions

Linear Discriminant Functions

Linear Discriminant Function

y(x) = wTx + w0 = w̃T x̃ (3)

w is called weight, w0 is called bias.(−w0 is called threshold)

Decision Criteria

{C1, if y(x) ≥ 0

C2, if y(x) < 0(4)

Target Coding SchemeThere are one-versus-the-rest or one-versus-one coding scheme, but theyhave some ambiguity. Therefore, we use a single K-classes discriminationcomprising K linear functions of the form

yk(x) = w>k x + wk0 (5)

and thenCk = Ci , if yi (x) > yj(x) ∀j 6= i (6)

then decision region will be

(wi − wj)>x + (wk0 − wj0) = 0 (7)

Least Square

Modely(x) = W̃>x̃ (8)

where k-th column of W̃ is w̃k = (wk0,wTk )T and x̃ = (1, xT )T

ED(W̃) =1

2‖X̃W̃ − T‖2

2tr{(X̃W̃ − T)>(X̃W̃ − T)} (9)

where k-th row of X̃ is x̃>n

Closed-form Solution

W̃ = (X̃>X̃)−1X̃>T = X̃†T (10)

Therefore, discriminant function is

y(x) = T>(X̃†)>x̃ (11)

Least Square

Limitations1 Output value cannot have probabilistic interpretation.2 LS solutions lack robustness to outliers.3 SSE function penalizes predictions that are ’too correct’.

Origin of LimitationsMaximum Likelihood under the assumptions of a Gaussian Conditionaldistributions, whereas binary target vectors have a distribution that is farfrom Gaussian.

Fisher’s Linear Discriminant Analysis

Motivation: Dimensionality Reduction

Simple Model: Choose w ∈ {w : ‖w‖ = 1} such that maximize

m2 −m1 = wT (m2 −m1) (12)

where m2 = 1N2

∑n∈C2

xn, m1 = 1N1

∑n∈C1

Revised Model: Choose a model which give large separation betweenprojected class means while also give a small variance within each classes.Therefore, we need to minimize

J(w) =(m2 −m1)2

s21 + s2

=wTSBw

wTSWw(13)

where s2k =

∑n∈Ck

(yn −mk)2 means within-class variance of thetransformed class Ck andSB = (m2 −m1)(m2 −m1)T is the between-class covariance matrix andSW =

∑n∈C1

(xn −m1)(xn −m1)T +∑

n∈C2(xn −m2)(xn −m2)T is total

within-class covariance matrix

Fisher’s Linear Discriminant Analysis

Closed-form SolutionBy simple calculation,

w ∝ S−1W (m2 −m1) (14)

Relation to LSBy adjusting target values to N/N1 and −N/N2, we can prove Fisher’sLDA is equivalent to LS.

Multiple Classes caseOne example of Multi-class classification by LDA is to mimize

J(W ) = tr{(WSWW T )−1(WSBWT )} (15)

where SW =∑K

k=1 Sk , Sk =∑

n∈Ck(xn −mk)(xn −mk)T and

mk =∑

n∈Ckxn and

ST =∑N

n=1(xn −m)(xn −m)T and this total covraince matrix can bedecomposed by

ST = SW + SB (16)

where SB =∑K

k=1 Nk(mk −m)(mk −m)T

The Perceptron algorithm

Motivation: How about take non-linear transformation to make classifier?

y(x) = f (wTφ(x)) (17)

f (a) =

{+1, a ≥ 0

−1, a < 0(18)

Error Fucntion

EP(w) = −∑n∈M

wTφntn (19)

where M is a set of missclassified patterns.

SGDApplying Stochastic Gradient Descent, we get

w(τ+1) = w(τ)− η∇EP(w) = w(τ) + ηφntn (20)

The Perceptron algorithm

Perceptron Convergence TheoremIf the training data set is linearly separable, then the perceptron learningalgorithm is guaranteed to find an exact solution in a finite number ofsteps.

Limitations

1 PCT doesn’t tell anything about convergence rate and it really convergesslowly sometimes.

2 Also Perceptron is based on linear combinations of fixed basis functions. Wewill solve this problem in chapter 5 and 6.

Probabilistic Generative Models

Introduction

Generative ApproachModel the class-conditional densities p(x|Ck) and class priors p(Ck).

Posterior Probability and Activation functionTypically, poseterior probabiliity can be defined like

p(C1|x) =p(x|C1)p(C1)

p(x|C1)p(C1) + p(x|C2)p(C2)=

exp(a)

exp(a) + 1= σ(a) (21)

where a = ln p(x|C1)p(C1)p(x|C2)p(C2)

and σ is sigmoid function. Sometimes, we call alog odds.More generally, we can represent posterior probability like

p(Ck |x) =p(x|Ck)p(Ck)∑j p(x|Cj)p(Cj)

=exp(ak)∑j exp(aj)

where aj = ln(p(x|Cj)p(Cj)) and normalized-exponential function is calledsoftmax function. In fact, this function acts like soft-argmax function.

Continuous inputs

ModelIf we assume that class-conditional probability is

p(x |Ck) =1

(2π)D/2

|Σ|1/2exp−1

2(x− µk)TΣ−1(x− µk) (23)

for k = 1,2. Then we get

p(C1|x) = σ(wTx + w0) (24)

where w = Σ−1(µ1 − µ2) and w0 = − 12µT

1 Σ−1µ1 + 12µT

2 Σ−1µ2 + ln p(C1)p(C2)

Similarly, if we have multiple class, then model is

ak(x) = wTk x + wk0 (25)

where wk = Σ−1µk and wk0 = − 12µTk Σ−1µk + ln p(Ck)

Gaussian clas-conditional densityLikelihood function is

p(t,X|π, µ1, µ2,Σ) =N∏

[πN (xn|µ1,Σ)]tn [(1− π)N (xn|µ2,Σ)]1−tn (26)

where π is class prior probability, µk is class-mean, tn is 1 if class of n-thdata is 1, otherwise 0 and we assume that classes share covariance.

Closed-form SolutionWe can solve this problem exactly

π =N1

N1 + N2(27)

µ1 =1

N∑n=1

tnxn, µ2 =1

N∑n=1

(1− tn)xn (28)

NS2 (29)

whereS1 = 1

∑n∈C1

(xn − µ1)(xn − µ1)TandS2 = 1N2

∑n∈C2

(xn − µ2)(xn − µ2)T

Discrete features

Naive BayesIn general case, we need to consider all possiblities to treat discretefeatures. But if we assume that the situation is naive bayes, which meansfeature values are treated as independent, conditioned on class Ck , we cancalculate likelihood function very easy.

p(x |Ck) =D∏i=1

µxiki (1− µki )

1−xi (30)

Exponential family

Likelihood of Exponential family

p(x|η) = h(x)g(η) exp{ηTu(x)} (31)

If we assume that u(x) = x and the density is scale invariance we canrepresent density as

p(x |ηk , s) =1

sx)g(ηk) exp{1

sηTk x} (32)

Closed-form solution of Exp familyBy above, we get in binary classification

a(x) =1

s(η1 − η2)Tx + ln

g(η1)

g(η2)+ ln

p(C2)(33)

and in multi-class classificiation

ak(x) =1

sηTk x + ln g(ηk) + ln p(Ck) (34)

Probabilistic Disriminative Models

Introduction

Discriminative ApproachUse the functional form of the GLM explicitly and to determine itsparameters directly by using MLE.

Advantages

1 Fewer adaptive parameters2 Do not use class-conditional density assumption

Fixed Basis FunctionIn discriminative approach, we model the posterior proababilitiesaccurately and then applying standatd decision theory. Since fixed basisfunction has some limitations, we can generalize this to adaptive basisfunction to the data.

Logistic Regression

p(C1|φ) = y(φ) = σ(wTφ) (35)

If we use this model, we just need to find M adaptive parameters. It isrelatively simpler than Gaussian model which is need to be find itscovariance paramters.

LikelihoodWe can write likelihood as

p(t|w) =N∏

y tnn {1− yn}1−tn (36)

and we can use this likelihood function to make cross entropy errorfunction like

E(w) = Et [− ln y ] ' −N∑

{tn ln yn + (1− tn) ln(1− yn)} (37)

Logistic Regression

Gradient of CETake the gradient of error function, we get

∇wE(w) =N∑

(yn − tn)φn (38)

and this gradient can be interpreted as

(error)× (prediction of model) (39)

Stochastic Gradient DescentWe can use above to give a sequential algorithm, in which each of theweight vectors is updated using

∇wEn(w) = (yn − tn)φn (40)

Iterative Reweighted Least SquaresFor logistic regression, there is no longer a closed-form solution, due to thenonlinearity of the logistic sigmoid function. But fortunately, logisticsigmoid function is convex and we can find global optimizer by iterativemethod.

Newton-Raphson MethodNewton-Raphson method is a iterative method defined as

w(τ+1) = w(τ) − H−1∇wE(w) (41)

The gradient and hessian of our error function is

∇wE(w) =N∑

(yn − tn)φn = ΦT (y − t) (42)

∇2wE(w) =

N∑n=1

yn(1− yn)φnφTn = ΦTRΦ (43)

where R = diag(y � (1− y))

Becasue weighting matrix R depends on w , we must compute this matrix everyiteration.

Multiclass Logistic Regression

LikelihoodSimilar in binary case, we can get likelihood of our model

p(T|w1, . . . ,wK ) =N∏

K∏k=1

y tnknk (44)

Take negaitve logarithm to get cross entropy error function

E(w1, . . . ,wK ) = ET[− ln p(y)] ' −N∑

K∑k=1

tnk ln ynk (45)

By similar argument in binary case, we can get

∇wjE(w1, . . . ,wK ) =N∑

(ynj − tnj)φn (46)

∇wk∇wjE(w1, . . . ,wK ) =N∑

ynk(Ikj − ynj)φnφTn (47)

Probit Regression

Inverse Probit FunctionInverse Probit function is defined as

Φ(a) =

−∞N (θ|0, 1)dθ (48)

and the GLM based on an inverse probit activation function is known asprobit regression.

LimitationsLogistic sigmoid decays asymptotically like exp(−x) for x →∞, whereasinverse probit activation function decay like exp(−x2), therefore the probitmodel is more sensitive to outliers.

Canonical Link Functios

Canonical Link FunctionAssuming a conditional distribution for the target variable from theexponential family, along with a corresponding choice for the activationfunction known as the canonical link function.

Likelihood function First we assume scale invariant exponentialclass-conditional distribution

p(t|η, s) =1

s)g(η) exp{ηt

s} (49)

By definition of 1st statistics, we get

y = E[t|η] = −s d

dηln g(η) (50)

WLOG we denote this relation as η = ψ(y). In the definition of GLM, wecall f (·) activation function. We call inverse of this function, f −1(·), linkfunction. Meanwhile log-likelihood function is

ln p(t|η, s) =N∑

{ln g(ηn) +ηntns}+ const. (51)

Canonical Link Functios

Gradient of Log-LikelihoodBy previous page, we can get

∇w ln p(t|η, s) =N∑

dηnln g(ηn) +

tns}dηnyn

dyndan∇an

s{tn − yn}ψ′(yn)f ′(an)φn

N∑n=1

{yn − tn}φn

The Laplace Approximation

Process of Lapalace Approximation

MotivationFind Gaussian Approximation to a probability density defined over a set ofcontinuous variables.

1 Find the mode z0 and evaluate the Hessian matrix A.2 Using above information, we get

f (z) ' f (z0) exp{−1

2(z− z0)TA(z− z0)} (53)

3 Normalize the distribution, we get

q(z) =|A|1/2

(2π)M/2exp{−

2(z − z0)TA(z − z0)} = N (z|z0,A

−1) (54)

LimitationSince it is based on Gaussian, it can fail sometimes.

The Laplace Approximation

Model Comparison and BIC

Model EvidenceWe have approximated the normalization constant Z to

Z ' f (z0)(2π)M/2

|A|1/2(55)

Threfore, we get model evidence

p(D) =

∫p(D|θ)p(θ)dθ (56)

Since f (θ) = p(D|θ)p(θ) and Z = p(D) we get

ln p(D) ' ln p(D|θMAP) + ln p(θMAP) +M

2ln(2π)− 1

2ln |A| (57)

If we assume Gaussian prior is broad, and the Hessian has full rank thenwe get

ln p(D|θMAP) ' ln p(D|θMAP)− 1

2M lnN (58)

We can make more accurate estimate of the model evidence in chapter 5

Bayesian Lostic Regression

Laplace Approximation

1 First, set the prior asp(w) = N (w|m0,S0) (59)

2 Calculate the posterior as

ln p(w|t) = −1

2(w −m0)TS−1

0 (w −m0)

{tn ln yn + (1− tn) ln(1− yn)}+ const.

3 Approximate the posterior using Laplace Approximation

q(w) = N (w|wMAP ,SN) (61)

S−1N = S−1

0 +N∑

yn(1− yn)φnφTn (62)

Predictive Distribution

Predictive Distribution by Laplace ApproximationBy using Laplace Approximation, we get

p(C1|φ, t) =

∫p(C1|φ,w)p(w|t)dw '

∫σ(wTφ)q(w)dw (63)

Denoting a = wTφ we get

σ(wTφ) =

∫δ(a− wTφ)σ(a)da (64)

Therefore we get

p(C1|φ, t) '∫σ(a)

∫δ(a− wTφ)q(w)dwda =

∫σ(a)p(a)da (65)

Since p(a) means marginalize of q(w), we know that p(a) is also Gaussian.And mean and variance of Gaussian is

µa = E[a] =

∫ap(a)da =

∫wTφq(w)dw = wT

MAPφ (66)

σ2a =

∫{a2E [a]2}p(a)da =

∫{(wTφ)2 − (mT

Nφ)2}dwq(w) = φTSNφ

Predictive Distribution

Approximate ConvolutionThis predictive distribution has a form of the convolution of a Gaussianwith a logistic sigmoid, and cannot be evaluated analytically. Therefore weuse similar function ,inverse probit function, to calculate this analytically∫

Φ(λa)N (a|µ, σ2) = Φ(µ

(λ−2 + σ2)1/2) (68)

Therefore, we get ∫σ(a)N (a|µ, σ2) ' σ(κ(σ2)µ) (69)

where κ(σ2) = 1√(1+πσ2/8)

Therefore, we get the approximation of

predictive distribution in the following form

p(C1|φ, t) = σ(κ(σ2a)µa) (70)