Post on 21-Feb-2017
transcript
Linear Models for Classification
Linear Models for Classification
Sung-Yub Kim
Dept of IE, Seoul National University
February 18, 2017
Linear Models for Classification
1 Introduction
2 Discriminant Functions
3 Probabilistic Generative Models
4 Probabilistic Disriminative Models
5 The Laplace Approximation
6 Bayesian Lostic Regression
Linear Models for Classification
Introduction
Bishop, C. M. Pattern Recognition and Machine Learning Information Science and Statistics, Springer, 2006.
Kevin P. Murphy. Machine Learning - A Probabilistic Perspective Adaptive Computation and Machine
Learning, MIT press, 2012.
Ian Goodfellow and Yoshua Bengio and Aaron Courville. Deep Learning Computer Science and Intelligent
Systems, MIT Press, 2016.
Linear Models for Classification
Introduction
Goal: Take an input vector x and assign it to one of K discrete classes Ck ,where k=1,· · · ,K.
The input space is divided into decision reigions and its boundary calleddecision boundaries or decision surfaces.
Linearly Separable → Separating Hyperplane Theorem
1-of-K Coding Scheme
t = (0, 0, 1, 0, 0)> (1)
Each tk means the probability that the class is CkGeneralized Linear Model
y(x) = f (w>x + w0) (2)
Nonlinear function which is needed to make probabilistic output f (·) iscalled activation function, and its inverse is called link function. Sincetheir decision boundary is y(x) = c for some constant c, they are calledGeneralized Linear Models(GLM).
Linear Models for Classification
Discriminant Functions
Linear Discriminant Functions
Linear Discriminant Function
y(x) = wTx + w0 = w̃T x̃ (3)
w is called weight, w0 is called bias.(−w0 is called threshold)
Decision Criteria
Ck =
{C1, if y(x) ≥ 0
C2, if y(x) < 0(4)
Target Coding SchemeThere are one-versus-the-rest or one-versus-one coding scheme, but theyhave some ambiguity. Therefore, we use a single K-classes discriminationcomprising K linear functions of the form
yk(x) = w>k x + wk0 (5)
and thenCk = Ci , if yi (x) > yj(x) ∀j 6= i (6)
then decision region will be
(wi − wj)>x + (wk0 − wj0) = 0 (7)
Linear Models for Classification
Discriminant Functions
Least Square
Modely(x) = W̃>x̃ (8)
where k-th column of W̃ is w̃k = (wk0,wTk )T and x̃ = (1, xT )T
SSE
ED(W̃) =1
2‖X̃W̃ − T‖2
F =1
2tr{(X̃W̃ − T)>(X̃W̃ − T)} (9)
where k-th row of X̃ is x̃>n
Closed-form Solution
W̃ = (X̃>X̃)−1X̃>T = X̃†T (10)
Therefore, discriminant function is
y(x) = T>(X̃†)>x̃ (11)
Linear Models for Classification
Discriminant Functions
Least Square
Limitations1 Output value cannot have probabilistic interpretation.2 LS solutions lack robustness to outliers.3 SSE function penalizes predictions that are ’too correct’.
Origin of LimitationsMaximum Likelihood under the assumptions of a Gaussian Conditionaldistributions, whereas binary target vectors have a distribution that is farfrom Gaussian.
Linear Models for Classification
Discriminant Functions
Fisher’s Linear Discriminant Analysis
Motivation: Dimensionality Reduction
Simple Model: Choose w ∈ {w : ‖w‖ = 1} such that maximize
m2 −m1 = wT (m2 −m1) (12)
where m2 = 1N2
∑n∈C2
xn, m1 = 1N1
∑n∈C1
xn
Revised Model: Choose a model which give large separation betweenprojected class means while also give a small variance within each classes.Therefore, we need to minimize
J(w) =(m2 −m1)2
s21 + s2
2
=wTSBw
wTSWw(13)
where s2k =
∑n∈Ck
(yn −mk)2 means within-class variance of thetransformed class Ck andSB = (m2 −m1)(m2 −m1)T is the between-class covariance matrix andSW =
∑n∈C1
(xn −m1)(xn −m1)T +∑
n∈C2(xn −m2)(xn −m2)T is total
within-class covariance matrix
Linear Models for Classification
Discriminant Functions
Fisher’s Linear Discriminant Analysis
Closed-form SolutionBy simple calculation,
w ∝ S−1W (m2 −m1) (14)
Relation to LSBy adjusting target values to N/N1 and −N/N2, we can prove Fisher’sLDA is equivalent to LS.
Multiple Classes caseOne example of Multi-class classification by LDA is to mimize
J(W ) = tr{(WSWW T )−1(WSBWT )} (15)
where SW =∑K
k=1 Sk , Sk =∑
n∈Ck(xn −mk)(xn −mk)T and
mk =∑
n∈Ckxn and
ST =∑N
n=1(xn −m)(xn −m)T and this total covraince matrix can bedecomposed by
ST = SW + SB (16)
where SB =∑K
k=1 Nk(mk −m)(mk −m)T
Linear Models for Classification
Discriminant Functions
The Perceptron algorithm
Motivation: How about take non-linear transformation to make classifier?
Model
y(x) = f (wTφ(x)) (17)
where
f (a) =
{+1, a ≥ 0
−1, a < 0(18)
Error Fucntion
EP(w) = −∑n∈M
wTφntn (19)
where M is a set of missclassified patterns.
SGDApplying Stochastic Gradient Descent, we get
w(τ+1) = w(τ)− η∇EP(w) = w(τ) + ηφntn (20)
Linear Models for Classification
Discriminant Functions
The Perceptron algorithm
Perceptron Convergence TheoremIf the training data set is linearly separable, then the perceptron learningalgorithm is guaranteed to find an exact solution in a finite number ofsteps.
Limitations
1 PCT doesn’t tell anything about convergence rate and it really convergesslowly sometimes.
2 Also Perceptron is based on linear combinations of fixed basis functions. Wewill solve this problem in chapter 5 and 6.
Linear Models for Classification
Probabilistic Generative Models
Introduction
Generative ApproachModel the class-conditional densities p(x|Ck) and class priors p(Ck).
Posterior Probability and Activation functionTypically, poseterior probabiliity can be defined like
p(C1|x) =p(x|C1)p(C1)
p(x|C1)p(C1) + p(x|C2)p(C2)=
exp(a)
exp(a) + 1= σ(a) (21)
where a = ln p(x|C1)p(C1)p(x|C2)p(C2)
and σ is sigmoid function. Sometimes, we call alog odds.More generally, we can represent posterior probability like
p(Ck |x) =p(x|Ck)p(Ck)∑j p(x|Cj)p(Cj)
=exp(ak)∑j exp(aj)
(22)
where aj = ln(p(x|Cj)p(Cj)) and normalized-exponential function is calledsoftmax function. In fact, this function acts like soft-argmax function.
Linear Models for Classification
Probabilistic Generative Models
Continuous inputs
ModelIf we assume that class-conditional probability is
p(x |Ck) =1
(2π)D/2
1
|Σ|1/2exp−1
2(x− µk)TΣ−1(x− µk) (23)
for k = 1,2. Then we get
p(C1|x) = σ(wTx + w0) (24)
where w = Σ−1(µ1 − µ2) and w0 = − 12µT
1 Σ−1µ1 + 12µT
2 Σ−1µ2 + ln p(C1)p(C2)
Similarly, if we have multiple class, then model is
ak(x) = wTk x + wk0 (25)
where wk = Σ−1µk and wk0 = − 12µTk Σ−1µk + ln p(Ck)
Linear Models for Classification
Probabilistic Generative Models
MLE
Gaussian clas-conditional densityLikelihood function is
p(t,X|π, µ1, µ2,Σ) =N∏
n=1
[πN (xn|µ1,Σ)]tn [(1− π)N (xn|µ2,Σ)]1−tn (26)
where π is class prior probability, µk is class-mean, tn is 1 if class of n-thdata is 1, otherwise 0 and we assume that classes share covariance.
Closed-form SolutionWe can solve this problem exactly
π =N1
N1 + N2(27)
µ1 =1
N1
N∑n=1
tnxn, µ2 =1
N2
N∑n=1
(1− tn)xn (28)
S =N1
NS1 +
N2
NS2 (29)
whereS1 = 1
N1
∑n∈C1
(xn − µ1)(xn − µ1)TandS2 = 1N2
∑n∈C2
(xn − µ2)(xn − µ2)T
Linear Models for Classification
Probabilistic Generative Models
Discrete features
Naive BayesIn general case, we need to consider all possiblities to treat discretefeatures. But if we assume that the situation is naive bayes, which meansfeature values are treated as independent, conditioned on class Ck , we cancalculate likelihood function very easy.
p(x |Ck) =D∏i=1
µxiki (1− µki )
1−xi (30)
Linear Models for Classification
Probabilistic Generative Models
Exponential family
Likelihood of Exponential family
p(x|η) = h(x)g(η) exp{ηTu(x)} (31)
If we assume that u(x) = x and the density is scale invariance we canrepresent density as
p(x |ηk , s) =1
sh(
1
sx)g(ηk) exp{1
sηTk x} (32)
Closed-form solution of Exp familyBy above, we get in binary classification
a(x) =1
s(η1 − η2)Tx + ln
g(η1)
g(η2)+ ln
p(C1)
p(C2)(33)
and in multi-class classificiation
ak(x) =1
sηTk x + ln g(ηk) + ln p(Ck) (34)
Linear Models for Classification
Probabilistic Disriminative Models
Introduction
Discriminative ApproachUse the functional form of the GLM explicitly and to determine itsparameters directly by using MLE.
Advantages
1 Fewer adaptive parameters2 Do not use class-conditional density assumption
Fixed Basis FunctionIn discriminative approach, we model the posterior proababilitiesaccurately and then applying standatd decision theory. Since fixed basisfunction has some limitations, we can generalize this to adaptive basisfunction to the data.
Linear Models for Classification
Probabilistic Disriminative Models
Logistic Regression
Model
p(C1|φ) = y(φ) = σ(wTφ) (35)
If we use this model, we just need to find M adaptive parameters. It isrelatively simpler than Gaussian model which is need to be find itscovariance paramters.
LikelihoodWe can write likelihood as
p(t|w) =N∏
n=1
y tnn {1− yn}1−tn (36)
and we can use this likelihood function to make cross entropy errorfunction like
E(w) = Et [− ln y ] ' −N∑
n=1
{tn ln yn + (1− tn) ln(1− yn)} (37)
Linear Models for Classification
Probabilistic Disriminative Models
Logistic Regression
Gradient of CETake the gradient of error function, we get
∇wE(w) =N∑
n=1
(yn − tn)φn (38)
and this gradient can be interpreted as
(error)× (prediction of model) (39)
Stochastic Gradient DescentWe can use above to give a sequential algorithm, in which each of theweight vectors is updated using
∇wEn(w) = (yn − tn)φn (40)
Linear Models for Classification
Probabilistic Disriminative Models
IRLS
Iterative Reweighted Least SquaresFor logistic regression, there is no longer a closed-form solution, due to thenonlinearity of the logistic sigmoid function. But fortunately, logisticsigmoid function is convex and we can find global optimizer by iterativemethod.
Newton-Raphson MethodNewton-Raphson method is a iterative method defined as
w(τ+1) = w(τ) − H−1∇wE(w) (41)
The gradient and hessian of our error function is
∇wE(w) =N∑
n=1
(yn − tn)φn = ΦT (y − t) (42)
∇2wE(w) =
N∑n=1
yn(1− yn)φnφTn = ΦTRΦ (43)
where R = diag(y � (1− y))
Becasue weighting matrix R depends on w , we must compute this matrix everyiteration.
Linear Models for Classification
Probabilistic Disriminative Models
Multiclass Logistic Regression
LikelihoodSimilar in binary case, we can get likelihood of our model
p(T|w1, . . . ,wK ) =N∏
n=1
K∏k=1
y tnknk (44)
Take negaitve logarithm to get cross entropy error function
E(w1, . . . ,wK ) = ET[− ln p(y)] ' −N∑
n=1
K∑k=1
tnk ln ynk (45)
By similar argument in binary case, we can get
∇wjE(w1, . . . ,wK ) =N∑
n=1
(ynj − tnj)φn (46)
∇wk∇wjE(w1, . . . ,wK ) =N∑
n=1
ynk(Ikj − ynj)φnφTn (47)
Linear Models for Classification
Probabilistic Disriminative Models
Probit Regression
Inverse Probit FunctionInverse Probit function is defined as
Φ(a) =
∫ a
−∞N (θ|0, 1)dθ (48)
and the GLM based on an inverse probit activation function is known asprobit regression.
LimitationsLogistic sigmoid decays asymptotically like exp(−x) for x →∞, whereasinverse probit activation function decay like exp(−x2), therefore the probitmodel is more sensitive to outliers.
Linear Models for Classification
Probabilistic Disriminative Models
Canonical Link Functios
Canonical Link FunctionAssuming a conditional distribution for the target variable from theexponential family, along with a corresponding choice for the activationfunction known as the canonical link function.
Likelihood function First we assume scale invariant exponentialclass-conditional distribution
p(t|η, s) =1
sh(
t
s)g(η) exp{ηt
s} (49)
By definition of 1st statistics, we get
y = E[t|η] = −s d
dηln g(η) (50)
WLOG we denote this relation as η = ψ(y). In the definition of GLM, wecall f (·) activation function. We call inverse of this function, f −1(·), linkfunction. Meanwhile log-likelihood function is
ln p(t|η, s) =N∑
n=1
{ln g(ηn) +ηntns}+ const. (51)
Linear Models for Classification
Probabilistic Disriminative Models
Canonical Link Functios
Gradient of Log-LikelihoodBy previous page, we can get
∇w ln p(t|η, s) =N∑
n=1
{ d
dηnln g(ηn) +
tns}dηnyn
dyndan∇an
=N∑
n=1
1
s{tn − yn}ψ′(yn)f ′(an)φn
=1
s
N∑n=1
{yn − tn}φn
(52)
Linear Models for Classification
The Laplace Approximation
Process of Lapalace Approximation
MotivationFind Gaussian Approximation to a probability density defined over a set ofcontinuous variables.
How?
1 Find the mode z0 and evaluate the Hessian matrix A.2 Using above information, we get
f (z) ' f (z0) exp{−1
2(z− z0)TA(z− z0)} (53)
3 Normalize the distribution, we get
q(z) =|A|1/2
(2π)M/2exp{−
1
2(z − z0)TA(z − z0)} = N (z|z0,A
−1) (54)
LimitationSince it is based on Gaussian, it can fail sometimes.
Linear Models for Classification
The Laplace Approximation
Model Comparison and BIC
Model EvidenceWe have approximated the normalization constant Z to
Z ' f (z0)(2π)M/2
|A|1/2(55)
Threfore, we get model evidence
p(D) =
∫p(D|θ)p(θ)dθ (56)
Since f (θ) = p(D|θ)p(θ) and Z = p(D) we get
ln p(D) ' ln p(D|θMAP) + ln p(θMAP) +M
2ln(2π)− 1
2ln |A| (57)
If we assume Gaussian prior is broad, and the Hessian has full rank thenwe get
ln p(D|θMAP) ' ln p(D|θMAP)− 1
2M lnN (58)
We can make more accurate estimate of the model evidence in chapter 5
Linear Models for Classification
Bayesian Lostic Regression
Laplace Approximation
How?
1 First, set the prior asp(w) = N (w|m0,S0) (59)
2 Calculate the posterior as
ln p(w|t) = −1
2(w −m0)TS−1
0 (w −m0)
+N∑
n=1
{tn ln yn + (1− tn) ln(1− yn)}+ const.
(60)
3 Approximate the posterior using Laplace Approximation
q(w) = N (w|wMAP ,SN) (61)
where
S−1N = S−1
0 +N∑
n=1
yn(1− yn)φnφTn (62)
Linear Models for Classification
Bayesian Lostic Regression
Predictive Distribution
Predictive Distribution by Laplace ApproximationBy using Laplace Approximation, we get
p(C1|φ, t) =
∫p(C1|φ,w)p(w|t)dw '
∫σ(wTφ)q(w)dw (63)
Denoting a = wTφ we get
σ(wTφ) =
∫δ(a− wTφ)σ(a)da (64)
Therefore we get
p(C1|φ, t) '∫σ(a)
∫δ(a− wTφ)q(w)dwda =
∫σ(a)p(a)da (65)
Since p(a) means marginalize of q(w), we know that p(a) is also Gaussian.And mean and variance of Gaussian is
µa = E[a] =
∫ap(a)da =
∫wTφq(w)dw = wT
MAPφ (66)
σ2a =
∫{a2E [a]2}p(a)da =
∫{(wTφ)2 − (mT
Nφ)2}dwq(w) = φTSNφ
(67)
Linear Models for Classification
Bayesian Lostic Regression
Predictive Distribution
Approximate ConvolutionThis predictive distribution has a form of the convolution of a Gaussianwith a logistic sigmoid, and cannot be evaluated analytically. Therefore weuse similar function ,inverse probit function, to calculate this analytically∫
Φ(λa)N (a|µ, σ2) = Φ(µ
(λ−2 + σ2)1/2) (68)
Therefore, we get ∫σ(a)N (a|µ, σ2) ' σ(κ(σ2)µ) (69)
where κ(σ2) = 1√(1+πσ2/8)
Therefore, we get the approximation of
predictive distribution in the following form
p(C1|φ, t) = σ(κ(σ2a)µa) (70)