Maximum Entropy Discrimination

transcript

Tommi Jaakkola Marina Meila Tony Jebara

MIT CMU MIT

· inputs x, class y = +1, -1· data D = { (x1,y1), …. (xT,yT) }

· learn fopt(x) discriminant functionfrom F = {f} family of discriminants

· classify y = sign fopt(x)

Classification

Model averaging

· many f with near optimal performance

· Instead of choosing fopt, average over all f in F

Q(f) = weight of f

y(x) = sign Q(f)f(x) F

= sign < f(x) >Q

· To specify:F = { f } family of discriminant functions

· To learn Q(f) distribution over F

Goal of this work

· Define a discriminative criterion for averaging over models

Advantages

· can incorporate prior

· can use generative model

· computationally feasible

· generalizes to other discrimination tasks

Maximum Entropy Discrimination

given data set D = { (x1,y1), … (xT,yT) } find

QME = argmaxQ H(Q)

s.t. yt< f(xt) >Q for all t = 1,…,T (C)

and some > 0

solution QME correctly classifies D

· among all admissible Q, QME has max entropy

· max entropy least specific about f

· convex problem: QME unique

· solution TQME (f) ~ exp{ tytf(xt) } t=1

· t 0 Lagrange multipliers

· finding QME : start with =0 and follow gradient of unsatisfied constraints

Solution: Q ME as a projection

uniform Q0

admissible Q

Finding the solution

· needed t, t = 1,...T

· by solving the dual problem

max J() = max [ - log Z + - log Z- - t ]

s.t. t >= 0 for t = 1,...T

Algorithm

· start with t = 0 (uniform distribution)

· iterative ascent on J() until convergence· derivative J/ t = yt<log +b >Q(P) -

QME as sparse solution

· Classification ruley(x) = sign< f(x) >QME

is classification margin

t> 0 for yt< f(xt) >Q

xt on the margin

(support vector!)

QME as regularization

· Uniform distribution Q0 =0

· ”smoothness” of Q = H(Q)

· QME is smoothest admissible distribution

QME Q0

Goal of this work

· Define a discriminative criterion for averaging over models

Extensions

· incorporate prior

· relationship to support vectors

· use generative models

· generalizes to other discrimination tasks

Priors

· prior Q0( f )

· Minimum Relative Entropy Discrimination

QMRE = argminQ KL( Q || Q0)

s.t. yt< f(xt) >Q for all t = 1,…,T (C)

· prior on learn QMRE( f, ) soft margin

admissible Q

KL( Q || Q0)

Soft margins

· average also over margin · define Q0 (f,) = Q0(f) Q0()

· constraints < ytf(xt) - >Q(f,) 0

· learn QMRE (f, ) = QMRE(f) QMRE()

Q0() =c exp[c(-1)]

Potential as function of

Examples: support vector machines

· Theorem

For f(x) = .x + b, Q0() = Normal( 0, I ), Q0(b) = non-informative prior, the Lagrange multipliers are obtained by

maximizing J() subject to 0 t 0 and ttyt = 0 , where

J() = t[ t + log( 1 - t/c) ] - 1/2t,stsytysxt.xs

· Separable D SVM recovered exactly· Inseparable D SVM recovered with different

misclassification penalty

· Adaptive kernel SVM....

SVM extensions

· Example: Leptograpsus Crabs (5 inputs, Ttrain=80,

Ttest=120)

f(x) = log + b

with P+( x ) = normal( x ; m+, V+ )

quadratic classifier Q( V+, V- ) = distribution of kernel width

MRE Gaussian

Linear SVM

Max Likelihood Gaussian

Using generative models

· generative models

P+(x), P

for y = +1, -1

· f(x) = log + b

· learn QMRE (P+,P

-, b, )

· if Q0 (P+,P

- b,) = Q0 (P

+) Q0 ( P

-) Q0 ( b) Q0 ( )

· QMRE (P+,P

-) = QME (P

+) QME (P

-) QMRE( b) QMRE ( )

(factored prior factored posterior)

Examples: other distributions

· Multinomial (1 discrete variable)

· Graphical model (fixed structure, no hidden variables)

· Tree graphical model ( Q over structures and parameters)

Tree graphical models

· P(x| E, ) = P0(x) Puv(xuxv|uv)

· prior Q0(P) = Q0(E) Q0(|E)

· Q0(E) = uv

· Q0(|E) = conjugate prior

QMRE(P) = W0 Wuv

can be integrated analytically

Q0(P) conjugate prior

over E and

Trees: experiments

· Splice junction classification task• 25 inputs, 400 training examples• compared with Max Likelihood trees

ML, err=14%

MaxEnt, err=12.3%

Trees experiments (contd)

Tree edges’ weights

Discrimination tasks

· Classification

· Classification with partially labeled data

· Anomaly detection

Partially labeled data

· Problem: given F families of discriminants and data set D = { (x1, y1)… (xT ,yT), xT+1,… xN } find

Q(f,,y) = argminQ KL(Q||Q0)

s. t. < ytf(x) - >Q 0 for all t = 1,…,T (C)

Partially labeled data : experiment

Complete data

10% labeled + 90% unlabeled

10% labeled

·Splice junction classification• 25 inputs

• Ttotal=1000

Anomaly detection

· Problem: given P = { P } family of generative models and data set D = { x1, … xT } find Q(P) that

Q( P,) = argminQ KL(Q||Q0)

s. t. < log P(x) - >Q 0 for all t = 1,…,T (C)

Anomaly detection: experiments

MaxEnt

MaxLikelihood

Anomaly detection: experiments

MaxEnt

MaxLikelihood

Conclusions

· New framework for classification· Based on regularization in the space of distributions· Enables use of generative models· Enables use of priors· Generalizes to other discrimination tasks

Maximum Entropy Discrimination

Documents