Post on 05-Jan-2016
description
transcript
Maximum Entropy Discrimination
Tommi Jaakkola Marina Meila Tony Jebara
MIT CMU MIT
· inputs x, class y = +1, -1· data D = { (x1,y1), …. (xT,yT) }
· learn fopt(x) discriminant functionfrom F = {f} family of discriminants
· classify y = sign fopt(x)
Classification
Model averaging
· many f with near optimal performance
· Instead of choosing fopt, average over all f in F
Q(f) = weight of f
y(x) = sign Q(f)f(x) F
= sign < f(x) >Q
· To specify:F = { f } family of discriminant functions
· To learn Q(f) distribution over F
Goal of this work
· Define a discriminative criterion for averaging over models
Advantages
· can incorporate prior
· can use generative model
· computationally feasible
· generalizes to other discrimination tasks
Maximum Entropy Discrimination
given data set D = { (x1,y1), … (xT,yT) } find
QME = argmaxQ H(Q)
s.t. yt< f(xt) >Q for all t = 1,…,T (C)
and some > 0
solution QME correctly classifies D
· among all admissible Q, QME has max entropy
· max entropy least specific about f
· convex problem: QME unique
· solution TQME (f) ~ exp{ tytf(xt) } t=1
· t 0 Lagrange multipliers
· finding QME : start with =0 and follow gradient of unsatisfied constraints
Solution: Q ME as a projection
uniform Q0
QME
admissible Q
=0
ME
Finding the solution
· needed t, t = 1,...T
· by solving the dual problem
max J() = max [ - log Z + - log Z- - t ]
s.t. t >= 0 for t = 1,...T
Algorithm
· start with t = 0 (uniform distribution)
· iterative ascent on J() until convergence· derivative J/ t = yt<log +b >Q(P) -
P+(x)
P-(x)
QME as sparse solution
· Classification ruley(x) = sign< f(x) >QME
is classification margin
t> 0 for yt< f(xt) >Q
=
xt on the margin
(support vector!)
QME as regularization
· Uniform distribution Q0 =0
· ”smoothness” of Q = H(Q)
· QME is smoothest admissible distribution
fopt
QME Q0
Q(f)
f
Goal of this work
· Define a discriminative criterion for averaging over models
Extensions
· incorporate prior
· relationship to support vectors
· use generative models
· generalizes to other discrimination tasks
Priors
· prior Q0( f )
· Minimum Relative Entropy Discrimination
QMRE = argminQ KL( Q || Q0)
s.t. yt< f(xt) >Q for all t = 1,…,T (C)
· prior on learn QMRE( f, ) soft margin
Q0
QMRE
admissible Q
prior
KL( Q || Q0)
Soft margins
· average also over margin · define Q0 (f,) = Q0(f) Q0()
· constraints < ytf(xt) - >Q(f,) 0
· learn QMRE (f, ) = QMRE(f) QMRE()
Q0() =c exp[c(-1)]
Potential as function of
Examples: support vector machines
· Theorem
For f(x) = .x + b, Q0() = Normal( 0, I ), Q0(b) = non-informative prior, the Lagrange multipliers are obtained by
maximizing J() subject to 0 t 0 and ttyt = 0 , where
J() = t[ t + log( 1 - t/c) ] - 1/2t,stsytysxt.xs
· Separable D SVM recovered exactly· Inseparable D SVM recovered with different
misclassification penalty
· Adaptive kernel SVM....
SVM extensions
· Example: Leptograpsus Crabs (5 inputs, Ttrain=80,
Ttest=120)
f(x) = log + b
with P+( x ) = normal( x ; m+, V+ )
quadratic classifier Q( V+, V- ) = distribution of kernel width
P+
(x)P
-(x)
MRE Gaussian
Linear SVM
Max Likelihood Gaussian
Using generative models
· generative models
P+(x), P
-(x)
for y = +1, -1
· f(x) = log + b
· learn QMRE (P+,P
-, b, )
· if Q0 (P+,P
- b,) = Q0 (P
+) Q0 ( P
-) Q0 ( b) Q0 ( )
· QMRE (P+,P
-) = QME (P
+) QME (P
-) QMRE( b) QMRE ( )
(factored prior factored posterior)
P+
(x)P
-(x)
Examples: other distributions
· Multinomial (1 discrete variable)
· Graphical model (fixed structure, no hidden variables)
· Tree graphical model ( Q over structures and parameters)
Tree graphical models
· P(x| E, ) = P0(x) Puv(xuxv|uv)
· prior Q0(P) = Q0(E) Q0(|E)
· Q0(E) = uv
· Q0(|E) = conjugate prior
QMRE(P) = W0 Wuv
can be integrated analytically
Q0(P) conjugate prior
over E and
E
E
E
Trees: experiments
· Splice junction classification task• 25 inputs, 400 training examples• compared with Max Likelihood trees
ML, err=14%
MaxEnt, err=12.3%
Trees experiments (contd)
Tree edges’ weights
Discrimination tasks
· Classification
· Classification with partially labeled data
· Anomaly detection
+
+++
-
-
-xx
x
x
x
x
xx
x
x
x
+ ++
++
+
+
++
++
+ ++
++
+
+
++
++
Partially labeled data
· Problem: given F families of discriminants and data set D = { (x1, y1)… (xT ,yT), xT+1,… xN } find
Q(f,,y) = argminQ KL(Q||Q0)
s. t. < ytf(x) - >Q 0 for all t = 1,…,T (C)
Partially labeled data : experiment
Complete data
10% labeled + 90% unlabeled
10% labeled
·Splice junction classification• 25 inputs
• Ttotal=1000
Anomaly detection
· Problem: given P = { P } family of generative models and data set D = { x1, … xT } find Q(P) that
Q( P,) = argminQ KL(Q||Q0)
s. t. < log P(x) - >Q 0 for all t = 1,…,T (C)
Anomaly detection: experiments
MaxEnt
MaxLikelihood
Anomaly detection: experiments
MaxEnt
MaxLikelihood
Conclusions
· New framework for classification· Based on regularization in the space of distributions· Enables use of generative models· Enables use of priors· Generalizes to other discrimination tasks