Gradient-based estimation for maximumentropy
Micha Elsner
January 23, 2013
The task
Last week, we derived the maximum-entropy classifier:
P(T = noun|F ) =exp(Θnoun · F )∑
t ′ exp(Θt ′ · F )
I said that:I The weights Θ are trained to optimize P(T |F )
I This solves the feature correlation problemI The mathematical form of the equation makes this training
convenient
2
High-level points
I Optimizing P(T |F ) is still maximum likelihoodI But we can’t do it by count-and-divide
I The θ values depend on one another— this is how wecorrect for correlation!
I Instead we’ll use an iterative hill-climbing methodI This is how we optimized α and β in the LM
I Hill-climbing can be slow...I It can take many steps to reach the answerI We can try to take fewer stepsI Or speed each step upI Both are research topics
I You should use an off-the-shelf optimization packageI But you may need to do some work first
3
Climbing hills
Until the ground is level:I Figure out which way is uphillI Go in that direction for a whileI Check the slope of the ground again
Mathematical tasks:I Formalize slope, uphill
I Derivatives (calculus)I Formalize “go for a while”
I Step size, line search, etc
4
Climbing hills
Until the ground is level:I Figure out which way is uphillI Go in that direction for a whileI Check the slope of the ground again
Mathematical tasks:I Formalize slope, uphill
I Derivatives (calculus)I Formalize “go for a while”
I Step size, line search, etc
4
Basic calculus: tangent lines
4 2 0 2 40
5
10
15
20
25
4 2 0 2 410
5
0
5
10
5
Basics of tangents
4 2 0 2 40
5
10
15
20
25
4 2 0 2 410
5
0
5
10
Derivative of f (x)
Written dfdx or f ′(x)
The slope of f (x) at every point xI Positive when f (x) is increasingI Negative when f (x) is decreasingI 0 when f (x) is neither increasing nor decreasing
I Maxima, minima and saddle points
6
Maximizing a function
At the maximum of a function, the derivative is 0I If no minima or saddle points...I The derivative has a single 0I ...at the maximum
7
Hill-climbing, now with more math
Begin with arbitrary guess x∗...Until the ground is level: Until df
dx (x∗) = 0:I Figure out which way is uphill Check sign of df
dx (x∗)I Go in that direction for a while Pick a new x∗
I Check the slope of the ground again Recompute dfdx (x∗)
8
Let’s compute the derivative
I I’m divided on whether to tackle the multiclass formula:
P(t |F ) =exp(Θt · F )∑t ′ exp(Θt ′ · F )
I Or the binary formula, which is simpler:
P(t |F ) =1
1 + exp(−Θt · F )
I I’m following Klein and doing the general oneI Derivative is more interpretable
9
Notation
Θt · F
Last time I was a bit vague about Θ and FI Partly because there are many ways to do it
This time:I There is a vector Θt for each class tI Its entries are θt ,1, θt ,2, . . .
I The F vector has entries f1, f2, . . .I These correspond to feature-value combinations
Θnoun = θnoun,word−is−fish, θnoun,word−is−buy . . .
Θadj = θadj,word−is−fish, θadj,word−is−buy . . .
F = fword−is−fish, fword−is−buy . . .
10
Write out the likelihoodYou’ve seen this part:
P(D; Θ) =∏
ti ,Fi∈D
P(ti |Fi ; Θ)
log P(D; Θ) =∑
ti ,Fi∈D
log P(ti |Fi ; Θ)
log P(D; Θ) =∑
ti ,Fi∈D
logexp(Θti · F )∑t ′ exp(Θt ′ · F )
Recalling that log ab = log a− log b:
P(D; Θ) =∑
ti ,Fi∈D
(log exp(Θti · Fi)− log
∑t ′
exp(Θt ′ · Fi)
)...and that log exp x = x , and splitting the sum...
P(D; Θ) =∑
ti ,Fi∈D
(Θti · Fi)−∑
ti ,Fi∈D
(log∑
t ′exp(Θt ′ · Fi)
)11
Taking the derivative
Partial derivative with respect to a particular weightI Let’s focus on a particular θt ,j (from Θt , feature j)
I Derivative of log P(D) treating θt ,j as a variableI And all other θ as constants
∂ log P(D)
∂θt ,j
Gradient ∇Θ
The vector of partial derivatives for all θt ,j in Θ
I A list of numbers, same dimensions as Θ
I Indicates the slope along each axis (direction)Why we call these estimation techniques gradient methods
12
First step
I Linearity: Derivative of a sum is the sum of the derivatives
∂
∂θt ,jP(D; Θ) =
∂
∂θt ,j
∑ti ,Fi∈D
(Θti · Fi)−
∂
∂θt ,j
∑ti ,Fi∈D
(log∑
t ′exp(Θt ′ · Fi)
)
13
First step
I Linearity: Derivative of a sum is the sum of the derivatives
∂
∂θt ,jP(D; Θ) =
∂
∂θt ,j
∑ti ,Fi∈D
(Θti · Fi)−
∂
∂θt ,j
∑ti ,Fi∈D
(log∑
t ′exp(Θt ′ · Fi)
)
13
Derivative of the first partI This bit comes from the numeratorI Push the derivative through the sum
I By applying the linearity rule again
∂
∂θt ,j
∑ti ,Fi∈D
(Θti · Fi) =∑
ti ,Fi∈D
∂
∂θt ,jΘti · Fi
θt ,j is only involved here if ti = tI That is, if this example has true tag t
Recalling that the dot product Θ · F = θ1f1 + θ2f2 + . . .
I Treating all the θ except θt ,j as constantsI The derivative of a constant is 0 (no slope)
∂
∂θt ,j
∑ti ,Fi∈D
(Θti · Fi) =∑
ti ,Fi∈D
∂
∂θtjθt ,j fj
14
Numerator continued
∂
∂θt ,j
∑ti ,Fi∈D
(Θti · Fi) =∑
ti ,Fi∈D
∂
∂θtjθt ,j fj
=∑
ti ,Fi∈D
fj
= #(t , fj)
The number of times tag t occurs with feature fj
15
Derivative of the second part
∂
∂θt ,jP(D; Θ) =
∂
∂θt ,j
∑ti ,Fi∈D
(Θti · Fi)−
∂
∂θt ,j
∑ti ,Fi∈D
(log∑
t ′exp(Θt ′ · Fi)
)
I Again, pushing the derivative through the sum
∑ti ,Fi∈D
∂
∂θt ,j
(log∑
t ′exp(Θt ′ · Fi)
)
I Derivative of log f (x) = 1f (x)
ddx f (x)
∑ti ,Fi∈D
1∑t ′ exp(Θt ′ · Fi)
∂
∂θt ,j
(∑t ′
exp(Θt ′ · Fi)
)16
Continued...I In rightmost sum, θt ,j appears when t ′ = t (for every ex.)
I Derivative of exp f (x) = exp f (x) ddx f (x)
I And we know the derivative of Θ · F
∑ti ,Fi∈D
1∑t ′ exp(Θt ′ · Fi)
∂
∂θt ,j
(∑t ′
exp(Θt ′ · Fi)
)
=∑
ti ,Fi∈D
exp(Θt · Fi)∑t ′ exp(Θt ′ · Fi)
∂
∂θt ,jΘt · F
=∑
ti ,Fi∈D
exp(Θt · Fi)∑t ′ exp(Θt ′ · Fi)
#(t , fj)
=∑
ti ,Fi∈D
P(t |Fi)#(fj)
The expected number of times tag t occurs with feature fj17
What we’ve learned
The derivative of the maximum-entropy model:I A difference of two quantities
I The observed counts of feature, tagI The expected counts of feature, tag
I Setting the derivative to 0 means making what we expectmatch what we see
Differentiating sums is easyI Most terms get thrown away
18
Formal advantages of the maximum entropy model
I The use of dot products means a lot of terms in thederivative go to 0
I The use of exponentials means that the derivative isexpressed in terms of P(t |F )
I Which is useful since we needed to code it alreadyI Not using probabilities means we have no constraints
I θ can be any numberI It’s also possible to prove convexity
I Only a single zero point in the derivative— unique maximumI We won’t prove this, but Klein does (slide 67)
19
Optimization based on the gradient
Until the ground is level:I Figure out which way is uphillI Go in that direction for a whileI Check the slope of the ground again
Start with guess Θ∗
Until gradient ∇Θ is (nearly) 0I Compute gradient ∇Θ
I Add multiple of gradient, x∗ ← η∇Θ
I Recompute gradient
20
Basic algorithmStep size .05, optimizing −(x2), d
dx = −2xI Well-behaved but slow
4 2 0 2 4
20
10
0
10
20
21
Basic algorithmStep size .8, optimizing −(x2), d
dx = −2xI We overshoot the maximum but converge relatively fast
6 4 2 0 2 4 640
20
0
20
40
60
80
22
Basic algorithm (3)Step size 1.1, optimizing −(x2), d
dx = −2xI Too large steps cause rapid divergence
15 10 5 0 5 10 15
400
300
200
100
0
100
200
300
23
Taking fewer stepsI Instead of trying to guess a step size, do a line searchI Try to approximate curvature of the function (second
derivative)I Keep track of overall trajectory of search pointsI Other stuff...
LBFGSLimited-memory Broyden-Fletcher-Goldfarb-Shanno method
I Relies on approximating the second derivativeI Uses a line searchI Very popular for maximum entropyI I don’t know more about how it works
Variant OWL-QN (Orthant-wise limited-memory quasi-Newton’smethod)
I Popular with some kinds of smoothing
24
A picture of an Owl Queen
credit: seller poordogfarm on etsy
25
Making the steps fasterThe gradient involved the term:∑
ti ,Fi∈D
P(t |Fi)#(fj)
This sum is slow to compute (loop over all training items)I Approximate this with a subset of training itemsI Law of large numbers: larger subsets, more precise resultsI Algorithm: when calculating ∇Θ
I Pick random d items from DI Calculate ∇Θ on that subset
Stochastic gradient algorithmsI So called because choosing random d itemsI d = 1 is a variant of the perceptron algorithm
I ...original version is not probabilisticI Other variants that do similar things
I eg passive-aggressive updating
26