Gradient-based estimation for maximum entropy...4 2 0 2 4 0 5 10 15 20 25 4 2 0 2 4 10 5 0 5 10...

Gradient-based estimation for maximumentropy

Micha Elsner

January 23, 2013

The task

Last week, we derived the maximum-entropy classifier:

P(T = noun|F ) =exp(Θnoun · F )∑

t ′ exp(Θt ′ · F )

I said that:I The weights Θ are trained to optimize P(T |F )

I This solves the feature correlation problemI The mathematical form of the equation makes this training

convenient

2

High-level points

I Optimizing P(T |F ) is still maximum likelihoodI But we can’t do it by count-and-divide

I The θ values depend on one another— this is how wecorrect for correlation!

I Instead we’ll use an iterative hill-climbing methodI This is how we optimized α and β in the LM

I Hill-climbing can be slow...I It can take many steps to reach the answerI We can try to take fewer stepsI Or speed each step upI Both are research topics

I You should use an off-the-shelf optimization packageI But you may need to do some work first

3

Climbing hills

Until the ground is level:I Figure out which way is uphillI Go in that direction for a whileI Check the slope of the ground again

Mathematical tasks:I Formalize slope, uphill

I Derivatives (calculus)I Formalize “go for a while”

I Step size, line search, etc

4

Climbing hills


Mathematical tasks:I Formalize slope, uphill

I Derivatives (calculus)I Formalize “go for a while”

I Step size, line search, etc

4

Basic calculus: tangent lines

4 2 0 2 40

5

10

15

20

25

4 2 0 2 410

5

0

5

10

5

Basics of tangents

4 2 0 2 40

5

10

15

20

25

4 2 0 2 410

5

0

5

10

Derivative of f (x)

Written dfdx or f ′(x)

The slope of f (x) at every point xI Positive when f (x) is increasingI Negative when f (x) is decreasingI 0 when f (x) is neither increasing nor decreasing

I Maxima, minima and saddle points

6

Maximizing a function

At the maximum of a function, the derivative is 0I If no minima or saddle points...I The derivative has a single 0I ...at the maximum

7

Hill-climbing, now with more math

Begin with arbitrary guess x∗...Until the ground is level: Until df

dx (x∗) = 0:I Figure out which way is uphill Check sign of df

dx (x∗)I Go in that direction for a while Pick a new x∗

I Check the slope of the ground again Recompute dfdx (x∗)

8

Let’s compute the derivative

I I’m divided on whether to tackle the multiclass formula:

P(t |F ) =exp(Θt · F )∑t ′ exp(Θt ′ · F )

I Or the binary formula, which is simpler:

P(t |F ) =1

1 + exp(−Θt · F )

I I’m following Klein and doing the general oneI Derivative is more interpretable

9

Notation

Θt · F

Last time I was a bit vague about Θ and FI Partly because there are many ways to do it

This time:I There is a vector Θt for each class tI Its entries are θt ,1, θt ,2, . . .

I The F vector has entries f1, f2, . . .I These correspond to feature-value combinations

Θnoun = θnoun,word−is−fish, θnoun,word−is−buy . . .

Θadj = θadj,word−is−fish, θadj,word−is−buy . . .

F = fword−is−fish, fword−is−buy . . .

10

Write out the likelihoodYou’ve seen this part:

P(D; Θ) =∏

ti ,Fi∈D

P(ti |Fi ; Θ)

log P(D; Θ) =∑

ti ,Fi∈D

log P(ti |Fi ; Θ)

log P(D; Θ) =∑

ti ,Fi∈D

logexp(Θti · F )∑t ′ exp(Θt ′ · F )

Recalling that log ab = log a− log b:

P(D; Θ) =∑

ti ,Fi∈D

(log exp(Θti · Fi)− log

∑t ′

exp(Θt ′ · Fi)

)...and that log exp x = x , and splitting the sum...

P(D; Θ) =∑

ti ,Fi∈D

(Θti · Fi)−∑

ti ,Fi∈D

(log∑

t ′exp(Θt ′ · Fi)

)11

Taking the derivative

Partial derivative with respect to a particular weightI Let’s focus on a particular θt ,j (from Θt , feature j)

I Derivative of log P(D) treating θt ,j as a variableI And all other θ as constants

∂ log P(D)

∂θt ,j

Gradient ∇Θ

The vector of partial derivatives for all θt ,j in Θ

I A list of numbers, same dimensions as Θ

I Indicates the slope along each axis (direction)Why we call these estimation techniques gradient methods

12

First step

I Linearity: Derivative of a sum is the sum of the derivatives

∂

∂θt ,jP(D; Θ) =

∂

∂θt ,j

∑ti ,Fi∈D

(Θti · Fi)−

∂

∂θt ,j

∑ti ,Fi∈D

(log∑


)

13

First step

I Linearity: Derivative of a sum is the sum of the derivatives

∂

∂θt ,jP(D; Θ) =

∂

∂θt ,j

∑ti ,Fi∈D

(Θti · Fi)−

∂

∂θt ,j

∑ti ,Fi∈D

(log∑


)

13

Derivative of the first partI This bit comes from the numeratorI Push the derivative through the sum

I By applying the linearity rule again

∂

∂θt ,j

∑ti ,Fi∈D

(Θti · Fi) =∑

ti ,Fi∈D

∂

∂θt ,jΘti · Fi

θt ,j is only involved here if ti = tI That is, if this example has true tag t

Recalling that the dot product Θ · F = θ1f1 + θ2f2 + . . .

I Treating all the θ except θt ,j as constantsI The derivative of a constant is 0 (no slope)

∂

∂θt ,j

∑ti ,Fi∈D

(Θti · Fi) =∑

ti ,Fi∈D

∂

∂θtjθt ,j fj

14

Numerator continued

∂

∂θt ,j

∑ti ,Fi∈D

(Θti · Fi) =∑

ti ,Fi∈D

∂

∂θtjθt ,j fj

=∑

ti ,Fi∈D

fj

= #(t , fj)

The number of times tag t occurs with feature fj

15

Derivative of the second part

∂

∂θt ,jP(D; Θ) =

∂

∂θt ,j

∑ti ,Fi∈D

(Θti · Fi)−

∂

∂θt ,j

∑ti ,Fi∈D

(log∑


)

I Again, pushing the derivative through the sum

∑ti ,Fi∈D

∂

∂θt ,j

(log∑


)

I Derivative of log f (x) = 1f (x)

ddx f (x)

∑ti ,Fi∈D

1∑t ′ exp(Θt ′ · Fi)

∂

∂θt ,j

(∑t ′

exp(Θt ′ · Fi)

)16

Continued...I In rightmost sum, θt ,j appears when t ′ = t (for every ex.)

I Derivative of exp f (x) = exp f (x) ddx f (x)

I And we know the derivative of Θ · F

∑ti ,Fi∈D

1∑t ′ exp(Θt ′ · Fi)

∂

∂θt ,j

(∑t ′

exp(Θt ′ · Fi)

)

=∑

ti ,Fi∈D

exp(Θt · Fi)∑t ′ exp(Θt ′ · Fi)

∂

∂θt ,jΘt · F

=∑

ti ,Fi∈D

exp(Θt · Fi)∑t ′ exp(Θt ′ · Fi)

#(t , fj)

=∑

ti ,Fi∈D

P(t |Fi)#(fj)

The expected number of times tag t occurs with feature fj17

What we’ve learned

The derivative of the maximum-entropy model:I A difference of two quantities

I The observed counts of feature, tagI The expected counts of feature, tag

I Setting the derivative to 0 means making what we expectmatch what we see

Differentiating sums is easyI Most terms get thrown away

18

Formal advantages of the maximum entropy model

I The use of dot products means a lot of terms in thederivative go to 0

I The use of exponentials means that the derivative isexpressed in terms of P(t |F )

I Which is useful since we needed to code it alreadyI Not using probabilities means we have no constraints

I θ can be any numberI It’s also possible to prove convexity

I Only a single zero point in the derivative— unique maximumI We won’t prove this, but Klein does (slide 67)

19

Optimization based on the gradient


Start with guess Θ∗

Until gradient ∇Θ is (nearly) 0I Compute gradient ∇Θ

I Add multiple of gradient, x∗ ← η∇Θ

I Recompute gradient

20

Basic algorithmStep size .05, optimizing −(x2), d

dx = −2xI Well-behaved but slow

4 2 0 2 4

20

10

0

10

20

21

Basic algorithmStep size .8, optimizing −(x2), d

dx = −2xI We overshoot the maximum but converge relatively fast

6 4 2 0 2 4 640

20

0

20

40

60

80

22

Basic algorithm (3)Step size 1.1, optimizing −(x2), d

dx = −2xI Too large steps cause rapid divergence

15 10 5 0 5 10 15

400

300

200

100

0

100

200

300

23

Taking fewer stepsI Instead of trying to guess a step size, do a line searchI Try to approximate curvature of the function (second

derivative)I Keep track of overall trajectory of search pointsI Other stuff...

LBFGSLimited-memory Broyden-Fletcher-Goldfarb-Shanno method

I Relies on approximating the second derivativeI Uses a line searchI Very popular for maximum entropyI I don’t know more about how it works

Variant OWL-QN (Orthant-wise limited-memory quasi-Newton’smethod)

I Popular with some kinds of smoothing

24

A picture of an Owl Queen

credit: seller poordogfarm on etsy

25

Making the steps fasterThe gradient involved the term:∑

ti ,Fi∈D

P(t |Fi)#(fj)

This sum is slow to compute (loop over all training items)I Approximate this with a subset of training itemsI Law of large numbers: larger subsets, more precise resultsI Algorithm: when calculating ∇Θ

I Pick random d items from DI Calculate ∇Θ on that subset

Stochastic gradient algorithmsI So called because choosing random d itemsI d = 1 is a variant of the perceptron algorithm

I ...original version is not probabilisticI Other variants that do similar things

I eg passive-aggressive updating

26

Date post:	21-Sep-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Gradient-based estimation for maximum entropy...4 2 0 2 4 0 5 10 15 20 25 4 2 0 2 4 10 5 0 5 10...

Documents