+ All Categories
Home > Documents > Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 ›...

Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 ›...

Date post: 24-Jun-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
45
+ Maxent Models and Discriminative Estimation Generative vs. Discriminative models
Transcript
Page 1: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+Maxent Models and Discriminative Estimation

Generative vs. Discriminative models

Page 2: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ Introduction n So far we’ve looked at “generative models”

n  Language models, Naive Bayes

n But there is now much use of conditional or discriminative probabilistic models in NLP, Speech, IR (and ML generally)

n Because: n  They give high accuracy performance n  They make it easy to incorporate lots of linguistically important

features n  They allow automatic building of language independent,

retargetable NLP modules

From Jurafsky & Martin Coursera & 3rd Edition

2

Page 3: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ Naïve Bayes Classifier Joint (generative) models cMAP = argmax

c∈CP(c | d)

= argmaxc∈C

P(d | c)P(c)P(d)

= argmaxc∈C

P(d | c)P(c)

MAP is “maximum a posteriori” = most likely class

Bayes Rule

Dropping the denominator

P̂(w | c) = count(w,c)+1count(c)+ V

Where “d” is the words

Page 4: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

Joint vs. Conditional Models n We have some data {(d, c)} of paired observations

d and hidden classes c.

n Joint (generative) models place probabilities over both observed data and the hidden stuff (gene-rate the observed data from hidden stuff): n  All the classic StatNLP models:

n  n-gram models, Naive Bayes classifiers, hidden Markov models, probabilistic context-free grammars, IBM machine translation alignment models

P(c,d)

From Jurafsky & Martin Coursera & 3rd Edition

4

Page 5: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

Joint vs. Conditional Models n Discriminative (conditional) models take the

data as given, and put a probability over hidden structure given the data:

n  Logistic regression, conditional loglinear or maximum entropy models, conditional random fields

n  Also, SVMs, (averaged) perceptron, etc. are discriminative classifiers (but not directly probabilistic)

P(c|d)

From Jurafsky & Martin Coursera & 3rd Edition

5

Page 6: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ Bayes Net/Graphical Models n  Bayes net diagrams draw circles for random variables, and

lines for direct dependencies

n  Some variables are observed; some are hidden

n  Each node is a little classifier (conditional probability table) based on incoming arcs

c

d1 d 2 d 3

Naive Bayes

c

d1 d2 d3

Generative Logistic Regression Discriminative

From Jurafsky & Martin Coursera & 3rd Edition

6

Page 7: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ Conditional vs. Joint Likelihood n A joint model gives probabilities P(d,c) and tries to

maximize this joint likelihood. n  It turns out to be trivial to choose weights: just relative

frequencies.

n A conditional model gives probabilities P(c|d). It takes the data as given and models only the conditional probability of the class. n  We seek to maximize conditional likelihood. n  Harder to do (as we’ll see…) n  More closely related to classification error.

From Jurafsky & Martin Coursera & 3rd Edition

7

Page 8: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

Conditional models work well: Word Sense Disambiguation

n Even with exactly the same features, changing from joint to conditional estimation increases performance

n That is, we use the same smoothing, and the same word-class features, we just change the numbers (parameters)

Training Set

Objective Accuracy

Joint Like. 86.8

Cond. Like. 98.5

Test Set

Objective Accuracy

Joint Like. 73.6

Cond. Like. 76.1 (Klein and Manning 2002, using Senseval-1 Data) From Jurafsky & Martin Coursera & 3rd Edition

8

Page 9: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ 9 Genera&vevs.Discrimina&veClassifiers-Intui&on n  Genera&veclassifier,e.g.,NaïveBayes:

n  Assumesomefunc&onalformforP(X|Y),P(Y)n  Es&mateparametersofP(X|Y),P(Y)directlyfromtrainingdatan  UseBayesruletocalculateP(Y|X=x)n  Thisis‘genera&ve’model

n  Indirectcomputa&onofP(Y|X)throughBayesrulen  But,cangenerateasampleofthedata,

n  Discrimina&veclassifier,e.g.,Logis&cRegression:n  Assumesomefunc&onalformforP(Y|X)n  Es&mateparametersofP(Y|X)directlyfromtrainingdatan  Thisisthe‘discrimina&ve’model

n  DirectlylearnP(Y|X)

( ) ( ) ( | )y

P X P y P X y=∑

Page 10: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+Discriminative Model Features

Making features from text for discriminative NLP models Christopher Manning

Page 11: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ Features n  In these slides and most maxent work: features f are

elementary pieces of evidence that link aspects of what we observe d with a category c that we want to predict

n A feature is a function with a bounded real value

f: C × D → ℝ

From Jurafsky & Martin Coursera & 3rd Edition

11

Page 12: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ Example features n  f1(c, d) ≡ [c = LOCATION ∧ w-1 = “in” ∧ isCapitalized(w)] n  f2(c, d) ≡ [c = LOCATION ∧ hasAccentedLatinChar(w)] n  f3(c, d) ≡ [c = DRUG ∧ ends(w, “c”)]

n Models will assign to each feature a weight: n  A positive weight votes that this configuration is likely correct n  A negative weight votes that this configuration is likely

incorrect

LOCATION in Québec

PERSON saw Sue

DRUG taking Zantac

LOCATION in Arcadia

From Jurafsky & Martin Coursera & 3rd Edition

12

Page 13: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ Feature Expectations n We will crucially make use of two expectations

n  actual or predicted counts of a feature firing:

n  Empirical count (expectation) of a feature:

n  Model expectation of a feature:

∑ ∈=

),(observed),(),()( empirical

DCdc ii dcffE

∑ ∈=

),(),(),(),()(

DCdc ii dcfdcPfE

From Jurafsky & Martin Coursera & 3rd Edition

13

Page 14: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ Features n  In NLP uses, usually a feature specifies (1) an indicator

function – a yes/no boolean matching function – of properties of the input and (2) a particular class

n fi(c, d) ≡ [Φ(d) ∧ c = cj] [Value is 0 or 1] n They pick out a data subset and suggest a label for it.

n We will say that Φ(d) is a feature of the data d, when, for each cj, the conjunction Φ(d) ∧ c = cj is a feature of the data-class pair (c, d)

From Jurafsky & Martin Coursera & 3rd Edition

14

Page 15: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ Features

n  In NLP uses, usually a feature specifies 1.  an indicator function – a yes/no boolean matching function – of

properties of the input and 2.  a particular class

fi(c, d) ≡ [Φ(d) ∧ c = cj] [Value is 0 or 1]

n Each feature picks out a data subset and suggests a label for it

From Jurafsky & Martin Coursera & 3rd Edition

15

Page 16: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ Feature-Based Models n The decision about a data point is based only on

the features active at that point.

BUSINESS: Stocks hit a yearly low …

Data

Features {…, stocks, hit, a, yearly, low, …}

Label: BUSINESS

Text Categorization

… to restructure bank:MONEY debt.

Data

Features {…, w-1=restructure, w+1=debt, L=12, …}

Label: MONEY

Word-Sense Disambiguation

DT JJ NN … The previous fall …

Data

Features {w=fall, t-1=JJ w-1=previous}

Label: NN

POS Tagging

From Jurafsky & Martin Coursera & 3rd Edition

16

Page 17: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ Example: Text Categorization (Zhang and Oles 2001)

n  Features are presence of each word in a document and the document class (they do feature selection to use reliable indicator words)

n  Tests on classic Reuters data set (and others) n  Naïve Bayes: 77.0% F1 n  Linear regression: 86.0% n  Logistic regression: 86.4% n  Support vector machine: 86.5%

n  Paper emphasizes the importance of regularization (smoothing) for successful use of discriminative methods (not used in much early NLP/IR work)

From Jurafsky & Martin Coursera & 3rd Edition

17

Page 18: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ Other Maxent Classifier Examples n  You can use a maxent classifier whenever you want to assign

data points to one of a number of classes: n  Sentence boundary detection (Mikheev 2000)

n  Is a period end of sentence or abbreviation? n  Sentiment analysis (Pang and Lee 2002)

n  Word unigrams, bigrams, POS counts, …

n  PP attachment (Ratnaparkhi 1998)

n  Attach to verb or noun? Features of head noun, preposition, etc. n  Parsing decisions in general (Ratnaparkhi 1997; Johnson et al. 1999, etc.)

From Jurafsky & Martin Coursera & 3rd Edition

18

Page 19: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+Feature-based Linear Classifiers

How to put features into a classifier

Page 20: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ Feature-Based Linear Classifiers n Linear classifiers at classification time:

n  Linear function from feature sets {fi} to classes {c}. n  Assign a weight λi to each feature fi. n  We consider each class for an observed datum d n  For a pair (c,d), features vote with their weights:

n  vote(c) = Σλifi(c,d)

n  Choose the class c which maximizes Σλifi(c,d)

LOCATION in Québec

DRUG in Québec

PERSON in Québec

From Jurafsky & Martin Coursera & 3rd Edition

20

Page 21: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+n  For a pair (c,d), features vote with their weights:

n  1.8 f1(c, d) ≡ [c = LOCATION ∧ w-1 = “in” ∧ isCapitalized(w)] n  -0.6 f2(c, d) ≡ [c = LOCATION ∧ hasAccentedLatinChar(w)] n  0.3 f3(c, d) ≡ [c = DRUG ∧ ends(w, “c”)]

n vote(c) = Σλifi(c,d)

PERSON LOCATION DRUG in Québec in Québec in Québec

Example features

From Jurafsky & Martin Coursera & 3rd Edition

21

0.0 1.8 + -0.6 = 1.2 0.3

Choose the class c which maximizes Σλifi(c,d)

Page 22: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

Feature-Based Linear Classifiers

There are many ways to chose weights for features

n  Perceptron: find a currently misclassified example, and nudge

weights in the direction of its correct classification

n  Margin-based methods (Support Vector Machines)

From Jurafsky & Martin Coursera & 3rd Edition

22

Page 23: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ Feature-Based Linear Classifiers n Exponential (log-linear, maxent, logistic, Gibbs) models:

n  Make a probabilistic model from the linear combination Σλifi(c,d)

n  P(LOCATION|in Québec) = e1.8e–0.6/(e1.8e–0.6 + e0.3 + e0) = 0.586

n  P(DRUG|in Québec) = e0.3 /(e1.8e–0.6 + e0.3 + e0) = 0.238

n  P(PERSON|in Québec) = e0 /(e1.8e–0.6 + e0.3 + e0) = 0.176

n The weights are the parameters of the probability model, combined via a “soft max” function

∑ ∑'

),'(expc i

ii dcfλ=),|( λdcP

∑i

ii dcf ),(exp λ Makes votes positive

Normalizes votes

From Jurafsky & Martin Coursera & 3rd Edition

23

Page 24: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

Feature-Based Linear Classifiers

n Exponential (log-linear, maxent, logistic, Gibbs) models: n  Given this model form, we will choose parameters {λi}

that maximize the conditional likelihood of the data according to this model.

n  We construct not only classifications, but probability distributions over classifications. n  There are other (good!) ways of discriminating classes –

SVMs, boosting, even perceptrons – but these methods are not as trivial to interpret as distributions over classes.

From Jurafsky & Martin Coursera & 3rd Edition

24

Page 25: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ Aside: logistic regression n Maxent models in NLP are essentially the same as

multiclass logistic regression models in statistics (or machine learning) n  If you haven’t seen these before, don’t worry, this presentation is

self-contained! n  If you have seen these before you might think about:

n  The parameterization is slightly different in a way that is advantageous for NLP-style models with tons of sparse features (but statistically inelegant)

n  The key role of feature functions in NLP and in this presentation n  The features are more general, with f also being a function of the

class – when might this be useful?

25

From Jurafsky & Martin Coursera & 3rd Edition

Page 26: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ Quiz Question n  Assuming exactly the same set up (3 class decision:

LOCATION, PERSON, or DRUG; 3 features as before, maxent), what are:

n  P(PERSON | by Goéric) =

n  P(LOCATION | by Goéric) =

n  P(DRUG | by Goéric) =

n  1.8 f1(c, d) ≡ [c = LOCATION ∧ w-1 = “in” ∧ isCapitalized(w)] n  -0.6 f2(c, d) ≡ [c = LOCATION ∧ hasAccentedLatinChar(w)] n  0.3 f3(c, d) ≡ [c = DRUG ∧ ends(w, “c”)]

∑ ∑'

),'(expc i

ii dcfλ=),|( λdcP ∑

iii dcf ),(exp λ

PERSON by Goéric

LOCATION by Goéric

DRUG by Goéric

From Jurafsky & Martin Coursera & 3rd Edition

26

Page 27: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+Building a Maxent Model

The nuts and bolts

Page 28: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ Building a Maxent Model n We define features (indicator functions) over data points

n  Features represent sets of data points which are distinctive enough to deserve model parameters. n  Words, but also “word contains number”, “word ends with ing”, etc.

n We will simply encode each Φ feature as a unique String n  A datum will give rise to a set of Strings: the active Φ features

“w = computer” “num = decimal” “end = ing”

n  Each feature fi(c, d) ≡ [Φ(d) ∧ c = cj] gets a real number weight

n We concentrate on Φ features but the math uses i indices of fi From Jurafsky & Martin Coursera & 3rd Edition

28

Page 29: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ Building a Maxent Model n Features are often added during model development to target

errors n  Often, the easiest thing to think of are features that mark bad combinations

n Then, for any given feature weights, we want to be able to calculate: n  Data conditional likelihood n  Derivative of the likelihood wrt each feature weight

n  Uses expectations of each feature according to the model

n We can then find the optimum feature weights (discussed later).

From Jurafsky & Martin Coursera & 3rd Edition

29

Page 30: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+Naive Bayes vs. Maxent models

Generative vs. Discriminative models: The problem of overcounting evidence Christopher Manning

Page 31: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ Text classification: Asia or Europe

NB FACTORS:

n P(A) = P(E) =

n P(M|A) =

n P(M|E) =

Europe Asia

Class

X1=M

NB Model PREDICTIONS: •  P(A,M) = •  P(E,M) = •  P(A|M) = •  P(E|M) =

Training Data Monaco Monaco

Monaco Monaco Hong Kong

Hong Kong Monaco

Monaco Hong Kong

Hong Kong

Monaco Monaco

½ ¼ ¾

½*¼ = ⅛ ½*¾ = ⅜

¾¼

Page 32: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+

PREDICTIONS: •  P(A,H,K) = •  P(E,H,K) = •  P(A|H,K) = •  P(E|H,K) =

Text classification: Asia or Europe

NB FACTORS:

n P(A) = P(E) =

n P(H|A) = P(K|A) =

n P(H|E) = PK|E) =

Europe Asia

Class

X1=H X2=K

NB Model

Training Data Monaco Monaco

Monaco Monaco Hong Kong

Hong Kong Monaco

Monaco Hong Kong

Hong Kong

Monaco Monaco

From Jurafsky & Martin Coursera & 3rd Edition

32

½

3/8

½*⅜*⅜ ½*⅛*⅛ 9/10

1/10

Page 33: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ Text classification: Asia or Europe Europe Asia

Class

H K

NB Model

Training Data

M

Monaco Monaco

Monaco Monaco Hong Kong

Hong Kong Monaco

Monaco Hong Kong

Hong Kong

Monaco Monaco

From Jurafsky & Martin Coursera & 3rd Edition

33

NB FACTORS: n P(A) = P(E) = n P(M|A) = n P(M|E) = n P(H|A) = P(K|A) = n P(H|E) = PK|E)

½ ¼

¾3/8

1/8

½*⅜*⅜*¾

¾ ¼

PREDICTIONS: •  P(A,H,K,M) = •  P(E,H,K,M) = •  P(A|H,K,M) = •  P(E|H,K,M) =

½*⅛*⅛*¾

Page 34: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ Naive Bayes vs. Maxent Models n Naive Bayes models multi-count correlated evidence

n  Each feature is multiplied in, even when you have multiple features telling you the same thing

n Maximum Entropy models (pretty much) solve this problem n  As we will see, this is done by weighting features so that

model expectations match the observed (empirical) expectations

From Jurafsky & Martin Coursera & 3rd Edition

34

Page 35: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+Maxent Models and Discriminative Estimation

Maximizing the likelihood

Page 36: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ Exponential Model Likelihood

n Maximum (Conditional) Likelihood Models : n  Given a model form, choose values of parameters to maximize

the (conditional) likelihood of the data.

∑∑∈∈

==),(),(),(),(

log),|(log),|(logDCdcDCdc

dcPDCP λλ∑ ∑'

),'(expc i

ii dcfλ

∑i

ii dcf ),(exp λ

From Jurafsky & Martin Coursera & 3rd Edition

36

Page 37: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ The Likelihood Value n The (log) conditional likelihood of iid data (C,D)

according to maxent model is a function of the data and the parameters λ:

n  If there aren’t many values of c, it’s easy to calculate:

From Jurafsky & Martin Coursera & 3rd Edition

37

∑∏∈∈

==),(),(),(),(

),|(log),|(log),|(logDCdcDCdc

dcPdcPDCP λλλ

∑∈

=),(),(

log),|(logDCdc

DCP λ

∑ ∑'

),'(expc i

ii dcfλ

∑i

ii dcf ),(exp λ

Page 38: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

The Likelihood Value

n We can separate this into two components:

n The derivative is the difference between the derivatives of each component

∑ ∑ ∑∈ ),(),( '

),'(explogDCdc c i

ii dcfλ∑ ∑∈ ),(),(

),(explogDCdc i

ii dcfλ −=),|(log λDCP

)(λN )(λM=),|(log λDCP −

From Jurafsky & Martin Coursera & 3rd Edition

38

Page 39: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ The Derivative I: Numerator

Derivative of the numerator is: the empirical count(fi, c)

i

DCdc iii dcf

λ

λ

=∑ ∑∈ ),(),(

),(

∑∑

∈ ∂

∂=

),(),(

),(

DCdc i

iii dcf

λ

λ

∑∈

=),(),(

),(DCdci dcf

i

DCdc iici

i

dcfN

λ

λ

λλ

=∂

∂∑ ∑∈ ),(),(

),(explog)(

From Jurafsky & Martin Coursera & 3rd Edition

39

Page 40: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ The Derivative II: Denominator

i

DCdc c iii

i

dcfM

λ

λ

λλ

=∂

∂∑ ∑ ∑∈ ),(),( '

),'(explog)(

∑∑ ∑

∑ ∑∈ ∂

∂=

),(),(

'

''

),'(exp

),''(exp1

DCdc i

c iii

c iii

dcf

dcf λ

λ

λ

∑ ∑∑∑

∑ ∑∈ ∂

∂=

),(),( '''

),'(

1

),'(exp

),''(exp1

DCdc c i

iii

iii

c iii

dcfdcf

dcf λ

λλ

λ

i

iii

DCdc cc i

ii

iii dcf

dcf

dcf

λ

λ

λ

λ

∂=

∑∑ ∑∑ ∑

∑∈

),'(

),''(exp

),'(exp

),(),( '''

∑ ∑∈

=),(),( '

),'(),|'(DCdc

ic

dcfdcP λ = predicted count(fi, λ) From Jurafsky & Martin Coursera & 3rd Edition

40

Page 41: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ The Derivative III

n  The optimum parameters are the ones for which each feature’s predicted expectation equals its empirical expectation. The optimum distribution is: n  Always unique (but parameters may not be unique) n  Always exists (if feature counts are from actual data).

n  These models are also called maximum entropy models because we find the model having maximum entropy and satisfying the constraints:

=∂

i

DCPλ

λ),|(log),(countactual Cfi ),(countpredicted λif−

jfEfE jpjp ∀= ),()( ~From Jurafsky & Martin Coursera & 3rd Edition

41

Page 42: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ Finding the optimal parameters n We want to choose parameters λ1, λ2, λ3, … that

maximize the conditional log-likelihood of the training data

n To be able to do that, we’ve worked out how to calculate the function value and its partial derivatives (its gradient)

)|(log)(1

i

n

ii dcPDCLogLik ∑

=

=

From Jurafsky & Martin Coursera & 3rd Edition

42

Page 43: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ A likelihood surface

From Jurafsky & Martin Coursera & 3rd Edition

43

Page 44: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+ Finding the optimal parameters n Use your favorite numerical optimization package….

n  Commonly (and in our code), you minimize the negative of CLogLik

1.  Gradient descent (GD); Stochastic gradient descent (SGD)

2.  Iterative proportional fitting methods: Generalized Iterative Scaling (GIS) and Improved Iterative Scaling (IIS)

3.  Conjugate gradient (CG), perhaps with preconditioning 4.  Quasi-Newton methods – limited memory variable metric

(LMVM) methods, in particular, L-BFGS

From Jurafsky & Martin Coursera & 3rd Edition

44

Page 45: Maxent Models and Discriminative Estimation › wp-content › uploads › 2016 › 02 › CS114_L8... · +Features n In NLP uses, usually a feature specifies (1) an indicator function

+Maxent Models and Discriminative Estimation

Maximizing the likelihood


Recommended