+ All Categories
Home > Documents > Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell,...

Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell,...

Date post: 05-Jan-2016
Category:
Upload: calvin-cobb
View: 216 times
Download: 0 times
Share this document with a friend
29
Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford University
Transcript
Page 1: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Optimizing Local Probability Models for Statistical Parsing

Kristina Toutanova, Mark Mitchell, Christopher Manning

Computer Science DepartmentStanford University

Page 2: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Highlights

Choosing a local probability model P(expansion(n)|history(n)) for statistical parsing – a comparison of commonly used models

A new player – memory based models and their relation to interpolated models

Joint likelihood, conditional likelihood and classification accuracy for models of this form

Page 3: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Motivation

Many problems in natural language processing are disambiguation problems word senses

jaguar – a big cat, a car, name of a Java package

line - phone, queue, in mathematics, air line, etc. part-of-speech tags (noun, verb, proper noun,

etc.)

? ? ?Joy makes progress every day .

NN

VBDT NN

NN

NNP

VBZ

NNS

Page 4: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Parsing as Classification“I would like to meet with you again on

Monday” Input: a sentence

Classify to one of the possible parses

Page 5: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Motivation – Classification Problems

There are two major differences from typical ML domains: The number of classes can be very large or

even infinite; the set of available classes for an input varies (and depends on a grammar)

Data is usually very sparse and the number of possible features is large (e.g. words)

Page 6: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Solutions

The possible parse trees are broken down into small pieces defining features features are now functions of input and

class, not input only Discriminative or generative models are

built using these features we concentrate on generative models here;

when a huge number of analyses are possible, they are the only practical ones

Page 7: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

History-Based Generative Parsing Models

“Tuesday Marks bought Brooks”.S

TOP

NP NP-C VP

NNP

TuesdayThe generative models

learn a distribution P(S,T) on <sentence, parse tree> pairs:

select a single most likely parse based on:

))(history|)(expansion(),()(

nnPTSPTnodesn

),(maxarg)(yield:

TSPTSTT

best

Page 8: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Factors in the Performance of Generative History-Based Models

The chosen decomposition of parse tree generation, including the representation of parse tree nodes and the independence assumptions

The model family chosen for representing local probability distributions:Decision Trees, Naïve Bayes, Log-linear

Models The optimization method for fitting major

and smoothing parameters: Maximum likelihood, maximum conditional

likelihood, minimum error rate , etc.

Page 9: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Previous Studies and This Work

The influence of the previous three factors has not been isolated in previous work: authors presented specific choices for all components and the importance of each was unclear.

We assume the generative history-based model and set of features (the representation of parse tree nodes) are fixed and we study carefully the other two factors.

Page 10: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Deleted Interpolation

),...,(

},...,2,1{

),...,,(

11 },..{

21

kk iiii

n

xxX

nS

xxxX

Estimating the probability P(y|X) by interpolating relative frequency estimates for lower-order distributions

*},...,1{

)|(ˆ)()|(~

ii

iiSnS

SS XyPXXyP

Most commonly used: linear feature subsets order

)(ˆ...),..,|(ˆ)(),...,|(ˆ)(),...,|(~

{}11}1,..,1{1},..,1{1 yPxxyPXxxyPXxxyP nnnnn

Jelinek-Mercer with fixed weight, Witten Bell with varying d, Decision Trees with path interpolation,Memory-Based Learning

Page 11: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Memory-Based Learning as Deleted Interpolation

In k-NN, the probability of a class given features is estimated as:

If the distance function depends only on the positions of the matching features* , it is a case of deleted interpolation

)('

)('

))',((

)',())',((

)|(~

XNX

XNX

K

K

XXw

yyXXw

XyP

Page 12: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Memory-Based Learning as Deleted Interpolation

P(eye-color=blue|hair-color=blond)We have N=12 samples of people d=1 or d=0 (match), w(1)=w1 , w(0)=w0, K=12

Deleted Interpolation where the interpolation weights depend on the counts and weights of nearest neighbors at all accepted distances

)(.,)(.,

),(),()|(

~

10

10

blondcwblondcw

blondbluecwblondbluecwblondblueP

)(ˆ)(.,)(.,

(.,.))|(ˆ

)(.,)(.,

)(.,)(

(.,.)

,.)(

)(.,)(.,

(.,.)

)(.,

),(

)(.,)(.,

)(.,)(

)(.,)(.,

)],(,.)([),(

10

1

10

10

10

1

10

10

10

10

bluePbluecwbluecw

cwblondblueP

bluecwbluecw

blondcww

c

bluec

bluecwbluecw

cw

blondc

blondbluec

bluecwbluecw

blondcww

bluecwbluecw

blondbluecbluecwblondbluecw

Page 13: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

The Task and Features Used

No Name Example

1 Node label HCOMP

2 Parent node label HCOMP

3 Node direction left

4 Parent node direction

none

5 Grandparent node label

IMPER

6 Great grandparent node label

TOP

7 Left sister node label

none

8 Category of node verb

sent length

struct amb

random

5312

7.0 8.3 25.81%

Maximum ambiguity – 507, minimum - 2

seesee

letlet usus

LET_V1 US

IMPER

SEE_V3HCOMP

HCOMP

TOP

Page 14: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Experiments

Linear Feature Subsets Order Jelinek-Mercer with fixed weight Witten Bell with varying d Linear Memory-Based Learning

Arbitrary Feature Subsets Order Decision Trees Memory-Based Learning Log-linear Models

Experiments on the connection among likelihoods and accuracy

Page 15: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Experiments – Linear Sequence

The features {1,2,…,8} ordered by gain ratio {1,8,2,3,5,4,7,6}

),..,|(~

),...,|(ˆ)1(),...,|(~

1111 iJMFWiiJMFW xxyPxxyPxxyP

Jelinek Mercer Fixed Weight

Witten-Bell Varying d

|0),..,,(:|),..,(

),..,(),..,(

11

11

ii

ii xxycydxxc

xxcxx

),..,|(~

)),..,(1(),...,|(ˆ),..,(),...,|(~

11 WBD1111 WBD iiiii xxyPxxxxyPxxxxyP

Page 16: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Jelinek Mercer Fixed Weight

73%

74%

75%

76%

77%

78%

79%

80%

81%

0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9Interpolation Weight

Acc

ura

cy

Experiments – Linear Sequence

heavy smoothing for best results

Witten-Bell Varying d

73%

74%

75%

76%

77%

78%

79%

80%

81%

0 5 10 15 20 25 30d

Acc

ura

cy

Page 17: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

MBL Linear Subsets Sequence

Restrict MBL to be an instance of the same linear subsets sequence deleted interpolation as follows:

Weighting functions INV3 and INV4 performed best:

}',...,'{max)',(

)',(1)',(

11 iii

xxxxXXsim

XXsimXX

4

3

))1/(1()(4

))1/(1()(3

INV

INVLinear k-NN

73%

74%

75%

76%

77%

78%

79%

80%

81%

0 5000 10000 15000K

Accu

racy

Weight Inverse 3 Weight Inverse 4

LKNN3 best at K=3,000 79.94%

LKNN4 best at K=15,000 80.18%

LKNN4 is best of all

Page 18: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Experiments

Linear Subsets Feature Order Jelinek-Mercer with fixed weight Witten Bell with varying d Linear Memory-Based Learning

Arbitrary Subsets Feature Order Decision Trees Memory-Based Learning Log-linear Models

Experiments on the connection among likelihoods and accuracy

Page 19: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Model Implementations – Decision Trees

(DecTreeWBd) n-ary decision trees; If we choose a feature f

to split on, all its values form subtrees splitting criterion – gain ratio final probabilities estimates at the leaves are

Witten Bell d interpolations of estimates on the path to the root

feat: 1

feat:2

HCOMP

instances of deleted interpolation models!

NOPTCOMP

) noptcomp, hcomp|(~

21WBD xxexpP

Page 20: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Model Implementations – Log-linear Models

Binary features formed by instantiating templates

Three models with different allowable features Single attributes only LogLinSingle Pairs of attributes, only pairs involving the most

important feature (node label) LogLinPairs Linear feature subsets – comparable to

previous models LogLinBackoff Gaussian smoothing was used Trained by Conjugate Gradient (Stanford

Classify Package)

Page 21: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Model Implementations –Memory-Based Learning

Weighting functions INV3 and INV4

KNN4 better than DecTreeWBd and Log-linear models

KNN4 has 5.8% error reduction from WBd (significant at the 0.01 level)

kniiXX

ii

k

k

}),...,({)',(

X' and X of indices feature matching ofset },...,{

1

1

Model KNN4 DecTreeWBd

LogLinSingle

LogLinPairs

LogLinBackoff

Accuracy

80.79%

79.66% 78.65%

78.91%

77.52%

Page 22: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Accuracy Curves for MBL and Decision Trees

Decision Tree Accuracy

74%

75%

76%

77%

78%

79%

80%

81%

0 10 20 30 40 50d

Ac

cu

rac

y

k-NN Accuracy

74%

75%

76%

77%

78%

79%

80%

81%

0 5000 10000 15000

K

Acc

ura

cy

Inverse Weight 4 Inverse Weight 3

Page 23: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Experiments

Linear Subsets Feature Order Jelinek-Mercer with fixed weight Witten Bell with varying d Linear Memory-Based Learning

Arbitrary Subsets Feature Order Decision Trees Memory-Based Learning Log-linear Models

Experiments on the connection among likelihoods and accuracy

Page 24: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Joint Likelihood, Conditional Likelihood, and Classification Accuracy

Our aim is to maximize parsing accuracy, but: Smoothing parameters are usually fit on held-

out data to maximize joint likelihood Sometimes conditional likelihood is optimized

We look at the relationship among the maxima of these three scoring functions, depending on the amount of smoothing, finding that: Much heavier smoothing is needed to

maximize accuracy than joint likelihood Conditional likelihood also increases with

smoothing, even long after the maximum for joint likelihood

Page 25: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Test Set Performance versus Amount of Smoothing - I

JMFW Data Log-likelihoods and Accuracy

-7,0

-6,0

-5,0

-4,0

-3,0

-2,0

-1,0

0,0

0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9

interpolation weight

log

-lik

elih

oo

d

73%

74%

75%

76%

77%

78%

79%

80%

81%

joint log-likelihood conditional log-likelihood Accuracy

Page 26: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Test Set Performance versus Amount of Smoothing

k-NN INV4 Data Log-likelihoods and Accuracy

-3,5

-3,0

-2,5

-2,0

-1,5

-1,0

-0,5

0,0

0 5000 10000 15000K

Lo

g-l

ikel

iho

od

74%

75%

76%

77%

78%

79%

80%

81%

Joint Log-likelihood Conditional Log-likelihood Accuracy

Page 27: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Test Set Performance versus Amount of Smoothing –PP Attachment

Jelinek-Mercer Fixed Weight

-1

-0,8

-0,6

-0,4

-0,2

0

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

d

74%

75%

76%

77%

78%

79%

80%

81%

82%

83%

84%

85%

Joint Log-likelihood Conditional Log-likelihood Accuracy

-1

-0,8

-0,6

-0,4

-0,2

0

0 5 10 15 20 25 30

d

74%

75%

76%

77%

78%

79%

80%

81%

82%

83%

84%

85%

Joint Log-likelihood Conditional Log-likelihood Accuracy

Witten-Bell Varying d

Page 28: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Summary

The problem of effectively estimating local probability distributions for compound decision models used for classification is under-explored

We showed that the chosen local distribution model matters

We showed the relationship between MBL and deleted interpolation models

MBL with large numbers of neighbors and appropriate weighting outperformed more expensive and popular algorithms – Decision Trees and Log-linear Models

Fitting a small number of smoothing parameters to maximize classification accuracy is promising for improving performance

Page 29: Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Future Work

Compare MBL to other state-of-the art smoothing methods

Better ways of fitting MBL weight functions

Theoretical investigation of bias-variance tradeoffs for compound decision systems with strong independence assumptions


Recommended