Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf ·...

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Statistical Parsing: an Overview

based on a draft chapter 14 of the 2nd edition of “Speech

and Language Processing” by Jurafsky and MartinJennifer Foster

NCLT Seminar Series 2007-2008

23rd January 2007



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Outline

Statistical Parsing: What and Why?

PCFGsProbabilistic CKYObtaining Probabilistic Grammars

Limitations of PCFG Parsing

Generative, History-Based Lexicalised ParsersCollins ParserCharniak ParserUnlexicalised Parsing

Discriminative ParsingDiscriminative RerankingDiscriminative Dynamic Programming

Dependency Parsing

Conclusion



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Outline






Dependency Parsing

Conclusion



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser






What is natural language parsing?Process of analysing the structure of natural languagesentences.

What is statistical natural language parsing?Parsing in which the most likely analysis is selected amongst allpossible analyses for an input sentence.

Why are most natural language parsers statistical?Because syntactic ambiguity makes non-probabilistic orsymbolic parsing intractable!



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser











“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser











“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser











“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser











“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser











“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Benefits of Statistical Parsing

Jurafsky and Martin identify the following two benefits ofprobabilistic parsing:

1. Syntactic Disambiguation (main motivation)

2. Language Modelling



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Outline






Dependency Parsing

Conclusion



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Probabilistic Context-Free Grammar Defined

A PCFG 〈N, Σ, R, S〉 is defined as follows:

1. N is the set of non-terminal symbols

2. Σ is the terminals (disjoint from N)

3. R is a set of rules of the form A → β[p], where A ∈ N andβ ∈ (Σ ∪ N)∗, and p is a number between 0 and 1

4. A start symbol, S ∈ N

A PCFG is a CFG in which each rule is associated with aprobability.



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Probabilistic Context-Free Grammar Defined

A PCFG 〈N, Σ, R, S〉 is defined as follows:

1. N is the set of non-terminal symbols

2. Σ is the terminals (disjoint from N)

3. R is a set of rules of the form A → β[p], where A ∈ N andβ ∈ (Σ ∪ N)∗, and p is a number between 0 and 1

4. A start symbol, S ∈ N

A PCFG is a CFG in which each rule is associated with aprobability.



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





More about PCFGs

What does the p associated with each rule express?It expresses the probability that the LHS non-terminal will beexpanded as the RHS sequence.

I P (A → β|A)

I

∑

β

P (A → β|A) = 1

I The sum of the probabilities associated with all of therules expanding the non-terminal A is 1.



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





More about PCFGs


I P (A → β|A)

I

∑

β

P (A → β|A) = 1




“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





More about PCFGs


I P (A → β|A)

I

∑

β

P (A → β|A) = 1




“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





More about PCFGs


I P (A → β|A)

I

∑

β

P (A → β|A) = 1




“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





PCFGs and DisambiguationI A PCFG assigns a probability to every parse tree or

derivation associated with a sentence.

I This probability is the product of the rules applied inbuilding the parse tree.

I P (T, S) =n∏

i=1

P (Ai → βi) where n is the number of rules

in T .

I P (T, S) = P (T )P (S|T ) = P (S)P (T |S) by definition

I But P (S|T ) = 1 because all the words in S are in T

I So, P (T, S) = P (T )

I A parse disambiguation algorithm picks out the mostprobable parse tree for a sentence.

I T(S) = argmax P (T |S) s.t. S = yield(T )

I P (T |S) = P (T,S)P (S)

I P (S) will be constant over all the trees for S.

I T(S) = argmax P (T, S) s.t. S = yield(T )

I T(S) = argmax P (T ) s.t. S = yield(T )



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser








I P (T, S) =n∏

i=1


in T .



I So, P (T, S) = P (T )



I P (T |S) = P (T,S)P (S)






“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser








I P (T, S) =n∏

i=1


in T .



I So, P (T, S) = P (T )



I P (T |S) = P (T,S)P (S)






“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser








I P (T, S) =n∏

i=1


in T .



I So, P (T, S) = P (T )



I P (T |S) = P (T,S)P (S)






“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser








I P (T, S) =n∏

i=1


in T .



I So, P (T, S) = P (T )



I P (T |S) = P (T,S)P (S)






“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser








I P (T, S) =n∏

i=1


in T .



I So, P (T, S) = P (T )



I P (T |S) = P (T,S)P (S)






“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser








I P (T, S) =n∏

i=1


in T .



I So, P (T, S) = P (T )



I P (T |S) = P (T,S)P (S)






“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser








I P (T, S) =n∏

i=1


in T .



I So, P (T, S) = P (T )



I P (T |S) = P (T,S)P (S)






“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser








I P (T, S) =n∏

i=1


in T .



I So, P (T, S) = P (T )



I P (T |S) = P (T,S)P (S)






“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser








I P (T, S) =n∏

i=1


in T .



I So, P (T, S) = P (T )



I P (T |S) = P (T,S)P (S)






“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser








I P (T, S) =n∏

i=1


in T .



I So, P (T, S) = P (T )



I P (T |S) = P (T,S)P (S)






“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser








I P (T, S) =n∏

i=1


in T .



I So, P (T, S) = P (T )



I P (T |S) = P (T,S)P (S)






“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





PCFGs and Language Modelling

As well as assigning probabilities to parse trees, aPCFG assigns a probability to every sentencegenerated by the grammar.This is useful for language modelling.

I The probability of a sentence is the sum of the probabilitiesof each parse tree associated with the sentence.

I P (S) =∑

T s.t. yield(T )=S

P (T, S)

I P (S) =∑

T s.t. yield(T )=S

P (T )

When is it useful to know the probability of asentence?When ranking the output of speech recognition, machinetranslation and error correction systems.



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser








I P (S) =∑

T s.t. yield(T )=S

P (T, S)

I P (S) =∑

T s.t. yield(T )=S

P (T )




“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser








I P (S) =∑

T s.t. yield(T )=S

P (T, S)

I P (S) =∑

T s.t. yield(T )=S

P (T )




“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser








I P (S) =∑

T s.t. yield(T )=S

P (T, S)

I P (S) =∑

T s.t. yield(T )=S

P (T )




“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser








I P (S) =∑

T s.t. yield(T )=S

P (T, S)

I P (S) =∑

T s.t. yield(T )=S

P (T )




“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Probabilistic CKY

Probabilistic versions of most parsing algorithmsexist.Many probabilistic parsers use a probabilistic version of theCKY bottom-up chart parsing algorithm (see Chap. 13)

Sentence, s, of length n and CFG grammar with V

non-terminal symbols

I Normal CKY2-d (n + 1) ∗ (n + 1) array where a value in a cell (i, j) is alist of non-terminals spanning position i through j in s

I Probabilistic CKY3-d (n + 1) ∗ (n + 1) ∗ V array where a value in a cell(i, j, K) is the probability of the non-terminal K spanningposition i through j in s

As with regular CKY, probabilistic CKY assumesthat the grammar is in Chomsky-normal form.Note that probabilities will also need to be adjusted whentransforming a non-CNF PCFG into a CNF PCFG.



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Probabilistic CKY









“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Probabilistic CKY









“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Probabilistic CKY









“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Obtaining Probabilistic Grammars

In a PCFG every rule is associated with a probability.But where do these rule probabilities come from?

There are two ways to obtain rule probabilities for a PCFG.

1. Use a treebankP (A → β|A) = count(A→β)

count(A)

2. Corpus of sentences + Inside-Outside algorithm (Baker,1979)

2.1 Take a CFG and set all rules to have equal probability2.2 Parse the corpus with the CFG2.3 Adjust the probabilities2.4 Repeat steps two and three until probabilities converge

The Inside-Outside algorithm is a type of ExpectationMaximisation algorithm. It can also be used to induce agrammar, but only with limited success.



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser









count(A)






“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser









count(A)






“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser









count(A)






“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser









count(A)






“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Outline






Dependency Parsing

Conclusion



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser






Two well-known drawbacks of PCFG parsing are:

1. the independence of the rules in a PCFG

2. their failure to fully exploit lexical knowledge in resolvingambiguities



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





PCFG Rule Independence

In PCFG parsing, the application of a rule in aderivation is an independent event.This means that previous rule applications in the samederivation have no influence over the current rule application.

I The rule independence can be a disadvantage.

I Consider, for example, that, an English subject NP isextremely likely to be realised as a pronoun, and that anEnglish object NP is more likely to be realised as anon-pronominal noun phrase.

I This cannot be reflected in a PCFG which does notdistinguish object NPs from subject NPs.



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser












“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser












“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser












“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser












“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Using Lexical Knowledge to Resolve

Ambiguity

I Consider, for example, the sentencesWorkers dumped sacks into a bin

Fisherman caught tons of herring

I The problem of where to attach a PP is a common form ofsyntactic ambiguity.

I A main motivation for probabilistic parsing is to performaccurate disambiguation.

I People resolve this kind of ambiguity by looking at theactual nouns, verbs and prepositions involved.

I Faced with a choice between two attachment sites, V P andNP , for a constituent PP , a PCFG can only compare therule probabilities of

1. V P → V P PP versus2. NP → NP PP

These rules contain no lexical information.



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser






Ambiguity











“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser






Ambiguity











“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser






Ambiguity











“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser






Ambiguity











“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Outline






Dependency Parsing

Conclusion



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Generative, History-Based, Lexicalised Parsers

Generative, history-based lexicalized parsers overcomethe limits of PCFGs by employing:

1. lexicalisation

2. more complicated probablistic models

Two examples are:

1. the Collins parser

2. the Charniak parser



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Generative, History-Based, Lexicalised Parsers

Generative, history-based lexicalized parsers overcomethe limits of PCFGs by employing:

1. lexicalisation

2. more complicated probablistic models

Two examples are:

1. the Collins parser

2. the Charniak parser



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Lexicalised PCFGs

A PCFG can be lexicalised by associating a word andpart-of-speech tag with every non-terminal in thegrammar.

It is head -lexicalised if the word is the head of the constituentdescribed by the non-terminal.



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Lexicalised PCFGs

A non-lexicalised parse tree for the sentenceLast week IBM bought Lotus

S

VP

NP

NNP

Lotus

VBD

bought

NP

NNP

IBM

NP

NN

week

JJ

Last



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Lexicalised PCFGs

A lexicalised parse tree for the sentenceLast week IBM bought Lotus

S(bought,VBD)

VP(bought,VBD)

NP(Lotus,NNP)

NNP(Lotus,NNP)

Lotus

VBD(bought,VBD)

bought

NP(IBM,NNP)

NNP(IBM,NNP)

IBM

NP(week,NN)

NN(week,NN)

week

JJ(Last,JJ)

Last



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Lexicalised PCFGs

Lexicalisation drastically increases the size of thegrammar.This leads to the sparse data problem.

I How do we estimate the probabilities of the ruleS(bought, V BD) →NP (week, NN) NP (IBM, NNP ) V P (bought, V BD)

I Maximum-likelihood estimation won’t work:count(S(bought,V BD)→NP (week,NN) NP (IBM,NNP ) V P (bought,V BD))

count(S(bought,V BD))

I Sparse data problem can be tackled by breaking theprobability estimation for the right-hand side of the ruleinto three parts:

1. predict the probability of the head constituent, in thiscase, V P (bought, V BD)

2. predict the probability of the constituents on the left of thehead, in this case, NP (IBM, NNP ) and NP (week, NN)

3. predict the probability of the constituents on the right ofthe head, in this case, none



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Lexicalised PCFGs











“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Lexicalised PCFGs











“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Lexicalised PCFGs











“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Lexicalised PCFGs











“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Lexicalised PCFGs











“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Lexicalised PCFGs











“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





The Collins Parser

Model 1Given a rule of the form:LHS → Ln Ln−1...L1 H R1...Rn−1Rn

1. Generate the head of the phrase H(hw, ht) withprobability Ph(H(hw, ht)|LHS, hw, ht)

2. Generate modifiers to the left of the head with total

probability:n+1∏

i=1

PL(Li(lwi, lti)|LHS, H, hw, ht)

such that Ln+1(lwn+1, ltn+1) = STOP , and we stopgenerating once we’ve generated a STOP token.

3. Generate modifiers to the right of the head with total

probability:n+1∏

i=1

PR(Ri(rwi, rti)|LHS, H, hw, ht)

such that Rn+1(rwn+1, rtn+1) = STOP , and we stopgenerating once we’ve generated a STOP token.



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





The Collins Parser




probability:n+1∏

i=1




probability:n+1∏

i=1





“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





The Collins Parser




probability:n+1∏

i=1




probability:n+1∏

i=1





“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





The Collins Parser




probability:n+1∏

i=1




probability:n+1∏

i=1





“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





The Collins Parser

Model 1 Example

I We want to calculateP (S(bought, V BD) →

NP (week, NN) NP (IBM,NNP ) V P (bought, V BD))

I Work out the following probabilities and multiply theresult

1. P (V P (bought, V BD)|S, bought, V BD)2. P (NP (IBM,NNP )|S, V P, bought, V BD)3. P (NP (week, NN)|S, V P, bought, V BD)



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





The Collins Parser

Model 1 Example







“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





The Collins Parser

Model 1 Example







“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





The Collins Parser

Model 1 Example







“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





The Collins Parser

Model 1 Example







“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





The Collins Parser

Model 1 Example







“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





The Collins Parser

I In Model 1, a distance function is also included in theconditioning information for the left and right modifiers.

I This is used to measure the number of words between thecurrent modifier and the head.

I It has the effect of preferring right branching structuresand dispreferring dependencies which cross a verb.

I Other Linguistically Motivated Refinements

1. Distinction between recursive and base NPs2. Special features for coordination and punctuation

I Model 2 incorporates verb subcategorisation information.

I Model 3 incorporates long distance dependencyinformation.



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





The Collins Parser










“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





The Collins Parser










“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





The Collins ParserSmoothing

I Smoothing is carried using a linear interpolation of threemodelsλ1e1 + (1 − λ1)(λ2e2 + (1 − λ2)e3)

I e1 is the MLE of the fully lexicalised model, e2 the MLE ofthe model which omits the head word from theconditioning information and e3 is the MLE of the modelwhich omits the head word and head tag

I Weights are set using Witten-Bell discounting

Accuracy

I Achieves Parseval f-scores of approx. 88% on WSJ23

I Models 2 and 3 slightly more accurate than Model 1

Parsing Algorithm

I Version of probabilistic CKY

I Complexity of lexicalised CFG chart parsing is n4 or n5



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser









Accuracy



Parsing Algorithm





“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser









Accuracy



Parsing Algorithm





“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





The Charniak Parser

Main differences between the Charniak Parser and the CollinsParser

1. Breaks down the probability estimation in a similar way tothe Collins parser but uses more conditioning information:

1.1 When estimating the probability of a rule A → β, itincludes the parent node of A in the conditioninginformation. This is an important feature.

1.2 Also, includes the grandparent and sister nodes of A

1.3 Uses a third-order Markov grammar — a modifierconstituent in β is conditioned on the head consituent (ala Collins) and the previous two left or right modifiers.

2. Uses the suffix of an unknown word to guess itspart-of-speech



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





The Charniak Parser









“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





The Charniak Parser









“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





The Charniak Parser









“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





The Charniak Parser









“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Unlexicalised Parsing

I History-based lexicalised parsers achieve f-scores in therange 87-89% on WSJ23

I Klein and Manning (2003) show that it is possible toachieve an f-score of approximately 86% withoutlexicalisation:

I A node in a parse tree is annotated with its parent node -parent annotation

I This means that a subject noun phrase is annotated withthe category S, NP S, and a direct object noun phrasewill be annotated with the category V P , NP V P .

I Other transformations can be carried out, e.g. splittingpre-terminal categories into more fine-grained categories



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Outline






Dependency Parsing

Conclusion



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Discriminative versus Generative Statistical

Parsing

Generative Statistical ParsingThe probabilistic model is based on the generative derivation ofa sentence. The model gives us the probability of a sentence bysumming the probabilities of the derivations.

Discriminative Statistical ParsingMore flexible probability models which can incorporateinformation from a variety of sources:

1. Global facts about tree structure

2. Structure of previous sentences

3. Text genre

4. Facts about the speaker (e.g. gender)

Two Types of Discriminative Parsing

1. Discriminative Reranking

2. Discriminative Dynamic Programming



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Discriminative versus Generative Statistical

Parsing

Generative Statistical ParsingThe probabilistic model is based on the generative derivation ofa sentence. The model gives us the probability of a sentence bysumming the probabilities of the derivations.

Discriminative Statistical ParsingMore flexible probability models which can incorporateinformation from a variety of sources:

1. Global facts about tree structure

2. Structure of previous sentences

3. Text genre

4. Facts about the speaker (e.g. gender)

Two Types of Discriminative Parsing

1. Discriminative Reranking

2. Discriminative Dynamic Programming



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Discriminative Reranking

I Generative parser outputs an n-best list of parse trees

I Features are extracted from the parse treesI Parse probabilitiesI The CFG rules usedI Structural properties of trees (e.g. degree of parallelism)

I Log-linear models are often used

I Charniak and Johnson (2005) reranker: boosts f-score 2percentage points to 91.3%

I Disadvantage: final result will depend on the quality of then-best list



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser













“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser













“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser













“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser













“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser













“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser













“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser













“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Discriminative Dynamic Programming

I All parses are stored (compactly) in the chart

I The best parse is then selected using a discriminativeprobabilistic model

I Example: CCG parser (Clark and Curran, 2004)



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser











“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser











“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Outline






Dependency Parsing

Conclusion



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Dependency Parsing

Dependency parsers return a different type ofstructural analysis to phrase structure parsers.They return dependency graphs:

I the nodes are the words in the sentence

I the arcs are the dependency relationships between thewords

Phrase structure is not explicitly represented.



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Dependency Parsing

We distinguish labelled and non-labelled dependencygraphs.

I In a labelled dependency graph, the nature of thedependency relationship is specified.

I Typical dependency relationships: subj, obj, nmod, vmod

We distinguish projective and non-projectivedependency graphs.

I Non-projectivity identifiable with crossing dependencies.

I Used for representing certain types of non-distancedependencies, e.g. What did economic news have little

effect on?

I Most dependency-based linguistic theories assumenon-projectivity

I Most dependency-based parsing systems assumeprojectivity



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Probabilistic Dependency Parsing

Broadly speaking, there are two types of probabilisticdependency parsers.

1. Transition-based systems

2. Graph-based systems

Transition-based systems

I Probability model predicts the next action of the parser

I Deterministic shift-reduce parsing

I Example: Malt Parser (Nivre et al. 2006)

Graph-based systems

I Probability model predicts the entire graph for a sentencefrom the set of all possible graphs

I Example: MST Parser (Mc Donald et al. 2006)



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser













Graph-based systems





“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser













Graph-based systems





“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Outline






Dependency Parsing

Conclusion



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Conclusion

But while there is room for improvement I believe thatit is now or will soon be time to stop working onimproving labeled precision/recall per se... At somepoint we should move on. Charniak (1997)

What now?

1. Domain adaptation

2. Task-based evaluation

3. Semantic parsing



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Conclusion

But while there is room for improvement I believe thatit is now or will soon be time to stop working onimproving labeled precision/recall per se... At somepoint we should move on. Charniak (1997)

What now?

1. Domain adaptation

2. Task-based evaluation

3. Semantic parsing



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Charniak, E.: Statistical techniques for natural languageparsing.AI Magazine 18(4) (1997)

Charniak, E.: A maximum-entropy-inspired parser.In: Proceedings of the Annual Meeting of the NorthAmerican Association for Computational Linguistics(NAACL-00), pp. 132–139. Seattle, Washington (2000)

Collins, M.: Head-driven statistical models for naturallanguage parsing.Computational Linguistics 29(4), 499–637 (2003)

Klein, D., Manning, C.: Accurate unlexicalised parsing.In: Proceedings of the 41st Annual Meeting of the ACL(2006)

McDonald, R., Lerman, K., Pereira, F.: Multilingualdependency analysis with a two-stage discriminativeparser.In: Proceedings of the 10th Conference on ComputationalNatural Language Learning (CoNLL) (2006)



“Speech and

Language




PCFGs

ProbabilisticCKY




Collins Parser

Charniak Parser





Nivre, J., Hall, J., Nilsson, J., Eryigit, G., Marinov, S.:Labeled pseudo-projective dependency parsing withsupport vector machines.In: Proceedings of the 10th Conference on ComputationalNatural Language Learning (CoNLL) (2006)

Date post:	27-Jul-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf ·...

Documents