+ All Categories
Home > Documents > Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf ·...

Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf ·...

Date post: 27-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
115
Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech and Language Processing” by Jurafsky and Martin Jennifer Foster Statistical Parsing: What and Why? PCFGs Probabilistic CKY Obtaining Probabilistic Grammars Limitations of PCFG Parsing Generative, History-Based Lexicalised Parsers Collins Parser Charniak Parser Unlexicalised Parsing Discriminative Parsing Discriminative Reranking Discriminative Dynamic Programming Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech and Language Processing” by Jurafsky and Martin Jennifer Foster NCLT Seminar Series 2007-2008 23rd January 2007
Transcript
Page 1: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Statistical Parsing: an Overview

based on a draft chapter 14 of the 2nd edition of “Speech

and Language Processing” by Jurafsky and MartinJennifer Foster

NCLT Seminar Series 2007-2008

23rd January 2007

Page 2: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Outline

Statistical Parsing: What and Why?

PCFGsProbabilistic CKYObtaining Probabilistic Grammars

Limitations of PCFG Parsing

Generative, History-Based Lexicalised ParsersCollins ParserCharniak ParserUnlexicalised Parsing

Discriminative ParsingDiscriminative RerankingDiscriminative Dynamic Programming

Dependency Parsing

Conclusion

Page 3: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Outline

Statistical Parsing: What and Why?

PCFGsProbabilistic CKYObtaining Probabilistic Grammars

Limitations of PCFG Parsing

Generative, History-Based Lexicalised ParsersCollins ParserCharniak ParserUnlexicalised Parsing

Discriminative ParsingDiscriminative RerankingDiscriminative Dynamic Programming

Dependency Parsing

Conclusion

Page 4: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Statistical Parsing: What and Why?

What is natural language parsing?Process of analysing the structure of natural languagesentences.

What is statistical natural language parsing?Parsing in which the most likely analysis is selected amongst allpossible analyses for an input sentence.

Why are most natural language parsers statistical?Because syntactic ambiguity makes non-probabilistic orsymbolic parsing intractable!

Page 5: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Statistical Parsing: What and Why?

What is natural language parsing?Process of analysing the structure of natural languagesentences.

What is statistical natural language parsing?Parsing in which the most likely analysis is selected amongst allpossible analyses for an input sentence.

Why are most natural language parsers statistical?Because syntactic ambiguity makes non-probabilistic orsymbolic parsing intractable!

Page 6: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Statistical Parsing: What and Why?

What is natural language parsing?Process of analysing the structure of natural languagesentences.

What is statistical natural language parsing?Parsing in which the most likely analysis is selected amongst allpossible analyses for an input sentence.

Why are most natural language parsers statistical?Because syntactic ambiguity makes non-probabilistic orsymbolic parsing intractable!

Page 7: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Statistical Parsing: What and Why?

What is natural language parsing?Process of analysing the structure of natural languagesentences.

What is statistical natural language parsing?Parsing in which the most likely analysis is selected amongst allpossible analyses for an input sentence.

Why are most natural language parsers statistical?Because syntactic ambiguity makes non-probabilistic orsymbolic parsing intractable!

Page 8: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Statistical Parsing: What and Why?

What is natural language parsing?Process of analysing the structure of natural languagesentences.

What is statistical natural language parsing?Parsing in which the most likely analysis is selected amongst allpossible analyses for an input sentence.

Why are most natural language parsers statistical?Because syntactic ambiguity makes non-probabilistic orsymbolic parsing intractable!

Page 9: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Statistical Parsing: What and Why?

What is natural language parsing?Process of analysing the structure of natural languagesentences.

What is statistical natural language parsing?Parsing in which the most likely analysis is selected amongst allpossible analyses for an input sentence.

Why are most natural language parsers statistical?Because syntactic ambiguity makes non-probabilistic orsymbolic parsing intractable!

Page 10: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Benefits of Statistical Parsing

Jurafsky and Martin identify the following two benefits ofprobabilistic parsing:

1. Syntactic Disambiguation (main motivation)

2. Language Modelling

Page 11: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Outline

Statistical Parsing: What and Why?

PCFGsProbabilistic CKYObtaining Probabilistic Grammars

Limitations of PCFG Parsing

Generative, History-Based Lexicalised ParsersCollins ParserCharniak ParserUnlexicalised Parsing

Discriminative ParsingDiscriminative RerankingDiscriminative Dynamic Programming

Dependency Parsing

Conclusion

Page 12: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Probabilistic Context-Free Grammar Defined

A PCFG 〈N, Σ, R, S〉 is defined as follows:

1. N is the set of non-terminal symbols

2. Σ is the terminals (disjoint from N)

3. R is a set of rules of the form A → β[p], where A ∈ N andβ ∈ (Σ ∪ N)∗, and p is a number between 0 and 1

4. A start symbol, S ∈ N

A PCFG is a CFG in which each rule is associated with aprobability.

Page 13: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Probabilistic Context-Free Grammar Defined

A PCFG 〈N, Σ, R, S〉 is defined as follows:

1. N is the set of non-terminal symbols

2. Σ is the terminals (disjoint from N)

3. R is a set of rules of the form A → β[p], where A ∈ N andβ ∈ (Σ ∪ N)∗, and p is a number between 0 and 1

4. A start symbol, S ∈ N

A PCFG is a CFG in which each rule is associated with aprobability.

Page 14: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

More about PCFGs

What does the p associated with each rule express?It expresses the probability that the LHS non-terminal will beexpanded as the RHS sequence.

I P (A → β|A)

I

β

P (A → β|A) = 1

I The sum of the probabilities associated with all of therules expanding the non-terminal A is 1.

Page 15: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

More about PCFGs

What does the p associated with each rule express?It expresses the probability that the LHS non-terminal will beexpanded as the RHS sequence.

I P (A → β|A)

I

β

P (A → β|A) = 1

I The sum of the probabilities associated with all of therules expanding the non-terminal A is 1.

Page 16: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

More about PCFGs

What does the p associated with each rule express?It expresses the probability that the LHS non-terminal will beexpanded as the RHS sequence.

I P (A → β|A)

I

β

P (A → β|A) = 1

I The sum of the probabilities associated with all of therules expanding the non-terminal A is 1.

Page 17: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

More about PCFGs

What does the p associated with each rule express?It expresses the probability that the LHS non-terminal will beexpanded as the RHS sequence.

I P (A → β|A)

I

β

P (A → β|A) = 1

I The sum of the probabilities associated with all of therules expanding the non-terminal A is 1.

Page 18: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

PCFGs and DisambiguationI A PCFG assigns a probability to every parse tree or

derivation associated with a sentence.

I This probability is the product of the rules applied inbuilding the parse tree.

I P (T, S) =n∏

i=1

P (Ai → βi) where n is the number of rules

in T .

I P (T, S) = P (T )P (S|T ) = P (S)P (T |S) by definition

I But P (S|T ) = 1 because all the words in S are in T

I So, P (T, S) = P (T )

I A parse disambiguation algorithm picks out the mostprobable parse tree for a sentence.

I T(S) = argmax P (T |S) s.t. S = yield(T )

I P (T |S) = P (T,S)P (S)

I P (S) will be constant over all the trees for S.

I T(S) = argmax P (T, S) s.t. S = yield(T )

I T(S) = argmax P (T ) s.t. S = yield(T )

Page 19: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

PCFGs and DisambiguationI A PCFG assigns a probability to every parse tree or

derivation associated with a sentence.

I This probability is the product of the rules applied inbuilding the parse tree.

I P (T, S) =n∏

i=1

P (Ai → βi) where n is the number of rules

in T .

I P (T, S) = P (T )P (S|T ) = P (S)P (T |S) by definition

I But P (S|T ) = 1 because all the words in S are in T

I So, P (T, S) = P (T )

I A parse disambiguation algorithm picks out the mostprobable parse tree for a sentence.

I T(S) = argmax P (T |S) s.t. S = yield(T )

I P (T |S) = P (T,S)P (S)

I P (S) will be constant over all the trees for S.

I T(S) = argmax P (T, S) s.t. S = yield(T )

I T(S) = argmax P (T ) s.t. S = yield(T )

Page 20: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

PCFGs and DisambiguationI A PCFG assigns a probability to every parse tree or

derivation associated with a sentence.

I This probability is the product of the rules applied inbuilding the parse tree.

I P (T, S) =n∏

i=1

P (Ai → βi) where n is the number of rules

in T .

I P (T, S) = P (T )P (S|T ) = P (S)P (T |S) by definition

I But P (S|T ) = 1 because all the words in S are in T

I So, P (T, S) = P (T )

I A parse disambiguation algorithm picks out the mostprobable parse tree for a sentence.

I T(S) = argmax P (T |S) s.t. S = yield(T )

I P (T |S) = P (T,S)P (S)

I P (S) will be constant over all the trees for S.

I T(S) = argmax P (T, S) s.t. S = yield(T )

I T(S) = argmax P (T ) s.t. S = yield(T )

Page 21: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

PCFGs and DisambiguationI A PCFG assigns a probability to every parse tree or

derivation associated with a sentence.

I This probability is the product of the rules applied inbuilding the parse tree.

I P (T, S) =n∏

i=1

P (Ai → βi) where n is the number of rules

in T .

I P (T, S) = P (T )P (S|T ) = P (S)P (T |S) by definition

I But P (S|T ) = 1 because all the words in S are in T

I So, P (T, S) = P (T )

I A parse disambiguation algorithm picks out the mostprobable parse tree for a sentence.

I T(S) = argmax P (T |S) s.t. S = yield(T )

I P (T |S) = P (T,S)P (S)

I P (S) will be constant over all the trees for S.

I T(S) = argmax P (T, S) s.t. S = yield(T )

I T(S) = argmax P (T ) s.t. S = yield(T )

Page 22: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

PCFGs and DisambiguationI A PCFG assigns a probability to every parse tree or

derivation associated with a sentence.

I This probability is the product of the rules applied inbuilding the parse tree.

I P (T, S) =n∏

i=1

P (Ai → βi) where n is the number of rules

in T .

I P (T, S) = P (T )P (S|T ) = P (S)P (T |S) by definition

I But P (S|T ) = 1 because all the words in S are in T

I So, P (T, S) = P (T )

I A parse disambiguation algorithm picks out the mostprobable parse tree for a sentence.

I T(S) = argmax P (T |S) s.t. S = yield(T )

I P (T |S) = P (T,S)P (S)

I P (S) will be constant over all the trees for S.

I T(S) = argmax P (T, S) s.t. S = yield(T )

I T(S) = argmax P (T ) s.t. S = yield(T )

Page 23: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

PCFGs and DisambiguationI A PCFG assigns a probability to every parse tree or

derivation associated with a sentence.

I This probability is the product of the rules applied inbuilding the parse tree.

I P (T, S) =n∏

i=1

P (Ai → βi) where n is the number of rules

in T .

I P (T, S) = P (T )P (S|T ) = P (S)P (T |S) by definition

I But P (S|T ) = 1 because all the words in S are in T

I So, P (T, S) = P (T )

I A parse disambiguation algorithm picks out the mostprobable parse tree for a sentence.

I T(S) = argmax P (T |S) s.t. S = yield(T )

I P (T |S) = P (T,S)P (S)

I P (S) will be constant over all the trees for S.

I T(S) = argmax P (T, S) s.t. S = yield(T )

I T(S) = argmax P (T ) s.t. S = yield(T )

Page 24: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

PCFGs and DisambiguationI A PCFG assigns a probability to every parse tree or

derivation associated with a sentence.

I This probability is the product of the rules applied inbuilding the parse tree.

I P (T, S) =n∏

i=1

P (Ai → βi) where n is the number of rules

in T .

I P (T, S) = P (T )P (S|T ) = P (S)P (T |S) by definition

I But P (S|T ) = 1 because all the words in S are in T

I So, P (T, S) = P (T )

I A parse disambiguation algorithm picks out the mostprobable parse tree for a sentence.

I T(S) = argmax P (T |S) s.t. S = yield(T )

I P (T |S) = P (T,S)P (S)

I P (S) will be constant over all the trees for S.

I T(S) = argmax P (T, S) s.t. S = yield(T )

I T(S) = argmax P (T ) s.t. S = yield(T )

Page 25: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

PCFGs and DisambiguationI A PCFG assigns a probability to every parse tree or

derivation associated with a sentence.

I This probability is the product of the rules applied inbuilding the parse tree.

I P (T, S) =n∏

i=1

P (Ai → βi) where n is the number of rules

in T .

I P (T, S) = P (T )P (S|T ) = P (S)P (T |S) by definition

I But P (S|T ) = 1 because all the words in S are in T

I So, P (T, S) = P (T )

I A parse disambiguation algorithm picks out the mostprobable parse tree for a sentence.

I T(S) = argmax P (T |S) s.t. S = yield(T )

I P (T |S) = P (T,S)P (S)

I P (S) will be constant over all the trees for S.

I T(S) = argmax P (T, S) s.t. S = yield(T )

I T(S) = argmax P (T ) s.t. S = yield(T )

Page 26: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

PCFGs and DisambiguationI A PCFG assigns a probability to every parse tree or

derivation associated with a sentence.

I This probability is the product of the rules applied inbuilding the parse tree.

I P (T, S) =n∏

i=1

P (Ai → βi) where n is the number of rules

in T .

I P (T, S) = P (T )P (S|T ) = P (S)P (T |S) by definition

I But P (S|T ) = 1 because all the words in S are in T

I So, P (T, S) = P (T )

I A parse disambiguation algorithm picks out the mostprobable parse tree for a sentence.

I T(S) = argmax P (T |S) s.t. S = yield(T )

I P (T |S) = P (T,S)P (S)

I P (S) will be constant over all the trees for S.

I T(S) = argmax P (T, S) s.t. S = yield(T )

I T(S) = argmax P (T ) s.t. S = yield(T )

Page 27: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

PCFGs and DisambiguationI A PCFG assigns a probability to every parse tree or

derivation associated with a sentence.

I This probability is the product of the rules applied inbuilding the parse tree.

I P (T, S) =n∏

i=1

P (Ai → βi) where n is the number of rules

in T .

I P (T, S) = P (T )P (S|T ) = P (S)P (T |S) by definition

I But P (S|T ) = 1 because all the words in S are in T

I So, P (T, S) = P (T )

I A parse disambiguation algorithm picks out the mostprobable parse tree for a sentence.

I T(S) = argmax P (T |S) s.t. S = yield(T )

I P (T |S) = P (T,S)P (S)

I P (S) will be constant over all the trees for S.

I T(S) = argmax P (T, S) s.t. S = yield(T )

I T(S) = argmax P (T ) s.t. S = yield(T )

Page 28: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

PCFGs and DisambiguationI A PCFG assigns a probability to every parse tree or

derivation associated with a sentence.

I This probability is the product of the rules applied inbuilding the parse tree.

I P (T, S) =n∏

i=1

P (Ai → βi) where n is the number of rules

in T .

I P (T, S) = P (T )P (S|T ) = P (S)P (T |S) by definition

I But P (S|T ) = 1 because all the words in S are in T

I So, P (T, S) = P (T )

I A parse disambiguation algorithm picks out the mostprobable parse tree for a sentence.

I T(S) = argmax P (T |S) s.t. S = yield(T )

I P (T |S) = P (T,S)P (S)

I P (S) will be constant over all the trees for S.

I T(S) = argmax P (T, S) s.t. S = yield(T )

I T(S) = argmax P (T ) s.t. S = yield(T )

Page 29: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

PCFGs and DisambiguationI A PCFG assigns a probability to every parse tree or

derivation associated with a sentence.

I This probability is the product of the rules applied inbuilding the parse tree.

I P (T, S) =n∏

i=1

P (Ai → βi) where n is the number of rules

in T .

I P (T, S) = P (T )P (S|T ) = P (S)P (T |S) by definition

I But P (S|T ) = 1 because all the words in S are in T

I So, P (T, S) = P (T )

I A parse disambiguation algorithm picks out the mostprobable parse tree for a sentence.

I T(S) = argmax P (T |S) s.t. S = yield(T )

I P (T |S) = P (T,S)P (S)

I P (S) will be constant over all the trees for S.

I T(S) = argmax P (T, S) s.t. S = yield(T )

I T(S) = argmax P (T ) s.t. S = yield(T )

Page 30: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

PCFGs and Language Modelling

As well as assigning probabilities to parse trees, aPCFG assigns a probability to every sentencegenerated by the grammar.This is useful for language modelling.

I The probability of a sentence is the sum of the probabilitiesof each parse tree associated with the sentence.

I P (S) =∑

T s.t. yield(T )=S

P (T, S)

I P (S) =∑

T s.t. yield(T )=S

P (T )

When is it useful to know the probability of asentence?When ranking the output of speech recognition, machinetranslation and error correction systems.

Page 31: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

PCFGs and Language Modelling

As well as assigning probabilities to parse trees, aPCFG assigns a probability to every sentencegenerated by the grammar.This is useful for language modelling.

I The probability of a sentence is the sum of the probabilitiesof each parse tree associated with the sentence.

I P (S) =∑

T s.t. yield(T )=S

P (T, S)

I P (S) =∑

T s.t. yield(T )=S

P (T )

When is it useful to know the probability of asentence?When ranking the output of speech recognition, machinetranslation and error correction systems.

Page 32: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

PCFGs and Language Modelling

As well as assigning probabilities to parse trees, aPCFG assigns a probability to every sentencegenerated by the grammar.This is useful for language modelling.

I The probability of a sentence is the sum of the probabilitiesof each parse tree associated with the sentence.

I P (S) =∑

T s.t. yield(T )=S

P (T, S)

I P (S) =∑

T s.t. yield(T )=S

P (T )

When is it useful to know the probability of asentence?When ranking the output of speech recognition, machinetranslation and error correction systems.

Page 33: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

PCFGs and Language Modelling

As well as assigning probabilities to parse trees, aPCFG assigns a probability to every sentencegenerated by the grammar.This is useful for language modelling.

I The probability of a sentence is the sum of the probabilitiesof each parse tree associated with the sentence.

I P (S) =∑

T s.t. yield(T )=S

P (T, S)

I P (S) =∑

T s.t. yield(T )=S

P (T )

When is it useful to know the probability of asentence?When ranking the output of speech recognition, machinetranslation and error correction systems.

Page 34: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

PCFGs and Language Modelling

As well as assigning probabilities to parse trees, aPCFG assigns a probability to every sentencegenerated by the grammar.This is useful for language modelling.

I The probability of a sentence is the sum of the probabilitiesof each parse tree associated with the sentence.

I P (S) =∑

T s.t. yield(T )=S

P (T, S)

I P (S) =∑

T s.t. yield(T )=S

P (T )

When is it useful to know the probability of asentence?When ranking the output of speech recognition, machinetranslation and error correction systems.

Page 35: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Probabilistic CKY

Probabilistic versions of most parsing algorithmsexist.Many probabilistic parsers use a probabilistic version of theCKY bottom-up chart parsing algorithm (see Chap. 13)

Sentence, s, of length n and CFG grammar with V

non-terminal symbols

I Normal CKY2-d (n + 1) ∗ (n + 1) array where a value in a cell (i, j) is alist of non-terminals spanning position i through j in s

I Probabilistic CKY3-d (n + 1) ∗ (n + 1) ∗ V array where a value in a cell(i, j, K) is the probability of the non-terminal K spanningposition i through j in s

As with regular CKY, probabilistic CKY assumesthat the grammar is in Chomsky-normal form.Note that probabilities will also need to be adjusted whentransforming a non-CNF PCFG into a CNF PCFG.

Page 36: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Probabilistic CKY

Probabilistic versions of most parsing algorithmsexist.Many probabilistic parsers use a probabilistic version of theCKY bottom-up chart parsing algorithm (see Chap. 13)

Sentence, s, of length n and CFG grammar with V

non-terminal symbols

I Normal CKY2-d (n + 1) ∗ (n + 1) array where a value in a cell (i, j) is alist of non-terminals spanning position i through j in s

I Probabilistic CKY3-d (n + 1) ∗ (n + 1) ∗ V array where a value in a cell(i, j, K) is the probability of the non-terminal K spanningposition i through j in s

As with regular CKY, probabilistic CKY assumesthat the grammar is in Chomsky-normal form.Note that probabilities will also need to be adjusted whentransforming a non-CNF PCFG into a CNF PCFG.

Page 37: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Probabilistic CKY

Probabilistic versions of most parsing algorithmsexist.Many probabilistic parsers use a probabilistic version of theCKY bottom-up chart parsing algorithm (see Chap. 13)

Sentence, s, of length n and CFG grammar with V

non-terminal symbols

I Normal CKY2-d (n + 1) ∗ (n + 1) array where a value in a cell (i, j) is alist of non-terminals spanning position i through j in s

I Probabilistic CKY3-d (n + 1) ∗ (n + 1) ∗ V array where a value in a cell(i, j, K) is the probability of the non-terminal K spanningposition i through j in s

As with regular CKY, probabilistic CKY assumesthat the grammar is in Chomsky-normal form.Note that probabilities will also need to be adjusted whentransforming a non-CNF PCFG into a CNF PCFG.

Page 38: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Probabilistic CKY

Probabilistic versions of most parsing algorithmsexist.Many probabilistic parsers use a probabilistic version of theCKY bottom-up chart parsing algorithm (see Chap. 13)

Sentence, s, of length n and CFG grammar with V

non-terminal symbols

I Normal CKY2-d (n + 1) ∗ (n + 1) array where a value in a cell (i, j) is alist of non-terminals spanning position i through j in s

I Probabilistic CKY3-d (n + 1) ∗ (n + 1) ∗ V array where a value in a cell(i, j, K) is the probability of the non-terminal K spanningposition i through j in s

As with regular CKY, probabilistic CKY assumesthat the grammar is in Chomsky-normal form.Note that probabilities will also need to be adjusted whentransforming a non-CNF PCFG into a CNF PCFG.

Page 39: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Obtaining Probabilistic Grammars

In a PCFG every rule is associated with a probability.But where do these rule probabilities come from?

There are two ways to obtain rule probabilities for a PCFG.

1. Use a treebankP (A → β|A) = count(A→β)

count(A)

2. Corpus of sentences + Inside-Outside algorithm (Baker,1979)

2.1 Take a CFG and set all rules to have equal probability2.2 Parse the corpus with the CFG2.3 Adjust the probabilities2.4 Repeat steps two and three until probabilities converge

The Inside-Outside algorithm is a type of ExpectationMaximisation algorithm. It can also be used to induce agrammar, but only with limited success.

Page 40: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Obtaining Probabilistic Grammars

In a PCFG every rule is associated with a probability.But where do these rule probabilities come from?

There are two ways to obtain rule probabilities for a PCFG.

1. Use a treebankP (A → β|A) = count(A→β)

count(A)

2. Corpus of sentences + Inside-Outside algorithm (Baker,1979)

2.1 Take a CFG and set all rules to have equal probability2.2 Parse the corpus with the CFG2.3 Adjust the probabilities2.4 Repeat steps two and three until probabilities converge

The Inside-Outside algorithm is a type of ExpectationMaximisation algorithm. It can also be used to induce agrammar, but only with limited success.

Page 41: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Obtaining Probabilistic Grammars

In a PCFG every rule is associated with a probability.But where do these rule probabilities come from?

There are two ways to obtain rule probabilities for a PCFG.

1. Use a treebankP (A → β|A) = count(A→β)

count(A)

2. Corpus of sentences + Inside-Outside algorithm (Baker,1979)

2.1 Take a CFG and set all rules to have equal probability2.2 Parse the corpus with the CFG2.3 Adjust the probabilities2.4 Repeat steps two and three until probabilities converge

The Inside-Outside algorithm is a type of ExpectationMaximisation algorithm. It can also be used to induce agrammar, but only with limited success.

Page 42: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Obtaining Probabilistic Grammars

In a PCFG every rule is associated with a probability.But where do these rule probabilities come from?

There are two ways to obtain rule probabilities for a PCFG.

1. Use a treebankP (A → β|A) = count(A→β)

count(A)

2. Corpus of sentences + Inside-Outside algorithm (Baker,1979)

2.1 Take a CFG and set all rules to have equal probability2.2 Parse the corpus with the CFG2.3 Adjust the probabilities2.4 Repeat steps two and three until probabilities converge

The Inside-Outside algorithm is a type of ExpectationMaximisation algorithm. It can also be used to induce agrammar, but only with limited success.

Page 43: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Obtaining Probabilistic Grammars

In a PCFG every rule is associated with a probability.But where do these rule probabilities come from?

There are two ways to obtain rule probabilities for a PCFG.

1. Use a treebankP (A → β|A) = count(A→β)

count(A)

2. Corpus of sentences + Inside-Outside algorithm (Baker,1979)

2.1 Take a CFG and set all rules to have equal probability2.2 Parse the corpus with the CFG2.3 Adjust the probabilities2.4 Repeat steps two and three until probabilities converge

The Inside-Outside algorithm is a type of ExpectationMaximisation algorithm. It can also be used to induce agrammar, but only with limited success.

Page 44: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Outline

Statistical Parsing: What and Why?

PCFGsProbabilistic CKYObtaining Probabilistic Grammars

Limitations of PCFG Parsing

Generative, History-Based Lexicalised ParsersCollins ParserCharniak ParserUnlexicalised Parsing

Discriminative ParsingDiscriminative RerankingDiscriminative Dynamic Programming

Dependency Parsing

Conclusion

Page 45: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Limitations of PCFG Parsing

Two well-known drawbacks of PCFG parsing are:

1. the independence of the rules in a PCFG

2. their failure to fully exploit lexical knowledge in resolvingambiguities

Page 46: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

PCFG Rule Independence

In PCFG parsing, the application of a rule in aderivation is an independent event.This means that previous rule applications in the samederivation have no influence over the current rule application.

I The rule independence can be a disadvantage.

I Consider, for example, that, an English subject NP isextremely likely to be realised as a pronoun, and that anEnglish object NP is more likely to be realised as anon-pronominal noun phrase.

I This cannot be reflected in a PCFG which does notdistinguish object NPs from subject NPs.

Page 47: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

PCFG Rule Independence

In PCFG parsing, the application of a rule in aderivation is an independent event.This means that previous rule applications in the samederivation have no influence over the current rule application.

I The rule independence can be a disadvantage.

I Consider, for example, that, an English subject NP isextremely likely to be realised as a pronoun, and that anEnglish object NP is more likely to be realised as anon-pronominal noun phrase.

I This cannot be reflected in a PCFG which does notdistinguish object NPs from subject NPs.

Page 48: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

PCFG Rule Independence

In PCFG parsing, the application of a rule in aderivation is an independent event.This means that previous rule applications in the samederivation have no influence over the current rule application.

I The rule independence can be a disadvantage.

I Consider, for example, that, an English subject NP isextremely likely to be realised as a pronoun, and that anEnglish object NP is more likely to be realised as anon-pronominal noun phrase.

I This cannot be reflected in a PCFG which does notdistinguish object NPs from subject NPs.

Page 49: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

PCFG Rule Independence

In PCFG parsing, the application of a rule in aderivation is an independent event.This means that previous rule applications in the samederivation have no influence over the current rule application.

I The rule independence can be a disadvantage.

I Consider, for example, that, an English subject NP isextremely likely to be realised as a pronoun, and that anEnglish object NP is more likely to be realised as anon-pronominal noun phrase.

I This cannot be reflected in a PCFG which does notdistinguish object NPs from subject NPs.

Page 50: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

PCFG Rule Independence

In PCFG parsing, the application of a rule in aderivation is an independent event.This means that previous rule applications in the samederivation have no influence over the current rule application.

I The rule independence can be a disadvantage.

I Consider, for example, that, an English subject NP isextremely likely to be realised as a pronoun, and that anEnglish object NP is more likely to be realised as anon-pronominal noun phrase.

I This cannot be reflected in a PCFG which does notdistinguish object NPs from subject NPs.

Page 51: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Using Lexical Knowledge to Resolve

Ambiguity

I Consider, for example, the sentencesWorkers dumped sacks into a bin

Fisherman caught tons of herring

I The problem of where to attach a PP is a common form ofsyntactic ambiguity.

I A main motivation for probabilistic parsing is to performaccurate disambiguation.

I People resolve this kind of ambiguity by looking at theactual nouns, verbs and prepositions involved.

I Faced with a choice between two attachment sites, V P andNP , for a constituent PP , a PCFG can only compare therule probabilities of

1. V P → V P PP versus2. NP → NP PP

These rules contain no lexical information.

Page 52: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Using Lexical Knowledge to Resolve

Ambiguity

I Consider, for example, the sentencesWorkers dumped sacks into a bin

Fisherman caught tons of herring

I The problem of where to attach a PP is a common form ofsyntactic ambiguity.

I A main motivation for probabilistic parsing is to performaccurate disambiguation.

I People resolve this kind of ambiguity by looking at theactual nouns, verbs and prepositions involved.

I Faced with a choice between two attachment sites, V P andNP , for a constituent PP , a PCFG can only compare therule probabilities of

1. V P → V P PP versus2. NP → NP PP

These rules contain no lexical information.

Page 53: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Using Lexical Knowledge to Resolve

Ambiguity

I Consider, for example, the sentencesWorkers dumped sacks into a bin

Fisherman caught tons of herring

I The problem of where to attach a PP is a common form ofsyntactic ambiguity.

I A main motivation for probabilistic parsing is to performaccurate disambiguation.

I People resolve this kind of ambiguity by looking at theactual nouns, verbs and prepositions involved.

I Faced with a choice between two attachment sites, V P andNP , for a constituent PP , a PCFG can only compare therule probabilities of

1. V P → V P PP versus2. NP → NP PP

These rules contain no lexical information.

Page 54: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Using Lexical Knowledge to Resolve

Ambiguity

I Consider, for example, the sentencesWorkers dumped sacks into a bin

Fisherman caught tons of herring

I The problem of where to attach a PP is a common form ofsyntactic ambiguity.

I A main motivation for probabilistic parsing is to performaccurate disambiguation.

I People resolve this kind of ambiguity by looking at theactual nouns, verbs and prepositions involved.

I Faced with a choice between two attachment sites, V P andNP , for a constituent PP , a PCFG can only compare therule probabilities of

1. V P → V P PP versus2. NP → NP PP

These rules contain no lexical information.

Page 55: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Using Lexical Knowledge to Resolve

Ambiguity

I Consider, for example, the sentencesWorkers dumped sacks into a bin

Fisherman caught tons of herring

I The problem of where to attach a PP is a common form ofsyntactic ambiguity.

I A main motivation for probabilistic parsing is to performaccurate disambiguation.

I People resolve this kind of ambiguity by looking at theactual nouns, verbs and prepositions involved.

I Faced with a choice between two attachment sites, V P andNP , for a constituent PP , a PCFG can only compare therule probabilities of

1. V P → V P PP versus2. NP → NP PP

These rules contain no lexical information.

Page 56: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Outline

Statistical Parsing: What and Why?

PCFGsProbabilistic CKYObtaining Probabilistic Grammars

Limitations of PCFG Parsing

Generative, History-Based Lexicalised ParsersCollins ParserCharniak ParserUnlexicalised Parsing

Discriminative ParsingDiscriminative RerankingDiscriminative Dynamic Programming

Dependency Parsing

Conclusion

Page 57: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Generative, History-Based, Lexicalised Parsers

Generative, history-based lexicalized parsers overcomethe limits of PCFGs by employing:

1. lexicalisation

2. more complicated probablistic models

Two examples are:

1. the Collins parser

2. the Charniak parser

Page 58: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Generative, History-Based, Lexicalised Parsers

Generative, history-based lexicalized parsers overcomethe limits of PCFGs by employing:

1. lexicalisation

2. more complicated probablistic models

Two examples are:

1. the Collins parser

2. the Charniak parser

Page 59: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Lexicalised PCFGs

A PCFG can be lexicalised by associating a word andpart-of-speech tag with every non-terminal in thegrammar.

It is head -lexicalised if the word is the head of the constituentdescribed by the non-terminal.

Page 60: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Lexicalised PCFGs

A non-lexicalised parse tree for the sentenceLast week IBM bought Lotus

S

VP

NP

NNP

Lotus

VBD

bought

NP

NNP

IBM

NP

NN

week

JJ

Last

Page 61: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Lexicalised PCFGs

A lexicalised parse tree for the sentenceLast week IBM bought Lotus

S(bought,VBD)

VP(bought,VBD)

NP(Lotus,NNP)

NNP(Lotus,NNP)

Lotus

VBD(bought,VBD)

bought

NP(IBM,NNP)

NNP(IBM,NNP)

IBM

NP(week,NN)

NN(week,NN)

week

JJ(Last,JJ)

Last

Page 62: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Lexicalised PCFGs

Lexicalisation drastically increases the size of thegrammar.This leads to the sparse data problem.

I How do we estimate the probabilities of the ruleS(bought, V BD) →NP (week, NN) NP (IBM, NNP ) V P (bought, V BD)

I Maximum-likelihood estimation won’t work:count(S(bought,V BD)→NP (week,NN) NP (IBM,NNP ) V P (bought,V BD))

count(S(bought,V BD))

I Sparse data problem can be tackled by breaking theprobability estimation for the right-hand side of the ruleinto three parts:

1. predict the probability of the head constituent, in thiscase, V P (bought, V BD)

2. predict the probability of the constituents on the left of thehead, in this case, NP (IBM, NNP ) and NP (week, NN)

3. predict the probability of the constituents on the right ofthe head, in this case, none

Page 63: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Lexicalised PCFGs

Lexicalisation drastically increases the size of thegrammar.This leads to the sparse data problem.

I How do we estimate the probabilities of the ruleS(bought, V BD) →NP (week, NN) NP (IBM, NNP ) V P (bought, V BD)

I Maximum-likelihood estimation won’t work:count(S(bought,V BD)→NP (week,NN) NP (IBM,NNP ) V P (bought,V BD))

count(S(bought,V BD))

I Sparse data problem can be tackled by breaking theprobability estimation for the right-hand side of the ruleinto three parts:

1. predict the probability of the head constituent, in thiscase, V P (bought, V BD)

2. predict the probability of the constituents on the left of thehead, in this case, NP (IBM, NNP ) and NP (week, NN)

3. predict the probability of the constituents on the right ofthe head, in this case, none

Page 64: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Lexicalised PCFGs

Lexicalisation drastically increases the size of thegrammar.This leads to the sparse data problem.

I How do we estimate the probabilities of the ruleS(bought, V BD) →NP (week, NN) NP (IBM, NNP ) V P (bought, V BD)

I Maximum-likelihood estimation won’t work:count(S(bought,V BD)→NP (week,NN) NP (IBM,NNP ) V P (bought,V BD))

count(S(bought,V BD))

I Sparse data problem can be tackled by breaking theprobability estimation for the right-hand side of the ruleinto three parts:

1. predict the probability of the head constituent, in thiscase, V P (bought, V BD)

2. predict the probability of the constituents on the left of thehead, in this case, NP (IBM, NNP ) and NP (week, NN)

3. predict the probability of the constituents on the right ofthe head, in this case, none

Page 65: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Lexicalised PCFGs

Lexicalisation drastically increases the size of thegrammar.This leads to the sparse data problem.

I How do we estimate the probabilities of the ruleS(bought, V BD) →NP (week, NN) NP (IBM, NNP ) V P (bought, V BD)

I Maximum-likelihood estimation won’t work:count(S(bought,V BD)→NP (week,NN) NP (IBM,NNP ) V P (bought,V BD))

count(S(bought,V BD))

I Sparse data problem can be tackled by breaking theprobability estimation for the right-hand side of the ruleinto three parts:

1. predict the probability of the head constituent, in thiscase, V P (bought, V BD)

2. predict the probability of the constituents on the left of thehead, in this case, NP (IBM, NNP ) and NP (week, NN)

3. predict the probability of the constituents on the right ofthe head, in this case, none

Page 66: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Lexicalised PCFGs

Lexicalisation drastically increases the size of thegrammar.This leads to the sparse data problem.

I How do we estimate the probabilities of the ruleS(bought, V BD) →NP (week, NN) NP (IBM, NNP ) V P (bought, V BD)

I Maximum-likelihood estimation won’t work:count(S(bought,V BD)→NP (week,NN) NP (IBM,NNP ) V P (bought,V BD))

count(S(bought,V BD))

I Sparse data problem can be tackled by breaking theprobability estimation for the right-hand side of the ruleinto three parts:

1. predict the probability of the head constituent, in thiscase, V P (bought, V BD)

2. predict the probability of the constituents on the left of thehead, in this case, NP (IBM, NNP ) and NP (week, NN)

3. predict the probability of the constituents on the right ofthe head, in this case, none

Page 67: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Lexicalised PCFGs

Lexicalisation drastically increases the size of thegrammar.This leads to the sparse data problem.

I How do we estimate the probabilities of the ruleS(bought, V BD) →NP (week, NN) NP (IBM, NNP ) V P (bought, V BD)

I Maximum-likelihood estimation won’t work:count(S(bought,V BD)→NP (week,NN) NP (IBM,NNP ) V P (bought,V BD))

count(S(bought,V BD))

I Sparse data problem can be tackled by breaking theprobability estimation for the right-hand side of the ruleinto three parts:

1. predict the probability of the head constituent, in thiscase, V P (bought, V BD)

2. predict the probability of the constituents on the left of thehead, in this case, NP (IBM, NNP ) and NP (week, NN)

3. predict the probability of the constituents on the right ofthe head, in this case, none

Page 68: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Lexicalised PCFGs

Lexicalisation drastically increases the size of thegrammar.This leads to the sparse data problem.

I How do we estimate the probabilities of the ruleS(bought, V BD) →NP (week, NN) NP (IBM, NNP ) V P (bought, V BD)

I Maximum-likelihood estimation won’t work:count(S(bought,V BD)→NP (week,NN) NP (IBM,NNP ) V P (bought,V BD))

count(S(bought,V BD))

I Sparse data problem can be tackled by breaking theprobability estimation for the right-hand side of the ruleinto three parts:

1. predict the probability of the head constituent, in thiscase, V P (bought, V BD)

2. predict the probability of the constituents on the left of thehead, in this case, NP (IBM, NNP ) and NP (week, NN)

3. predict the probability of the constituents on the right ofthe head, in this case, none

Page 69: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

The Collins Parser

Model 1Given a rule of the form:LHS → Ln Ln−1...L1 H R1...Rn−1Rn

1. Generate the head of the phrase H(hw, ht) withprobability Ph(H(hw, ht)|LHS, hw, ht)

2. Generate modifiers to the left of the head with total

probability:n+1∏

i=1

PL(Li(lwi, lti)|LHS, H, hw, ht)

such that Ln+1(lwn+1, ltn+1) = STOP , and we stopgenerating once we’ve generated a STOP token.

3. Generate modifiers to the right of the head with total

probability:n+1∏

i=1

PR(Ri(rwi, rti)|LHS, H, hw, ht)

such that Rn+1(rwn+1, rtn+1) = STOP , and we stopgenerating once we’ve generated a STOP token.

Page 70: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

The Collins Parser

Model 1Given a rule of the form:LHS → Ln Ln−1...L1 H R1...Rn−1Rn

1. Generate the head of the phrase H(hw, ht) withprobability Ph(H(hw, ht)|LHS, hw, ht)

2. Generate modifiers to the left of the head with total

probability:n+1∏

i=1

PL(Li(lwi, lti)|LHS, H, hw, ht)

such that Ln+1(lwn+1, ltn+1) = STOP , and we stopgenerating once we’ve generated a STOP token.

3. Generate modifiers to the right of the head with total

probability:n+1∏

i=1

PR(Ri(rwi, rti)|LHS, H, hw, ht)

such that Rn+1(rwn+1, rtn+1) = STOP , and we stopgenerating once we’ve generated a STOP token.

Page 71: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

The Collins Parser

Model 1Given a rule of the form:LHS → Ln Ln−1...L1 H R1...Rn−1Rn

1. Generate the head of the phrase H(hw, ht) withprobability Ph(H(hw, ht)|LHS, hw, ht)

2. Generate modifiers to the left of the head with total

probability:n+1∏

i=1

PL(Li(lwi, lti)|LHS, H, hw, ht)

such that Ln+1(lwn+1, ltn+1) = STOP , and we stopgenerating once we’ve generated a STOP token.

3. Generate modifiers to the right of the head with total

probability:n+1∏

i=1

PR(Ri(rwi, rti)|LHS, H, hw, ht)

such that Rn+1(rwn+1, rtn+1) = STOP , and we stopgenerating once we’ve generated a STOP token.

Page 72: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

The Collins Parser

Model 1Given a rule of the form:LHS → Ln Ln−1...L1 H R1...Rn−1Rn

1. Generate the head of the phrase H(hw, ht) withprobability Ph(H(hw, ht)|LHS, hw, ht)

2. Generate modifiers to the left of the head with total

probability:n+1∏

i=1

PL(Li(lwi, lti)|LHS, H, hw, ht)

such that Ln+1(lwn+1, ltn+1) = STOP , and we stopgenerating once we’ve generated a STOP token.

3. Generate modifiers to the right of the head with total

probability:n+1∏

i=1

PR(Ri(rwi, rti)|LHS, H, hw, ht)

such that Rn+1(rwn+1, rtn+1) = STOP , and we stopgenerating once we’ve generated a STOP token.

Page 73: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

The Collins Parser

Model 1 Example

I We want to calculateP (S(bought, V BD) →

NP (week, NN) NP (IBM,NNP ) V P (bought, V BD))

I Work out the following probabilities and multiply theresult

1. P (V P (bought, V BD)|S, bought, V BD)2. P (NP (IBM,NNP )|S, V P, bought, V BD)3. P (NP (week, NN)|S, V P, bought, V BD)

Page 74: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

The Collins Parser

Model 1 Example

I We want to calculateP (S(bought, V BD) →

NP (week, NN) NP (IBM,NNP ) V P (bought, V BD))

I Work out the following probabilities and multiply theresult

1. P (V P (bought, V BD)|S, bought, V BD)2. P (NP (IBM,NNP )|S, V P, bought, V BD)3. P (NP (week, NN)|S, V P, bought, V BD)

Page 75: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

The Collins Parser

Model 1 Example

I We want to calculateP (S(bought, V BD) →

NP (week, NN) NP (IBM,NNP ) V P (bought, V BD))

I Work out the following probabilities and multiply theresult

1. P (V P (bought, V BD)|S, bought, V BD)2. P (NP (IBM,NNP )|S, V P, bought, V BD)3. P (NP (week, NN)|S, V P, bought, V BD)

Page 76: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

The Collins Parser

Model 1 Example

I We want to calculateP (S(bought, V BD) →

NP (week, NN) NP (IBM,NNP ) V P (bought, V BD))

I Work out the following probabilities and multiply theresult

1. P (V P (bought, V BD)|S, bought, V BD)2. P (NP (IBM,NNP )|S, V P, bought, V BD)3. P (NP (week, NN)|S, V P, bought, V BD)

Page 77: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

The Collins Parser

Model 1 Example

I We want to calculateP (S(bought, V BD) →

NP (week, NN) NP (IBM,NNP ) V P (bought, V BD))

I Work out the following probabilities and multiply theresult

1. P (V P (bought, V BD)|S, bought, V BD)2. P (NP (IBM,NNP )|S, V P, bought, V BD)3. P (NP (week, NN)|S, V P, bought, V BD)

Page 78: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

The Collins Parser

Model 1 Example

I We want to calculateP (S(bought, V BD) →

NP (week, NN) NP (IBM,NNP ) V P (bought, V BD))

I Work out the following probabilities and multiply theresult

1. P (V P (bought, V BD)|S, bought, V BD)2. P (NP (IBM,NNP )|S, V P, bought, V BD)3. P (NP (week, NN)|S, V P, bought, V BD)

Page 79: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

The Collins Parser

I In Model 1, a distance function is also included in theconditioning information for the left and right modifiers.

I This is used to measure the number of words between thecurrent modifier and the head.

I It has the effect of preferring right branching structuresand dispreferring dependencies which cross a verb.

I Other Linguistically Motivated Refinements

1. Distinction between recursive and base NPs2. Special features for coordination and punctuation

I Model 2 incorporates verb subcategorisation information.

I Model 3 incorporates long distance dependencyinformation.

Page 80: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

The Collins Parser

I In Model 1, a distance function is also included in theconditioning information for the left and right modifiers.

I This is used to measure the number of words between thecurrent modifier and the head.

I It has the effect of preferring right branching structuresand dispreferring dependencies which cross a verb.

I Other Linguistically Motivated Refinements

1. Distinction between recursive and base NPs2. Special features for coordination and punctuation

I Model 2 incorporates verb subcategorisation information.

I Model 3 incorporates long distance dependencyinformation.

Page 81: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

The Collins Parser

I In Model 1, a distance function is also included in theconditioning information for the left and right modifiers.

I This is used to measure the number of words between thecurrent modifier and the head.

I It has the effect of preferring right branching structuresand dispreferring dependencies which cross a verb.

I Other Linguistically Motivated Refinements

1. Distinction between recursive and base NPs2. Special features for coordination and punctuation

I Model 2 incorporates verb subcategorisation information.

I Model 3 incorporates long distance dependencyinformation.

Page 82: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

The Collins ParserSmoothing

I Smoothing is carried using a linear interpolation of threemodelsλ1e1 + (1 − λ1)(λ2e2 + (1 − λ2)e3)

I e1 is the MLE of the fully lexicalised model, e2 the MLE ofthe model which omits the head word from theconditioning information and e3 is the MLE of the modelwhich omits the head word and head tag

I Weights are set using Witten-Bell discounting

Accuracy

I Achieves Parseval f-scores of approx. 88% on WSJ23

I Models 2 and 3 slightly more accurate than Model 1

Parsing Algorithm

I Version of probabilistic CKY

I Complexity of lexicalised CFG chart parsing is n4 or n5

Page 83: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

The Collins ParserSmoothing

I Smoothing is carried using a linear interpolation of threemodelsλ1e1 + (1 − λ1)(λ2e2 + (1 − λ2)e3)

I e1 is the MLE of the fully lexicalised model, e2 the MLE ofthe model which omits the head word from theconditioning information and e3 is the MLE of the modelwhich omits the head word and head tag

I Weights are set using Witten-Bell discounting

Accuracy

I Achieves Parseval f-scores of approx. 88% on WSJ23

I Models 2 and 3 slightly more accurate than Model 1

Parsing Algorithm

I Version of probabilistic CKY

I Complexity of lexicalised CFG chart parsing is n4 or n5

Page 84: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

The Collins ParserSmoothing

I Smoothing is carried using a linear interpolation of threemodelsλ1e1 + (1 − λ1)(λ2e2 + (1 − λ2)e3)

I e1 is the MLE of the fully lexicalised model, e2 the MLE ofthe model which omits the head word from theconditioning information and e3 is the MLE of the modelwhich omits the head word and head tag

I Weights are set using Witten-Bell discounting

Accuracy

I Achieves Parseval f-scores of approx. 88% on WSJ23

I Models 2 and 3 slightly more accurate than Model 1

Parsing Algorithm

I Version of probabilistic CKY

I Complexity of lexicalised CFG chart parsing is n4 or n5

Page 85: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

The Charniak Parser

Main differences between the Charniak Parser and the CollinsParser

1. Breaks down the probability estimation in a similar way tothe Collins parser but uses more conditioning information:

1.1 When estimating the probability of a rule A → β, itincludes the parent node of A in the conditioninginformation. This is an important feature.

1.2 Also, includes the grandparent and sister nodes of A

1.3 Uses a third-order Markov grammar — a modifierconstituent in β is conditioned on the head consituent (ala Collins) and the previous two left or right modifiers.

2. Uses the suffix of an unknown word to guess itspart-of-speech

Page 86: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

The Charniak Parser

Main differences between the Charniak Parser and the CollinsParser

1. Breaks down the probability estimation in a similar way tothe Collins parser but uses more conditioning information:

1.1 When estimating the probability of a rule A → β, itincludes the parent node of A in the conditioninginformation. This is an important feature.

1.2 Also, includes the grandparent and sister nodes of A

1.3 Uses a third-order Markov grammar — a modifierconstituent in β is conditioned on the head consituent (ala Collins) and the previous two left or right modifiers.

2. Uses the suffix of an unknown word to guess itspart-of-speech

Page 87: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

The Charniak Parser

Main differences between the Charniak Parser and the CollinsParser

1. Breaks down the probability estimation in a similar way tothe Collins parser but uses more conditioning information:

1.1 When estimating the probability of a rule A → β, itincludes the parent node of A in the conditioninginformation. This is an important feature.

1.2 Also, includes the grandparent and sister nodes of A

1.3 Uses a third-order Markov grammar — a modifierconstituent in β is conditioned on the head consituent (ala Collins) and the previous two left or right modifiers.

2. Uses the suffix of an unknown word to guess itspart-of-speech

Page 88: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

The Charniak Parser

Main differences between the Charniak Parser and the CollinsParser

1. Breaks down the probability estimation in a similar way tothe Collins parser but uses more conditioning information:

1.1 When estimating the probability of a rule A → β, itincludes the parent node of A in the conditioninginformation. This is an important feature.

1.2 Also, includes the grandparent and sister nodes of A

1.3 Uses a third-order Markov grammar — a modifierconstituent in β is conditioned on the head consituent (ala Collins) and the previous two left or right modifiers.

2. Uses the suffix of an unknown word to guess itspart-of-speech

Page 89: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

The Charniak Parser

Main differences between the Charniak Parser and the CollinsParser

1. Breaks down the probability estimation in a similar way tothe Collins parser but uses more conditioning information:

1.1 When estimating the probability of a rule A → β, itincludes the parent node of A in the conditioninginformation. This is an important feature.

1.2 Also, includes the grandparent and sister nodes of A

1.3 Uses a third-order Markov grammar — a modifierconstituent in β is conditioned on the head consituent (ala Collins) and the previous two left or right modifiers.

2. Uses the suffix of an unknown word to guess itspart-of-speech

Page 90: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Unlexicalised Parsing

I History-based lexicalised parsers achieve f-scores in therange 87-89% on WSJ23

I Klein and Manning (2003) show that it is possible toachieve an f-score of approximately 86% withoutlexicalisation:

I A node in a parse tree is annotated with its parent node -parent annotation

I This means that a subject noun phrase is annotated withthe category S, NP S, and a direct object noun phrasewill be annotated with the category V P , NP V P .

I Other transformations can be carried out, e.g. splittingpre-terminal categories into more fine-grained categories

Page 91: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Outline

Statistical Parsing: What and Why?

PCFGsProbabilistic CKYObtaining Probabilistic Grammars

Limitations of PCFG Parsing

Generative, History-Based Lexicalised ParsersCollins ParserCharniak ParserUnlexicalised Parsing

Discriminative ParsingDiscriminative RerankingDiscriminative Dynamic Programming

Dependency Parsing

Conclusion

Page 92: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Discriminative versus Generative Statistical

Parsing

Generative Statistical ParsingThe probabilistic model is based on the generative derivation ofa sentence. The model gives us the probability of a sentence bysumming the probabilities of the derivations.

Discriminative Statistical ParsingMore flexible probability models which can incorporateinformation from a variety of sources:

1. Global facts about tree structure

2. Structure of previous sentences

3. Text genre

4. Facts about the speaker (e.g. gender)

Two Types of Discriminative Parsing

1. Discriminative Reranking

2. Discriminative Dynamic Programming

Page 93: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Discriminative versus Generative Statistical

Parsing

Generative Statistical ParsingThe probabilistic model is based on the generative derivation ofa sentence. The model gives us the probability of a sentence bysumming the probabilities of the derivations.

Discriminative Statistical ParsingMore flexible probability models which can incorporateinformation from a variety of sources:

1. Global facts about tree structure

2. Structure of previous sentences

3. Text genre

4. Facts about the speaker (e.g. gender)

Two Types of Discriminative Parsing

1. Discriminative Reranking

2. Discriminative Dynamic Programming

Page 94: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Discriminative Reranking

I Generative parser outputs an n-best list of parse trees

I Features are extracted from the parse treesI Parse probabilitiesI The CFG rules usedI Structural properties of trees (e.g. degree of parallelism)

I Log-linear models are often used

I Charniak and Johnson (2005) reranker: boosts f-score 2percentage points to 91.3%

I Disadvantage: final result will depend on the quality of then-best list

Page 95: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Discriminative Reranking

I Generative parser outputs an n-best list of parse trees

I Features are extracted from the parse treesI Parse probabilitiesI The CFG rules usedI Structural properties of trees (e.g. degree of parallelism)

I Log-linear models are often used

I Charniak and Johnson (2005) reranker: boosts f-score 2percentage points to 91.3%

I Disadvantage: final result will depend on the quality of then-best list

Page 96: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Discriminative Reranking

I Generative parser outputs an n-best list of parse trees

I Features are extracted from the parse treesI Parse probabilitiesI The CFG rules usedI Structural properties of trees (e.g. degree of parallelism)

I Log-linear models are often used

I Charniak and Johnson (2005) reranker: boosts f-score 2percentage points to 91.3%

I Disadvantage: final result will depend on the quality of then-best list

Page 97: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Discriminative Reranking

I Generative parser outputs an n-best list of parse trees

I Features are extracted from the parse treesI Parse probabilitiesI The CFG rules usedI Structural properties of trees (e.g. degree of parallelism)

I Log-linear models are often used

I Charniak and Johnson (2005) reranker: boosts f-score 2percentage points to 91.3%

I Disadvantage: final result will depend on the quality of then-best list

Page 98: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Discriminative Reranking

I Generative parser outputs an n-best list of parse trees

I Features are extracted from the parse treesI Parse probabilitiesI The CFG rules usedI Structural properties of trees (e.g. degree of parallelism)

I Log-linear models are often used

I Charniak and Johnson (2005) reranker: boosts f-score 2percentage points to 91.3%

I Disadvantage: final result will depend on the quality of then-best list

Page 99: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Discriminative Reranking

I Generative parser outputs an n-best list of parse trees

I Features are extracted from the parse treesI Parse probabilitiesI The CFG rules usedI Structural properties of trees (e.g. degree of parallelism)

I Log-linear models are often used

I Charniak and Johnson (2005) reranker: boosts f-score 2percentage points to 91.3%

I Disadvantage: final result will depend on the quality of then-best list

Page 100: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Discriminative Reranking

I Generative parser outputs an n-best list of parse trees

I Features are extracted from the parse treesI Parse probabilitiesI The CFG rules usedI Structural properties of trees (e.g. degree of parallelism)

I Log-linear models are often used

I Charniak and Johnson (2005) reranker: boosts f-score 2percentage points to 91.3%

I Disadvantage: final result will depend on the quality of then-best list

Page 101: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Discriminative Reranking

I Generative parser outputs an n-best list of parse trees

I Features are extracted from the parse treesI Parse probabilitiesI The CFG rules usedI Structural properties of trees (e.g. degree of parallelism)

I Log-linear models are often used

I Charniak and Johnson (2005) reranker: boosts f-score 2percentage points to 91.3%

I Disadvantage: final result will depend on the quality of then-best list

Page 102: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Discriminative Dynamic Programming

I All parses are stored (compactly) in the chart

I The best parse is then selected using a discriminativeprobabilistic model

I Example: CCG parser (Clark and Curran, 2004)

Page 103: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Discriminative Dynamic Programming

I All parses are stored (compactly) in the chart

I The best parse is then selected using a discriminativeprobabilistic model

I Example: CCG parser (Clark and Curran, 2004)

Page 104: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Discriminative Dynamic Programming

I All parses are stored (compactly) in the chart

I The best parse is then selected using a discriminativeprobabilistic model

I Example: CCG parser (Clark and Curran, 2004)

Page 105: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Outline

Statistical Parsing: What and Why?

PCFGsProbabilistic CKYObtaining Probabilistic Grammars

Limitations of PCFG Parsing

Generative, History-Based Lexicalised ParsersCollins ParserCharniak ParserUnlexicalised Parsing

Discriminative ParsingDiscriminative RerankingDiscriminative Dynamic Programming

Dependency Parsing

Conclusion

Page 106: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Dependency Parsing

Dependency parsers return a different type ofstructural analysis to phrase structure parsers.They return dependency graphs:

I the nodes are the words in the sentence

I the arcs are the dependency relationships between thewords

Phrase structure is not explicitly represented.

Page 107: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Dependency Parsing

We distinguish labelled and non-labelled dependencygraphs.

I In a labelled dependency graph, the nature of thedependency relationship is specified.

I Typical dependency relationships: subj, obj, nmod, vmod

We distinguish projective and non-projectivedependency graphs.

I Non-projectivity identifiable with crossing dependencies.

I Used for representing certain types of non-distancedependencies, e.g. What did economic news have little

effect on?

I Most dependency-based linguistic theories assumenon-projectivity

I Most dependency-based parsing systems assumeprojectivity

Page 108: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Probabilistic Dependency Parsing

Broadly speaking, there are two types of probabilisticdependency parsers.

1. Transition-based systems

2. Graph-based systems

Transition-based systems

I Probability model predicts the next action of the parser

I Deterministic shift-reduce parsing

I Example: Malt Parser (Nivre et al. 2006)

Graph-based systems

I Probability model predicts the entire graph for a sentencefrom the set of all possible graphs

I Example: MST Parser (Mc Donald et al. 2006)

Page 109: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Probabilistic Dependency Parsing

Broadly speaking, there are two types of probabilisticdependency parsers.

1. Transition-based systems

2. Graph-based systems

Transition-based systems

I Probability model predicts the next action of the parser

I Deterministic shift-reduce parsing

I Example: Malt Parser (Nivre et al. 2006)

Graph-based systems

I Probability model predicts the entire graph for a sentencefrom the set of all possible graphs

I Example: MST Parser (Mc Donald et al. 2006)

Page 110: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Probabilistic Dependency Parsing

Broadly speaking, there are two types of probabilisticdependency parsers.

1. Transition-based systems

2. Graph-based systems

Transition-based systems

I Probability model predicts the next action of the parser

I Deterministic shift-reduce parsing

I Example: Malt Parser (Nivre et al. 2006)

Graph-based systems

I Probability model predicts the entire graph for a sentencefrom the set of all possible graphs

I Example: MST Parser (Mc Donald et al. 2006)

Page 111: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Outline

Statistical Parsing: What and Why?

PCFGsProbabilistic CKYObtaining Probabilistic Grammars

Limitations of PCFG Parsing

Generative, History-Based Lexicalised ParsersCollins ParserCharniak ParserUnlexicalised Parsing

Discriminative ParsingDiscriminative RerankingDiscriminative Dynamic Programming

Dependency Parsing

Conclusion

Page 112: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Conclusion

But while there is room for improvement I believe thatit is now or will soon be time to stop working onimproving labeled precision/recall per se... At somepoint we should move on. Charniak (1997)

What now?

1. Domain adaptation

2. Task-based evaluation

3. Semantic parsing

Page 113: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Conclusion

But while there is room for improvement I believe thatit is now or will soon be time to stop working onimproving labeled precision/recall per se... At somepoint we should move on. Charniak (1997)

What now?

1. Domain adaptation

2. Task-based evaluation

3. Semantic parsing

Page 114: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Charniak, E.: Statistical techniques for natural languageparsing.AI Magazine 18(4) (1997)

Charniak, E.: A maximum-entropy-inspired parser.In: Proceedings of the Annual Meeting of the NorthAmerican Association for Computational Linguistics(NAACL-00), pp. 132–139. Seattle, Washington (2000)

Collins, M.: Head-driven statistical models for naturallanguage parsing.Computational Linguistics 29(4), 499–637 (2003)

Klein, D., Manning, C.: Accurate unlexicalised parsing.In: Proceedings of the 41st Annual Meeting of the ACL(2006)

McDonald, R., Lerman, K., Pereira, F.: Multilingualdependency analysis with a two-stage discriminativeparser.In: Proceedings of the 10th Conference on ComputationalNatural Language Learning (CoNLL) (2006)

Page 115: Statistical Parsing: an Overviewnclt.computing.dcu.ie/2007-2008NCLTSlides/jen_slides.pdf · Statistical Parsing: an Overview based on a draft chapter 14 of the 2nd edition of “Speech

StatisticalParsing: anOverview

based on a draftchapter 14 of the2nd edition of

“Speech and

Language

Processing” byJurafsky and

MartinJennifer Foster

StatisticalParsing: Whatand Why?

PCFGs

ProbabilisticCKY

ObtainingProbabilisticGrammars

Limitations ofPCFG Parsing

Generative,History-BasedLexicalisedParsers

Collins Parser

Charniak Parser

UnlexicalisedParsing

DiscriminativeParsing

DiscriminativeReranking

DiscriminativeDynamicProgramming

Nivre, J., Hall, J., Nilsson, J., Eryigit, G., Marinov, S.:Labeled pseudo-projective dependency parsing withsupport vector machines.In: Proceedings of the 10th Conference on ComputationalNatural Language Learning (CoNLL) (2006)


Recommended