StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Statistical Parsing: an Overview
based on a draft chapter 14 of the 2nd edition of “Speech
and Language Processing” by Jurafsky and MartinJennifer Foster
NCLT Seminar Series 2007-2008
23rd January 2007
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Outline
Statistical Parsing: What and Why?
PCFGsProbabilistic CKYObtaining Probabilistic Grammars
Limitations of PCFG Parsing
Generative, History-Based Lexicalised ParsersCollins ParserCharniak ParserUnlexicalised Parsing
Discriminative ParsingDiscriminative RerankingDiscriminative Dynamic Programming
Dependency Parsing
Conclusion
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Outline
Statistical Parsing: What and Why?
PCFGsProbabilistic CKYObtaining Probabilistic Grammars
Limitations of PCFG Parsing
Generative, History-Based Lexicalised ParsersCollins ParserCharniak ParserUnlexicalised Parsing
Discriminative ParsingDiscriminative RerankingDiscriminative Dynamic Programming
Dependency Parsing
Conclusion
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Statistical Parsing: What and Why?
What is natural language parsing?Process of analysing the structure of natural languagesentences.
What is statistical natural language parsing?Parsing in which the most likely analysis is selected amongst allpossible analyses for an input sentence.
Why are most natural language parsers statistical?Because syntactic ambiguity makes non-probabilistic orsymbolic parsing intractable!
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Statistical Parsing: What and Why?
What is natural language parsing?Process of analysing the structure of natural languagesentences.
What is statistical natural language parsing?Parsing in which the most likely analysis is selected amongst allpossible analyses for an input sentence.
Why are most natural language parsers statistical?Because syntactic ambiguity makes non-probabilistic orsymbolic parsing intractable!
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Statistical Parsing: What and Why?
What is natural language parsing?Process of analysing the structure of natural languagesentences.
What is statistical natural language parsing?Parsing in which the most likely analysis is selected amongst allpossible analyses for an input sentence.
Why are most natural language parsers statistical?Because syntactic ambiguity makes non-probabilistic orsymbolic parsing intractable!
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Statistical Parsing: What and Why?
What is natural language parsing?Process of analysing the structure of natural languagesentences.
What is statistical natural language parsing?Parsing in which the most likely analysis is selected amongst allpossible analyses for an input sentence.
Why are most natural language parsers statistical?Because syntactic ambiguity makes non-probabilistic orsymbolic parsing intractable!
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Statistical Parsing: What and Why?
What is natural language parsing?Process of analysing the structure of natural languagesentences.
What is statistical natural language parsing?Parsing in which the most likely analysis is selected amongst allpossible analyses for an input sentence.
Why are most natural language parsers statistical?Because syntactic ambiguity makes non-probabilistic orsymbolic parsing intractable!
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Statistical Parsing: What and Why?
What is natural language parsing?Process of analysing the structure of natural languagesentences.
What is statistical natural language parsing?Parsing in which the most likely analysis is selected amongst allpossible analyses for an input sentence.
Why are most natural language parsers statistical?Because syntactic ambiguity makes non-probabilistic orsymbolic parsing intractable!
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Benefits of Statistical Parsing
Jurafsky and Martin identify the following two benefits ofprobabilistic parsing:
1. Syntactic Disambiguation (main motivation)
2. Language Modelling
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Outline
Statistical Parsing: What and Why?
PCFGsProbabilistic CKYObtaining Probabilistic Grammars
Limitations of PCFG Parsing
Generative, History-Based Lexicalised ParsersCollins ParserCharniak ParserUnlexicalised Parsing
Discriminative ParsingDiscriminative RerankingDiscriminative Dynamic Programming
Dependency Parsing
Conclusion
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Probabilistic Context-Free Grammar Defined
A PCFG 〈N, Σ, R, S〉 is defined as follows:
1. N is the set of non-terminal symbols
2. Σ is the terminals (disjoint from N)
3. R is a set of rules of the form A → β[p], where A ∈ N andβ ∈ (Σ ∪ N)∗, and p is a number between 0 and 1
4. A start symbol, S ∈ N
A PCFG is a CFG in which each rule is associated with aprobability.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Probabilistic Context-Free Grammar Defined
A PCFG 〈N, Σ, R, S〉 is defined as follows:
1. N is the set of non-terminal symbols
2. Σ is the terminals (disjoint from N)
3. R is a set of rules of the form A → β[p], where A ∈ N andβ ∈ (Σ ∪ N)∗, and p is a number between 0 and 1
4. A start symbol, S ∈ N
A PCFG is a CFG in which each rule is associated with aprobability.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
More about PCFGs
What does the p associated with each rule express?It expresses the probability that the LHS non-terminal will beexpanded as the RHS sequence.
I P (A → β|A)
I
∑
β
P (A → β|A) = 1
I The sum of the probabilities associated with all of therules expanding the non-terminal A is 1.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
More about PCFGs
What does the p associated with each rule express?It expresses the probability that the LHS non-terminal will beexpanded as the RHS sequence.
I P (A → β|A)
I
∑
β
P (A → β|A) = 1
I The sum of the probabilities associated with all of therules expanding the non-terminal A is 1.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
More about PCFGs
What does the p associated with each rule express?It expresses the probability that the LHS non-terminal will beexpanded as the RHS sequence.
I P (A → β|A)
I
∑
β
P (A → β|A) = 1
I The sum of the probabilities associated with all of therules expanding the non-terminal A is 1.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
More about PCFGs
What does the p associated with each rule express?It expresses the probability that the LHS non-terminal will beexpanded as the RHS sequence.
I P (A → β|A)
I
∑
β
P (A → β|A) = 1
I The sum of the probabilities associated with all of therules expanding the non-terminal A is 1.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
PCFGs and DisambiguationI A PCFG assigns a probability to every parse tree or
derivation associated with a sentence.
I This probability is the product of the rules applied inbuilding the parse tree.
I P (T, S) =n∏
i=1
P (Ai → βi) where n is the number of rules
in T .
I P (T, S) = P (T )P (S|T ) = P (S)P (T |S) by definition
I But P (S|T ) = 1 because all the words in S are in T
I So, P (T, S) = P (T )
I A parse disambiguation algorithm picks out the mostprobable parse tree for a sentence.
I T(S) = argmax P (T |S) s.t. S = yield(T )
I P (T |S) = P (T,S)P (S)
I P (S) will be constant over all the trees for S.
I T(S) = argmax P (T, S) s.t. S = yield(T )
I T(S) = argmax P (T ) s.t. S = yield(T )
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
PCFGs and DisambiguationI A PCFG assigns a probability to every parse tree or
derivation associated with a sentence.
I This probability is the product of the rules applied inbuilding the parse tree.
I P (T, S) =n∏
i=1
P (Ai → βi) where n is the number of rules
in T .
I P (T, S) = P (T )P (S|T ) = P (S)P (T |S) by definition
I But P (S|T ) = 1 because all the words in S are in T
I So, P (T, S) = P (T )
I A parse disambiguation algorithm picks out the mostprobable parse tree for a sentence.
I T(S) = argmax P (T |S) s.t. S = yield(T )
I P (T |S) = P (T,S)P (S)
I P (S) will be constant over all the trees for S.
I T(S) = argmax P (T, S) s.t. S = yield(T )
I T(S) = argmax P (T ) s.t. S = yield(T )
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
PCFGs and DisambiguationI A PCFG assigns a probability to every parse tree or
derivation associated with a sentence.
I This probability is the product of the rules applied inbuilding the parse tree.
I P (T, S) =n∏
i=1
P (Ai → βi) where n is the number of rules
in T .
I P (T, S) = P (T )P (S|T ) = P (S)P (T |S) by definition
I But P (S|T ) = 1 because all the words in S are in T
I So, P (T, S) = P (T )
I A parse disambiguation algorithm picks out the mostprobable parse tree for a sentence.
I T(S) = argmax P (T |S) s.t. S = yield(T )
I P (T |S) = P (T,S)P (S)
I P (S) will be constant over all the trees for S.
I T(S) = argmax P (T, S) s.t. S = yield(T )
I T(S) = argmax P (T ) s.t. S = yield(T )
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
PCFGs and DisambiguationI A PCFG assigns a probability to every parse tree or
derivation associated with a sentence.
I This probability is the product of the rules applied inbuilding the parse tree.
I P (T, S) =n∏
i=1
P (Ai → βi) where n is the number of rules
in T .
I P (T, S) = P (T )P (S|T ) = P (S)P (T |S) by definition
I But P (S|T ) = 1 because all the words in S are in T
I So, P (T, S) = P (T )
I A parse disambiguation algorithm picks out the mostprobable parse tree for a sentence.
I T(S) = argmax P (T |S) s.t. S = yield(T )
I P (T |S) = P (T,S)P (S)
I P (S) will be constant over all the trees for S.
I T(S) = argmax P (T, S) s.t. S = yield(T )
I T(S) = argmax P (T ) s.t. S = yield(T )
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
PCFGs and DisambiguationI A PCFG assigns a probability to every parse tree or
derivation associated with a sentence.
I This probability is the product of the rules applied inbuilding the parse tree.
I P (T, S) =n∏
i=1
P (Ai → βi) where n is the number of rules
in T .
I P (T, S) = P (T )P (S|T ) = P (S)P (T |S) by definition
I But P (S|T ) = 1 because all the words in S are in T
I So, P (T, S) = P (T )
I A parse disambiguation algorithm picks out the mostprobable parse tree for a sentence.
I T(S) = argmax P (T |S) s.t. S = yield(T )
I P (T |S) = P (T,S)P (S)
I P (S) will be constant over all the trees for S.
I T(S) = argmax P (T, S) s.t. S = yield(T )
I T(S) = argmax P (T ) s.t. S = yield(T )
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
PCFGs and DisambiguationI A PCFG assigns a probability to every parse tree or
derivation associated with a sentence.
I This probability is the product of the rules applied inbuilding the parse tree.
I P (T, S) =n∏
i=1
P (Ai → βi) where n is the number of rules
in T .
I P (T, S) = P (T )P (S|T ) = P (S)P (T |S) by definition
I But P (S|T ) = 1 because all the words in S are in T
I So, P (T, S) = P (T )
I A parse disambiguation algorithm picks out the mostprobable parse tree for a sentence.
I T(S) = argmax P (T |S) s.t. S = yield(T )
I P (T |S) = P (T,S)P (S)
I P (S) will be constant over all the trees for S.
I T(S) = argmax P (T, S) s.t. S = yield(T )
I T(S) = argmax P (T ) s.t. S = yield(T )
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
PCFGs and DisambiguationI A PCFG assigns a probability to every parse tree or
derivation associated with a sentence.
I This probability is the product of the rules applied inbuilding the parse tree.
I P (T, S) =n∏
i=1
P (Ai → βi) where n is the number of rules
in T .
I P (T, S) = P (T )P (S|T ) = P (S)P (T |S) by definition
I But P (S|T ) = 1 because all the words in S are in T
I So, P (T, S) = P (T )
I A parse disambiguation algorithm picks out the mostprobable parse tree for a sentence.
I T(S) = argmax P (T |S) s.t. S = yield(T )
I P (T |S) = P (T,S)P (S)
I P (S) will be constant over all the trees for S.
I T(S) = argmax P (T, S) s.t. S = yield(T )
I T(S) = argmax P (T ) s.t. S = yield(T )
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
PCFGs and DisambiguationI A PCFG assigns a probability to every parse tree or
derivation associated with a sentence.
I This probability is the product of the rules applied inbuilding the parse tree.
I P (T, S) =n∏
i=1
P (Ai → βi) where n is the number of rules
in T .
I P (T, S) = P (T )P (S|T ) = P (S)P (T |S) by definition
I But P (S|T ) = 1 because all the words in S are in T
I So, P (T, S) = P (T )
I A parse disambiguation algorithm picks out the mostprobable parse tree for a sentence.
I T(S) = argmax P (T |S) s.t. S = yield(T )
I P (T |S) = P (T,S)P (S)
I P (S) will be constant over all the trees for S.
I T(S) = argmax P (T, S) s.t. S = yield(T )
I T(S) = argmax P (T ) s.t. S = yield(T )
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
PCFGs and DisambiguationI A PCFG assigns a probability to every parse tree or
derivation associated with a sentence.
I This probability is the product of the rules applied inbuilding the parse tree.
I P (T, S) =n∏
i=1
P (Ai → βi) where n is the number of rules
in T .
I P (T, S) = P (T )P (S|T ) = P (S)P (T |S) by definition
I But P (S|T ) = 1 because all the words in S are in T
I So, P (T, S) = P (T )
I A parse disambiguation algorithm picks out the mostprobable parse tree for a sentence.
I T(S) = argmax P (T |S) s.t. S = yield(T )
I P (T |S) = P (T,S)P (S)
I P (S) will be constant over all the trees for S.
I T(S) = argmax P (T, S) s.t. S = yield(T )
I T(S) = argmax P (T ) s.t. S = yield(T )
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
PCFGs and DisambiguationI A PCFG assigns a probability to every parse tree or
derivation associated with a sentence.
I This probability is the product of the rules applied inbuilding the parse tree.
I P (T, S) =n∏
i=1
P (Ai → βi) where n is the number of rules
in T .
I P (T, S) = P (T )P (S|T ) = P (S)P (T |S) by definition
I But P (S|T ) = 1 because all the words in S are in T
I So, P (T, S) = P (T )
I A parse disambiguation algorithm picks out the mostprobable parse tree for a sentence.
I T(S) = argmax P (T |S) s.t. S = yield(T )
I P (T |S) = P (T,S)P (S)
I P (S) will be constant over all the trees for S.
I T(S) = argmax P (T, S) s.t. S = yield(T )
I T(S) = argmax P (T ) s.t. S = yield(T )
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
PCFGs and DisambiguationI A PCFG assigns a probability to every parse tree or
derivation associated with a sentence.
I This probability is the product of the rules applied inbuilding the parse tree.
I P (T, S) =n∏
i=1
P (Ai → βi) where n is the number of rules
in T .
I P (T, S) = P (T )P (S|T ) = P (S)P (T |S) by definition
I But P (S|T ) = 1 because all the words in S are in T
I So, P (T, S) = P (T )
I A parse disambiguation algorithm picks out the mostprobable parse tree for a sentence.
I T(S) = argmax P (T |S) s.t. S = yield(T )
I P (T |S) = P (T,S)P (S)
I P (S) will be constant over all the trees for S.
I T(S) = argmax P (T, S) s.t. S = yield(T )
I T(S) = argmax P (T ) s.t. S = yield(T )
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
PCFGs and DisambiguationI A PCFG assigns a probability to every parse tree or
derivation associated with a sentence.
I This probability is the product of the rules applied inbuilding the parse tree.
I P (T, S) =n∏
i=1
P (Ai → βi) where n is the number of rules
in T .
I P (T, S) = P (T )P (S|T ) = P (S)P (T |S) by definition
I But P (S|T ) = 1 because all the words in S are in T
I So, P (T, S) = P (T )
I A parse disambiguation algorithm picks out the mostprobable parse tree for a sentence.
I T(S) = argmax P (T |S) s.t. S = yield(T )
I P (T |S) = P (T,S)P (S)
I P (S) will be constant over all the trees for S.
I T(S) = argmax P (T, S) s.t. S = yield(T )
I T(S) = argmax P (T ) s.t. S = yield(T )
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
PCFGs and Language Modelling
As well as assigning probabilities to parse trees, aPCFG assigns a probability to every sentencegenerated by the grammar.This is useful for language modelling.
I The probability of a sentence is the sum of the probabilitiesof each parse tree associated with the sentence.
I P (S) =∑
T s.t. yield(T )=S
P (T, S)
I P (S) =∑
T s.t. yield(T )=S
P (T )
When is it useful to know the probability of asentence?When ranking the output of speech recognition, machinetranslation and error correction systems.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
PCFGs and Language Modelling
As well as assigning probabilities to parse trees, aPCFG assigns a probability to every sentencegenerated by the grammar.This is useful for language modelling.
I The probability of a sentence is the sum of the probabilitiesof each parse tree associated with the sentence.
I P (S) =∑
T s.t. yield(T )=S
P (T, S)
I P (S) =∑
T s.t. yield(T )=S
P (T )
When is it useful to know the probability of asentence?When ranking the output of speech recognition, machinetranslation and error correction systems.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
PCFGs and Language Modelling
As well as assigning probabilities to parse trees, aPCFG assigns a probability to every sentencegenerated by the grammar.This is useful for language modelling.
I The probability of a sentence is the sum of the probabilitiesof each parse tree associated with the sentence.
I P (S) =∑
T s.t. yield(T )=S
P (T, S)
I P (S) =∑
T s.t. yield(T )=S
P (T )
When is it useful to know the probability of asentence?When ranking the output of speech recognition, machinetranslation and error correction systems.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
PCFGs and Language Modelling
As well as assigning probabilities to parse trees, aPCFG assigns a probability to every sentencegenerated by the grammar.This is useful for language modelling.
I The probability of a sentence is the sum of the probabilitiesof each parse tree associated with the sentence.
I P (S) =∑
T s.t. yield(T )=S
P (T, S)
I P (S) =∑
T s.t. yield(T )=S
P (T )
When is it useful to know the probability of asentence?When ranking the output of speech recognition, machinetranslation and error correction systems.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
PCFGs and Language Modelling
As well as assigning probabilities to parse trees, aPCFG assigns a probability to every sentencegenerated by the grammar.This is useful for language modelling.
I The probability of a sentence is the sum of the probabilitiesof each parse tree associated with the sentence.
I P (S) =∑
T s.t. yield(T )=S
P (T, S)
I P (S) =∑
T s.t. yield(T )=S
P (T )
When is it useful to know the probability of asentence?When ranking the output of speech recognition, machinetranslation and error correction systems.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Probabilistic CKY
Probabilistic versions of most parsing algorithmsexist.Many probabilistic parsers use a probabilistic version of theCKY bottom-up chart parsing algorithm (see Chap. 13)
Sentence, s, of length n and CFG grammar with V
non-terminal symbols
I Normal CKY2-d (n + 1) ∗ (n + 1) array where a value in a cell (i, j) is alist of non-terminals spanning position i through j in s
I Probabilistic CKY3-d (n + 1) ∗ (n + 1) ∗ V array where a value in a cell(i, j, K) is the probability of the non-terminal K spanningposition i through j in s
As with regular CKY, probabilistic CKY assumesthat the grammar is in Chomsky-normal form.Note that probabilities will also need to be adjusted whentransforming a non-CNF PCFG into a CNF PCFG.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Probabilistic CKY
Probabilistic versions of most parsing algorithmsexist.Many probabilistic parsers use a probabilistic version of theCKY bottom-up chart parsing algorithm (see Chap. 13)
Sentence, s, of length n and CFG grammar with V
non-terminal symbols
I Normal CKY2-d (n + 1) ∗ (n + 1) array where a value in a cell (i, j) is alist of non-terminals spanning position i through j in s
I Probabilistic CKY3-d (n + 1) ∗ (n + 1) ∗ V array where a value in a cell(i, j, K) is the probability of the non-terminal K spanningposition i through j in s
As with regular CKY, probabilistic CKY assumesthat the grammar is in Chomsky-normal form.Note that probabilities will also need to be adjusted whentransforming a non-CNF PCFG into a CNF PCFG.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Probabilistic CKY
Probabilistic versions of most parsing algorithmsexist.Many probabilistic parsers use a probabilistic version of theCKY bottom-up chart parsing algorithm (see Chap. 13)
Sentence, s, of length n and CFG grammar with V
non-terminal symbols
I Normal CKY2-d (n + 1) ∗ (n + 1) array where a value in a cell (i, j) is alist of non-terminals spanning position i through j in s
I Probabilistic CKY3-d (n + 1) ∗ (n + 1) ∗ V array where a value in a cell(i, j, K) is the probability of the non-terminal K spanningposition i through j in s
As with regular CKY, probabilistic CKY assumesthat the grammar is in Chomsky-normal form.Note that probabilities will also need to be adjusted whentransforming a non-CNF PCFG into a CNF PCFG.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Probabilistic CKY
Probabilistic versions of most parsing algorithmsexist.Many probabilistic parsers use a probabilistic version of theCKY bottom-up chart parsing algorithm (see Chap. 13)
Sentence, s, of length n and CFG grammar with V
non-terminal symbols
I Normal CKY2-d (n + 1) ∗ (n + 1) array where a value in a cell (i, j) is alist of non-terminals spanning position i through j in s
I Probabilistic CKY3-d (n + 1) ∗ (n + 1) ∗ V array where a value in a cell(i, j, K) is the probability of the non-terminal K spanningposition i through j in s
As with regular CKY, probabilistic CKY assumesthat the grammar is in Chomsky-normal form.Note that probabilities will also need to be adjusted whentransforming a non-CNF PCFG into a CNF PCFG.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Obtaining Probabilistic Grammars
In a PCFG every rule is associated with a probability.But where do these rule probabilities come from?
There are two ways to obtain rule probabilities for a PCFG.
1. Use a treebankP (A → β|A) = count(A→β)
count(A)
2. Corpus of sentences + Inside-Outside algorithm (Baker,1979)
2.1 Take a CFG and set all rules to have equal probability2.2 Parse the corpus with the CFG2.3 Adjust the probabilities2.4 Repeat steps two and three until probabilities converge
The Inside-Outside algorithm is a type of ExpectationMaximisation algorithm. It can also be used to induce agrammar, but only with limited success.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Obtaining Probabilistic Grammars
In a PCFG every rule is associated with a probability.But where do these rule probabilities come from?
There are two ways to obtain rule probabilities for a PCFG.
1. Use a treebankP (A → β|A) = count(A→β)
count(A)
2. Corpus of sentences + Inside-Outside algorithm (Baker,1979)
2.1 Take a CFG and set all rules to have equal probability2.2 Parse the corpus with the CFG2.3 Adjust the probabilities2.4 Repeat steps two and three until probabilities converge
The Inside-Outside algorithm is a type of ExpectationMaximisation algorithm. It can also be used to induce agrammar, but only with limited success.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Obtaining Probabilistic Grammars
In a PCFG every rule is associated with a probability.But where do these rule probabilities come from?
There are two ways to obtain rule probabilities for a PCFG.
1. Use a treebankP (A → β|A) = count(A→β)
count(A)
2. Corpus of sentences + Inside-Outside algorithm (Baker,1979)
2.1 Take a CFG and set all rules to have equal probability2.2 Parse the corpus with the CFG2.3 Adjust the probabilities2.4 Repeat steps two and three until probabilities converge
The Inside-Outside algorithm is a type of ExpectationMaximisation algorithm. It can also be used to induce agrammar, but only with limited success.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Obtaining Probabilistic Grammars
In a PCFG every rule is associated with a probability.But where do these rule probabilities come from?
There are two ways to obtain rule probabilities for a PCFG.
1. Use a treebankP (A → β|A) = count(A→β)
count(A)
2. Corpus of sentences + Inside-Outside algorithm (Baker,1979)
2.1 Take a CFG and set all rules to have equal probability2.2 Parse the corpus with the CFG2.3 Adjust the probabilities2.4 Repeat steps two and three until probabilities converge
The Inside-Outside algorithm is a type of ExpectationMaximisation algorithm. It can also be used to induce agrammar, but only with limited success.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Obtaining Probabilistic Grammars
In a PCFG every rule is associated with a probability.But where do these rule probabilities come from?
There are two ways to obtain rule probabilities for a PCFG.
1. Use a treebankP (A → β|A) = count(A→β)
count(A)
2. Corpus of sentences + Inside-Outside algorithm (Baker,1979)
2.1 Take a CFG and set all rules to have equal probability2.2 Parse the corpus with the CFG2.3 Adjust the probabilities2.4 Repeat steps two and three until probabilities converge
The Inside-Outside algorithm is a type of ExpectationMaximisation algorithm. It can also be used to induce agrammar, but only with limited success.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Outline
Statistical Parsing: What and Why?
PCFGsProbabilistic CKYObtaining Probabilistic Grammars
Limitations of PCFG Parsing
Generative, History-Based Lexicalised ParsersCollins ParserCharniak ParserUnlexicalised Parsing
Discriminative ParsingDiscriminative RerankingDiscriminative Dynamic Programming
Dependency Parsing
Conclusion
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Limitations of PCFG Parsing
Two well-known drawbacks of PCFG parsing are:
1. the independence of the rules in a PCFG
2. their failure to fully exploit lexical knowledge in resolvingambiguities
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
PCFG Rule Independence
In PCFG parsing, the application of a rule in aderivation is an independent event.This means that previous rule applications in the samederivation have no influence over the current rule application.
I The rule independence can be a disadvantage.
I Consider, for example, that, an English subject NP isextremely likely to be realised as a pronoun, and that anEnglish object NP is more likely to be realised as anon-pronominal noun phrase.
I This cannot be reflected in a PCFG which does notdistinguish object NPs from subject NPs.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
PCFG Rule Independence
In PCFG parsing, the application of a rule in aderivation is an independent event.This means that previous rule applications in the samederivation have no influence over the current rule application.
I The rule independence can be a disadvantage.
I Consider, for example, that, an English subject NP isextremely likely to be realised as a pronoun, and that anEnglish object NP is more likely to be realised as anon-pronominal noun phrase.
I This cannot be reflected in a PCFG which does notdistinguish object NPs from subject NPs.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
PCFG Rule Independence
In PCFG parsing, the application of a rule in aderivation is an independent event.This means that previous rule applications in the samederivation have no influence over the current rule application.
I The rule independence can be a disadvantage.
I Consider, for example, that, an English subject NP isextremely likely to be realised as a pronoun, and that anEnglish object NP is more likely to be realised as anon-pronominal noun phrase.
I This cannot be reflected in a PCFG which does notdistinguish object NPs from subject NPs.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
PCFG Rule Independence
In PCFG parsing, the application of a rule in aderivation is an independent event.This means that previous rule applications in the samederivation have no influence over the current rule application.
I The rule independence can be a disadvantage.
I Consider, for example, that, an English subject NP isextremely likely to be realised as a pronoun, and that anEnglish object NP is more likely to be realised as anon-pronominal noun phrase.
I This cannot be reflected in a PCFG which does notdistinguish object NPs from subject NPs.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
PCFG Rule Independence
In PCFG parsing, the application of a rule in aderivation is an independent event.This means that previous rule applications in the samederivation have no influence over the current rule application.
I The rule independence can be a disadvantage.
I Consider, for example, that, an English subject NP isextremely likely to be realised as a pronoun, and that anEnglish object NP is more likely to be realised as anon-pronominal noun phrase.
I This cannot be reflected in a PCFG which does notdistinguish object NPs from subject NPs.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Using Lexical Knowledge to Resolve
Ambiguity
I Consider, for example, the sentencesWorkers dumped sacks into a bin
Fisherman caught tons of herring
I The problem of where to attach a PP is a common form ofsyntactic ambiguity.
I A main motivation for probabilistic parsing is to performaccurate disambiguation.
I People resolve this kind of ambiguity by looking at theactual nouns, verbs and prepositions involved.
I Faced with a choice between two attachment sites, V P andNP , for a constituent PP , a PCFG can only compare therule probabilities of
1. V P → V P PP versus2. NP → NP PP
These rules contain no lexical information.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Using Lexical Knowledge to Resolve
Ambiguity
I Consider, for example, the sentencesWorkers dumped sacks into a bin
Fisherman caught tons of herring
I The problem of where to attach a PP is a common form ofsyntactic ambiguity.
I A main motivation for probabilistic parsing is to performaccurate disambiguation.
I People resolve this kind of ambiguity by looking at theactual nouns, verbs and prepositions involved.
I Faced with a choice between two attachment sites, V P andNP , for a constituent PP , a PCFG can only compare therule probabilities of
1. V P → V P PP versus2. NP → NP PP
These rules contain no lexical information.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Using Lexical Knowledge to Resolve
Ambiguity
I Consider, for example, the sentencesWorkers dumped sacks into a bin
Fisherman caught tons of herring
I The problem of where to attach a PP is a common form ofsyntactic ambiguity.
I A main motivation for probabilistic parsing is to performaccurate disambiguation.
I People resolve this kind of ambiguity by looking at theactual nouns, verbs and prepositions involved.
I Faced with a choice between two attachment sites, V P andNP , for a constituent PP , a PCFG can only compare therule probabilities of
1. V P → V P PP versus2. NP → NP PP
These rules contain no lexical information.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Using Lexical Knowledge to Resolve
Ambiguity
I Consider, for example, the sentencesWorkers dumped sacks into a bin
Fisherman caught tons of herring
I The problem of where to attach a PP is a common form ofsyntactic ambiguity.
I A main motivation for probabilistic parsing is to performaccurate disambiguation.
I People resolve this kind of ambiguity by looking at theactual nouns, verbs and prepositions involved.
I Faced with a choice between two attachment sites, V P andNP , for a constituent PP , a PCFG can only compare therule probabilities of
1. V P → V P PP versus2. NP → NP PP
These rules contain no lexical information.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Using Lexical Knowledge to Resolve
Ambiguity
I Consider, for example, the sentencesWorkers dumped sacks into a bin
Fisherman caught tons of herring
I The problem of where to attach a PP is a common form ofsyntactic ambiguity.
I A main motivation for probabilistic parsing is to performaccurate disambiguation.
I People resolve this kind of ambiguity by looking at theactual nouns, verbs and prepositions involved.
I Faced with a choice between two attachment sites, V P andNP , for a constituent PP , a PCFG can only compare therule probabilities of
1. V P → V P PP versus2. NP → NP PP
These rules contain no lexical information.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Outline
Statistical Parsing: What and Why?
PCFGsProbabilistic CKYObtaining Probabilistic Grammars
Limitations of PCFG Parsing
Generative, History-Based Lexicalised ParsersCollins ParserCharniak ParserUnlexicalised Parsing
Discriminative ParsingDiscriminative RerankingDiscriminative Dynamic Programming
Dependency Parsing
Conclusion
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Generative, History-Based, Lexicalised Parsers
Generative, history-based lexicalized parsers overcomethe limits of PCFGs by employing:
1. lexicalisation
2. more complicated probablistic models
Two examples are:
1. the Collins parser
2. the Charniak parser
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Generative, History-Based, Lexicalised Parsers
Generative, history-based lexicalized parsers overcomethe limits of PCFGs by employing:
1. lexicalisation
2. more complicated probablistic models
Two examples are:
1. the Collins parser
2. the Charniak parser
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Lexicalised PCFGs
A PCFG can be lexicalised by associating a word andpart-of-speech tag with every non-terminal in thegrammar.
It is head -lexicalised if the word is the head of the constituentdescribed by the non-terminal.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Lexicalised PCFGs
A non-lexicalised parse tree for the sentenceLast week IBM bought Lotus
S
VP
NP
NNP
Lotus
VBD
bought
NP
NNP
IBM
NP
NN
week
JJ
Last
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Lexicalised PCFGs
A lexicalised parse tree for the sentenceLast week IBM bought Lotus
S(bought,VBD)
VP(bought,VBD)
NP(Lotus,NNP)
NNP(Lotus,NNP)
Lotus
VBD(bought,VBD)
bought
NP(IBM,NNP)
NNP(IBM,NNP)
IBM
NP(week,NN)
NN(week,NN)
week
JJ(Last,JJ)
Last
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Lexicalised PCFGs
Lexicalisation drastically increases the size of thegrammar.This leads to the sparse data problem.
I How do we estimate the probabilities of the ruleS(bought, V BD) →NP (week, NN) NP (IBM, NNP ) V P (bought, V BD)
I Maximum-likelihood estimation won’t work:count(S(bought,V BD)→NP (week,NN) NP (IBM,NNP ) V P (bought,V BD))
count(S(bought,V BD))
I Sparse data problem can be tackled by breaking theprobability estimation for the right-hand side of the ruleinto three parts:
1. predict the probability of the head constituent, in thiscase, V P (bought, V BD)
2. predict the probability of the constituents on the left of thehead, in this case, NP (IBM, NNP ) and NP (week, NN)
3. predict the probability of the constituents on the right ofthe head, in this case, none
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Lexicalised PCFGs
Lexicalisation drastically increases the size of thegrammar.This leads to the sparse data problem.
I How do we estimate the probabilities of the ruleS(bought, V BD) →NP (week, NN) NP (IBM, NNP ) V P (bought, V BD)
I Maximum-likelihood estimation won’t work:count(S(bought,V BD)→NP (week,NN) NP (IBM,NNP ) V P (bought,V BD))
count(S(bought,V BD))
I Sparse data problem can be tackled by breaking theprobability estimation for the right-hand side of the ruleinto three parts:
1. predict the probability of the head constituent, in thiscase, V P (bought, V BD)
2. predict the probability of the constituents on the left of thehead, in this case, NP (IBM, NNP ) and NP (week, NN)
3. predict the probability of the constituents on the right ofthe head, in this case, none
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Lexicalised PCFGs
Lexicalisation drastically increases the size of thegrammar.This leads to the sparse data problem.
I How do we estimate the probabilities of the ruleS(bought, V BD) →NP (week, NN) NP (IBM, NNP ) V P (bought, V BD)
I Maximum-likelihood estimation won’t work:count(S(bought,V BD)→NP (week,NN) NP (IBM,NNP ) V P (bought,V BD))
count(S(bought,V BD))
I Sparse data problem can be tackled by breaking theprobability estimation for the right-hand side of the ruleinto three parts:
1. predict the probability of the head constituent, in thiscase, V P (bought, V BD)
2. predict the probability of the constituents on the left of thehead, in this case, NP (IBM, NNP ) and NP (week, NN)
3. predict the probability of the constituents on the right ofthe head, in this case, none
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Lexicalised PCFGs
Lexicalisation drastically increases the size of thegrammar.This leads to the sparse data problem.
I How do we estimate the probabilities of the ruleS(bought, V BD) →NP (week, NN) NP (IBM, NNP ) V P (bought, V BD)
I Maximum-likelihood estimation won’t work:count(S(bought,V BD)→NP (week,NN) NP (IBM,NNP ) V P (bought,V BD))
count(S(bought,V BD))
I Sparse data problem can be tackled by breaking theprobability estimation for the right-hand side of the ruleinto three parts:
1. predict the probability of the head constituent, in thiscase, V P (bought, V BD)
2. predict the probability of the constituents on the left of thehead, in this case, NP (IBM, NNP ) and NP (week, NN)
3. predict the probability of the constituents on the right ofthe head, in this case, none
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Lexicalised PCFGs
Lexicalisation drastically increases the size of thegrammar.This leads to the sparse data problem.
I How do we estimate the probabilities of the ruleS(bought, V BD) →NP (week, NN) NP (IBM, NNP ) V P (bought, V BD)
I Maximum-likelihood estimation won’t work:count(S(bought,V BD)→NP (week,NN) NP (IBM,NNP ) V P (bought,V BD))
count(S(bought,V BD))
I Sparse data problem can be tackled by breaking theprobability estimation for the right-hand side of the ruleinto three parts:
1. predict the probability of the head constituent, in thiscase, V P (bought, V BD)
2. predict the probability of the constituents on the left of thehead, in this case, NP (IBM, NNP ) and NP (week, NN)
3. predict the probability of the constituents on the right ofthe head, in this case, none
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Lexicalised PCFGs
Lexicalisation drastically increases the size of thegrammar.This leads to the sparse data problem.
I How do we estimate the probabilities of the ruleS(bought, V BD) →NP (week, NN) NP (IBM, NNP ) V P (bought, V BD)
I Maximum-likelihood estimation won’t work:count(S(bought,V BD)→NP (week,NN) NP (IBM,NNP ) V P (bought,V BD))
count(S(bought,V BD))
I Sparse data problem can be tackled by breaking theprobability estimation for the right-hand side of the ruleinto three parts:
1. predict the probability of the head constituent, in thiscase, V P (bought, V BD)
2. predict the probability of the constituents on the left of thehead, in this case, NP (IBM, NNP ) and NP (week, NN)
3. predict the probability of the constituents on the right ofthe head, in this case, none
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Lexicalised PCFGs
Lexicalisation drastically increases the size of thegrammar.This leads to the sparse data problem.
I How do we estimate the probabilities of the ruleS(bought, V BD) →NP (week, NN) NP (IBM, NNP ) V P (bought, V BD)
I Maximum-likelihood estimation won’t work:count(S(bought,V BD)→NP (week,NN) NP (IBM,NNP ) V P (bought,V BD))
count(S(bought,V BD))
I Sparse data problem can be tackled by breaking theprobability estimation for the right-hand side of the ruleinto three parts:
1. predict the probability of the head constituent, in thiscase, V P (bought, V BD)
2. predict the probability of the constituents on the left of thehead, in this case, NP (IBM, NNP ) and NP (week, NN)
3. predict the probability of the constituents on the right ofthe head, in this case, none
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
The Collins Parser
Model 1Given a rule of the form:LHS → Ln Ln−1...L1 H R1...Rn−1Rn
1. Generate the head of the phrase H(hw, ht) withprobability Ph(H(hw, ht)|LHS, hw, ht)
2. Generate modifiers to the left of the head with total
probability:n+1∏
i=1
PL(Li(lwi, lti)|LHS, H, hw, ht)
such that Ln+1(lwn+1, ltn+1) = STOP , and we stopgenerating once we’ve generated a STOP token.
3. Generate modifiers to the right of the head with total
probability:n+1∏
i=1
PR(Ri(rwi, rti)|LHS, H, hw, ht)
such that Rn+1(rwn+1, rtn+1) = STOP , and we stopgenerating once we’ve generated a STOP token.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
The Collins Parser
Model 1Given a rule of the form:LHS → Ln Ln−1...L1 H R1...Rn−1Rn
1. Generate the head of the phrase H(hw, ht) withprobability Ph(H(hw, ht)|LHS, hw, ht)
2. Generate modifiers to the left of the head with total
probability:n+1∏
i=1
PL(Li(lwi, lti)|LHS, H, hw, ht)
such that Ln+1(lwn+1, ltn+1) = STOP , and we stopgenerating once we’ve generated a STOP token.
3. Generate modifiers to the right of the head with total
probability:n+1∏
i=1
PR(Ri(rwi, rti)|LHS, H, hw, ht)
such that Rn+1(rwn+1, rtn+1) = STOP , and we stopgenerating once we’ve generated a STOP token.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
The Collins Parser
Model 1Given a rule of the form:LHS → Ln Ln−1...L1 H R1...Rn−1Rn
1. Generate the head of the phrase H(hw, ht) withprobability Ph(H(hw, ht)|LHS, hw, ht)
2. Generate modifiers to the left of the head with total
probability:n+1∏
i=1
PL(Li(lwi, lti)|LHS, H, hw, ht)
such that Ln+1(lwn+1, ltn+1) = STOP , and we stopgenerating once we’ve generated a STOP token.
3. Generate modifiers to the right of the head with total
probability:n+1∏
i=1
PR(Ri(rwi, rti)|LHS, H, hw, ht)
such that Rn+1(rwn+1, rtn+1) = STOP , and we stopgenerating once we’ve generated a STOP token.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
The Collins Parser
Model 1Given a rule of the form:LHS → Ln Ln−1...L1 H R1...Rn−1Rn
1. Generate the head of the phrase H(hw, ht) withprobability Ph(H(hw, ht)|LHS, hw, ht)
2. Generate modifiers to the left of the head with total
probability:n+1∏
i=1
PL(Li(lwi, lti)|LHS, H, hw, ht)
such that Ln+1(lwn+1, ltn+1) = STOP , and we stopgenerating once we’ve generated a STOP token.
3. Generate modifiers to the right of the head with total
probability:n+1∏
i=1
PR(Ri(rwi, rti)|LHS, H, hw, ht)
such that Rn+1(rwn+1, rtn+1) = STOP , and we stopgenerating once we’ve generated a STOP token.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
The Collins Parser
Model 1 Example
I We want to calculateP (S(bought, V BD) →
NP (week, NN) NP (IBM,NNP ) V P (bought, V BD))
I Work out the following probabilities and multiply theresult
1. P (V P (bought, V BD)|S, bought, V BD)2. P (NP (IBM,NNP )|S, V P, bought, V BD)3. P (NP (week, NN)|S, V P, bought, V BD)
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
The Collins Parser
Model 1 Example
I We want to calculateP (S(bought, V BD) →
NP (week, NN) NP (IBM,NNP ) V P (bought, V BD))
I Work out the following probabilities and multiply theresult
1. P (V P (bought, V BD)|S, bought, V BD)2. P (NP (IBM,NNP )|S, V P, bought, V BD)3. P (NP (week, NN)|S, V P, bought, V BD)
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
The Collins Parser
Model 1 Example
I We want to calculateP (S(bought, V BD) →
NP (week, NN) NP (IBM,NNP ) V P (bought, V BD))
I Work out the following probabilities and multiply theresult
1. P (V P (bought, V BD)|S, bought, V BD)2. P (NP (IBM,NNP )|S, V P, bought, V BD)3. P (NP (week, NN)|S, V P, bought, V BD)
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
The Collins Parser
Model 1 Example
I We want to calculateP (S(bought, V BD) →
NP (week, NN) NP (IBM,NNP ) V P (bought, V BD))
I Work out the following probabilities and multiply theresult
1. P (V P (bought, V BD)|S, bought, V BD)2. P (NP (IBM,NNP )|S, V P, bought, V BD)3. P (NP (week, NN)|S, V P, bought, V BD)
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
The Collins Parser
Model 1 Example
I We want to calculateP (S(bought, V BD) →
NP (week, NN) NP (IBM,NNP ) V P (bought, V BD))
I Work out the following probabilities and multiply theresult
1. P (V P (bought, V BD)|S, bought, V BD)2. P (NP (IBM,NNP )|S, V P, bought, V BD)3. P (NP (week, NN)|S, V P, bought, V BD)
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
The Collins Parser
Model 1 Example
I We want to calculateP (S(bought, V BD) →
NP (week, NN) NP (IBM,NNP ) V P (bought, V BD))
I Work out the following probabilities and multiply theresult
1. P (V P (bought, V BD)|S, bought, V BD)2. P (NP (IBM,NNP )|S, V P, bought, V BD)3. P (NP (week, NN)|S, V P, bought, V BD)
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
The Collins Parser
I In Model 1, a distance function is also included in theconditioning information for the left and right modifiers.
I This is used to measure the number of words between thecurrent modifier and the head.
I It has the effect of preferring right branching structuresand dispreferring dependencies which cross a verb.
I Other Linguistically Motivated Refinements
1. Distinction between recursive and base NPs2. Special features for coordination and punctuation
I Model 2 incorporates verb subcategorisation information.
I Model 3 incorporates long distance dependencyinformation.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
The Collins Parser
I In Model 1, a distance function is also included in theconditioning information for the left and right modifiers.
I This is used to measure the number of words between thecurrent modifier and the head.
I It has the effect of preferring right branching structuresand dispreferring dependencies which cross a verb.
I Other Linguistically Motivated Refinements
1. Distinction between recursive and base NPs2. Special features for coordination and punctuation
I Model 2 incorporates verb subcategorisation information.
I Model 3 incorporates long distance dependencyinformation.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
The Collins Parser
I In Model 1, a distance function is also included in theconditioning information for the left and right modifiers.
I This is used to measure the number of words between thecurrent modifier and the head.
I It has the effect of preferring right branching structuresand dispreferring dependencies which cross a verb.
I Other Linguistically Motivated Refinements
1. Distinction between recursive and base NPs2. Special features for coordination and punctuation
I Model 2 incorporates verb subcategorisation information.
I Model 3 incorporates long distance dependencyinformation.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
The Collins ParserSmoothing
I Smoothing is carried using a linear interpolation of threemodelsλ1e1 + (1 − λ1)(λ2e2 + (1 − λ2)e3)
I e1 is the MLE of the fully lexicalised model, e2 the MLE ofthe model which omits the head word from theconditioning information and e3 is the MLE of the modelwhich omits the head word and head tag
I Weights are set using Witten-Bell discounting
Accuracy
I Achieves Parseval f-scores of approx. 88% on WSJ23
I Models 2 and 3 slightly more accurate than Model 1
Parsing Algorithm
I Version of probabilistic CKY
I Complexity of lexicalised CFG chart parsing is n4 or n5
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
The Collins ParserSmoothing
I Smoothing is carried using a linear interpolation of threemodelsλ1e1 + (1 − λ1)(λ2e2 + (1 − λ2)e3)
I e1 is the MLE of the fully lexicalised model, e2 the MLE ofthe model which omits the head word from theconditioning information and e3 is the MLE of the modelwhich omits the head word and head tag
I Weights are set using Witten-Bell discounting
Accuracy
I Achieves Parseval f-scores of approx. 88% on WSJ23
I Models 2 and 3 slightly more accurate than Model 1
Parsing Algorithm
I Version of probabilistic CKY
I Complexity of lexicalised CFG chart parsing is n4 or n5
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
The Collins ParserSmoothing
I Smoothing is carried using a linear interpolation of threemodelsλ1e1 + (1 − λ1)(λ2e2 + (1 − λ2)e3)
I e1 is the MLE of the fully lexicalised model, e2 the MLE ofthe model which omits the head word from theconditioning information and e3 is the MLE of the modelwhich omits the head word and head tag
I Weights are set using Witten-Bell discounting
Accuracy
I Achieves Parseval f-scores of approx. 88% on WSJ23
I Models 2 and 3 slightly more accurate than Model 1
Parsing Algorithm
I Version of probabilistic CKY
I Complexity of lexicalised CFG chart parsing is n4 or n5
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
The Charniak Parser
Main differences between the Charniak Parser and the CollinsParser
1. Breaks down the probability estimation in a similar way tothe Collins parser but uses more conditioning information:
1.1 When estimating the probability of a rule A → β, itincludes the parent node of A in the conditioninginformation. This is an important feature.
1.2 Also, includes the grandparent and sister nodes of A
1.3 Uses a third-order Markov grammar — a modifierconstituent in β is conditioned on the head consituent (ala Collins) and the previous two left or right modifiers.
2. Uses the suffix of an unknown word to guess itspart-of-speech
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
The Charniak Parser
Main differences between the Charniak Parser and the CollinsParser
1. Breaks down the probability estimation in a similar way tothe Collins parser but uses more conditioning information:
1.1 When estimating the probability of a rule A → β, itincludes the parent node of A in the conditioninginformation. This is an important feature.
1.2 Also, includes the grandparent and sister nodes of A
1.3 Uses a third-order Markov grammar — a modifierconstituent in β is conditioned on the head consituent (ala Collins) and the previous two left or right modifiers.
2. Uses the suffix of an unknown word to guess itspart-of-speech
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
The Charniak Parser
Main differences between the Charniak Parser and the CollinsParser
1. Breaks down the probability estimation in a similar way tothe Collins parser but uses more conditioning information:
1.1 When estimating the probability of a rule A → β, itincludes the parent node of A in the conditioninginformation. This is an important feature.
1.2 Also, includes the grandparent and sister nodes of A
1.3 Uses a third-order Markov grammar — a modifierconstituent in β is conditioned on the head consituent (ala Collins) and the previous two left or right modifiers.
2. Uses the suffix of an unknown word to guess itspart-of-speech
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
The Charniak Parser
Main differences between the Charniak Parser and the CollinsParser
1. Breaks down the probability estimation in a similar way tothe Collins parser but uses more conditioning information:
1.1 When estimating the probability of a rule A → β, itincludes the parent node of A in the conditioninginformation. This is an important feature.
1.2 Also, includes the grandparent and sister nodes of A
1.3 Uses a third-order Markov grammar — a modifierconstituent in β is conditioned on the head consituent (ala Collins) and the previous two left or right modifiers.
2. Uses the suffix of an unknown word to guess itspart-of-speech
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
The Charniak Parser
Main differences between the Charniak Parser and the CollinsParser
1. Breaks down the probability estimation in a similar way tothe Collins parser but uses more conditioning information:
1.1 When estimating the probability of a rule A → β, itincludes the parent node of A in the conditioninginformation. This is an important feature.
1.2 Also, includes the grandparent and sister nodes of A
1.3 Uses a third-order Markov grammar — a modifierconstituent in β is conditioned on the head consituent (ala Collins) and the previous two left or right modifiers.
2. Uses the suffix of an unknown word to guess itspart-of-speech
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Unlexicalised Parsing
I History-based lexicalised parsers achieve f-scores in therange 87-89% on WSJ23
I Klein and Manning (2003) show that it is possible toachieve an f-score of approximately 86% withoutlexicalisation:
I A node in a parse tree is annotated with its parent node -parent annotation
I This means that a subject noun phrase is annotated withthe category S, NP S, and a direct object noun phrasewill be annotated with the category V P , NP V P .
I Other transformations can be carried out, e.g. splittingpre-terminal categories into more fine-grained categories
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Outline
Statistical Parsing: What and Why?
PCFGsProbabilistic CKYObtaining Probabilistic Grammars
Limitations of PCFG Parsing
Generative, History-Based Lexicalised ParsersCollins ParserCharniak ParserUnlexicalised Parsing
Discriminative ParsingDiscriminative RerankingDiscriminative Dynamic Programming
Dependency Parsing
Conclusion
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Discriminative versus Generative Statistical
Parsing
Generative Statistical ParsingThe probabilistic model is based on the generative derivation ofa sentence. The model gives us the probability of a sentence bysumming the probabilities of the derivations.
Discriminative Statistical ParsingMore flexible probability models which can incorporateinformation from a variety of sources:
1. Global facts about tree structure
2. Structure of previous sentences
3. Text genre
4. Facts about the speaker (e.g. gender)
Two Types of Discriminative Parsing
1. Discriminative Reranking
2. Discriminative Dynamic Programming
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Discriminative versus Generative Statistical
Parsing
Generative Statistical ParsingThe probabilistic model is based on the generative derivation ofa sentence. The model gives us the probability of a sentence bysumming the probabilities of the derivations.
Discriminative Statistical ParsingMore flexible probability models which can incorporateinformation from a variety of sources:
1. Global facts about tree structure
2. Structure of previous sentences
3. Text genre
4. Facts about the speaker (e.g. gender)
Two Types of Discriminative Parsing
1. Discriminative Reranking
2. Discriminative Dynamic Programming
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Discriminative Reranking
I Generative parser outputs an n-best list of parse trees
I Features are extracted from the parse treesI Parse probabilitiesI The CFG rules usedI Structural properties of trees (e.g. degree of parallelism)
I Log-linear models are often used
I Charniak and Johnson (2005) reranker: boosts f-score 2percentage points to 91.3%
I Disadvantage: final result will depend on the quality of then-best list
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Discriminative Reranking
I Generative parser outputs an n-best list of parse trees
I Features are extracted from the parse treesI Parse probabilitiesI The CFG rules usedI Structural properties of trees (e.g. degree of parallelism)
I Log-linear models are often used
I Charniak and Johnson (2005) reranker: boosts f-score 2percentage points to 91.3%
I Disadvantage: final result will depend on the quality of then-best list
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Discriminative Reranking
I Generative parser outputs an n-best list of parse trees
I Features are extracted from the parse treesI Parse probabilitiesI The CFG rules usedI Structural properties of trees (e.g. degree of parallelism)
I Log-linear models are often used
I Charniak and Johnson (2005) reranker: boosts f-score 2percentage points to 91.3%
I Disadvantage: final result will depend on the quality of then-best list
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Discriminative Reranking
I Generative parser outputs an n-best list of parse trees
I Features are extracted from the parse treesI Parse probabilitiesI The CFG rules usedI Structural properties of trees (e.g. degree of parallelism)
I Log-linear models are often used
I Charniak and Johnson (2005) reranker: boosts f-score 2percentage points to 91.3%
I Disadvantage: final result will depend on the quality of then-best list
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Discriminative Reranking
I Generative parser outputs an n-best list of parse trees
I Features are extracted from the parse treesI Parse probabilitiesI The CFG rules usedI Structural properties of trees (e.g. degree of parallelism)
I Log-linear models are often used
I Charniak and Johnson (2005) reranker: boosts f-score 2percentage points to 91.3%
I Disadvantage: final result will depend on the quality of then-best list
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Discriminative Reranking
I Generative parser outputs an n-best list of parse trees
I Features are extracted from the parse treesI Parse probabilitiesI The CFG rules usedI Structural properties of trees (e.g. degree of parallelism)
I Log-linear models are often used
I Charniak and Johnson (2005) reranker: boosts f-score 2percentage points to 91.3%
I Disadvantage: final result will depend on the quality of then-best list
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Discriminative Reranking
I Generative parser outputs an n-best list of parse trees
I Features are extracted from the parse treesI Parse probabilitiesI The CFG rules usedI Structural properties of trees (e.g. degree of parallelism)
I Log-linear models are often used
I Charniak and Johnson (2005) reranker: boosts f-score 2percentage points to 91.3%
I Disadvantage: final result will depend on the quality of then-best list
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Discriminative Reranking
I Generative parser outputs an n-best list of parse trees
I Features are extracted from the parse treesI Parse probabilitiesI The CFG rules usedI Structural properties of trees (e.g. degree of parallelism)
I Log-linear models are often used
I Charniak and Johnson (2005) reranker: boosts f-score 2percentage points to 91.3%
I Disadvantage: final result will depend on the quality of then-best list
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Discriminative Dynamic Programming
I All parses are stored (compactly) in the chart
I The best parse is then selected using a discriminativeprobabilistic model
I Example: CCG parser (Clark and Curran, 2004)
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Discriminative Dynamic Programming
I All parses are stored (compactly) in the chart
I The best parse is then selected using a discriminativeprobabilistic model
I Example: CCG parser (Clark and Curran, 2004)
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Discriminative Dynamic Programming
I All parses are stored (compactly) in the chart
I The best parse is then selected using a discriminativeprobabilistic model
I Example: CCG parser (Clark and Curran, 2004)
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Outline
Statistical Parsing: What and Why?
PCFGsProbabilistic CKYObtaining Probabilistic Grammars
Limitations of PCFG Parsing
Generative, History-Based Lexicalised ParsersCollins ParserCharniak ParserUnlexicalised Parsing
Discriminative ParsingDiscriminative RerankingDiscriminative Dynamic Programming
Dependency Parsing
Conclusion
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Dependency Parsing
Dependency parsers return a different type ofstructural analysis to phrase structure parsers.They return dependency graphs:
I the nodes are the words in the sentence
I the arcs are the dependency relationships between thewords
Phrase structure is not explicitly represented.
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Dependency Parsing
We distinguish labelled and non-labelled dependencygraphs.
I In a labelled dependency graph, the nature of thedependency relationship is specified.
I Typical dependency relationships: subj, obj, nmod, vmod
We distinguish projective and non-projectivedependency graphs.
I Non-projectivity identifiable with crossing dependencies.
I Used for representing certain types of non-distancedependencies, e.g. What did economic news have little
effect on?
I Most dependency-based linguistic theories assumenon-projectivity
I Most dependency-based parsing systems assumeprojectivity
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Probabilistic Dependency Parsing
Broadly speaking, there are two types of probabilisticdependency parsers.
1. Transition-based systems
2. Graph-based systems
Transition-based systems
I Probability model predicts the next action of the parser
I Deterministic shift-reduce parsing
I Example: Malt Parser (Nivre et al. 2006)
Graph-based systems
I Probability model predicts the entire graph for a sentencefrom the set of all possible graphs
I Example: MST Parser (Mc Donald et al. 2006)
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Probabilistic Dependency Parsing
Broadly speaking, there are two types of probabilisticdependency parsers.
1. Transition-based systems
2. Graph-based systems
Transition-based systems
I Probability model predicts the next action of the parser
I Deterministic shift-reduce parsing
I Example: Malt Parser (Nivre et al. 2006)
Graph-based systems
I Probability model predicts the entire graph for a sentencefrom the set of all possible graphs
I Example: MST Parser (Mc Donald et al. 2006)
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Probabilistic Dependency Parsing
Broadly speaking, there are two types of probabilisticdependency parsers.
1. Transition-based systems
2. Graph-based systems
Transition-based systems
I Probability model predicts the next action of the parser
I Deterministic shift-reduce parsing
I Example: Malt Parser (Nivre et al. 2006)
Graph-based systems
I Probability model predicts the entire graph for a sentencefrom the set of all possible graphs
I Example: MST Parser (Mc Donald et al. 2006)
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Outline
Statistical Parsing: What and Why?
PCFGsProbabilistic CKYObtaining Probabilistic Grammars
Limitations of PCFG Parsing
Generative, History-Based Lexicalised ParsersCollins ParserCharniak ParserUnlexicalised Parsing
Discriminative ParsingDiscriminative RerankingDiscriminative Dynamic Programming
Dependency Parsing
Conclusion
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Conclusion
But while there is room for improvement I believe thatit is now or will soon be time to stop working onimproving labeled precision/recall per se... At somepoint we should move on. Charniak (1997)
What now?
1. Domain adaptation
2. Task-based evaluation
3. Semantic parsing
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Conclusion
But while there is room for improvement I believe thatit is now or will soon be time to stop working onimproving labeled precision/recall per se... At somepoint we should move on. Charniak (1997)
What now?
1. Domain adaptation
2. Task-based evaluation
3. Semantic parsing
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Charniak, E.: Statistical techniques for natural languageparsing.AI Magazine 18(4) (1997)
Charniak, E.: A maximum-entropy-inspired parser.In: Proceedings of the Annual Meeting of the NorthAmerican Association for Computational Linguistics(NAACL-00), pp. 132–139. Seattle, Washington (2000)
Collins, M.: Head-driven statistical models for naturallanguage parsing.Computational Linguistics 29(4), 499–637 (2003)
Klein, D., Manning, C.: Accurate unlexicalised parsing.In: Proceedings of the 41st Annual Meeting of the ACL(2006)
McDonald, R., Lerman, K., Pereira, F.: Multilingualdependency analysis with a two-stage discriminativeparser.In: Proceedings of the 10th Conference on ComputationalNatural Language Learning (CoNLL) (2006)
StatisticalParsing: anOverview
based on a draftchapter 14 of the2nd edition of
“Speech and
Language
Processing” byJurafsky and
MartinJennifer Foster
StatisticalParsing: Whatand Why?
PCFGs
ProbabilisticCKY
ObtainingProbabilisticGrammars
Limitations ofPCFG Parsing
Generative,History-BasedLexicalisedParsers
Collins Parser
Charniak Parser
UnlexicalisedParsing
DiscriminativeParsing
DiscriminativeReranking
DiscriminativeDynamicProgramming
Nivre, J., Hall, J., Nilsson, J., Eryigit, G., Marinov, S.:Labeled pseudo-projective dependency parsing withsupport vector machines.In: Proceedings of the 10th Conference on ComputationalNatural Language Learning (CoNLL) (2006)