CS626-460: Language Technology for the Web/Natural
Language Processing
Pushpak BhattacharyyaCSE Dept., IIT Bombay
Constituent Parsing and Algorithms(with major contributions from Dr. Rajat
Mohanty)
Syntax Syntax is the study of the combination
of words into phrases, clauses and sentences.
Syntax describes how sentences and their constituents are structured.
Grammar A finite set of rules
that generates only and all sentences of a language.
that assigns an appropriate structural description to each one.
Grammatical Analysis Techniques Two main devices
Morphological Categorial Functional
Sequential Hierarchical Transformational
Breaking up a String Labeling the Constituents
Hierarchical Breaking up and Categorial Labeling
S
VP
V Adv
ran away
NP
A N
Poor John
Poor John ran away.
Hierarchical Breaking up and Functional Labeling
Immediate Constituent (IC) Analysis Construction types in terms of the
function of the constituents: Predication (subject + predicate) Modification (modifier + head) Complementation (verbal + complement) Subordination (subordinator + dependent
unit) Coordination (independent unit +
coordinator)
S
HeadModifier
In the morning, the sky looked much brighter
Subordinator DU PredicateSubject
Head
Head
Head Verbal ComplementModifier Modifier
Modifier
In the morning, the sky looked much brighter.
An Example
Noun Phrases
John
NP
N
student
NP
N
the
Det
student
NP
N
the
Det
intelligent
AdjP
John the student the intelligent student
Phrases
Noun Phrase
five
NP
Quant
his
Det
first
Ord
students
N
PhD
N
his first five PhD students
Noun Phrase
five
NP
Quant
the
Det
students
N
best
AP
of my class
PP
The five best students of my class
Verb Phrases
sing
VP
V
can
Aux
the ball
VP
NP
can
Aux
hit
V
can sing can hit the ball
Verb Phrase
a flower
VP
NP
can
Aux
give
V
to Mary
PP
Can give a flower to Mary
Verb Phrase
John
VP
NP
may
Aux
make
V
the chairman
NP
may make John the chairman
Verb Phrase
the book
VP
NP
may
Aux
find
V
very interesting
AP
may find the book very interesting
Prepositional Phrases
in the classroom
the river
PP
NP
near
P
the classroom
PP
NP
in
P
near the river
Adjective Phrases
intelligent
AP
A
honest
AP
A
very
Degree
of sweets
AP
PP
fond
A
intelligent very honest fond of sweets
Adjective Phrase• very worried that she might have done badly in the
assignment
that she might have done badly in the assignment
AP
S’
very
Degree
worried
A
A segment of English Grammar S’(C) S S{NP/S’} VP VP(AP+) V (AP+) ({NP/S’})
(AP+) (PP+) (AP+) NP(D) (AP+) N (PP+) PPP NP AP(AP) A
PSG Parse Tree John wrote those words in the Book of Proverbs.
S
VPNP
VPropN NP
John wrote those words
PP
NP
in
P
the book
of proverbs
NP PP
Penn Treebank
(S (NP-SBJ (NP John)) (VP wrote (NP those words) (PP-LOC in
(NP (NP-TTL (NP the Book)(PP of (NP Proverbs)))
John wrote those words in the Book of Proverbs.
PSG Parse Tree Official trading in the shares will start in Paris on
Nov 6.S
VP
NP
NAP
official
PP
trading will start on Nov 6
A
PP
NP
in
P
the shares
NP
PPVAux
in Paris
Penn POS Tags
[ Official/JJ trading/NN ]in/IN [ the/DT shares/NNS ]will/MD start/VB in/IN [ Paris/NNP ]on/IN [ Nov./NNP 6/CD ]
Official trading in the shares will start in Paris on Nov 6.
Penn Treebank
( (S (NP-SBJ (NP Official trading) (PP in (NP the shares))) (VP will (VP start
(PP-LOC in (NP Paris))
(PP-TMP on (NP (NP Nov 6)
Official trading in the shares will start in Paris on Nov 6.
Penn POS Tag Sset Adjective: JJ Adverb: RB Cardinal Number: CD Determiner: DT Preposition: IN Coordinating Conjunction CC Subordinating Conjunction: IN Singular Noun: NN Plural Noun: NNS Personal Pronoun: PP Proper Noun: NP Verb base form: VB Modal verb: MD Verb (3sg Pres): VBZ Wh-determiner: WDT Wh-pronoun: WP
Basic Parsing Strategy
A Fragment of English Grammar
S NP VPVP V NPNP NNP | ART NNNP RamV ate | sawART a | an | theN rice | apple | movie
Derivation
S => NP VP (rewrite S) => NNP VP (rewrite NP) => Ram VP (rewrite NNP) => Ram V NP (rewrite VP) => Ram ate NP (rewrite V) => Ram ate ART N (rewrite NP) => Ram ate the N (rewrite ART) => Ram ate the rice (rewrite N)
MultipleChoicePoints
• S is a special symbol called start symbol.
Two Strategies : Top-Down & Bottom-Up
Top down : Start with S and generate the sentence.
Bottom up : Start with the words in the sentence and use the rewrite rules backwards to reduce the sequence of symbols to produce S.
Previous slide showed top-down strategy.
Bottom-Up DerivationRam ate the rice
=> NNP ate the rice (rewrite Ram)=> NNP V the rice (rewrite ate)=> NNP V ART rice (rewrite the)=> NNP V ART N (rewrite rice)=> NP V ART N (rewrite NNP)=> NP V NP (rewrite ART N)=> NP VP (rewrite V NP)=> S
Parsing AlgorithmA procedure that “searches” through the grammatical
rules to find a combination that generates a tree which
stands for the structure of the sentence
Top-Down Parsing (using A*)
DFS on the AND-OR graph
Data structures: Open List (OL): Nodes to be expanded Closed List (CL): Expanded Nodes Input List (IL): Words of sentence to be
parsed Moving Head (MH): Walks over the IL
Trace of Top-Down Parsing
OL
CL (empty)
IL
S
Ram ate the rice
Initial Condition (T0)
MH
Trace of Top-Down Parsing
OL
CL
IL
MH
NP VP
S
Ram ate the rice
T1:
Trace of Top-Down Parsing
OL
CL
IL
MH
NNP ART N VP
S NP
Ram ate the rice
T2:
Trace of Top-Down Parsing
OL
CL
IL
ART N VP
S NP NNP
Ram ate the rice
T3:
MH (portion of Input consumed)
Trace of Top-Down Parsing
OL
CL
IL
N VP
S NP NNP ART*
Ram ate the rice
T4:
MH
(* indicates ‘useless’ expansion)
Trace of Top-Down Parsing
OL
CL
IL
VP
S NP NNP ART* N*
Ram ate the rice
T5:
MH
Trace of Top-Down Parsing
OL
CL
IL
V NP
S NP NNP ART* N*
Ram ate the rice
T6:
MH
Trace of Top-Down Parsing
OL
CL
IL
NP
S NP NNP ART* N* V
Ram ate the rice
T7:
MH
Trace of Top-Down Parsing
OL
CL
IL
NNP ART N
S NP NNP ART* N* V NP
Ram ate the rice
T8:
MH
Trace of Top-Down Parsing
OL
CL
IL
ART N
S NP NNP ART* N* V NNP*
Ram ate the rice
T9:
MH
Trace of Top-Down Parsing
OL
CL
IL
N
S NP NNP ART* N* V NNP ART
Ram ate the rice
T10:
MH
Trace of Top-Down Parsing
OL
CL
IL
S NP NNP ART* N* V NNP ART N
Ram ate the rice
T11:
MH
Successful Termination: OL empty AND MH at the end of IL.
Bottom-Up ParsingBasic idea: Refer to words from the
lexicon. Obtain all POSs for each word. Keep combining until S is
obtained. (to be continued)