CSA2050 Introduction to Computational Linguistics Parsing I.

Post on 14-Dec-2015

234 views 8 download

transcript

CSA2050 Introduction to Computational

Linguistics

Parsing I

Apr 2008 -- MR CSA2050 - Parsing I 2

Why Is Syntax Important?

The presidential candidate who was extremely popular smiled broadly.

How many presidential candidates are implied?

1 or >1?

Apr 2008 -- MR CSA2050 - Parsing I 3

Why Is Syntax Important?

The presidential candidate, who was extremely popular, smiled broadly.

How many presidential candidates are implied?

1 or >1?

Apr 2008 -- MR CSA2050 - Parsing I 4

Why Is Syntax Important?

The presidential candidate, who was extremely popular, smiled broadly.

The presidential candidate who was extremely popular smiled broadly.

…because the syntactic structure has an important bearing on the meaning

Apr 2008 -- MR CSA2050 - Parsing I 5

PP Attachment

The policeman saw a burglar with a gun The policemen saw a burglar with a

telescope PP can modify V or N In the first case, it modifes V In the second, it modifies N

Apr 2008 -- MR CSA2050 - Parsing I 6

PP modifies V

D N V D N P D NThe policemen saw the burglar with a telescope

S

NP

VP

PP

NP

NP

Apr 2008 -- MR CSA2050 - Parsing I 7

PP modifies N

D N V D N P D NThe policemen saw a burglar with a gun

S

NP

VP

PP

NP

NP

Apr 2008 -- MR CSA2050 - Parsing I 8

Issue

In general, how can we determine whether a prepositional phrase modifies the preceding noun or verb?

Knowledge based approach must encode, for example burglars often have guns people can see things with a telescope + a lot of other things

Statistical approach

Apr 2008 -- MR CSA2050 - Parsing I 9

PP Attachment – Statistical Approach

The Prepositional Phrase Attachment Corpus, included with NLTK as ppattach, makes it possible for us to study this question systematically.

Derived from the IBM-Lancaster Treebank of Computer Manuals and the Penn Treebank,

Distils only the essential information about PP attachment.

Apr 2008 -- MR CSA2050 - Parsing I 10

Corpus Example Sentence

Original Four of the five surviving workers have asbestos-

related diseases, including three with recently diagnosed cancer.

including three with recently diagnosed cancerversus

including three by adding two and one

Apr 2008 -- MR CSA2050 - Parsing I 11

Distilled Information in Corpus

Original Four of the five surviving workers have asbestos-

related diseases, including three with recently diagnosed cancer.

ppattach corpus 16 including three with cancer N

i/d head verb head of obj prep head of pp’s np N or V

Apr 2008 -- MR CSA2050 - Parsing I 12

Further examples

47830 allow visits between families N 47830 allow visits on peninsula V 42457 acquired interest in firm N 42457 acquired interest in 1986 V

Etc.

Apr 2008 -- MR CSA2050 - Parsing I 13

Minimal Pair Extraction

NLTK contains primitives that allow us to to extract minimal pairs where we hold NP1, PREP and NP2 constant and get different attachments with respect to verb, e.g. received (NP offer) (PP from group) V

rejected (NP offer (PP from group)) N receive x from y reject x

Apr 2008 -- MR CSA2050 - Parsing I 14

Why Syntactic Structure? Helps to make explicit how a sentence says who did

what to whomThe fierce dog bit the man

Key idea is to identify noun phrases around the verb <noun group> <verb> <noun group> We can do this in terms of sequences of POS tags,

e.g. D JJ* N But there are limitations to this approach

The child with a fierce dog bit the man Here child is biting but D JJ* N still precedes “bit” so

fierce dog remains the thing doing the biting.

Apr 2008 -- MR CSA2050 - Parsing I 15

Constituency

We could repair with a more complex regular expression such as

DT JJ* NN (IN DT JJ* NN)* But this is defeated by

The seagull that attacked the child with the fierce dog bit the man

Basic problem is that we need a richer notion of constituency – how the words fit together to form a noun phrase.

Apr 2008 -- MR CSA2050 - Parsing I 16

Recursion – Central Embedding

The dog barked

Apr 2008 -- MR CSA2050 - Parsing I 17

Recursion – Central Embedding

The dog barked The dog the cat scratched barked

Apr 2008 -- MR CSA2050 - Parsing I 18

Recursion – Central Embedding

The dog barked The dog the cat scratched barked The dog the cat the horse liked scratched

barked.

Apr 2008 -- MR CSA2050 - Parsing I 19

Recursion – Central Embedding

The dog barked The dog the cat scratched barked The dog the cat the horse liked scratched

barked. The dog the cat the horse the man rode liked

scratched barked.

Apr 2008 -- MR CSA2050 - Parsing I 20

Chomsky Hierarchy

Apr 2008 -- MR CSA2050 - Parsing I 21

CFG Review

A CFG is a 4-tuple (N, Σ, P, S), where:

N is a set of non-terminal symbols (the category labels); Σ is a set of terminal symbols (e.g., lexical items); P is a set of productions of the form A → α, where – A is a non-terminal, and – α is a string of symbols from (N U Σ)* (i.e., strings of either

terminals or non-terminals); S is the start symbol.

A derivation of a string from a non-terminal N in P is the result or trace of successively applying individual productions in P to A.

Apr 2008 -- MR CSA2050 - Parsing I 22

Different Derivations for the Same Sentence

Derivation 1NPDet N PPthe N PPthe dog PPthe dog P NPthe dog with NPthe dog with Det Nthe dog with a Nthe dog with a telescope

Derivation 2NPDet N PPDet N P NPDet N with NPThe N with NPThe N with a N

Apr 2008 -- MR CSA2050 - Parsing I 23

What Does Context Free Mean?

LHS of rule is just one symbol. Can haveNP -> Det N

Cannot haveX NP Y -> X Det N Y

Apr 2008 -- MR CSA2050 - Parsing I 24

Grammar Symbols

Symbols of the grammar fall into three categories:

1. Non Terminal Symbols

2. Terminal Symbols

3. Parts of Speech

We will sometimes not distinguish between 2 and 3

Apr 2008 -- MR CSA2050 - Parsing I 25

Technical Aspects of CFGs

Rules of the form LHS -> RHS LHS comprises at most one NT symbol RHS any combination of NT and T symbols

Finite State (type 3) grammars have different restrictions LHS comprises at most one NT symbol RHS combination of T symbols with at most one NT.

Right linear grammar: NT must come at extreme left Left linear grammar: NT must come at extreme right

Apr 2008 -- MR CSA2050 - Parsing I 26

A Simple Grammar + Lexicon

grammar:

S NP VPNP NVP V NPlexicon:

V kicksN JohnN Bill

S

NP

N

John kicks

NPV

VP

N

Bill

Apr 2008 -- MR CSA2050 - Parsing I 27

Grammar versus Parser

A grammar/lexicon defines a relation between sentences generated by the grammar and their respective syntactic structures.

The grammar does not tell us how to actually go about discovering the structure of a sentence.

A parsing algorithm is an effective procedure for carrying out that discovery.

A parser implements a parsing algorithm. Recursive descent parsing.