TEORIE E TECNICHE DEL RICONOSCIMENTO Linguistica computazionale in Python: -Analisi sintattica...

Post on 11-Jan-2016

219 views 3 download

transcript

TEORIE E TECNICHE DEL RICONOSCIMENTO

Linguistica computazionale in Python:- Analisi sintattica (parsing)

DAL CHUNKING ALL’ANALISI SINTATTICA COMPLETA

PROBLEMA: AMBIGUITA’

While hunting in Africa, I shot an elephant in my pajamas. How an elephant got into my pajamas I'll never know.

PROBLEMA: AMBIGUITA’

While hunting in Africa, I shot an elephant in my pajamas. How an elephant got into my pajamas I'll never know.

CARATTERIZZAZIONE DELLA SINTASSI DI UNA LINGUA: CONTEXT-FREE GRAMMARS

• Slides ELN?

CARATTERIZZAZIONE DELLA SINTASSI DI UNA LINGUA: CONTEXT-FREE GRAMMARS

• Capture constituency and ordering– Ordering:

• What are the rules that govern the ordering of words and bigger units in the language?

– Constituency:How words group into units and how the various kinds of units behave

Constituency• E.g., Noun phrases (NPs)

• Three parties from Brooklyn• A high-class spot such as Mindy’s• The Broadway coppers• They• Harry the Horse• The reason he comes into the Hot Box

• How do we know these form a constituent?

Constituency (II)– They can all appear before a verb:

• Three parties from Brooklyn arrive…• A high-class spot such as Mindy’s attracts…• The Broadway coppers love…• They sit

– But individual words can’t always appear before verbs:• *from arrive…• *as attracts…• *the is• *spot is…

– Must be able to state generalizations like:• Noun phrases occur before verbs

Constituency (III)

• Preposing and postposing:– On September 17th, I’d like to fly from Atlanta to Denver– I’d like to fly on September 17th from Atlanta to Denver– I’d like to fly from Atlanta to Denver on September 17th.

• But not:– *On September, I’d like to fly 17th from Atlanta to Denver– *On I’d like to fly September 17th from Atlanta to Denver

Indicating constituents: brackets, trees

• [S [NP [PRO I]] [VP [V prefer] [NP [Det a] [Nom [N morning]

[N flight] ] ] ] ]S

NP VP

NP

VerbPro

Nom

Det NounNoun

I prefer morninga flight

NLE 12

Beyond regular languages: Context-Free Grammars

S NP VPNP Det NominalNominal NounVP V

Det theDet aNoun flightV left

CFGs: set of rules

• S -> NP VP– This says that there are units called S, NP, and VP

in this language– That an S consists of an NP followed immediately

by a VP– Doesn’t say that that’s the only kind of S– Nor does it say that this is the only place that NPs

and VPs occur

Generativity

• As with FSAs you can view these rules as either analysis or synthesis machines– Generate strings in the language– Reject strings not in the language– Impose structures (trees) on strings in the

language

• How can we define grammatical vs. ungrammatical sentences?

Derivations

• A derivation is a sequence of rules applied to a string that accounts for that string– Covers all the elements in the string– Covers only the elements in the string

Derivations as Trees

S

NP VP

NP

VerbPro

Nom

Det NounNoun

I prefer morninga flight

CFGs more formally

• A context-free grammar has 4 parameters (“is a 4-tuple”)

1) A set of non-terminal symbols (“variables”) N

2) A set of terminal symbols (disjoint from N)

3) A set of productions P, each of the form• A -> • Where A is a non-terminal and is a string of symbols from the

infinite set of strings ( N)*

4) A designated start symbol S

Defining a CF language via derivation

• A string A derives a string B if – A can be rewritten as B via some series of rule applications

• More formally:– If A -> is a production of P– and are any strings in the set ( N)*– Then we say that

• A directly derives or A – Derivation is a generalization of direct derivation– Let 1, 2, … m be strings in ( N)*, m>= 1, s.t.

• 1 2, 2 3… m-1 m

• We say that 1derives m or 1* m

– We then formally define language LG generated by grammar G• A set of strings composed of terminal symbols derived from S• LG = {w | w is in * and S * w}

NLE 22

What `context free’ means

NLE 23

Derivations and languages

• The language LG GENERATED by a CFG grammar G is the set of strings of TERMINAL symbols that can be derived from the start symbol S using the production rules in G– LG = {w | w is in * and S derives w}

• The strings in LG are called GRAMMATICAL

• The strings not in LG are called UNGRAMMATICAL

NLE 24

Grammar development

• One of the most basic skills in NLE is the ability to write a CFG for some fragment of a language (e.g., the dates)

• We’ll briefly cover some of the issues to be addressed when writing small CFG grammars

CFG in PYTHON

• NLTK, 8.3

ANALISI SINTATTICA

• TOP-DOWN search: the parse tree has to be rooted in the start symbol S– EXPECTATION-DRIVEN parsing– Esempio; RECURSIVE DESCENT

• BOTTOM-UP search: the parse tree must be an analysis of the input– DATA-DRIVEN parsing– Esempio: SHIFT-REDUCE

TOP-DOWN PARSING CON NLTK

• Recursive descent parsing (NLTK, 8.3)– nltk.RecursiveDescentParser(grammar)– nltk.app.rdparser()

BOTTOM-UP PARSING CON NLTK

• Shift-reduce (NLTK, 8.3, p. 305)– nltk.app.srparser()– ShiftReduceParser(grammar)

MODELLI PIU’ AVANZATI DI PARSING

• Left corner (NLTK)• Chart (NLTK)

DEPENDENCIES E DEPENDENCY GRAMMAR (NLTK, 8.5)

IL PROBLEMA DELL’AMBIGUITA’

• Ambiguity – Church and Patel (1982): the number of

attachment ambiguities grows like the Catalan numbers

• C(2) = 2, C(3) = 5, C(4) = 14, C(5) = 132, C(6) = 469, C(7) = 1430, C(8) = 4867

• Avoiding reparsing

COMMON STRUCTURAL AMBIGUITIES

• COORDINATION ambiguity– OLD (MEN AND WOMEN) vs

(OLD MEN) AND WOMEN• ATTACHMENT ambiguity:

– Gerundive VP attachment ambiguity• I saw the Eiffel Tower flying to Paris

– PP attachment ambiguity• I shot an elephant in my pajamas

PP ATTACHMENT AMBIGUITY

AMBIGUITY: SOLUTIONS

• Use a PROBABILISTIC GRAMMAR (not covered in this module)

• Use semantics

SCRIVERE UNA GRAMMATICA

• NLTK, 8.6