LINGUISTICA GENERALE E COMPUTAZIONALE DISAMBIGUAZIONE DELLE PARTI DEL DISCORSO.

Post on 17-Dec-2015

232 views 2 download

Tags:

transcript

LINGUISTICA GENERALE E COMPUTAZIONALE

DISAMBIGUAZIONE DELLE PARTI DEL DISCORSO

2

POS tagging: the problem

• People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN

• Problem: assign a tag to race• Requires: tagged corpus

3

Ambiguity in POS tagging

The ATman NN VBstill NN VB RBsaw NN VBDher PPO PP$

4

How hard is POS tagging?

Number of tags 1 2 3 4 5 6 7

Number of words types

35340 3760 264 61 12 2 1

In the Brown corpus,- 11.5% of word types ambiguous- 40% of word TOKENS

5

Frequency + Context

• Both the Brill tagger and HMM-based taggers achieve good results by combining– FREQUENCY

• I poured FLOUR/NN into the bowl.• Peter should FLOUR/VB the baking tray

– Information about CONTEXT • I saw the new/JJ PLAY/NN in the theater.• The boy will/MD PLAY/VBP in the garden.

6

The importance of context

• Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN

• People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN

TAGGED CORPORA

8

Choosing a tagset

• The choice of tagset greatly affects the difficulty of the problem

• Need to strike a balance between– Getting better information about context (best:

introduce more distinctions)– Make it possible for classifiers to do their job

(need to minimize distinctions)

9

Some of the best-known Tagsets

• Brown corpus: 87 tags• Penn Treebank: 45 tags• Lancaster UCREL C5 (used to tag the BNC): 61

tags• Lancaster C7: 145 tags

10

Important Penn Treebank tags

11

Verb inflection tags

12

The entire Penn Treebank tagset

13

UCREL C5

14

Tagsets per l’italiano

Si-TAL (Pisa, Venezia, IRST, ....)

PAROLE

TEXTPRO (dopo)

15

Il tagset di SI-TAL

16

POS tags in the Brown corpus

Television/NN has/HVZ yet/RB to/TO work/VB out/RP a/AT living/RBG arrangement/NN with/IN jazz/NN ,/, which/VDT comes/VBZ to/IN the/AT medium/NN more/QL as/CS an/AT uneasy/JJ guest/NN than/CS as/CS a/AT relaxed/VBN member/NN of/IN the/AT family/NN ./.

17

SGML-based POS in the BNC

<div1 complete=y org=seq> <head> <s n=00040> <w NN2>TROUSERS <w VVB>SUIT </head> <caption> <s n=00041> <w EX0>There <w VBZ>is <w PNI>nothing <w AJ0>masculine <w PRP>about <w DT0>these <w AJ0>new <w NN1>trouser <w NN2-VVZ>suits <w PRP>in <w NN1>summer<w POS>'s <w AJ0>soft <w NN2>pastels<c PUN>. <s n=00042> <w NP0>Smart <w CJC>and <w AJ0>acceptable <w PRP>for <w NN1>city <w NN1-VVB>wear <w CJC>but <w AJ0>soft <w AV0>enough <w PRP>for <w AJ0>relaxed <w NN2>days </caption>

18

Quick test

DoCoMo and Sony are to develop a chip that would let people pay for goods through their mobiles.

POS TAGGED CORPORA IN NLTK

>>> tagged_token = nltk.tag.str2tuple('fly/NN') >>> tagged_token('fly', 'NN') >>> tagged_token[0] 'fly' >>> tagged_token[1] 'NN'

>>> nltk.corpus.brown.tagged_words() [('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ...]

Exploring tagged corpora

• Ch.5, p. 184-189

OTHER POS-TAGGED CORPORA

• NLTK:• WAC Corpora:

– English: UKWAC– Italian: ITWAC

POS TAGGING

23

Markov Model POS tagging

• Again, the problem is to find an `explanation’ with the highest probability:

• As in the lecture on text classification, this can be ‘turned around’ using Bayes’ Rule:

)..|..(argmax 11Tt

nn wwttPi

)..(

)..()..|..(argmax

1

111

n

nnn

wwP

ttPttwwP

24

Combining frequency and contextual information

• As in the case of spelling, this equation can be simplified:

• As we will see, once further simplifications are applied, this equation will encode both FREQUENCY and CONTEXT INFORMATION

prior

1

likelihood

11 )..()..|..(argmax nnn ttPttwwP

25

Three further assumptions

• MARKOV assumption: a tag only depends on a FIXED NUMBER of previous tags (here, assume bigrams)– Simplify second factor

• INDEPENDENCE assumption: words are independent from each other.

• A word’s identity only depends on its own tag– Simplify first factor

26

The final equations

FREQUENCYCONTEXT

27

Estimating the probabilities

Can be done using Maximum Likelihood Estimation as usual, for BOTH probabilities:

28

An example of tagging with Markov Models :

• Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN

• People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN

• Problem: assign a tag to race given the subsequences– to/TO race/???– the/DT race/???

• Solution: we choose the tag that has the greater of these probabilities:– P(VB|TO) P(race|VB)– P(NN|TO)P(race|NN)

29

Tagging with MMs (2)• Actual estimates from the Switchboard corpus:• LEXICAL FREQUENCIES:

– P(race|NN) = .00041– P(race|VB) = .00003

• CONTEXT:– P(NN|TO) = .021– P(VB|TO) = .34

• The probabilities:– P(VB|TO) P(race|VB) = .00001– P(NN|TO)P(race|NN) = .000007

30

A graphical interpretation of the POS tagging equations

31

Hidden Markov Models

32

An example

33

Computing the most likely sequence of tags

• In general, the problem of computing the most likely sequence t1 .. tn could have exponential complexity

• It can however be solved in polynomial time using an example of DYNAMIC PROGRAMMING: the VITERBI ALGORITHM (Viterbi, 1967)

• (Also called TRELLIS ALGORITHMs)

POS TAGGING IN NLTK

DEFAULT POS TAGGER: nltk.pos_tag

>>> text = nltk.word_tokenize("And now for something completely different")>>> nltk.pos_tag(text) [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]

TEXTPRO

• The most widely used NLP tool for Italian• http://textpro.fbk.eu/• Demo

THE TEXTPRO TAGSET

READINGS

• Bird et al, chapter 5, chapter 6.1