+ All Categories
Home > Documents > 6/10/2015CPSC503 Winter 20091 CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini.

6/10/2015CPSC503 Winter 20091 CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini.

Date post: 19-Dec-2015
Category:
Upload: clara-stephens
View: 215 times
Download: 0 times
Share this document with a friend
Popular Tags:
31
03/27/22 CPSC503 Winter 2009 1 CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini
Transcript

04/18/23 CPSC503 Winter 2009 1

CPSC 503Computational Linguistics

Lecture 3Giuseppe Carenini

NLP research at UBCTOPICS• Generation and Summarization of Evaluative

Text (e.g., customer reviews)• Summarization of conversations (emails,

blogs, meetings)• Subjectivity Detection, Domain Adaptation,

Rhetorical Parsing

PEOPLE: G. Carenini & R. Ng (Profs), G. Murray (Postdoc) + Students

SUPPORT: NSERC, Google, BObjects(now SAP), COLLABORATIONS: MSResearch

04/18/23 CPSC503 Winter 2009 2

http://people.cs.ubc.ca/~rjoty/Webpage/

04/18/23 CPSC503 Winter 2009 3

State Machines (no prob.)• Finite State Automata

(and Regular Expressions)

• Finite State Transducers

(English)Morpholo

gy

Logical formalisms (First-Order Logics)

Rule systems (and prob. version)(e.g., (Prob.) Context-Free

Grammars)

Syntax

PragmaticsDiscourse and

Dialogue

Semantics

AI planners

Linguistic Knowledge Formalisms and associated Algorithms

04/18/23 CPSC503 Winter 2009 4

Computational tasks in Morphology

• Recognition: recognize whether a string is an English/… word (FSA)

• Parsing/Generation: word

stem, class, lexical features

….….

boughtbuy +V +PAST-PART

buy +V +PAST• Stemming:

wordstem

….

e.g.,

04/18/23 CPSC503 Winter 2009 5

Today Sept 16

• Finite State Transducers (FSTs) and Morphological Parsing

• Stemming (Porter Stemmer)

04/18/23 CPSC503 Winter 2009 6

FST definition (Recap.)

• Q: a finite set of states• I,O: input and an output alphabets

(which may include ε)• Σ: a finite alphabet of complex symbols

i:o, iI and oO

• Q0: the start state

• F: a set of accept/final states (FQ)• A transition relation δ that maps QxΣ

to 2Q

E.g., |Q| =3 ; I={a,b,c, ε} ; O={a,b}; |Σ|=?; 0 <= |δ| <= ?

04/18/23 CPSC503 Winter 2009 7

FST can be used as…

• Translators: input one string from I, output another from O (or vice versa)

• Recognizers: input a string from IxO

• Generator: output a string from IxO Terminology

warning!E.g., if I={a,b} ; O={a,b,ε};

……

04/18/23 CPSC503 Winter 2009 8

FST: inflectional morphology of plural

Some regular-nouns

Some irregular-nouns o:i

X -> X:X

lexical:surface

Notes:

04/18/23 CPSC503 Winter 2009 9

Examples

m i c

+N +PLc a tlexical

lexical

surface

surface e

04/18/23 CPSC503 Winter 2009 10

Computational Morphology: Problems/Challenges

1. Ambiguity: one word can correspond to multiple structures (more critical in morphologically richer languages)

2. Spelling changes: may occur when two morphemes are combinede.g. butterfly + -s -> butterflies

04/18/23 CPSC503 Winter 2009 11

Ambiguity: more complex example

• What’s the right parse for Unionizable?– Union-ize-able– Un-ion-ize-able

• Each would represent a valid path through an FST for derivational morphology.

• Both Adj……

04/18/23 CPSC503 Winter 2009 12

Deal with Morphological Ambiguity

•Find all the possible outputs (all paths) and return them all (without choosing)Then Part-of-

speech taggingto choose…… look at the neighboring words

04/18/23 CPSC503 Winter 2009 13

(2) Spelling Changes

When morphemes are combined inflectionally the spelling at the boundaries may change Examples

•E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x (e.g., kiss, miss, waltz, bush, watch, rich, box)

•Y-replacement: when –s or -ed are added to a word ending with a –y, -y changes to –ie or –i respectively (e.g., butterfly, try)

04/18/23 CPSC503 Winter 2009 14

Solution: Multi-Tape Machines

• Add intermediate tape • Use the output of one tape

machine as the input to the next

• Add intermediate symbols– ^ morpheme boundary– # word boundary

04/18/23 CPSC503 Winter 2009 15

Multi-Level Tape Machines

• FST-1 translates between the lexical and the intermediate level

• FTS-2 handles the spelling changes (due to one rule) to the surface tape

FST-1

FST-2

04/18/23 CPSC503 Winter 2009 16

FST-1 for inflectional morphology of plural (Lexical <->

Intermediate )Some regular-nouns

Some irregular-nouns o:i

+PL:^s#

#

#

#

+PL:^ ε:s ε:#

04/18/23 CPSC503 Winter 2009 17

Example

f o x

intemediate

lexical

s em o u

intemediate

lexical

+PL+N

+N +PL

04/18/23 CPSC503 Winter 2009 18

FST-2 for E-insertion(Intermediate <-> Surface)

E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x

…as in fox^s# <-> foxes

#: ε

04/18/23 CPSC503 Winter 2009 19

Examples

^ sf o xintermediate

surface

#

^ ib o xintermediate

surface

n g #

04/18/23 CPSC503 Winter 2009 20

Where are we?

#

04/18/23 CPSC503 Winter 2009 21

Final Scheme: Part 1

04/18/23 CPSC503 Winter 2009 22

Final Scheme: Part 2

04/18/23 CPSC503 Winter 2009 23

Intersection (FST1, FST2) = FST3

For all i,j,n,m,a,b δ3((q1i,q2j), a:b) = (q1n,q2m) iff

– δ1(q1i, a:b) = q1n AND

– δ2(q2j, a:b) = q2m

• States of FST1 and FST2 : Q1 and Q2

• States of intersection: (Q1 x Q2)

• Transitions of FST1 and FST2 : δ1, δ2

• Transitions of intersection : δ3

a:b

(q1i,q2j) (q1n,q2m

)

a:b

q1i q1n

a:b

q2j q2m

a:b

04/18/23 CPSC503 Winter 2009 24

Composition(FST1, FST2) = FST3 • States of FST1 and FST2 : Q1 and Q2

• States of composition : Q1 x Q2

• Transitions of FST1 and FST2 : δ1, δ2

• Transitions of composition : δ3

For all i,j,n,m,a,b δ3((q1i,q2j), a:b) = (q1n,q2m) iff– There exists c such that

– δ1(q1i, a:c) = q1n AND

– δ2(q2j, c:b) = q2ma:b

(q1i,q2j) (q1n,q2m

)

a:b

a:c

q1i q1n

c:b

q2j q2m

04/18/23 CPSC503 Winter 2009 25

FSTs in Practice• Install an FST package…… (pointers)• Describe your “formal language” (e.g,

lexicon, morphotactic and rules) in a RegExp-like notation (pointer)

• Your specification is compiled in a single FST Ref: “Finite State Morphology” (Beesley and

Karttunen, 2003, CSLI Publications)

Complexity/Coverage: • FSTs for the morphology of a natural

language may have 105 – 107 states and arcs

• Spanish (1996) 46x103 stems; 3.4 x 106 word forms

• Arabic (2002?) 131x103 stems; 7.7 x 106 word forms

04/18/23 CPSC503 Winter 2009 26

Other important applications of FST in NLP

From segmenting words into morphemes to…

• Tokenization:

– finding word boundaries in text (?!) …maxmatch

– Finding sentence boundaries: punctuation… but . is ambiguous look at example in Fig. 3.22

• Shallow syntactic parsing: e.g., find only noun phrases

• Phonological Rules…… (Chpt. 11)

04/18/23 CPSC503 Winter 2009 27

Computational tasks in Morphology

• Recognition: recognize whether a string is an English word (FSA)

• Parsing/Generation: word

stem, class, lexical features

….….

boughtbuy +V +PAST-PART

buy +V +PAST• Stemmin

g:wordstem

….

e.g.,

04/18/23 CPSC503 Winter 2009 28

Stemmer• E.g. the Porter algorithm, which is

based on a series of sets of simple cascaded rewrite rules:

• (condition) S1->S2– ATIONAL ATE (relational relate)– (*v*) ING if stem contains vowel (motoring

motor)

• Cascade of rules applied to: computerization– ization -> -ize computerize– ize -> ε computer

• Errors occur:– organization organ, university universe

Code freely available in most languages: Python, Java,…

04/18/23 CPSC503 Winter 2009 29

Stemming mainly used in Information Retrieval

1. Run a stemmer on the documents to be indexed

2. Run a stemmer on users queries3. Compute similarity between

queries and documents (based on stems they contain)

Seems to work especially well with smaller documents

04/18/23 CPSC503 Winter 2009 30

Porter as an FST

• The original exposition of the Porter stemmer did not describe it as a transducer but…– Each stage is a separate

transducer– The stages can be composed to

get one big transducer

04/18/23 CPSC503 Winter 2009 31

Next Time• Read handout

– Probability– Stats– Information theory

• Next Lecture: – finish Chpt 3, 3.10-11– Start Probabilistic Models for NLP (Chpt.

4, 4.1 – 4.2 and 5.9!)


Recommended