Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn...

Post on 11-Jan-2016

222 views 0 download

Tags:

transcript

Lecture 3, 7/27/2005 Natural Language Processing 1

CS60057Speech &Natural Language

Processing

Autumn 2005

Lecture 4

28 July 2005

Lecture 3, 7/27/2005 Natural Language Processing 2

Morphology

Study of the rules that govern the combination of morphemes.

Inflection: same word, different syntactic information Run/runs/running, book/books

Derivation: new word, different meaning Often different part of speech, but not always Possible/possibly/impossible, happy/happiness

Compounding: new word, each part is a word Blackboard, bookshelf lAlakamala, banabAsa

Lecture 3, 7/27/2005 Natural Language Processing 3

Morphology Level: The Mapping

Formally: A+ 2(L,C1,C2,...,Cn)

A is the alphabet of phonemes (A+ denotes any non-empty sequence of phonemes)

L is the set of possible lemmas, uniquely identified Ci are morphological categories, such as:

grammatical number, gender, case person, tense, negation, degree of comparison, voice, aspect, ... tone, politeness, ... part of speech (not quite morphological category, but...)

A, L and Ci are obviously language-dependent

Lecture 3, 7/27/2005 Natural Language Processing 4

Bengali/Hindi Inflectional Morphology

Certain languages encode more syntax in morphology than in syntax

Some of inflectional suffixes that nouns can have: singular/plural : Gender possessive markers : case markers :

Different Karakas

Inflectional suffixes that verbs can have: Hindi: Tense, aspect, modality, person, gender, number Bengali: Tense, aspect, modality, person

Order among inflectional suffixes (morphotactics ) Chhelederke baigulokei

Lecture 3, 7/27/2005 Natural Language Processing 5

Bengali/ Hindi Derivational Morphology

Derivational morphology is very rich.

Lecture 3, 7/27/2005 Natural Language Processing 6

English Inflectional Morphology

Nouns have simple inflectional morphology. plural -- cat / cats possessive -- John / John’s

Verbs have slightly more complex inflectional, but still relatively simple inflectional morphology. past form -- walk / walked past participle form -- walk / walked gerund -- walk / walking singular third person -- walk / walks

Verbs can be categorized as: main verbs modal verbs -- can, will, should primary verbs -- be, have, do

Regular and irregular verbs: walk / walked -- go / went

Lecture 3, 7/27/2005 Natural Language Processing 7

Regulars and Irregulars Some words misbehave (refuse to follow the rules)

Mouse/mice, goose/geese, ox/oxen Go/went, fly/flew

The terms regular and irregular will be used to refer to words that follow the rules and those that don’t.

Lecture 3, 7/27/2005 Natural Language Processing 8

Regular and Irregular Verbs

Regulars… Walk, walks, walking, walked, walked

Irregulars Eat, eats, eating, ate, eaten Catch, catches, catching, caught, caught Cut, cuts, cutting, cut, cut

Lecture 3, 7/27/2005 Natural Language Processing 9

Derivational Morphology

Quasi-systematicity Irregular meaning change Changes of word class

Some English derivational affixes -ation : transport / transportation -er : kill / killer -ness : fuzzy / fuzziness -al : computation / computational -able : break / breakable -less : help / helpless un : do / undo re : try / retry

renationalizationability

Lecture 3, 7/27/2005 Natural Language Processing 10

Derivational Examples

Verb/Adj to Noun

-ation computerize computerization

-ee appoint appointee

-er kill killer

-ness fuzzy fuzziness

Lecture 3, 7/27/2005 Natural Language Processing 11

Derivational Examples

Noun/Verb to Adj

-al Computation Computational

-able Embrace Embraceable

-less Clue Clueless

Lecture 3, 7/27/2005 Natural Language Processing 12

Compute

Many paths are possible… Start with compute

Computer -> computerize -> computerization Computation -> computational Computer -> computerize -> computerizable Compute -> computee

Lecture 3, 7/27/2005 Natural Language Processing 13

Parts of A Morphological Processor

For a morphological processor, we need at least followings: Lexicon : The list of stems and affixes together with basic

information about them such as their main categories (noun, verb, adjective, …) and their sub-categories (regular noun, irregular noun, …).

Morphotactics : The model of morpheme ordering that explains which classes of morphemes can follow other classes of morphemes inside a word.

Orthographic Rules (Spelling Rules) : These spelling rules are used to model changes that occur in a word (normally when two morphemes combine).

Lecture 3, 7/27/2005 Natural Language Processing 14

Lexicon

A lexicon is a repository for words (stems). They are grouped according to their main categories.

noun, verb, adjective, adverb, … They may be also divided into sub-categories.

regular-nouns, irregular-singular nouns, irregular-plural nouns, … The simplest way to create a morphological parser, put all possible

words (together with its inflections) into a lexicon. We do not this because their numbers are huge (theoretically for

Turkish, it is infinite)

Lecture 3, 7/27/2005 Natural Language Processing 15

Morphotactics

Which morphemes can follow which morphemes.

Lexicon:

regular-noun irregular-pl-noun irreg-sg-noun pluralfox geese goose -s

cat sheep sheep

dog mice mouse

Simple English Nominal Inflection (Morphotactic Rules)

0

1

2

reg-nounplural (-s)

irreg-sg-noun

irreg-pl-noun

Lecture 3, 7/27/2005 Natural Language Processing 16

Combine Lexicon and Morphotacticsf

ox

sc a t

d o g

s

h e ep

g

o

e e

o s

e

m

o u s

i c

e

This only says yes or no. Does not give lexical representation.It accepts a wrong word (foxs).

Lecture 3, 7/27/2005 Natural Language Processing 17

FSAs and the Lexicon

This will actual require a kind of FSA : the Finite State Transducer (FST)

We will give a quick overview

First we’ll capture the morphotactics The rules governing the ordering of affixes in a language.

Then we’ll add in the actual words

Lecture 3, 7/27/2005 Natural Language Processing 18

Simple Rules

Lecture 3, 7/27/2005 Natural Language Processing 19

Adding the Words

Lecture 3, 7/27/2005 Natural Language Processing 20

Derivational Rules

Lecture 3, 7/27/2005 Natural Language Processing 21

Parsing/Generation vs. Recognition

Recognition is usually not quite what we need. Usually if we find some string in the language we need to find the

structure in it (parsing) Or we have some structure and we want to produce a surface

form (production/generation)

Example From “cats” to “cat +N +PL”

Lecture 3, 7/27/2005 Natural Language Processing 22

Why care about morphology?

`Stemming’ in information retrieval Might want to search for “aardvark” and find pages with both

“aardvark” and “aardvarks”

Morphology in machine translation Need to know that the Spanish words quiero and quieres are

both related to querer ‘want’

Morphology in spell checking Need to know that misclam and antiundoggingly are not words

despite being made up of word parts

Lecture 3, 7/27/2005 Natural Language Processing 23

Can’t just list all words

Turkish word Uygarlastiramadiklarimizdanmissinizcasin `(behaving) as if you are among those whom we could not civilize’

Uygar `civilized’ + las `become’ + tir `cause’ + ama `not able’ + dik `past’ + lar ‘plural’+ imiz ‘p1pl’ + dan ‘abl’ + mis ‘past’ + siniz ‘2pl’ + casina ‘as if’

Lecture 3, 7/27/2005 Natural Language Processing 24

Finite State Transducers

The simple story Add another tape Add extra symbols to the transitions

On one tape we read “cats”, on the other we write “cat +N +PL”

Lecture 3, 7/27/2005 Natural Language Processing 25

Transitions

c:c means read a c on one tape and write a c on the other +N:ε means read a +N symbol on one tape and write nothing on the other +PL:s means read +PL and write an s

c:c a:a t:t +N:ε +PL:s

Lecture 3, 7/27/2005 Natural Language Processing 26

Lexical to Intermediate Level

Lecture 3, 7/27/2005 Natural Language Processing 27

FST Properties

FSTs are closed under: union, inversion, and composition.

union : The union of two regular relations is also a regular relation. inversion : The inversion of a FST simply switches the input and

output labels. This means that the same FST can be used for both directions of a

morphological processor.

composition : If T1 is a FST from I1 to O1 and T2 is a FST from O1

to O2, then composition of T1 and T2 (T1oT2) maps from I1 to O2.

We use these properties of FSTs in the creation of the FST for a morphological processor.

Lecture 3, 7/27/2005 Natural Language Processing 28

A FST for Simple English Nominals

reg-noun

irreg-sg-noun

irreg-pl-noun

+N: є

+N: є

+N: є

+S:#+PL:^s#

+SG:#

+PL:#

Lecture 3, 7/27/2005 Natural Language Processing 29

FST for stems

A FST for stems which maps roots to their root-class

reg-noun irreg-pl-noun irreg-sg-noun fox g o:e o:e se goose

cat sheep sheep

dog m o:i u:є s:c e mouse

fox stands for f:f o:o x:x When these two transducers are composed, we have a FST which maps

lexical forms to intermediate forms of words for simple English noun inflections.

Next thing that we should handle is to design the FSTs for orthographic rules, and combine all these transducers.

Lecture 3, 7/27/2005 Natural Language Processing 30

Multi-Level Multi-Tape Machines

A frequently use FST idiom, called cascade, is to have the output of one FST read in as the input to a subsequent machine.

So, to handle spelling we use three tapes: lexical, intermediate and surface

We need one transducer to work between the lexical and intermediate levels, and a second (a bunch of FSTs) to work between intermediate and surface levels to patch up the spelling.

+PL+Ngod

sgod

s #^god

lexical

intermediate

surface

Lecture 3, 7/27/2005 Natural Language Processing 31

Lexical to Intermediate FST

Lecture 3, 7/27/2005 Natural Language Processing 32

Orthographic Rules

We need FSTs to map intermediate level to surface level. For each spelling rule we will have a FST, and these FSTs run parallel.

Some of English Spelling Rules: consonant doubling -- 1-letter consonant doubled before ing/ed --

beg/begging E deletion - Silent e dropped before ing and ed -- make/making E insertion -- e added after s, z, x, ch, sh before s -- watch/watches Y replacement -- y changes to ie before s, and to i before ed -- try/tries K insertion -- verbs ending with vowel+c we add k -- panic/panicked

We represent these rules using two-level morphology rules: a => b / c__d rewrite a as b when it occurs between c and d.

Lecture 3, 7/27/2005 Natural Language Processing 33

FST for E-Insertion Rule

E-insertion rule: є => e / {x,s,z}^ __ s#

Lecture 3, 7/27/2005 Natural Language Processing 34

Generating or Parsing with FST Lexicon and Rules

Lecture 3, 7/27/2005 Natural Language Processing 35

Accepting Foxes

Lecture 3, 7/27/2005 Natural Language Processing 36

Intersection

We can intersect all rule FSTs to create a single FST. Intersection algorithm just takes the Cartesian product of

states. For each state qi of the first machine and qj of the second

machine, we create a new state qij

For input symbol a, if the first machine would transition to state

qn and the second machine would transition to qm the new

machine would transition to qnm.

Lecture 3, 7/27/2005 Natural Language Processing 37

Composition

Cascade can turn out to be somewhat pain. it is hard to manage all tapes it fails to take advantage of restricting power of the machines

So, it is better to compile the cascade into a single large machine.

Create a new state (x,y) for every pair of states x є Q1 and y є Q2. The transition function of composition will be defined as follows:

δ((x,y),i:o) = (v,z) if

there exists c such that δ1(x,i:c) = v

and δ2(y,c:o) = z

Lecture 3, 7/27/2005 Natural Language Processing 38

Intersect Rule FSTs

lexical tape

LEXICON-FST

intermediate tape

FST1 … FSTn

surface tape

=> FSTR = FST1 ^ … ^ FSTn

Lecture 3, 7/27/2005 Natural Language Processing 39

Compose Lexicon and Rule FSTs

lexical tape

LEXICON-FST

intermediate tape

surface tape

FSTR = FST1 ^ … ^ FSTn

=> LEXICON-FST o FSTR

lexical tape

surface level

Lecture 3, 7/27/2005 Natural Language Processing 40

Porter Stemming

Some applications (some informational retrieval applications) do not the whole morphological processor.

They only need the stem of the word. A stemming algorithm (Port Stemming algorithm) is a

lexicon-free FST. It is just a cascaded rewrite rules. Stemming algorithms are efficient but they may introduce

errors because they do not use a lexicon.