+ All Categories
Home > Documents > Introduction to Computational Linguistics

Introduction to Computational Linguistics

Date post: 21-Jan-2016
Category:
Upload: muniya
View: 52 times
Download: 4 times
Share this document with a friend
Description:
Introduction to Computational Linguistics. Words and Finite State Machinery. Acknowledgement. Material derived from/copied from Jurafsky and Martin, Speech and Language Processing, Prentice Hall 2000 Richard Sproat, Lecture notes. Finite State Methods. Word-Oriented Application Areas - PowerPoint PPT Presentation
Popular Tags:
60
May 2007 CLINT-CS Finite State 1 Introduction to Computational Linguistics Words and Finite State Machinery
Transcript
Page 1: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 1

Introduction toComputational Linguistics

Words and

Finite State Machinery

Page 2: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 2

Acknowledgement

Material derived from/copied from– Jurafsky and Martin, Speech and Language

Processing, Prentice Hall 2000– Richard Sproat, Lecture notes

Page 3: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 3

Finite State Methods

• Word-Oriented Application Areas– Tokenization– Sentence breaking– Spelling correction– Morphology (analysis/generation)– Phonological disambiguation (Speech Recognition)– Morphological disambiguation (“Tagging”)– Pattern matching (“Named Entity Recognition”)– Shallow Parsing

Page 4: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 4

Outline

Words

Regular Languages

Regular Expressions

Finite State Automota

Page 5: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 5

What is a Word?

Some Distinctions

• Written

• Spoken

• Word Type

• Word Token

Page 6: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 6

Information Associated with Words

• Spelling– orthographic– phonological

• Syntax– POS– Valency

• Semantics– Meaning – Relationship to other words

Page 7: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 7

Properties of Words

• Sequence– characters pollution– phonemes

• Delimitation– whitespace– other?

• Structure– simple ("atomic") words– complex ("molecular") words

Page 8: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 8

Complex Words

• Complex words have subparts:• e.g. "enlargement"en + large + ment

• Some subparts are valid wordslarge

• Others are prefixes and suffixesen, ment

• N.B. The complex word can be built in different ways: (en + large) + menten + (large + ment)

Page 9: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 9

Morphological Processes

• affixation– prefix– suffix– circumfix: għandi - mgħandix– infix: phenidine phenetidine

• other morphological processes– redoubling (mexa; mexxa)– vowel change (swim; swam)

Page 10: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 10

Affixation uses Concatenation

disreunen

largechargeinfectcodedecide

edingeeerly

+ +

prefixes roots suffixes

Page 11: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 11

The Language of Words

• What kind of formal language is the language of words?

• One which can be constructed out of– A characteristic set of basic symbols (alphabet)– A characteristic set of combining operations

• Union (disjunction) • Concatenation• Iteration

• Regular Language; Regular Sets

Page 12: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 12

Characterising Classes of SetCLASS OF

SETS or LANGUAGES

NOTATION MACHINE

Page 13: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 13

Outline

Words

Regular Languages

Regular Expressions

Finite State Automota

Page 14: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 14

Regular Languages

• A regular language is a language with a finite alphabet that can be constructed out of one or more of the following operations:– Set union– Concatenation– Transitive closure (Kleene star)

Page 15: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 15

Some things that areregular languages

• Zero or more a’s followed by zero or more b’s

• The set of words in an English dictionary

• Dates

• URLs

• English?

Page 16: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 16

Some things that are not regular languages

• Zero or more a’s followed by exactly the same number of b’s

• The set of all English palindromes (e.g. Madam I'm Adam)

• The set that includes all noun phrases of the form– the cat slept– the cat the dog bit slept– the cat the dog the man fed bit slept

Page 17: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 17

Some special regular languages

• The universal language (Σ*)

• The empty language (Ø)

Note: the empty language is not the same as the empty string

Page 18: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 18

Some closure propertiesof regular languages

• Intersection

• Complementation

• Difference

• Reversal

• Power

Page 19: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 19

Characterising Classes of SetCLASS OF

SETS or LANGUAGES

NOTATION MACHINE

Page 20: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 20

Outline

Words

Regular Languages

Regular Expressions

Finite Automota

Page 21: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 21

Regular Expressions

• Notation for describing regular sets

• Used extensively in the Unix operating system (grep, sed, etc.) and also in some Microsoft products (Word)

• Xerox Finite State tools use a somewhat different notation, but similar function.

Page 22: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 22

Regular Expressions

a a simple symbol

A B concatenation

A | B alternation operator

A & B intersection operator

A* Kleene star

Page 23: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 23

Caveats

• Perl and other languages (see J&M, Chapter 2) have lots of stuff in their “regular expression” syntax. Strictly speaking, not all of these correspond to regular expressions in the formal sense since they don’t describe regular languages.

• For example, arbitrary substring copying is not expressible as a regular language, though one can do this in Perl (or Python …)

/(…+)\1/

Page 24: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 24

Characterising Classes of Set

CLASS OFSETS or LANGUAGES

NOTATION MACHINE

Page 25: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 25

Outline

Words

Regular Languages

Regular Expressions

Finite Automota

Page 26: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 26

Finite Automaton• A finite automaton is a quintuple

(Q, I, q0,F, δ ) where:

• Q is a finite set of states

• Σ is alphabet of symbols

• q0 Q is a start state

• F Q are final states

• δ is a transition relation δ(q,i,q') between a state q Q, a symbol σ Σ and q' Q

Page 27: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 27

Representation of FSA’s:State Diagram

Page 28: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 28

State Table

Page 29: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 29

Prolog

initial(1).final(4).arc(1,2,h).arc(2,3,a).arc(3,4,!).arc(3,2,h).

1-

2

3

4=

h

ha

!

Page 30: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 30

Mr. S.K.

Page 31: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 31

Kleene’s theorem• Languages generated by NFAs are

exactly equivalent languages described by Regular Expressions.

• Kleene’s Theorem, part 1: To each regular expression there corresponds a NFA.

• Kleene’s Theorem, part 2: To each NFA there corresponds a regular expression.

Page 32: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 32

Converting a Regular Expressionto an NFA

• The NFA representing the empty string is:

• The NFA representing a single character is:

1 2ε

1 2a

Page 33: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 33

Converting a Regular Expressionto an NFA

• The union operator is represented by a choice of paths from a node, e.g. a|b

1 2

a

b

Page 34: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 34

Converting a Regular Expressionto an NFA

• Concatenation simply involves connecting one NFA to the other, so that ab is represented by

1 2a

3b

Page 35: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 35

Converting a Regular Expressionto an NFA

• The Kleene star must allow for zero or more occurrences. So a* is represented by

1 2ε

3a

εε

Page 36: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 36

Deterministic versus non-deterministic finite automata

• The definition of finite-state automata given above was for non-deterministic finite automata (NFA):

• δ is a relation, meaning that from any state and given any symbol, one can in principle transition to any number of states.

• In deterministic finite automata (DFA), every state/symbol pair maps to a unique state

• In other words, δ is a function

Page 37: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 37

A deterministic automaton

Page 38: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 38

NFAs vs DFAs

• NDFA’s are typically smaller and simpler than their equivalent DFA’s

• Why do we care about DFA’s?

Page 39: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 39

NFAs vs DFAs

• NDFA’s are typically smaller and simpler than their equivalent DFA’s

• Why do we care about DFA’s?

• EFFICIENCY!

Page 40: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 40

Equivalence of NFA’s and DFA’s

Page 41: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 41

Subset Construction for Determinisation

• Any two states that are connected by an ε transition may as well be the same, since we can move from one to the other without consuming any character.

• Thus states which are connected by an ε transition will be represented by the same states in the DFA.

• If there are multiple transitions based on the same symbol, then we can regard a transition as moving from a state to a set of states (ie. the union of all those states reachable by a transition on the current symbol).

• Thus these states will be combined into a single DFA state.

• more details http://www.cs.may.ie/~jpower/Courses/parsing/node9.html

Page 42: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 42

Xerox Tools

Finite State Machinery

Page 43: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 43

The Xerox Approach

• Lauri Karttunen, Martin Kay, Ronald Kaplan, Kimmo Koskienniemi.

• Meta-languages for describing regular languages and regular relations.

• Compiler for mapping meta-language "programs" into efficient FS machinery

• Several tools and applications

Page 45: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 45

xfst

• xfst is a general tool for creating and manipulating finite state networks, both simple automota and transducers.

• xfst and other Xerox tools employ a special "xfst notation" (more powerful than that used in Unix, Perl, C# etc.)

Page 46: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 46

Simple Regular Expressions

• Atomic Expressions– Simple Symbols– Multicharacter Symbols

• Complex Expressions– Union– Intersection– Concatenation

Page 47: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 47

xfst Notation ExamplesA|B Union

A&B Intersection

A B Concatenation

A* Closure (Kleene Star)

(A) Optional Element

? Any symbol

\b Any symbol other than b

~A Complement (= [?* - A ])

0 Empty string language

$A [ ?* A ?* ]

Page 48: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 48

Concatenation over Reg. Expression and LanguageRegular Expression

E1: = [a|b]

E2: = [c|d]

E1 E2 =

[a|b] [c|d]

Language

L1 = {"a", "b"}

L2 = {"c", "d"}

L1 L2 =

{"ac", "ad", "bc", "bd"}

Page 49: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 49

Concatenation overFS Automata

a

b

c

d

a

b

c

d

+

Page 50: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 50

Simple Commands

• In addition to the notation there are also commands, e.g.– define: give a name to an RE– print: print information– read: read information– various stack operations– file interaction– various command line options

Page 51: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 51

define command

• define name regexp

xfst[0]: define foo [d o g] | [c a t];

xfst[0]: define R1 [a | b | c | d];

xfst[0]: define R2 [d | e | f | g];

xfst[0]: define R3 [f | g | h | i | j];

x0

Page 52: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 52

print command

• print words name - see the words in the language called name

• print net name - see detailed information about the network name.

xfst[0]: print words foo;

xfst[0]: print net baz;

xfst[0]: define baz R1 & R2;

Page 53: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 53

Stack Example

xfst[0]: clear stack;

xfst[0]: read regex [e d | i n g | s |[]]

xfst[1]: read regex [t a l k | k i c k]

xfst[2]: print stack

xfst[2]: print net

xfst[2]: print words

xfst[2]: concatenate net

xfst[1]: print words

Page 54: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 54

lexc?

Source File lexc Compiled Network

lexc is a high level programming language and compiler that is well suited for defining NL lexicons.

The output is a compiled form of FS network in a format identical to other Xerox tools (xfst, twolc).

Page 55: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 55

lexc source file!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ex0-lex.txtLEXICON Rootdine #;dines #;dined #;line #;lines #;lined #;END!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Page 56: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 56

Lexc Sublexicons! ex1-lex.txtLEXICON RootNoun;Verb;LEXICON Nounline NounSuffix;LEXICON Verbdine VerbSuffix;line VerbSuffix;LEXICON NounSuffixs #;#;LEXICON VerbSuffixs #;d #;#;

Page 57: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 57

lexc

• The resulting lexicon contains the same six words• The form lines actually gets constructed twice,

once as a verb, once as a noun.• After minimization, only one of them remains. • The compiler first processes each sublexicon

separately, keeping track of continuation pointers, and then joins the structures to a single network which is determinized and minimized.

Page 58: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 58

Resulting FSA

s

i

l

den

d

Page 59: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 59

Running lexc

lexc> compile-source ex1-lex.txt

Opening 'ex1-lex.txt'...

Root...2, Noun...1, Verb...2, NounSuffix...2, VerbSuffix...3

Building lexicon...Minimizing...Done!

SOURCE: 6 states, 7 arcs, 6 words

lexc>

Page 60: Introduction to Computational Linguistics

May 2007 CLINT-CS Finite State 60

ConclusionCLASS OF

SETS or LANGUAGES

NOTATION MACHINExfst

lexc


Recommended