May 2007 CLINT-CS Finite State 1
Introduction toComputational Linguistics
Words and
Finite State Machinery
May 2007 CLINT-CS Finite State 2
Acknowledgement
Material derived from/copied from– Jurafsky and Martin, Speech and Language
Processing, Prentice Hall 2000– Richard Sproat, Lecture notes
May 2007 CLINT-CS Finite State 3
Finite State Methods
• Word-Oriented Application Areas– Tokenization– Sentence breaking– Spelling correction– Morphology (analysis/generation)– Phonological disambiguation (Speech Recognition)– Morphological disambiguation (“Tagging”)– Pattern matching (“Named Entity Recognition”)– Shallow Parsing
May 2007 CLINT-CS Finite State 4
Outline
Words
Regular Languages
Regular Expressions
Finite State Automota
May 2007 CLINT-CS Finite State 5
What is a Word?
Some Distinctions
• Written
• Spoken
• Word Type
• Word Token
May 2007 CLINT-CS Finite State 6
Information Associated with Words
• Spelling– orthographic– phonological
• Syntax– POS– Valency
• Semantics– Meaning – Relationship to other words
May 2007 CLINT-CS Finite State 7
Properties of Words
• Sequence– characters pollution– phonemes
• Delimitation– whitespace– other?
• Structure– simple ("atomic") words– complex ("molecular") words
May 2007 CLINT-CS Finite State 8
Complex Words
• Complex words have subparts:• e.g. "enlargement"en + large + ment
• Some subparts are valid wordslarge
• Others are prefixes and suffixesen, ment
• N.B. The complex word can be built in different ways: (en + large) + menten + (large + ment)
May 2007 CLINT-CS Finite State 9
Morphological Processes
• affixation– prefix– suffix– circumfix: għandi - mgħandix– infix: phenidine phenetidine
• other morphological processes– redoubling (mexa; mexxa)– vowel change (swim; swam)
May 2007 CLINT-CS Finite State 10
Affixation uses Concatenation
disreunen
largechargeinfectcodedecide
edingeeerly
+ +
prefixes roots suffixes
May 2007 CLINT-CS Finite State 11
The Language of Words
• What kind of formal language is the language of words?
• One which can be constructed out of– A characteristic set of basic symbols (alphabet)– A characteristic set of combining operations
• Union (disjunction) • Concatenation• Iteration
• Regular Language; Regular Sets
May 2007 CLINT-CS Finite State 12
Characterising Classes of SetCLASS OF
SETS or LANGUAGES
NOTATION MACHINE
May 2007 CLINT-CS Finite State 13
Outline
Words
Regular Languages
Regular Expressions
Finite State Automota
May 2007 CLINT-CS Finite State 14
Regular Languages
• A regular language is a language with a finite alphabet that can be constructed out of one or more of the following operations:– Set union– Concatenation– Transitive closure (Kleene star)
May 2007 CLINT-CS Finite State 15
Some things that areregular languages
• Zero or more a’s followed by zero or more b’s
• The set of words in an English dictionary
• Dates
• URLs
• English?
May 2007 CLINT-CS Finite State 16
Some things that are not regular languages
• Zero or more a’s followed by exactly the same number of b’s
• The set of all English palindromes (e.g. Madam I'm Adam)
• The set that includes all noun phrases of the form– the cat slept– the cat the dog bit slept– the cat the dog the man fed bit slept
May 2007 CLINT-CS Finite State 17
Some special regular languages
• The universal language (Σ*)
• The empty language (Ø)
Note: the empty language is not the same as the empty string
May 2007 CLINT-CS Finite State 18
Some closure propertiesof regular languages
• Intersection
• Complementation
• Difference
• Reversal
• Power
May 2007 CLINT-CS Finite State 19
Characterising Classes of SetCLASS OF
SETS or LANGUAGES
NOTATION MACHINE
May 2007 CLINT-CS Finite State 20
Outline
Words
Regular Languages
Regular Expressions
Finite Automota
May 2007 CLINT-CS Finite State 21
Regular Expressions
• Notation for describing regular sets
• Used extensively in the Unix operating system (grep, sed, etc.) and also in some Microsoft products (Word)
• Xerox Finite State tools use a somewhat different notation, but similar function.
May 2007 CLINT-CS Finite State 22
Regular Expressions
a a simple symbol
A B concatenation
A | B alternation operator
A & B intersection operator
A* Kleene star
May 2007 CLINT-CS Finite State 23
Caveats
• Perl and other languages (see J&M, Chapter 2) have lots of stuff in their “regular expression” syntax. Strictly speaking, not all of these correspond to regular expressions in the formal sense since they don’t describe regular languages.
• For example, arbitrary substring copying is not expressible as a regular language, though one can do this in Perl (or Python …)
/(…+)\1/
May 2007 CLINT-CS Finite State 24
Characterising Classes of Set
CLASS OFSETS or LANGUAGES
NOTATION MACHINE
May 2007 CLINT-CS Finite State 25
Outline
Words
Regular Languages
Regular Expressions
Finite Automota
May 2007 CLINT-CS Finite State 26
Finite Automaton• A finite automaton is a quintuple
(Q, I, q0,F, δ ) where:
• Q is a finite set of states
• Σ is alphabet of symbols
• q0 Q is a start state
• F Q are final states
• δ is a transition relation δ(q,i,q') between a state q Q, a symbol σ Σ and q' Q
May 2007 CLINT-CS Finite State 27
Representation of FSA’s:State Diagram
May 2007 CLINT-CS Finite State 28
State Table
May 2007 CLINT-CS Finite State 29
Prolog
initial(1).final(4).arc(1,2,h).arc(2,3,a).arc(3,4,!).arc(3,2,h).
1-
2
3
4=
h
ha
!
May 2007 CLINT-CS Finite State 30
Mr. S.K.
May 2007 CLINT-CS Finite State 31
Kleene’s theorem• Languages generated by NFAs are
exactly equivalent languages described by Regular Expressions.
• Kleene’s Theorem, part 1: To each regular expression there corresponds a NFA.
• Kleene’s Theorem, part 2: To each NFA there corresponds a regular expression.
May 2007 CLINT-CS Finite State 32
Converting a Regular Expressionto an NFA
• The NFA representing the empty string is:
• The NFA representing a single character is:
1 2ε
1 2a
May 2007 CLINT-CS Finite State 33
Converting a Regular Expressionto an NFA
• The union operator is represented by a choice of paths from a node, e.g. a|b
1 2
a
b
May 2007 CLINT-CS Finite State 34
Converting a Regular Expressionto an NFA
• Concatenation simply involves connecting one NFA to the other, so that ab is represented by
1 2a
3b
May 2007 CLINT-CS Finite State 35
Converting a Regular Expressionto an NFA
• The Kleene star must allow for zero or more occurrences. So a* is represented by
1 2ε
3a
3ε
εε
May 2007 CLINT-CS Finite State 36
Deterministic versus non-deterministic finite automata
• The definition of finite-state automata given above was for non-deterministic finite automata (NFA):
• δ is a relation, meaning that from any state and given any symbol, one can in principle transition to any number of states.
• In deterministic finite automata (DFA), every state/symbol pair maps to a unique state
• In other words, δ is a function
May 2007 CLINT-CS Finite State 37
A deterministic automaton
May 2007 CLINT-CS Finite State 38
NFAs vs DFAs
• NDFA’s are typically smaller and simpler than their equivalent DFA’s
• Why do we care about DFA’s?
May 2007 CLINT-CS Finite State 39
NFAs vs DFAs
• NDFA’s are typically smaller and simpler than their equivalent DFA’s
• Why do we care about DFA’s?
• EFFICIENCY!
May 2007 CLINT-CS Finite State 40
Equivalence of NFA’s and DFA’s
May 2007 CLINT-CS Finite State 41
Subset Construction for Determinisation
• Any two states that are connected by an ε transition may as well be the same, since we can move from one to the other without consuming any character.
• Thus states which are connected by an ε transition will be represented by the same states in the DFA.
• If there are multiple transitions based on the same symbol, then we can regard a transition as moving from a state to a set of states (ie. the union of all those states reachable by a transition on the current symbol).
• Thus these states will be combined into a single DFA state.
• more details http://www.cs.may.ie/~jpower/Courses/parsing/node9.html
May 2007 CLINT-CS Finite State 42
Xerox Tools
Finite State Machinery
May 2007 CLINT-CS Finite State 43
The Xerox Approach
• Lauri Karttunen, Martin Kay, Ronald Kaplan, Kimmo Koskienniemi.
• Meta-languages for describing regular languages and regular relations.
• Compiler for mapping meta-language "programs" into efficient FS machinery
• Several tools and applications
May 2007 CLINT-CS Finite State 44
xerox tools
• xfst Xerox Finite-State Tool• lexc Finite-State Lexicon Compiler• twolc Two-Level Rule Compiler
May 2007 CLINT-CS Finite State 45
xfst
• xfst is a general tool for creating and manipulating finite state networks, both simple automota and transducers.
• xfst and other Xerox tools employ a special "xfst notation" (more powerful than that used in Unix, Perl, C# etc.)
May 2007 CLINT-CS Finite State 46
Simple Regular Expressions
• Atomic Expressions– Simple Symbols– Multicharacter Symbols
• Complex Expressions– Union– Intersection– Concatenation
May 2007 CLINT-CS Finite State 47
xfst Notation ExamplesA|B Union
A&B Intersection
A B Concatenation
A* Closure (Kleene Star)
(A) Optional Element
? Any symbol
\b Any symbol other than b
~A Complement (= [?* - A ])
0 Empty string language
$A [ ?* A ?* ]
May 2007 CLINT-CS Finite State 48
Concatenation over Reg. Expression and LanguageRegular Expression
E1: = [a|b]
E2: = [c|d]
E1 E2 =
[a|b] [c|d]
Language
L1 = {"a", "b"}
L2 = {"c", "d"}
L1 L2 =
{"ac", "ad", "bc", "bd"}
May 2007 CLINT-CS Finite State 49
Concatenation overFS Automata
a
b
c
d
a
b
c
d
+
May 2007 CLINT-CS Finite State 50
Simple Commands
• In addition to the notation there are also commands, e.g.– define: give a name to an RE– print: print information– read: read information– various stack operations– file interaction– various command line options
May 2007 CLINT-CS Finite State 51
define command
• define name regexp
xfst[0]: define foo [d o g] | [c a t];
xfst[0]: define R1 [a | b | c | d];
xfst[0]: define R2 [d | e | f | g];
xfst[0]: define R3 [f | g | h | i | j];
x0
May 2007 CLINT-CS Finite State 52
print command
• print words name - see the words in the language called name
• print net name - see detailed information about the network name.
xfst[0]: print words foo;
xfst[0]: print net baz;
xfst[0]: define baz R1 & R2;
May 2007 CLINT-CS Finite State 53
Stack Example
xfst[0]: clear stack;
xfst[0]: read regex [e d | i n g | s |[]]
xfst[1]: read regex [t a l k | k i c k]
xfst[2]: print stack
xfst[2]: print net
xfst[2]: print words
xfst[2]: concatenate net
xfst[1]: print words
May 2007 CLINT-CS Finite State 54
lexc?
Source File lexc Compiled Network
lexc is a high level programming language and compiler that is well suited for defining NL lexicons.
The output is a compiled form of FS network in a format identical to other Xerox tools (xfst, twolc).
May 2007 CLINT-CS Finite State 55
lexc source file!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ex0-lex.txtLEXICON Rootdine #;dines #;dined #;line #;lines #;lined #;END!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
May 2007 CLINT-CS Finite State 56
Lexc Sublexicons! ex1-lex.txtLEXICON RootNoun;Verb;LEXICON Nounline NounSuffix;LEXICON Verbdine VerbSuffix;line VerbSuffix;LEXICON NounSuffixs #;#;LEXICON VerbSuffixs #;d #;#;
May 2007 CLINT-CS Finite State 57
lexc
• The resulting lexicon contains the same six words• The form lines actually gets constructed twice,
once as a verb, once as a noun.• After minimization, only one of them remains. • The compiler first processes each sublexicon
separately, keeping track of continuation pointers, and then joins the structures to a single network which is determinized and minimized.
May 2007 CLINT-CS Finite State 58
Resulting FSA
s
i
l
den
d
May 2007 CLINT-CS Finite State 59
Running lexc
lexc> compile-source ex1-lex.txt
Opening 'ex1-lex.txt'...
Root...2, Noun...1, Verb...2, NounSuffix...2, VerbSuffix...3
Building lexicon...Minimizing...Done!
SOURCE: 6 states, 7 arcs, 6 words
lexc>
May 2007 CLINT-CS Finite State 60
ConclusionCLASS OF
SETS or LANGUAGES
NOTATION MACHINExfst
lexc