Post on 08-Feb-2016
description
transcript
Finite-State Methods in Natural Finite-State Methods in Natural Language ProcessingLanguage ProcessingLauri KarttunenLSA 2005 Summer InstituteJuly 20, 2005
Course OutlineCourse Outline
July 18:Intro to computational morphologyXFST
ReadingsLauri Karttunen, “Finite-State Constraints”, The Last
Phonological Rule. J. Goldsmith (ed.), pages 173-194, University of Chicago Press, 1993.
Karttunen and Beesley, “25 Years of Finite-State Morphology”
Chapter 1: “Gentle Introduction” (B&K)
July 20:Regular expressionsMore on XFST
ReadingsChapter 2: “Systematic Introduction”Chapter 3: “The XFST interface”
July 25Concatenative morphotacticsConstraining non-local dependencies
ReadingsChapter 4. “The LEXC Language”Chapter 5. “Flag Diacritics”
July 27Non-concatenative morphotactics
Reduplication, interdigitation
ReadingsChapter 8. “Non-Concatenative Morphotactics”
August 1Realizational morphology
ReadingsGregory T. Stump. Inflectional Morphology. A Theory of
Paradigm Structure. Cambridge U. Press. 2001. (An excerpt)
Lauri Karttunen, “Computing with Realizational Morphology”, Lecture Notes in Computer Science, Volume 2588, Alexander Gelbukh (ed.), 205-216, Springer Verlag. 2003.
August 3Optimality theory
ReadingsPaul Kiparsky “Finnish Noun Inflection” Generative Approaches to Finnic
and Saami Linguistics, Diane Nelson and Satu Manninen (eds.), pp.109-161, CSLI Publications, 2003.
Nine Elenbaas and René Kager. "Ternary rhythm and the lapse constraint". Phonology 16. 273-329.
Scripting xfstScripting xfst
xfst -l myscript
xfst -f myscript
xfst -e “echo Welcome” \ -e “regex a b c;” \ -e “save foo” \ -stop
Start XFSTexecute myscriptwait for more commands from the command line
Execute myscript and exit
Execute the commands in the given order. The commands must be on the same line. The -stop at the end is required to make xfst quit.
Numeral ScriptNumeral Script
# This script constructs the language of English# numerals from "one” to "ninety-nine".# This is a comment.
# From "one" through "nine":
define OneToNine [{one} | {two} | {three} | {four} | {five} | {six} | {seven} | {eight} | {nine}];
# It is convenient to define a set of prefixes that# can be followed either by "teen" or by "ty".
define TeenTyStem [{thir} | {fif} | {six} | {seven} | {eigh} | {nine}] ;
Numeral Script (Continued)Numeral Script (Continued)
# From "ten" to "nineteen"define Teens [{ten} | {eleven} | {twelve} |
[TeenTyStem | {four}] {teen}];
# Let’s define stems that can be followed "ty".define TyStem [TeenTyStem | {twen} | {for}];
# TyStem is followed either by "ty" or by ty-"# and a number from OneToNine.
define Tens [TyStem [{ty} | {ty-} OneToNine]];
define OneToNinetyNine [ OneToNine | Teens | Tens ];
push OneToNinetyNine
Number to NumeralNumber to Numeral
Generation
105
hundred five hundred and five
one hundred and five
Analysis
hundred five
105
NumberToNumeral scriptNumberToNumeral script
# This script constructs a transducer that relates the# English numerals "one", "two", ..., "ninety-nine",# to the corresponding numbers "1", 2 ... "99".
define OneToNine [1:{one} | 2:{two} | 3:{three} | 4:{four} |5:{five} | 6:{six} | 7:{seven} | 8:{eight} | 9:{nine}];
define TeenTyStem [3:{thir} | 5:{fif} | 6:{six}| 7:{seven} | 8:{eigh} | 9:{nine}];
define Teens [1:0 [{0}:{ten} | 1:{eleven} | 2:{twelve} | [TeenTyStem | 4:{four}] 0:{teen}]];
NumberToNumeral (Continued)NumberToNumeral (Continued)
define TyStem [2:{twen} | TeenTyStem | 4:{for}];
# TyStem is followed either by "ty" paired with a zero# or by "ty-" mapped to an epsilon and followed by a# number. Note that {0} means zero and not epsilon.
define Tens [TyStem [{0}:{ty} | 0:{ty-} OneToNine]];
define OneToNinetyNine [ OneToNine | Teens | Tens ];
push OneToNinetyNine
Xerox RE OperatorsXerox RE Operators
$ containment=> restriction-> @-> replacement
Make it easier to describe complex languages and relations without extending the formal power of finite-state systems.
ContainmentContainment
aa?? ?? aa$a$a
[?* a ?*][?* a ?*]
RestrictionRestriction
??cc
bb
bb
cc?? aa
cc
a => b _ ca => b _ c
““AnyAny aa must be preceded bymust be preceded by bband followed byand followed by cc.”.”
~[~[?* b] a ?*] & ~[?* a ~[c ?*]] ~[~[?* b] a ?*] & ~[?* a ~[c ?*]]
Equivalent expression Equivalent expression
ReplacementReplacement
a:ba:b
bb
aa
??
??
b:ab:a
aa
a:ba:b
a b -> b a
““Replace ‘ab’ by ‘ba’.”Replace ‘ab’ by ‘ba’.”
[[~$[a b] [[a b] .x. [b a]]]* ~$[a b]]
Equivalent expression Equivalent expression
MarkingMarking
0:[0:[
[[
0:]0:]
??
aa
eeii
oo
uu]]
a|e|i|o|u -> %[ ... %]
p o t a t op o t a t op[o]t[a]t[o]p[o]t[a]t[o]
a b | b | b a | a b a -> x(a) b (a) -> x
applied to “aba”
a b a a b a a b a a b aa x a a x x a x
Multiple ResultsMultiple Results
Four factorizations of the input string.
Directed Replace OperatorsDirected Replace Operators
guarantee a unique result by constraining the factorization of the input string by
Direction of the match (rightward or leftward)Length (longest or shortest)
@-> Left-to-right, Longest-match @-> Left-to-right, Longest-match ReplacementReplacement
(a) b (a) @-> x
applied to “aba”
a b a a b a a b a a b aa x a a x x a x
Conditional ReplacementConditional Replacement
The relation that replaces A by B between L and R leaving everything else unchanged.
A -> BA -> BReplacement
L _ RL _ R
Context
Sources of complexity:
Replacements and contexts may overlap
Alternative ways of interpreting “between left and right.”A -> B || L _ R both contexts on the inputA -> B // L _ R left context on the outputA -> B \\ L _ R right context on the output
Vowel shortening after a long Vowel shortening after a long vowelvowel
V %: -> V || V %: C* _V %: -> V || V %: C* _Left context on the input side
Slovakv o l + a: v + a: m e:v o l + a: v + a m ewe call often
Gidabalg u n u: m + ba: + d a: ng + b e: +g u n u: m + ba +d a: ng + b e +is certainly right on the stump
V%: -> V // V%: C* _V%: -> V // V%: C* _Left context on the output side
Shortening scriptShortening script
define V [ a | e | i | o | u | a ];define C [ b | c | d | f | g | h | j | k | l | m | n | p | q | r | s | t | v | x | y | z ];
define SlovakShortening %: -> 0 || V %: C* V _ ;
define GidabalShortening %: -> 0 // V %: C* V _ ;
push SlovakShorteningdown vola:va:me:vola:vame
push GidabalShorteningdown gunu:mba:da:ngbe:gunu:mbada:ngbe
Palatalization and Vowel RaisingPalatalization and Vowel Raising
Palatalizationtim --> cim
Vowel Raisingmemi --> mimi
Interactiontemi --> cimitememi --> cimimi
Vowel Raising & PalatalizationVowel Raising & Palatalization
define C [ b | c | d | f | g | h | j | k | l | m | n | p | q | r | s | t | v | x | y | z ];
define Raising e -> i \\ _ C* i ;define Palatalization t -> c || _ i;
regex Raising .o. Palatalization;
down memimimidown timcimdown temicimidown tememicimimi
t e m e m i
t i m i m i
c i m i m i
Making a lexical transducerMaking a lexical transducer
LexiconFST
RuleFSTs
Compiler Lexical Transducer(a single FST)composition
LexiconRegular Expression
RulesRegular Expressions
Morphotactics
Alternations
Finnish Gradation ScriptFinnish Gradation Script
define Stems [ {tukka}| {kakku} | {pappi} | {tippa} | {katto} | {juttu} |{tikka} | {huppu} | {rotta} | {nahka} |{lika} | {maku} | {rako} | {tuke} | {halko} | {jalka} | {virka} | {lanka} | {linko} | {puku} | {suku} | {tiuku} | {raaka} |{ripa} | {sopu} | {tapa} | {kampa} | {rumpu} | {sampe} | {sota} | {pata} | {kita} | {rinta} | {kanto} | {ranta} | {ilta} | {kulta} | {parta} | {kerta} ];
define Case [ "+Part":a | "+Gen":n ];
define Finnish [Stems Case];
Auxiliary definitionsAuxiliary definitions
define V [a | e | i | o | u | y | ä | ö];define C [b | c | d | f | g | h | j | k | l | m | n | p | q | r | s | t | v | w | x | z];
define Coda [ C [C | .#.] ];
define ClosedSyll [V Coda] ;
Weak form of kWeak form of k
define WeakK k -> ' || V a _ a Coda, V u _ u Coda .o. k -> j || r _ e Coda .o. k -> v || u _ u Coda .o. k -> g || n _ V Coda .o. k -> 0 || \[s|h] _ V Coda ; # kiskon 'rail', # nahkan 'skin
Weak form of pWeak form of p
define WeakP p -> m || m _ V Coda .o. p -> v || \[s|p] _ V Coda # piispan 'bishop' .o. p -> 0 || p _ V Coda;
Weak form of tWeak form of t
define WeakT t -> n || n _ V Coda .o. t -> l || l _ V Coda .o. t -> r || r _ V Coda .o. t -> d || \[s|t] _ V Coda # koston revenge .o. t -> 0 || t _ V Coda ;
Putting it all togetherPutting it all together
define Gradation WeakK .o. WeakP .o. WeakT;
regex Finnish .o. Gradation;
print lower-words
echo *** Size of Finnish .o. Gradationprint sizeecho *** Size of Finnishpush Finnishprint sizeecho *** Size of Gradationpush Gradationprint size
SyllabificationSyllabification
define C [ b | c | d | f ...define C [ b | c | d | f ...define V [ a | e | i | o | u ];define V [ a | e | i | o | u ];
s t r u k t u r a l i s m is t r u k t u r a l i s m is t r u k - t u - r a - l i s - m is t r u k - t u - r a - l i s - m i
[C* V+ C*] @-> ... "-" || _ [C V][C* V+ C*] @-> ... "-" || _ [C V]
““Insert a hyphen after the longest instance of theInsert a hyphen after the longest instance of the C* V+ C*C* V+ C* pattern in front of a pattern in front of a C VC V pattern.” pattern.”