Date post: | 31-Dec-2015 |
Category: |
Documents |
Upload: | cameron-ball |
View: | 225 times |
Download: | 2 times |
October 2004 CSA3050 NL Algorithms 1
CSA3050: Natural Language Algorithms
Words, Strings and
Regular Expressions
Finite State Automota
October 2004 CSA3050 NL Algorithms 2
This lecture
• Outline– Words– The language of words– FSAs in Prolog
• Acknowledgement– Jurafsky and Martin, Speech and Language
Processing, Prentice Hall 2000– Blackburn and Steignitz: NLP Techiques in Prolog:
http://www.coli.uni-sb.de/~kris/nlp-with-prolog/html/
October 2004 CSA3050 NL Algorithms 3
What is a Word?
• A series of speech sounds that symbolizes meaning without being divisible into smaller units
• Any segment of written or printed discourse ordinarily appearing between spaces or between a space and a punctuation mark
• A set of linguistic forms produced by combining a single base with various inflectional elements without change in the part-of-speech elements
• A number of bytes processed as a unit.
October 2004 CSA3050 NL Algorithms 4
Information Associated with Words
• Spelling– orthographic– phonological
• Syntax– POS– Valency
• Semantics– Meaning – Relationship to other words
October 2004 CSA3050 NL Algorithms 5
Properties of Words
• Sequence– characters pollution– phonemes
• Delimitation– whitespace– other?
• Structure– simple ("atomic“) words– complex ("molecular") words
October 2004 CSA3050 NL Algorithms 6
Complex Words
• enlargementen + large + ment(en + large) + menten + (large + ment)
• affixation– prefix– suffix– infix
October 2004 CSA3050 NL Algorithms 7
Sets Underly the Formation of Complex Words
disreunen
largechargeinfectcodedecide
edingeeerly
+ +
prefixes roots suffixes
October 2004 CSA3050 NL Algorithms 8
Structure of Complex Words
• Complex words are made by concatenating elements chosen from – a set of prefixes– a set of roots– a set of suffixes
• The set of valid words for a given human language (e.g. English, Maltese) can be regarded as a formal language.
October 2004 CSA3050 NL Algorithms 9
The Language of Words
• What kind of formal language is the language of words?
• One which can be constructed out of– A characteristic set of basic symbols (alphabet)– A characteristic set of combining operations
• Union (disjunction) • Concatenation• Closure (iteration)
• Regular Language; Regular Sets
October 2004 CSA3050 NL Algorithms 10
Characterising Classes of Set
CLASS OFSETS or LANGUAGES
NOTATION MACHINE
October 2004 CSA3050 NL Algorithms 11
Regular Expressions
• Notation for describing regular sets
• Used extensively in the Unix operating system (grep, sed, etc.) and also in some Microsoft products (Word)
• Xerox Finite State tools use a somewhat different notation, but similar function.
October 2004 CSA3050 NL Algorithms 12
Regular Expressions
a a simple symbol
A B concatenation
A | B alternation operator
A & B intersection operator
A* Kleene star
October 2004 CSA3050 NL Algorithms 13
Characterising Classes of Set
CLASS OFSETS or LANGUAGES
NOTATION MACHINE
October 2004 CSA3050 NL Algorithms 14
Finite Automaton
• A finite automaton comprises• A finite set of states Q• An alphabet of symbols I• A start state q0 Q• A set of final states F Q• A transition function δ(q,i) which maps a
state q Q and a symbol i I to a new state q' Q
October 2004 CSA3050 NL Algorithms 15
Encoding FSAs in Prolog
• Three predicates– initial/1initial(s) – s is an initial state
– final/1final(f) – f is a final state
– arc/3arc(s,t,c)there is an arc from s to t labelled c
October 2004 CSA3050 NL Algorithms 16
Example 1: FSA
initial(1).final(4).arc(1,2,h).arc(2,3,a).arc(3,4,!).arc(3,2,h).
1-
2
3
4=
h
ha
!
October 2004 CSA3050 NL Algorithms 17
Example 2: FSA with jump arc
initial(1).final(4).arc(1,2,h).arc(2,3,a).arc(3,4,!).arc(3,1,#).
1-
2
3
4=
h
#a
!
October 2004 CSA3050 NL Algorithms 18
Example 3: NDA
initial(1).final(4).arc(1,2,h).arc(2,3,a).arc(3,4,!).arc(2,1,a).
1-
2
3
4=
h a
a
!
October 2004 CSA3050 NL Algorithms 19
A Recogniser
recognize1(Node,[ ]) :- final(Node).
recognize1(Node1,String) :- arc(Node1,Node2,Label), traverse1(Label,String,NewString), recognize1(Node2,NewString).
traverse1(Label,[Label|Symbols],Symbols).
October 2004 CSA3050 NL Algorithms 20
TraceCall: (7) test1([h, a, !]) Call: (8) initial(_L181) Exit: (8) initial(1) Call: (8) recognize1(1, [h, a, !]) Call: (9) arc(1, _L199, _L200) Exit: (9) arc(1, 2, h) Call: (9) traverse1(h, [h, a, !], _L201) Exit: (9) traverse1(h, [h, a, !], [a, !]) Call: (9) recognize1(2, [a, !]) Call: (10) recognize1(3, [!]) Call: (11) recognize1(4, []) Call: (12) final(4) Exit: (12) final(4) Exit: (11) recognize1(4, []) Exit: (10) recognize1(3, [!]) Exit: (9) recognize1(2, [a, !]) Exit: (8) recognize1(1, [h, a, !]) Exit: (7) test1([h, a, !])
October 2004 CSA3050 NL Algorithms 21
Generation
• test1(X)
• X = [h, a, !] ;
• X = [h, a, h, a, !] ;
• X = [h, a, h, a, h, a, !] ;
• X = [h, a, h, a, h, a, h, a, !] ;
• etc.
October 2004 CSA3050 NL Algorithms 22
3 Related Frameworks
REGULARLANGS/SETS
REGULAREXPRESSIONS
FINITE STATENETWORKS
describe recognise
October 2004 CSA3050 NL Algorithms 23
Regular Operations
• Operations– Concatenation– Union– Closure
• Over What– Language– Expressions– FS Automota
October 2004 CSA3050 NL Algorithms 24
Concatenation over Reg. Expression and LanguageRegular Expression
E1: = [a|b]
E2: = [c|d]
E1 E2 =
[a|b] [c|d]
Language
L1 = {"a", "b"}
L2 = {"c", "d"}
L1 L2 =
{"ac", "ad", "bc", "bd"}
October 2004 CSA3050 NL Algorithms 25
Concatenation overFS Automata
a
b
c
d
a
b
c
d
⌣
October 2004 CSA3050 NL Algorithms 26
Issues
• Handling jump arcs.
• Handling non-determinism
• Computing operations over networks.
• Maintaining multiple states in DB
• Representation.