Date post: | 18-Jan-2016 |
Category: |
Documents |
Upload: | charlene-campbell |
View: | 220 times |
Download: | 0 times |
October 2007 Natural Language Processing 1
CSA3050: Natural Language Algorithms
Words and
Finite State Machinery
October 2007 Natural Language Processing 2
Acknowledgement
Material derived from/copied from– Jurafsky and Martin, Speech and Language
Processing, Prentice Hall 2000– Richard Sproat, Lecture notes
October 2007 Natural Language Processing 3
Outline
Words
Regular Languages
Regular Expressions
Finite State Automata
October 2007 Natural Language Processing 4
What is a Word?
• A series of speech sounds that symbolizes meaning without being divisible into smaller units
• Any segment of written or printed discourse ordinarily appearing between spaces or between a space and a punctuation mark
• A set of linguistic forms produced by combining a single base with various inflectional elements without change in the part-of-speech elements
• The smallest meaningful element of language. When written it stands alone with a space on either side of it.
October 2007 Natural Language Processing 5
Information Associated with Words
• Spelling– orthographic– phonological
• Syntax– POS– Valency
• Semantics– Meaning – Relationship to other words
October 2007 Natural Language Processing 6
Properties of Words
• Sequence– characters pollution– phonemes
• Delimitation– whitespace– other?
• Structure– simple ("atomic") words– complex ("molecular") words
October 2007 Natural Language Processing 7
Complex Words
• Complex words have subparts:• e.g. "enlargement"en + large + ment
• Some subparts are valid wordslarge
• Others are prefixes and suffixesen, ment
• N.B. The complex word can be built in different ways: (en + large) + menten + (large + ment)
October 2007 Natural Language Processing 8
Morphological Processes
• affixation– prefix– suffix– circumfix: għandi - mgħandix– infix: phenidine phenetidine
• other morphological processes– redoubling (mexa; mexxa)– vowel change (swim; swam)
October 2007 Natural Language Processing 9
Complex Words Formed by Concatenation
disreunen
largechargeinfectcodedecide
edingeeerly
+ +
prefixes roots suffixes
October 2007 Natural Language Processing 10
The Language of Words
• What kind of formal language is the language of words?
• One which can be constructed out of– A characteristic set of basic symbols (alphabet)– A characteristic set of combining operations
• Union (disjunction) • Concatenation• Iteration
• Regular Language; Regular Sets
October 2007 Natural Language Processing 11
Outline
Words
Regular Languages
Regular Expressions
Finite State Automota
October 2007 Natural Language Processing 12
Regular Languages
• A regular language is a language with a finite alphabet that can be constructed out of one or more of the following operations:– Set union– Concatenation– Transitive closure (Kleene star)
October 2007 Natural Language Processing 13
Some things that areregular languages
• Zero or more a’s followed by zero or more b’s
• The set of words in an English dictionary
• Dates
• URLs
• English?
October 2007 Natural Language Processing 14
Some things that are not regular languages
• Zero or more a’s followed by exactly the same number of b’s
• The set of all English palindromes (e.g. Madam I'm Adam)
• The set that includes all noun phrases of the form– the cat slept– the cat the dog bit slept– the cat the dog the man fed bit slept
October 2007 Natural Language Processing 15
Some special regular languages
• The universal language (Σ*)
• The empty language (Ø)
Note: the empty language is not the same as the empty string
October 2007 Natural Language Processing 16
Some closure propertiesof regular languages
• Intersection
• Complementation
• Difference
• Reversal
• Power
October 2007 Natural Language Processing 17
Characterising Classes of SetCLASS OF
SETS or LANGUAGES
NOTATION MACHINE
October 2007 Natural Language Processing 18
Outline
Words
Regular Languages
Regular Expressions
Finite Automota
October 2007 Natural Language Processing 19
Regular Expressions
• Notation for describing regular sets
• Used extensively in the Unix operating system (grep, sed, etc.) and also in some Microsoft products (Word)
• Xerox Finite State tools use a somewhat different notation, but similar function.
October 2007 Natural Language Processing 20
Regular Expressions
a a simple symbol
A B concatenation
A | B alternation operator
A & B intersection operator
A* Kleene star
October 2007 Natural Language Processing 21
Characterising Classes of Set
CLASS OFSETS or LANGUAGES
NOTATION MACHINE
October 2007 Natural Language Processing 22
Outline
Words
Regular Languages
Regular Expressions
Finite Automata
October 2007 Natural Language Processing 23
Finite Automaton• A finite automaton is a quintuple
(Q, I, q0,F, δ ) where:
• Q is a finite set of states
• Σ is alphabet of symbols
• q0 Q is a start state
• F Q are final states
• δ is a transition relation δ(q,i,q') between a state q Q, a symbol σ Σ and q' Q
October 2007 Natural Language Processing 24
Representation of FSA’s:State Diagram
October 2007 Natural Language Processing 25
State Table
October 2007 Natural Language Processing 26
Mr. Kleene
October 2007 Natural Language Processing 27
Kleene’s theorem
• Languages generated by NFAs are exactly equivalent to languages described by Regular Expressions.
• Kleene’s Theorem, part 1: To each regular expression there corresponds a NFA.
• Kleene’s Theorem, part 2: To each NFA there corresponds a regular expression.
http://www.cs.may.ie/~jpower/Courses/parsing/node6.html
October 2007 Natural Language Processing 28
Converting a Regular Expressionto an NFA
• The NFA representing the empty string is:
• The NFA representing a single character is:
1 2ε
1 2a
October 2007 Natural Language Processing 29
Regular Expression to NFA
Dia
gram
fro
m L
eoni
das
Feg
aras
, U
niv.
Tex
as
October 2007 Natural Language Processing 30
Deterministic Finite Automata
• In deterministic finite automata (DFA), every state/symbol pair maps to a unique state
• In other words, δ is a function• Why do we care about DFAs?
October 2007 Natural Language Processing 31
Deterministic Finite Automata
• In deterministic finite automata (DFA), every state/symbol pair maps to a unique state
• In other words, δ is a function• Why do we care about DFAs?• EFFICIENCY!!
October 2007 Natural Language Processing 32
Equivalence of NFA’s and DFA’s
October 2007 Natural Language Processing 33
Subset Construction for Determinisation
• States which are connected by an ε transition will be represented by the same states in the DFA.
• If there are multiple transitions based on the same symbol, then we can regard a transition as moving from a state to a set of states (ie. the union of all those states reachable by a transition on the current symbol).
• Thus these states will be combined into a single DFA state.
• more details http://www.cs.may.ie/~jpower/Courses/parsing/node9.html
October 2007 Natural Language Processing 34
Subset construction for determinization
October 2007 Natural Language Processing 35
Subset construction for determinization
October 2007 Natural Language Processing 36
Subset construction for determinization
October 2007 Natural Language Processing 37
Subset construction for determinization
October 2007 Natural Language Processing 38
Subset construction for determinization
October 2007 Natural Language Processing 39
Subset construction for determinization