+ All Categories
Home > Documents > October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State...

October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State...

Date post: 18-Jan-2016
Category:
Upload: charlene-campbell
View: 220 times
Download: 0 times
Share this document with a friend
Popular Tags:
39
October 2007 Natural Language Processi ng 1 CSA3050: Natural Language Algorithms Words and Finite State Machinery
Transcript
Page 1: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 1

CSA3050: Natural Language Algorithms

Words and

Finite State Machinery

Page 2: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 2

Acknowledgement

Material derived from/copied from– Jurafsky and Martin, Speech and Language

Processing, Prentice Hall 2000– Richard Sproat, Lecture notes

Page 3: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 3

Outline

Words

Regular Languages

Regular Expressions

Finite State Automata

Page 4: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 4

What is a Word?

• A series of speech sounds that symbolizes meaning without being divisible into smaller units

• Any segment of written or printed discourse ordinarily appearing between spaces or between a space and a punctuation mark

• A set of linguistic forms produced by combining a single base with various inflectional elements without change in the part-of-speech elements

• The smallest meaningful element of language. When written it stands alone with a space on either side of it.

Page 5: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 5

Information Associated with Words

• Spelling– orthographic– phonological

• Syntax– POS– Valency

• Semantics– Meaning – Relationship to other words

Page 6: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 6

Properties of Words

• Sequence– characters pollution– phonemes

• Delimitation– whitespace– other?

• Structure– simple ("atomic") words– complex ("molecular") words

Page 7: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 7

Complex Words

• Complex words have subparts:• e.g. "enlargement"en + large + ment

• Some subparts are valid wordslarge

• Others are prefixes and suffixesen, ment

• N.B. The complex word can be built in different ways: (en + large) + menten + (large + ment)

Page 8: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 8

Morphological Processes

• affixation– prefix– suffix– circumfix: għandi - mgħandix– infix: phenidine phenetidine

• other morphological processes– redoubling (mexa; mexxa)– vowel change (swim; swam)

Page 9: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 9

Complex Words Formed by Concatenation

disreunen

largechargeinfectcodedecide

edingeeerly

+ +

prefixes roots suffixes

Page 10: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 10

The Language of Words

• What kind of formal language is the language of words?

• One which can be constructed out of– A characteristic set of basic symbols (alphabet)– A characteristic set of combining operations

• Union (disjunction) • Concatenation• Iteration

• Regular Language; Regular Sets

Page 11: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 11

Outline

Words

Regular Languages

Regular Expressions

Finite State Automota

Page 12: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 12

Regular Languages

• A regular language is a language with a finite alphabet that can be constructed out of one or more of the following operations:– Set union– Concatenation– Transitive closure (Kleene star)

Page 13: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 13

Some things that areregular languages

• Zero or more a’s followed by zero or more b’s

• The set of words in an English dictionary

• Dates

• URLs

• English?

Page 14: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 14

Some things that are not regular languages

• Zero or more a’s followed by exactly the same number of b’s

• The set of all English palindromes (e.g. Madam I'm Adam)

• The set that includes all noun phrases of the form– the cat slept– the cat the dog bit slept– the cat the dog the man fed bit slept

Page 15: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 15

Some special regular languages

• The universal language (Σ*)

• The empty language (Ø)

Note: the empty language is not the same as the empty string

Page 16: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 16

Some closure propertiesof regular languages

• Intersection

• Complementation

• Difference

• Reversal

• Power

Page 17: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 17

Characterising Classes of SetCLASS OF

SETS or LANGUAGES

NOTATION MACHINE

Page 18: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 18

Outline

Words

Regular Languages

Regular Expressions

Finite Automota

Page 19: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 19

Regular Expressions

• Notation for describing regular sets

• Used extensively in the Unix operating system (grep, sed, etc.) and also in some Microsoft products (Word)

• Xerox Finite State tools use a somewhat different notation, but similar function.

Page 20: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 20

Regular Expressions

a a simple symbol

A B concatenation

A | B alternation operator

A & B intersection operator

A* Kleene star

Page 21: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 21

Characterising Classes of Set

CLASS OFSETS or LANGUAGES

NOTATION MACHINE

Page 22: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 22

Outline

Words

Regular Languages

Regular Expressions

Finite Automata

Page 23: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 23

Finite Automaton• A finite automaton is a quintuple

(Q, I, q0,F, δ ) where:

• Q is a finite set of states

• Σ is alphabet of symbols

• q0 Q is a start state

• F Q are final states

• δ is a transition relation δ(q,i,q') between a state q Q, a symbol σ Σ and q' Q

Page 24: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 24

Representation of FSA’s:State Diagram

Page 25: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 25

State Table

Page 26: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 26

Mr. Kleene

Page 27: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 27

Kleene’s theorem

• Languages generated by NFAs are exactly equivalent to languages described by Regular Expressions.

• Kleene’s Theorem, part 1: To each regular expression there corresponds a NFA.

• Kleene’s Theorem, part 2: To each NFA there corresponds a regular expression.

http://www.cs.may.ie/~jpower/Courses/parsing/node6.html

Page 28: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 28

Converting a Regular Expressionto an NFA

• The NFA representing the empty string is:

• The NFA representing a single character is:

1 2ε

1 2a

Page 29: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 29

Regular Expression to NFA

Dia

gram

fro

m L

eoni

das

Feg

aras

, U

niv.

Tex

as

Page 30: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 30

Deterministic Finite Automata

• In deterministic finite automata (DFA), every state/symbol pair maps to a unique state

• In other words, δ is a function• Why do we care about DFAs?

Page 31: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 31

Deterministic Finite Automata

• In deterministic finite automata (DFA), every state/symbol pair maps to a unique state

• In other words, δ is a function• Why do we care about DFAs?• EFFICIENCY!!

Page 32: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 32

Equivalence of NFA’s and DFA’s

Page 33: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 33

Subset Construction for Determinisation

• States which are connected by an ε transition will be represented by the same states in the DFA.

• If there are multiple transitions based on the same symbol, then we can regard a transition as moving from a state to a set of states (ie. the union of all those states reachable by a transition on the current symbol).

• Thus these states will be combined into a single DFA state.

• more details http://www.cs.may.ie/~jpower/Courses/parsing/node9.html

Page 34: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 34

Subset construction for determinization

Page 35: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 35

Subset construction for determinization

Page 36: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 36

Subset construction for determinization

Page 37: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 37

Subset construction for determinization

Page 38: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 38

Subset construction for determinization

Page 39: October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.

October 2007 Natural Language Processing 39

Subset construction for determinization


Recommended