10/24/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 6 Giuseppe Carenini.

04/20/23 CPSC503 Winter 2008 1

CPSC 503Computational Linguistics

Lecture 6Giuseppe Carenini

04/20/23 CPSC503 Winter 2008 2

Knowledge-Formalisms Map

Logical formalisms (First-Order Logics)

Rule systems (and prob. versions)(e.g., (Prob.) Context-Free

Grammars)

State Machines (and prob. versions)

(Finite State Automata,Finite State Transducers, Markov Models)

Morphology

Syntax

PragmaticsDiscourse and

Dialogue

Semantics

AI planners

Markov Models

Markov Chains -> n-grams

Hidden Markov Models (HMM)

MaxEntropy Markov Models (MEMM)

04/20/23 CPSC503 Winter 2008 3

Today 24/9

• ngrams evaluation• Markov Chains• Hidden Markov Models:

– definition– the three key problems (only one in

detail)• Part-of-speech tagging

– What it is– Why we need it– How to do it

04/20/23 CPSC503 Winter 2008 4

Model Evaluation: GoalYou may want to compare: • 2-grams with 3-grams • two different smoothing techniques

(given the same n-grams)

On a given corpus…

04/20/23 CPSC503 Winter 2008 5

Model Evaluation: Key Ideas

Corpus

Training Set Testing set

A:split

B: train models

Models: Q1 and Q2

C:Apply models• counting • frequencies • smoothing

• Compare results

Nw1

04/20/23 CPSC503 Winter 2008 6

Entropy• Def1. Measure of uncertainty• Def2. Measure of the information

that we need to resolve an uncertain situation

– Let p(x)=P(X=x); where x X.

– H(p)= H(X)= - xX p(x)log2p(x)

– It is normally measured in bits.

04/20/23 CPSC503 Winter 2008 7

Model Evaluation

?),..,( 1 nwwP

?),..,( 1 nwwQ

Actual distribution

Our approximation

How different?

Relative Entropy (KL divergence) ?

D(p||q)= xX p(x)log(p(x)/q(x))

04/20/23 CPSC503 Winter 2008 8

Entropy of

)(log)()()( 111

1

n

Lw

nn wPwPwHPHn

),..,( 1 nwwP

Entropy rate )(1

1nwH

n

)(log)(1

lim)( 11

1

n

Lw

n

nwPwP

nLH

n

Language

EntropyAssumptions:ergodic and stationary

)(log1

lim)( 1n

nwP

nLH

Entropy can be computed by taking the average log probability of a looooong sample

NL?

Shannon-McMillan-Breiman

04/20/23 CPSC503 Winter 2008 9

Cross-EntropyBetween probability distribution P and

another distribution Q (model for P)

)(log)()||()(),( xQxPQPDPHQPHx

)(),( PHQPH

Between two models Q1 and Q2 the more accurate is the one with higher =>lower cross-entropy => lower

)(log)(1

lim),( 11

1

n

Lw

n

nwQwP

nQPH

n

)(log1

lim),( 1n

nwQ

nQPH

Applied to Languag

e

04/20/23 CPSC503 Winter 2008 10

Model Evaluation: In practiceCorpus

Training Set Testing set

A:split

B: train models

Models: Q1 and Q2

C:Apply models• counting • frequencies • smoothing

• Compare cross-perplexities

Nw1

),(),( 21 2?2 QPHQPH)(log

1),( 1

nwQn

QPH

04/20/23 CPSC503 Winter 2008 11

k-fold cross validation and t-test• Randomly divide the corpus in k

subsets of equal size• Use each for testing (all the other

for training)In practice you do k times what we sow in previous

slide

• Now for each model you have k perplexities

• Compare average models perplexities with t-test

04/20/23 CPSC503 Winter 2008 12

Knowledge-Formalisms Map(including probabilistic formalisms)



Grammars)



Morphology

Syntax

PragmaticsDiscourse

and Dialogue

Semantics

AI planners

04/20/23 CPSC503 Winter 2008 13

Today 24/9





04/20/23 CPSC503 Winter 2008 14

Example of a Markov Chain

1

.4

1

.3.3

.4

.6 1

.6

.4

te

ha p

i

Start.6

Start.4

04/20/23 CPSC503 Winter 2008 15

Markov-ChainFormal description:

Manning/Schütze, 2000: 318

Probability of initial states

)( 1 ii sXP

N

i

i

1

1ti

.6

.4

Stochastic Transition matrix A)|( 1 itjtij sXsXPa

0ija ji,

N

j

jia1

, 1 i

t

t i p

i

p

0 .3 0.4 0 .6

0 0 11

2

ah

e

0 0 .40 0 0

1 0 0

a h e

.3 .4 00 0 0

0 0 0

.6 0 00 0 1

0 0 0

1tX

tX

04/20/23 CPSC503 Winter 2008 16

Markov Assumptions• Let X=(X1, .., Xt) be a sequence of

random variables taking values in some finite set S={s1, …, sn}, the state space, the Markov properties are:

• (a) Limited Horizon:

For all t, P(Xt+1|X1, .., Xt)=P(X t+1 | Xt)

• (b)Time Invariant:

For all t, P(X t+1 |Xt)=P(X2 | X1) i.e., the dependency does not change over

time.

04/20/23 CPSC503 Winter 2008 17

Markov-ChainProbability of a sequence of states X1 … XT


),...,( 1 TXXP

),...,|()...,|()|()( 11123121 TT XXXPXXXPXXPXP)|()...|()|()( 123121 TT XXPXXPXXPXP

11

1

1

ttXX

T

t

aX

108.06.03.06.0)|()|()(),,( 23121

iXpXPtXiXPtXPpitP

Example:

Similar to …….?

04/20/23 CPSC503 Winter 2008 18

Today 24/9





04/20/23 CPSC503 Winter 2008 19

HMMs (and MEMM) intro

They are probabilistic sequence-classifier / sequence-lablers: assign a class/label to each unit in a sequence

We have already seen a non-prob. version...

Used extensively in NLP• Part of Speech Tagging• Partial parsing• Named entity recognition• Information Extraction

04/20/23 CPSC503 Winter 2008 20

Hidden Markov Model(State Emission)

.7.3

.4

.6 1

.6

.4

s1

a

b

i

Start.6

Start.4

s2

as3

s4

i

a

b

b

.5

.5

.1

.9

1

.1

.4

.5

04/20/23 CPSC503 Winter 2008 21

SjiaA ij ,},{

Hidden Markov Model

Formal Specification as five-tuple

Set of States

Output Alphabet

Initial State Probabilities

State Transition Probabilities

Symbol EmissionProbabilities

BAKS ,,,,

Sii },{

},...,{ 1 NssS },...,1{},...,{ 1 MkkK M

KoSiobB tti ,)},({

11

N

j

ija

1)(1

M

t

ti ob

04/20/23 CPSC503 Winter 2008 22

Three fundamental questions for HMMs

Decoding: Finding the probability of an observation• brute force or Forward/Backward-Algorithm


)|(compute ),,,( model aGiven OPBA

Finding the best state sequence• Viterbi-Algorithm

Training: find model parameters which best explain the observations

),|(maxarg OXPX

)|(maxarg

trainingOP

04/20/23 CPSC503 Winter 2008 23

Computing the probability of an observation sequence O= o1 ...

oT

)|( OP XX

XPXOPXOP )|(),|()|,(

)(),|(1

t

t

T

t

oXbXOP

1

)|(1

11

tXtXaX

XPT

t

X = all sequences of T states

e.g., P(b,i | sample HMM )

X

T

t

t

t

T

t tXtXaX

oXbOP

1)()|(

1

11 1

04/20/23 CPSC503 Winter 2008 24

Decoding Example


4statesof#,2);,( NTibP

s1, s1 = 0 ?

s1, s4 = 1 * .5 * .6 * .7s2, s4 = 0?……….

s1, s2 = 1 * .1 * .6 * .3

……….

……….

……….

tXX ...1

Complexity

X

T

t

t

t

T

t tXtXaX

oXbOP

1)()|(

1

11 1

04/20/23 CPSC503 Winter 2008 25

The forward procedure

1. Initialization

Niobi ii 1),()( 11

2. Induction

NjNiobaij tj

N

i

ijtt

1,1),()()(1

1

3. Total

N

i

T iOP1

)()|( Complexity

)|,....()( 21 iXoooPi ttt

04/20/23 CPSC503 Winter 2008 26

Three fundamental questions for HMMs

Decoding: Finding the probability of an observation• brute force or Forward Algorithm

)|(compute ),,,( model aGiven OPBA

Finding the best state sequence• Viterbi-Algorithm

Training: find model parameters which best explain the observations

),|(maxarg OXPX

)|(maxarg

trainingOP

If interested in details of the next two questions, read (Sections 6.4 – 6.5)

04/20/23 CPSC503 Winter 2008 27

Today 24/9





04/20/23 CPSC503 Winter 2008 28

Parts of Speech Tagging• What is it? • Why do we need it?• Word classes (Tags)

– Distribution– Tagsets

• How to do it– Rule-based– Stochastic– Transformation-based

04/20/23 CPSC503 Winter 2008 29

Parts of Speech Tagging: What

• Brainpower_NN ,_, not_RB physical_JJ plant_NN ,_, is_VBZ now_RB a_DT firm_NN 's_POS chief_JJ asset_NN ._.

Tag meanings• NNP (Proper N sing), RB (Adv), JJ (Adj), NN (N

sing. or mass), VBZ (V 3sg pres), DT (Determiner), POS (Possessive ending), . (sentence-final punct)

Output

• Brainpower, not physical plant, is now a firm's chief asset.

Input

04/20/23 CPSC503 Winter 2008 30

Parts of Speech Tagging: Why?

• As a basis for (Partial) Parsing • Information Retrieval• Word-sense disambiguation• Speech synthesis • Improve language models

(Spelling/Speech)

• Part-of-speech (word class, morph. class, syntactic category) gives a significant amount of info about the word and its neighborsUseful in the following NLP

tasks:

04/20/23 CPSC503 Winter 2008 31

Parts of Speech

• Eight basic categories– Noun, verb, pronoun, preposition,

adjective, adverb, article, conjunction• These categories are based on:

– morphological properties (affixes they take)

– distributional properties (what other words can occur nearby)

– e.g, green It is so… , both…, The… is• Not semantics!

04/20/23 CPSC503 Winter 2008 32

Parts of Speech• Two kinds of category

– Closed class (generally are function words)

•Prepositions, articles, conjunctions, pronouns, determiners, aux, numerals

– Open class•Nouns (proper/common; mass/count), verbs, adjectives, adverbs

Very short, frequent and important

Objects, actions, events, properties

• If you run across an unknown word….??

04/20/23 CPSC503 Winter 2008 33

PoS Distribution• Parts of speech follow a usual

behavior in Language

Words

1 PoS2 PoS

(unfortunately very frequent)>2 PoS

…but luckily different tags associated with a word are not equally likely

~35k~4k

~4k

04/20/23 CPSC503 Winter 2008 34

Sets of Parts of Speech:Tagsets• Most commonly used:

– 45-tag Penn Treebank, – 61-tag C5, – 146-tag C7

• The choice of tagset is based on the application (do you care about distinguishing between “to” as a prep and “to” as a infinitive marker?)

• Accurate tagging can be done with even large tagsets

04/20/23 CPSC503 Winter 2008 35

PoS Tagging

Dictionarywordi -> set of tags

from Tagset

• Brainpower_NN ,_, not_RB physical_JJ plant_NN ,_, is_VBZ now_RB a_DT firm_NN 's_POS chief_JJ asset_NN ._. ……….

• Brainpower, not physical plant, is now a firm's chief asset. …………

Input text

Output

Tagger

04/20/23 CPSC503 Winter 2008 36

Tagger Types

• Rule-based ‘95

• Stochastic– HMM tagger ~ >= ’92– Transformation-based tagger (Brill) ~

>= ’95– Maximum Entropy Models ~ >= ’97

04/20/23 CPSC503 Winter 2008 37

Rule-Based (ENGTWOL ‘95)1. A lexicon transducer returns for each

word all possible morphological parses 2. A set of ~1,000 constraints is applied to

rule out inappropriate PoSStep 1: sample I/O

“Pavlov had show that salivation….”Pavlov N SG PROPERhad HAVE V PAST SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO

……that ADV

PRON DEM SG CS……..…….

Sample ConstraintExample: Adverbial “that” ruleGiven input: “that”If

(+1 A/ADV/QUANT)(+2 SENT-LIM)(NOT -1 SVOC/A)

Then eliminate non-ADV tagsElse eliminate ADV

04/20/23 CPSC503 Winter 2008 38

HMM Stochastic Tagging•Tags corresponds to an HMM states

•Words correspond to the HMM alphabet symbols Tagging: given a sequence of words (observations), find the most likely sequence of tags (states)But this is…..!We need: State transition and symbol emission probabilities

)(

),()|(

1

11

i

iiii

tC

ttCttP

)(

),()|(

i

iiii

tC

twCtwP 1) From hand-

tagged corpus2) No tagged corpus: parameter estimation (Baum-Welch)

04/20/23 CPSC503 Winter 2008 39

Evaluating Taggers

•Accuracy: percent correct (most current taggers 96-7%) *test on unseen data!*

•Human Celing: agreement rate of humans on classification (96-7%)

•Unigram baseline: assign each token to the class it occurred in most frequently in the training set (race -> NN). (91%)

•What is causing the errors? Build a confusion matrix…

04/20/23 CPSC503 Winter 2008 40

Knowledge-Formalisms Map(next three lectures)



Grammars)



Morphology

Syntax

PragmaticsDiscourse

and Dialogue

Semantics

AI planners

04/20/23 CPSC503 Winter 2008 41

Next Time

• Read Chapter 12 (syntax & Context Free Grammars)

Date post:	03-Jan-2016
Category:	Documents
Upload:	jennifer-bruce
View:	213 times
Download:	1 times

10/24/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 6 Giuseppe Carenini.

Documents