Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | jennifer-bruce |
View: | 213 times |
Download: | 1 times |
04/20/23 CPSC503 Winter 2008 1
CPSC 503Computational Linguistics
Lecture 6Giuseppe Carenini
04/20/23 CPSC503 Winter 2008 2
Knowledge-Formalisms Map
Logical formalisms (First-Order Logics)
Rule systems (and prob. versions)(e.g., (Prob.) Context-Free
Grammars)
State Machines (and prob. versions)
(Finite State Automata,Finite State Transducers, Markov Models)
Morphology
Syntax
PragmaticsDiscourse and
Dialogue
Semantics
AI planners
Markov Models
Markov Chains -> n-grams
Hidden Markov Models (HMM)
MaxEntropy Markov Models (MEMM)
04/20/23 CPSC503 Winter 2008 3
Today 24/9
• ngrams evaluation• Markov Chains• Hidden Markov Models:
– definition– the three key problems (only one in
detail)• Part-of-speech tagging
– What it is– Why we need it– How to do it
04/20/23 CPSC503 Winter 2008 4
Model Evaluation: GoalYou may want to compare: • 2-grams with 3-grams • two different smoothing techniques
(given the same n-grams)
On a given corpus…
04/20/23 CPSC503 Winter 2008 5
Model Evaluation: Key Ideas
Corpus
Training Set Testing set
A:split
B: train models
Models: Q1 and Q2
C:Apply models• counting • frequencies • smoothing
• Compare results
Nw1
04/20/23 CPSC503 Winter 2008 6
Entropy• Def1. Measure of uncertainty• Def2. Measure of the information
that we need to resolve an uncertain situation
– Let p(x)=P(X=x); where x X.
– H(p)= H(X)= - xX p(x)log2p(x)
– It is normally measured in bits.
04/20/23 CPSC503 Winter 2008 7
Model Evaluation
?),..,( 1 nwwP
?),..,( 1 nwwQ
Actual distribution
Our approximation
How different?
Relative Entropy (KL divergence) ?
D(p||q)= xX p(x)log(p(x)/q(x))
04/20/23 CPSC503 Winter 2008 8
Entropy of
)(log)()()( 111
1
n
Lw
nn wPwPwHPHn
),..,( 1 nwwP
Entropy rate )(1
1nwH
n
)(log)(1
lim)( 11
1
n
Lw
n
nwPwP
nLH
n
Language
EntropyAssumptions:ergodic and stationary
)(log1
lim)( 1n
nwP
nLH
Entropy can be computed by taking the average log probability of a looooong sample
NL?
Shannon-McMillan-Breiman
04/20/23 CPSC503 Winter 2008 9
Cross-EntropyBetween probability distribution P and
another distribution Q (model for P)
)(log)()||()(),( xQxPQPDPHQPHx
)(),( PHQPH
Between two models Q1 and Q2 the more accurate is the one with higher =>lower cross-entropy => lower
)(log)(1
lim),( 11
1
n
Lw
n
nwQwP
nQPH
n
)(log1
lim),( 1n
nwQ
nQPH
Applied to Languag
e
04/20/23 CPSC503 Winter 2008 10
Model Evaluation: In practiceCorpus
Training Set Testing set
A:split
B: train models
Models: Q1 and Q2
C:Apply models• counting • frequencies • smoothing
• Compare cross-perplexities
Nw1
),(),( 21 2?2 QPHQPH)(log
1),( 1
nwQn
QPH
04/20/23 CPSC503 Winter 2008 11
k-fold cross validation and t-test• Randomly divide the corpus in k
subsets of equal size• Use each for testing (all the other
for training)In practice you do k times what we sow in previous
slide
• Now for each model you have k perplexities
• Compare average models perplexities with t-test
04/20/23 CPSC503 Winter 2008 12
Knowledge-Formalisms Map(including probabilistic formalisms)
Logical formalisms (First-Order Logics)
Rule systems (and prob. versions)(e.g., (Prob.) Context-Free
Grammars)
State Machines (and prob. versions)
(Finite State Automata,Finite State Transducers, Markov Models)
Morphology
Syntax
PragmaticsDiscourse
and Dialogue
Semantics
AI planners
04/20/23 CPSC503 Winter 2008 13
Today 24/9
• ngrams evaluation• Markov Chains• Hidden Markov Models:
– definition– the three key problems (only one in
detail)• Part-of-speech tagging
– What it is– Why we need it– How to do it
04/20/23 CPSC503 Winter 2008 14
Example of a Markov Chain
1
.4
1
.3.3
.4
.6 1
.6
.4
te
ha p
i
Start.6
Start.4
04/20/23 CPSC503 Winter 2008 15
Markov-ChainFormal description:
Manning/Schütze, 2000: 318
Probability of initial states
)( 1 ii sXP
N
i
i
1
1ti
.6
.4
Stochastic Transition matrix A)|( 1 itjtij sXsXPa
0ija ji,
N
j
jia1
, 1 i
t
t i p
i
p
0 .3 0.4 0 .6
0 0 11
2
ah
e
0 0 .40 0 0
1 0 0
a h e
.3 .4 00 0 0
0 0 0
.6 0 00 0 1
0 0 0
1tX
tX
04/20/23 CPSC503 Winter 2008 16
Markov Assumptions• Let X=(X1, .., Xt) be a sequence of
random variables taking values in some finite set S={s1, …, sn}, the state space, the Markov properties are:
• (a) Limited Horizon:
For all t, P(Xt+1|X1, .., Xt)=P(X t+1 | Xt)
• (b)Time Invariant:
For all t, P(X t+1 |Xt)=P(X2 | X1) i.e., the dependency does not change over
time.
04/20/23 CPSC503 Winter 2008 17
Markov-ChainProbability of a sequence of states X1 … XT
Manning/Schütze, 2000: 320
),...,( 1 TXXP
),...,|()...,|()|()( 11123121 TT XXXPXXXPXXPXP)|()...|()|()( 123121 TT XXPXXPXXPXP
11
1
1
ttXX
T
t
aX
108.06.03.06.0)|()|()(),,( 23121
iXpXPtXiXPtXPpitP
Example:
Similar to …….?
04/20/23 CPSC503 Winter 2008 18
Today 24/9
• ngrams evaluation• Markov Chains• Hidden Markov Models:
– definition– the three key problems (only one in
detail)• Part-of-speech tagging
– What it is– Why we need it– How to do it
04/20/23 CPSC503 Winter 2008 19
HMMs (and MEMM) intro
They are probabilistic sequence-classifier / sequence-lablers: assign a class/label to each unit in a sequence
We have already seen a non-prob. version...
Used extensively in NLP• Part of Speech Tagging• Partial parsing• Named entity recognition• Information Extraction
04/20/23 CPSC503 Winter 2008 20
Hidden Markov Model(State Emission)
.7.3
.4
.6 1
.6
.4
s1
a
b
i
Start.6
Start.4
s2
as3
s4
i
a
b
b
.5
.5
.1
.9
1
.1
.4
.5
04/20/23 CPSC503 Winter 2008 21
SjiaA ij ,},{
Hidden Markov Model
Formal Specification as five-tuple
Set of States
Output Alphabet
Initial State Probabilities
State Transition Probabilities
Symbol EmissionProbabilities
BAKS ,,,,
Sii },{
},...,{ 1 NssS },...,1{},...,{ 1 MkkK M
KoSiobB tti ,)},({
11
N
j
ija
1)(1
M
t
ti ob
04/20/23 CPSC503 Winter 2008 22
Three fundamental questions for HMMs
Decoding: Finding the probability of an observation• brute force or Forward/Backward-Algorithm
Manning/Schütze, 2000: 325
)|(compute ),,,( model aGiven OPBA
Finding the best state sequence• Viterbi-Algorithm
Training: find model parameters which best explain the observations
),|(maxarg OXPX
)|(maxarg
trainingOP
04/20/23 CPSC503 Winter 2008 23
Computing the probability of an observation sequence O= o1 ...
oT
)|( OP XX
XPXOPXOP )|(),|()|,(
)(),|(1
t
t
T
t
oXbXOP
1
)|(1
11
tXtXaX
XPT
t
X = all sequences of T states
e.g., P(b,i | sample HMM )
X
T
t
t
t
T
t tXtXaX
oXbOP
1)()|(
1
11 1
04/20/23 CPSC503 Winter 2008 24
Decoding Example
Manning/Schütze, 2000: 327
4statesof#,2);,( NTibP
s1, s1 = 0 ?
s1, s4 = 1 * .5 * .6 * .7s2, s4 = 0?……….
s1, s2 = 1 * .1 * .6 * .3
……….
……….
……….
tXX ...1
Complexity
X
T
t
t
t
T
t tXtXaX
oXbOP
1)()|(
1
11 1
04/20/23 CPSC503 Winter 2008 25
The forward procedure
1. Initialization
Niobi ii 1),()( 11
2. Induction
NjNiobaij tj
N
i
ijtt
1,1),()()(1
1
3. Total
N
i
T iOP1
)()|( Complexity
)|,....()( 21 iXoooPi ttt
04/20/23 CPSC503 Winter 2008 26
Three fundamental questions for HMMs
Decoding: Finding the probability of an observation• brute force or Forward Algorithm
)|(compute ),,,( model aGiven OPBA
Finding the best state sequence• Viterbi-Algorithm
Training: find model parameters which best explain the observations
),|(maxarg OXPX
)|(maxarg
trainingOP
If interested in details of the next two questions, read (Sections 6.4 – 6.5)
04/20/23 CPSC503 Winter 2008 27
Today 24/9
• ngrams evaluation• Markov Chains• Hidden Markov Models:
– definition– the three key problems (only one in
detail)• Part-of-speech tagging
– What it is– Why we need it– How to do it
04/20/23 CPSC503 Winter 2008 28
Parts of Speech Tagging• What is it? • Why do we need it?• Word classes (Tags)
– Distribution– Tagsets
• How to do it– Rule-based– Stochastic– Transformation-based
04/20/23 CPSC503 Winter 2008 29
Parts of Speech Tagging: What
• Brainpower_NN ,_, not_RB physical_JJ plant_NN ,_, is_VBZ now_RB a_DT firm_NN 's_POS chief_JJ asset_NN ._.
Tag meanings• NNP (Proper N sing), RB (Adv), JJ (Adj), NN (N
sing. or mass), VBZ (V 3sg pres), DT (Determiner), POS (Possessive ending), . (sentence-final punct)
Output
• Brainpower, not physical plant, is now a firm's chief asset.
Input
04/20/23 CPSC503 Winter 2008 30
Parts of Speech Tagging: Why?
• As a basis for (Partial) Parsing • Information Retrieval• Word-sense disambiguation• Speech synthesis • Improve language models
(Spelling/Speech)
• Part-of-speech (word class, morph. class, syntactic category) gives a significant amount of info about the word and its neighborsUseful in the following NLP
tasks:
04/20/23 CPSC503 Winter 2008 31
Parts of Speech
• Eight basic categories– Noun, verb, pronoun, preposition,
adjective, adverb, article, conjunction• These categories are based on:
– morphological properties (affixes they take)
– distributional properties (what other words can occur nearby)
– e.g, green It is so… , both…, The… is• Not semantics!
04/20/23 CPSC503 Winter 2008 32
Parts of Speech• Two kinds of category
– Closed class (generally are function words)
•Prepositions, articles, conjunctions, pronouns, determiners, aux, numerals
– Open class•Nouns (proper/common; mass/count), verbs, adjectives, adverbs
Very short, frequent and important
Objects, actions, events, properties
• If you run across an unknown word….??
04/20/23 CPSC503 Winter 2008 33
PoS Distribution• Parts of speech follow a usual
behavior in Language
Words
1 PoS2 PoS
(unfortunately very frequent)>2 PoS
…but luckily different tags associated with a word are not equally likely
~35k~4k
~4k
04/20/23 CPSC503 Winter 2008 34
Sets of Parts of Speech:Tagsets• Most commonly used:
– 45-tag Penn Treebank, – 61-tag C5, – 146-tag C7
• The choice of tagset is based on the application (do you care about distinguishing between “to” as a prep and “to” as a infinitive marker?)
• Accurate tagging can be done with even large tagsets
04/20/23 CPSC503 Winter 2008 35
PoS Tagging
Dictionarywordi -> set of tags
from Tagset
• Brainpower_NN ,_, not_RB physical_JJ plant_NN ,_, is_VBZ now_RB a_DT firm_NN 's_POS chief_JJ asset_NN ._. ……….
• Brainpower, not physical plant, is now a firm's chief asset. …………
Input text
Output
Tagger
04/20/23 CPSC503 Winter 2008 36
Tagger Types
• Rule-based ‘95
• Stochastic– HMM tagger ~ >= ’92– Transformation-based tagger (Brill) ~
>= ’95– Maximum Entropy Models ~ >= ’97
04/20/23 CPSC503 Winter 2008 37
Rule-Based (ENGTWOL ‘95)1. A lexicon transducer returns for each
word all possible morphological parses 2. A set of ~1,000 constraints is applied to
rule out inappropriate PoSStep 1: sample I/O
“Pavlov had show that salivation….”Pavlov N SG PROPERhad HAVE V PAST SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO
……that ADV
PRON DEM SG CS……..…….
Sample ConstraintExample: Adverbial “that” ruleGiven input: “that”If
(+1 A/ADV/QUANT)(+2 SENT-LIM)(NOT -1 SVOC/A)
Then eliminate non-ADV tagsElse eliminate ADV
04/20/23 CPSC503 Winter 2008 38
HMM Stochastic Tagging•Tags corresponds to an HMM states
•Words correspond to the HMM alphabet symbols Tagging: given a sequence of words (observations), find the most likely sequence of tags (states)But this is…..!We need: State transition and symbol emission probabilities
)(
),()|(
1
11
i
iiii
tC
ttCttP
)(
),()|(
i
iiii
tC
twCtwP 1) From hand-
tagged corpus2) No tagged corpus: parameter estimation (Baum-Welch)
04/20/23 CPSC503 Winter 2008 39
Evaluating Taggers
•Accuracy: percent correct (most current taggers 96-7%) *test on unseen data!*
•Human Celing: agreement rate of humans on classification (96-7%)
•Unigram baseline: assign each token to the class it occurred in most frequently in the training set (race -> NN). (91%)
•What is causing the errors? Build a confusion matrix…
04/20/23 CPSC503 Winter 2008 40
Knowledge-Formalisms Map(next three lectures)
Logical formalisms (First-Order Logics)
Rule systems (and prob. versions)(e.g., (Prob.) Context-Free
Grammars)
State Machines (and prob. versions)
(Finite State Automata,Finite State Transducers, Markov Models)
Morphology
Syntax
PragmaticsDiscourse
and Dialogue
Semantics
AI planners
04/20/23 CPSC503 Winter 2008 41
Next Time
• Read Chapter 12 (syntax & Context Free Grammars)