Part of Speech Tagging
& Hidden Markov Models (Part 1)
Mitch Marcus
CIS 421/521
CIS 421/521 - Intro to AI 2
NLP Task I – Determining Part of Speech Tags
• Given a text, assign each token its correct part of speech
(POS) tag, given its context and a list of possible POS tags
for each word type
Word POS listing in Brown Corpus
heat noun verb
oil noun
in prep noun adv
a det noun noun-proper
large adj noun adv
pot noun
CIS 421/521 - Intro to AI 3
What is POS tagging good for?
• Speech synthesis:
• How to pronounce “lead”?• INsult inSULT
• OBject obJECT
• OVERflow overFLOW
• DIScount disCOUNT
• CONtent content
• Machine Translation• translations of nouns and verbs are different
• Stemming for search
• Knowing a word is a V tells you it gets past tense, participles, etc.
• Can search for “walk”, can get “walked”, “walking,…
Equivalent Problem in Bioinformatics
• From a sequence of amino
acids (primary structure):
ATCPLELLLD
• Infer secondary structure
(features of the 3D
structure, like helices,
sheets, etc.):
HHHBBBBBC..
CIS 421/521 - Intro to AI 4
Figure from:
http://www.particlesciences.com/news/technical-
briefs/2009/protein-structure.html
CIS 421/521 - Intro to AI 5
Penn Treebank Tagset I
Tag Description Example
CC coordinating conjunction and
CD cardinal number 1, third
DT determiner the
EX existential there there is
FW foreign word d'hoevre
IN preposition/subordinating conjunction in, of, like
JJ adjective green
JJR adjective, comparative greener
JJS adjective, superlative greenest
LS list marker 1)
MD modal could, will
NN noun, singular or mass table
NNS noun plural tables (supports)
NNP proper noun, singular John
NNPS proper noun, plural Vikings
CIS 421/521 - Intro to AI 6
Tag Description Example
PDT predeterminer both the boys
POS possessive ending friend 's
PRP personal pronoun I, me, him, he, it
PRP$ possessive pronoun my, his
RB adverb however, usually, here, good
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO to to go, to him
UH interjection uhhuhhuhh
Penn Treebank Tagset II
CIS 421/521 - Intro to AI 7
Tag Description Example
VB verb, base form take (support)
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes (supports)
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-abverb where, when
Penn Treebank Tagset III
CIS 421/521 - Intro to AI 8
NLP Task I – Determining Part of Speech Tags
• The Old Solution: Depth First search.
• If each of n word tokens has k tags on average,
try the kn combinations until one works.
• Machine Learning Solutions: Automatically learn
Part of Speech (POS) assignment.
• The best techniques achieve 97+% accuracy per word on
new materials, given a POS-tagged training corpus of 106
tokens with 3% error on a set of ~40 POS tags (tags on the
last three slides)
CIS 421/521 - Intro to AI 9
Simple Statistical Approaches: Idea 1
CIS 421/521 - Intro to AI 10
Simple Statistical Approaches: Idea 2
For a string of words
W = w1w2w3…wn
find the string of POS tags
T = t1 t2 t3 …tn
which maximizes P(T|W)
• i.e., the most likely POS tag ti for each word wigiven its surrounding context
CIS 421/521 - Intro to AI 11
The Sparse Data Problem …
A Simple, Impossible Approach to Compute P(T|W):
Count up instances of the string "heat oil in a large
pot" in the training corpus, and pick the most
common tag assignment to the string..
One more time: A BOTEC Estimate of What Works
CIS 421/521 - Intro to AI 12
What parameters can we estimate with a million words of hand tagged training data?
• Assume a uniform distribution of 5000 words and 40 part of speech tags..
We can get reasonable estimates of
• Tag bigrams
• Word x tag pairs
Bayes Rule plus Markov Assumptions
yields a practical POS tagger!
I. By Bayes Rule
II. So we want to find
III. To compute P(W|T):
• use the chain rule + a Markov assumption
• Estimation requires word x tag and tag counts
IV. To compute P(T):
• use the chain rule + a slightly different Markov assumption
• Estimation requires tag unigram and bigram counts
CIS 421/521 - Intro to AI 13
( | )* ( )( | )
( )
P W T P TP T W
P W
arg max ( | ) arg max ( | )* ( )T T
P T W P W T P T
IV. To compute P(T):
Just like computing P(W) last lecture
I. By the chain rule,
II. Applying the 1st order Markov Assumption
Estimated using tag bigrams/tag unigrams!
CIS 421/521 - Intro to AI 14
1 2 1 3 1 2 1 1( ) ( )* ( | )* ( | )*...* ( | ... )n nP T P t P t t P t t t P t t t
1 2 1 3 2 1( ) ( )* ( | )* ( | )*...* ( | )n nP T P t P t t P t t P t t
CIS 421/521 - Intro to AI 15
III. To compute P(W|T):
I. Assume that the words wi are conditionally independent
given the tag sequence T=t1 t2 … tn:
II. Applying a zeroth-order Markov Assumption:
by which
So, for a given string W = w1w2w3…wn, the tagger needs to find
the string of tags T which maximizes
1
( | ) ( | )n
i
i
P W T P w T
( | ) ( | )i i iP w T P w t
1
( | ) ( | )n
i i
i
P W T P w t
CIS 421/521 - Intro to AI 16
Hidden Markov Models
This model is an instance of a Hidden Markov
Model. Viewed graphically:
Adj
.3
.6Det
.02
.47Noun
.3
.7Verb
.51 .1P(w|Det)
a .4
the .4
P(w|Adj)
good .02
low .04
P(w|Noun)
price .001
deal .0001
CIS 421/521 - Intro to AI 17
Viewed as a generator, an HMM:
Adj
.3
.6Det
.02
.47 Noun
.3
.7 Verb
.51 .1
.4the
.4a
P(w|Det)
.04low
.02good
P(w|Adj)
.0001deal
.001price
P(w|Noun)
Summary: Recognition using an HMM
I. By Bayes Rule
II. We select the Tag sequence T that maximizes
P(T|W):
CIS 421/521 - Intro to AI 18
( )* ( | )( | )
( )
P T P W TP T W
P W
1 2
1 2
...
1
1 1...
1 1
arg max ( | )
arg max ( )* ( | )
arg max ( )* ( , )* ( , )
n
n
T
T t t t
n n
i i i iT t t t
i i
P T W
P T P W T
t a t t b t w
CIS 421/521 - Intro to AI 19
Training and Performance
• To estimate the parameters of this model, given an annotated training corpus use the MLE:
• Because many of these counts are small, smoothing is necessary for best results…
• Such taggers typically achieve about 95-96% correct tagging, for the standard 40-tag POS set.
• A few tricks for unknown words increase accuracy to 97%.
POS from bigram and word-tag pairs??
CIS 421/521 - Intro to AI 20
A Practical compromise
• Rich Models often require vast amounts of data
• Well estimated bad models often outperform badly estimated truer models
(Mutt & Jeff 1942)
CIS 421/521 - Intro to AI 21
Practical Tagging using HMMs
• Finding this maximum can be done using an
exponential search through all strings for T.
• However, there is a linear time solution using
dynamic programming called Viterbi decoding.
The three basic HMM problems
CIS 421/521 - Intro to AI 23
Parameters of an HMM
• States: A set of states S=s1, … sn
• Transition probabilities: A= a1,1, a1,2, …, an,n
Each ai,j represents the probability of
transitioning from state si to sj.
• Emission probabilities: a set B of functions of
the form bi(ot) which is the probability of
observation ot being emitted by si
• Initial state distribution: is the probability that
si is a start state
i
(This and later slides follow classic formulation by Ferguson, as
published by Rabiner and Juang, as adapted by Manning and Schutze.
Note the change in notation!!)
CIS 421/521 - Intro to AI 24
The Three Basic HMM Problems
• Problem 1 (Evaluation): Given the observation
sequence O=o1,…,oT and an HMM model
, how do we compute the
probability of O given the model?
• Problem 2 (Decoding): Given the observation
sequence O and an HMM model , how do we
find the state sequence that best explains the
observations?
• Problem 3 (Learning): How do we adjust the
model parameters , to maximize
?
(A,B,)
(A,B,)
P(O | )
CIS 421/521 - Intro to AI 25
Problem 1: Probability of an Observation Sequence
• Q: What is ?
• A: the sum of the probabilities of all possible
state sequences in the HMM.
• Naïve computation is very expensive. Given T
observations and N states, there are NT possible
state sequences.
• (for T=10 and N=10, 10 billion different paths!!)
• Solution: linear time dynamic programming!
P(O | )
CIS 421/521 - Intro to AI 26
The Crucial Data Structure: The Trellis
CIS 421/521 - Intro to AI 27
Forward Probabilities:
• For a given HMM , for some time t,
what is the probability that the partial
observation o1 … ot has been generated and that
the state at time t is i ?
• Forward algorithm computes t(i) 0<i<N, 0<t<T
in time 0(N2T) using the trellis
t (i) P(o1...ot , qt si | )
CIS 421/521 - Intro to AI 28
Forward Algorithm: Induction step
t ( j) t1(i)aiji1
N
b j (ot )
t (i) P(o1...ot , qt si | )
CIS 421/521 - Intro to AI 29
Forward Algorithm
• Initialization (probability that o1 has been
generated and that the state is i at time t=1:
• Induction:
• Termination:
NjTtobaij tj
N
i
ijtt
1,2)()()(1
1
1(i) ibi(o1) 1 i N
P(O | ) T (i)i1
N
CIS 421/521 - Intro to AI 30
Forward Algorithm Complexity
• Naïve approach requires exponential time to
evaluate all NT state sequences
• Forward algorithm using dynamic programming
takes O(N2T) computations