Hidden Markov Model (HMM) TaggingUsing an HMM to do POS tagging
HMM is a special case of Bayesian inference
Natural Language Processing
Goal: maximize P(word|tag) x P(tag|previous n tags)
P(word|tag) word/lexical likelihoodprobability that given this tag, we have this word NOT probability that this word has this tagmodeled through language model (word-tag matrix)
P(tag|previous n tags)tag sequence likelihoodprobability that this tag follows these previous tagsmodeled through language model (tag-tag matrix)
Hidden Markov Model (HMM) TaggersLexical informationSyntagmatic information
Natural Language Processing
POS tagging as a sequence classification taskWe are given a sentence (an observation or sequence of observations)Secretariat is expected to race tomorrow.sequence of n words w1wn.What is the best sequence of tags which corresponds to this sequence of observations?Probabilistic/Bayesian view:Consider all possible sequences of tagsOut of this universe of sequences, choose the tag sequence which is most probable given the observation sequence of n words w1wn.
Natural Language Processing
Getting to HMMLet T = t1,t2,,tnLet W = w1,w2,,wn
Goal: Out of all sequences of tags t1tn, get the the most probable sequence of POS tags T underlying the observed sequence of words w1,w2,,wn
Hat ^ means our estimate of the best = the most probable tag sequence
Argmaxx f(x) means the x such that f(x) is maximized it maximizes our estimate of the best tag sequence
Natural Language Processing
Bayes RuleWe can drop the denominator: it does not change for each tag sequence; we are looking for the best tag sequence for the same observation, for the same fixed set of words
Natural Language Processing
Bayes Rule
Natural Language Processing
Likelihood and prior
Natural Language Processing
Likelihood and prior Further Simplificationsn1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it 2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag3. The most probable tag sequence estimated by the bigram tagger
Natural Language Processing
Likelihood and prior Further Simplifications 2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag.Bigrams are groups of two written letters, two syllables, or two words; they are a special case of N-gram.Bigrams are used as the basis for simple statistical analysis of textThe bigram assumption is related to the first-order Markov assumption
Natural Language Processing
Likelihood and prior Further Simplifications3. The most probable tag sequence estimated by the bigram tagger n biagram assumption---------------------------------------------------------------------------------------------------------------
Natural Language Processing
Probability estimatesTag transition probabilities p(ti|ti-1)Determiners likely to precede adjectives and nounsThat/DT flight/NNThe/DT yellow/JJ hat/NNSo we expect P(NN|DT) and P(JJ|DT) to be high
Natural Language Processing
Estimating probabilityTag transition probabilities p(ti|ti-1)Compute P(NN|DT) by counting in a labeled corpus:# of times DT is followed by NN
Natural Language Processing
Two kinds of probabilities Word likelihood probabilities p(wi|ti)P(is|VBZ) = probability of VBZ (3sg Pres verb) being is
Compute P(is|VBZ) by counting in a labeled corpus:If we were expecting a third person singular verb, how likely is it that this verb would be is?
Natural Language Processing
An Example: the verb raceTwo possible tags:
Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR
People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN
How do we pick the right tag?
Natural Language Processing
Disambiguating race
Natural Language Processing
Disambiguating race P(NN|TO) = .00047 P(VB|TO) = .83The tag transition probabilities P(NN|TO) and P(VB|TO) answer the question: How likely are we to expect verb/noun given the previous tag TO?
P(race|NN) = .00057 P(race|VB) = .00012Lexical likelihoods from the Brown corpus for race given a POS tag NN or VB.
Natural Language Processing
Disambiguating race P(NR|VB) = .0027 P(NR|NN) = .0012tag sequence probability for the likelihood of an adverb occurring given the previous tag verb or noun
P(VB|TO)P(NR|VB)P(race|VB) = .00000027 P(NN|TO)P(NR|NN)P(race|NN)=.00000000032Multiply the lexical likelihoods with the tag sequence probabilities: the verb wins
Natural Language Processing
Hidden Markov ModelsWhat weve described with these two kinds of probabilities is a Hidden Markov Model (HMM)Lets just spend a bit of time tying this into the modelIn order to define HMM, we will first introduce the Markov Chain, or observable Markov Model.
Natural Language Processing
DefinitionsA weighted finite-state automaton adds probabilities to the arcsThe sum of the probabilities leaving any arc must sum to oneA Markov chain is a special case of a WFST in which the input sequence uniquely determines which states the automaton will go throughMarkov chains cant represent inherently ambiguous problemsUseful for assigning probabilities to unambiguous sequences
Natural Language Processing
States Q = q1, q2qN; Observations O = o1, o2oN; Each observation is a symbol from a vocabulary V = {v1,v2,vV}
Transition probabilities (prior)Transition probability matrix A = {aij}
Observation likelihoods (likelihood)Output probability matrix B={bi(ot)}a set of observation likelihoods, each expressing the probability of an observation ot being generated from a state i, emission probabilities
Special initial probability vector i the probability that the HMM will start in state i, each i expresses the probability p(qi|START)Hidden Markov Models Formal definition
Natural Language Processing
AssumptionsMarkov assumption: the probability of a particular state depends only on the previous state
Output-independence assumption: the probability of an output observation depends only on the state that produced that observation
Natural Language Processing
HMM TaggersTwo kinds of probabilitiesA transition probabilities (PRIOR) B observation likelihoods (LIKELIHOOD) HMM Taggers choose the tag sequence which maximizes the product of word likelihood and tag sequence probability
Natural Language Processing
Weighted FSM corresponding to hidden states of HMM
Natural Language Processing
observation likelihoods for POS HMM
Natural Language Processing
Transition matrix for the POS HMM
Natural Language Processing
The output matrix for the POS HMM
Natural Language Processing
HMM TaggersThe probabilities are trained on hand-labeled training corpora (training set)
Combine different N-gram levels Evaluated by comparing their output from a test set to human labels for that test set (Gold Standard)
Natural Language Processing
The Viterbi Algorithmbest tag sequence for "John likes to fish in the sea"?
efficiently computes the most likely state sequence given a particular output sequence
based on dynamic programming
Natural Language Processing
A smaller exampleWhat is the best sequence of states for the input string bbba?Computing all possible paths and finding the one with the max probability is exponential0.6bqrstartend0.50.7a0.40.80.2ba110.30.5
Natural Language Processing
Viterbi for POS taggingLet:n = nb of words in sentence to tag (nb of input tokens)T = nb of tags in the tag set (nb of states)vit = path probability matrix (viterbi) vit[i,j] = probability of being at state (tag) j at word i state = matrix to recover the nodes of the best path (best tag sequence) state[i+1,j] = the state (tag) of the incoming arc that led to this most probable state j at word i+1
// Initialization vit[1,PERIOD]:=1.0 // pretend that there is a period before // our sentence (start tag = PERIOD) vit[1,t]:=0.0 for t PERIOD
Natural Language Processing
Viterbi for POS tagging (cont)// Induction (build the path probability matrix)for i:=1 to n step 1 do // for all words in the sentence for all tags tj do // for all possible tags// store the max prob of the pathvit[i+1,tj] := max1kT(vit[i,tk] x P(wi+1|tj) x P(tj| tk)) // store the actual statepath[i+1,tj] := argmax1kT ( vit[i,tk] x P(wi+1|tj) x P(tj| tk)) endend
//Termination and path-readoutbestStaten+1 := argmax1jT vit[n+1,j]for j:=n to 1 step -1 do // for all the words in the sentence bestStatej := path[i+1, bestStatej+1]end
P(bestState1,, bestStaten ) := max1jT vit[n+1,j]
emission probabilitystate transitionprobabilityprobability of best path leading to state tk at word i
Natural Language Processing
in bigram POS tagging, we condition a tag only on the preceding tag
why not...use more context (ex. use trigram model) more precise:is clearly marked --> verb, past participlehe clearly marked --> verb, past tensecombine trigram, bigram, unigram modelscondition on words too
but with an n-gram approach, this is too costly (too many parameters to model)Possible improvements
Natural Language Processing
Further issues with Markov Model taggingUnknown words are a problem since we dont have the required probabilities. Possible solutions:Assign the word probabilities based on corpus-wide distribution of POS ???Use morphological cues (capitalization, suffix) to assign a more calculated guess.
Using higher order Markov models:Using a trigram model captures more contextHowever, data sparseness is much more of a problem.
Natural Language Processing
***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008