+ All Categories
Home > Documents > Hidden Markov Model (HMM) Tagging

Hidden Markov Model (HMM) Tagging

Date post: 10-Feb-2016
Category:
Author: faxon
View: 53 times
Download: 2 times
Share this document with a friend
Description:
Hidden Markov Model (HMM) Tagging. Using an HMM to do POS tagging HMM is a special case of Bayesian inference. Hidden Markov Model (HMM) Taggers. Goal: maximize P(word|tag) x P(tag|previous n tags) P(word|tag) word/lexical likelihood probability that given this tag, we have this word - PowerPoint PPT Presentation
Embed Size (px)
Popular Tags:
of 31 /31
Hidden Markov Model (HMM) Tagging Using an HMM to do POS tagging HMM is a special case of Bayesian inference
Transcript
  • Hidden Markov Model (HMM) TaggingUsing an HMM to do POS tagging

    HMM is a special case of Bayesian inference

    Natural Language Processing

  • Goal: maximize P(word|tag) x P(tag|previous n tags)

    P(word|tag) word/lexical likelihoodprobability that given this tag, we have this word NOT probability that this word has this tagmodeled through language model (word-tag matrix)

    P(tag|previous n tags)tag sequence likelihoodprobability that this tag follows these previous tagsmodeled through language model (tag-tag matrix)

    Hidden Markov Model (HMM) TaggersLexical informationSyntagmatic information

    Natural Language Processing

  • POS tagging as a sequence classification taskWe are given a sentence (an observation or sequence of observations)Secretariat is expected to race tomorrow.sequence of n words w1wn.What is the best sequence of tags which corresponds to this sequence of observations?Probabilistic/Bayesian view:Consider all possible sequences of tagsOut of this universe of sequences, choose the tag sequence which is most probable given the observation sequence of n words w1wn.

    Natural Language Processing

  • Getting to HMMLet T = t1,t2,,tnLet W = w1,w2,,wn

    Goal: Out of all sequences of tags t1tn, get the the most probable sequence of POS tags T underlying the observed sequence of words w1,w2,,wn

    Hat ^ means our estimate of the best = the most probable tag sequence

    Argmaxx f(x) means the x such that f(x) is maximized it maximizes our estimate of the best tag sequence

    Natural Language Processing

  • Bayes RuleWe can drop the denominator: it does not change for each tag sequence; we are looking for the best tag sequence for the same observation, for the same fixed set of words

    Natural Language Processing

  • Bayes Rule

    Natural Language Processing

  • Likelihood and prior

    Natural Language Processing

  • Likelihood and prior Further Simplificationsn1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it 2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag3. The most probable tag sequence estimated by the bigram tagger

    Natural Language Processing

  • Likelihood and prior Further Simplifications 2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag.Bigrams are groups of two written letters, two syllables, or two words; they are a special case of N-gram.Bigrams are used as the basis for simple statistical analysis of textThe bigram assumption is related to the first-order Markov assumption

    Natural Language Processing

  • Likelihood and prior Further Simplifications3. The most probable tag sequence estimated by the bigram tagger n biagram assumption---------------------------------------------------------------------------------------------------------------

    Natural Language Processing

  • Probability estimatesTag transition probabilities p(ti|ti-1)Determiners likely to precede adjectives and nounsThat/DT flight/NNThe/DT yellow/JJ hat/NNSo we expect P(NN|DT) and P(JJ|DT) to be high

    Natural Language Processing

  • Estimating probabilityTag transition probabilities p(ti|ti-1)Compute P(NN|DT) by counting in a labeled corpus:# of times DT is followed by NN

    Natural Language Processing

  • Two kinds of probabilities Word likelihood probabilities p(wi|ti)P(is|VBZ) = probability of VBZ (3sg Pres verb) being is

    Compute P(is|VBZ) by counting in a labeled corpus:If we were expecting a third person singular verb, how likely is it that this verb would be is?

    Natural Language Processing

  • An Example: the verb raceTwo possible tags:

    Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR

    People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN

    How do we pick the right tag?

    Natural Language Processing

  • Disambiguating race

    Natural Language Processing

  • Disambiguating race P(NN|TO) = .00047 P(VB|TO) = .83The tag transition probabilities P(NN|TO) and P(VB|TO) answer the question: How likely are we to expect verb/noun given the previous tag TO?

    P(race|NN) = .00057 P(race|VB) = .00012Lexical likelihoods from the Brown corpus for race given a POS tag NN or VB.

    Natural Language Processing

  • Disambiguating race P(NR|VB) = .0027 P(NR|NN) = .0012tag sequence probability for the likelihood of an adverb occurring given the previous tag verb or noun

    P(VB|TO)P(NR|VB)P(race|VB) = .00000027 P(NN|TO)P(NR|NN)P(race|NN)=.00000000032Multiply the lexical likelihoods with the tag sequence probabilities: the verb wins

    Natural Language Processing

  • Hidden Markov ModelsWhat weve described with these two kinds of probabilities is a Hidden Markov Model (HMM)Lets just spend a bit of time tying this into the modelIn order to define HMM, we will first introduce the Markov Chain, or observable Markov Model.

    Natural Language Processing

  • DefinitionsA weighted finite-state automaton adds probabilities to the arcsThe sum of the probabilities leaving any arc must sum to oneA Markov chain is a special case of a WFST in which the input sequence uniquely determines which states the automaton will go throughMarkov chains cant represent inherently ambiguous problemsUseful for assigning probabilities to unambiguous sequences

    Natural Language Processing

  • States Q = q1, q2qN; Observations O = o1, o2oN; Each observation is a symbol from a vocabulary V = {v1,v2,vV}

    Transition probabilities (prior)Transition probability matrix A = {aij}

    Observation likelihoods (likelihood)Output probability matrix B={bi(ot)}a set of observation likelihoods, each expressing the probability of an observation ot being generated from a state i, emission probabilities

    Special initial probability vector i the probability that the HMM will start in state i, each i expresses the probability p(qi|START)Hidden Markov Models Formal definition

    Natural Language Processing

  • AssumptionsMarkov assumption: the probability of a particular state depends only on the previous state

    Output-independence assumption: the probability of an output observation depends only on the state that produced that observation

    Natural Language Processing

  • HMM TaggersTwo kinds of probabilitiesA transition probabilities (PRIOR) B observation likelihoods (LIKELIHOOD) HMM Taggers choose the tag sequence which maximizes the product of word likelihood and tag sequence probability

    Natural Language Processing

  • Weighted FSM corresponding to hidden states of HMM

    Natural Language Processing

  • observation likelihoods for POS HMM

    Natural Language Processing

  • Transition matrix for the POS HMM

    Natural Language Processing

  • The output matrix for the POS HMM

    Natural Language Processing

  • HMM TaggersThe probabilities are trained on hand-labeled training corpora (training set)

    Combine different N-gram levels Evaluated by comparing their output from a test set to human labels for that test set (Gold Standard)

    Natural Language Processing

  • The Viterbi Algorithmbest tag sequence for "John likes to fish in the sea"?

    efficiently computes the most likely state sequence given a particular output sequence

    based on dynamic programming

    Natural Language Processing

  • A smaller exampleWhat is the best sequence of states for the input string bbba?Computing all possible paths and finding the one with the max probability is exponential0.6bqrstartend0.50.7a0.40.80.2ba110.30.5

    Natural Language Processing

  • Viterbi for POS taggingLet:n = nb of words in sentence to tag (nb of input tokens)T = nb of tags in the tag set (nb of states)vit = path probability matrix (viterbi) vit[i,j] = probability of being at state (tag) j at word i state = matrix to recover the nodes of the best path (best tag sequence) state[i+1,j] = the state (tag) of the incoming arc that led to this most probable state j at word i+1

    // Initialization vit[1,PERIOD]:=1.0 // pretend that there is a period before // our sentence (start tag = PERIOD) vit[1,t]:=0.0 for t PERIOD

    Natural Language Processing

  • Viterbi for POS tagging (cont)// Induction (build the path probability matrix)for i:=1 to n step 1 do // for all words in the sentence for all tags tj do // for all possible tags// store the max prob of the pathvit[i+1,tj] := max1kT(vit[i,tk] x P(wi+1|tj) x P(tj| tk)) // store the actual statepath[i+1,tj] := argmax1kT ( vit[i,tk] x P(wi+1|tj) x P(tj| tk)) endend

    //Termination and path-readoutbestStaten+1 := argmax1jT vit[n+1,j]for j:=n to 1 step -1 do // for all the words in the sentence bestStatej := path[i+1, bestStatej+1]end

    P(bestState1,, bestStaten ) := max1jT vit[n+1,j]

    emission probabilitystate transitionprobabilityprobability of best path leading to state tk at word i

    Natural Language Processing

  • in bigram POS tagging, we condition a tag only on the preceding tag

    why not...use more context (ex. use trigram model) more precise:is clearly marked --> verb, past participlehe clearly marked --> verb, past tensecombine trigram, bigram, unigram modelscondition on words too

    but with an n-gram approach, this is too costly (too many parameters to model)Possible improvements

    Natural Language Processing

  • Further issues with Markov Model taggingUnknown words are a problem since we dont have the required probabilities. Possible solutions:Assign the word probabilities based on corpus-wide distribution of POS ???Use morphological cues (capitalization, suffix) to assign a more calculated guess.

    Using higher order Markov models:Using a trigram model captures more contextHowever, data sparseness is much more of a problem.

    Natural Language Processing

    ***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008***November 2, 2008


Recommended