CS60057 Speech &Natural Language Processing

transcript

Lecture 1, 7/21/2005 Natural Language Processing 1

CS60057Speech &Natural Language

Processing

Autumn 2007

Lecture 10

16 August 2007

Hidden Markov Model (HMM) Tagging

Using an HMM to do POS tagging

HMM is a special case of Bayesian inference

It is also related to the “noisy channel” model in ASR (Automatic Speech Recognition)

Goal: maximize P(word|tag) x P(tag|previous n tags)

P(word|tag) word/lexical likelihood probability that given this tag, we have this word NOT probability that this word has this tag modeled through language model (word-tag matrix)

P(tag|previous n tags) tag sequence likelihood probability that this tag follows these previous tags modeled through language model (tag-tag matrix)

Hidden Markov Model (HMM) Taggers

Lexical information Syntagmatic information

POS tagging as a sequence classification task

We are given a sentence (an “observation” or “sequence of observations”) Secretariat is expected to race tomorrow sequence of n words w1…wn.

What is the best sequence of tags which corresponds to this sequence of observations?

Probabilistic/Bayesian view: Consider all possible sequences of tags Out of this universe of sequences, choose the tag sequence

which is most probable given the observation sequence of n words w1…wn.

Getting to HMM

Let T = t1,t2,…,tn

Let W = w1,w2,…,wn

Goal: Out of all sequences of tags t1…tn, get the the most probable sequence of POS tags T underlying the observed sequence of words w1,w2,…,wn

Hat ^ means “our estimate of the best = the most probable tag sequence” Argmaxx f(x) means “the x such that f(x) is maximized”

it maximazes our estimate of the best tag sequence

Getting to HMM

This equation is guaranteed to give us the best tag sequence

But how do we make it operational? How do we compute this value? Intuition of Bayesian classification:

Use Bayes rule to transform it into a set of other probabilities that are easier to compute

Thomas Bayes: British mathematician (1702-1761)

Bayes Rule

Breaks down any conditional probability P(x|y) into three other probabilities

P(x|y): The conditional probability of an event x assuming that y has occurred

Bayes Rule

We can drop the denominator: it does not change for each tag sequence; we are looking for the best tag sequence for the same observation, for the same fixed set of words

Bayes Rule

Likelihood and prior

Likelihood and prior Further Simplifications

1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it

2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag

3. The most probable tag sequence estimated by the bigram tagger

1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it

thekoalaputthekeysonthetable

WORDSTAGS

NVPDET

2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag

Bigrams are groups of two written letters, two syllables, or two words; they are a special case of N-gram.

Bigrams are used as the basis for simple statistical analysis of text

The bigram assumption is related to the first-order Markov assumption

3. The most probable tag sequence estimated by the bigram tagger

biagram assumption

---------------------------------------------------------------------------------------------------------------

Two kinds of probabilities (1)

Tag transition probabilities p(ti|ti-1) Determiners likely to precede adjs and nouns

That/DT flight/NN The/DT yellow/JJ hat/NN So we expect P(NN|DT) and P(JJ|DT) to be high But P(DT|JJ) to be:?

Tag transition probabilities p(ti|ti-1) Compute P(NN|DT) by counting in a labeled

corpus:

# of times DT is followed by NN

Word likelihood probabilities p(wi|ti) P(is|VBZ) = probability of VBZ (3sg Pres verb) being “is”

Compute P(is|VBZ) by counting in a labeled corpus:

If we were expecting a third person singular verb, how likely is it that

this verb would be is?

An Example: the verb “race”

Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR

People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN

How do we pick the right tag?

Disambiguating “race”

P(NN|TO) = .00047P(VB|TO) = .83The tag transition probabilities P(NN|TO) and P(VB|TO) answer the question: ‘How likely are we to expect verb/noun given the previous tag TO?’

P(race|NN) = .00057P(race|VB) = .00012Lexical likelihoods from the Brown corpus for ‘race’ given a POS tag NN or VB.

P(NR|VB) = .0027P(NR|NN) = .0012tag sequence probability for the likelihood of an adverb occurring given the previous tag verb or noun

Hidden Markov Models

What we’ve described with these two kinds of probabilities is a Hidden Markov Model (HMM)

Let’s just spend a bit of time tying this into the model In order to define HMM, we will first introduce the Markov

Chain, or observable Markov Model.

Definitions

A weighted finite-state automaton adds probabilities to the arcs The sum of the probabilities leaving any arc must sum

to one A Markov chain is a special case of a WFST in which the

input sequence uniquely determines which states the automaton will go through

Markov chains can’t represent inherently ambiguous problems Useful for assigning probabilities to unambiguous

sequences

Markov chain = “First-order observed Markov Model” a set of states

Q = q1, q2…qN; the state at time t is qt a set of transition probabilities:

a set of probabilities A = a01a02…an1…ann. Each aij represents the probability of transitioning from state i to state j The set of these is the transition probability matrix A

Distinguished start and end states

Special initial probability vector

i the probability that the MM will start in state i, each i expresses the probability p(qi|START)

aij P(qt j | qt 1 i) 1i, j N

aij 1; 1i Nj1

Markov chain = “First-order observed Markov Model”

Markov Chain for weather: Example 1 three types of weather: sunny, rainy, foggy we want to find the following conditional probabilities:

P(qn|qn-1, qn-2, …, q1)

- I.e., the probability of the unknown weather on day n, depending on the (known) weather of the preceding days

- We could infer this probability from the relative frequency (the statistics) of past observations of weather sequences

Problem: the larger n is, the more observations we must collect.

Suppose that n=6, then we have to collect statistics for 3(6-1) =

243 past histories

Markov chain = “First-order observed Markov Model” Therefore, we make a simplifying assumption, called the (first-order) Markov

assumption

for a sequence of observations q1, … qn,

current state only depends on previous state

the joint probability of certain past and current observations

Markov chain = “First-order observable Markov Model”

Markov chain = “First-order observed Markov Model”

Given that today the weather is sunny, what's the probability that tomorrow is sunny and the day after is rainy?

Using the Markov assumption and the probabilities in table 1, this translates into:

Markov chain for weather

What is the probability of 4 consecutive rainy days? Sequence is rainy-rainy-rainy-rainy I.e., state sequence is 3-3-3-3 P(3,3,3,3) =

1a11a11a11a11 = 0.2 x (0.6)3 = 0.0432

Hidden Markov Model

For Markov chains, the output symbols are the same as the states. See sunny weather: we’re in state sunny

But in part-of-speech tagging (and other things) The output symbols are words But the hidden states are part-of-speech tags

So we need an extension! A Hidden Markov Model is an extension of a Markov

chain in which the output symbols are not the same as the states.

This means we don’t know which state we are in.

Markov chain for words

Observed events: words

Hidden events: tags

States Q = q1, q2…qN; Observations O = o1, o2…oN;

Each observation is a symbol from a vocabulary V = {v1,v2,…vV}

Transition probabilities (prior)

Transition probability matrix A = {aij}

Observation likelihoods (likelihood)

Output probability matrix B={bi(ot)}a set of observation likelihoods, each expressing the probability of an

observation ot being generated from a state i, emission probabilities

Special initial probability vector

i the probability that the HMM will start in state i, each i expresses the probability

p(qi|START)

Hidden Markov Models

Assumptions

Markov assumption: the probability of a particular state depends only on the previous state

Output-independence assumption: the probability of an output observation depends only on the state that produced that observation

P(qi | q1...qi 1) P(qi | qi 1)

HMM for Ice Cream

You are a climatologist in the year 2799 Studying global warming You can’t find any records of the weather in Boston, MA

for summer of 2007 But you find Jason Eisner’s diary Which lists how many ice-creams Jason ate every date

that summer Our job: figure out how hot it was

Noam task

Given Ice Cream Observation Sequence: 1,2,3,2,2,2,3…

(cp. with output symbols) Produce:

Weather Sequence: C,C,H,C,C,C,H …

(cp. with hidden states, causing states)

HMM for ice cream

Different types of HMM structure

Bakis = left-to-right Ergodic = fully-connected

HMM Taggers

Two kinds of probabilities A transition probabilities (PRIOR) B observation likelihoods (LIKELIHOOD)

HMM Taggers choose the tag sequence which maximizes the product of word likelihood and tag sequence probability

Weighted FSM corresponding to hidden states of HMM, showing A probs

B observation likelihoods for POS HMM

The A matrix for the POS HMM

The B matrix for the POS HMM

HMM Taggers

The probabilities are trained on hand-labeled training corpora (training set)

Combine different N-gram levels Evaluated by comparing their output from a test set to

human labels for that test set (Gold Standard)

The Viterbi Algorithm best tag sequence for "John likes to fish in the sea"? efficiently computes the most likely state sequence given a

particular output sequence based on dynamic programming

A smaller example0.6

q rstart end

What is the best sequence of states for the input string “bbba”?

Computing all possible paths and finding the one with the max probability is exponential

0.4 0.80.2

0.3 0.5

A smaller example (con’t)

For each state, store the most likely sequence that could lead to it (and its probability) Path probability matrix:

An array of states versus time (tags versus words) That stores the prob. of being at each state at each time in terms of the prob. for being

in each state at the preceding time.Best sequence Input sequence / time

ε --> b b --> b bb --> b bbb --> a

leading

coming

from qε --> q 0.6

(1.0x0.6)

q --> q 0.108

(0.6x0.3x0.6)

qq --> q 0.01944 (0.108x0.3x0.6)

qrq --> q 0.018144

(0.1008x0.3x0.4)

coming

from rr --> q 0

(0x0.5x0.6)

qr --> q 0.1008

(0.336x0.5x 0.6)

qrr --> q 0.02688 (0.1344x0.5x0.4)

leading

coming

from qε --> r 0

(0x0.8)

q --> r 0.336

(0.6x0.7x0.8)

qq --> r 0.0648 (0.108x0.7x0.8)

qrq --> r 0.014112

(0.1008x0.7x0.2)

coming

from rr --> r 0 (0x0.5x0.8)

qr --> r 0.1344 (0.336x0.5x0.8)

qrr --> r 0.01344

(0.1344x0.5x0.2)

Viterbi intuition: we are looking for the best ‘path’

promised to back the bill

S1 S2 S4S3 S5

promised to back the bill

Slide from Dekang Lin

The Viterbi Algorithm

Intuition

The value in each cell is computed by taking the MAX over all paths that lead to this cell.

An extension of a path from state i at time t-1 is computed by multiplying: Previous path probability from previous cell viterbi[t-

1,i] Transition probability aij from previous state I to

current state j Observation likelihood bj(ot) that current state j

matches observation symbol t

Viterbi example

Smoothing of probabilities

Data sparseness is a problem when estimating probabilities based on corpus data. The “add one” smoothing technique –

wCwP n

C- absolute frequencyN: no of training instancesB: no of different types

Linear interpolation methods can compensate for data sparseness with higher order models. A common method is interpolating trigrams, bigrams and unigrams:

iiiiiiii ttPttPtPttP

)|()|()(| 2,133122111,1

The lambda values are automatically determined using a variant of the Expectation Maximization algorithm.

in bigram POS tagging, we condition a tag only on the preceding tag

why not... use more context (ex. use trigram model)

more precise: “is clearly marked” --> verb, past participle “he clearly marked” --> verb, past tense

combine trigram, bigram, unigram models condition on words too

but with an n-gram approach, this is too costly (too many parameters to model)

Possible improvements

Further issues with Markov Model tagging

Unknown words are a problem since we don’t have the required probabilities. Possible solutions: Assign the word probabilities based on corpus-wide distribution

of POS Use morphological cues (capitalization, suffix) to assign a more

calculated guess. Using higher order Markov models:

Using a trigram model captures more context However, data sparseness is much more of a problem.

Efficient statistical POS tagger developed by Thorsten Brants, ANLP-2000 Underlying model:

Trigram modelling – The probability of a POS only depends on its two preceding POS The probability of a word appearing at a particular position given that its

POS occurs at that position is independent of everything else.

iTTiiiii

ttttPtwPtttP

T 1121 )|()|(),|(maxarg

Training

Maximum likelihood estimates:

),()|(:

),,(),|(:

),()|( : Bigrams

: Unigrams

321213

twctwPLexical

tttctttPTrigrams

ttcttP

)c(t)(tP

Smoothing : context-independent variant of linear interpolation.

),|()|()(),|( 213

1213 tttPttPtPtttP

Smoothing algorithm

Set λi=0

For each trigram t1 t2 t3 with f(t1,t2,t3 )>0 Depending on the max of the following three values:

Case (f(t1,t2,t3 )-1)/ f(t1,t2) : incr λ3 by f(t1,t2,t3 )

Case (f(t2,t3 )-1)/ f(t2) : incr λ2 by f(t1,t2,t3 )

Case (f(t3 )-1)/ N-1 : incr λ1 by f(t1,t2,t3 )

Normalize λi

Evaluation of POS taggers

compared with gold-standard of human performance metric:

accuracy = % of tags that are identical to gold standard most taggers ~96-97% accuracy must compare accuracy to:

ceiling (best possible results) how do human annotators score compared to each other? (96-

97%) so systems are not bad at all!

baseline (worst possible results) what if we take the most-likely tag (unigram model) regardless of

previous tags ? (90-91%) so anything less is really bad

CS60057 Speech &Natural Language Processing

Documents