+ All Categories
Home > Documents > Part of Speech Tagging & Hidden Markov Models (Part 1)

Part of Speech Tagging & Hidden Markov Models (Part 1)

Date post: 24-Dec-2021
Category:
Upload: others
View: 7 times
Download: 1 times
Share this document with a friend
30
Part of Speech Tagging & Hidden Markov Models (Part 1) Mitch Marcus CIS 421/521
Transcript
Page 1: Part of Speech Tagging & Hidden Markov Models (Part 1)

Part of Speech Tagging

& Hidden Markov Models (Part 1)

Mitch Marcus

CIS 421/521

Page 2: Part of Speech Tagging & Hidden Markov Models (Part 1)

CIS 421/521 - Intro to AI 2

NLP Task I – Determining Part of Speech Tags

• Given a text, assign each token its correct part of speech

(POS) tag, given its context and a list of possible POS tags

for each word type

Word POS listing in Brown Corpus

heat noun verb

oil noun

in prep noun adv

a det noun noun-proper

large adj noun adv

pot noun

Page 3: Part of Speech Tagging & Hidden Markov Models (Part 1)

CIS 421/521 - Intro to AI 3

What is POS tagging good for?

• Speech synthesis:

• How to pronounce “lead”?• INsult inSULT

• OBject obJECT

• OVERflow overFLOW

• DIScount disCOUNT

• CONtent content

• Machine Translation• translations of nouns and verbs are different

• Stemming for search

• Knowing a word is a V tells you it gets past tense, participles, etc.

• Can search for “walk”, can get “walked”, “walking,…

Page 4: Part of Speech Tagging & Hidden Markov Models (Part 1)

Equivalent Problem in Bioinformatics

• From a sequence of amino

acids (primary structure):

ATCPLELLLD

• Infer secondary structure

(features of the 3D

structure, like helices,

sheets, etc.):

HHHBBBBBC..

CIS 421/521 - Intro to AI 4

Figure from:

http://www.particlesciences.com/news/technical-

briefs/2009/protein-structure.html

Page 5: Part of Speech Tagging & Hidden Markov Models (Part 1)

CIS 421/521 - Intro to AI 5

Penn Treebank Tagset I

Tag Description Example

CC coordinating conjunction and

CD cardinal number 1, third

DT determiner the

EX existential there there is

FW foreign word d'hoevre

IN preposition/subordinating conjunction in, of, like

JJ adjective green

JJR adjective, comparative greener

JJS adjective, superlative greenest

LS list marker 1)

MD modal could, will

NN noun, singular or mass table

NNS noun plural tables (supports)

NNP proper noun, singular John

NNPS proper noun, plural Vikings

Page 6: Part of Speech Tagging & Hidden Markov Models (Part 1)

CIS 421/521 - Intro to AI 6

Tag Description Example

PDT predeterminer both the boys

POS possessive ending friend 's

PRP personal pronoun I, me, him, he, it

PRP$ possessive pronoun my, his

RB adverb however, usually, here, good

RBR adverb, comparative better

RBS adverb, superlative best

RP particle give up

TO to to go, to him

UH interjection uhhuhhuhh

Penn Treebank Tagset II

Page 7: Part of Speech Tagging & Hidden Markov Models (Part 1)

CIS 421/521 - Intro to AI 7

Tag Description Example

VB verb, base form take (support)

VBD verb, past tense took

VBG verb, gerund/present participle taking

VBN verb, past participle taken

VBP verb, sing. present, non-3d take

VBZ verb, 3rd person sing. present takes (supports)

WDT wh-determiner which

WP wh-pronoun who, what

WP$ possessive wh-pronoun whose

WRB wh-abverb where, when

Penn Treebank Tagset III

Page 8: Part of Speech Tagging & Hidden Markov Models (Part 1)

CIS 421/521 - Intro to AI 8

NLP Task I – Determining Part of Speech Tags

• The Old Solution: Depth First search.

• If each of n word tokens has k tags on average,

try the kn combinations until one works.

• Machine Learning Solutions: Automatically learn

Part of Speech (POS) assignment.

• The best techniques achieve 97+% accuracy per word on

new materials, given a POS-tagged training corpus of 106

tokens with 3% error on a set of ~40 POS tags (tags on the

last three slides)

Page 9: Part of Speech Tagging & Hidden Markov Models (Part 1)

CIS 421/521 - Intro to AI 9

Simple Statistical Approaches: Idea 1

Page 10: Part of Speech Tagging & Hidden Markov Models (Part 1)

CIS 421/521 - Intro to AI 10

Simple Statistical Approaches: Idea 2

For a string of words

W = w1w2w3…wn

find the string of POS tags

T = t1 t2 t3 …tn

which maximizes P(T|W)

• i.e., the most likely POS tag ti for each word wigiven its surrounding context

Page 11: Part of Speech Tagging & Hidden Markov Models (Part 1)

CIS 421/521 - Intro to AI 11

The Sparse Data Problem …

A Simple, Impossible Approach to Compute P(T|W):

Count up instances of the string "heat oil in a large

pot" in the training corpus, and pick the most

common tag assignment to the string..

Page 12: Part of Speech Tagging & Hidden Markov Models (Part 1)

One more time: A BOTEC Estimate of What Works

CIS 421/521 - Intro to AI 12

What parameters can we estimate with a million words of hand tagged training data?

• Assume a uniform distribution of 5000 words and 40 part of speech tags..

We can get reasonable estimates of

• Tag bigrams

• Word x tag pairs

Page 13: Part of Speech Tagging & Hidden Markov Models (Part 1)

Bayes Rule plus Markov Assumptions

yields a practical POS tagger!

I. By Bayes Rule

II. So we want to find

III. To compute P(W|T):

• use the chain rule + a Markov assumption

• Estimation requires word x tag and tag counts

IV. To compute P(T):

• use the chain rule + a slightly different Markov assumption

• Estimation requires tag unigram and bigram counts

CIS 421/521 - Intro to AI 13

( | )* ( )( | )

( )

P W T P TP T W

P W

arg max ( | ) arg max ( | )* ( )T T

P T W P W T P T

Page 14: Part of Speech Tagging & Hidden Markov Models (Part 1)

IV. To compute P(T):

Just like computing P(W) last lecture

I. By the chain rule,

II. Applying the 1st order Markov Assumption

Estimated using tag bigrams/tag unigrams!

CIS 421/521 - Intro to AI 14

1 2 1 3 1 2 1 1( ) ( )* ( | )* ( | )*...* ( | ... )n nP T P t P t t P t t t P t t t

1 2 1 3 2 1( ) ( )* ( | )* ( | )*...* ( | )n nP T P t P t t P t t P t t

Page 15: Part of Speech Tagging & Hidden Markov Models (Part 1)

CIS 421/521 - Intro to AI 15

III. To compute P(W|T):

I. Assume that the words wi are conditionally independent

given the tag sequence T=t1 t2 … tn:

II. Applying a zeroth-order Markov Assumption:

by which

So, for a given string W = w1w2w3…wn, the tagger needs to find

the string of tags T which maximizes

1

( | ) ( | )n

i

i

P W T P w T

( | ) ( | )i i iP w T P w t

1

( | ) ( | )n

i i

i

P W T P w t

Page 16: Part of Speech Tagging & Hidden Markov Models (Part 1)

CIS 421/521 - Intro to AI 16

Hidden Markov Models

This model is an instance of a Hidden Markov

Model. Viewed graphically:

Adj

.3

.6Det

.02

.47Noun

.3

.7Verb

.51 .1P(w|Det)

a .4

the .4

P(w|Adj)

good .02

low .04

P(w|Noun)

price .001

deal .0001

Page 17: Part of Speech Tagging & Hidden Markov Models (Part 1)

CIS 421/521 - Intro to AI 17

Viewed as a generator, an HMM:

Adj

.3

.6Det

.02

.47 Noun

.3

.7 Verb

.51 .1

.4the

.4a

P(w|Det)

.04low

.02good

P(w|Adj)

.0001deal

.001price

P(w|Noun)

Page 18: Part of Speech Tagging & Hidden Markov Models (Part 1)

Summary: Recognition using an HMM

I. By Bayes Rule

II. We select the Tag sequence T that maximizes

P(T|W):

CIS 421/521 - Intro to AI 18

( )* ( | )( | )

( )

P T P W TP T W

P W

1 2

1 2

...

1

1 1...

1 1

arg max ( | )

arg max ( )* ( | )

arg max ( )* ( , )* ( , )

n

n

T

T t t t

n n

i i i iT t t t

i i

P T W

P T P W T

t a t t b t w

Page 19: Part of Speech Tagging & Hidden Markov Models (Part 1)

CIS 421/521 - Intro to AI 19

Training and Performance

• To estimate the parameters of this model, given an annotated training corpus use the MLE:

• Because many of these counts are small, smoothing is necessary for best results…

• Such taggers typically achieve about 95-96% correct tagging, for the standard 40-tag POS set.

• A few tricks for unknown words increase accuracy to 97%.

Page 20: Part of Speech Tagging & Hidden Markov Models (Part 1)

POS from bigram and word-tag pairs??

CIS 421/521 - Intro to AI 20

A Practical compromise

• Rich Models often require vast amounts of data

• Well estimated bad models often outperform badly estimated truer models

(Mutt & Jeff 1942)

Page 21: Part of Speech Tagging & Hidden Markov Models (Part 1)

CIS 421/521 - Intro to AI 21

Practical Tagging using HMMs

• Finding this maximum can be done using an

exponential search through all strings for T.

• However, there is a linear time solution using

dynamic programming called Viterbi decoding.

Page 22: Part of Speech Tagging & Hidden Markov Models (Part 1)

The three basic HMM problems

Page 23: Part of Speech Tagging & Hidden Markov Models (Part 1)

CIS 421/521 - Intro to AI 23

Parameters of an HMM

• States: A set of states S=s1, … sn

• Transition probabilities: A= a1,1, a1,2, …, an,n

Each ai,j represents the probability of

transitioning from state si to sj.

• Emission probabilities: a set B of functions of

the form bi(ot) which is the probability of

observation ot being emitted by si

• Initial state distribution: is the probability that

si is a start state

i

(This and later slides follow classic formulation by Ferguson, as

published by Rabiner and Juang, as adapted by Manning and Schutze.

Note the change in notation!!)

Page 24: Part of Speech Tagging & Hidden Markov Models (Part 1)

CIS 421/521 - Intro to AI 24

The Three Basic HMM Problems

• Problem 1 (Evaluation): Given the observation

sequence O=o1,…,oT and an HMM model

, how do we compute the

probability of O given the model?

• Problem 2 (Decoding): Given the observation

sequence O and an HMM model , how do we

find the state sequence that best explains the

observations?

• Problem 3 (Learning): How do we adjust the

model parameters , to maximize

?

(A,B,)

(A,B,)

P(O | )

Page 25: Part of Speech Tagging & Hidden Markov Models (Part 1)

CIS 421/521 - Intro to AI 25

Problem 1: Probability of an Observation Sequence

• Q: What is ?

• A: the sum of the probabilities of all possible

state sequences in the HMM.

• Naïve computation is very expensive. Given T

observations and N states, there are NT possible

state sequences.

• (for T=10 and N=10, 10 billion different paths!!)

• Solution: linear time dynamic programming!

P(O | )

Page 26: Part of Speech Tagging & Hidden Markov Models (Part 1)

CIS 421/521 - Intro to AI 26

The Crucial Data Structure: The Trellis

Page 27: Part of Speech Tagging & Hidden Markov Models (Part 1)

CIS 421/521 - Intro to AI 27

Forward Probabilities:

• For a given HMM , for some time t,

what is the probability that the partial

observation o1 … ot has been generated and that

the state at time t is i ?

• Forward algorithm computes t(i) 0<i<N, 0<t<T

in time 0(N2T) using the trellis

t (i) P(o1...ot , qt si | )

Page 28: Part of Speech Tagging & Hidden Markov Models (Part 1)

CIS 421/521 - Intro to AI 28

Forward Algorithm: Induction step

t ( j) t1(i)aiji1

N

b j (ot )

t (i) P(o1...ot , qt si | )

Page 29: Part of Speech Tagging & Hidden Markov Models (Part 1)

CIS 421/521 - Intro to AI 29

Forward Algorithm

• Initialization (probability that o1 has been

generated and that the state is i at time t=1:

• Induction:

• Termination:

NjTtobaij tj

N

i

ijtt

1,2)()()(1

1

1(i) ibi(o1) 1 i N

P(O | ) T (i)i1

N

Page 30: Part of Speech Tagging & Hidden Markov Models (Part 1)

CIS 421/521 - Intro to AI 30

Forward Algorithm Complexity

• Naïve approach requires exponential time to

evaluate all NT state sequences

• Forward algorithm using dynamic programming

takes O(N2T) computations


Recommended