CS460/626 : Natural Language Processing/Speech, NLP and the Web
(Lecture 8– POS tagset)
Pushpak BhattacharyyaCSE Dept., IIT Bombay
17th Jan, 2012
HMM: Three ProblemsProblem
Language
Hindi
Marathi
English
FrenchMorph
Analysis
Part of Speech
Tagging
Parsing
Semantics
CRF
HMM
MEMM
NLP
Trinity
� Problem 1: Likelihood of a sequence
� Forward Procedure
� Backward ProcedureAlgorithm
� Problem 2: Best state sequence
� Viterbi Algorithm
� Problem 3: Re-estimation
� Baum-Welch ( Forward-Backward Algorithm )
POS tagging
Tagged Corpora
� ^_^“_“ The_DT guys_NNS that_WDT
make_VBP traditional_JJ hardware_NN
are_VBP really_RB being_VBG
obsoleted_VBN by_IN microprocessor-
based_JJ machines_NNS ,_, ”_”
said_VBD Mr._NNP Benton_NNP ._.$_$
For Hindi
� Rama achhaa gaata hai. (hai is VAUX :
Auxiliary verb) ; Ram sings well
Rama achha ladakaa hai. (hai is VCOP : � Rama achha ladakaa hai. (hai is VCOP :
Copula verb) ; Ram is a good boy
Example of difficulty in POS tagging
Tags
Content Word Function Word
Noun Adjective Verb Tags PronounPreposition
Noun Verb TagsConjunctio
nInjection
on
Proper Noun
Common Noun
NNP(for NER)
NNSNN
VBP VBD VBG VBN
Difficulty in POS Tagging� Consider the following sentences:
राम अ�छा गाता है_VAUX (auxiliary verb)
Ram good sing is : Ram sings well
GNPTAM for ‘गाता ‘ only : Male, Singular, ??,??,??,-GNPTAM for ‘गाता ‘ only : Male, Singular, ??,??,??,-GNPTAM for ‘गाता है’ : Male, Singular, 2nd or 3rd, Present, Default, Declarative
राम अ�छा लड़का है_VCOP (copular verb)
Ram good boy is : Ram is a good boy
In general, VAUX, VM (main verb) and VCOP cannot be separated easily
To POS Tag based on Rules, one simple rule could be:
है
Difficulty in POS Tagging
Preceded by nominal
Preceded by verb
This is a ‘High Precision, Low Recall’ rule, i.e. when it says Yes is indeed Yes but a No may not actually be No
VAUX VCOPFacilitates co-referenceसामानािधकरण
Exceptions to the previous rule
� False Negative for VAUX
� Particle Injection (Particles: भी-Bhi, तो-To, ह�-Hi, नह�ं -Nahi)
राम गाता तो अछा है, पर ... राम गाता तो अछा है, पर ...
� Consider the following sentences:
राम अ�छा है_VCOP
राम तो गाता अ�छा है_VAUX
POS TAGs of है vary here despite the preceding word being an adjective
Evaluation of POS Tag Accuracy
� Precision, Recall and F-Score
Given G(what our system returns)
Ideal I(Actual Tags)
AgreementAgreement
False Positive
False Negative
• Precision P= |G ∩ I| / |I| Recall R= |G ∩ I| / |I|
• F-Score = 2PR/(P+R)
POS tag computation (1/2)Best tag sequence= T*= argmax P(T|W)= argmax P(T)P(W|T) (by Baye’s Theorem)
P(T) = P(t0=^ t1t2 … tn+1=.)P(T) = P(t0=^ t1t2 … tn+1=.)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) …
P(tn|tn-1tn-2…t0)P(tn+1|tntn-1…t0)= P(t0)P(t1|t0)P(t2|t1) … P(tn|tn-1)P(tn+1|tn)
= P(ti|ti-1) Bigram Assumption∏N+1
i = 0
POS tag computation (2/2)
P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) …P(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)
Assumption: A word is determined completely by its tag. This is inspired by speech recognitioninspired by speech recognition
= P(wo|to)P(w1|t1) … P(wn+1|tn+1)
= P(wi|ti)
= P(wi|ti) (Lexical Probability Assumption)
∏n+1
i = 0
∏n+1
i = 1
Example
”People jump high”.
People : Noun/Verb
jump : Noun/Verbjump : Noun/Verb
high : Noun/Adjective
We can start with probabilities.
^
VM
N
VM
N
JJ
N
$
People
Jump High^ $
Trellis diagram
8 POS TAG sequences are possible, given these valid tags for each word taken from dictionary
Bigram AssumptionBest tag sequence= T*= argmax P(T|W)= argmax P(T)P(W|T) (by Baye’s Theorem)
P(T) = P(t0=^ t1t2 … tn+1=.)P(T) = P(t0=^ t1t2 … tn+1=.)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) …
P(tn|tn-1tn-2…t0)P(tn+1|tntn-1…t0)= P(t0)P(t1|t0)P(t2|t1) … P(tn|tn-1)P(tn+1|tn)
= P(ti|ti-1) Bigram Assumption∏N+1
i = 0
Lexical Probability Assumption
P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) …P(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)
Assumption: A word is determined completely by its tag. This is inspired by speech recognitioninspired by speech recognition
= P(wo|to)P(w1|t1) … P(wn+1|tn+1)
= P(wi|ti)
= P(wi|ti) (Lexical Probability Assumption)
∏n+1
i = 0
∏n+1
i = 1
Calculation from actual data� Corpus
� ^ Ram got many NLP books. He found them all very interesting.
� Pos Tagged� Pos Tagged
� ^ N V A N N . N V N A R A .
Recording numbers^ N V A R .
^ 0 2 0 0 0 0
N 0 1 2 1 0 1
V 0 1 0 1 0 0
A 0 1 0 0 1 1
R 0 0 0 1 0 0
. 1 0 0 0 0 0
Probabilities^ N V A R .
^ 0 1 0 0 0 0
N 0 1/5 2/5 1/5 0 1/5
V 0 1/2 0 1/2 0 0
A 0 1/3 0 0 1/3 1/3
R 0 0 0 1 0 0
. 1 0 0 0 0 0
Penn tagset (1/2)
Penn tagset (2/2)
Indian Language Tagset: Noun
Indian Language Tagset: Pronoun
Indian Language Tagset: Quantifier
Indian Language Tagset: Demonstrative
3 Demonstrative DM DM Vaha, jo, yaha,
3.1 Deictic DMD DM__DMD Vaha, yaha
3.2 Relative DMR DM__DMR jo, jis
3.3 Wh-word DMQ DM__DMQ kis, kaun
Indefinite DMI DM__DMI KoI, kis
Indian Language Tagset: Verb, Adjective, Adverb
Indian Language Tagset: Postposition, conjunction
Indian Language Tagset: Particle
Indian Language Tagset: Residuals