What is POS tagging
Raw Text
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Pierre_NNP Vinken_NNP ,_, 61_CD years_NNS old_JJ ,_, will_MD join_VB the_DTboard_NN as_IN a_DTnonexecutive_JJ director_NNNov._NNP 29_CD ._.
Tagged Text
POS Tagger
CS@UVa CS 6501: Text Mining 2
Tag SetNNP: proper nounCD: numeralJJ: adjective
Why POS tagging?
β’ POS tagging is a prerequisite for further NLP analysisβ Syntax parsing
β’ Basic unit for parsing
β Information extractionβ’ Indication of names, relations
β Machine translationβ’ The meaning of a particular word depends on its POS tag
β Sentiment analysisβ’ Adjectives are the major opinion holders
β Good v.s. Bad, Excellent v.s. Terrible
CS@UVa CS 6501: Text Mining 3
Challenges in POS tagging
β’ Words often have more than one POS tagβ The back door (adjective)β On my back (noun)β Promised to back the bill (verb)
β’ Simple solution with dictionary look-up does not work in practiceβ One needs to determine the POS tag for a
particular instance of a word from its context
CS@UVa CS 6501: Text Mining 4
Define a tagset
β’ We have to agree on a standard inventory of word classesβ Taggers are trained on a labeled corporaβ The tagset needs to capture semantically or
syntactically important distinctions that can easily be made by trained human annotators
CS@UVa CS 6501: Text Mining 5
Word classes
β’ Open classesβ Nouns, verbs, adjectives, adverbs
β’ Closed classesβ Auxiliaries and modal verbsβ Prepositions, Conjunctionsβ Pronouns, Determinersβ Particles, Numerals
CS@UVa CS 6501: Text Mining 6
Public tagsets in NLP
β’ Brown corpus - Francis and Kucera 1961β 500 samples, distributed across 15 genres in rough
proportion to the amount published in 1961 in each of those genres
β 87 tagsβ’ Penn Treebank - Marcus et al. 1993
β Hand-annotated corpus of Wall Street Journal, 1M words
β 45 tags, a simplified version of Brown tag setβ Standard for English now
β’ Most statistical POS taggers are trained on this Tagset
CS@UVa CS 6501: Text Mining 7
How much ambiguity is there?
β’ Statistics of word-tag pair in Brown Corpus and Penn Treebank
11% 18%
CS@UVa CS 6501: Text Mining 8
Is POS tagging a solved problem?
β’ Baselineβ Tag every word with its most frequent tagβ Tag unknown words as nounsβ Accuracy
β’ Word level: 90%β’ Sentence level
β Average English sentence length 14.3 wordsβ 0.914.3 = 22%
Accuracy of State-of-the-art POS Taggerβ’ Word level: 97%β’ Sentence level: 0.9714.3 = 65%
CS@UVa CS 6501: Text Mining 9
Building a POS tagger
β’ Rule-based solution1. Take a dictionary that lists all possible tags for
each word2. Assign to every word all its possible tags3. Apply rules that eliminate impossible/unlikely
tag sequences, leaving only one tag per wordshe PRPpromised VBN,VBDto TOback VB, JJ, RB, NN!!the DTbill NN, VB
R1: Pronoun should be followed by a past tense verb
R2: Verb cannot follow determiner
CS@UVa CS 6501: Text Mining 10
Rules can be learned via inductive learning.
Building a POS tagger
β’ Statistical POS tagging
β What is the most likely sequence of tags ππ for the given sequence of words ππ
π‘π‘1 π‘π‘2 π‘π‘3 π‘π‘4 π‘π‘5 π‘π‘6
π€π€1 π€π€2 π€π€3 π€π€4 π€π€5 π€π€6
ππ =
ππ =
ππβ = πππππππππππ₯π₯ππππ(ππ|ππ)
CS@UVa CS 6501: Text Mining 11
POS tagging with generative models
β’ Bayes Rule
β Joint distribution of tags and wordsβ Generative model
β’ A stochastic process that first generates the tags, and then generates the words based on these tags
ππβ = πππππππππππ₯π₯ππππ ππ ππ= πππππππππππ₯π₯ππ
ππ ππ ππ ππ(ππ)ππ(ππ)
= πππππππππππ₯π₯ππππ ππ ππ ππ(ππ)
CS@UVa CS 6501: Text Mining 12
Hidden Markov models
β’ Two assumptions for POS tagging1. Current tag only depends on previous ππ tags
β’ ππ ππ = βππ ππ(π‘π‘ππ|π‘π‘ππβ1, π‘π‘ππβ2, β¦ , π‘π‘ππβππ)β’ When ππ=1, it is so-called first-order HMMs
2. Each word in the sequence depends only on its corresponding tagβ’ ππ ππ ππ = βππ ππ(π€π€ππ|π‘π‘ππ)
CS@UVa CS 6501: Text Mining 13
Graphical representation of HMMs
β’ Light circle: latent random variablesβ’ Dark circle: observed random variablesβ’ Arrow: probabilistic dependency
ππ(π‘π‘ππ|π‘π‘ππβ1) Transition probability
ππ(π€π€ππ|π‘π‘ππ)Emission probability
All the tags in the tagset
All the words in the vocabulary
CS@UVa CS 6501: Text Mining 14
Finding the most probable tag sequence
β’ Complexity analysisβ Each word can have up to ππ tagsβ For a sentence with ππ words, there will be up to ππππ possible tag sequences
β Key: explore the special structure in HMMs!
ππβ = πππππππππππ₯π₯ππππ ππ ππ
= πππππππππππ₯π₯πποΏ½ππ
ππ π€π€ππ π‘π‘ππ ππ(π‘π‘ππ|π‘π‘ππβ1)
CS@UVa CS 6501: Text Mining 15
π€π€1 π€π€2 π€π€3 π€π€4 π€π€5π‘π‘1π‘π‘2π‘π‘3π‘π‘4π‘π‘5π‘π‘6π‘π‘7
Word π€π€1 takes tag π‘π‘4CS@UVa CS 6501: Text Mining 16
ππππ = π‘π‘4π‘π‘1π‘π‘3π‘π‘5π‘π‘7 ππππ = π‘π‘4π‘π‘1π‘π‘3π‘π‘5π‘π‘2
Trellis: a special structure for HMMs
π€π€1 π€π€2 π€π€3 π€π€4 π€π€5π‘π‘1π‘π‘2π‘π‘3π‘π‘4π‘π‘5π‘π‘6π‘π‘7
Word π€π€1 takes tag π‘π‘4
ππππ = π‘π‘4π‘π‘1π‘π‘3π‘π‘5π‘π‘7 ππππ = π‘π‘4π‘π‘1π‘π‘3π‘π‘5π‘π‘2Computation can be reused!
CS@UVa CS 6501: Text Mining 17
Viterbi algorithm
β’ Store the best tag sequence for π€π€1 β¦π€π€ππ that ends in π‘π‘ππ in ππ[ππ][ππ]β ππ[ππ][ππ] = maxππ(π€π€1 β¦π€π€ππ , π‘π‘1 β¦ , π‘π‘ππ = π‘π‘ππ )
β’ Recursively compute trellis[j][i] from the entries in the previous column trellis[j][i-1]β ππ ππ ππ = ππ π€π€ππ π‘π‘ππ πππππ₯π₯ππ ππ ππ ππ β 1 ππ π‘π‘ππ π‘π‘ππ
The best i-1 tag sequence
Generating the current observation
Transition from the previous best ending tag
CS@UVa CS 6501: Text Mining 18
Viterbi algorithm
π€π€1 π€π€2 π€π€3 π€π€4 π€π€5π‘π‘1π‘π‘2π‘π‘3π‘π‘4π‘π‘5π‘π‘6π‘π‘7
CS@UVa CS 6501: Text Mining 19
ππ ππ ππ = ππ π€π€ππ π‘π‘ππ πππππ₯π₯ππ ππ ππ ππ β 1 ππ π‘π‘ππ π‘π‘ππ
Order of computation
Dynamic programming: ππ(ππ2ππ)!
Decode πππππππππππ₯π₯ππππ(ππ|ππ)
β’ Take the highest scoring entry in the last column of the trellis
π€π€1 π€π€2 π€π€3 π€π€4 π€π€5π‘π‘1π‘π‘2π‘π‘3π‘π‘4π‘π‘5π‘π‘6π‘π‘7
Keep backpointers in each trellis to keep track of the most probable sequence
CS@UVa CS 6501: Text Mining 20
ππ ππ ππ = ππ π€π€ππ π‘π‘ππ πππππ₯π₯ππ ππ ππ ππ β 1 ππ π‘π‘ππ π‘π‘ππ
Train an HMM tagger
β’ Parameters in an HMM taggerβ Transition probability: ππ π‘π‘ππ π‘π‘ππ ,ππ Γ ππβ Emission probability: ππ π€π€ π‘π‘ ,ππ Γ ππβ Initial state probability: ππ π‘π‘ ππ ,ππ Γ 1
For the first tag in a sentence
CS@UVa CS 6501: Text Mining 21
Train an HMMs tagger
β’ Maximum likelihood estimatorβ Given a labeled corpus, e.g., Penn Treebankβ Count how often we have the pair of π‘π‘πππ‘π‘ππ and π€π€πππ‘π‘ππ
β’ ππ π‘π‘ππ π‘π‘ππ = ππ(π‘π‘ππ,π‘π‘ππ)ππ(π‘π‘ππ)
β’ ππ π€π€ππ π‘π‘ππ = ππ(π€π€ππ,π‘π‘ππ)ππ(π‘π‘ππ)
CS@UVa CS 6501: Text Mining 22
Proper smoothing is necessary!
Public POS taggersβ’ Brillβs tagger
β http://www.cs.jhu.edu/~brill/ β’ TnT tagger
β http://www.coli.uni-saarland.de/~thorsten/tnt/β’ Stanford tagger
β http://nlp.stanford.edu/software/tagger.shtmlβ’ SVMTool
β http://www.lsi.upc.es/~nlp/SVMTool/β’ GENIA tagger
β http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/β’ More complete list at
β http://www-nlp.stanford.edu/links/statnlp.html#Taggers
CS@UVa CS 6501: Text Mining 23
Letβs take a look at other NLP tasks
β’ Noun phrase (NP) chunkingβ Task: identify all non-recursive NP chunks
CS@UVa CS 6501: Text Mining 24
The BIO encoding
β’ Define three new tagsβ B-NP: beginning of a noun phrase chunkβ I-NP: inside of a noun phrase chunkβ O: outside of a noun phrase chunk
POS Tagging with a restricted Tagset?
CS@UVa CS 6501: Text Mining 25
Another NLP task
β’ Shallow parsingβ Task: identify all non-recursive NP, verb (βVPβ) and
preposition (βPPβ) chunks
CS@UVa CS 6501: Text Mining 26
BIO Encoding for Shallow Parsing
β’ Define several new tagsβ B-NP B-VP B-PP: beginning of an βNPβ, βVPβ, βPPβ chunkβ I-NP I-VP I-PP: inside of an βNPβ, βVPβ, βPPβ chunkβ O: outside of any chunk
CS@UVa CS 6501: Text Mining 27
POS Tagging with a restricted Tagset?
Yet another NLP task
β’ Named Entity Recognitionβ Task: identify all mentions of named entities
(people, organizations, locations, dates)
CS@UVa CS 6501: Text Mining 28
BIO Encoding for NER
β’ Define many new tagsβ B-PERS, B-DATE,β¦: beginning of a mention of a
person/date...β I-PERS, B-DATE,β¦: inside of a mention of a person/date...β O: outside of any mention of a named entity
CS@UVa CS 6501: Text Mining 29
POS Tagging with a restricted Tagset?
Sequence labeling
β’ Many NLP tasks are sequence labeling tasksβ Input: a sequence of tokens/wordsβ Output: a sequence of corresponding labels
β’ E.g., POS tags, BIO encoding for NER
β Solution: finding the most probable label sequence for the given word sequence
β’ ππβ = πππππππππππ₯π₯ππππ ππ ππ
CS@UVa CS 6501: Text Mining 30
Comparing to traditional classification problem
Sequence labelingβ’ ππβ = πππππππππππ₯π₯ππππ ππ ππ
β ππ is a vector/matrix
β’ Dependency between both (ππ,ππ) and (π‘π‘ππ , π‘π‘ππ)
β’ Structured output β’ Difficult to solve the
inference problem
Traditional classification β’ π¦π¦ = πππππππππππ₯π₯π¦π¦ππ(π¦π¦|ππ)
β π¦π¦ is a single label
β’ Dependency only within (π¦π¦,ππ)
β’ Independent outputβ’ Easy to solve the inference
problem
CS@UVa CS 6501: Text Mining 31
yi
xi xj
yjti
wi wj
tj
Two modeling perspectives
β’ Generative modelsβ Model the joint probability of labels and wordsβ ππβ = πππππππππππ₯π₯ππππ ππ ππ = πππππππππππ₯π₯ππππ ππ ππ ππ(ππ)
β’ Discriminative modelsβ Directly model the conditional probability of labels
given the wordsβ ππβ = πππππππππππ₯π₯ππππ ππ ππ = πππππππππππ₯π₯ππππ(ππ,ππ)
CS@UVa CS 6501: Text Mining 32
Generative V.S. discriminative models
β’ Binary classification as an example
CS@UVa CS 6501: Text Mining 33
Generative Modelβs view Discriminative Modelβs view
Generative V.S. discriminative models
Generativeβ’ Specifying joint distribution
β Full probabilistic specification for all the random variables
β’ Dependence assumption has to be specified for ππ ππ ππ and ππ(ππ)
β’ Flexible, can be used in unsupervised learning
Discriminative β’ Specifying conditional
distributionβ Only explain the target
variable
β’ Arbitrary features can be incorporated for modeling ππ ππ ππ
β’ Need labeled data, only suitable for (semi-) supervised learning
CS@UVa CS 6501: Text Mining 34
Maximum entropy Markov models
β’ MEMMs are discriminative models of the labels ππ given the observed input sequence ππβ ππ ππ ππ = βππ ππ(π‘π‘ππ|π€π€ππ , π‘π‘ππβ1)
CS@UVa CS 6501: Text Mining 35
Design features
β’ Emission-like featuresβ Binary feature functions
β’ ffirst-letter-capitalized-NNP(China) = 1β’ ffirst-letter-capitalized-VB(know) = 0
β Integer (or real-valued) feature functionsβ’ fnumber-of-vowels-NNP(China) = 2
β’ Transition-like featuresβ Binary feature functions
β’ ffirst-letter-capitalized-VB-NNP(China) = 1
Not necessarily independent features!
VB
China
NNP
CS@UVa CS 6501: Text Mining 36
know
Parameterization of ππ(π‘π‘ππ|π€π€ππ , π‘π‘ππβ1)
β’ Associate a real-valued weight ππ to each specific type of feature functionβ ππππ for ffirst-letter-capitalized-NNP(w)
β’ Define a scoring function ππ π‘π‘ππ , π‘π‘ππβ1,π€π€ππ =βππ ππππππππ(π‘π‘ππ , π‘π‘ππβ1,π€π€ππ)
β’ Naturally ππ π‘π‘ππ π€π€ππ , π‘π‘ππβ1 β exp ππ π‘π‘ππ , π‘π‘ππβ1,π€π€ππβ Recall the basic definition of probability
β’ ππ(π₯π₯) > 0β’ βπ₯π₯ ππ(π₯π₯) = 1
CS@UVa CS 6501: Text Mining 37
Parameterization of MEMMs
β’ It is a log-linear modelβ log ππ ππ ππ = βππ ππ(π‘π‘ππ , π‘π‘ππβ1,π€π€ππ) β πΆπΆ(ππ)
β’ Viterbi algorithm can be used to decode the most probable label sequence solely based on βππ ππ(π‘π‘ππ , π‘π‘ππβ1,π€π€ππ)
ππ ππ ππ = οΏ½ππ
ππ(π‘π‘ππ|π€π€ππ , π‘π‘ππβ1)
=βππ exp ππ(π‘π‘ππ , π‘π‘ππβ1,π€π€ππ)βππβππ exp ππ(π‘π‘ππ , π‘π‘ππβ1,π€π€ππ)
Constant only related to ππ
CS@UVa CS 6501: Text Mining 38
Parameter estimation
β’ Maximum likelihood estimator can be used in a similar way as in HMMsβ ππβ = πππππππππππ₯π₯ππ βππ,π€π€ log ππ(ππ|π€π€)
= πππππππππππ₯π₯πποΏ½ππ,π€π€
οΏ½ππ
ππ(π‘π‘ππ , π‘π‘ππβ1,π€π€ππ) β πΆπΆ(ππ)
Decompose the training data into such units
CS@UVa CS 6501: Text Mining 39
Why maximum entropy?
β’ We will explain this in detail when discussing the Logistic Regression models
CS@UVa CS 6501: Text Mining 40
A little bit more about MEMMs
β’ Emission features can go across multiple observationsβ ππ π‘π‘ππ , π‘π‘ππβ1,π€π€ππ β βππ ππππππππ(π‘π‘ππ , π‘π‘ππβ1,ππ)β Especially useful for shallow parsing and NER tasks
CS@UVa CS 6501: Text Mining 41
Conditional random field
β’ A more advanced model for sequence labelingβ Model global dependencyβ ππ π‘π‘ π€π€ ββππ exp(βππ ππππππππ π‘π‘ππ ,ππ + βππ ππππππππ(π‘π‘ππ , π‘π‘ππβ1,ππ))
π‘π‘3 π‘π‘4
π€π€3 π€π€4
π‘π‘1 π‘π‘2
π€π€1 π€π€2
Node feature ππ(π‘π‘ππ ,ππ)
Edge feature ππ(π‘π‘ππ , π‘π‘ππβ1,ππ)
CS@UVa CS 6501: Text Mining 42
What you should know
β’ Definition of POS tagging problemβ Property & challenges
β’ Public tag setsβ’ Generative model for POS tagging
β HMMs
β’ General sequential labeling problemβ’ Discriminative model for sequential labeling
β MEMMs
CS@UVa CS 6501: Text Mining 43