10/25/2014
1
Conditional Random Fields
Univ. of
Pittsburgh
Computer Science Department
Zhipeng Luo
Oct. 2014
www.company.com
Subjects of CRF
• 1. Introduction
• 2. CRF Modeling
• 3. Inference using CRF
• 4. Training CRF
• 5. Applications of CRF
Part 0
Outline
Univ. of
Pittsburgh
10/25/2014
2
www.company.com
Introduction
• 1. Introduction
• 2. CRF Modeling
• 3. Inference using CRF
• 4. Training CRF
• 5. Applications of CRF
Part 1
Introduction
Univ. of
Pittsburgh
www.company.com
Problem Description
• Given X (observations), find Y (predictions)
• For example,
Part 1
Introduction
Univ. of
Pittsburgh
{ , , ,...}
{ , , ,...}
X temperature moisture pressure
Y Sunny Rainy Stormy
10/25/2014
3
www.company.com
CRF Modeling
• 1. Introduction
• 2. CRF Modeling
• Related Models
• Discriminative vs. Generative
• Chain CRF
• General CRF
• 3. Inference using CRF
• 4. Training CRF
• 5. Applications of CRF
Part 2
Modeling
Univ. of
Pittsburgh
www.company.com
Related Models
• Markov Random Fields
• Bayesian Network
• Factor Graph
• Sequencing Model Part 2
Modeling
Univ. of
Pittsburgh
10/25/2014
4
www.company.com
Undirected Graph Model: MRF
• On an undirected graph, the joint
distribution of variables
• Potential Functions:
• Partition Functions:
• Energy Functions:
• In MRF, defined on cliques
• Markov property (next slide)
• A generative model
Part 2
Modeling
Univ. of
Pittsburgh
y1
( ) ( ), ( )C C C C
C C
pZ
Z y
y y y
( ) 0C C y
) exp{ )}( (C C CE y y
Z
( ) 0C C y
www.company.com
Independence in MRF
•
Part 2
Modeling
Univ. of
Pittsburgh
10/25/2014
5
www.company.com
Directed Graph: Bayesian Network
• Local Conditional Distributions
• Indices the parent of
• Naïve Bayes: once the class label is
known, all the features are independent
• Again, a generative model
Part 2
Modeling
Univ. of
Pittsburgh
( )s sy
www.company.com
Factor Graph
• An explicit way to represent the factors in
graphs
• Undirected Graph
• Directed Graph
Part 2
Modeling
Univ. of
Pittsburgh
factor
10/25/2014
6
www.company.com
Sequence Prediction
• NER, POS problems
• Set of observations:
• Set of underlying sequence of states
• HMM is generative:
• Basic Independent Assumptions
• Observation is only dependent of its
corresponding state;
• Current state is only dependent of its previous
state.
• Strong assumptions!
Part 2
Modeling
Univ. of
Pittsburgh
Transition probability
Observation probability
www.company.com
Discriminative Vs. Generative
• Generative: describes how a label vector y
can probabilistically “generate” a feature
vector x.
• Discriminative: describes how to take a
feature vector x and assign it a label y.
• Naïve Bayes:
• MaxEnt classifier:
Part 2
Modeling
Univ. of
Pittsburgh
,( )p y x
|( )p y x
10/25/2014
7
www.company.com
Discriminative Vs. Generative
• Limitations of generative models
• Modeling the joint distribution can lead to difficulties
• features may have complex dependencies
• Models often make strong independence assumptions
• Discriminative models:
• No independence assumption is made for X
• We care about conditional independences among Y
• And how the Y can depend on X.
• Best suits rich, overlapping features.
Part 2
Modeling
Univ. of
Pittsburgh
Xi X2 X1 Xt
www.company.com
Chain CRFs
• CRF is simply a conditional distribution
with an associated graphical structure
• X are constant with respect to Y
• Convert HMM to Chain CRF:
Part 2
Modeling
Univ. of
Pittsburgh
|( )p y x
10/25/2014
8
www.company.com
Convert HMM to Chain CRF:
• Step 1: rewrite
• As:
• Where:
Part 2
Modeling
Univ. of
Pittsburgh
www.company.com
Convert HMM to Chain CRF:
• Step 2: introduce feature functions
• Then:
Part 2
Modeling
Univ. of
Pittsburgh
10/25/2014
9
www.company.com
Convert HMM to Chain CRF:
•
Part 2
Modeling
Univ. of
Pittsburgh
www.company.com
Convert HMM to Chain CRF:
• Step 4: Conditional Distribution
Part 2
Modeling
Univ. of
Pittsburgh
10/25/2014
10
www.company.com
Chain CRFs
• We can change it so that each state depends
on more observations:
• Or inputs at previous steps:
• Or all inputs:
Part 2
Modeling
Univ. of
Pittsburgh
www.company.com
General CRF
• If , and ;
• and are neighbors;
• is a CRF, if
• Example:
Part 2
Modeling
Univ. of
Pittsburgh
10/25/2014
11
www.company.com
General CRFs: Visualization
• According to the definition of CRF, the
random variables still obey the
Markov Property with respect to the graph.
Part 2
Modeling
Univ. of
Pittsburgh
the MRF
fixed,
observable,
variables
(not in the
MRF)
the CRF
Y
X
the CRF
the MRF
fixed, observable,
variables X (not in
the MRF)
www.company.com
General CRFs: Visualization
• Divide MRF into cliques. The parameters inside each template are
tied --potential functions; functions for the template
• The cliques contain only unobservables (y); x is an argument to c
• The probability PM(y|x) is a joint distribution over the unobservables Y
Part 2
Modeling
Univ. of
Pittsburgh
cliques (include only
the unobservables, Y)
observables, X (not
included in the cliques)
CR
F
Y
X
CRF
( , )c c y x
( , ) ( , )
( , )
1 1( | ) ,
( )
Q Q
Qp e e
Z e
y x y x
y x
y
y xx
Q(y,x) c(yc,x)cC
10/25/2014
12
www.company.com
General CRFs: Visualization
• c is typically decomposed into a weighted sum of feature
sensors fi, producing:
• Back to the chain-CRF!
• Cliques can be identified as pairs of adjacent Ys:
Part 2
Modeling
Univ. of
Pittsburgh
( , )1( | )
i i c
c C i F
f y
P eZ
x
y x
( , )1( | ) Qp e
Z y x
y x
Q(y,x) c(yc,x)cC
( , ) ( , )c c i i c
i F
f y
y x x
www.company.com
Inference using CRF
• 1. Introduction
• 2. CRF Modeling
• 3. Inference using CRF
• General CRF
• Chain CRF
• 4. Training CRF
• 5. Applications of CRF
Part 3
Inference
Univ. of
Pittsburgh
10/25/2014
13
www.company.com
General CRF
• Given the observations, and parameters,
we target to find the best state sequence:
• For the general CRF:
• =
• But, exact inference in CRFs is
intractable…
• Approximate methods!
• MCMC, Belief Propagation
Part 3
Inference
Univ. of
Pittsburgh
www.company.com
Inference in HMM
• Dynamic Programming:
• Forward
• Backward
• Viterbi
Part 3
Inference
Univ. of
Pittsburgh
1
2
K
…
1
2
K
…
1
2
K
…
…
…
…
1
2
K
…
2
1
K
2
1x 2x Kx3x
10/25/2014
14
www.company.com
Chain CRFs
• Chain CRF could be done using dynamic programming
• Define a matrix with size • Y is a finite label alphabet
Part 3
Inference
Univ. of
Pittsburgh
www.company.com
Chain CRFs
• By defining the following forward and backward
parameters,
Part 3
Inference
Univ. of
Pittsburgh
10/25/2014
15
www.company.com
Chain CRFs
• The inference of linear-chain CRF is very similar to that
of HMM
• We can write the marginal distribution:
• Solve Chain-CRF using Dynamic Programming
(Similar to Viterbi)!
• 1. First computing α for all t (forward), then compute β for all t
(backward).
• 2. Return the marginal distributions computed.
• 3. Run viterbi to find the optimal sequence
Part 3
Inference
Univ. of
Pittsburgh
www.company.com
Training CRF
• 1. Introduction
• 2. CRF Modeling
• 3. Inference using CRF
• 4. Training CRF
• General CRF
• Intro to approximate algorithms
• 5. Applications of CRF
Part 4
Learning
Univ. of
Pittsburgh
10/25/2014
16
www.company.com
Parameter learning
• Given the training data,
we wish to learn parameters of the
model.
• For chain or tree structured CRFs, they
can be trained by maximum likelihood
• The objective function for chain-CRF is
convex(see Lafferty et al(2001) ).
• General CRFs are intractable hence
approximation solutions are necessary
Part 4
Learning
Univ. of
Pittsburgh
www.company.com
Parameter learning
• Conditional log-likelihood for a general CRF:
• It is not possible to analytically determine the
parameter values that maximize the log-likelihood
– setting the gradient to zero and solving for λ
does not always yield a closed form solution.
(Almost always)
Part 4
Learning
Univ. of
Pittsburgh
10/25/2014
17
www.company.com
Parameter learning
• This could be done using gradient descent
• Until we reach convergence
Part 4
Learning
Univ. of
Pittsburgh
1
max ( ; | ) max (log | ; )N
i
y x p
y xL
1 . ( ; | )i i y x L
1| ( ; | ) ( ; | ) |i iy x y x L L ò
www.company.com
Parameter learning
•
Part 4
Learning
Univ. of
Pittsburgh
10/25/2014
18
www.company.com
Parameter learning
•
Part 4
Learning
Univ. of
Pittsburgh
www.company.com
Training ( and Inference): General Case
• Approximate solution, to get faster inference.
• Treat inference as shortest path problem in the network
consisting of paths(with costs)
• Max Flow-Min Cut (Ford-Fulkerson, 1956 )
• Pseudo-likelihood approximation:
• Convert a CRF into separate patches; each consists of a hidden
node and true values of neighbors; Run ML on separate
patches
• Efficient but may over-estimate inter-dependencies
• Belief propagation
• variational inference algorithm
• it is a direct generalization of the exact inference algorithms for
linear-chain CRFs
• Sampling based method(MCMC)
Part 4
Learning
Univ. of
Pittsburgh
10/25/2014
19
www.company.com
Applications of CRF
• 1. Introduction
• 2. CRF Modeling
• 3. Inference using CRF
• 4. Training CRF
• 5. Applications of CRF
Part 5
Application
Univ. of
Pittsburgh
www.company.com
Part-of-Speech-Tagging
• POS(part of speech) tagging; the
identification of words as nouns, verbs,
adjectives, adverbs, etc.
Part 5
Univ. of
Pittsburgh
Application
10/25/2014
20
www.company.com
Part-of-Speech-Tagging
• Each word to be labeled with one of 45
syntactic tags.
• 50%-50% train-test split
• Compared HMMs, MEMMs, and CRFs on
Penn treebank POS tagging
• oov = out-of-vocabulary (not observed in the
training set)
Part 5
Univ. of
Pittsburgh
Application
www.company.com
Part-of-Speech-Tagging
• But…
• Add a small set of orthographic features: whether
a spelling begins with a number or upper case
letter, whether it contains a hyphen, and if it
contains one of the following suffixes: -ing, -ogy, -
ed, -s, -ly, -ion, -tion, -ity, -ies
Part 5
Univ. of
Pittsburgh
Application
10/25/2014
21
www.company.com
Part-of-Speech-Tagging
Part 5
Univ. of
Pittsburgh
Application
www.company.com
Is HMM(Gen.) better or CRF(Disc.)
• If your application gives you good structural
information such that could be easily modeled by
dependent distributions, and could be learnt
tractably, go the generative way!
• Ex. Higher-order emissions from individual states
Part 5
Univ. of
Pittsburgh
Application
10/25/2014
22
www.company.com
Other Applications
• Application in computational biology
• DNA and protein sequence alignment
• Sequence homolog searching in databases
• Protein secondary structure prediction
• RNA secondary structure analysis
• Application in computational linguistics &
computer science
• Text and speech processing, including topic
segmentation, part-of-speech (POS) tagging
• Information extraction
• Syntactic disambiguation
Part 5
Univ. of
Pittsburgh
Application
www.company.com
References
• J. Lafferty, A. McCallum, and F. Pereira. Conditional
random fields: Probabilistic models for segmenting and
labeling sequence data. In Proc. ICML01, 2001.
• Charles Sutton and Andrew McCallum, “An Introduction
to Conditional Random Fields for Relational Learning,”
MIT Press, 2006
• Daniel Khashabi, “Conditional Random Fields
and beyond ”, UIUC CS 546, 2013
• Zitao Liu, “Probabilistic Models of Time Series and
Sequences”, 2014
• Bishop, Christopher M., “Pattern recognition and
machine learning”. Vol. 1. New York: springer, 2006
• Peter, “Conditional Random Fields: Probabilistic
Models for Segmenting and Labeling Sequence Data”,
2012
Univ. of
Pittsburgh
10/25/2014
23
Conditional Random Fields
Univ. of
Pittsburgh
Computer Science Department
Thank you!