Conditional Random Fieldsmilos/courses/cs3750... · • DNA and protein sequence alignment •...

10/25/2014

1

Conditional Random Fields

Univ. of

Pittsburgh

Computer Science Department

Zhipeng Luo

[email protected]

Oct. 2014

www.company.com

Subjects of CRF

• 1. Introduction

• 2. CRF Modeling

• 3. Inference using CRF

• 4. Training CRF

• 5. Applications of CRF

Part 0

Outline

Univ. of

Pittsburgh

mailto:[email protected]

10/25/2014

2

www.company.com

Introduction

• 1. Introduction

• 2. CRF Modeling


• 4. Training CRF


Part 1

Introduction

Univ. of

Pittsburgh

www.company.com

Problem Description

• Given X (observations), find Y (predictions)

• For example,

Part 1

Introduction

Univ. of

Pittsburgh

{ , , ,...}

{ , , ,...}

X temperature moisture pressure

Y Sunny Rainy Stormy

10/25/2014

3

www.company.com

CRF Modeling

• 1. Introduction

• 2. CRF Modeling

• Related Models

• Discriminative vs. Generative

• Chain CRF

• General CRF


• 4. Training CRF


Part 2

Modeling

Univ. of

Pittsburgh

www.company.com

Related Models

• Markov Random Fields

• Bayesian Network

• Factor Graph

• Sequencing Model Part 2

Modeling

Univ. of

Pittsburgh

10/25/2014

4

www.company.com

Undirected Graph Model: MRF

• On an undirected graph, the joint

distribution of variables

• Potential Functions:

• Partition Functions:

• Energy Functions:

• In MRF, defined on cliques

• Markov property (next slide)

• A generative model

Part 2

Modeling

Univ. of

Pittsburgh

y1

( ) ( ), ( )C C C C

C C

pZ

Z y

y y y

( ) 0C C y

) exp{ )}( (C C CE y y

Z

( ) 0C C y

www.company.com

Independence in MRF

•

Part 2

Modeling

Univ. of

Pittsburgh

10/25/2014

5

www.company.com

Directed Graph: Bayesian Network

• Local Conditional Distributions

• Indices the parent of

• Naïve Bayes: once the class label is

known, all the features are independent

• Again, a generative model

Part 2

Modeling

Univ. of

Pittsburgh

( )s sy

www.company.com

Factor Graph

• An explicit way to represent the factors in

graphs

• Undirected Graph

• Directed Graph

Part 2

Modeling

Univ. of

Pittsburgh

factor

10/25/2014

6

www.company.com

Sequence Prediction

• NER, POS problems

• Set of observations:

• Set of underlying sequence of states

• HMM is generative:

• Basic Independent Assumptions

• Observation is only dependent of its

corresponding state;

• Current state is only dependent of its previous

state.

• Strong assumptions!

Part 2

Modeling

Univ. of

Pittsburgh

Transition probability

Observation probability

www.company.com

Discriminative Vs. Generative

• Generative: describes how a label vector y

can probabilistically “generate” a feature

vector x.

• Discriminative: describes how to take a

feature vector x and assign it a label y.

• Naïve Bayes:

• MaxEnt classifier:

Part 2

Modeling

Univ. of

Pittsburgh

,( )p y x

|( )p y x

10/25/2014

7

www.company.com

Discriminative Vs. Generative

• Limitations of generative models

• Modeling the joint distribution can lead to difficulties

• features may have complex dependencies

• Models often make strong independence assumptions

• Discriminative models:

• No independence assumption is made for X

• We care about conditional independences among Y

• And how the Y can depend on X.

• Best suits rich, overlapping features.

Part 2

Modeling

Univ. of

Pittsburgh

Xi X2 X1 Xt

www.company.com

Chain CRFs

• CRF is simply a conditional distribution

with an associated graphical structure

• X are constant with respect to Y

• Convert HMM to Chain CRF:

Part 2

Modeling

Univ. of

Pittsburgh

|( )p y x

10/25/2014

8

www.company.com

Convert HMM to Chain CRF:

• Step 1: rewrite

• As:

• Where:

Part 2

Modeling

Univ. of

Pittsburgh

www.company.com


• Step 2: introduce feature functions

• Then:

Part 2

Modeling

Univ. of

Pittsburgh

10/25/2014

9

www.company.com


•

Part 2

Modeling

Univ. of

Pittsburgh

www.company.com


• Step 4: Conditional Distribution

Part 2

Modeling

Univ. of

Pittsburgh

10/25/2014

10

www.company.com

Chain CRFs

• We can change it so that each state depends

on more observations:

• Or inputs at previous steps:

• Or all inputs:

Part 2

Modeling

Univ. of

Pittsburgh

www.company.com

General CRF

• If , and ;

• and are neighbors;

• is a CRF, if

• Example:

Part 2

Modeling

Univ. of

Pittsburgh

10/25/2014

11

www.company.com

General CRFs: Visualization

• According to the definition of CRF, the

random variables still obey the

Markov Property with respect to the graph.

Part 2

Modeling

Univ. of

Pittsburgh

the MRF

fixed,

observable,

variables

(not in the

MRF)

the CRF

Y

X

the CRF

the MRF

fixed, observable,

variables X (not in

the MRF)

www.company.com


• Divide MRF into cliques. The parameters inside each template are

tied --potential functions; functions for the template

• The cliques contain only unobservables (y); x is an argument to c

• The probability PM(y|x) is a joint distribution over the unobservables Y

Part 2

Modeling

Univ. of

Pittsburgh

cliques (include only

the unobservables, Y)

observables, X (not

included in the cliques)

CR

F

Y

X

CRF

( , )c c y x

( , ) ( , )

( , )

1 1( | ) ,

( )

Q Q

Qp e e

Z e

y x y x

y x

y

y xx

Q(y,x) c(yc,x)cC

10/25/2014

12

www.company.com


• c is typically decomposed into a weighted sum of feature

sensors fi, producing:

• Back to the chain-CRF!

• Cliques can be identified as pairs of adjacent Ys:

Part 2

Modeling

Univ. of

Pittsburgh

( , )1( | )

i i c

c C i F

f y

P eZ

x

y x

( , )1( | ) Qp e

Z y x

y x

Q(y,x) c(yc,x)cC

( , ) ( , )c c i i c

i F

f y

y x x

www.company.com

Inference using CRF

• 1. Introduction

• 2. CRF Modeling


• General CRF

• Chain CRF

• 4. Training CRF


Part 3

Inference

Univ. of

Pittsburgh

10/25/2014

13

www.company.com

General CRF

• Given the observations, and parameters,

we target to find the best state sequence:

• For the general CRF:

• =

• But, exact inference in CRFs is

intractable…

• Approximate methods!

• MCMC, Belief Propagation

Part 3

Inference

Univ. of

Pittsburgh

www.company.com

Inference in HMM

• Dynamic Programming:

• Forward

• Backward

• Viterbi

Part 3

Inference

Univ. of

Pittsburgh

1

2

K

…

1

2

K

…

1

2

K

…

…

…

…

1

2

K

…

2

1

K

2

1x 2x Kx3x

10/25/2014

14

www.company.com

Chain CRFs

• Chain CRF could be done using dynamic programming

• Define a matrix with size • Y is a finite label alphabet

Part 3

Inference

Univ. of

Pittsburgh

www.company.com

Chain CRFs

• By defining the following forward and backward

parameters,

Part 3

Inference

Univ. of

Pittsburgh

10/25/2014

15

www.company.com

Chain CRFs

• The inference of linear-chain CRF is very similar to that

of HMM

• We can write the marginal distribution:

• Solve Chain-CRF using Dynamic Programming

(Similar to Viterbi)!

• 1. First computing α for all t (forward), then compute β for all t

(backward).

• 2. Return the marginal distributions computed.

• 3. Run viterbi to find the optimal sequence

Part 3

Inference

Univ. of

Pittsburgh

www.company.com

Training CRF

• 1. Introduction

• 2. CRF Modeling


• 4. Training CRF

• General CRF

• Intro to approximate algorithms


Part 4

Learning

Univ. of

Pittsburgh

10/25/2014

16

www.company.com

Parameter learning

• Given the training data,

we wish to learn parameters of the

model.

• For chain or tree structured CRFs, they

can be trained by maximum likelihood

• The objective function for chain-CRF is

convex(see Lafferty et al(2001) ).

• General CRFs are intractable hence

approximation solutions are necessary

Part 4

Learning

Univ. of

Pittsburgh

www.company.com

Parameter learning

• Conditional log-likelihood for a general CRF:

• It is not possible to analytically determine the

parameter values that maximize the log-likelihood

– setting the gradient to zero and solving for λ

does not always yield a closed form solution.

(Almost always)

Part 4

Learning

Univ. of

Pittsburgh

10/25/2014

17

www.company.com

Parameter learning

• This could be done using gradient descent

• Until we reach convergence

Part 4

Learning

Univ. of

Pittsburgh

1

max ( ; | ) max (log | ; )N

i

y x p

y xL

1 . ( ; | )i i y x L

1| ( ; | ) ( ; | ) |i iy x y x L L ò

www.company.com

Parameter learning

•

Part 4

Learning

Univ. of

Pittsburgh

10/25/2014

18

www.company.com

Parameter learning

•

Part 4

Learning

Univ. of

Pittsburgh

www.company.com

Training ( and Inference): General Case

• Approximate solution, to get faster inference.

• Treat inference as shortest path problem in the network

consisting of paths(with costs)

• Max Flow-Min Cut (Ford-Fulkerson, 1956 )

• Pseudo-likelihood approximation:

• Convert a CRF into separate patches; each consists of a hidden

node and true values of neighbors; Run ML on separate

patches

• Efficient but may over-estimate inter-dependencies

• Belief propagation

• variational inference algorithm

• it is a direct generalization of the exact inference algorithms for

linear-chain CRFs

• Sampling based method(MCMC)

Part 4

Learning

Univ. of

Pittsburgh

10/25/2014

19

www.company.com

Applications of CRF

• 1. Introduction

• 2. CRF Modeling


• 4. Training CRF


Part 5

Application

Univ. of

Pittsburgh

www.company.com

Part-of-Speech-Tagging

• POS(part of speech) tagging; the

identification of words as nouns, verbs,

adjectives, adverbs, etc.

Part 5

Univ. of

Pittsburgh

Application

10/25/2014

20

www.company.com


• Each word to be labeled with one of 45

syntactic tags.

• 50%-50% train-test split

• Compared HMMs, MEMMs, and CRFs on

Penn treebank POS tagging

• oov = out-of-vocabulary (not observed in the

training set)

Part 5

Univ. of

Pittsburgh

Application

www.company.com


• But…

• Add a small set of orthographic features: whether

a spelling begins with a number or upper case

letter, whether it contains a hyphen, and if it

contains one of the following suffixes: -ing, -ogy, -

ed, -s, -ly, -ion, -tion, -ity, -ies

Part 5

Univ. of

Pittsburgh

Application

10/25/2014

21

www.company.com


Part 5

Univ. of

Pittsburgh

Application

www.company.com

Is HMM(Gen.) better or CRF(Disc.)

• If your application gives you good structural

information such that could be easily modeled by

dependent distributions, and could be learnt

tractably, go the generative way!

• Ex. Higher-order emissions from individual states

Part 5

Univ. of

Pittsburgh

Application

10/25/2014

22

www.company.com

Other Applications

• Application in computational biology

• DNA and protein sequence alignment

• Sequence homolog searching in databases

• Protein secondary structure prediction

• RNA secondary structure analysis

• Application in computational linguistics &

computer science

• Text and speech processing, including topic

segmentation, part-of-speech (POS) tagging

• Information extraction

• Syntactic disambiguation

Part 5

Univ. of

Pittsburgh

Application

www.company.com

References

• J. Lafferty, A. McCallum, and F. Pereira. Conditional

random fields: Probabilistic models for segmenting and

labeling sequence data. In Proc. ICML01, 2001.

• Charles Sutton and Andrew McCallum, “An Introduction

to Conditional Random Fields for Relational Learning,”

MIT Press, 2006

• Daniel Khashabi, “Conditional Random Fields

and beyond ”, UIUC CS 546, 2013

• Zitao Liu, “Probabilistic Models of Time Series and

Sequences”, 2014

• Bishop, Christopher M., “Pattern recognition and

machine learning”. Vol. 1. New York: springer, 2006

• Peter, “Conditional Random Fields: Probabilistic

Models for Segmenting and Labeling Sequence Data”,

2012

Univ. of

Pittsburgh

10/25/2014

23

Conditional Random Fields

Univ. of

Pittsburgh

Computer Science Department

Thank you!

Date post:	16-Oct-2020
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

Conditional Random Fieldsmilos/courses/cs3750... · • DNA and protein sequence alignment •...

Documents