+ All Categories
Home > Documents > Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei...

Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei...

Date post: 28-Dec-2015
Category:
Upload: phillip-byrd
View: 222 times
Download: 4 times
Share this document with a friend
Popular Tags:
183
Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign
Transcript
Page 1: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 1

March 2009

EACL

Constrained Conditional Models for

Natural Language Processing

Ming-Wei Chang, Lev Ratinov, Dan RothDepartment of Computer ScienceUniversity of Illinois at Urbana-Champaign

Page 2: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 2

Nice to Meet You

Page 3: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Informally: Everything that has to do with constraints (and

learning models)

Formally: We typically make decisions based on models such as:

With CCMs we make decisions based on models such as:

We do not define the learning method but we’ll discuss it and make suggestions

CCMs make predictions in the presence/guided by constraints

Page 3

Constraints Conditional Models (CCMs)

),(maxarg xyfwTy

arg max ( , ) ( ,1 )Ty c C

c C

w f y x d y

Page 4: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Constraints Driven Learning and Decision Making

Why Constraints? The Goal: Building a good NLP systems easily We have prior knowledge at our hand

How can we use it? We suggest that often knowledge can be injected directly

Can use it to guide learning Can use it to improve decision making Can use it to simplify the models we need to

learn

How useful are constraints? Useful for supervised learning Useful for semi-supervised learning Sometimes more efficient than labeling data directly

Page 5: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 5

Inference

Page 6: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 6

Comprehension

1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now.

A process that maintains and updates a collection of propositions about the state of affairs.

(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous.

This is an Inference Problem

Page 7: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 7

This Tutorial: Constrained Conditional Models

Part 1: Introduction to CCMs Examples:

NE + Relations Information extraction – correcting models with CCMS

First summary: why are CCM important Problem Setting

Features and Constraints; Some hints about training issues

Part 2: Introduction to Integer Linear Programming What is ILP; use of ILP in NLP

Part 3: Detailed examples of using CCMs Semantic Role Labeling in Details Coreference Resolution Sentence Compression

BREAK

Page 8: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 8

This Tutorial: Constrained Conditional Models (2nd part)

Part 4: More on Inference – Other inference algorithms

Search (SRL); Dynamic Programming (Transliteration); Cutting Planes

Using hard and soft constraints Part 5: Training issues when working with CCMS

Formalism (again) Choices of training paradigms -- Tradeoffs

Examples in Supervised learning Examples in Semi-Supervised learning

Part 6: Conclusion Building CCMs Features and Constraints; Objective functions; Different

Learners. Mixed models vs. Joint models; where is Knowledge coming from

THE END

Page 9: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 9

Learning and Inference

Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. E.g. Structured Output Problems – multiple dependent output

variables

(Learned) models/classifiers for different sub-problems In some cases, not all local models can be learned simultaneously Key examples in NLP are Textual Entailment and QA In these cases, constraints may appear only at evaluation time

Incorporate models’ information, along with prior knowledge/constraints, in making coherent decisions decisions that respect the local models as well as domain &

context specific knowledge/constraints.

Page 10: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 10

This Tutorial: Constrained Conditional Models

Part 1: Introduction to CCMs Examples:

NE + Relations Information extraction – correcting models with CCMS

First summary: why are CCM important Problem Setting

Features and Constraints; Some hints about training issues

Part 2: Introduction to Integer Linear Programming What is ILP; use of ILP in NLP Semantic Role Labeling in Details

Part 3: Detailed examples of using CCMs Coreference Resolution Sentence Compression

BREAK

Page 11: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 11

This Tutorial: Constrained Conditional Models (2nd part)

Part 4: More on Inference – Other inference algorithms

Search (SRL); Dynamic Programming (Transliteration); Cutting Planes

Using hard and soft constraints Part 5: Training issues when working with CCMS

Formalism (again) Choices of training paradigms -- Tradeoffs

Examples in Supervised learning Examples in Semi-Supervised learning

Part 6: Conclusion Building CCMs Features and Constraints; Objective functions; Different

Learners. Mixed models vs. Joint models; where is Knowledge coming from

THE END

Page 12: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 12

Inference with General Constraint Structure [Roth&Yih’04]

Recognizing Entities and Relations

Dole ’s wife, Elizabeth , is a native of N.C.

E1 E2 E3

R12 R2

3

other 0.05per 0.85loc 0.10

other 0.05per 0.50loc 0.45

other 0.10per 0.60loc 0.30

irrelevant 0.10spouse_of 0.05born_in 0.85

irrelevant 0.05spouse_of 0.45born_in 0.50

irrelevant 0.05spouse_of 0.45born_in 0.50

other 0.05per 0.85loc 0.10

other 0.10per 0.60loc 0.30

other 0.05per 0.50loc 0.45

irrelevant 0.05spouse_of 0.45born_in 0.50

irrelevant 0.10spouse_of 0.05born_in 0.85

other 0.05per 0.50loc 0.45

Improvement over no inference: 2-5%

Some Questions: How to guide the global inference? Why not learn Jointly?

Models could be learned separately; constraints may come up only at decision time.

Note: Non Sequential Model

Page 13: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 13

Task of Interests: Structured Output

For each instance, assign values to a set of variables

Output variables depend on each other

Common tasks in Natural language processing

Parsing; Semantic Parsing; Summarization; Transliteration; Co-reference resolution,…

Information extraction Entities, Relations,…

Many pure machine learning approaches exist Hidden Markov Models (HMMs); CRFs Structured Perceptrons ad SVMs…

However, …

Page 14: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 14Page 14

Information Extraction via Hidden Markov Models

Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 .

Prediction result of a trained HMM Lars Ole Andersen . Program analysis and

specialization for the C Programming language

. PhD thesis .DIKU , University of Copenhagen , May1994 .

[AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION]

[DATE]

Unsatisfactory results !

Motivation II

Page 15: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 15

Strategies for Improving the Results

(Pure) Machine Learning Approaches Higher Order HMM/CRF? Increasing the window size? Adding a lot of new features

Requires a lot of labeled examples

What if we only have a few labeled examples?

Any other options? Humans can immediately tell bad outputs The output does not make sense

Increasing the model complexity

Can we keep the learned model simple and still make expressive decisions?

Page 16: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 16Page 16

Information extraction without Prior Knowledge

Prediction result of a trained HMMLars Ole Andersen . Program analysis and

specialization for the C Programming language

. PhD thesis .DIKU , University of Copenhagen , May1994 .

[AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION]

[DATE]

Violates lots of natural constraints!

Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 .

Page 17: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 17

Examples of Constraints

Each field must be a consecutive list of words and can appear at most once in a citation.

State transitions must occur on punctuation marks.

The citation can only start with AUTHOR or EDITOR.

The words pp., pages correspond to PAGE. Four digits starting with 20xx and 19xx are DATE. Quotations can appear only in TITLE ……. Easy to express pieces of “knowledge”

Non Propositional; May use Quantifiers

Page 18: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 18Page 18

Adding constraints, we get correct results! Without changing the model

[AUTHOR] Lars Ole Andersen . [TITLE] Program analysis and specialization

for the C Programming language .

[TECH-REPORT] PhD thesis .[INSTITUTION] DIKU , University of Copenhagen , [DATE] May, 1994 .

Information Extraction with Constraints

Page 19: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 19

Random Variables Y:

Conditional Distributions P (learned by models/classifiers) Constraints C– any Boolean function defined over partial assignments (possibly: + weights W )

Goal: Find the “best” assignment The assignment that achieves the highest global

performance. This is an Integer Programming Problem

Problem Setting

y7y4 y5 y6 y8

y1 y2 y3C(y1,y4)

C(y2,y3,y6,y7,y8)

Y*=argmaxY PY subject to constraints C(+ WC)

observations

Page 20: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 20

Formal Model

How to solve?

This is an Integer Linear Program

Solving using ILP packages gives an exact solution.

Search techniques are also possible

(Soft) constraints component

Weight Vector for “local” models

Penalty for violatingthe constraint.

How far y is from a “legal” assignment

Subject to constraints

A collection of Classifiers; Log-linear models (HMM, CRF) or a combination

How to train?

How to decompose the global objective function?

Should we incorporate constraints in the learning process?

Page 21: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 21

Features Versus Constraints

Ái : X £ Y ! R; Ci : X £ Y ! {0,1}; d: X £ Y ! R; In principle, constraints and features can encode the same propeties In practice, they are very different

Features Local , short distance properties – to allow tractable inference Propositional (grounded): E.g. True if: “the” followed by a Noun occurs in the sentence”

Constraints Global properties Quantified, first order logic expressions E.g.True if: “all yis in the sequence y are assigned different values.”

Indeed, used differently

Page 22: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 22

Encoding Prior Knowledge

Consider encoding the knowledge that: Entities of type A and B cannot occur simultaneously in a sentence

The “Feature” Way Results in higher order HMM, CRF May require designing a model tailored to

knowledge/constraints Large number of new features: might require more labeled

data Wastes parameters to learn indirectly knowledge we have.

The Constraints Way Keeps the model simple; add expressive constraints

directly A small set of constraints Allows for decision time incorporation of constraints

A form of supervision

Need more training data

Page 23: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 23

Constrained Conditional Models – 1st Summary

Everything that has to do with Constraints and Learning models

In both examples, we started with learning models Either for components of the problem

Classifiers for Relations and Entities Or the whole problem

Citations We then included constraints on the output

As a way to “correct” the output of the model In both cases this allows us to

Learn simpler models than we would otherwise As presented, global constraints did not take part in

training Global constraints were used only at the output.

We will later call this training paradigm L+I

Page 24: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 24

This Tutorial: Constrained Conditional Models

Part 1: Introduction to CCMs Examples:

NE + Relations Information extraction – correcting models with CCMS

First summary: why are CCM important Problem Setting

Features and Constraints; Some hints about training issues

Part 2: Introduction to Integer Linear Programming What is ILP; use of ILP in NLP Semantic Role Labeling in Details

Part 3: Detailed examples of using CCMs Coreference Resolution Sentence Compression

BREAK

Page 25: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 25

This Tutorial: Constrained Conditional Models (2nd part)

Part 4: More on Inference – Other inference algorithms

Search (SRL); Dynamic Programming (Transliteration); Cutting Planes

Using hard and soft constraints Part 5: Training issues when working with CCMS

Formalism (again) Choices of training paradigms -- Tradeoffs

Examples in Supervised learning Examples in Semi-Supervised learning

Part 6: Conclusion Building CCMs Features and Constraints; Objective functions; Different

Learners. Mixed models vs. Joint models; where is Knowledge coming from

THE END

Page 26: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 26

Formal Model

How to solve?

This is an Integer Linear Program

Solving using ILP packages gives an exact solution.

Search techniques are also possible

(Soft) constraints component

Weight Vector for “local” models

Penalty for violatingthe constraint.

How far y is from a “legal” assignment

A collection of Classifiers; Log-linear models (HMM, CRF) or a combination

How to train?

How to decompose the global objective function?

Should we incorporate constraints in the learning process?

Page 27: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Inference with constraints.

We start with adding constraints to existing models.

It’s a good place to start, because conceptually all you do is to add constraints to what you were doing before and the performance improves.

Page 27

Page 28: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Constraints and Integer Linear Programming (ILP)

ILP is powerful (NP-complete) ILP is popular – inference for many models, such

as Viterbi for CFR, have already been implemented.

Powerful off-the shelf solvers exist. All we need is to write down the objective function and the constraints, there is no need to write code.

Page 28

Page 29: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Linear Programming

Key contributors: Leonid Kantorovich, George B. Dantzig, John von Neumann.

Optimization technique with linear objective function and linear constraints.

Note that the word “Integer” is absent.

Page 29

Page 30: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Example (Thanks James Clarke)

Telfa Co. produces tables and chairs Each table makes 8$ profit, each chair makes 5$ profit.

We want to maximize the profit.

Page 30

Page 31: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Telfa Co. produces tables and chairs Each table makes 8$ profit, each chair makes 5$ profit. A table requires 1 hour of labor and 9 sq. feet of wood. A chair requires 1 hour of labor and 5 sq. feet of wood. We have only 6 hours of work and 45sq. Feet of wood.

We want to maximize the profit.

Example (Thanks James Clarke)

Page 31

Page 32: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Solving LP problems.

Page 32

Page 33: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Solving LP problems.

Page 33

Page 34: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Solving LP problems.

Page 34

Page 35: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Solving LP problems.

Page 35

Page 36: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Solving LP problems

Page 36

Page 37: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Solving LP Problems.

Page 37

Page 38: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Integer Linear Programming- integer solutions.

Page 38

Page 39: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Integer Linear Programming.

In NLP, we are dealing with discrete outputs, therefore we’re almost always interested in integer solutions.

ILP is NP-complete, but often efficient for large NLP problems.

In some cases, the solutions to LP are integral (e.g totally unimodular constraint matrix.). Next, we show an example for using ILP with

constraints. The matrix is not totally unimodular, but LP gives integral solutions.

Page 39

Page 40: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 40

Back to Example: Recognizing Entities and Relations

Dole ’s wife, Elizabeth , is a native of N.C.

E1 E2 E3

R12 R2

3

other 0.05per 0.85loc 0.10

other 0.05per 0.50loc 0.45

other 0.10per 0.60loc 0.30

irrelevant 0.10spouse_of 0.05born_in 0.85

irrelevant 0.05spouse_of 0.45born_in 0.50

irrelevant 0.05spouse_of 0.45born_in 0.50

other 0.05per 0.85loc 0.10

other 0.10per 0.60loc 0.30

other 0.05per 0.50loc 0.45

irrelevant 0.05spouse_of 0.45born_in 0.50

irrelevant 0.10spouse_of 0.05born_in 0.85

other 0.05per 0.50loc 0.45

Page 41: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 41

Back to Example: Recognizing Entities and Relations

Dole ’s wife, Elizabeth , is a native of N.C.

E1 E2 E3

R12 R2

3

other 0.05per 0.85loc 0.10

other 0.05per 0.50loc 0.45

other 0.10per 0.60loc 0.30

irrelevant 0.10spouse_of 0.05born_in 0.85

irrelevant 0.05spouse_of 0.45born_in 0.50

irrelevant 0.05spouse_of 0.45born_in 0.50

other 0.05per 0.85loc 0.10

other 0.10per 0.60loc 0.30

other 0.05per 0.50loc 0.45

irrelevant 0.05spouse_of 0.45born_in 0.50

irrelevant 0.10spouse_of 0.05born_in 0.85

other 0.05per 0.50loc 0.45

NLP with ILP- key issues:1)Write down the objective function.

2)Write down the constraints as linear inequalities

Page 42: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 42

Back to Example: Recognizing Entities and Relations

Dole ’s wife, Elizabeth , is a native of N.C.

E1 E2 E3

R12 R2

3

other 0.05per 0.85loc 0.10

other 0.05per 0.50loc 0.45

other 0.10per 0.60loc 0.30

irrelevant 0.10spouse_of 0.05born_in 0.85

irrelevant 0.05spouse_of 0.45born_in 0.50

irrelevant 0.05spouse_of 0.45born_in 0.50

other 0.05per 0.85loc 0.10

other 0.10per 0.60loc 0.30

other 0.05per 0.50loc 0.45

irrelevant 0.05spouse_of 0.45born_in 0.50

irrelevant 0.10spouse_of 0.05born_in 0.85

other 0.05per 0.50loc 0.45

Page 43: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 43

Back to Example: cost function

other 0.05per 0.85loc 0.10

other 0.05per 0.50loc 0.45

other 0.10per 0.60loc 0.30

irrelevant 0.10spouse_of 0.05born_in 0.85

irrelevant 0.05spouse_of 0.45born_in 0.50

irrelevant 0.05spouse_of 0.45born_in 0.50

other 0.05per 0.85loc 0.10

other 0.10per 0.60loc 0.30

other 0.05per 0.50loc 0.45

irrelevant 0.05spouse_of 0.45born_in 0.50

irrelevant 0.10spouse_of 0.05born_in 0.85

other 0.05per 0.50loc 0.45

x{E1 = per}, x{E1 = loc}, …, x{R12 = spouse_of}, x{R12 = born_in}, …, x{R12 = } , … {0,1}

R12 R21 R23 R32 R13 R31

E1

DoleE2

ElizabethE3

N.C.

Page 44: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 44

Back to Example: cost function

other 0.05per 0.85loc 0.10

other 0.05per 0.50loc 0.45

other 0.10per 0.60loc 0.30

irrelevant 0.10spouse_of 0.05born_in 0.85

irrelevant 0.05spouse_of 0.45born_in 0.50

irrelevant 0.05spouse_of 0.45born_in 0.50

other 0.05per 0.85loc 0.10

other 0.10per 0.60loc 0.30

other 0.05per 0.50loc 0.45

irrelevant 0.05spouse_of 0.45born_in 0.50

irrelevant 0.10spouse_of 0.05born_in 0.85

other 0.05per 0.50loc 0.45

Cost function:

c{E1 = per}· x{E1 = per} + c{E1 = loc}· x{E1 = loc} + … + c{R12 = spouse_of}· x{R12 = spouse_of} + … + c{R12 = }· x{R12 = } + …

R12 R21 R23 R32 R13 R31

E1

DoleE2

ElizabethE3

N.C.

Page 45: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Adding Constraints

Each entity is either a person, organization or location: x{E1 = per}+ x{E1 = loc}+ x{E1 = org} + x{E1 = }=1

(R12 = spouse_of) (E1 = person) (E2 = person)

x{R12 = spouse_of} x{E1 = per}

x{R12 = spouse_of} x{E2 = per}

We need more consistency constraints. Any Boolean constraint can be written with a set of

linear inequalities, and an efficient algorithm exist [Rizzollo2007].

Page 45

Page 46: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

CCM for NE-Relations

We showed how a CCM formulation for the NE-Relation problem.

In this case, the cost of each variable was learned independently, using trained classifiers.

Other expressive problems can be formulated as Integer Linear Programs. For example, HMM/CRF inference, Viterbi

algorithms

Page 46

c{E1 = per}· x{E1 = per} + c{E1 = loc}· x{E1 = loc} + … + c{R12 = spouse_of}· x{R12 = spouse_of} + … + c{R12 = }· x{R12 = } + …

Page 47: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

TSP as an Integer Linear Program

Page 47

Dantzig et al. were the first to suggest a practical solution to the problem using ILP.

Page 48: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 48

Reduction from Traveling Salesman to ILP

x12c123

1

2x32c32

Every node hasONE outgoing

edge

Page 49: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 49

Reduction from Traveling Salesman to ILP

Every node hasONE incoming

edge

x12c123

1

2x32c32

Page 50: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 50

Reduction from Traveling Salesman to ILP

The solutions are binary (int)

x12c123

1

2x32c32

Page 51: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 51

Reduction from Traveling Salesman to ILP

No sub graphcontains a cycle

x12c123

1

2x32c32

Page 52: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Viterbi as an Integer Linear Program

Page 52

y0 y1 y2 yN

x0 x1 x2 xN

Page 53: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Viterbi with ILP

Page 53

y2y0 y1 yN

x2x0 x1 xN

y01

y02

y0M

S

-Log{P(x0|y0=1) P(y0=1)}

Page 54: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Viterbi with ILP

Page 54

y2y0 y1 yN

x2x0 x1 xN

y01

y02

y0M

S -Log{P(x0|y0=2)P(y0=2)}

Page 55: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Viterbi with ILP

Page 55

y2y0 y1 yN

x2x0 x1 xN

y01

y02

y0M

S -Log{P(x0|y0=M) P(y0=M)}

Page 56: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Viterbi with ILP

Page 56

y2y0 y1 yN

x2x0 x1 xN

y01

y02

y0M

y11

y12

y1M

-Log{P(x2|y1=1)P(y1=1|y0=1)}S

Page 57: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Viterbi with ILP

Page 57

y2y0 y1 yN

x2x0 x1 xN

y01

y02

y0M

y11

y12

y1M

-Log{P(x2|y1=2)P(y1=1|y0=1)}S

Page 58: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Viterbi with ILP

Page 58

y0 y1 y2 yN

x0 x1 x2 xN

S

y01

y02

y0M

y11

y12

y1M

y21

y22

y2M

yN

1

yN2

yNM

T

Page 59: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Viterbi with ILP

Page 59

y0 y1 y2 yN

x0 x1 x2 xN

S

y01

y02

y0M

y11

y12

y1M

y21

y22

y2M

yN

1

yN2

yNM

T

Viterbi = shortest path (ST ) with dynamic programming. We saw TSP ILP.Now show: Shortest Path ILP.

Page 60: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Shortest path with ILP.

For each edge (u,v), xuv =1 if (u,v) is on the shortest path, 0 otherwise.

cuv = the weight of the edge.

Page 60

Page 61: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Shortest path with ILP.

For each edge (u,v), xuv =1 if (u,v) is on the shortest path, 0 otherwise.

cuv = the weight of the edge.

Page 61

Page 62: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Shortest path with ILP.

For each edge (u,v), xuv =1 if (u,v) is on the shortest path, 0 otherwise.

cuv = the weight of the edge.

Page 62

All nodes except the S and the T have eq. number of ingoing and outgoing edges

Page 63: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Shortest path with ILP.

For each edge (u,v), xuv =1 if (u,v) is on the shortest path, 0 otherwise.

cuv = the weight of the edge.

Page 63

All nodes except the S and the T have eq. number of ingoing and outgoing edges

The source node has one outgoing edge more than ingoing edges.

Page 64: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Shortest path with ILP.

For each edge (u,v), xuv =1 if (u,v) is on the shortest path, 0 otherwise.

cuv = the weight of the edge.

Page 64

All nodes except the S and the T have eq. number of ingoing and outgoing edges

The source node has one outgoing edge more than ingoing edges.

The target node has one ingoing edge more than outgoing edges.

Page 65: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Shortest path with ILP.

For each edge (u,v), xuv =1 if (u,v) is on the shortest path, 0 otherwise.

cuv = the weight of the edge.

Page 65

Interestingly, in this formulation, the solution to the general LP problem will have integer solutions. Thus, we have a reasonably fast alternative to Viterbi with ILP.

Page 66: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Who cares that Viterbi is an ILP?

Assume you want to learn an HMM/CRF model (e.g., Extracting fields from citations (IE))

But, you also want to add expressive constraints: No field can appear twice in the data. The fields <journal> and <tech-report> are mutually

exclusive. The field <author> must appear at least once.

Do: Learn HMM/CRF Convert to an LP Modify the LP canonical constraint matrix

A concrete example for adding constraints over CRF will be shown

Page 67: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 67

This Tutorial: Constrained Conditional Models

Part 1: Introduction to CCMs Examples:

NE + Relations Information extraction – correcting models with CCMS

First summary: why are CCM important Problem Setting

Features and Constraints; Some hints about training issues

Part 2: Introduction to Integer Linear Programming What is ILP; use of ILP in NLP

Part 3: Detailed examples of using CCMs Semantic Role Labeling in Details Coreference Resolution Sentence Compression

BREAK

Page 68: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 68

This Tutorial: Constrained Conditional Models (2nd part)

Part 4: More on Inference – Other inference algorithms

Search (SRL); Dynamic Programming (Transliteration); Cutting Planes

Using hard and soft constraints Part 5: Training issues when working with CCMS

Formalism (again) Choices of training paradigms -- Tradeoffs

Examples in Supervised learning Examples in Semi-Supervised learning

Part 6: Conclusion Building CCMs Features and Constraints; Objective functions; Different

Learners. Mixed models vs. Joint models; where is Knowledge coming from

THE END

Page 69: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

CCM Examples

Many works in NLP make use of constrained conditional models, implicitly or explicitly.

Next we describe in details three examples.

Example 2: Semantic Role Labeling. The use of inference with constraints to improve

semantic parsing Example 3 (Co-ref):

combining classifiers through objective function outperforms pipeline.

Example 4 (Sentence Compression): Simple language model with constraints

outperforms complex models.

Page 69

Page 70: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Example 2: Semantic Role Labeling

Demo:http://L2R.cs.uiuc.edu/~cogcomp

Approach :1) Reveals several relations.

Top ranked system in CoNLL’05 shared task

Key difference is the Inference

2) Produces a very good semantic parser. F1~90% 3) Easy and fast: ~7 Sent/Sec (using Xpress-MP)

Who did what to whom, when, where, why,…

Page 71: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Simple sentence:

I left my pearls to my daughter in my will .

[I]A0 left [my pearls]A1 [to my daughter]A2 [in my will]AM-LOC .

A0 Leaver A1 Things left A2 Benefactor AM-LOC Location

I left my pearls to my daughter in my will .

Page 72: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 72

SRL Dataset

PropBank [Palmer et. al. 05]

Core arguments: A0-A5 and AA different semantics for each verb specified in the PropBank Frame files

13 types of adjuncts labeled as AM-arg where arg specifies the adjunct type

In this problem, all the information is given, but conceptually, we could train different components on different resources.

Page 73: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Algorithmic Approach

Identify argument candidates Pruning [Xue&Palmer, EMNLP’04] Argument Identifier

Binary classification (SNoW) Classify argument candidates

Argument Classifier Multi-class classification (SNoW)

Inference Use the estimated probability distribution

given by the argument classifier Use structural and linguistic constraints Infer the optimal global output

I left my nice pearls to her

I left my nice pearls to her[ [ [ [ [ ] ] ] ] ]

I left my nice pearls to her[ [ [ [ [ ] ] ] ] ]

I left my nice pearls to her

Identify Vocabulary

Inference over (old and new) Vocabulary

candidate arguments

Well

understood

I left my nice pearls to her

Page 74: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 74

Learning

Both argument identifier and argument classifier are trained using SNoW. Sparse network of linear functions Multiclass classifier that produces a probability

distribution over output values.

Features (some examples) Voice, phrase type, head word, parse tree path

from predicate, chunk sequence, syntactic frame, …

Conjunction of features

Page 75: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 75

Inference

The output of the argument classifier often violates some constraints, especially when the sentence is long.

Finding the best legitimate output is formalized as an optimization problem and solved via Integer Linear Programming (ILP).

Input: The scores from the argument classifier Structural and linguistic constraints

ILP allows incorporating expressive (non-sequential) constraints on the variables (the arguments types).

Page 76: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 76

Semantic Role Labeling (SRL)

I left my pearls to my daughter in my will .

0.5

0.15

0.15

0.1

0.1

0.15

0.6

0.05

0.05

0.05

0.05

0.1

0.2

0.6

0.05

0.05

0.05

0.7

0.05

0.150.3

0.2

0.2

0.1

0.2

Page 77: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 77

Semantic Role Labeling (SRL)

I left my pearls to my daughter in my will .

0.5

0.15

0.15

0.1

0.1

0.15

0.6

0.05

0.05

0.05

0.05

0.1

0.2

0.6

0.05

0.05

0.05

0.7

0.05

0.150.3

0.2

0.2

0.1

0.2

Page 78: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 78

Semantic Role Labeling (SRL)

I left my pearls to my daughter in my will .

0.5

0.15

0.15

0.1

0.1

0.15

0.6

0.05

0.05

0.05

0.05

0.1

0.2

0.6

0.05

0.05

0.05

0.7

0.05

0.150.3

0.2

0.2

0.1

0.2

One inference problem for each verb predicate.

Page 79: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 79

Integer Linear Programming Inference

For each argument ai

Set up a Boolean variable: ai,t indicating whether ai is classified as t

Goal is to maximize i score(ai = t ) ai,t

Subject to the (linear) constraints

If score(ai = t ) = P(ai = t ), the objective is to find the assignment that maximizes the expected number of arguments that are correct and satisfies the constraints.

Page 80: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 80

No duplicate argument classes

a POTARG x{a = A0} 1

R-ARG

a2 POTARG , a POTARG x{a = A0} x{a2 = R-A0}

C-ARG a2 POTARG ,

(a POTARG) (a is before a2 ) x{a = A0} x{a2 = C-A0}

Many other possible constraints: Unique labels No overlapping or embedding Relations between number of arguments; order constraints If verb is of type A, no argument of type B

Any Boolean rule can be encoded as a set of linear inequalities.

If there is an R-ARG phrase, there is an ARG Phrase

If there is an C-ARG phrase, there is an ARG before it

Constraints

Joint inference can be used also to combine different SRL Systems.

Universally quantified rulesLBJ: allows a developer to encode

constraints in FOL; these are compiled into linear inequalities automatically.

Page 81: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 81

Summary: Semantic Role Labeling

Demo:http://L2R.cs.uiuc.edu/~cogcomp

Who did what to whom, when, where, why,…

Page 82: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Example 3: co-ref (thanks Pascal Denis)

Page 82

Page 83: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Example 1: co-ref (thanks Pascal Denis)

Page 83

Traditional approach: train a classifier which predicts the probability that two named entities co-refer:

Pco-ref(Clinton, NPR) = 0.2 (order sensitive!)Pco-ref(Lewinsky, her) = 0.6

Decision Process: E.g., If Pco-ref(s1, s2) > 0.5, label two entities as co-referent

or, clustering

Page 84: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Example 1: co-ref (thanks Pascal Denis)

Page 84

Traditional approach: train a classifier which predicts the probability that two named entities co-refer:

Pco-ref(Clinton, NPR) = 0.2Pco-ref(Lewinsky, her) = 0.6

Decision Process: E.g., If Pco-ref(s1, s2) > 0.5, label two entities as co-referent

or, clustering

Evaluation:Cluster all the entities based on co-ref classifier predictions. The evaluation is based on cluster overlap.

Page 85: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Example 1: co-ref (thanks Pascal Denis)

Page 85

Two types of entities: “Base entities”“Anaphors” (pointers)

NPR

Page 86: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Example 1: co-ref (thanks Pascal Denis)

Page 86

Error analysis: 1)“Base entities” that “point” to anaphors.2)Anaphors that don’t “point” to anything.

Page 87: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Proposed solution.

Page 87

Pipelined approach: Identify anaphors.Forbid links of certain types.

NPR

Page 88: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Proposed solution.

Page 88

Doesn’t work, despite 80% accuracy of the anaphorisity classifier.

The reason: error propagation in the pipeline.

NPR

Page 89: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Joint Solution with CCMs

Page 89

Page 90: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Joint solution using CCMs

Page 90

Page 91: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Joint solution using CCMs

Page 91

This performs considerably better than the pipelined approach, and actually improves over the baseline without anaphorisity classifier (though marginally)

Page 92: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Joint solution using CCMs

Page 92

α

β

Note:If we have reasons to believe one of the classifiers significantly more

than the other, we can scale with tuned parameters

Page 93: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 93

Example 4 - Sentence Compression (thanks James Clarke).

Page 94: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 94

Example

Example:

0 1 2 3 4 5 6 7 8

Big fish eat small fish in a small pond

Big fish in a pond

1

1

568156015

86510

Page 95: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 95

Language model-based compression

Page 96: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 96

Example 2 : summarization (thanks James Clarke)

This formulation requires some additional constraintsBig fish eat small fish in a small pond

No selection of decision variables can make these trigrams appear consecutively in output.

We skip these constraints here.

Page 97: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 97

Trigram model in action

Page 98: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 98

Modifier Constraints

Page 99: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 99

Example

Page 100: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 100

Example

Page 101: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 101

Sentential Constraints

Page 102: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 102

Example

Page 103: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 103

Example

Page 104: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 104

More constraints

Page 105: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Other CCM Examples: Opinion Recognition

Y. Choi, E. Breck, and C. Cardie. Joint Extraction of Entities and Relations for Opinion Recognition EMNLP-2006

Semantic parsing variation: Agent=entity Relation=opinion

Constraints: A an agent can have at most two opinions. An opinion should be linked to only one agent. The usual non-overlap constraints.

Page 105

Page 106: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Other CCM Examples: Temporal Ordering

N. Chambers and D. Jurafsky. Jointly Combining Implicit Constraints Improves Temporal Ordering. EMNLP-2008.

Page 106

Page 107: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Other CCM Examples: Temporal Ordering

N. Chambers and D. Jurafsky. Jointly Combining Implicit Constraints Improves Temporal Ordering. EMNLP-2008.

Page 107

Three types of edges:1)Annotation relations before/after2)Transitive closure constraints3)Time normalization constraints

Page 108: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Related Work: Language generation.

Regina Barzilay and Mirella Lapata. Aggregation via Set Partitioning for Natural Language Generation.HLT-NAACL-2006.

Constraints: Transitivity: if (ei,ej)were aggregated, and (ei,ejk)

were too, then (ei,ek) get aggregated. Max number of facts aggregated, max sentence

length.

Page 108

Page 109: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

MT & Alignment

Ulrich Germann, Mike Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. Fast decoding and optimal decoding for machine translation. ACL 2001.

John DeNero and Dan Klein. The Complexity of Phrase Alignment Problems. ACL-HLT-2008.

Page 109

Page 110: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Summary of Examples

We have shown several different NLP solution that make use of CCMs.

Examples vary in the way models are learned.

In all cases, constraints can be expressed in a high level language, and then transformed into linear inequalities.

Learning based Java (LBJ) [Rizollo&Roth’07] describe an automatic way to compile high level description of constraint into linear inequalities.

Page 110

Page 111: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Page 111

Solvers

All applications presented so far used ILP for inference.

People used different solvers Xpress-MP GLPK lpsolve R Mosek CPLEX

Next we discuss other inference approaches to CCMs

Page 112: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

112

This Tutorial: Constrained Conditional Models (2nd part)

Part 4: More on Inference – Other inference algorithms

Search (SRL); Dynamic Programming (Transliteration); Cutting Planes

Using hard and soft constraints Part 5: Training issues when working with CCMS

Formalism (again) Choices of training paradigms -- Tradeoffs

Examples in Supervised learning Examples in Semi-Supervised learning

Part 6: Conclusion Building CCMs Features and Constraints; Objective functions; Different

Learners. Mixed models vs. Joint models; where is Knowledge coming from

THE END

Page 113: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

113

Where Are We ?

We hope we have already convinced you that Using constraints is a good idea for addressing NLP problems Constrained conditional models provide a good platform

We were talking about using expressive constraints To improve existing models Learning + Inference The problem: inference

A powerful inference tool: Integer Linear Programming SRL, co-ref, summarization, entity-and-relation… Easy to inject domain knowledge

Page 114: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

114

Constrained Conditional Model : Inference

How to solve?

This is an Integer Linear Program

Solving using ILP packages gives an exact solution.

Search techniques are also possible

(Soft) constraints component

Weight Vector for “local” models

Constraint violation penalty

How far y is from a “legal” assignment

A collection of Classifiers; Log-linear models (HMM, CRF) or a combination

How to train?

How to decompose the global objective function?

Should we incorporate constraints in the learning process?

Page 115: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

115

Advantages of ILP Solvers: Review

ILP is Expressive: We can solve many inference problems Converting inference problems into ILP is easy

ILP is Easy to Use: Many available packages (Open Source Packages): LPSolve, GLPK, … (Commercial Packages): XPressMP, Cplex No need to write optimization code!

Why should we consider other inference options?

Page 116: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

116

ILP: Speed Can Be an Issue

Inference problems in NLP Sometimes large problems are actually easy for ILP

E.g. Entities-Relations Many of them are not “difficult”

When ILP isn’t fast enough, and one needs to resort to approximate solutions.

The Problem: General Solvers vs. Specific Solvers ILP is a very, very general solver But, sometimes the structure of the problem allows for

simpler inference algorithms. Next we give examples for both cases.

Page 117: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

117

Example 1: Search based Inference for SRL

The objective function

Constraints Unique labels No overlapping or embedding If verb is of type A, no argument of type B …

Intuition: check constraints’ violations on partial assignments

Maximize summation of the scores subject to linguistic constraints

ji

ijij xc,

max

Indicator variableassigns the j-th class for the i-th token

Classification confidence

Page 118: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

118

Inference using Beam Search

For each step, discard partial assignments that violate constraints!

Shape: argument

Color: label

Beam size = 2,Constraint:Only one Red

Rank them according to classification confidence!

Page 119: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

119

Heuristic Inference

Problems of heuristic inference Problem 1: Possibly, sub-optimal solution Problem 2: May not find a feasible solution

Drop some constraints, solve it again

Using search on SRL gives comparable results to using ILP, but is much faster.

Page 120: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

120

How to get a score for the pair? Previous approaches:

Extract features for each source and target entity pair

The CCM approach: Introduce an internal structure (characters) Constrain character mappings to “make

sense”.

Example 2: Exploiting Structure in Inference: Transliteration

Page 121: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

121

Transliteration Discovery with CCM

The problem now: inference How to find the best mapping that satisfies the

constraints?

A weight is assigned to each edge.

Include it or not? A binary decision.

Score = sum of the mappings’ weight

Assume the weights are given. More on this latter.

• Natural constraints• Pronunciation constraints• One-to-One • Non-crossing•…

Score = sum of the mappings’ weights. t. mapping satisfies constraints

Page 122: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

122

Finding The Best Character Mappings

An Integer Linear Programming Problem

Is this the best inference algorithm?

...

1

,,,,,,

,1,

,0,),(

,10

max,

kmij

jij

ij

ijij

TjSiijij

xx

jmkimkji

xi

xBji

Zxx

xc

Maximize the mapping score

Pronounciation constraint

One-to-one constraint

Non-crossing constraint

Page 123: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

123

Finding The Best Character Mappings

A Dynamic Programming Algorithm

Exact and fast!

Maximize the mapping score

Restricted mapping constraints

One-to-one constraint

Non-crossing constraint

Take Home Message:

Although ILP can solve most problems, the fastest inference algorithm depends on the constraints and can be simpler

We can decompose the inference problem into two parts

Page 124: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

124

Other Inference Options

Constraint Relaxation Strategies Try Linear Programming

[Roth and Yih, ICML 2005] Cutting plane algorithms do not use all constraints at

first Dependency Parsing: Exponential number of constraints [Riedel and Clarke, EMNLP 2006]

Other search algorithms A-star, Hill Climbing… Gibbs Sampling Inference [Finkel et. al, ACL 2005]

Named Entity Recognition: enforce long distance constraints

Can be considered as : Learning + Inference One type of constraints only

Page 125: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

125

Inference Methods – Summary

Why ILP? A powerful way to formalize the problems However, not necessarily the best algorithmic

solution

Heuristic inference algorithms are useful sometimes! Beam search Other approaches: annealing …

Sometimes, a specific inference algorithm can be designed According to your constraints

Page 126: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

126

This Tutorial: Constrained Conditional Models (2nd part)

Part 4: More on Inference – Other inference algorithms

Search (SRL); Dynamic Programming (Transliteration); Cutting Planes

Using hard and soft constraints Part 5: Training issues when working with CCMS

Formalism (again) Choices of training paradigms -- Tradeoffs

Examples in Supervised learning Examples in Semi-Supervised learning

Part 6: Conclusion Building CCMs Features and Constraints; Objective functions; Different

Learners. Mixed models vs. Joint models; where is Knowledge coming from

THE END

Page 127: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

127

Constrained Conditional Model: Training

How to solve?

This is an Integer Linear Program

Solving using ILP packages gives an exact solution.

Search techniques are also possible

(Soft) constraints component

Weight Vector for “local” models

Constraint violation penalty

How far y is from a “legal” assignment

A collection of Classifiers; Log-linear models (HMM, CRF) or a combination

How to train?

How to decompose the global objective function?

Should we incorporate constraints in the learning process?

Page 128: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

128

Where are we?

Algorithmic Approach: Incorporating general constraints Showed that CCMs allow for formalizing many problems Showed several algorithmic ways to incorporate global

constraints in the decision.

Training: Coupling vs. Decoupling Training and Inference. Incorporating global constraints is important but Should it be done only at evaluation time or also at

training time? How to decompose the objective function and train in

parts? Issues related to:

Modularity, efficiency and performance, availability of training data

Problem specific considerations

Page 129: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

129

Training in the presence of Constraints

General Training Paradigm: First Term: Learning from data (could be further

decomposed) Second Term: Guiding the model by constraints Can choose if constraints’ weights trained, when

and how, or taken into account only in evaluation.

Decompose Model (SRL case) Decompose Model from constraints

Page 130: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

130

Comparing Training Methods

Option 1: Learning + Inference (with Constraints) Ignore constraints during training

Option 2: Inference (with Constraints) Based Training Consider constraints during training

In both cases: Decision Making with Constraints

Question: Isn’t Option 2 always better?

Not so simple… Next, the “Local model story”

Page 131: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

131

f1(x)

f2(x)

f3(x)f4(x)

f5(x)

Training Methods

x1

x6

x2

x5

x4

x3

x7

y1

y2

y5

y4

y3

X

Y

Learning + Inference (L+I)Learn models independentlyInference Based Training (IBT)Learn all models together!

IntuitionLearning with constraints may make learning more difficult

IntuitionLearning with constraints may make learning more difficult

Page 132: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

132

-1 1 111Y’ Local Predictions

Training with ConstraintsExample: Perceptron-based Global Learning

x1

x6

x2

x5

x4

x3

x7

f1(x)

f2(x)

f3(x)f4(x)

f5(x)

X

Y

-1 1 1-1-1YTrue Global Labeling

-1 1 11-1Y’ Apply Constraints:

Which one is better? When and Why?

Page 133: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

133

Claims [Punyakanok et. al , IJCAI 2005]

Theory applies to the case of local model (no Y in the features)

When the local modes are “easy” to learn, L+I outperforms IBT. In many applications, the components are identifiable and

easy to learn (e.g., argument, open-close, PER). Only when the local problems become difficult to solve in

isolation, IBT outperforms L+I, but needs a larger number of training examples.

Other training paradigms are possible Pipeline-like Sequential Models: [Roth, Small, Titov: AI&Stat’09]

Identify a preferred ordering among components Learn k-th model jointly with previously learned models

L+I: cheaper computationally; modularIBT is better in the limit, and other extreme cases.

Page 134: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

134

opt=0.2opt=0.1opt=0

Bound Prediction

Local ≤ opt + ( ( d log m + log 1/ ) / m )1/2

Global ≤ 0 + ( ( cd log m + c2d + log 1/ ) / m )1/2

Bounds Simulated Data

L+I vs. IBT: the more identifiable individual problems are, the better overall performance is with L+I

Indication for hardness of

problem

Page 135: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

135

Relative Merits: SRL

Difficulty of the learning problem

(# features)

L+I is better.

When the problem is artificially made harder, the tradeoff is clearer.

easyhard

In some cases problems are hard due to lack of training data.

Semi-supervised learning

Page 136: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

136

YPRED=

For each iterationFor each (X, YGOLD ) in the training data

If YPRED != YGOLD

λ = λ + F(X, YGOLD ) - F(X, YPRED)

endifendfor

L+I & IBT: General View – Structured Perceptron

The theory applies when F(x,y) = F(x)

The theory applies when F(x,y) = F(x)

The difference between

L+I and IBT

The difference between

L+I and IBT

Page 137: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

137

Comparing Training Methods (Cont.)

Local Models (train independently) vs. Structured Models In many cases, structured models might be better due to

expressivity But, what if we use constraints? Local Models+Constraints vs. Structured Models

+Constraints Hard to tell: Constraints are expressive For tractability reasons, structured models have less expressivity

than the use of constraints. Local can be better, because local models are easier to learn

Decompose Model (SRL case) Decompose Model from constraints

Page 138: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

138

Example: Semantic Role Labeling Revisited

s

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

t

Sequential Models

Conditional Random Field

Global perceptron

Training: sentence based

Testing: find the shortest path

with constraints

Local Models

Logistic Regression

Local Avg. Perceptron

Training: token based.

Testing: find the best assignment locally

with constraints

Page 139: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

139

Model CRF CRF-D CRF-IBT Avg. P

Baseline 66.46 69.14 58.15

+ Constraint

s

71.94 73.91 69.82 74.49

Training Time

48 38 145 0.8

Which Model is Better? Semantic Role Labeling

Experiments on SRL: [Roth and Yih, ICML 2005] Story: Inject constraints into conditional random field mdoels

Sequential ModelsSequential Models LocalLocal

L+IL+I L+IL+IIBTIBT

Sequential Models are better than Local Models !

(No constraints)

Sequential Models are better than Local Models !

(No constraints)

Local Models are now better than Sequential Models! (With constraints)

Local Models are now better than Sequential Models! (With constraints)

Page 140: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

140

Summary: Training Methods

Many choices for training a CCM Learning + Inference (Training without constraints) Inference based Learning (Training with constraints)

Based on this, what kind of models you want to use?

Advantages of L+I Require fewer training examples More efficient; most of the time, better performance Modularity; easier to incorporate already learned models.

Page 141: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

141

Constrained Conditional Model: Soft Constraints

(3) How to solve?

This is an Integer Linear Program

Solving using ILP packages gives an exact solution.

Search techniques are also possible

(Soft) constraints component

Constraint violation penalty

How far y is from a “legal” assignment

Subject to constraints

(4) How to train?

How to decompose the global objective function?

Should we incorporate constraints in the learning process?

(1) Why use soft constraint?

(2) How to model “degree of violations”

Page 142: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

142

(1) Why Are Soft Constraints Important?

Some constraints may be violated by the data.

Even when the gold data violates no constraints, the model may prefer illegal solutions. We want a way to rank solutions based on the

level of constraints’ violation. Important when beam search is used

Working with soft constraints Need to define the degree of violation Need to assign penalties for constraints

Page 143: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

143

Example: Information extraction

Prediction result of a trained HMMLars Ole Andersen . Program analysis and

specialization for the C Programming language

. PhD thesis .DIKU , University of Copenhagen , May1994 .

[AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION]

[DATE]

Violates lots of natural constraints!

Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 .

Page 144: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

144

Examples of Constraints

Each field must be a consecutive list of words and can appear at most once in a citation.

State transitions must occur on punctuation marks.

The citation can only start with AUTHOR or EDITOR.

The words pp., pages correspond to PAGE. Four digits starting with 20xx and 19xx are DATE. Quotations can appear only in TITLE …….

Page 145: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

145

(2) Modeling Constraints’ Degree of Violations

Lars Ole Andersen .

AUTH AUTH EDITOR EDITOR

Φc(y1)=0 Φc(y2)=0 Φc(y3)=1 Φc(y4)=0

)( Nc y1 - if assigning yN to xN violates the constraint C with respect to assignment (x1,..,xN-1;y1,…,yN-1)

0 - otherwise

•Count how many times it violated the constraints: d(y,1c(X))= ∑Φc(yi)

1 1, is a punctuationi i ii y y x

Lars Ole Andersen .

AUTH BOOK EDITOR EDITOR

Φc(y1)=0 Φc(y2)=1 Φc(y3)=1 Φc(y4)=0∑Φc(yi) =1∑Φc(yi) =2

State transition must occur on punctuations.

Page 146: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

146

Example: Degree of Violation?

It might be the case that all “good” assignments violate some constraints We lost the ability to judge which assignment is better

Better options: Choose the one with fewer violations!

Lars Ole Andersen .

AUTH BOOK EDITOR EDITOR

Φc(y1)=0 Φc(y2)=1 Φc(y3)=1 Φc(y4)=0

Lars Ole Andersen .

AUTH AUTH EDITOR EDITOR

Φc(y1)=0 Φc(y2)=0 Φc(y3)=1 Φc(y4)=0 The first one is better because of d(y,1c(X))!

Page 147: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

147

Hard Constraints vs. Weighted Constraints

Constraints are close to perfect

Labeled data might not follow the constraints

Page 148: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

148

(3) Constrained Conditional Model with Soft Constraints

How to solve?

This is an Integer Linear Program

Solving using ILP packages gives an exact solution.

Search techniques are also possible

(Soft) constraints component

Constraint violation penalty

How far y is from a “legal” assignment

How to train?

How to decompose the global objective function?

Should we incorporate constraints in the learning process?

Page 149: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Inference with Beam Search (& Soft Constraints)

Au

L. Shari

et.

al. 1998.

Building

semantic

Concordances.

In

C. Fellbaum,

ed. WordNet:

and its applications.

ILP is another option!Run beam search inference as before. Rather than eliminating illegal assignments, re-rank them.

149

Page 150: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Assume that We Have Made a Mistake…

150

Au

Au Au

Au

Date

Title Title Journal

L. Shari

et.

al. 1998.

Building

semantic

Concordances.

In

C. Fellbaum,

ed. WordNet:

and its applications.

Assumption: the best assignment according to the objective functionStart from here

Page 151: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Top Choice from Learned Models Is Not Good: ∑Φc1(yi) =1 ; ∑Φc2(yi) =1+1

151

Au Au Au Au

Date

Title Title Journal Book

L. Shari

et.

al. 1998.

Building

semantic

Concordances.

In

Au

C. Fellbaum,

ed. WordNet:

and its applications.

Each field must be a consecutive list of words and can appear at most once in a citation.

State transitions must occur on punctuation marks.

Two violations.

Page 152: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

The Second Choice from Learned Models Is Not Good

152

Au Au Au Au

Date

Title Title Journal Book

L. Shari

et.

al. 1998.

Building

semantic

Concordances.

In

Editor

C. Fellbaum,

ed. WordNet:

and its applications.

State transitions must occur on punctuation marks.

Two violations.

∑Φc2(yi) =1+1

Page 153: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Use Constraints to Find the Best Assignment

Au Au Au Au Date

Title Title Journal Editor

L. Shari

et.

al. 1998.

Building

semantic

Concordances.

In

Editor

C. Fellbaum,

ed. WordNet:

and its applications.

153

State transitions must occur on punctuation marks.

one violations.

∑Φc2(yi) =1

Page 154: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Soft Constraints Help Us Find a Better Assignment

Au Au Au Au Date

Title Title Journal Editor

L. Shari

et.

al. 1998.

Building

semantic

Concordances.

In

Editor

Editor Editor

Book Book

Book Book

C. Fellbaum,

ed. WordNet:

and its applications.

154

We can do this because we use soft constraints. If we use hard constraints, the score of all assignment given the prefix is negative infinity.

Page 155: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

155

(4) Constrained Conditional Model with Soft Constraints

How to solve?

This is an Integer Linear Program

Solving using ILP packages gives an exact solution.

Search techniques are also possible

(Soft) constraints component

Constraint violation penalty

How far y is from a “legal” assignment

How to train?

How to decompose the global objective function?

Should we incorporate constraints in the learning process?

Page 156: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

156

Compare Training Methods (with Soft Constraints)

Need to figure out the penalty as well…

Option 1: Learning + Inference (with Constraints) Learn the weights and penalties separately

Option 2: Inference (with Constraints) Based Training Learn the weights and penalties together

The tradeoff between L+I and IBT is similar to what we have seen earlier.

Page 157: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

157

YPRED=

Inference Based Training With Soft Constraints

For each iterationFor each (X, YGOLD ) in the training data

If YPRED != YGOLD

λ = λ + F(X, YGOLD ) - F(X, YPRED)

ρI = ρI+ d(YGOLD,1C(X)) - d(YPRED,1C(X)), I = 1,..

endifendfor

• Example: Perceptron• Update penalties as well !

• Example: Perceptron• Update penalties as well !

Page 158: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

158

L+I v.s IBT for Soft Constraints

Test on citation recognition: L+I: HMM + weighted constraints IBT: Perceptron + weighted constraints Same feature set

With constraints Factored Model is better More significant with a small # of examples

Without constraints Few labeled examples, HMM > perceptron Many labeled examples, perceptron > HMM

Agrees with earlier results in the supervised setting ICML’05, IJCAI’5

Page 159: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Summary – Soft Constraints

Using soft constraints is important sometimes Some constraints might be violated in the gold data We want to have the notion of degree of violation

Degree of violation One approximation: count how many time the

constraints was violated

How to solve? Beam search, ILP, …

How to learn? L+I v.s. IBT

159

Page 160: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

160

Constrained Conditional Model: Injecting Knowledge

How to solve?

This is an Integer Linear Program

Solving using ILP packages gives an exact solution.

Search techniques are also possible

(Soft) constraints component

Weight Vector for “local” models

Constraint violation penalty

How far y is from a “legal” assignment

A collection of Classifiers; Log-linear models (HMM, CRF) or a combination

How to train?

How to decompose the global objective function?

Should we incorporate constraints in the learning process?

Inject Prior Knowledge via Constraints

A.K.A.

Semi/unsupervised Learning with Constrained Conditional Model

Page 161: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

161

Constraints As a Way To Encode Prior Knowledge

Consider encoding the knowledge that: Entities of type A and B cannot occur simultaneously in a

sentence The “Feature” Way

Requires larger models

The Constraints Way Keeps the model simple; add expressive

constraints directly A small set of constraints Allows for decision time incorporation of

constraints

A effective way to inject knowledge

Need more training data

We can use constraints as a way to replace training data

Page 162: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

162162

Constraint Driven Semi/Un Supervised Learning

CODLUse constraints to generate better training samples in semi/unsupervised leaning.

Model

Unlabeled Data

PredictionLabel unlabeled data

FeedbackLearn from labeled data

Seed Examples Seed Examples

Prediction + ConstraintsLabel unlabeled data

Prediction + ConstraintsLabel unlabeled data

Better FeedbackLearn from labeled data

Better FeedbackLearn from labeled data

Better ModelBetter ModelResource Resource

In traditional semi/unsupervised Learning, models can drift away from correct model

Page 163: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

163

Semi-Supervised Learning with Soft Constraints (L+I)

Learning model weights Example: HMM

Constraints Penalties Hard Constraints : infinity Weighted Constraints:

½i = -log P{Constraint Ci is violated in training data}

Only 10 constraints

Learn the weights and penalty separately

Page 164: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

164

Semi-supervised Learning with Constraints

=learn(T)For N iterations do

T= For each x in unlabeled dataset

{y1,…,yK} InferenceWithConstraints(x,C, )

T=T {(x, yi)}i=1…k

= +(1- )learn(T)

Learn from new training data.Weigh supervised and unsupervised model.

Inference based augmentation of the training set (feedback)(inference with constraints).

Supervised learning algorithm parameterized by

[Chang, Ratinov, Roth, ACL’07]

Page 165: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

165

Objective function:

Value of Constraints in Semi-Supervised Learning

300# of available labeled

examples

Learning w 10 ConstraintsConstraints are used to Bootstrap a semi-supervised learner Poor model + constraints used to annotate unlabeled data, which in turn is used to keep training the model.

Learning w/o Constraints: 300 examples.

Page 166: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

166

Unsupervised Constraint Driven Learning

Semi-supervised Constraint Driven Learning In [Chang et.al. ACL 2007], they use a small labeled

dataset Reason: We need a good starting point!

Unsupervised Constraint Driven Learning Sometimes, good resources exist

We can use the resource to initialize the model We do not need labeled instances at all!

Application: transliteration discovery

Page 167: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

167

Why Adding Constraints?

Before talking about transliteration discoveryLet’s think about why adding

constraints again Reason: We want to capture the

dependencies between different outputs

A new question: What happen if you are trying to solve a single output classification problem Can you still inject prior knowledge?

Page 168: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

168

xX

x1

x6

x2

x5

x4

x3

x7

X

Y

y1

y2

y4

y3

y5

Structure Output Problem:

Dependencies between different outputs

Why Add Constraints?

Page 169: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

169

xX

x1

x6

x2

x5

x4

x3

x7

X

Yy1

Single Output Problem:

Only one output

Why Add Constraints?

Page 170: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

170

Adding Constraints Through Hidden Variables

xX

x1

x6

x2

x5

x4

x3

x7

X

xX

Yf1

f2

f4

f3

f5

y1

Single Output Problem

with hidden variables

Page 171: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

171

Hidden variables Character mappings

Character mappings is not our final goal But, it provides a intermediate representation

that satisfies natural constraints.

Example: Transliteration Discovery

The score for each pair depends on the character mappings.

Page 172: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

172

Transliteration Discovery with Hidden Variables

For each source NE, find the best target candidate

For each NE pair, the score is equal to the score of the best hidden set of character mapping

(edges) Add constraints when predicting hidden

variables

Cc

cctsT

ts HdNNHFNNHg )1,(),,(),,(

||

),,(maxarg),(

t

tsHts N

NNHgNNscore

Violation

X + y

A CCM Formulation, use constraintsto bias the prediction!Alternative View: Finding the best feature representation

A CCM Formulation, use constraintsto bias the prediction!Alternative View: Finding the best feature representation

Page 173: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

173

Natural Constraints

A Dynamic Programming Algorithm

Exact and fast!

Maximize the mapping score

Pronouncition constraints

One-to-one constraint

Non-crossing constraint

Take Home Message:

Although ILP can solve most problems, the fastest inference algorithm depends on the constraints and can be simpler

We can decompose the problem into two parts

Page 174: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

174

Algorithm: High Level View

Model Resource (Romanization Table) Until Convergence

Use model + Constraints to get assignments for both hidden variables (F) and labels (Y)

Update the model with newly-labeled F and y

Get feedback from both hidden variables and labels

*

*

)},|,(,{

)}|,(,{

ttts

ts

NNWNNFDD

WNNFDD

)(DtrainW

Page 175: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

175

Results - Russian

Higher, better

Higher, better

No constraints + 20 labeled example No constraints + 20 labeled example

Unsupervised Constraint Driven LearningNo labeled data No temporal Information

Unsupervised Constraint Driven LearningNo labeled data No temporal Information

No NER on Russian

Page 176: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

176

Results - Hebrew

No constraints + 250 labeled example No constraints + 250 labeled example

Unsupervised Constraint Driven LearningNo labeled data No temporal Information

Unsupervised Constraint Driven LearningNo labeled data No temporal Information

Page 177: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

Summary

Adding constraints is an effective way to inject knowledge Constraints can correct our predictions on unlabeled data

Injecting constraints sometimes are more effective than annotating data 300 labeled data v.s. 20 labeled data + constraints

On the cases of single output problems Use hidden variables to capture the constraints

Other approaches are possible [Duame, EMNLP 2008]

Only get feedback from the examples where constraints are satisfied

177

Page 178: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

178Page 178

This Tutorial: Constrained Conditional Models (2nd part)

Part 4: More on Inference – Other inference algorithms

Search (SRL); Dynamic Programming (Transliteration); Cutting Planes

Using hard and soft constraints Part 5: Training issues when working with CCMS

Formalism (again) Choices of training paradigms -- Tradeoffs

Examples in Supervised learning Examples in Semi-Supervised learning

Part 6: Conclusion Building CCMs Features and Constraints; Objective functions; Different

Learners. Mixed models vs. Joint models; where is Knowledge coming from

THE END

Page 179: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

179

Conclusion

Constrained Conditional Models combine Learning conditional models with using declarative

expressive constraints Within a constrained optimization framework

Our goal was to: Introduce a clean way of incorporating constraints to

bias and improve decisions of learned models A clean way to use (declarative) prior knowledge to

guide semi-supervised learning Provide examples for the diverse usage CCMs have

already found in NLP Significant success on several NLP and IE tasks (often,

with ILP)

Page 180: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

180

Technical Conclusions Presented and discussed inference algorithms

How to use constraints to make global decisions The formulation is an Integer Linear Programming

formulation, but algorithmic solutions can employ a variety of algorithms

Present and discussed learning models to be used along with constraints Training protocol matters We distinguished two extreme cases – training with/without

constraints. Issues include performance, as well as modularity of the

solution and the ability to use previously learned models. We did not attend to the question of “how to find constraints”

In order to emphasize the idea that we think background knowledge is important, it exists, and to focus on using it.

But, we talked about learning weights for constraints, and it’s clearly possible to learn constraints.

Page 181: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

181

y* = argmaxy wi Á(x; y)

Linear objective functions Typically Á(x,y) will be local

functions, or Á(x,y) = Á(x)

Summary: Constrained Conditional Models

y7y4 y5 y6 y8

y1 y2 y3

y7y4 y5 y6 y8

y1 y2 y3

Conditional Markov Random Field Constraints Network

i ½i dC(x,y)

Expressive constraints over output variables

Soft, weighted constraints Specified declaratively as FOL

formulae Clearly, there is a joint probability distribution that

represents this mixed model. We would like to:

Learn a simple model or several simple models Make decisions with respect to a complex model

Key difference from MLNs which provide a concise definition of a model, but the whole joint one.

Page 182: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

182

Decide what variables are of interest – learn model (s) Think about constraints among the variables of interest Design your objective function

LBJ (Learning Based Java): http://L2R.cs.uiuc.edu/~cogcomp

A modeling language for Constrained Conditional Models. Supports programming along with building learned models, high level specification of

constraints and inference with constraints

Designing CCMs

y* = argmaxy wi Á(x; y)

Linear objective functions Typically Á(x,y) will be local

functions, or Á(x,y) = Á(x)

y7y4 y5 y6 y8

y1 y2 y3

y7y4 y5 y6 y8

y1 y2 y3

i ½i dC(x,y)

Expressive constraints over output variables

Soft, weighted constraints Specified declaratively as FOL

formulae

Page 183: Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science.

183

Questions? Thank you!


Recommended