Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei...

transcript

March 2009

Constrained Conditional Models for

Natural Language Processing

Ming-Wei Chang, Lev Ratinov, Dan RothDepartment of Computer ScienceUniversity of Illinois at Urbana-Champaign

Nice to Meet You

Informally: Everything that has to do with constraints (and

learning models)

Formally: We typically make decisions based on models such as:

With CCMs we make decisions based on models such as:

We do not define the learning method but we’ll discuss it and make suggestions

CCMs make predictions in the presence/guided by constraints

Constraints Conditional Models (CCMs)

),(maxarg xyfwTy

arg max ( , ) ( ,1 )Ty c C

w f y x d y

Constraints Driven Learning and Decision Making

Why Constraints? The Goal: Building a good NLP systems easily We have prior knowledge at our hand

How can we use it? We suggest that often knowledge can be injected directly

Can use it to guide learning Can use it to improve decision making Can use it to simplify the models we need to

How useful are constraints? Useful for supervised learning Useful for semi-supervised learning Sometimes more efficient than labeling data directly

Inference

Comprehension

1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now.

A process that maintains and updates a collection of propositions about the state of affairs.

(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous.

This is an Inference Problem

This Tutorial: Constrained Conditional Models

Part 1: Introduction to CCMs Examples:

NE + Relations Information extraction – correcting models with CCMS

First summary: why are CCM important Problem Setting

Features and Constraints; Some hints about training issues

Part 2: Introduction to Integer Linear Programming What is ILP; use of ILP in NLP

Part 3: Detailed examples of using CCMs Semantic Role Labeling in Details Coreference Resolution Sentence Compression

This Tutorial: Constrained Conditional Models (2nd part)

Part 4: More on Inference – Other inference algorithms

Search (SRL); Dynamic Programming (Transliteration); Cutting Planes

Using hard and soft constraints Part 5: Training issues when working with CCMS

Formalism (again) Choices of training paradigms -- Tradeoffs

Examples in Supervised learning Examples in Semi-Supervised learning

Part 6: Conclusion Building CCMs Features and Constraints; Objective functions; Different

Learners. Mixed models vs. Joint models; where is Knowledge coming from

THE END

Learning and Inference

Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. E.g. Structured Output Problems – multiple dependent output

variables

(Learned) models/classifiers for different sub-problems In some cases, not all local models can be learned simultaneously Key examples in NLP are Textual Entailment and QA In these cases, constraints may appear only at evaluation time

Incorporate models’ information, along with prior knowledge/constraints, in making coherent decisions decisions that respect the local models as well as domain &

context specific knowledge/constraints.

Part 2: Introduction to Integer Linear Programming What is ILP; use of ILP in NLP Semantic Role Labeling in Details

Part 3: Detailed examples of using CCMs Coreference Resolution Sentence Compression

THE END

Inference with General Constraint Structure [Roth&Yih’04]

Recognizing Entities and Relations

Dole ’s wife, Elizabeth , is a native of N.C.

E1 E2 E3

R12 R2

other 0.05per 0.85loc 0.10

irrelevant 0.10spouse_of 0.05born_in 0.85

Improvement over no inference: 2-5%

Some Questions: How to guide the global inference? Why not learn Jointly?

Models could be learned separately; constraints may come up only at decision time.

Note: Non Sequential Model

Task of Interests: Structured Output

For each instance, assign values to a set of variables

Output variables depend on each other

Common tasks in Natural language processing

Parsing; Semantic Parsing; Summarization; Transliteration; Co-reference resolution,…

Information extraction Entities, Relations,…

Many pure machine learning approaches exist Hidden Markov Models (HMMs); CRFs Structured Perceptrons ad SVMs…

However, …

Page 14

Information Extraction via Hidden Markov Models

Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 .

Prediction result of a trained HMM Lars Ole Andersen . Program analysis and

specialization for the C Programming language

. PhD thesis .DIKU , University of Copenhagen , May1994 .

[AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION]

[DATE]

Unsatisfactory results !

Motivation II

Strategies for Improving the Results

(Pure) Machine Learning Approaches Higher Order HMM/CRF? Increasing the window size? Adding a lot of new features

Requires a lot of labeled examples

What if we only have a few labeled examples?

Any other options? Humans can immediately tell bad outputs The output does not make sense

Increasing the model complexity

Can we keep the learned model simple and still make expressive decisions?

Page 16

Information extraction without Prior Knowledge

Prediction result of a trained HMMLars Ole Andersen . Program analysis and

[DATE]

Violates lots of natural constraints!

Examples of Constraints

Each field must be a consecutive list of words and can appear at most once in a citation.

State transitions must occur on punctuation marks.

The citation can only start with AUTHOR or EDITOR.

The words pp., pages correspond to PAGE. Four digits starting with 20xx and 19xx are DATE. Quotations can appear only in TITLE ……. Easy to express pieces of “knowledge”

Non Propositional; May use Quantifiers

Page 18

Adding constraints, we get correct results! Without changing the model

[AUTHOR] Lars Ole Andersen . [TITLE] Program analysis and specialization

for the C Programming language .

[TECH-REPORT] PhD thesis .[INSTITUTION] DIKU , University of Copenhagen , [DATE] May, 1994 .

Information Extraction with Constraints

Random Variables Y:

Conditional Distributions P (learned by models/classifiers) Constraints C– any Boolean function defined over partial assignments (possibly: + weights W )

Goal: Find the “best” assignment The assignment that achieves the highest global

performance. This is an Integer Programming Problem

Problem Setting

y7y4 y5 y6 y8

y1 y2 y3C(y1,y4)

C(y2,y3,y6,y7,y8)

Y*=argmaxY PY subject to constraints C(+ WC)

observations

Formal Model

How to solve?

This is an Integer Linear Program

Solving using ILP packages gives an exact solution.

Search techniques are also possible

(Soft) constraints component

Weight Vector for “local” models

Penalty for violatingthe constraint.

How far y is from a “legal” assignment

Subject to constraints

A collection of Classifiers; Log-linear models (HMM, CRF) or a combination

How to train?

How to decompose the global objective function?

Should we incorporate constraints in the learning process?

Features Versus Constraints

Ái : X £ Y ! R; Ci : X £ Y ! {0,1}; d: X £ Y ! R; In principle, constraints and features can encode the same propeties In practice, they are very different

Features Local , short distance properties – to allow tractable inference Propositional (grounded): E.g. True if: “the” followed by a Noun occurs in the sentence”

Constraints Global properties Quantified, first order logic expressions E.g.True if: “all yis in the sequence y are assigned different values.”

Indeed, used differently

Encoding Prior Knowledge

Consider encoding the knowledge that: Entities of type A and B cannot occur simultaneously in a sentence

The “Feature” Way Results in higher order HMM, CRF May require designing a model tailored to

knowledge/constraints Large number of new features: might require more labeled

data Wastes parameters to learn indirectly knowledge we have.

The Constraints Way Keeps the model simple; add expressive constraints

directly A small set of constraints Allows for decision time incorporation of constraints

A form of supervision

Need more training data

Constrained Conditional Models – 1st Summary

Everything that has to do with Constraints and Learning models

In both examples, we started with learning models Either for components of the problem

Classifiers for Relations and Entities Or the whole problem

Citations We then included constraints on the output

As a way to “correct” the output of the model In both cases this allows us to

Learn simpler models than we would otherwise As presented, global constraints did not take part in

training Global constraints were used only at the output.

We will later call this training paradigm L+I

Part 2: Introduction to Integer Linear Programming What is ILP; use of ILP in NLP Semantic Role Labeling in Details

Part 3: Detailed examples of using CCMs Coreference Resolution Sentence Compression

THE END

Formal Model

How to solve?

Penalty for violatingthe constraint.

How to train?

Inference with constraints.

We start with adding constraints to existing models.

It’s a good place to start, because conceptually all you do is to add constraints to what you were doing before and the performance improves.

Constraints and Integer Linear Programming (ILP)

ILP is powerful (NP-complete) ILP is popular – inference for many models, such

as Viterbi for CFR, have already been implemented.

Powerful off-the shelf solvers exist. All we need is to write down the objective function and the constraints, there is no need to write code.

Linear Programming

Key contributors: Leonid Kantorovich, George B. Dantzig, John von Neumann.

Optimization technique with linear objective function and linear constraints.

Note that the word “Integer” is absent.

Example (Thanks James Clarke)

Telfa Co. produces tables and chairs Each table makes 8$ profit, each chair makes 5$ profit.

We want to maximize the profit.

Telfa Co. produces tables and chairs Each table makes 8$ profit, each chair makes 5$ profit. A table requires 1 hour of labor and 9 sq. feet of wood. A chair requires 1 hour of labor and 5 sq. feet of wood. We have only 6 hours of work and 45sq. Feet of wood.

We want to maximize the profit.

Example (Thanks James Clarke)

Solving LP problems.

Solving LP problems

Solving LP Problems.

Integer Linear Programming- integer solutions.

Integer Linear Programming.

In NLP, we are dealing with discrete outputs, therefore we’re almost always interested in integer solutions.

ILP is NP-complete, but often efficient for large NLP problems.

In some cases, the solutions to LP are integral (e.g totally unimodular constraint matrix.). Next, we show an example for using ILP with

constraints. The matrix is not totally unimodular, but LP gives integral solutions.

Back to Example: Recognizing Entities and Relations

E1 E2 E3

R12 R2

E1 E2 E3

R12 R2

NLP with ILP- key issues:1)Write down the objective function.

2)Write down the constraints as linear inequalities

E1 E2 E3

R12 R2

Back to Example: cost function

x{E1 = per}, x{E1 = loc}, …, x{R12 = spouse_of}, x{R12 = born_in}, …, x{R12 = } , … {0,1}

R12 R21 R23 R32 R13 R31

DoleE2

ElizabethE3

Back to Example: cost function

Cost function:

c{E1 = per}· x{E1 = per} + c{E1 = loc}· x{E1 = loc} + … + c{R12 = spouse_of}· x{R12 = spouse_of} + … + c{R12 = }· x{R12 = } + …

R12 R21 R23 R32 R13 R31

DoleE2

ElizabethE3

Adding Constraints

Each entity is either a person, organization or location: x{E1 = per}+ x{E1 = loc}+ x{E1 = org} + x{E1 = }=1

(R12 = spouse_of) (E1 = person) (E2 = person)

x{R12 = spouse_of} x{E1 = per}

x{R12 = spouse_of} x{E2 = per}

We need more consistency constraints. Any Boolean constraint can be written with a set of

linear inequalities, and an efficient algorithm exist [Rizzollo2007].

CCM for NE-Relations

We showed how a CCM formulation for the NE-Relation problem.

In this case, the cost of each variable was learned independently, using trained classifiers.

Other expressive problems can be formulated as Integer Linear Programs. For example, HMM/CRF inference, Viterbi

algorithms

c{E1 = per}· x{E1 = per} + c{E1 = loc}· x{E1 = loc} + … + c{R12 = spouse_of}· x{R12 = spouse_of} + … + c{R12 = }· x{R12 = } + …

TSP as an Integer Linear Program

Dantzig et al. were the first to suggest a practical solution to the problem using ILP.

Reduction from Traveling Salesman to ILP

x12c123

2x32c32

Every node hasONE outgoing

Every node hasONE incoming

x12c123

2x32c32

The solutions are binary (int)

x12c123

2x32c32

No sub graphcontains a cycle

x12c123

2x32c32

Viterbi as an Integer Linear Program

y0 y1 y2 yN

x0 x1 x2 xN

Viterbi with ILP

y2y0 y1 yN

x2x0 x1 xN

-Log{P(x0|y0=1) P(y0=1)}

Viterbi with ILP

y2y0 y1 yN

x2x0 x1 xN

S -Log{P(x0|y0=2)P(y0=2)}

Viterbi with ILP

y2y0 y1 yN

x2x0 x1 xN

S -Log{P(x0|y0=M) P(y0=M)}

Viterbi with ILP

y2y0 y1 yN

x2x0 x1 xN

-Log{P(x2|y1=1)P(y1=1|y0=1)}S

Viterbi with ILP

y2y0 y1 yN

x2x0 x1 xN

-Log{P(x2|y1=2)P(y1=1|y0=1)}S

Viterbi with ILP

y0 y1 y2 yN

x0 x1 x2 xN

Viterbi with ILP

y0 y1 y2 yN

x0 x1 x2 xN

Viterbi = shortest path (ST ) with dynamic programming. We saw TSP ILP.Now show: Shortest Path ILP.

Shortest path with ILP.

For each edge (u,v), xuv =1 if (u,v) is on the shortest path, 0 otherwise.

cuv = the weight of the edge.

All nodes except the S and the T have eq. number of ingoing and outgoing edges

The source node has one outgoing edge more than ingoing edges.

The target node has one ingoing edge more than outgoing edges.

Interestingly, in this formulation, the solution to the general LP problem will have integer solutions. Thus, we have a reasonably fast alternative to Viterbi with ILP.

Who cares that Viterbi is an ILP?

Assume you want to learn an HMM/CRF model (e.g., Extracting fields from citations (IE))

But, you also want to add expressive constraints: No field can appear twice in the data. The fields <journal> and <tech-report> are mutually

exclusive. The field <author> must appear at least once.

Do: Learn HMM/CRF Convert to an LP Modify the LP canonical constraint matrix

A concrete example for adding constraints over CRF will be shown

Part 2: Introduction to Integer Linear Programming What is ILP; use of ILP in NLP

Part 3: Detailed examples of using CCMs Semantic Role Labeling in Details Coreference Resolution Sentence Compression

THE END

CCM Examples

Many works in NLP make use of constrained conditional models, implicitly or explicitly.

Next we describe in details three examples.

Example 2: Semantic Role Labeling. The use of inference with constraints to improve

semantic parsing Example 3 (Co-ref):

combining classifiers through objective function outperforms pipeline.

Example 4 (Sentence Compression): Simple language model with constraints

outperforms complex models.

Example 2: Semantic Role Labeling

Demo:http://L2R.cs.uiuc.edu/~cogcomp

Approach :1) Reveals several relations.

Top ranked system in CoNLL’05 shared task

Key difference is the Inference

2) Produces a very good semantic parser. F1~90% 3) Easy and fast: ~7 Sent/Sec (using Xpress-MP)

Who did what to whom, when, where, why,…

Simple sentence:

I left my pearls to my daughter in my will .

[I]A0 left [my pearls]A1 [to my daughter]A2 [in my will]AM-LOC .

A0 Leaver A1 Things left A2 Benefactor AM-LOC Location

SRL Dataset

PropBank [Palmer et. al. 05]

Core arguments: A0-A5 and AA different semantics for each verb specified in the PropBank Frame files

13 types of adjuncts labeled as AM-arg where arg specifies the adjunct type

In this problem, all the information is given, but conceptually, we could train different components on different resources.

Algorithmic Approach

Identify argument candidates Pruning [Xue&Palmer, EMNLP’04] Argument Identifier

Binary classification (SNoW) Classify argument candidates

Argument Classifier Multi-class classification (SNoW)

Inference Use the estimated probability distribution

given by the argument classifier Use structural and linguistic constraints Infer the optimal global output

I left my nice pearls to her

I left my nice pearls to her[ [ [ [ [ ] ] ] ] ]

Identify Vocabulary

Inference over (old and new) Vocabulary

candidate arguments

understood

Learning

Both argument identifier and argument classifier are trained using SNoW. Sparse network of linear functions Multiclass classifier that produces a probability

distribution over output values.

Features (some examples) Voice, phrase type, head word, parse tree path

from predicate, chunk sequence, syntactic frame, …

Conjunction of features

Inference

The output of the argument classifier often violates some constraints, especially when the sentence is long.

Finding the best legitimate output is formalized as an optimization problem and solved via Integer Linear Programming (ILP).

Input: The scores from the argument classifier Structural and linguistic constraints

ILP allows incorporating expressive (non-sequential) constraints on the variables (the arguments types).

Semantic Role Labeling (SRL)

0.150.3

One inference problem for each verb predicate.

Integer Linear Programming Inference

For each argument ai

Set up a Boolean variable: ai,t indicating whether ai is classified as t

Goal is to maximize i score(ai = t ) ai,t

Subject to the (linear) constraints

If score(ai = t ) = P(ai = t ), the objective is to find the assignment that maximizes the expected number of arguments that are correct and satisfies the constraints.

No duplicate argument classes

a POTARG x{a = A0} 1

a2 POTARG , a POTARG x{a = A0} x{a2 = R-A0}

C-ARG a2 POTARG ,

(a POTARG) (a is before a2 ) x{a = A0} x{a2 = C-A0}

Many other possible constraints: Unique labels No overlapping or embedding Relations between number of arguments; order constraints If verb is of type A, no argument of type B

Any Boolean rule can be encoded as a set of linear inequalities.

If there is an R-ARG phrase, there is an ARG Phrase

If there is an C-ARG phrase, there is an ARG before it

Constraints

Joint inference can be used also to combine different SRL Systems.

Universally quantified rulesLBJ: allows a developer to encode

constraints in FOL; these are compiled into linear inequalities automatically.

Summary: Semantic Role Labeling

Demo:http://L2R.cs.uiuc.edu/~cogcomp

Who did what to whom, when, where, why,…

Example 3: co-ref (thanks Pascal Denis)

Traditional approach: train a classifier which predicts the probability that two named entities co-refer:

Pco-ref(Clinton, NPR) = 0.2 (order sensitive!)Pco-ref(Lewinsky, her) = 0.6

Decision Process: E.g., If Pco-ref(s1, s2) > 0.5, label two entities as co-referent

or, clustering

Traditional approach: train a classifier which predicts the probability that two named entities co-refer:

Pco-ref(Clinton, NPR) = 0.2Pco-ref(Lewinsky, her) = 0.6

Decision Process: E.g., If Pco-ref(s1, s2) > 0.5, label two entities as co-referent

or, clustering

Evaluation:Cluster all the entities based on co-ref classifier predictions. The evaluation is based on cluster overlap.

Two types of entities: “Base entities”“Anaphors” (pointers)

Error analysis: 1)“Base entities” that “point” to anaphors.2)Anaphors that don’t “point” to anything.

Proposed solution.

Pipelined approach: Identify anaphors.Forbid links of certain types.

Proposed solution.

Doesn’t work, despite 80% accuracy of the anaphorisity classifier.

The reason: error propagation in the pipeline.

Joint Solution with CCMs

Joint solution using CCMs

This performs considerably better than the pipelined approach, and actually improves over the baseline without anaphorisity classifier (though marginally)

Joint solution using CCMs

Note:If we have reasons to believe one of the classifiers significantly more

than the other, we can scale with tuned parameters

Example 4 - Sentence Compression (thanks James Clarke).

Example

Example:

0 1 2 3 4 5 6 7 8

Big fish eat small fish in a small pond

Big fish in a pond

568156015

Language model-based compression

Example 2 : summarization (thanks James Clarke)

This formulation requires some additional constraintsBig fish eat small fish in a small pond

No selection of decision variables can make these trigrams appear consecutively in output.

We skip these constraints here.

Trigram model in action

Modifier Constraints

Example

Sentential Constraints

Example

More constraints

Other CCM Examples: Opinion Recognition

Y. Choi, E. Breck, and C. Cardie. Joint Extraction of Entities and Relations for Opinion Recognition EMNLP-2006

Semantic parsing variation: Agent=entity Relation=opinion

Constraints: A an agent can have at most two opinions. An opinion should be linked to only one agent. The usual non-overlap constraints.

Other CCM Examples: Temporal Ordering

N. Chambers and D. Jurafsky. Jointly Combining Implicit Constraints Improves Temporal Ordering. EMNLP-2008.

Other CCM Examples: Temporal Ordering

N. Chambers and D. Jurafsky. Jointly Combining Implicit Constraints Improves Temporal Ordering. EMNLP-2008.

Three types of edges:1)Annotation relations before/after2)Transitive closure constraints3)Time normalization constraints

Related Work: Language generation.

Regina Barzilay and Mirella Lapata. Aggregation via Set Partitioning for Natural Language Generation.HLT-NAACL-2006.

Constraints: Transitivity: if (ei,ej)were aggregated, and (ei,ejk)

were too, then (ei,ek) get aggregated. Max number of facts aggregated, max sentence

length.

MT & Alignment

Ulrich Germann, Mike Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. Fast decoding and optimal decoding for machine translation. ACL 2001.

John DeNero and Dan Klein. The Complexity of Phrase Alignment Problems. ACL-HLT-2008.

Summary of Examples

We have shown several different NLP solution that make use of CCMs.

Examples vary in the way models are learned.

In all cases, constraints can be expressed in a high level language, and then transformed into linear inequalities.

Learning based Java (LBJ) [Rizollo&Roth’07] describe an automatic way to compile high level description of constraint into linear inequalities.

Solvers

All applications presented so far used ILP for inference.

People used different solvers Xpress-MP GLPK lpsolve R Mosek CPLEX

Next we discuss other inference approaches to CCMs

THE END

Where Are We ?

We hope we have already convinced you that Using constraints is a good idea for addressing NLP problems Constrained conditional models provide a good platform

We were talking about using expressive constraints To improve existing models Learning + Inference The problem: inference

A powerful inference tool: Integer Linear Programming SRL, co-ref, summarization, entity-and-relation… Easy to inject domain knowledge

Constrained Conditional Model : Inference

How to solve?

Constraint violation penalty

How to train?

Advantages of ILP Solvers: Review

ILP is Expressive: We can solve many inference problems Converting inference problems into ILP is easy

ILP is Easy to Use: Many available packages (Open Source Packages): LPSolve, GLPK, … (Commercial Packages): XPressMP, Cplex No need to write optimization code!

Why should we consider other inference options?

ILP: Speed Can Be an Issue

Inference problems in NLP Sometimes large problems are actually easy for ILP

E.g. Entities-Relations Many of them are not “difficult”

When ILP isn’t fast enough, and one needs to resort to approximate solutions.

The Problem: General Solvers vs. Specific Solvers ILP is a very, very general solver But, sometimes the structure of the problem allows for

simpler inference algorithms. Next we give examples for both cases.

Example 1: Search based Inference for SRL

The objective function

Constraints Unique labels No overlapping or embedding If verb is of type A, no argument of type B …

Intuition: check constraints’ violations on partial assignments

Maximize summation of the scores subject to linguistic constraints

ijij xc,

Indicator variableassigns the j-th class for the i-th token

Classification confidence

Inference using Beam Search

For each step, discard partial assignments that violate constraints!

Shape: argument

Color: label

Beam size = 2,Constraint:Only one Red

Rank them according to classification confidence!

Heuristic Inference

Problems of heuristic inference Problem 1: Possibly, sub-optimal solution Problem 2: May not find a feasible solution

Drop some constraints, solve it again

Using search on SRL gives comparable results to using ILP, but is much faster.

How to get a score for the pair? Previous approaches:

Extract features for each source and target entity pair

The CCM approach: Introduce an internal structure (characters) Constrain character mappings to “make

sense”.

Example 2: Exploiting Structure in Inference: Transliteration

Transliteration Discovery with CCM

The problem now: inference How to find the best mapping that satisfies the

constraints?

A weight is assigned to each edge.

Include it or not? A binary decision.

Score = sum of the mappings’ weight

Assume the weights are given. More on this latter.

• Natural constraints• Pronunciation constraints• One-to-One • Non-crossing•…

Score = sum of the mappings’ weights. t. mapping satisfies constraints

Finding The Best Character Mappings

An Integer Linear Programming Problem

Is this the best inference algorithm?

,,,,,,

,0,),(

TjSiijij

jmkimkji

Maximize the mapping score

Pronounciation constraint

One-to-one constraint

Non-crossing constraint

Finding The Best Character Mappings

A Dynamic Programming Algorithm

Exact and fast!

Restricted mapping constraints

Take Home Message:

Although ILP can solve most problems, the fastest inference algorithm depends on the constraints and can be simpler

We can decompose the inference problem into two parts

Other Inference Options

Constraint Relaxation Strategies Try Linear Programming

[Roth and Yih, ICML 2005] Cutting plane algorithms do not use all constraints at

first Dependency Parsing: Exponential number of constraints [Riedel and Clarke, EMNLP 2006]

Other search algorithms A-star, Hill Climbing… Gibbs Sampling Inference [Finkel et. al, ACL 2005]

Named Entity Recognition: enforce long distance constraints

Can be considered as : Learning + Inference One type of constraints only

Inference Methods – Summary

Why ILP? A powerful way to formalize the problems However, not necessarily the best algorithmic

solution

Heuristic inference algorithms are useful sometimes! Beam search Other approaches: annealing …

Sometimes, a specific inference algorithm can be designed According to your constraints

THE END

Constrained Conditional Model: Training

How to solve?

How to train?

Where are we?

Algorithmic Approach: Incorporating general constraints Showed that CCMs allow for formalizing many problems Showed several algorithmic ways to incorporate global

constraints in the decision.

Training: Coupling vs. Decoupling Training and Inference. Incorporating global constraints is important but Should it be done only at evaluation time or also at

training time? How to decompose the objective function and train in

parts? Issues related to:

Modularity, efficiency and performance, availability of training data

Problem specific considerations

Training in the presence of Constraints

General Training Paradigm: First Term: Learning from data (could be further

decomposed) Second Term: Guiding the model by constraints Can choose if constraints’ weights trained, when

and how, or taken into account only in evaluation.

Decompose Model (SRL case) Decompose Model from constraints

Comparing Training Methods

Option 1: Learning + Inference (with Constraints) Ignore constraints during training

Option 2: Inference (with Constraints) Based Training Consider constraints during training

In both cases: Decision Making with Constraints

Question: Isn’t Option 2 always better?

Not so simple… Next, the “Local model story”

f3(x)f4(x)

Training Methods

Learning + Inference (L+I)Learn models independentlyInference Based Training (IBT)Learn all models together!

IntuitionLearning with constraints may make learning more difficult

-1 1 111Y’ Local Predictions

Training with ConstraintsExample: Perceptron-based Global Learning

f3(x)f4(x)

-1 1 1-1-1YTrue Global Labeling

-1 1 11-1Y’ Apply Constraints:

Which one is better? When and Why?

Claims [Punyakanok et. al , IJCAI 2005]

Theory applies to the case of local model (no Y in the features)

When the local modes are “easy” to learn, L+I outperforms IBT. In many applications, the components are identifiable and

easy to learn (e.g., argument, open-close, PER). Only when the local problems become difficult to solve in

isolation, IBT outperforms L+I, but needs a larger number of training examples.

Other training paradigms are possible Pipeline-like Sequential Models: [Roth, Small, Titov: AI&Stat’09]

Identify a preferred ordering among components Learn k-th model jointly with previously learned models

L+I: cheaper computationally; modularIBT is better in the limit, and other extreme cases.

opt=0.2opt=0.1opt=0

Bound Prediction

Local ≤ opt + ( ( d log m + log 1/ ) / m )1/2

Global ≤ 0 + ( ( cd log m + c2d + log 1/ ) / m )1/2

Bounds Simulated Data

L+I vs. IBT: the more identifiable individual problems are, the better overall performance is with L+I

Indication for hardness of

problem

Relative Merits: SRL

Difficulty of the learning problem

(# features)

L+I is better.

When the problem is artificially made harder, the tradeoff is clearer.

easyhard

In some cases problems are hard due to lack of training data.

Semi-supervised learning

YPRED=

For each iterationFor each (X, YGOLD ) in the training data

If YPRED != YGOLD

λ = λ + F(X, YGOLD ) - F(X, YPRED)

endifendfor

L+I & IBT: General View – Structured Perceptron

The theory applies when F(x,y) = F(x)

The difference between

L+I and IBT

The difference between

L+I and IBT

Comparing Training Methods (Cont.)

Local Models (train independently) vs. Structured Models In many cases, structured models might be better due to

expressivity But, what if we use constraints? Local Models+Constraints vs. Structured Models

+Constraints Hard to tell: Constraints are expressive For tractability reasons, structured models have less expressivity

than the use of constraints. Local can be better, because local models are easier to learn

Decompose Model (SRL case) Decompose Model from constraints

Example: Semantic Role Labeling Revisited

Sequential Models

Conditional Random Field

Global perceptron

Training: sentence based

Testing: find the shortest path

with constraints

Local Models

Logistic Regression

Local Avg. Perceptron

Training: token based.

Testing: find the best assignment locally

with constraints

Model CRF CRF-D CRF-IBT Avg. P

Baseline 66.46 69.14 58.15

+ Constraint

71.94 73.91 69.82 74.49

Training Time

48 38 145 0.8

Which Model is Better? Semantic Role Labeling

Experiments on SRL: [Roth and Yih, ICML 2005] Story: Inject constraints into conditional random field mdoels

Sequential ModelsSequential Models LocalLocal

L+IL+I L+IL+IIBTIBT

Sequential Models are better than Local Models !

(No constraints)

Sequential Models are better than Local Models !

(No constraints)

Local Models are now better than Sequential Models! (With constraints)

Summary: Training Methods

Many choices for training a CCM Learning + Inference (Training without constraints) Inference based Learning (Training with constraints)

Based on this, what kind of models you want to use?

Advantages of L+I Require fewer training examples More efficient; most of the time, better performance Modularity; easier to incorporate already learned models.

Constrained Conditional Model: Soft Constraints

(3) How to solve?

Subject to constraints

(4) How to train?

(1) Why use soft constraint?

(2) How to model “degree of violations”

(1) Why Are Soft Constraints Important?

Some constraints may be violated by the data.

Even when the gold data violates no constraints, the model may prefer illegal solutions. We want a way to rank solutions based on the

level of constraints’ violation. Important when beam search is used

Working with soft constraints Need to define the degree of violation Need to assign penalties for constraints

Example: Information extraction

Prediction result of a trained HMMLars Ole Andersen . Program analysis and

[DATE]

Violates lots of natural constraints!

Examples of Constraints

The citation can only start with AUTHOR or EDITOR.

The words pp., pages correspond to PAGE. Four digits starting with 20xx and 19xx are DATE. Quotations can appear only in TITLE …….

(2) Modeling Constraints’ Degree of Violations

Lars Ole Andersen .

AUTH AUTH EDITOR EDITOR

Φc(y1)=0 Φc(y2)=0 Φc(y3)=1 Φc(y4)=0

)( Nc y1 - if assigning yN to xN violates the constraint C with respect to assignment (x1,..,xN-1;y1,…,yN-1)

0 - otherwise

•Count how many times it violated the constraints: d(y,1c(X))= ∑Φc(yi)

1 1, is a punctuationi i ii y y x

Lars Ole Andersen .

AUTH BOOK EDITOR EDITOR

Φc(y1)=0 Φc(y2)=1 Φc(y3)=1 Φc(y4)=0∑Φc(yi) =1∑Φc(yi) =2

State transition must occur on punctuations.

Example: Degree of Violation?

It might be the case that all “good” assignments violate some constraints We lost the ability to judge which assignment is better

Better options: Choose the one with fewer violations!

Lars Ole Andersen .

AUTH BOOK EDITOR EDITOR

Φc(y1)=0 Φc(y2)=1 Φc(y3)=1 Φc(y4)=0

Lars Ole Andersen .

AUTH AUTH EDITOR EDITOR

Φc(y1)=0 Φc(y2)=0 Φc(y3)=1 Φc(y4)=0 The first one is better because of d(y,1c(X))!

Hard Constraints vs. Weighted Constraints

Constraints are close to perfect

Labeled data might not follow the constraints

(3) Constrained Conditional Model with Soft Constraints

How to solve?

How to train?

Inference with Beam Search (& Soft Constraints)

L. Shari

al. 1998.

Building

semantic

Concordances.

C. Fellbaum,

ed. WordNet:

and its applications.

ILP is another option!Run beam search inference as before. Rather than eliminating illegal assignments, re-rank them.

Assume that We Have Made a Mistake…

Title Title Journal

L. Shari

al. 1998.

Building

semantic

Concordances.

C. Fellbaum,

ed. WordNet:

Assumption: the best assignment according to the objective functionStart from here

Top Choice from Learned Models Is Not Good: ∑Φc1(yi) =1 ; ∑Φc2(yi) =1+1

Au Au Au Au

Title Title Journal Book

L. Shari

al. 1998.

Building

semantic

Concordances.

C. Fellbaum,

ed. WordNet:

Two violations.

The Second Choice from Learned Models Is Not Good

Au Au Au Au

Title Title Journal Book

L. Shari

al. 1998.

Building

semantic

Concordances.

Editor

C. Fellbaum,

ed. WordNet:

Two violations.

∑Φc2(yi) =1+1

Use Constraints to Find the Best Assignment

Au Au Au Au Date

Title Title Journal Editor

L. Shari

al. 1998.

Building

semantic

Concordances.

Editor

C. Fellbaum,

ed. WordNet:

one violations.

∑Φc2(yi) =1

Soft Constraints Help Us Find a Better Assignment

Au Au Au Au Date

Title Title Journal Editor

L. Shari

al. 1998.

Building

semantic

Concordances.

Editor

Editor Editor

Book Book

C. Fellbaum,

ed. WordNet:

We can do this because we use soft constraints. If we use hard constraints, the score of all assignment given the prefix is negative infinity.

(4) Constrained Conditional Model with Soft Constraints

How to solve?

How to train?

Compare Training Methods (with Soft Constraints)

Need to figure out the penalty as well…

Option 1: Learning + Inference (with Constraints) Learn the weights and penalties separately

Option 2: Inference (with Constraints) Based Training Learn the weights and penalties together

The tradeoff between L+I and IBT is similar to what we have seen earlier.

YPRED=

Inference Based Training With Soft Constraints

For each iterationFor each (X, YGOLD ) in the training data

If YPRED != YGOLD

λ = λ + F(X, YGOLD ) - F(X, YPRED)

ρI = ρI+ d(YGOLD,1C(X)) - d(YPRED,1C(X)), I = 1,..

endifendfor

• Example: Perceptron• Update penalties as well !

L+I v.s IBT for Soft Constraints

Test on citation recognition: L+I: HMM + weighted constraints IBT: Perceptron + weighted constraints Same feature set

With constraints Factored Model is better More significant with a small # of examples

Without constraints Few labeled examples, HMM > perceptron Many labeled examples, perceptron > HMM

Agrees with earlier results in the supervised setting ICML’05, IJCAI’5

Summary – Soft Constraints

Using soft constraints is important sometimes Some constraints might be violated in the gold data We want to have the notion of degree of violation

Degree of violation One approximation: count how many time the

constraints was violated

How to solve? Beam search, ILP, …

How to learn? L+I v.s. IBT

Constrained Conditional Model: Injecting Knowledge

How to solve?

How to train?

Inject Prior Knowledge via Constraints

A.K.A.

Semi/unsupervised Learning with Constrained Conditional Model

Constraints As a Way To Encode Prior Knowledge

Consider encoding the knowledge that: Entities of type A and B cannot occur simultaneously in a

sentence The “Feature” Way

Requires larger models

The Constraints Way Keeps the model simple; add expressive

constraints directly A small set of constraints Allows for decision time incorporation of

constraints

A effective way to inject knowledge

Need more training data

We can use constraints as a way to replace training data

162162

Constraint Driven Semi/Un Supervised Learning

CODLUse constraints to generate better training samples in semi/unsupervised leaning.

Unlabeled Data

PredictionLabel unlabeled data

FeedbackLearn from labeled data

Seed Examples Seed Examples

Prediction + ConstraintsLabel unlabeled data

Better FeedbackLearn from labeled data

Better ModelBetter ModelResource Resource

In traditional semi/unsupervised Learning, models can drift away from correct model

Semi-Supervised Learning with Soft Constraints (L+I)

Learning model weights Example: HMM

Constraints Penalties Hard Constraints : infinity Weighted Constraints:

½i = -log P{Constraint Ci is violated in training data}

Only 10 constraints

Learn the weights and penalty separately

Semi-supervised Learning with Constraints

=learn(T)For N iterations do

T= For each x in unlabeled dataset

{y1,…,yK} InferenceWithConstraints(x,C, )

T=T {(x, yi)}i=1…k

= +(1- )learn(T)

Learn from new training data.Weigh supervised and unsupervised model.

Inference based augmentation of the training set (feedback)(inference with constraints).

Supervised learning algorithm parameterized by

[Chang, Ratinov, Roth, ACL’07]

Objective function:

Value of Constraints in Semi-Supervised Learning

300# of available labeled

examples

Learning w 10 ConstraintsConstraints are used to Bootstrap a semi-supervised learner Poor model + constraints used to annotate unlabeled data, which in turn is used to keep training the model.

Learning w/o Constraints: 300 examples.

Unsupervised Constraint Driven Learning

Semi-supervised Constraint Driven Learning In [Chang et.al. ACL 2007], they use a small labeled

dataset Reason: We need a good starting point!

Unsupervised Constraint Driven Learning Sometimes, good resources exist

We can use the resource to initialize the model We do not need labeled instances at all!

Application: transliteration discovery

Why Adding Constraints?

Before talking about transliteration discoveryLet’s think about why adding

constraints again Reason: We want to capture the

dependencies between different outputs

A new question: What happen if you are trying to solve a single output classification problem Can you still inject prior knowledge?

Structure Output Problem:

Dependencies between different outputs

Why Add Constraints?

Single Output Problem:

Only one output

Why Add Constraints?

Adding Constraints Through Hidden Variables

Single Output Problem

with hidden variables

Hidden variables Character mappings

Character mappings is not our final goal But, it provides a intermediate representation

that satisfies natural constraints.

Example: Transliteration Discovery

The score for each pair depends on the character mappings.

Transliteration Discovery with Hidden Variables

For each source NE, find the best target candidate

For each NE pair, the score is equal to the score of the best hidden set of character mapping

(edges) Add constraints when predicting hidden

variables

ts HdNNHFNNHg )1,(),,(),,(

),,(maxarg),(

tsHts N

NNHgNNscore

Violation

A CCM Formulation, use constraintsto bias the prediction!Alternative View: Finding the best feature representation

Natural Constraints

A Dynamic Programming Algorithm

Exact and fast!

Pronouncition constraints

Take Home Message:

Although ILP can solve most problems, the fastest inference algorithm depends on the constraints and can be simpler

We can decompose the problem into two parts

Algorithm: High Level View

Model Resource (Romanization Table) Until Convergence

Use model + Constraints to get assignments for both hidden variables (F) and labels (Y)

Update the model with newly-labeled F and y

Get feedback from both hidden variables and labels

)},|,(,{

)}|,(,{

NNWNNFDD

WNNFDD

)(DtrainW

Results - Russian

Higher, better

No constraints + 20 labeled example No constraints + 20 labeled example

Unsupervised Constraint Driven LearningNo labeled data No temporal Information

No NER on Russian

Results - Hebrew

No constraints + 250 labeled example No constraints + 250 labeled example

Unsupervised Constraint Driven LearningNo labeled data No temporal Information

Summary

Adding constraints is an effective way to inject knowledge Constraints can correct our predictions on unlabeled data

Injecting constraints sometimes are more effective than annotating data 300 labeled data v.s. 20 labeled data + constraints

On the cases of single output problems Use hidden variables to capture the constraints

Other approaches are possible [Duame, EMNLP 2008]

Only get feedback from the examples where constraints are satisfied

178Page 178

THE END

Conclusion

Constrained Conditional Models combine Learning conditional models with using declarative

expressive constraints Within a constrained optimization framework

Our goal was to: Introduce a clean way of incorporating constraints to

bias and improve decisions of learned models A clean way to use (declarative) prior knowledge to

guide semi-supervised learning Provide examples for the diverse usage CCMs have

already found in NLP Significant success on several NLP and IE tasks (often,

with ILP)

Technical Conclusions Presented and discussed inference algorithms

How to use constraints to make global decisions The formulation is an Integer Linear Programming

formulation, but algorithmic solutions can employ a variety of algorithms

Present and discussed learning models to be used along with constraints Training protocol matters We distinguished two extreme cases – training with/without

constraints. Issues include performance, as well as modularity of the

solution and the ability to use previously learned models. We did not attend to the question of “how to find constraints”

In order to emphasize the idea that we think background knowledge is important, it exists, and to focus on using it.

But, we talked about learning weights for constraints, and it’s clearly possible to learn constraints.

y* = argmaxy wi Á(x; y)

Linear objective functions Typically Á(x,y) will be local

functions, or Á(x,y) = Á(x)

Summary: Constrained Conditional Models

y7y4 y5 y6 y8

y1 y2 y3

y7y4 y5 y6 y8

y1 y2 y3

Conditional Markov Random Field Constraints Network

i ½i dC(x,y)

Expressive constraints over output variables

Soft, weighted constraints Specified declaratively as FOL

formulae Clearly, there is a joint probability distribution that

represents this mixed model. We would like to:

Learn a simple model or several simple models Make decisions with respect to a complex model

Key difference from MLNs which provide a concise definition of a model, but the whole joint one.

Decide what variables are of interest – learn model (s) Think about constraints among the variables of interest Design your objective function

LBJ (Learning Based Java): http://L2R.cs.uiuc.edu/~cogcomp

A modeling language for Constrained Conditional Models. Supports programming along with building learned models, high level specification of

constraints and inference with constraints

Designing CCMs

y* = argmaxy wi Á(x; y)

Linear objective functions Typically Á(x,y) will be local

functions, or Á(x,y) = Á(x)

y7y4 y5 y6 y8

y1 y2 y3

y7y4 y5 y6 y8

y1 y2 y3

i ½i dC(x,y)

Expressive constraints over output variables

Soft, weighted constraints Specified declaratively as FOL

formulae

Questions? Thank you!

Page 1 March 2009 EACL Constrained Conditional Models for Natural Language Processing Ming-Wei...

Documents