Page 1 February 2008 University of Edinburgh With thanks to: Collaborators: Ming-Wei Chang, Vasin...

Page 1

February 2008

University of Edinburgh

With thanks to:

Collaborators: Ming-Wei Chang, Vasin Punyakanok, Lev Ratinov,

Nick Rizzolo, Mark Sammons, Scott Yih, Dav ZimakFunding: ARDA, under the AQUAINT program

NSF: ITR IIS-0085836, ITR IIS-0428472, ITR IIS- 0085980, SoD-HCER-0613885 A DOI grant under the Reflex program, DASH Optimization (Xpress-MP)

Constrained Conditional Models for Global Learning and Inference

Dan RothDepartment of Computer ScienceUniversity of Illinois at Urbana-Champaign

Page 2

Nice to Meet You

Page 3

Learning and Inference

Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome.

(Learned) models/classifiers for different sub-problems

Incorporate models’ information, along with constraints, in making coherent decisions – decisions that respect the local models as well as domain & context specific constraints.

Global inference for the best assignment to all variables of interest.

Page 4

Inference

Page 5

Comprehension

1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now.

A process that maintains and updates a collection of propositions about the state of affairs.

(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous.

Page 6

Illinois’ bored of education board

...Nissan Car and truck plant is ……divide life into plant and animal kingdom

(This Art) (can N) (will MD) (rust V) V,N,N

The dog bit the kid. He was taken to a veterinarian a hospital

What we Know: Stand Alone Ambiguity Resolution

Learn a function f: X Y that maps observations in a domain to one of several categories or <

Page 7

Theoretically: generalization bounds How many example does one need to see in order to guarantee

good behavior on previously unobserved examples. Algorithmically: good learning algorithms for linear

representations. Can deal with very high dimensionality (106 features) Very efficient in terms of computation and # of examples. On-

line. Key issues remaining:

Learning protocols: how to minimize interaction (supervision); how to map domain/task information to supervision; semi-supervised learning; active learning; ranking; adaptation.

What are the features? No good theoretical understanding here.

How to decompose problems and learn tractable models. Modeling/Programming systems that have multiple classifiers.

Classification is Well Understood

Page 8

Comprehension

1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now.

A process that maintains and updates a collection of propositions about the state of affairs.

(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous.

This is an Inference Problem

Page 9

This Talk

Global Inference over Local Models/Classifiers + Expressive Constraints

Constrained Conditional Models Generality of the framework

Training Paradigms

Global training, Decomposition and Local training Semi-Supervised Learning

Examples Semantic Parsing Information Extraction Pipeline processes

Page 10

Sequential Constrains Structure

Three models for sequential inference with classifiers[Punyakanok & Roth NIPS’01]

HMM; HMM with Classifiers Sufficient for easy problems

Conditional Models (PMM) Allows direct modeling of states as a function of input Classifiers may vary; SNoW (Winnow;Perceptron); MEMM: MaxEnt;

SVM based

Constraint Satisfaction Models (CSCL: more general constrains)

The inference problem is modeled as weighted 2-SAT With sequential constraints: shown to have efficient solution

Recent work – viewed as multi-class classification; emphasis on global training [Collins’02, CRFs,SVMs]; efficiency and performance issues

s1

o1

s2

o2

s3

o3

s4

o4

s5

o5

s6

o6

s1

o1

s2

o2

s3

o3

s4

o4

s5

o5

s6

o6By far, the most popular in applications

Allows for Dynamic Programming based

Inference

What if the structure of the problem/constraints is not sequential?

Page 11

Pipeline

Pipelining is a crude approximation; interactions occur across levels and down stream decisions often interact with previous decisions.

Leads to propagation of errors Occasionally, later stage problems are easier but

upstream mistakes will not be corrected.

POS Tagging

Phrases

Semantic Entities

Relations

Most problems are not single classification problems

Parsing

WSD Semantic Role Labeling

Raw Data

Looking for: Global inference over the outcomes of different local predictors as a way to break away from this paradigm [between pipeline & fully global] A flexible way to incorporate linguistic and structural constraints.

Page 12

Inference with General Constraint Structure [Roth&Yih’04]

Dole ’s wife, Elizabeth , is a native of N.C.

E1 E2 E3

R12 R2

3

other 0.05per 0.85loc 0.10



irrelevant 0.10spouse_of 0.05born_in 0.85









Improvement over no inference: 2-5%

Page 13

Random Variables Y:

Conditional Distributions P (learned by models/classifiers) Constraints C– any Boolean function defined on partial assignments (possibly: + weights W )

Goal: Find the “best” assignment The assignment that achieves the highest global

performance. This is an Integer Programming Problem

Problem Setting

y7y4 y5 y6 y8

y1 y2 y3C(y1,y4)C(y2,y3,y6,y7,y8)

Y*=argmaxY PY subject to constraints C(+ WC)Other, more general ways to incorporate

soft constraints here [ACL’07]

observations

Page 14

y* = argmaxy wi Á(x; y)

Typically, Linear or log-linear Typically Á(x,y) will be local

functions, or Á(x,y) = Á(x)

Constrained Conditional Models

y7y4 y5 y6 y8

y1 y2 y3y7y4 y5 y6 y8

y1 y2 y3Conditional Markov Random Field Constraints Network

i ½i Cix,y)

Optimize for general constraints Constraints may have weights. May be soft Specified declaratively as FOL

formulae Clearly, there is a joint probability distribution that

represents this mixed model. We would like to:

Make decisions with respect to the mixed model, but

Not necessarily learn this complex model.

Page 15

A General Inference Setting

Linear objective function: Essentially all complex models studied today can be viewed as

optimizing a : HMMs/CRFs [Roth’99; Collins’02;Laffarty et. al 02]

Linear objective functions can be derived from probabilistic perspective:

Markov Random Field [standard; Kleinberg&Tardos] Optimization Problem (Metric

Labeling) [Chekuri et. al’01] Linear Programming Problems

Inference as Constrained Optimization [Yih&Roth CoNLL’04]…

The probabilistic perspective supports finding the most likely assignment Not necessarily what we want

Our Integer linear programming (ILP) formulation Allows the incorporation of more general cost functions General (non-sequential) constraint structure Better exploitation (computationally) of hard constraints Can find the optimal solution if desired

Page 16

Formal Model

How to solve?

This is an Integer Linear Program

Solve using ILP packages gives an exact solution.

Search techniques are also possible

(Soft) constraints component

Weight Vector for “local” models

Penalty for violatingthe constraint.

How far away is y from a “legal” assignment

Subject to constraints

A collection of Classifiers; Log-linear models (HMM, CRF) or a combination

How to train?

How to decompose global objective function?

Should we incorporate constraints in the learning process?

Page 17

Example: Semantic Role Labeling

I left my pearls to my daughter in my will .

[I]A0 left [my pearls]A1 [to my daughter]A2 [in my will]AM-LOC .

A0 Leaver A1 Things left A2 Benefactor AM-LOC Location I left my pearls to my daughter in my will .

Special Case (structured output problem): here, all the data is available at one time; in general, classifiers might be learned from different sources, at different times, at different contexts.

Implications on training paradigms

Overlapping arguments

If A2 is present, A1 must also be

present.

Who did what to whom, when, where, why,…

Page 18

PropBank [Palmer et. al. 05] provides a large human-annotated corpus of semantic verb-argument relations. It adds a layer of generic semantic labels to Penn Tree

Bank II. (Almost) all the labels are on the constituents of the

parse trees.

Core arguments: A0-A5 and AA different semantics for each verb specified in the PropBank Frame files

13 types of adjuncts labeled as AM-arg where arg specifies the adjunct type

Semantic Role Labeling (2/2)

Page 19

Algorithmic Approach

Identify argument candidates Pruning [Xue&Palmer, EMNLP’04] Argument Identifier

Binary classification (SNoW)

Classify argument candidates Argument Classifier

Multi-class classification (SNoW)

Inference Use the estimated probability

distribution given by the argument classifier

Use structural and linguistic constraints

Infer the optimal global output

I left my nice pearls to her

I left my nice pearls to her[ [ [ [ [ ] ] ] ] ]

I left my nice pearls to her[ [ [ [ [ ] ] ] ] ]



Identify Vocabulary

Inference over (old and new)

Vocabulary

candidate arguments

EASY

Page 20

InferenceI left my nice pearls to her

The output of the argument classifier often violates some constraints, especially when the sentence is long.

Finding the best legitimate output is formalized as an optimization problem and solved via Integer Linear Programming. [Punyakanok et. al 04, Roth & Yih 04;05]

Input: The probability estimation (by the argument classifier)

Structural and linguistic constraints

Allows incorporating expressive (non-sequential) constraints on the variables (the arguments types).

Page 21

Integer Linear Programming Inference

For each argument ai

Set up a Boolean variable: ai,t indicating whether ai is classified as t

Goal is to maximize i score(ai = t ) ai,t

Subject to the (linear) constraints

If score(ai = t ) = P(ai = t ), the objective is to find the assignment that maximizes the expected number of arguments that are correct and satisfies the constraints.

The Constrained Conditional Model is completely decomposed during training

Page 22

No duplicate argument classes

a POTARG x{a = A0} 1

R-ARG

a2 POTARG , a POTARG x{a = A0} x{a2 = R-A0}

C-ARG a2 POTARG ,

(a POTARG) (a is before a2 ) x{a = A0} x{a2 = C-A0}

Many other possible constraints: Unique labels No overlapping or embedding Relations between number of arguments If verb is of type A, no argument of type B

Any Boolean rule can be encoded as a linear constraint.

If there is an R-ARG phrase, there is an ARG Phrase

If there is an C-ARG phrase, there is an ARG before it

Constraints

Joint inference can be used also to combine different SRL Systems.

Universally quantified rulesIn LBJ we allow a programmer to

encode their constraints in FOL; these are compiled into linear inequalities automatically.

Page 23

Extracting Relations via Semantic AnalysisScreen shot from a CCG demo http://L2R.cs.uiuc.edu/~cogcomp

Semantic parsing reveals several relations in the sentence along with their arguments.

Top ranked system in CoNLL’05 shared task

Key difference is the Inference

This approach produces a very good semantic parser. F1~90%

Easy and fast: ~7 Sent/Sec (using Xpress-MP)

Page 24

ILP as a Unified Algorithmic Scheme

Consider a common model for sequential inference: HMM/CRF Inference in this model is done via the Viterbi Algorithm.

Viterbi is a special case of the Linear Programming based Inference. Viterbi is a shortest path problem, which is a LP, with a canonical

matrix that is totally unimodular. Therefore, you can get integrality constraints for free.

One can now incorporate non-sequential/expressive/declarative constraints by modifying this canonical matrix –modify the decision time objective function

The extension reduces to a polynomial scheme under some conditions (e.g., when constraints are sequential, when the solution space does not change, etc.)

Not necessarily increases complexity and very efficient in practice [Roth&Yih, ICML’05]

y1 y2 y3 y4 y5y

x x1 x2 x3 x4 x5s

ABC

ABC

ABC

ABC

ABC

t

Learn a rather simple model; make decisions with a more expressive model

So far, shown the use of only (deterministic) constraints. Can be used with statistical constraints.

This is a CCM that is trained globally (ML, Discriminatively)

Page 25

This Talk

Global Inference over Local Models/Classifiers + Expressive Constraints

Constrained Conditional Models Generality of the framework

Training Paradigms

Global training, Decomposition and Local training Semi-Supervised Learning

Examples Semantic Parsing Information Extraction Pipeline processes

Page 26

Given: Q: Who acquired Overture? Determine: A: Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc last year.

Textual Entailment

Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc. last year

Yahoo acquired Overture

Is it true that…?(Textual Entailment)

Overture is a search company Google is a search company

……….

Google owns Overture

Phrasal verb paraphrasing [Connor&Roth’07]Entity matching [Li et. al, AAAI’04, NAACL’04]Semantic Role LabelingInference for EntailmentAAAI’05;TE’07

Page 27

Training Paradigms that Support Global Inference

Incorporating general constraints (Algorithmic Approach) Allow both statistical and expressive declarative constraints Allow non-sequential constraints (generally difficult)

Coupling vs. Decoupling Training and Inference. Incorporating global constraints is important but Should it be done only at evaluation time or also at training

time? How to decompose the objective function and train in parts? Issues related to:

Modularity, efficiency and performance, availability of training data

Problem specific considerationsMay not be relevant in

some problems.

Page 28

Input: o1 o2 o3 o4 o5 o6 o7 o8 o9 o10

Classifier 1:Classifier 2:

Infer:

Phrase Identification Problem

Use classifiers’ outcomes to identify phrases Final outcome determined by optimizing classifiers outcome

and constrains

[ [ [ []] ] ] ] ]

[ ] ][

Did this classifier make a

mistake? How to train it?

Page 29

Training in the presence of Constraints

General Training Paradigm: First Term: Learning from data (could be further

decomposed) Second Term: Guiding the model by constraints Can choose if constraints’ weights trained, when

and how, or taken into account only in evaluation.

Page 30

L+I: Learning plus Inference

IBT: Inference-based Training

Training w/o ConstraintsTesting: Inference with Constraints

x1

x6

x2

x5

x4

x3

x7

y1

y2

y5

y4

y3

f1(x)

f2(x)

f3(x)f4(x)

f5(x)

X

Y

Learning the components together!

Cartoon: each model can be

more complex and may have a view on a set of output

variables.

Page 31

-1 1 111Y’ Local Predictions

Perceptron-based Global Learning

x1

x6

x2

x5

x4

x3

x7

f1(x)

f2(x)

f3(x)f4(x)

f5(x)

X

Y

-1 1 1-1-1YTrue Global Labeling

-1 1 11-1Y’ Apply Constraints:

Which one is better? When and Why?

Page 32

Claims

When the local modes are “easy” to learn, L+I outperforms IBT. In many applications, the components are identifiable

and easy to learn (e.g., argument, open-close, PER). Only when the local problems become difficult to solve in

isolation, IBT outperforms L+I, but needs a larger number of training examples. When data is scarce, problems are not easy and

constraints can be used, along with a “weak” model, to label unlabeled data and improve model.

Often, you don’t want the data to affect your view of the constraints.

L+I: cheaper computationally; modularIBT is better in the limit, and other extreme cases.

Combinations: L+I, and then IBT are possible

Page 33

opt=0.2opt=0.1opt=0

Bound Prediction

Local ≤ opt + ( ( d log m + log 1/ ) / m )1/2

Global ≤ 0 + ( ( cd log m + c2d + log 1/ ) / m )1/2

Bounds Simulated Data

L+I vs. IBT: the more identifiable individual problems are, the better overall performance is with L+I

Indication for hardness of

problem

Page 34

Relative Merits: SRL

Difficulty of the learning problem

(# features)

L+I is better.

When the problem is artificially made harder, the tradeoff is clearer.

easyhard

In some cases problems are hard due to lack of training data.

Semi-supervised learning

Page 35

Prediction result of a trained HMM

Lars Ole Andersen . Program analysis and

specialization for the C Programming language

. PhD thesis .DIKU , University of Copenhagen , May1994 .

Information extraction with Background Knowledge (Constraints)

Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 .

[AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE]

Violates lots of constraints!

Page 36

Examples of Constraints

Each field must be a consecutive list of words, and can appear at most once in a citation.

State transitions must occur on punctuation marks.

The citation can only start with AUTHOR or EDITOR.

The words pp., pages correspond to PAGE. Four digits starting with 20xx and 19xx are DATE. Quotations can appear only in TITLE …….

Page 37

Information Extraction with Constraints

Adding constraints, we get correct results!

[AUTHOR] Lars Ole Andersen . [TITLE] Program analysis and specialization for

the C Programming language .

[TECH-REPORT] PhD thesis .[INSTITUTION] DIKU , University of Copenhagen , [DATE] May, 1994 .

If incorporated into semi-supervised training, better results mean Better Feedback!

Page 38

Semi-Supervised Learning with Constraints

Model

Decision Time Constraints

Un-labeled Data

Constraints

In traditional Semi-Supervised learning the model can drift away from the correct one.

Constraints can be used At decision time, to bias the objective function

towards favoring constraint satisfaction. At training to improve labeling of un-labled data (and

thus improve the model)

Page 39

=learn(T)

For N iterations do

T=

For each x in unlabeled dataset

y Inference(x, )

T=T {(x, y)}

Supervised learning algorithm parameterized by

Inference based augmentation of the

training set (feedback)(inference with constraints).

Inference(x,C, )

Constraint - Driven Learning (CODL) [Chang, Ratinov, Roth, ACL’07]

Page 40

Constraint - Driven Learning (CODL) [Chang, Ratinov, Roth, ACL’07]

=learn(T)

For N iterations do

T=

For each x in unlabeled dataset

{y1,…,yK} Top-K-Inference(x,C, )

T=T {(x, yi)}i=1…k

= +(1- )learn(T)Learn from new training data.Weight supervised and unsupervised model.

Inference based augmentation of the training set (feedback)(inference with constraints).

Supervised learning algorithm parameterized by

Page 41

Token-based accuracy (inference with constraints)

60

65

70

75

80

85

90

95

100

5 10 15 20 25 300

Labeled set size

Ac

cu

rac

y

supervised Weighted EM CODL

Page 42

Objective function:

Semi-Supervised Learning with Constraints

# of available labeled examples

Learning w ConstraintsConstraints are used to Bootstrap a semi-supervised learner Poor model + constraints used to annotate unlabeled data, which in turn is used to keep training the model.

Learning w/o Constraints: 300 examples.

A Constrained Conditional Model in which we do not want to let training affect the constraints’ part of the objective function.

Page 43

Constrained Conditional Models: a general paradigm for learning and inference in the context of natural language understanding tasks

A general Constraint Optimization approach for integration of learned models with additional (declarative or statistical) expressivity.

A paradigm for making Machine Learning practical – allow domain/task specific constraints.

How to train? Learn simple local models; make use of them globally

(via global inference) [Punyakanok et. al IJCAI’05]

Ability to use of domain & constraints to drive supervision

[Klementiev & Roth, ACL’06; Chang, Ratinov, Roth, ACL’07]

Conclusions

LBJ (Learning Based Java): http://L2R.cs.uiuc.edu/~cogcomp

A modeling language for Constrained Conditional Models. Supports programming along with building learned models,

high level specification of constraints and inference with constraints

Page 44

Questions?

Thank you

Date post:	25-Dec-2015
Category:	Documents
Upload:	gordon-payne
View:	215 times
Download:	0 times

Page 1 February 2008 University of Edinburgh With thanks to: Collaborators: Ming-Wei Chang, Vasin...

Documents