Page 1
Global Inference in Learning for
Natural Language Processing
Page 2
Comprehension(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in
England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous.
1. Who is Christopher Robin? 2. When was Winnie the Pooh written?
3. What did Mr. Robin do when Chris was three years old?4. Where did young Chris live? 5. Why did Chris write two books of
his own?
Page 3
Given: Q: Who acquired Overture? Determine: A: Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc last year.
Textual Entailment
Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc. last year
Yahoo acquired Overture
Is it true that…?(Textual Entailment)
Overture is a search company Google is a search company
……….
Google owns Overture
Phrasal verb paraphrasing [Connor&Roth’07]Entity matching [Li et. al, AAAI’04, NAACL’04]Semantic Role LabelingInference for EntailmentAAAI’05;TE’07
Page 4
Illinois’ bored of education board
...Nissan Car and truck plant is ……divide life into plant and animal kingdom
(This Art) (can N) (will MD) (rust V) V,N,N
The dog bit the kid. He was taken to a veterinarian a hospital
What we Know: Stand Alone Ambiguity Resolution
Learn a function f: X Y that maps observations in a domain to one of several categories or <
Page 5
Theoretically: generalization bounds How many example does one need to see in order to guarantee
good behavior on previously unobserved examples. Algorithmically: good learning algorithms for linear
representations. Can deal with very high dimensionality (106 features) Very efficient in terms of computation and # of examples. On-
line. Key issues remaining:
Learning protocols: how to minimize interaction (supervision); how to map domain/task information to supervision; semi-supervised learning; active learning; ranking; adaptation.
What are the features? No good theoretical understanding here.
How to decompose problems and learn tractable models. Modeling/Programming systems that have multiple classifiers.
Classification is Well Understood
Page 6
Comprehension
1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now.
A process that maintains and updates a collection of propositions about the state of affairs.
(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous.
This is an Inference Problem
Page 7
Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome.
Learned classifiers for different sub-problems
Incorporate classifiers’ information, along with constraints, in making coherent decisions – decisions that respect the local classifiers as well as domain & context specific constraints.
Global inference for the best assignment to all variables of interest.
Learning and Inference
Page 8
Special case I: Structured Output
Classifiers1. Recognizing “The beginning of NP”2. Recognizing “The end of NP” (or: word based classifiers: BIO representation)Also for other kinds of phrases…
Some Constraints1. Phrases do not overlap2. Order of phrases ( Prob(NPVP) )3. Length of phrases4. Non-sequential and declarative: If PP then NP in the sentence
Inference: Use classifiers to infer a coherent set of phrases
He reckons the current account deficit will narrow to only # 1.8 billion in
September
[NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ]
[PP to ] [NP only # 1.8 billion ] [PP in ] [NP September ]
Page 9
Sequential Constrains Structure Three models for sequential inference with classifiers
[Punyakanok & Roth NIPS’01]
HMM; HMM with Classifiers Sufficient for easy problems
Conditional Models (PMM) Allows direct modeling of states as a function of input Classifiers may vary; SNoW (Winnow;Perceptron); MEMM:
MaxEnt; SVM based
Constraint Satisfaction Models The inference problem is modeled as weighted 2-SAT With sequential constraints: shown to have efficient solution
Recent work – viewed as multi-class classification; emphasis on global training [Collins’02, CRFs, M3Ns]
s1
o1
s2
o2
s3
o3
s4
o4
s5
o5
s6
o6
s1
o1
s2
o2
s3
o3
s4
o4
s5
o5
s6
o6
Allows for Dynamic Programming based
Inference
What if the structure of the problem/constraints is not sequential?
Page 10
Why Special Cases?
(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous.
1. Who is Christopher Robin? 2. When was Winnie the Pooh written?
3. What did Mr. Robin do when Chris was three years old?
4. Where did young Chris live? 5. Why did Chris write two books of his own?
Classifiers might be learned from different sources, at different times, at different contexts.
In general, cannot assume that all the data is available at one time.
Global Training is not an option
Page 11
Pipeline
Pipelining is a crude approximation; interactions occur across levels and down stream decisions often interact with previous decisions.
Leads to propagation of errors Occasionally, later stage problems are easier but
upstream mistakes will not be corrected.
POS Tagging
Phrases
Semantic Entities
Relations
Most problems are not single classification problems
Parsing
WSD Semantic Role Labeling
Raw Data
Looking for: Global inference over the outcomes of different local predictors as a way to break away from this paradigm [between pipeline & fully global] A flexible way to incorporate linguistic and structural constraints.
Page 12
Special Case II: General Constraints Structure
J.V. Oswald was murdered at JFK after his assassin, K. F. Johns…
Identify:
J.V. Oswald was murdered at JFK after his assassin, K. F. Johns…location
personperson
Kill (X, Y)
Identify named entities Identify relations between entities Exploit mutual dependencies between named
entities and relation to yield a coherent global detection. [Roth & Yih, COLING’02;CoNLL’04]
Some knowledge (classifiers) may be known in advanceSome constraints may be
available only at decision time
Page 13
Inference with General Constraint Structure [Roth&Yih’04,07]
Dole ’s wife, Elizabeth , is a native of N.C.
E1 E2 E3
R12 R2
3
other 0.05per 0.85loc 0.10
other 0.05per 0.50loc 0.45
other 0.10per 0.60loc 0.30
irrelevant 0.10spouse_of 0.05born_in 0.85
irrelevant 0.05spouse_of 0.45born_in 0.50
irrelevant 0.05spouse_of 0.45born_in 0.50
other 0.05per 0.85loc 0.10
other 0.10per 0.60loc 0.30
other 0.05per 0.50loc 0.45
irrelevant 0.05spouse_of 0.45born_in 0.50
irrelevant 0.10spouse_of 0.05born_in 0.85
other 0.05per 0.50loc 0.45
Improvement over no inference: 2-5%
Page 14
Global Inference over Local Models/Classifiers + Expressive Constraints
Constrained Conditional Models Generality of the framework
Training Paradigms
Global training, Decomposition and Local training
Examples Semantic Parsing Information Extraction Pipeline processes
Page 15
Issues
Incorporating general constraints (Algorithmic Approach)
Allow both statistical and expressive declarative constraints
Allow non-sequential constraints (generally difficult)
The value of using more constraints
Coupling vs. Decoupling Training and Inference. Incorporating global constraints is important but Should it be done only at evaluation time or also in
training time? Issues related to modularity, efficiency and
performance
Page 16
Random Variables Y:
Conditional Distributions P (learned by models/classifiers) Constraints C– any Boolean function defined on partial assignments (possibly: + weights W )
Goal: Find the “best” assignment The assignment that achieves the highest global
performance. This is an Integer Programming Problem
Problem Setting
y7y4 y5 y6 y8
y1 y2 y3C(y1,y4)
C(y2,y3,y6,y7,y8)
Y*=argmaxY PY subject to constraints C(+ WC)
Other, more general ways to incorporate soft constraints here [ACL’07]
observations
Page 17
y* = argmaxy wi Á(x; y)
Typically, Linear or log-linear Typically Á(x,y) will be local
functions, or Á(x,y) = Á(x)
Constrained Conditional Models
y7y4 y5 y6 y8
y1 y2 y3
y7y4 y5 y6 y8
y1 y2 y3
Conditional Markov Random Field Constraints Network
i ½i Ci(x,y)
Optimize for general constraints Constraints may have weights. May be soft Specified declaratively as FOL
formulae Clearly, there is a joint probability distribution that represents this mixed model.
We would like to: Make decisions with respect to the mixed model,
but Not necessarily learn this complex model.
Page 18
A General Inference Setting Linear objective function:
Essentially all complex models studied today can be viewed as optimizing a : HMMs/CRFs [Roth’99; Collins’02;Lafferty et. al 02]
Linear objective functions can be derived from probabilistic perspective:
The probabilistic perspective supports finding the most likely assignment Not necessarily what we want
Integer linear programming (ILP) formulation Allows the incorporation of more general cost functions General (non-sequential) constraint structure Better exploitation (computationally) of hard constraints Can find the optimal solution if desired
Page 19
Formal Model
How to solve?
This is an Integer Linear Program
Solve using ILP packages gives an exact solution.
Search techniques are also possible
(Soft) constraints component
Weight Vector for “local” models
Penalty for violatingthe constraint.
How far away is y from a “legal” assignment
Subject to constraints
A collection of Classifiers; Log-linear models (HMM, CRF) or a combination
How to train?
How to decompose global objective function?
Should we incorporate constraints in the learning process?
Page 21
Example: Semantic Role Labeling
I left my pearls to my daughter in my will .
[I]A0 left [my pearls]A1 [to my daughter]A2 [in my will]AM-LOC .
A0 Leaver A1 Things left A2 Benefactor AM-LOC Location I left my pearls to my daughter in my will .
Special Case (structured output problem): here, all the data is available at one time; in general, classifiers might be learned from different sources, at different times, at different contexts.
Implications on training paradigms
Overlapping arguments
If A2 is present, A1 must also be
present.
Who did what to whom, when, where, why,…
Score scaling issues may need to be
addressed
Page 22
Semantic Role Labeling (1/2)
For each verb in a sentence1. Identify all constituents that fill a semantic role2. Determine their roles
• Core Arguments, e.g., Agent, Patient or Instrument• Their adjuncts, e.g., Locative, Temporal or Manner
I left my pearls to my daughter-in-law in my will.
A0 : leaver
A1 : thing left
A2 : benefactor
AM-LOC
The pearls which I left to my daughter-in-law are fake.
A0 : leaver
A1 : thing left
A2 : benefactor
R-A1
The pearls, I said, were left to my daughter-in-law.
A0 : sayer
A1 : utterance
C-A1 : utterance
Page 23
Semantic Role Labeling (2/2) PropBank [Palmer et. al. 05] provides a large human-
annotated corpus of semantic verb-argument relations. It adds a layer of generic semantic labels to Penn Tree
Bank II. (Almost) all the labels are on the constituents of the
parse trees.
Core arguments: A0-A5 and AA different semantics for each verb specified in the PropBank Frame files
13 types of adjuncts labeled as AM-arg where arg specifies the adjunct type
Page 24
Algorithmic Approach
Identify argument candidates Pruning [Xue&Palmer, EMNLP’04] Argument Identifier
Binary classification (
Classify argument candidates Argument Classifier
Multi-class classification
Inference Use the estimated probability
distribution given by the argument classifier
Use structural and linguistic constraints Infer the optimal global output
I left my nice pearls to her
I left my nice pearls to her[ [ [ [ [ ] ] ] ] ]
I left my nice pearls to her[ [ [ [ [ ] ] ] ] ]
I left my nice pearls to her
I left my nice pearls to her
Identify Vocabulary
Inference over (old and new)
Vocabulary
candidate arguments
EASY
Page 25
Argument Identification & Classification
Both argument identifier and argument
classifier are trained phrase-based classifiers.
Features (some examples) voice, phrase type, head word, path, chunk, chunk pattern, etc. [some make use of a full syntactic
parse]
Learning Algorithm – SNoW Sparse network of linear functions
weights learned by regularized Winnow multiplicative update rule
Probability conversion is done via softmax pi = exp{acti}/j exp{actj}
I left my nice pearls to her[ [ [ [ [ ] ] ] ] ]
I left my nice pearls to her
Page 26
Inference
I left my nice pearls to her
The output of the argument classifier often violates some constraints, especially when the sentence is long.
Finding the best legitimate output is formalized as an optimization problem and solved via Integer Linear Programming. [Punyakanok et. al 04, Roth & Yih 04;05]
Input: The probability estimation (by the argument classifier)
Structural and linguistic constraints
Allows incorporating expressive (non-sequential) constraints on the variables (the arguments types).
Page 27
For each argument ai
Set up a Boolean variable: ai,t indicating whether ai is classified as t
Goal is to maximize i score(ai = t ) ai,t
Subject to the (linear) constraints
If score(ai = t ) = P(ai = t ), the objective is to find the assignment that maximizes the expected number of arguments that are correct and satisfies the constraints.
Integer Linear Programming Inference
The Constrained Conditional Model is completely decomposed during training
Note though that these parts can still be
inconsistent in some more complex way
Page 29
Inference
Maximize expected number correct T* = argmaxT i P( ai = ti )
Subject to some constraints Structural and Linguistic (R-A1A1)
Solved with Integer Learning Programming
I left my nice pearls to her
I left my nice pearls to her
Page 30
No duplicate argument classes
a POTARG x{a = A0} 1
R-ARG
a2 , a x{a = A0} x{a2 = R-A0}
C-ARG a2,
a is before a2 x{a = A0} x{a2 = C-A0}
Many other possible constraints: Unique labels No overlapping or embedding Relations between number of arguments If verb is of type A, no argument of type B
If there is an R-ARG phrase, there is an ARG Phrase
If there is an C-ARG phrase, there is an ARG before it
Constraints
Joint inference can be used also to combine different SRL Systems.
Page 31
Results We built two SRL systems based on two full parsers
Collins’s (We talked on Wednesday about this parser) Charniak’s (“Maximum Entropy Insired Parser, 2000”)
Results (PropBank; on the PennTree corpus; CoNLL evalution (sec.23) ):
AISTATS’09 (new theoretical results), IJCAI’05, CL’08 (analysis; ablation study) CoNLL’05 (more parsers)
Easy and fast: 7-8 Sentences/Second (using Xpress-MP) A lot of room for improvement (additional constraints) Demo available http://L2R.cs.uiuc.edu/~cogcomp
Prec Rec F1
Collins’ Parser 77.09 72.0 74.46
Charniak’s Parse
78.10 76.15
77.11
Combined 82.28 76.78
79.44
Top ranked system in CoNLL’05 shared task: Key difference is the
Inference
Page 32
A General Inference Setting
An Integer linear programming (ILP) formulation General: works on non-sequential constraint
structure Expressive: can represent many types of declarative
constraints Optimal: finds the optimal solution Fast: commercial packages are able to quickly solve
very large problems (hundreds of variables and constraints; sparsity is important)
Page 33
Issues
Incorporating general constraints (Algorithmic Approach)
Allow both statistical and expressive declarative constraints
Allow non-sequential constraints (generally difficult)
The value of using more constraints
Coupling vs. Decoupling Training and Inference. Incorporating global constraints is important but Should it be done only at evaluation time or also in
training time? Issues related to modularity, efficiency and
performance
Page 34
Some Properties of ILP Inference
Allows expressive constraints Any Boolean rule can be represented by a set of linear
(in)equalities
Combining acquired (statistical) constraints with declarative constraints Start with shortest path matrix and constraints Add new constraints to the basic integer linear program.
Solved using off-the-shelf packages For example, Xpress-MP or CPLEX If the additional constraints don’t change the solution, LP is
enough Otherwise, the computational time depends on sparsity; fast in
practice
Page 35
Experiments: Semantic Role Labeling
For each verb in a sentence1. Identify all constituents that fill a semantic role2. Determine their roles, such as, Agent, Patient or
Instrument No two arguments share the same label
Use IO representation
A0: leaver VA1: thing
leftA2: benefactor
I left my pearls to mydaughter-in-
lawinmy
will
.
I-A0 O I-A1 I-A1 I-A2 I-A2 I-A2 O O O O
X
Y
Page 36
Constraints
1. No duplicate argument labels (no dup) e.g., no two discontinuous segments are both A0
2. Specific token sequence share same labels (cand) Derive argument candidates from the parse tree
3. At least one argument in a sentence (argument) Not all of the tokens are label O
4. Given the verb position (verb pos)1. The label of the verb should be O
5. Disallow some arguments (disallow)1. Derive from the frame files in the PropBank corpus
Page 37
Results: Contribution of Expressive Constraints
Learning with statistical constraints only; additional constraints added at evaluation time (efficiency)
Page 38
Issues
Incorporating general constraints (Algorithmic Approach)
Allow both statistical and expressive declarative constraints
Allow non-sequential constraints (generally difficult)
The value of using more constraints
Coupling vs. Decoupling Training and Inference. Incorporating global constraints is important but Should it be done only at evaluation time or also in
training time? Issues related to modularity, efficiency and
performance
May not be relevant in some problems.
Page 39
Input: o1 o2 o3 o4 o5 o6 o7 o8 o9 o10
Classifier 1:Classifier 2:
Infer:
Phrase Identification Problem
Use classifiers’ outcomes to identify phrases Final outcome determined by optimizing classifiers outcome
and constrains
[ [ [ []] ] ] ] ]
Output: s1 s2 s3 s4 s5 s6 s7 s8 s9 s10
[ ] ][
Did this classifier make a
mistake? How to train it?
Page 40
Training in the presence of Constraints
General Training Paradigm: First Term: Learning from data (could be further
decomposed) Second Term: Guiding the model by constraints Can choose if constraints’ weights trained, when
and how, or taken into account only in evaluation.
Page 41
Training w/o ConstraintsTesting: Inference with ConstraintsIBT: Inference-based TrainingL+I: Learning plus Inference
x1
x6
x2
x5
x4
x3
x7
y1
y2
y5
y4
y3
f1(x)
f2(x)
f3(x)f4(x)
f5(x)
X
Y
Learning the components together!
Which one is better? When and Why?
Page 42
Learning Local and Global Classifiers
Learning + Inference: No inference used during learning. For each example (x, y) ∈ D, the learning algorithm
attempts to ensure that each component of y produces the correct output.
Global constraints are enforced only at evaluation time.
Inference based Training (IBT): Train for correct global output.
Feedback from the inference process determines which classifiers to provide feedback to; together, the classifiers and the inference yield the desired result.
At each step a subset of the classifiers are updated according to inference feedback.
We study the tradeoff in an online setting (perceptron)L+I: cheaper computationally; modularBut intuitively, IBT should be better in the limit
Page 43
-1 1 111Y’ Local Predictions
Perceptron-based Global Learning
x1
x6
x2
x5
x4
x3
x7
f1(x)
f2(x)
f3(x)f4(x)
f5(x)
X
Y
-1 1 1-1-1YTrue Global Labeling
-1 1 11-1Y’ Apply Constraints:
Page 44
Claims
When the local modes are “easy” to learn, L+I outperforms IBT. In many applications, the components are identifiable
and easy to learn (e.g., argument, open-close, PER). Only when the local problems become difficult to solve in
isolation, IBT outperforms L+I, but needs a larger number of training examples.
L+I: cheaper computationally; modularIBT is better in the limit
Page 45
Simulated dataL+I vs. IBT: the more identifiable individual problems are, the better overall performance is with L+I
Generalization bounds can be derived which suggest a similar behavior [AISTATS 09, IJCAI 05]
Page 46
L+I vs. IBT: SRL Experiment (Accuracy and Training Efficiency)
CRF (global): Learning with all constraints discriminatively VP: no edge features
CRF CRF (global) VP
Best F1 73.91 69.8274.4
9
Time (hrs)
38 145 0.8
For more discussion, see (Punyakanok, Roth, Yih, Zimak, Learning and Inference over Constrained Output, IJCAI-05) and (Roth, Small, Titov, Sequential Learning of Classifiers for Structured Prediction Problems; AISTATS-09)
When more expressive and informative constraints are available, simple L+I strategy may be better.
Page 47
Prediction result of a trained HMM
Lars Ole Andersen . Program analysis and
specialization for the C Programming language
. PhD thesis .DIKU , University of Copenhagen , May1994 .
Information extraction with Background Knowledge (Constraints)
Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 .
[AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE]
Violates lots of constraints!
Page 48
Examples of Constraints Each field must be a consecutive list of words, and
can appear at most once in a citation.
State transitions must occur on punctuation marks.
The citation can only start with AUTHOR or EDITOR.
The words pp., pages correspond to PAGE. Four digits starting with 20xx and 19xx are DATE. Quotations can appear only in TITLE …….
Page 49
Information Extraction with Constraints Adding constraints, we get correct results!
[AUTHOR] Lars Ole Andersen . [TITLE] Program analysis and specialization for
the C Programming language .
[TECH-REPORT] PhD thesis .[INSTITUTION] DIKU , University of Copenhagen , [DATE] May, 1994 .
If incorporated into semi-supervised training, better results mean Better Feedback!
Page 50
Semi-Supervised Learning with Constraints
Model
Decision Time Constraints
Un-labeled Data
Constraints
In traditional Semi-Supervised learning the model can drift away from the correct one.
Constraints can be used At decision time, to bias the objective function
towards favoring constraint satisfaction. At training to improve labeling of unlabeled data (and
thus improve the model)
Page 51
=learn(T)
For N iterations do
T=
For each x in unlabeled dataset
y Inference(x, )
T=T {(x, y)}
Supervised learning algorithm parameterized by
Inference based augmentation of the
training set (feedback)(inference with constraints).
Inference(x,C, )
Constraint - Driven Learning (CODL) [Chang, Ratinov, Roth, ACL’07]
Page 52
Constraint - Driven Learning (CODL) [Chang, Ratinov, Roth, ACL’07]
=learn(T)
For N iterations do
T=
For each x in unlabeled dataset
{y1,…,yK} Top-K-Inference(x,C, )
T=T {(x, yi)}i=1…k
= +(1- )learn(T)Learn from new training data.Weight supervised and unsupervised model.
Inference based augmentation of the training set (feedback)(inference with constraints).
Supervised learning algorithm parameterized by
Page 53
Token-based accuracy (inference with constraints)
60
65
70
75
80
85
90
95
100
5 10 15 20 25 300
Labeled set size
Ac
cu
rac
y
supervised Weighted EM CODL
Page 54
Objective function:
Semi-Supervised Learning with Constraints
# of available labeled examples
Learning w ConstraintsConstraints are used to Bootstrap a semi-supervised learner Poor model + constraints used to annotate unlabeled data, which in turn is used to keep training the model.
Learning w/o Constraints: 300 examples.
A Constrained Conditional Model in which we do not want to let training affect the constraints’ part of the objective function.
Page 55
Constrained Conditional Models: a general paradigm for learning and inference in the context of natural language understanding tasks
A general Constraint Optimization approach for integration of learned models with additional (declarative or statistical) expressivity.
A paradigm for making Machine Learning practical – allow domain/task specific constraints.
A lot more work is needed on “how to decompose” (for the sake of training) and “how to put things together”
How to train? Learn simple local models; make use of them globally (via
global inference) [Punyakanok et. al IJCAI’05]
Ability to use of domain & constraints to drive supervision [Klementiev & Roth, ACL’06; Chang, Ratinov, Roth, ACL’07]
Conclusions
LBJ (Learning Based Java): http://L2R.cs.uiuc.edu/~cogcomp
A modeling language for Constrained Conditional Models. Supports programming along with building learned models,
high level specification of constraints and inference with constraints
-The last critical review is due April, 27- (!) Send me messages about your role in the team for the second assignment- There will be no 3rd programming assignment, instead there will be a take-home exam related to the project- The class on Friday will be given by Dan Roth