Structured Prediction:A Large Margin Approach
Ben TaskarUniversity of Pennsylvania
Acknowledgments
Drago AnguelovVassil ChatalbashevCarlos GuestrinMichael Jordan
Dan KleinDaphne KollerSimon Lacoste-JulienPaul Vernaza
Structured Prediction Prediction of complex outputs
Structured outputs: multivariate, correlated, constrained
Novel, general way to solve many learning problems
Handwriting Recognition
brace
Sequential structure
x y
Object Segmentation
Spatial structure
x y
Natural Language Parsing
The screen was a sea of red
Recursive structure
x y
Bilingual Word Alignment
What is the anticipated cost of collecting fees under the new proposal?
En vertu des nouvelles propositions, quel est le coût prévu de perception des droits?
x yWhat
is the
anticipated
costof
collecting fees
under the
new proposal
?
En vertu delesnouvelles propositions, quel est le coût prévu de perception de les droits?Combinatorial structure
Protein Structure and Disulfide Bridges
Protein: 1IMT
AVITGACERDLQCGKGTCCAVSLWIKSVRVCTPVGTSGEDCHPASHKIPFSGQRMHHTCPCAPNLACVQTSPKKFKCLSK
Local Prediction
Classify using local information Ignores correlations & constraints!
br ea c
Local Predictionbuildingtreeshrubground
Structured Prediction
Use local information Exploit correlations
br ea c
Structured Predictionbuildingtreeshrubground
Outline Structured prediction models
Sequences (CRFs) Trees (CFGs) Associative Markov networks (Special MRFs) Matchings
Structured large margin estimation Margins and structure Min-max formulation Linear programming inference Certificate formulation
Structured Models
Mild assumption:
linear combination
space of feasible outputs
scoring function
Chain Markov Net (aka CRF*)
a-z
a-z
a-z
a-z
a-z
y
x
*Lafferty et al. 01
Chain Markov Net (aka CRF*)
a-z
a-z
a-z
a-z
a-z
y
x
*Lafferty et al. 01
Associative Markov Nets
Point featuresspin-images, point height
Edge featureslength of edge, edge orientation
yj
yk
jk
j
“associative” restriction
CFG Parsing
#(NP DT NN)
…
#(PP IN NP)
…
#(NN ‘sea’)
Bilingual Word Alignment
position orthography association
Whatis
theanticipate
dcost
ofcollecting
fees under
the new
proposal?
En vertu delesnouvelles propositions, quel est le coût prévu de perception de le droits?
j
k
Disulfide Bonds: Non-bipartite Matching
1
2 3
4
6 5
RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6
6
1
2
4
5
3
Fariselli & Casadio `01, Baldi et al. ‘04
Scoring Function
RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6
RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6
1
2 3
4
6 5
amino acid identities phys/chem properties
Structured Models
Mild assumptions:
linear combination
sum of part scores
space of feasible outputs
scoring function
Supervised Structured Prediction
Learning Prediction
Estimate w
Example:Weighted matching
Generally: Combinatorial
optimization
Data
Model:
Likelihood(can be intractable)
MarginLocal(ignores
structure)
Local Estimation
Treat edges as independent decisions
Estimate w locally, use globally E.g., naïve Bayes, SVM, logistic regression Cf. [Matusov+al, 03] for matchings
Simple and cheap Not well-calibrated for matching model Ignores correlations & constraints
Data
Model:
Conditional Likelihood Estimation
Estimate w jointly:
Denominator is #P-complete [Valiant 79, Jerrum & Sinclair 93]
Tractable model, intractable learning
Need tractable learning method margin-based estimation
Data
Model:
Outline Structured prediction models
Sequences (CRFs) Trees (CFGs) Associative Markov networks (Special MRFs) Matchings
Structured large margin estimation Margins and structure Min-max formulation Linear programming inference Certificate formulation
We want:
Equivalently:
OCR Example
a lot!…
“brace”
“brace”
“aaaaa”
“brace” “aaaab”
“brace” “zzzzz”
We want:
Equivalently:
‘It was red’
Parsing Example
a lot!
SA B
C D
SA BD F
SA B
C D
SE F
G H
SA B
C D
SA B
C D
SA B
C D
…
‘It was red’
‘It was red’
‘It was red’
‘It was red’
‘It was red’
‘It was red’
We want:
Equivalently:
‘What is the’‘Quel est le’
Alignment Example
a lot!…
123
123
‘What is the’‘Quel est le’
123
123
‘What is the’‘Quel est le’
123
123
‘What is the’‘Quel est le’
123
123
123
123
123
123
123
123
‘What is the’‘Quel est le’
‘What is the’‘Quel est le’
‘What is the’‘Quel est le’
Structured Loss
b c a r e b r o r e b r o c eb r a c e
2 2 10
123
123
123
123
123
123
123
123
‘What is the’‘Quel est le’
0 1 2 2S
A EC D
SB E
A C
SB D
A C
SA B
C D‘It was red’
0 1 2 3
Large margin estimation Given training examples , we want:
Maximize margin
Mistake weighted margin:
# of mistakes in y
*Collins 02, Altun et al 03, Taskar 03
Large margin estimation
Eliminate
Add slacks for inseparable case (hinge loss)
Large margin estimation Brute force enumeration
Min-max formulation
‘Plug-in’ linear program for inference
Min-max formulation
LP Inference
Structured loss (Hamming):
Inference
discrete optim.
Key step:
continuous optim.
Simple iterative method
Unstable for structured output: fewer instances, big updates
May not converge if non-separable Noisy
Voted / averaged perceptron [Freund & Schapire 99, Collins 02]
Regularize / reduce variance by aggregating over iterations
Alternatives: Perceptron
Add most violated constraint
Handles several more general loss functions Need to re-solve QP many times Theorem: Only polynomial # of constraints needed
to achieve -error [Tsochantaridis et al, 04]
Worst case # of constraints larger than factored
Alternatives: Constraint Generation
[Collins 02; Altun et al, 03]
Outline Structured prediction models
Sequences (CRFs) Trees (CFGs) Associative Markov networks (Special MRFs) Matchings
Structured large margin estimation Margins and structure Min-max formulation Linear programming inference Certificate formulation
Matching Inference LP
Has integral solutions z(A is totally unimodular)
degree
Whatis
theanticipate
dcost
ofcollecting
fees under
the new
proposal?
En vertu delesnouvelles propositions, quel est le coût prévu de perception de le droits?
j
k
[Nemhauser+Wolsey 88]
Need Hamming-like loss
y z Map for Markov Nets
0 1 . 0
0 0 . 0
. . . 0
0 0 0 0
1
0
:
0
0
1
:
0
1
0
:
0
0
1
:
0
0
1
:
0
a
b
:
z
0 0 . 0
1 0 . 0
. . . 0
0 0 0 0
0 1 . 0
0 0 . 0
. . . 0
0 0 0 0
0 0 . 0
0 1 . 0
. . . 0
0 0 0 0
a
b
:
z
a b . z a b . z a b . z a b . z
Markov Net Inference LP
0 0 0 0
0 0 0 0
0 1 0 0
0 0 0 0
0
0
1
0
0 1 0 0
Has integral solutions z for chains, (hyper)treesCan be fractional for untriangulated networks
normalization
agreement
[Chekuri+al 01, Wainright+al 02]
Associative MN Inference LP
For K=2, solutions are always integral (optimal) For K>2, within factor of 2 of optimal (results for larger cliques)
“associative” restriction
0
1
0
0
0
1
0
0
0 1 0 0
[Greig+al 89, Boykov+al 99, Kolmogorov & Zabih 02, Taskar+al 04]
CFG Chart
CNF tree = set of two types of parts: Constituents (A, s, e) CF-rules (A B C, s, m, e)
CFG Inference LP
inside
outside
Has integral solutions z
root
LP Duality Linear programming duality
Variables constraints Constraints variables
Optimal values are the same When both feasible regions are bounded
Min-max Formulation
LP duality
Min-max formulation summary
Formulation produces concise QP for Low-treewidth Markov networks Associative MNs (K=2) Context free grammars Bipartite matchings Approximate for untriangulated MNs, AMNs with K>2
*Taskar et al 04
Unfactored Primal/Dual
QP duality
Exponentially many constraints/variables
Factored Primal/Dual
By QP duality
Dual inherits structure from problem-specific inference LP
Variables correspond to a decomposition of variables of the flat case
The Connection
b c a r e b r o r e b r o c eb r a c e
rc
ao
cr
.2.15.25
.4
.2 .35
.65.8
.4
.61b 1e
2 2 10
Duals and Kernels
Kernel trick works: Factored dual Local functions (log-potentials) can use
kernels
3D Mapping
Laser Range Finder
GPS
IMU
Data provided by: Michael Montemerlo & Sebastian Thrun
Label: ground, building, tree, shrub Training: 30 thousand points Testing: 3 million points
Segmentation results
Hand labeled 180K test pointsModel
Accuracy
SVM 68%
V-SVM
73%
M3N 93%
Fly-through
LAGRbot: Real-time Navigation
LAGRbot: Paul Vernaza & Dan Lee
Range of stereo vision limited to approximately 15 m or less
LAGRbot: Real-time Navigation
Model Error
Local 17%
Structured 8%
160x120 images: Real time prediction/learning (~100ms)Current work with Paul Vernaza, Dan Lee
0
5
10
15
20
Tes
t Err
or
SVMs RMNS M^3Ns
Hypertext Classification WebKB dataset
Four CS department websites: 1300 pages/3500 links Classify each page: faculty, course, student, project, other Train on three universities/test on fourth
53% error reduction over SVMs
38% error reduction over RMNs
relaxed LP
*Taskar et al 02
better
loopy belief propagation
Word Alignment Results
Model *Error
Data: [Hansards – Canadian Parliament] Features induced on 1 mil unsupervised sentences Trained on 100 sentences (10,000 edges) Tested on 350 sentences (35,000 edges)
[Taskar+al 05]
*Error: weighted combination of precision/recall [Lacoste-Julien+Taskar+al 06]
GIZA/IBM4 [Och & Ney 03] 6.5
+Our approach+QAP 4.5
+Local learning+matching 5.4
+Our approach 4.9
Modeling First Order Effects
Monotonicity Local inversion Local fertility
QAP NP-complete Sentences (30 words, 1k vars) few seconds (Mosek) Learning: use LP relaxation Testing: using LP, 83.5% sentences, 99.85% edges integral
Outline Structured prediction models
Sequences (CRFs) Trees (CFGs) Associative Markov networks (Special MRFs) Matchings
Structured large margin estimation Margins and structure Min-max formulation Linear programming inference Certificate formulation
Certificate formulation Non-bipartite matchings:
O(n3) combinatorial algorithm No polynomial-size LP known
Spanning trees No polynomial-size LP known Simple certificate of optimality
Intuition: Verifying optimality easier than optimizing
Compact optimality condition of wrt.
1
2 3
4
6 5
ijkl
Certificate for non-bipartite matching
Alternating cycle: Every other edge is in matching
Augmenting alternating cycle: Score of edges not in matching greater than edges in matching
Negate score of edges not in matching Augmenting alternating cycle = negative length alternating
cycle
Matching is optimal no negative alternating cycles
1
2 3
4
6 5
Edmonds ‘65
Certificate for non-bipartite matching
Pick any node r as root
= length of shortest alternating
path from r to j
Triangle inequality:
Theorem:
No negative length cycle distance function d exists
Can be expressed as linear constraints: O(n) distance variables, O(n2) constraints
1
2 3
4
6 5
Certificate formulation
Formulation produces compact QP for Spanning trees Non-bipartite matchings Any problem with compact optimality condition
*Taskar et al. ‘05
Disulfide Bonding Prediction Data [Swiss Prot 39]
450 sequences (4-10 cysteines) Features:
windows around C-C pair physical/chemical properties
[Taskar+al 05]
Model *Acc
Local learning+matching 41%
Recursive Neural Net [Baldi+al’04] 52%
Our approach (certificate) 55%
*Accuracy: % proteins with all correct bonds
Formulation summary
Brute force enumeration
Min-max formulation ‘Plug-in’ convex program for inference
Certificate formulation Directly guarantee optimality of
Scalable Algorithms Convex quadratic program # variables and constraints linear in # parameters, edges Can solve using off-the-shelf software
Matlab, CPLEX, Mosek, etc. Superlinear convergence
Problem: linear is too large Second-order methods run out of memory (quadratic)
Need scalable memory-efficient methods Space/time tradeoff Structured SMO [Taskar+al 04] Structured exponentiated gradient [Bartlett+al 04,
Collins+al 07] Don’t work for matchings, min-cuts
Saddle-point Problem
Extragradient Method
[Korpelevich76]
Prediction:
Correction:
= Euclidean projection = step size
Theorem: Extragradient converges linearly
Key computation is Euclidean projection
usually easy harder
for Bipartite Matchings: Min Cost Flow
Min-cost quadratic flow computes projection O(N1.5) complexity for fixed precision (N=num
edges) Reduction to flow for min-cuts also possible[Taskar+al 06]
j
s t
k
All capacities = 1quel
est
le
coût
prévu
What
is
the
anticipate
d
cost Flow cost
Structured Extragradient
Extragradient method [Korpelevich 76, Nesterov 03] Linear convergence Key computation: projection min-cost quadratic flow for matchings & cuts
Extensions (using Bregman divergence) dynamic programming for decomposable
models “Online-envy” – want memory proportional
to # parameters independent of # examples solves problems with million edges
[Taskar+al 06]
Other approaches Online methods
Online updates with respect to most violated constraints [Crammer+al 05,06]
Regression based methods Regression from input to transformed output space
[Cortes+al 07]
Learning to search Learn classifier to guide local search for structured
solution [Daume+al 05] Many others
Generalization Bounds
“If the past any indication of the future, he’ll have a cruller.”
Generalization Bounds
Several Pointers Perceptron bound [Colllins 01]
Assume separability with margin Bound on 0-1 loss
Covering-number bound [Taskar+al 03] Bound on Hamming loss Logarithmic dependence on # variables in each y
Regret Bounds [Crammer+al 06] Online-style guarantees for more general loss
PAC-Bayes bound [McAllester 07] Tighter analysis, consistency
Bounds for Learning with Approximate Inference [Kulesza & Pereira, Today]
Open Questions for Large-Margin Estimation
Statistical consistency Hinge loss not consistent for non-binary output [See Tewari & Bartlett 05, McAllester 07]
Semi-supervised Laplacian-regularization [Altun+McAllester 05] Co-regularization [Brefeld+al 05]
Latent variables Machine Translation [Liang+al 06] CCG Parsing to Logical Form
[Zettlemoyer+Collins 07] Learning with approximate inference
Learning with LP relaxations Does constant factor approximate inference
guarantee anything a-priori about learning?
No [See Kulesza & Pereira, tonight] Simple 3-node counter example Separable with exact inference,
not separable with approximate
Question: What other (stronger?) approximate inference
guarantees will translate into learning guarantees?
References Edited collection: G. Bakir+al 07
Predicting Structured Data MIT Press
Code: SVMstruct by Thorsten Joachims
Slides, more papers at: http://www.cis.upenn.edu/~taskar
Thanks!
Segmentation Model Min-Cut
0 1
Local evidence
Spatial smoothness
Computing is hard in general, but if edge potentials attractive min-cut algorithmMultiway-cut for multiclass case use LP relaxation
[Greig+al 89, Boykov+al 99, Kolmogorov & Zabih 02, Taskar+al 04]