GRAPH-BASED POSTERIOR REGULARIZATION FOR
SEMI-SUPERVISED STRUCTURED PREDICTION
Luheng He Jennifer Gillenwater† Ben Taskar
†University of Pennsylvania University of Washington
OVERVIEW
Structured Prediction (CRF)
Graph Propagation
A Joint Objective
y7y5
ninth run for
ninth run for
J (q, p✓)
This is recognized as a face
CONDITIONAL RANDOM FIELD
This is recognized as a face
CONDITIONAL RANDOM FIELD
DET VERB VERB ADP DET NOUN
This is recognized as a face
Y = y1 y2 y4 y5 y6y3
X =
CONDITIONAL RANDOM FIELD
This is recognized as a face
Y = y1 y2 y4 y5 y6y3
X =
CONDITIONAL RANDOM FIELD
( )f yt , yt�1,x{
This is recognized as a face
Y = y1 y2 y4 y5 y6y3
X =
CONDITIONAL RANDOM FIELD
Conditional Distribution
p✓(y | x) = 1Z✓(x) exp
TPt=1
✓>f(yt, yt�1,x)
�( )f yt , yt�1,x
{-factorp {
This is recognized as a face
Y = y1 y2 y4 y5 y6y3
X =
CONDITIONAL RANDOM FIELD
Conditional Distribution
p✓(y | x) = 1Z✓(x) exp
TPt=1
✓>f(yt, yt�1,x)
�( )f yt , yt�1,x
{CRF objective
NLik(p✓) = �P̀i=1
log p✓(yi | xi)-factorp {
The painting is considered as a work of genius .
This is recognized as a face .
GRAPH PROPAGATION
Zhu et al. (ICML 2013): Graph-based Semi-supervised Learning
VERB
?
The painting is considered as a work of genius .
This is recognized as a face .
GRAPH PROPAGATION
Zhu et al. (ICML 2013): Graph-based Semi-supervised Learning
VERB
?
similar context
The painting is considered as a work of genius .
This is recognized as a face .
GRAPH PROPAGATION
Zhu et al. (ICML 2013): Graph-based Semi-supervised Learning
VERB
similar contextsimilar tagging
The painting is considered as a work of genius .
This is recognized as a face .
GRAPH PROPAGATION
Zhu et al. (ICML 2013): Graph-based Semi-supervised Learning
VERB
similar context
VERB
similar tagging
... a run along ...NOUN
... a run for ...NOUN
... luck run out ...VERB
?... ninth run for ...
GRAPH LAPLACIAN REGULARIZER
... a run along ...NOUN
... a run for ...NOUN
... luck run out ...VERB
?... ninth run for ...
0.4
0.8
0.8
GRAPH LAPLACIAN REGULARIZER
... a run along ...NOUN
... a run for ...NOUN
... luck run out ...VERB
?... ninth run for ...
0.4
0.8
0.8
GRAPH LAPLACIAN REGULARIZER
... a run along ...NOUN
... a run for ...NOUN
... luck run out ...VERB
?... ninth run for ...
0.4
0.8
0.8
GRAPH LAPLACIAN REGULARIZER
Prob (NOUN | ninth run for) = 0.6Prob (VERB | ninth run for) = 0.4Prob (ADV | ninth run for) = 0Prob (DET | ninth run for) = 0…
... a run along ...NOUN
... a run for ...NOUN
... luck run out ...VERB
?... ninth run for ...
0.4
0.8
0.8
GRAPH LAPLACIAN REGULARIZER
Prob(tag | ninth run for) =argmin
m0.4⇥ km� Prob(tag | a run along)k22
+0.8⇥ km� Prob(tag | a run for)k22+0.8⇥ km� Prob(tag | luck run out)k22
... a run along ...NOUN
... a run for ...NOUN
... luck run out ...VERB
?... ninth run for ...
0.4
0.8
0.8
GRAPH LAPLACIAN REGULARIZER
Prob(tag | ninth run for) =argmin
m0.4⇥ km� Prob(tag | a run along)k22
+0.8⇥ km� Prob(tag | a run for)k22+0.8⇥ km� Prob(tag | luck run out)k22: The proportion of time trigram a has tag k
minm
Lap(m) =X
a2Unlab
X
b2Neighbors(a)
X
k2Tagswab(ma,k �mb,k)2
ma,k
COMBINING THE TWO
labeled labeled + unlabeled
ninth run for
graph-propagationy7y5
ninth run for
CRF estimation
Data
Model m(tag | trigram)p(tags | sentence; ✓)
PRIOR WORKSubramanya et al. (EMNLP 2010)
graph-propagation
y7y5
ninth run for
VN
CRF estimation+
ninth run for
PRIOR WORK
Our work: retains efficiency while optimizing an extendible, joint objective.
Subramanya et al. (EMNLP 2010)
graph-propagation
y7y5
ninth run for
VN
CRF estimation+
ninth run for
HOW TO COMBINE?introduce auxiliary variables q
HOW TO COMBINE?introduce auxiliary variables q
p✓(y | xi) =1
Z✓(xi)exp
"TX
t=1
✓>f(yt, yt�1,xi)
#
qiy =1
Zq(xi)exp
"TX
t=1
ri,t(yt, yt�1)
#
HOW TO COMBINE?introduce auxiliary variables q
1. Normalized
p✓(y | xi) =1
Z✓(xi)exp
"TX
t=1
✓>f(yt, yt�1,xi)
#
qiy =1
Zq(xi)exp
"TX
t=1
ri,t(yt, yt�1)
#
HOW TO COMBINE?introduce auxiliary variables q
1. Normalized 2. Decomposed into local factors
p✓(y | xi) =1
Z✓(xi)exp
"TX
t=1
✓>f(yt, yt�1,xi)
#
qiy =1
Zq(xi)exp
"TX
t=1
ri,t(yt, yt�1)
#
Lap(q) + KL(q k p✓)ninth run for y7y5
ninth run for
NLik(p✓) +J (q, p✓) =
JOINT OBJECTIVE
Lap(q) + KL(q k p✓)ninth run for y7y5
ninth run for
NLik(p✓) +J (q, p✓) =
JOINT OBJECTIVE
Lap(q) =X
a2Unlab
X
b2Neighbors(a)
X
k2Tagswab(ma,k(q)�mb,k(q))2
Lap(q) + KL(q k p✓)ninth run for y7y5
ninth run for
NLik(p✓) +J (q, p✓) =
JOINT OBJECTIVE
NLik(p✓) = �P̀i=1
log p✓(yi | xi)
Lap(q) + KL(q k p✓)ninth run for y7y5
ninth run for
NLik(p✓) +J (q, p✓) =
JOINT OBJECTIVE
KL(q k p✓) =nX
i=1
X
y
qiy logqiy
p✓(y | xi)
HOW TO OPTIMIZE?minq,✓
J (q, p✓)
HOW TO OPTIMIZE?minq,✓
J (q, p✓)
unconstrained�
HOW TO OPTIMIZE?
update:
✓0 = ✓ � ⌘ @J (q,p✓)@✓
p
minq,✓
J (q, p✓)
unconstrained�
HOW TO OPTIMIZE?
update:
✓0 = ✓ � ⌘ @J (q,p✓)@✓
p
minq,✓
J (q, p✓)
unconstrained�
update:q
no compact form
projection is hard
(# tags)(i’s length) values
X
y
qiy = 1
UPDATE Q
UPDATE Q
qiy =1
Zq(xi)exp
"TX
t=1
ri,t(yt, yt�1)
#q can be represented by local factors r
UPDATE Q
qiy =1
Zq(xi)exp
"TX
t=1
ri,t(yt, yt�1)
#q can be represented by local factors r
qiy0 = qiy � ⌘
@J (q,p✓)@qiy
doing an additive gradient update
UPDATE Q
qiy =1
Zq(xi)exp
"TX
t=1
ri,t(yt, yt�1)
#q can be represented by local factors r
qiy0 = qiy � ⌘
@J (q,p✓)@qiy
doing an additive gradient update
q’ cannot be written as product of local factors!
EXPONENTIATED GRADIENT
Collins et al. (JMLR 2008): Exponentiated gradient for CRFs
multiplicative gradient update:
qiy0=
1
Zq0(xi)qiy exp
�⌘ @J (q, p✓)
@qiy
�
EXPONENTIATED GRADIENT
Collins et al. (JMLR 2008): Exponentiated gradient for CRFs
multiplicative gradient update:
qiy0=
1
Zq0(xi)exp
"TX
t=1
ri,t(yt, yt�1)
#exp
�⌘ @J (q, p✓)
@qiy
�
EXPONENTIATED GRADIENT
Collins et al. (JMLR 2008): Exponentiated gradient for CRFs
decompose into local factors
multiplicative gradient update:
qiy0=
1
Zq0(xi)exp
"TX
t=1
ri,t(yt, yt�1)
#exp
�⌘ @J (q, p✓)
@qiy
�
EXPONENTIATED GRADIENT
Collins et al. (JMLR 2008): Exponentiated gradient for CRFs
decompose into local factors
multiplicative gradient update:
qiy0=
1
Zq0(xi)exp
"TX
t=1
ri,t(yt, yt�1)
#exp
�⌘ @J (q, p✓)
@qiy
�
=
1
Zq0(xi)exp
"TX
t=1
r0i,t(yt, yt�1)
#
EXPONENTIATED GRADIENT
Collins et al. (JMLR 2008): Exponentiated gradient for CRFs
decompose into local factors
multiplicative gradient update:
qiy0=
1
Zq0(xi)exp
"TX
t=1
ri,t(yt, yt�1)
#exp
�⌘ @J (q, p✓)
@qiy
�
=
1
Zq0(xi)exp
"TX
t=1
r0i,t(yt, yt�1)
#
only updating (#tags)2 ⇥ (i’s length) variables!
SUMMARY
Lap(q) + KL(q k p✓)ninth run for y7y5
ninth run for
NLik(p✓) +J (q, p✓) =
SUMMARY
Lap(q) + KL(q k p✓)ninth run for y7y5
ninth run for
NLik(p✓) +J (q, p✓) =
✓0 = ✓ � ⌘ @J (q,p✓)@✓
E-step:
M-step:
qiy
0=
1Zq(xi)
qiy
exp
h�⌘ @J (q,p✓)@qiy
i
(update each in practice)qiy =1
Zq(xi)exp
"TX
t=1
ri,t(yt, yt�1)
#
SUMMARY
Lap(q) + KL(q k p✓)ninth run for y7y5
ninth run for
NLik(p✓) +J (q, p✓) =
Theorem: Converges to a local
optimum ofJ (q, p✓)
✓0 = ✓ � ⌘ @J (q,p✓)@✓
E-step:
M-step:
qiy
0=
1Zq(xi)
qiy
exp
h�⌘ @J (q,p✓)@qiy
i
(update each in practice)qiy =1
Zq(xi)exp
"TX
t=1
ri,t(yt, yt�1)
#
EXPERIMENT SETTING
10 Languages (CoNLL-X and CoNLL-2007)
100 Randomly sampled labeled sentencesAveraged over 10 sampling runs
Universal POS Tags (Petrov et al. 2011)Second Order CRF Model f(yt, yt�1, yt�2,x)
ninth run for
graph propagation
GP
EN DE ES PT DA SL SV EL IT NL Avg8
10
12
14
16
18
20
22
24
Language
POS
Tagg
ing
Erro
r
GP
EN DE ES PT DA SL SV EL IT NL Avg8
10
12
14
16
18
20
22
24
Language
POS
Tagg
ing
Erro
r
ninth run for
GP → CRF
y7y5
ninth run for
GP → CRFGP
EN DE ES PT DA SL SV EL IT NL Avg8
10
12
14
16
18
20
22
24
Language
POS
Tagg
ing
Erro
r
EN DE ES PT DA SL SV EL IT NL Avg8
10
12
14
16
18
20
22
24
Language
POS
Tagg
ing
Erro
r
EN DE ES PT DA SL SV EL IT NL Avg8
10
12
14
16
18
20
22
24
Language
POS
Tagg
ing
Erro
r
GP → CRFGP
CRF
y7y5
ninth run for
CRF
EN DE ES PT DA SL SV EL IT NL Avg8
10
12
14
16
18
20
22
24
Language
POS
Tagg
ing
Erro
r
EN DE ES PT DA SL SV EL IT NL Avg8
10
12
14
16
18
20
22
24
Language
POS
Tagg
ing
Erro
r
GP → CRFGP
EN DE ES PT DA SL SV EL IT NL Avg8
10
12
14
16
18
20
22
24
Language
POS
Tagg
ing
Erro
r
CRFGP → CRFGP
J
y7y5
ninth run for
KL(ninth run for
k )
J
EN DE ES PT DA SL SV EL IT NL Avg8
10
12
14
16
18
20
22
24
Language
POS
Tagg
ing
Erro
r
EN DE ES PT DA SL SV EL IT NL Avg8
10
12
14
16
18
20
22
24
Language
POS
Tagg
ing
Erro
r
CRFGP → CRFGP
J
y7y5
ninth run for
KL(ninth run for
k )
J
EN DE ES PT DA SL SV EL IT NL Avg8
10
12
14
16
18
20
22
24
Language
POS
Tagg
ing
Erro
r
EN DE ES PT DA SL SV EL IT NL Avg8
10
12
14
16
18
20
22
24
Language
POS
Tagg
ing
Erro
r
CRFGP → CRFGP
28% average relative error reduction
+ KL(q k p✓)Lap(q)ninth run for y7y5
ninth run for
NLik(p✓) +J (q, p✓) =
CONCLUSION
+ KL(q k p✓)
y7y5
ninth run for
NLik(p✓) +J (q, p✓) =
any convex, differentiable regularizer
CONCLUSION
PosteriorConstraintR(q)
+ KL(q k p✓)
y7y5
ninth run for
NLik(p✓) +J (q, p✓) =
any convex, differentiable regularizer
CONCLUSION
Code: https://code.google.com/p/pr-graph/
PosteriorConstraintR(q)
https://code.google.com/p/pr-graph/