Learning with Sparse Latent StructureVlad Niculae
Instituto de Telecomunicações
Work with: André Martins, Claire Cardie, Mathieu Blondel
github.com/deep-spin/lp-sparsemap @vnfrombucharest https://vene.ro
Rich Underlying Structure
title
author
date
body
/ /segmentation:sentences,words,and so on
entities
relationshipse.g., dependency
Most of this structure is hidden.
2
Rich Underlying Structure
title
author
date
body
/ /segmentation:sentences,words,and so on
entities
relationshipse.g., dependency
Most of this structure is hidden.
2
Rich Underlying Structure
title
author
date
body
/ /segmentation:sentences,words,and so on
entities
relationshipse.g., dependency
Most of this structure is hidden.
2
Rich Underlying Structure
title
author
date
body
/ /segmentation:sentences,words,and so on
entities
relationshipse.g., dependency
Most of this structure is hidden.
2
Rich Underlying Structure
title
author
date
body
/ /segmentation:sentences,words,and so on
entities
relationshipse.g., dependency
Most of this structure is hidden.
2
Rich Underlying Structure
title
author
date
body
/ /segmentation:sentences,words,and so on
entities
relationshipse.g., dependency
Most of this structure is hidden.
2
Rich Underlying StructureWidely occuring pattern!
speech(Andre‐Obrecht, 1988)
objects(Long et al., 2015)
transition graphs(Kipf, Pol, et al., 2020)
But we’ll focus on NLP.
3
Rich Underlying StructureWidely occuring pattern!
speech(Andre‐Obrecht, 1988)
objects(Long et al., 2015)
transition graphs(Kipf, Pol, et al., 2020)
But we’ll focus on NLP.
3
Structured Prediction
VERB PREP NOUNdog on wheels
NOUN PREP NOUNdog on wheels
NOUN DET NOUNdog on wheels
· · ·
⋆ dog on wheels
⋆ dog on wheels
⋆ dog on wheels
· · ·
dogon
wheels
hondopwielen
dogon
wheels
hondopwielen
dogon
wheels
hondopwielen
4
Structured Prediction
VERB PREP NOUNdog on wheels
NOUN PREP NOUNdog on wheels
NOUN DET NOUNdog on wheels
· · ·
⋆ dog on wheels
⋆ dog on wheels
⋆ dog on wheels
· · ·
dogon
wheels
hondopwielen
dogon
wheels
hondopwielen
dogon
wheels
hondopwielen
4
Structured Prediction· · ·
· · ·5
Traditional Pipeline Approach
input
pretrained parser
output
positive
neutral
negative
6
Traditional Pipeline Approach
input
pretrained parser
output
positive
neutral
negative
6
Deep Learning & Hidden Representations
input
dense vector
output
positive
neutral
negative
7
Deep Learning & Hidden Representations
input
dense vector
output
positive
neutral
negative
7
Latent Structure Models
input
· · ·
· · ·
output
positive
neutral
negative
8
*record scratch*
*freeze frame*
How to select an itemfrom a set?
9
How to select an item from a set?
· · ·
10
How to select an item from a set?
c1c2
· · ·cN
θ p
2
4
−1
1
−3
0
1
0
0
0
inputx
outputy
θ = f(x;w) y = g(p, x;w)
∂y∂w=? or, essentially, ∂p
∂θ=?
11
How to select an item from a set?
c1c2
· · ·cN
θ
p
2
4
−1
1
−3
0
1
0
0
0
inputx
outputy
θ = f(x;w) y = g(p, x;w)
∂y∂w=? or, essentially, ∂p
∂θ=?
11
How to select an item from a set?
c1c2
· · ·cN
θ p
2
4
−1
1
−3
0
1
0
0
0
inputx
outputy
θ = f(x;w) y = g(p, x;w)
∂y∂w=? or, essentially, ∂p
∂θ=?
11
How to select an item from a set?
c1c2
· · ·cN
θ p
2
4
−1
1
−3
0
1
0
0
0
inputx
outputy
θ = f(x;w) y = g(p, x;w)
∂y∂w=? or, essentially, ∂p
∂θ=?
11
How to select an item from a set?
c1c2
· · ·cN
θ p
2
4
−1
1
−3
0
1
0
0
0
inputx
outputy
θ = f(x;w) y = g(p, x;w)
∂y∂w=?
or, essentially, ∂p∂θ=?
11
How to select an item from a set?
c1c2
· · ·cN
θ p
2
4
−1
1
−3
0
1
0
0
0
inputx
outputy
θ = f(x;w) y = g(p, x;w)
∂y∂w=? or, essentially, ∂p
∂θ=?
11
Argmaxθ
c1c2
· · ·cN
θ pp
∂p∂θ=?
12
Argmaxθ
c1c2
· · ·cN
θ pp
∂p∂θ=?
12
Argmaxθ
c1c2
· · ·cN
θ pp
∂p∂θ=?
12
Argmaxθ
c1c2
· · ·cN
θ pp
∂p∂θ=?
12
Argmaxθ
c1c2
· · ·cN
θ pp
∂p∂θ=?
12
Argmaxθ
c1c2
· · ·cN
θ pp
∂p∂θ=?
12
Argmaxθ
c1c2
· · ·cN
θ pp
∂p∂θ=?
12
Argmax
c1c2
· · ·cN
θ p
∂p∂θ = 0
p1
θ10
1
θ2 − 1 θ2 θ2 + 1
13
Argmax vs. Softmax
c1c2
· · ·cN
θ p
∂p∂θ = diag(p) − pp
⊤
p1
θ10
1
θ2 − 1 θ2 θ2 + 1
pj = exp(θj)/Z
14
A Softmax Origin Story
= p ∈ RN : p ≥ 0, 1⊤p = 1
0.5 1 1.5
0.5
1
1.5
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 3
15
A Softmax Origin Story
= p ∈ RN : p ≥ 0, 1⊤p = 1
0.5 1 1.5
0.5
1
1.5
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 3
15
A Softmax Origin Story
= p ∈ RN : p ≥ 0, 1⊤p = 1
0.5 1 1.5
0.5
1
1.5
p = [1,0]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 3
15
A Softmax Origin Story
= p ∈ RN : p ≥ 0, 1⊤p = 1
0.5 1 1.5
0.5
1
1.5
p = [1,0]
p = [0,1]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 3
15
A Softmax Origin Story
= p ∈ RN : p ≥ 0, 1⊤p = 1
0.5 1 1.5
0.5
1
1.5
p = [1,0]
p = [0,1]
p = [1/2, 1/2]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 3
15
A Softmax Origin Story
= p ∈ RN : p ≥ 0, 1⊤p = 1
0.5 1 1.5
0.5
1
1.5
p = [1,0]
p = [0,1]
p = [1/2, 1/2]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 315
A Softmax Origin Story
= p ∈ RN : p ≥ 0, 1⊤p = 1
0.5 1 1.5
0.5
1
1.5
p = [1,0]
p = [0,1]
p = [1/2, 1/2]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
p = [0,1,0]
N = 315
A Softmax Origin Story
= p ∈ RN : p ≥ 0, 1⊤p = 1
0.5 1 1.5
0.5
1
1.5
p = [1,0]
p = [0,1]
p = [1/2, 1/2]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
p = [0,1,0]
p = [0,0,1]
N = 315
A Softmax Origin Story
= p ∈ RN : p ≥ 0, 1⊤p = 1
0.5 1 1.5
0.5
1
1.5
p = [1,0]
p = [0,1]
p = [1/2, 1/2]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
p = [0,1,0]
p = [0,0,1]
p = [1/3, 1/3, 1/3]
N = 315
A Softmax Origin Storymaxjθj =max
p∈p⊤θ Fundamental Thm. Lin. Prog.
(Dantzig et al., 1955)
0.5 1 1.5
0.5
1
1.5
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 315
A Softmax Origin Storymaxjθj =max
p∈p⊤θ Fundamental Thm. Lin. Prog.
(Dantzig et al., 1955)
0.5 1 1.5
0.5
1
1.5θ = [.2,1.4]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 315
A Softmax Origin Storymaxjθj =max
p∈p⊤θ Fundamental Thm. Lin. Prog.
(Dantzig et al., 1955)
0.5 1 1.5
0.5
1
1.5θ = [.2,1.4]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 315
A Softmax Origin Storymaxjθj =max
p∈p⊤θ Fundamental Thm. Lin. Prog.
(Dantzig et al., 1955)
0.5 1 1.5
0.5
1
1.5θ = [.2,1.4]
p⋆ = [0,1]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 315
A Softmax Origin Storymaxjθj =max
p∈p⊤θ Fundamental Thm. Lin. Prog.
(Dantzig et al., 1955)
0.5 1 1.5
0.5
1
1.5θ = [.2,1.4]
p⋆ = [0,1]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5θ = [.7, .1,1.5]
N = 315
A Softmax Origin Storymaxjθj =max
p∈p⊤θ Fundamental Thm. Lin. Prog.
(Dantzig et al., 1955)
0.5 1 1.5
0.5
1
1.5θ = [.2,1.4]
p⋆ = [0,1]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5θ = [.7, .1,1.5]
N = 315
A Softmax Origin Storymaxjθj =max
p∈p⊤θ Fundamental Thm. Lin. Prog.
(Dantzig et al., 1955)
0.5 1 1.5
0.5
1
1.5θ = [.2,1.4]
p⋆ = [0,1]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5θ = [.7, .1,1.5]
p⋆ = [0,0,1]
N = 315
Smoothed Max Operators
πΩ(θ) = argmaxp∈
p⊤θ −Ω(p)
argmax: Ω(p)=0 (no smoothing)
softmax: Ω(p)=∑
j pj log pj
sparsemax: Ω(p)= 1/2∥p∥22α‐entmax: Ω(p)= 1/α(α−1)
∑j pαi
p1
θ10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
(Niculae and Blondel, 2017)
16
Smoothed Max Operators
πΩ(θ) = argmaxp∈
p⊤θ −Ω(p)
argmax: Ω(p)=0 (no smoothing)
softmax: Ω(p)=∑
j pj log pj
sparsemax: Ω(p)= 1/2∥p∥22α‐entmax: Ω(p)= 1/α(α−1)
∑j pαi
p1
θ10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
(Niculae and Blondel, 2017)
16
Smoothed Max Operators
πΩ(θ) = argmaxp∈
p⊤θ −Ω(p)
argmax: Ω(p)=0 (no smoothing)
softmax: Ω(p)=∑
j pj log pj
sparsemax: Ω(p)= 1/2∥p∥22α‐entmax: Ω(p)= 1/α(α−1)
∑j pαi
p1
θ10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
(Niculae and Blondel, 2017)
16
Smoothed Max Operators
πΩ(θ) = argmaxp∈
p⊤θ −Ω(p)
argmax: Ω(p)=0 (no smoothing)
softmax: Ω(p)=∑
j pj log pj
sparsemax: Ω(p)= 1/2∥p∥22
α‐entmax: Ω(p)= 1/α(α−1)∑
j pαi
p1
θ10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
(Niculae and Blondel, 2017)
(Martins and Astudillo, 2016) 16
Smoothed Max OperatorsπΩ(θ) = argmax
p∈p⊤θ −Ω(p)
argmax: Ω(p)=0 (no smoothing)
softmax: Ω(p)=∑
j pj log pj
sparsemax: Ω(p)= 1/2∥p∥22α‐entmax: Ω(p)= 1/α(α−1)
∑j pαi
Tsallis (1988); a generalized entropy (Grünwald and Dawid, 2004)(Blondel, Martins, and Niculae 2019a;Peters, Niculae, and Martins 2019;Correia, Niculae, and Martins 2019)
p1
θ10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
(Niculae and Blondel, 2017)
16
softmax
17
sparsemax
18
fusedmax ?!
19
Smoothed Max Operators
πΩ(θ) = argmaxp∈
p⊤θ −Ω(p)
argmax: Ω(p)=0 (no smoothing)
softmax: Ω(p)=∑
j pj log pj
sparsemax: Ω(p)= 1/2∥p∥22α‐entmax: Ω(p)= 1/α(α−1)
∑j pαi
p1
θ10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
(Niculae and Blondel, 2017)
20
Smoothed Max Operators
πΩ(θ) = argmaxp∈
p⊤θ −Ω(p)
argmax: Ω(p)=0 (no smoothing)
softmax: Ω(p)=∑
j pj log pj
sparsemax: Ω(p)= 1/2∥p∥22α‐entmax: Ω(p)= 1/α(α−1)
∑j pαi
fusedmax: Ω(p)= 1/2∥p∥22 +∑
j |pj − pj−1|
p1
θ10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
(Niculae and Blondel, 2017)
20
Smoothed Max Operators
πΩ(θ) = argmaxp∈
p⊤θ −Ω(p)
argmax: Ω(p)=0 (no smoothing)
softmax: Ω(p)=∑
j pj log pj
sparsemax: Ω(p)= 1/2∥p∥22α‐entmax: Ω(p)= 1/α(α−1)
∑j pαi
p1
θ10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
(Niculae and Blondel, 2017)
21
Smoothed Max Operators
πΩ(θ) = argmaxp∈
p⊤θ −Ω(p)
argmax: Ω(p)=0 (no smoothing)
softmax: Ω(p)=∑
j pj log pj
sparsemax: Ω(p)= 1/2∥p∥22α‐entmax: Ω(p)= 1/α(α−1)
∑j pαi
fusedmax: Ω(p)= 1/2∥p∥22 +∑
j |pj − pj−1|
p1
θ10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
(Niculae and Blondel, 2017)
21
Structured Predictionfinally
Structured Predictionis essentially a (very high‐dimensional) argmax
c1c2
· · ·cN
θ pinputx
outputyc2
There are exponentiallymany structures(θ cannot fit in memory!)
22
Structured Predictionis essentially a (very high‐dimensional) argmax
· · ·
θ pinputx
outputy
There are exponentiallymany structures(θ cannot fit in memory!)
22
Structured Predictionis essentially a (very high‐dimensional) argmax
· · ·
θ pinputx
outputy
There are exponentiallymany structures(θ cannot fit in memory!)
22
Factorization Into Partsθ =A⊤η
⋆ dog on wheels
A =
⋆→dog 1 0 0on→dog 0 1 1
wheels→dog 0 0 0⋆→on 0 1 1
dog→on 1 ... 0 0 ...wheels→on 0 0 0
⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1
η =
.1
.2−.1.3.8.1−.3.2−.1
23
Factorization Into Partsθ =A⊤η
⋆ dog on wheels
A =
⋆→dog 1 0 0on→dog 0 1 1
wheels→dog 0 0 0⋆→on 0 1 1
dog→on 1 ... 0 0 ...wheels→on 0 0 0
⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1
η =
.1
.2−.1.3.8.1−.3.2−.1
*→
dog→
on→
wheels→
dog on wheels
ay = [010 100 001]
23
Factorization Into Partsθ =A⊤η
⋆ dog on wheels
A =
⋆→dog 1 0 0on→dog 0 1 1
wheels→dog 0 0 0⋆→on 0 1 1
dog→on 1 ... 0 0 ...wheels→on 0 0 0
⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1
η =
.1
.2−.1.3.8.1−.3.2−.1
*→
dog→
on→
wheels→
dog on wheels
TREE
ay = [010 100 001]
23
Factorization Into Partsθ =A⊤η
⋆ dog on wheels
A =
⋆→dog 1 0 0on→dog 0 1 1
wheels→dog 0 0 0⋆→on 0 1 1
dog→on 1 ... 0 0 ...wheels→on 0 0 0
⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1
η =
.1
.2−.1.3.8.1−.3.2−.1
*→
dog→
on→
wheels→
dog on wheels
TREE
ay = [010 100 001]
23
Factorization Into Partsθ =A⊤η
⋆ dog on wheels
A =
⋆→dog 1 0 0on→dog 0 1 1
wheels→dog 0 0 0⋆→on 0 1 1
dog→on 1 ... 0 0 ...wheels→on 0 0 0
⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1
η =
.1
.2−.1.3.8.1−.3.2−.1
*→
dog→
on→
wheels→
dog on wheels
TREE
ay = [010 100 001]
23
Factorization Into Partsθ =A⊤η
⋆ dog on wheels
A =
⋆→dog 1 0 0on→dog 0 1 1
wheels→dog 0 0 0⋆→on 0 1 1
dog→on 1 ... 0 0 ...wheels→on 0 0 0
⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1
η =
.1
.2−.1.3.8.1−.3.2−.1
dogon
wheels
hondopwielen
A =
dog—hond 1 0 0dog—op 0 1 1dog—wielen 0 0 0on—hond 0 0 0on—op 1 ... 0 0 ...on—wielen 0 1 1
wheels—hond 0 1 0wheels—op 0 0 0wheels—wielen 1 0 1
η =
.1
.2−.1.3.8.1−.3.2−.1
23
argmax argmaxp∈
p⊤θ
softmax argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres
e.g. sequence labeling→ forward‐backward(Rabiner, 1989)
As attention: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)
As attention: (Liu and Lapata, 2018)
e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)
M := convah : h ∈H
=Ap : p ∈
=EH∼p aH : p ∈
24
argmax argmaxp∈
p⊤θ
softmax argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres
e.g. sequence labeling→ forward‐backward(Rabiner, 1989)
As attention: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)
As attention: (Liu and Lapata, 2018)
e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)
M := convah : h ∈H
=Ap : p ∈
=EH∼p aH : p ∈
24
argmax argmaxp∈
p⊤θ
softmax argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres
e.g. sequence labeling→ forward‐backward(Rabiner, 1989)
As attention: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)
As attention: (Liu and Lapata, 2018)
e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)
M := convah : h ∈H
=Ap : p ∈
=EH∼p aH : p ∈
24
argmax argmaxp∈
p⊤θ
softmax argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres
e.g. sequence labeling→ forward‐backward(Rabiner, 1989)
As attention: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)
As attention: (Liu and Lapata, 2018)
e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)
M := convah : h ∈H
=Ap : p ∈
=EH∼p aH : p ∈
24
argmax argmaxp∈
p⊤θ
softmax argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres
e.g. sequence labeling→ forward‐backward(Rabiner, 1989)
As attention: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)
As attention: (Liu and Lapata, 2018)
e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)
M := convah : h ∈H
=Ap : p ∈
=EH∼p aH : p ∈
24
argmax argmaxp∈
p⊤θ
softmax argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres
e.g. sequence labeling→ forward‐backward(Rabiner, 1989)
As attention: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)
As attention: (Liu and Lapata, 2018)
e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)
M := convah : h ∈H
=Ap : p ∈
=EH∼p aH : p ∈
24
argmax argmaxp∈
p⊤θ
softmax argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres
e.g. sequence labeling→ forward‐backward(Rabiner, 1989)
As attention: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)
As attention: (Liu and Lapata, 2018)
e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)
M := convah : h ∈H
=Ap : p ∈
=EH∼p aH : p ∈
24
argmax argmaxp∈
p⊤θ
softmax argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres
e.g. sequence labeling→ forward‐backward(Rabiner, 1989)
As attention: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)
As attention: (Liu and Lapata, 2018)
e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)
M := convah : h ∈H
=Ap : p ∈
=EH∼p aH : p ∈
24
argmax argmaxp∈
p⊤θ
softmax argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)
SparseMAP argmaxμ∈M
μ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres
e.g. sequence labeling→ forward‐backward(Rabiner, 1989)
As attention: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)
As attention: (Liu and Lapata, 2018)
e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)
M := convah : h ∈H
=Ap : p ∈
=EH∼p aH : p ∈
24
argmax argmaxp∈
p⊤θ
softmax argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)
SparseMAP argmaxμ∈M
μ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres
e.g. sequence labeling→ forward‐backward(Rabiner, 1989)
As attention: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)
As attention: (Liu and Lapata, 2018)
e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)
M := convah : h ∈H
=Ap : p ∈
=EH∼p aH : p ∈
24
argmax argmaxp∈
p⊤θ
softmax argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)
SparseMAP argmaxμ∈M
μ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres
e.g. sequence labeling→ forward‐backward(Rabiner, 1989)
As attention: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)
As attention: (Liu and Lapata, 2018)
e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)
M := convah : h ∈H
=Ap : p ∈
=EH∼p aH : p ∈
24
argmax argmaxp∈
p⊤θ
softmax argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)
SparseMAP argmaxμ∈M
μ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres
e.g. sequence labeling→ forward‐backward(Rabiner, 1989)
As attention: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)
As attention: (Liu and Lapata, 2018)
e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)
M := convah : h ∈H
=Ap : p ∈
=EH∼p aH : p ∈
24
argmax argmaxp∈
p⊤θ
softmax argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)
SparseMAP argmaxμ∈M
μ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres
e.g. sequence labeling→ forward‐backward(Rabiner, 1989)
As attention: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)
As attention: (Liu and Lapata, 2018)
e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)
M := convah : h ∈H
=Ap : p ∈
=EH∼p aH : p ∈
24
argmax argmaxp∈
p⊤θ
softmax argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres
e.g. sequence labeling→ forward‐backward(Rabiner, 1989)
As attention: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)
As attention: (Liu and Lapata, 2018)
e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)
(Niculae, Martins, Blondel, and Cardie, 2018)
M := convah : h ∈H
=Ap : p ∈
=EH∼p aH : p ∈
24
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)
• select a new corner ofM• update the (sparse) coefficients of p
• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)
Backward pass
∂μ∂η is sparse
computing∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadratic objectivelinear constraints(alas, exponentially many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηActive Set achieves
finite & linear convergence!
Completely modular: just add MAP
25
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)
• select a new corner ofM• update the (sparse) coefficients of p
• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)
Backward pass
∂μ∂η is sparse
computing∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadratic objectivelinear constraints(alas, exponentially many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηActive Set achieves
finite & linear convergence!
Completely modular: just add MAP
25
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)
• select a new corner ofM• update the (sparse) coefficients of p
• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)
Backward pass
∂μ∂η is sparse
computing∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadratic objectivelinear constraints(alas, exponentially many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηActive Set achieves
finite & linear convergence!
Completely modular: just add MAP
25
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)
• select a new corner ofM
• update the (sparse) coefficients of p
• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)
Backward pass
∂μ∂η is sparse
computing∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadratic objectivelinear constraints(alas, exponentially many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηActive Set achieves
finite & linear convergence!
Completely modular: just add MAP
25
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)
• select a new corner ofM
• update the (sparse) coefficients of p
• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)
Backward pass
∂μ∂η is sparse
computing∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadratic objectivelinear constraints(alas, exponentially many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eη
Active Set achievesfinite & linear convergence!
Completely modular: just add MAP
25
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)
• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away‐step, pairwise
• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)
Backward pass
∂μ∂η is sparse
computing∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadratic objectivelinear constraints(alas, exponentially many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηActive Set achieves
finite & linear convergence!
Completely modular: just add MAP
25
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)
• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)
Backward pass
∂μ∂η is sparse
computing∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadratic objectivelinear constraints(alas, exponentially many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηActive Set achieves
finite & linear convergence!
Completely modular: just add MAP
25
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)
• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)
Backward pass
∂μ∂η is sparse
computing∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadratic objectivelinear constraints(alas, exponentially many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eη
Active Set achievesfinite & linear convergence!
Completely modular: just add MAP
25
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)
• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)
Backward pass
∂μ∂η is sparse
computing∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadratic objectivelinear constraints(alas, exponentially many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηActive Set achieves
finite & linear convergence!
Completely modular: just add MAP
25
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)
• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)
Backward pass
∂μ∂η is sparse
computing∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadratic objectivelinear constraints(alas, exponentially many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηActive Set achieves
finite & linear convergence!
Completely modular: just add MAP
25
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)
• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)
Backward pass
∂μ∂η is sparse
computing∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadratic objectivelinear constraints(alas, exponentially many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηActive Set achieves
finite & linear convergence!
Completely modular: just add MAP
25
SparseMAP Applications
• Sparse alignment attention (more later)(Niculae, Martins, Blondel, and Cardie, 2018)
• Latent TreeLSTM(Niculae, Martins, and Cardie, 2018)
• As loss: supervised dependency parsing(Niculae, Martins, Blondel, and Cardie 2018;Blondel, Martins, and Niculae 2019b)
26
Latent Dependency TreesListOps (Nangia and Bowman, 2018)
Arity tagging with latent GCN (Corro and Titov, 2019; Kipf and Welling, 2017)
(max 2 9 (min 4 7 ) 0 )
27
Latent Dependency TreesListOps (Nangia and Bowman, 2018)
Arity tagging with latent GCN (Corro and Titov, 2019; Kipf and Welling, 2017)
(max 2 9 (min 4 7 ) 0 )4 - - 2 - - - - -
27
0 20 40 60 80 100
0.2
0.4
0.6
0.8
1
epoch
validationF1score Gold tree
Left‐to‐right
28
0 20 40 60 80 100
0.2
0.4
0.6
0.8
1
epoch
validationF1score Gold tree
Latent treeLeft‐to‐right
28
What if MAP is notavailable?
Multiple, Overlapping Factors
Maximization in factor graphs: NP‐hard, even when each factor is tractable.
⋆ dog on wheels
*→
dog→
on→
wheels→
TREE
BUDGET
BUDGET
BUDGET
BUDGET
29
Multiple, Overlapping Factors
Maximization in factor graphs: NP‐hard, even when each factor is tractable.
⋆ dog on wheels
*→
dog→
on→
wheels→TREE
BUDGET
BUDGET
BUDGET
BUDGET
29
Multiple, Overlapping Factors
Maximization in factor graphs: NP‐hard, even when each factor is tractable.
⋆ dog on wheels
*→
dog→
on→
wheels→TREE
BUDGET
BUDGET
BUDGET
BUDGET
29
Multiple, Overlapping FactorsMaximization in factor graphs: NP‐hard, even when each factor is tractable.
⋆ dog on wheels
*→
dog→
on→
wheels→TREE
BUDGET
BUDGET
BUDGET
BUDGET
29
Optimization as Consensus-Seeking
μ[1:3]
μ[4:6]
μ[7:9]
fa
fb
μa
μb
Agreement on overlap: μa,[4:6] =μb,[4:6] =μ[4:6]
maxμf
∑f∈F
η⊤f μf s.t.
Cfμ =μf,
μf ∈Mf for f ∈ F
LP relaxation (Wainwright and Jordan, 2008)
the local polytope:L :=μ : Cfμ ∈Mf, f ∈ F ⊇M
M
L
(Niculae and Martins, 2020)
30
Optimization as Consensus-Seeking
μ[1:3]
μ[4:6]
μ[7:9]
fa
fb
μa
μb
Agreement on overlap: μa,[4:6] =μb,[4:6] =μ[4:6]
maxμf
∑f∈F
η⊤f μf s.t.
Cfμ =μf,
μf ∈Mf for f ∈ F
LP relaxation (Wainwright and Jordan, 2008)
the local polytope:L :=μ : Cfμ ∈Mf, f ∈ F ⊇M
M
L
(Niculae and Martins, 2020)
30
Optimization as Consensus-Seeking
μ[1:3]
μ[4:6]
μ[7:9]
fa
fb
μa
μb
Agreement on overlap: μa,[4:6] =μb,[4:6] =μ[4:6]
maxμf
∑f∈F
η⊤f μf s.t.
Cfμ =μf,
μf ∈Mf for f ∈ F
LP relaxation (Wainwright and Jordan, 2008)
the local polytope:L :=μ : Cfμ ∈Mf, f ∈ F ⊇M
M
L
(Niculae and Martins, 2020)
30
Optimization as Consensus-Seeking
μ[1:3]
μ[4:6]
μ[7:9]
fa
fb
μa
μb
Agreement on overlap: μa,[4:6] =μb,[4:6] =μ[4:6]
maxμf
∑f∈F
η⊤f μf s.t.
Cfμ =μf,
μf ∈Mf for f ∈ F
LP relaxation (Wainwright and Jordan, 2008)
the local polytope:L :=μ : Cfμ ∈Mf, f ∈ F ⊇M
M
L
(Niculae and Martins, 2020)
30
Optimization as Consensus-Seeking
μ[1:3]
μ[4:6]
μ[7:9]
fa
fb
μa
μb
Agreement on overlap: μa,[4:6] =μb,[4:6] =μ[4:6]
maxμ,μf
∑f∈F
η⊤f μf s.t. Cfμ =μf, μf ∈Mf for f ∈ F
LP relaxation (Wainwright and Jordan, 2008)
the local polytope:L :=μ : Cfμ ∈Mf, f ∈ F ⊇M
M
L
(Niculae and Martins, 2020)
30
Optimization as Consensus-Seeking
μ[1:3]
μ[4:6]
μ[7:9]
fa
fb
μa
μb
Agreement on overlap: μa,[4:6] =μb,[4:6] =μ[4:6]
maxμ,μf
∑f∈F
η⊤f μf s.t. Cfμ =μf, μf ∈Mf for f ∈ F
LP relaxation (Wainwright and Jordan, 2008)
the local polytope:L :=μ : Cfμ ∈Mf, f ∈ F ⊇M
M
L
(Niculae and Martins, 2020)
30
Optimization as Consensus-Seeking
μ[1:3]
μ[4:6]
μ[7:9]
fa
fb
μa
μb
Agreement on overlap: μa,[4:6] =μb,[4:6] =μ[4:6]
maxμ,μf
∑f∈F
η⊤f μf s.t. Cfμ =μf, μf ∈Mf for f ∈ F
LP relaxation (Wainwright and Jordan, 2008)
the local polytope:L :=μ : Cfμ ∈Mf, f ∈ F ⊇M
M
L
(Niculae and Martins, 2020)
30
Optimization as Consensus-Seeking
μ[1:3]
μ[4:6]
μ[7:9]
fa
fb
μa
μb
Agreement on overlap: μa,[4:6] =μb,[4:6] =μ[4:6]
maxμ,μf
∑f∈F
η⊤f μf− 1/2∥μ∥2 s.t. Cfμ =μf, μf ∈Mf for f ∈ F
LP relaxation (Wainwright and Jordan, 2008)
the local polytope:L :=μ : Cfμ ∈Mf, f ∈ F ⊇M
M
L
(Niculae and Martins, 2020)
30
Algorithms for LP-SparseMAP(Niculae and Martins, 2020)
Forward pass
argmaxCfμ=μf
∑f∈F
η⊤f μf− 1/2∥μ∥2
=argmaxCf μ=μf
∑f∈Fη⊤f μf − 1/2∥Dfμf∥2
• Separable objective,agreement constraintsADMM in consensus form
• SparseMAP subproblem for each f
Backward pass
• Sparse fixed‐point iteration
• Combines the SparseMAP Jacobiansof each factor
31
Algorithms for LP-SparseMAP(Niculae and Martins, 2020)
Forward pass
argmaxCfμ=μf
∑f∈F
η⊤f μf− 1/2∥μ∥2
=argmaxCf μ=μf
∑f∈Fη⊤f μf − 1/2∥Dfμf∥2
• Separable objective,agreement constraintsADMM in consensus form
• SparseMAP subproblem for each f
Backward pass
• Sparse fixed‐point iteration
• Combines the SparseMAP Jacobiansof each factor
31
Differentiable Sparse Structured Prediction(Niculae and Martins, 2020)
TREE
BUDGET
BUDGET
BUDGET
BUDGET
Factor graphs as a hidden‐layer DSL!
If |F | = 1, recovers SparseMAP.Modular library.Built‐in specialized factors:• OR, XOR, AND• OR‐with‐output• Budget, Knapsack• Pairwise
New factors only require MAP.
fg = FactorGraph()var = [fg.variable() for i =/ j] # handwave
fg.add(Tree(var))
for i in range(n):fg.add(Budget(var[i, :], budget=5)
μ = fg.lp_sparsemap(η)
32
Differentiable Sparse Structured Prediction(Niculae and Martins, 2020)
TREE
BUDGET
BUDGET
BUDGET
BUDGET
Factor graphs as a hidden‐layer DSL!If |F | = 1, recovers SparseMAP.
Modular library.Built‐in specialized factors:• OR, XOR, AND• OR‐with‐output• Budget, Knapsack• Pairwise
New factors only require MAP.
fg = FactorGraph()var = [fg.variable() for i =/ j] # handwave
fg.add(Tree(var))
for i in range(n):fg.add(Budget(var[i, :], budget=5)
μ = fg.lp_sparsemap(η)
32
Differentiable Sparse Structured Prediction(Niculae and Martins, 2020)
TREE
BUDGET
BUDGET
BUDGET
BUDGET
Factor graphs as a hidden‐layer DSL!If |F | = 1, recovers SparseMAP.Modular library.Built‐in specialized factors:• OR, XOR, AND• OR‐with‐output• Budget, Knapsack• Pairwise
New factors only require MAP.
class Factor:def map(ηf): # abstract, private
raise NotImplemented
def sparsemap(ηf):# active set algo, uses self.map
def backward(dμf):# analytic, uses active set result
class Budget(Factor):def sparsemap(ηf):
# specialized
def backward(dμf):# specialized
32
Differentiable Sparse Structured Prediction(Niculae and Martins, 2020)
TREE
BUDGET
BUDGET
BUDGET
BUDGET
Factor graphs as a hidden‐layer DSL!If |F | = 1, recovers SparseMAP.Modular library.Built‐in specialized factors:• OR, XOR, AND• OR‐with‐output• Budget, Knapsack• Pairwise
New factors only require MAP.
class Factor:def map(ηf): # abstract, private
raise NotImplemented
def sparsemap(ηf):# active set algo, uses self.map
def backward(dμf):# analytic, uses active set result
class Budget(Factor):def sparsemap(ηf):
# specialized
def backward(dμf):# specialized
class Tree(Factor):def map(η):
# Chu-Liu/Edmonds algo32
0 20 40 60 80 100
0.2
0.4
0.6
0.8
1
epoch
validationF1score Gold tree
Latent treeLeft‐to‐right
33
0 20 40 60 80 100
0.2
0.4
0.6
0.8
1
epoch
validationF1score Gold tree
Latent w/ Budget(5)Latent treeLeft‐to‐right
33
Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situation.
hypothesis: A police officer watches a situation closely.
input
(P, H)
A
gentleman
overlooking
...
situation
A
police
officer...
closely
output
entails
contradicts
neutral
(Model: decomposable attention (Parikh et al., 2016))34
Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situation.
hypothesis: A police officer watches a situation closely.
input
(P, H)
A
gentleman
overlooking
...
situation
A
police
officer...
closely
output
entails
contradicts
neutral
(Model: decomposable attention (Parikh et al., 2016))34
Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situation.
hypothesis: A police officer watches a situation closely.
input
(P, H)
A
gentleman
overlooking
...
situation
A
police
officer...
closely
output
entails
contradicts
neutral
(Model: decomposable attention (Parikh et al., 2016))34
Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situation.
hypothesis: A police officer watches a situation closely.
input
(P, H)
A
gentleman
overlooking
...
situation
A
police
officer...
closely
output
entails
contradicts
neutral
(Proposed model: global structured alignment.)34
Structured Alignment Modelsmatching
SparseMAP w/ Kuhn‐Munkres(Kuhn, 1955)
LP‐matching
LP‐SparseMAP w/ XORs(equivalent; different solver)
LP‐sequence
additional scorefor contiguous alignments
(i, j) − (i + 1, j± 1)
35
Structured Alignment Modelsmatching
SparseMAP w/ Kuhn‐Munkres(Kuhn, 1955)
LP‐matching
LP‐SparseMAP w/ XORs(equivalent; different solver)
LP‐sequence
additional scorefor contiguous alignments
(i, j) − (i + 1, j± 1)
35
Structured Alignment Modelsmatching
SparseMAP w/ Kuhn‐Munkres(Kuhn, 1955)
LP‐matching
LP‐SparseMAP w/ XORs(equivalent; different solver)
LP‐sequence
additional scorefor contiguous alignments
(i, j) − (i + 1, j± 1)35
MultiNLI (Williams et al., 2017)
softmax matching LP‐matching LP‐sequence66%
68%
70%
72%
36
a
gentleman
overlooking
a
neighborhood
situation
.
a
police
officer
watches a
situation
closely . a
police
officer
watches a
situation
closely .
37
a
police
officer
watches a
situation
closely .
a
gentleman
overlooking
a
neighborhood
situation
.
A
gentleman
overlooking
a
neighborhood
situation
.
A
police
officer
watches
a
situation
closely
.
38
ConclusionsDifferentiable & sparsestructured inference
Generic, extensible, efficient algorithms
Interpretable structured attention
Future workStructure beyond NLP
Weak & semi‐supervision
Generative latent structure models
[email protected] github.com/deep-spin/lp-sparsemaphttps://vene.ro @vnfrombucharest
· · ·
· · ·
a
police
officer
watches a
situation
closely .
a
gentleman
overlooking
a
neighborhood
situation
.
TREE
BUDGET
BUDGET
BUDGET
BUDGET
ConclusionsDifferentiable & sparsestructured inference
Generic, extensible, efficient algorithms
Interpretable structured attention
Future workStructure beyond NLP
Weak & semi‐supervision
Generative latent structure models
[email protected] github.com/deep-spin/lp-sparsemaphttps://vene.ro @vnfrombucharest
· · ·
· · ·
a
police
officer
watches a
situation
closely .
a
gentleman
overlooking
a
neighborhood
situation
.
TREE
BUDGET
BUDGET
BUDGET
BUDGET
Extra slides
Acknowledgements
This work was supported by the European Research Council (ERC StG DeepSPIN 758969) and by theFundação para a Ciência e Tecnologia through contract UID/EEA/50008/2013.
Some icons by Dave Gandy and Freepik via flaticon.com.
40
Sparsemaxsparsemax(θ) = argmax
p∈p⊤θ − 1/2∥p∥22
= argminp∈∥p − θ∥22
Computation:
p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via partial sort
(Held et al., 1974; Brucker, 1984; Condat, 2016)
Backward pass:
Jsparsemax = diag(s) − 1|S|ss⊤
where S = j : p⋆j > 0,sj = Jj ∈ SK
(Martins and Astudillo, 2016)
argmin differentiation(Gould et al., 2016; Amos and Kolter, 2017)
41
Sparsemaxsparsemax(θ) = argmax
p∈p⊤θ − 1/2∥p∥22
= argminp∈∥p − θ∥22
Computation:
p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via partial sort
(Held et al., 1974; Brucker, 1984; Condat, 2016)
Backward pass:
Jsparsemax = diag(s) − 1|S|ss⊤
where S = j : p⋆j > 0,sj = Jj ∈ SK
(Martins and Astudillo, 2016)
argmin differentiation(Gould et al., 2016; Amos and Kolter, 2017)
41
Sparsemaxsparsemax(θ) = argmax
p∈p⊤θ − 1/2∥p∥22
= argminp∈∥p − θ∥22
Computation:
p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via partial sort
(Held et al., 1974; Brucker, 1984; Condat, 2016)
Backward pass:
Jsparsemax = diag(s) − 1|S|ss⊤
where S = j : p⋆j > 0,sj = Jj ∈ SK
(Martins and Astudillo, 2016)
argmin differentiation(Gould et al., 2016; Amos and Kolter, 2017)
41
Sparsemaxsparsemax(θ) = argmax
p∈p⊤θ − 1/2∥p∥22
= argminp∈∥p − θ∥22
Computation:
p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via partial sort
(Held et al., 1974; Brucker, 1984; Condat, 2016)
Backward pass:
Jsparsemax = diag(s) − 1|S|ss⊤
where S = j : p⋆j > 0,sj = Jj ∈ SK
(Martins and Astudillo, 2016)
argmin differentiation(Gould et al., 2016; Amos and Kolter, 2017)
41
Fusedmax
fusedmax(θ) = argmaxp∈
p⊤θ − 1/2∥p∥22 −∑2≤j≤d|pj − pj−1|
= argminp∈∥p − θ∥22 +∑2≤j≤d|pj − pj−1|
proxfused(θ) = argminp∈Rd∥p − θ∥22 +∑2≤j≤d|pj − pj−1|
Proposition: fusedmax(θ) = sparsemaxproxfused(θ)
(Niculae and Blondel, 2017)
“Fused Lasso” a.k.a. 1‐d Total Variation
(Tibshirani et al., 2005)
42
Fusedmax
fusedmax(θ) = argmaxp∈
p⊤θ − 1/2∥p∥22 −∑2≤j≤d|pj − pj−1|
= argminp∈∥p − θ∥22 +∑2≤j≤d|pj − pj−1|
proxfused(θ) = argminp∈Rd∥p − θ∥22 +∑2≤j≤d|pj − pj−1|
Proposition: fusedmax(θ) = sparsemaxproxfused(θ)
(Niculae and Blondel, 2017)
“Fused Lasso” a.k.a. 1‐d Total Variation
(Tibshirani et al., 2005)
42
Danskin’s Theorem(Danskin, 1966; Prop. B.25 in Bertsekas, 1999)
Let φ : Rd × Z→ R,Z ⊂ Rd compact.∂max
z∈Z φ(x, z) = conv∇xφ(x, z⋆) | z⋆ ∈ argmaxz∈Z
φ(x, z).
Example: maximum of a vector
∂maxj∈[d] θj = ∂max
p∈ p⊤θ
= ∂maxp∈ φ(p,θ)
= conv∇θφ(p⋆,θ)= convp⋆
θ = [t,0]
t0
1
−1 0 +1maxj θj
t0
1
−1 0 +1g1 | g ∈ ∂maxj θj
43
Danskin’s Theorem(Danskin, 1966; Prop. B.25 in Bertsekas, 1999)
Let φ : Rd × Z→ R,Z ⊂ Rd compact.∂max
z∈Z φ(x, z) = conv∇xφ(x, z⋆) | z⋆ ∈ argmaxz∈Z
φ(x, z).
Example: maximum of a vector
∂maxj∈[d] θj = ∂max
p∈ p⊤θ
= ∂maxp∈ φ(p,θ)
= conv∇θφ(p⋆,θ)= convp⋆
θ = [t,0]
t0
1
−1 0 +1maxj θj
t0
1
−1 0 +1g1 | g ∈ ∂maxj θj
43
Danskin’s Theorem(Danskin, 1966; Prop. B.25 in Bertsekas, 1999)
Let φ : Rd × Z→ R,Z ⊂ Rd compact.∂max
z∈Z φ(x, z) = conv∇xφ(x, z⋆) | z⋆ ∈ argmaxz∈Z
φ(x, z).
Example: maximum of a vector
∂maxj∈[d] θj = ∂max
p∈ p⊤θ
= ∂maxp∈ φ(p,θ)
= conv∇θφ(p⋆,θ)= convp⋆
θ = [t,0]
t0
1
−1 0 +1maxj θj
t0
1
−1 0 +1g1 | g ∈ ∂maxj θj
43
Dynamically inferringthe computation graph
So far: a structured hidden layerEH[aH]
Network must handle “soft” combinations of structures.Fine for attention, but can be limiting.
Dependency TreeLSTM
The bears eat the pretty ones
(Tai et al., 2015)
45
Dependency TreeLSTM
The bears eat the pretty ones
(Tai et al., 2015)
45
Dependency TreeLSTM
The bears eat the pretty ones
(Tai et al., 2015)
45
Dependency TreeLSTM
The bears eat the pretty ones
(Tai et al., 2015)
45
Dependency TreeLSTM
The bears eat the pretty ones
(Tai et al., 2015)
45
Dependency TreeLSTM
The bears eat the pretty ones
(Tai et al., 2015)
45
Dependency TreeLSTM
The bears eat the pretty ones
(Tai et al., 2015)
45
Latent Dependency TreeLSTM
input
x
p(y|x) =∑h∈H
p(y | h, x) p(h | x)
h ∈H
The bears eat the pretty ones
output
y
(Niculae, Martins, and Cardie, 2018)
46
Latent Dependency TreeLSTM
input
x
p(y|x) =∑h∈H
p(y | h, x) p(h | x)
h ∈HThe bears eat the pretty ones
output
y
(Niculae, Martins, and Cardie, 2018)
46
Structured Latent Variable Models
p(y | x) = ∑h∈H
p
φ
(y | h, x) p
π
(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp
scoreπ(h; x)
softmaxidea 3 SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponentially large sum!
47
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp
scoreπ(h; x)
softmaxidea 3 SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponentially large sum!
47
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp
scoreπ(h; x)
softmaxidea 3 SparseMAP
e.g., a TreeLSTM defined by h
sum overall possible trees
parsing model,using some scoreπ(h; x)
Exponentially large sum!
47
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp
scoreπ(h; x)
softmaxidea 3 SparseMAP
e.g., a TreeLSTM defined by h
sum overall possible trees
parsing model,using some scoreπ(h; x)
Exponentially large sum!
47
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp
scoreπ(h; x)
softmaxidea 3 SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponentially large sum!
47
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ?
∑h∈H
∂p(y | x)∂π
idea 1
pπ(h | x) = 1 if h = h⋆ else 0 argmax
idea 2
pπ(h | x) ∝ expscoreπ(h; x)
softmax
idea 3
SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponentially large sum!
47
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1
pπ(h | x) = 1 if h = h⋆ else 0 argmax
idea 2
pπ(h | x) ∝ expscoreπ(h; x)
softmax
idea 3
SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponentially large sum!
47
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1
pπ(h | x) = 1 if h = h⋆ else 0 argmax
idea 2
pπ(h | x) ∝ expscoreπ(h; x)
softmax
idea 3
SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponentially large sum!
47
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2
pπ(h | x) ∝ expscoreπ(h; x)
softmax
idea 3
SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponentially large sum!
47
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2
pπ(h | x) ∝ expscoreπ(h; x)
softmax
idea 3
SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponentially large sum!
47
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2
pπ(h | x) ∝ expscoreπ(h; x)
softmax
idea 3
SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponentially large sum!
47
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp
scoreπ(h; x)
softmaxidea 3
SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponentially large sum!
47
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp
scoreπ(h; x)
softmaxidea 3
SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponentially large sum!
47
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp
scoreπ(h; x)
softmaxidea 3
SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponentially large sum!
47
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp
scoreπ(h; x)
softmaxidea 3 SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponentially large sum!
47
SparseMAP
• • •= .7 • • • + .3 • • •
+ 0 • • • + ...p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )
48
SparseMAP
• • •= .7 • • • + .3 • • • + 0 • • • + ...
p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )
48
SparseMAP
• • •= .7 • • • + .3 • • • + 0 • • • + ...p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )
48
Sentiment classification (SST)
80%
81%
82%
83%
84%
85%
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3‐class)
Reverse dictionary lookup(definitions)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
⋆ The bears eat the pretty ones
Left‐to‐right: regular LSTM
⋆ The bears eat the pretty ones
Flat: bag‐of‐words–like
⋆ The bears eat the pretty ones
CoreNLP: off‐line parser
Sentence pair classification (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
Sentiment classification (SST)
LTR80%
81%
82%
83%
84%
85%
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3‐class)
Reverse dictionary lookup(definitions)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
⋆ The bears eat the pretty ones
Left‐to‐right: regular LSTM
⋆ The bears eat the pretty ones
Flat: bag‐of‐words–like
⋆ The bears eat the pretty ones
CoreNLP: off‐line parser
Sentence pair classification (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
Sentiment classification (SST)
LTR Flat80%
81%
82%
83%
84%
85%
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3‐class)
Reverse dictionary lookup(definitions)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
⋆ The bears eat the pretty ones
Left‐to‐right: regular LSTM
⋆ The bears eat the pretty ones
Flat: bag‐of‐words–like
⋆ The bears eat the pretty ones
CoreNLP: off‐line parser
Sentence pair classification (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
Sentiment classification (SST)
LTR Flat CoreNLP80%
81%
82%
83%
84%
85%
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3‐class)
Reverse dictionary lookup(definitions)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
⋆ The bears eat the pretty ones
Left‐to‐right: regular LSTM
⋆ The bears eat the pretty ones
Flat: bag‐of‐words–like
⋆ The bears eat the pretty ones
CoreNLP: off‐line parser
Sentence pair classification (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
Sentiment classification (SST)
LTR Flat CoreNLP Latent80%
81%
82%
83%
84%
85%
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3‐class)
Reverse dictionary lookup(definitions)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
⋆ The bears eat the pretty ones
Left‐to‐right: regular LSTM
⋆ The bears eat the pretty ones
Flat: bag‐of‐words–like
⋆ The bears eat the pretty ones
CoreNLP: off‐line parser
Sentence pair classification (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
Sentiment classification (SST)
LTR Flat CoreNLP Latent80%
81%
82%
83%
84%
85%accuracy(binary)
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3‐class)
Reverse dictionary lookup(definitions)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
⋆ The bears eat the pretty ones
Left‐to‐right: regular LSTM
⋆ The bears eat the pretty ones
Flat: bag‐of‐words–like
⋆ The bears eat the pretty ones
CoreNLP: off‐line parser
Sentence pair classification (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
Sentiment classification (SST)
LTR Flat CoreNLP Latent80%
81%
82%
83%
84%
85%accuracy(binary)
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3‐class)
Reverse dictionary lookup(definitions)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
⋆ The bears eat the pretty ones
Left‐to‐right: regular LSTM
⋆ The bears eat the pretty ones
Flat: bag‐of‐words–like
⋆ The bears eat the pretty ones
CoreNLP: off‐line parser
Sentence pair classification (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)
given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
Sentiment classification (SST)
LTR Flat CoreNLP Latent80%
81%
82%
83%
84%
85%accuracy(binary)
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3‐class)
Reverse dictionary lookup
(definitions)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
⋆ The bears eat the pretty ones
Left‐to‐right: regular LSTM
⋆ The bears eat the pretty ones
Flat: bag‐of‐words–like
⋆ The bears eat the pretty ones
CoreNLP: off‐line parser
Sentence pair classification (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)
given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
Sentiment classification (SST)
LTR Flat CoreNLP Latent80%
81%
82%
83%
84%
85%accuracy(binary)
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3‐class)
Reverse dictionary lookup(definitions)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
⋆ The bears eat the pretty ones
Left‐to‐right: regular LSTM
⋆ The bears eat the pretty ones
Flat: bag‐of‐words–like
⋆ The bears eat the pretty ones
CoreNLP: off‐line parser
Sentence pair classification (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
Sentiment classification (SST)
LTR Flat CoreNLP Latent80%
81%
82%
83%
84%
85%accuracy(binary)
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3‐class)
Reverse dictionary lookup(definitions)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
⋆ The bears eat the pretty ones
Left‐to‐right: regular LSTM
⋆ The bears eat the pretty ones
Flat: bag‐of‐words–like
⋆ The bears eat the pretty ones
CoreNLP: off‐line parser
Sentence pair classification (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
Syntax vs. Composition Order
p = 22.6%
⋆ lovely and poignant .
CoreNLP parse, p = 21.4%
⋆ lovely and poignant .
· · ·
50
Syntax vs. Composition Orderp = 22.6%
⋆ lovely and poignant .
CoreNLP parse, p = 21.4%
⋆ lovely and poignant .
· · · 50
Syntax vs. Composition Order
p = 22.6%
⋆ lovely and poignant .
CoreNLP parse, p = 21.4%
⋆ lovely and poignant .
· · ·
p = 15.33%
⋆ a deep and meaningful film .1.0 1.0
1.01.0
1.0 1.0
p = 15.27%
⋆ a deep and meaningful film .
1.01.0
1.01.0
1.01.0
· · ·CoreNLP parse, p = 0%
⋆ a deep and meaningful film .
1.0 1.0
1.01.0
1.0
1.0
51
Structured Output Prediction
SparseMAP LA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2
− η⊤μ + 1/2∥μ∥2
cost‐SparseMAP LρA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2+ρ(μ, μ)
− η⊤μ + 1/2∥μ∥2
Instance of a structured Fenchel‐Young loss, like CRF, SVM, etc. (Blondel, Martins, and Niculae, 2019b)
52
Structured Output Prediction
SparseMAP LA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2
− η⊤μ + 1/2∥μ∥2cost‐SparseMAP LρA(η, μ) = max
μ∈Mη⊤μ − 1/2∥μ∥2+ρ(μ, μ)
− η⊤μ + 1/2∥μ∥2
Instance of a structured Fenchel‐Young loss, like CRF, SVM, etc. (Blondel, Martins, and Niculae, 2019b)
52
References IAmos, Brandon and J. Zico Kolter (2017). “OptNet: Differentiable optimization as a layer in neural networks”. In: Proc. of ICML.
Andre‐Obrecht, Regine (1988). “A new statistical approach for the automatic segmentation of continuous speech signals”. In: IEEETransactions on Acoustics, Speech, and Signal Processing 36.1, pp. 29–40.
Bertsekas, Dimitri P (1999). Nonlinear Programming. Athena Scientific Belmont.
Blondel, Mathieu, André FT Martins, and Vlad Niculae (2019a). “Learning classifiers with Fenchel‐Young losses: Generalizedentropies, margins, and algorithms”. In: Proc. of AISTATS.
— (2019b). “Learning with Fenchel‐Young Losses”. In: preprint arXiv:1901.02324.
Brucker, Peter (1984). “An O(n) algorithm for quadratic knapsack problems”. In: Operations Research Letters 3.3, pp. 163–166.
Condat, Laurent (2016). “Fast projection onto the simplex and the ℓ1 ball”. In:Mathematical Programming 158.1‐2, pp. 575–585.
Correia, Gonçalo M., Vlad Niculae, and André FT Martins (2019). “Adaptively Sparse Transformers”. In: Proc. EMNLP.
Corro, Caio and Ivan Titov (2019). “Learning latent trees with stochastic perturbations and differentiable dynamic programming”.In: Proc. of ACL.
Danskin, John M (1966). “The theory of max‐min, with applications”. In: SIAM Journal on Applied Mathematics 14.4, pp. 641–664.
53
References IIDantzig, George B, Alex Orden, and Philip Wolfe (1955). “The generalized simplex method for minimizing a linear form underlinear inequality restraints”. In: Pacific Journal of Mathematics 5.2, pp. 183–195.
Frank, Marguerite and Philip Wolfe (1956). “An algorithm for quadratic programming”. In: Nav. Res. Log. 3.1‐2, pp. 95–110.
Gould, Stephen et al. (2016). “On differentiating parameterized argmin and argmax problems with application to bi‐leveloptimization”. In: preprint arXiv:1607.05447.
Grünwald, Peter D and A Philip Dawid (2004). “Game theory, maximum entropy, minimum discrepancy and robust Bayesiandecision theory”. In: Annals of Statistics, pp. 1367–1433.
Held, Michael, Philip Wolfe, and Harlan P Crowder (1974). “Validation of subgradient optimization”. In:Mathematical Programming6.1, pp. 62–88.
Hill, Felix et al. (2016). “Learning to understand phrases by embedding the dictionary”. In: TACL 4.1, pp. 17–30.
Kim, Yoon et al. (2017). “Structured attention networks”. In: Proc. of ICLR.
Kipf, Thomas, Elise van der Pol, and Max Welling (2020). “Contrastive Learning of Structured World Models”. In: Proc. of ICLR.
Kipf, Thomas and Max Welling (2017). “Semi‐supervised classification with graph convolutional networks”. In: Proc. of ICLR.
54
References IIIKoo, Terry et al. (2007). “Structured prediction models via the matrix‐tree theorem”. In: Proc. of EMNLP.
Kuhn, Harold W (1955). “The Hungarian method for the assignment problem”. In: Nav. Res. Log. 2.1‐2, pp. 83–97.
Lacoste‐Julien, Simon and Martin Jaggi (2015). “On the global linear convergence of Frank‐Wolfe optimization variants”. In: Proc.of NeurIPS.
Liu, Yang and Mirella Lapata (2018). “Learning structured text representations”. In: TACL 6, pp. 63–75.
Long, Jonathan, Evan Shelhamer, and Trevor Darrell (2015). “Fully convolutional networks for semantic segmentation”. In: Proc. ofCVPR.
Martins, André FT and Ramón Fernandez Astudillo (2016). “From softmax to sparsemax: A sparse model of attention andmulti‐label classification”. In: Proc. of ICML.
Martins, André FT, Mário AT Figueiredo, et al. (2015). “AD3: Alternating directions dual decomposition for MAP inference ingraphical models”. In: JMLR 16.1, pp. 495–545.
McDonald, Ryan T and Giorgio Satta (2007). “On the complexity of non‐projective data‐driven dependency parsing”. In: Proc. ofICPT.
Nangia, Nikita and Samuel Bowman (2018). “ListOps: A diagnostic dataset for latent tree learning”. In: Proc. of NAACL SRW.
55
References IVNiculae, Vlad and Mathieu Blondel (2017). “A regularized framework for sparse and structured neural attention”. In: Proc. of
NeurIPS.Niculae, Vlad and André FT Martins (2020). “LP‐SparseMAP: Differentiable relaxed optimization for sparse structuredprediction”. In: preprint arXiv:2001.04437.
Niculae, Vlad, André FT Martins, Mathieu Blondel, et al. (2018). “SparseMAP: Differentiable sparse structured inference”. In: Proc.of ICML.
Niculae, Vlad, André FT Martins, and Claire Cardie (2018). “Towards dynamic computation graphs via sparse latent structure”. In:Proc. of EMNLP.
Nocedal, Jorge and Stephen Wright (1999). Numerical Optimization. Springer New York.
Parikh, Ankur et al. (2016). “A decomposable attention model for natural language inference”. In: Proc. of EMNLP.
Peters, Ben, Vlad Niculae, and André FT Martins (2019). “Sparse sequence‐to‐sequence models”. In: Proc. ACL.
Rabiner, Lawrence R. (1989). “A tutorial on Hidden Markov Models and selected applications in speech recognition”. In: P. IEEE77.2, pp. 257–286.
Smith, David A and Noah A Smith (2007). “Probabilistic models of nonprojective dependency trees”. In: Proc. of EMNLP.
56
References VTai, Kai Sheng, Richard Socher, and Christopher D Manning (2015). “Improved semantic representations from tree‐structuredLong Short‐Term Memory networks”. In: Proc. of ACL‐IJCNLP.
Taskar, Ben (2004). “Learning structured prediction models: A large margin approach”. PhD thesis. Stanford University.
Tibshirani, Robert et al. (2005). “Sparsity and smoothness via the fused lasso”. In: Journal of the Royal Statistical Society: Series B(Statistical Methodology) 67.1, pp. 91–108.
Tsallis, Constantino (1988). “Possible generalization of Boltzmann‐Gibbs statistics”. In: Journal of Statistical Physics 52,pp. 479–487.
Valiant, Leslie G (1979). “The complexity of computing the permanent”. In: Theor. Comput. Sci. 8.2, pp. 189–201.
Wainwright, Martin J and Michael I Jordan (2008). Graphical models, exponential families, and variational inference. Vol. 1. 1–2. NowPublishers, Inc., pp. 1–305.
Williams, Adina, Nikita Nangia, and Samuel R Bowman (2017). “A broad‐coverage challenge corpus for sentence understandingthrough inference”. In: preprint arXiv:1704.05426.
Wolfe, Philip (1976). “Finding the nearest point in a polytope”. In:Mathematical Programming 11.1, pp. 128–149.
57