Download - vene.ro · DifferentiableSparseStructuredPrediction (NiculaeandMartins,2020) TREE BUDGET BUDGET BUDGET BUDGET Factorgraphsasahidden‐layerDSL! If|F |=1,recoversSparseMAP ...

Learning with Sparse Latent StructureVlad Niculae

Instituto de Telecomunicações

Work with: André Martins, Claire Cardie, Mathieu Blondel

github.com/deep-spin/lp-sparsemap @vnfrombucharest https://vene.ro

https://github.com/deep-spin/lp-sparsemap

https://twitter.com/vnfrombucharest

https://vene.ro

Rich Underlying Structure

title

author

date

body

/ /segmentation:sentences,words,and so on

entities

relationshipse.g., dependency

Most of this structure is hidden.

2


title

author

date

body


entities



2


title

author

date

body


entities



2


title

author

date

body


entities



2


title

author

date

body


entities



2


title

author

date

body


entities



2

Rich Underlying StructureWidely occuring pattern!

speech(Andre‐Obrecht, 1988)

objects(Long et al., 2015)

transition graphs(Kipf, Pol, et al., 2020)

But we’ll focus on NLP.

3

Rich Underlying StructureWidely occuring pattern!

speech(Andre‐Obrecht, 1988)

objects(Long et al., 2015)

transition graphs(Kipf, Pol, et al., 2020)

But we’ll focus on NLP.

3

Structured Prediction

VERB PREP NOUNdog on wheels

NOUN PREP NOUNdog on wheels

NOUN DET NOUNdog on wheels

· · ·

⋆ dog on wheels

⋆ dog on wheels

⋆ dog on wheels

· · ·

dogon

wheels

hondopwielen

dogon

wheels

hondopwielen

dogon

wheels

hondopwielen

4

Structured Prediction

VERB PREP NOUNdog on wheels

NOUN PREP NOUNdog on wheels

NOUN DET NOUNdog on wheels

· · ·

⋆ dog on wheels

⋆ dog on wheels

⋆ dog on wheels

· · ·

dogon

wheels

hondopwielen

dogon

wheels

hondopwielen

dogon

wheels

hondopwielen

4

Structured Prediction· · ·

· · ·5

Traditional Pipeline Approach

input

pretrained parser

output

positive

neutral

negative

6

Traditional Pipeline Approach

input

pretrained parser

output

positive

neutral

negative

6

Deep Learning & Hidden Representations

input

dense vector

output

positive

neutral

negative

7

Deep Learning & Hidden Representations

input

dense vector

output

positive

neutral

negative

7

Latent Structure Models

input

· · ·

· · ·

output

positive

neutral

negative

8

*record scratch*

*freeze frame*

How to select an itemfrom a set?

9

How to select an item from a set?

· · ·

10


c1c2

· · ·cN

θ p

2

4

−1

1

−3

0

1

0

0

0

inputx

outputy

θ = f(x;w) y = g(p, x;w)

∂y∂w=? or, essentially, ∂p

∂θ=?

11


c1c2

· · ·cN

θ

p

2

4

−1

1

−3

0

1

0

0

0

inputx

outputy

θ = f(x;w) y = g(p, x;w)


∂θ=?

11


c1c2

· · ·cN

θ p

2

4

−1

1

−3

0

1

0

0

0

inputx

outputy

θ = f(x;w) y = g(p, x;w)


∂θ=?

11


c1c2

· · ·cN

θ p

2

4

−1

1

−3

0

1

0

0

0

inputx

outputy

θ = f(x;w) y = g(p, x;w)


∂θ=?

11


c1c2

· · ·cN

θ p

2

4

−1

1

−3

0

1

0

0

0

inputx

outputy

θ = f(x;w) y = g(p, x;w)

∂y∂w=?

or, essentially, ∂p∂θ=?

11


c1c2

· · ·cN

θ p

2

4

−1

1

−3

0

1

0

0

0

inputx

outputy

θ = f(x;w) y = g(p, x;w)


∂θ=?

11

Argmaxθ

c1c2

· · ·cN

θ pp

∂p∂θ=?

12

Argmaxθ

c1c2

· · ·cN

θ pp

∂p∂θ=?

12

Argmaxθ

c1c2

· · ·cN

θ pp

∂p∂θ=?

12

Argmaxθ

c1c2

· · ·cN

θ pp

∂p∂θ=?

12

Argmaxθ

c1c2

· · ·cN

θ pp

∂p∂θ=?

12

Argmaxθ

c1c2

· · ·cN

θ pp

∂p∂θ=?

12

Argmaxθ

c1c2

· · ·cN

θ pp

∂p∂θ=?

12

Argmax

c1c2

· · ·cN

θ p

∂p∂θ = 0

p1

θ10

1

θ2 − 1 θ2 θ2 + 1

13

Argmax vs. Softmax

c1c2

· · ·cN

θ p

∂p∂θ = diag(p) − pp

⊤

p1

θ10

1

θ2 − 1 θ2 θ2 + 1

pj = exp(θj)/Z

14

A Softmax Origin Story

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

15


= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

15


= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

15


= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

15


= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

15


= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 315


= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

p = [0,1,0]

N = 315


= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

p = [0,1,0]

p = [0,0,1]

N = 315


= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

p = [0,1,0]

p = [0,0,1]

p = [1/3, 1/3, 1/3]

N = 315

A Softmax Origin Storymaxjθj =max

p∈p⊤θ Fundamental Thm. Lin. Prog.

(Dantzig et al., 1955)

0.5 1 1.5

0.5

1

1.5

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 315




0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 315




0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 315




0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

p⋆ = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 315




0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

p⋆ = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5θ = [.7, .1,1.5]

N = 315




0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

p⋆ = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5θ = [.7, .1,1.5]

N = 315




0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

p⋆ = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5θ = [.7, .1,1.5]

p⋆ = [0,0,1]

N = 315

Smoothed Max Operators

πΩ(θ) = argmaxp∈

p⊤θ −Ω(p)

argmax: Ω(p)=0 (no smoothing)

softmax: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22α‐entmax: Ω(p)= 1/α(α−1)

∑j pαi

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)

16



p⊤θ −Ω(p)


softmax: Ω(p)=∑

j pj log pj


∑j pαi

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]


16



p⊤θ −Ω(p)


softmax: Ω(p)=∑

j pj log pj


∑j pαi

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]


16



p⊤θ −Ω(p)


softmax: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22

α‐entmax: Ω(p)= 1/α(α−1)∑

j pαi

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]


(Martins and Astudillo, 2016) 16

Smoothed Max OperatorsπΩ(θ) = argmax

p∈p⊤θ −Ω(p)


softmax: Ω(p)=∑

j pj log pj


∑j pαi

Tsallis (1988); a generalized entropy (Grünwald and Dawid, 2004)(Blondel, Martins, and Niculae 2019a;Peters, Niculae, and Martins 2019;Correia, Niculae, and Martins 2019)

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]


16

softmax

17

sparsemax

18

fusedmax ?!

19



p⊤θ −Ω(p)


softmax: Ω(p)=∑

j pj log pj


∑j pαi

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]


20



p⊤θ −Ω(p)


softmax: Ω(p)=∑

j pj log pj


∑j pαi

fusedmax: Ω(p)= 1/2∥p∥22 +∑

j |pj − pj−1|

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]


20



p⊤θ −Ω(p)


softmax: Ω(p)=∑

j pj log pj


∑j pαi

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]


21



p⊤θ −Ω(p)


softmax: Ω(p)=∑

j pj log pj


∑j pαi

fusedmax: Ω(p)= 1/2∥p∥22 +∑

j |pj − pj−1|

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]


21

Structured Predictionfinally

Structured Predictionis essentially a (very high‐dimensional) argmax

c1c2

· · ·cN

θ pinputx

outputyc2

There are exponentiallymany structures(θ cannot fit in memory!)

22


· · ·

θ pinputx

outputy


22


· · ·

θ pinputx

outputy


22

Factorization Into Partsθ =A⊤η

⋆ dog on wheels

A =

⋆→dog 1 0 0on→dog 0 1 1

wheels→dog 0 0 0⋆→on 0 1 1

dog→on 1 ... 0 0 ...wheels→on 0 0 0

⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1

η =

.1

.2−.1.3.8.1−.3.2−.1

23


⋆ dog on wheels

A =

⋆→dog 1 0 0on→dog 0 1 1


dog→on 1 ... 0 0 ...wheels→on 0 0 0


η =

.1

.2−.1.3.8.1−.3.2−.1

*→

dog→

on→

wheels→

dog on wheels

ay = [010 100 001]

23


⋆ dog on wheels

A =

⋆→dog 1 0 0on→dog 0 1 1


dog→on 1 ... 0 0 ...wheels→on 0 0 0


η =

.1

.2−.1.3.8.1−.3.2−.1

*→

dog→

on→

wheels→

dog on wheels

TREE

ay = [010 100 001]

23


⋆ dog on wheels

A =

⋆→dog 1 0 0on→dog 0 1 1


dog→on 1 ... 0 0 ...wheels→on 0 0 0


η =

.1

.2−.1.3.8.1−.3.2−.1

*→

dog→

on→

wheels→

dog on wheels

TREE

ay = [010 100 001]

23


⋆ dog on wheels

A =

⋆→dog 1 0 0on→dog 0 1 1


dog→on 1 ... 0 0 ...wheels→on 0 0 0


η =

.1

.2−.1.3.8.1−.3.2−.1

*→

dog→

on→

wheels→

dog on wheels

TREE

ay = [010 100 001]

23


⋆ dog on wheels

A =

⋆→dog 1 0 0on→dog 0 1 1


dog→on 1 ... 0 0 ...wheels→on 0 0 0


η =

.1

.2−.1.3.8.1−.3.2−.1

dogon

wheels

hondopwielen

A =

dog—hond 1 0 0dog—op 0 1 1dog—wielen 0 0 0on—hond 0 0 0on—op 1 ... 0 0 ...on—wielen 0 1 1

wheels—hond 0 1 0wheels—op 0 0 0wheels—wielen 1 0 1

η =

.1

.2−.1.3.8.1−.3.2−.1

23

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)


p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η



μ∈Mμ⊤η − 1/2∥μ∥2

M








=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)


p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η



μ∈Mμ⊤η − 1/2∥μ∥2

M








=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)


p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η



μ∈Mμ⊤η − 1/2∥μ∥2

M








=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)


p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η



μ∈Mμ⊤η − 1/2∥μ∥2

M








=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)


p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η



μ∈Mμ⊤η − 1/2∥μ∥2

M








=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)


p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η



μ∈Mμ⊤η − 1/2∥μ∥2

M








=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)


p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η



μ∈Mμ⊤η − 1/2∥μ∥2

M








=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)


p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η


μ⊤η + eH(μ)

SparseMAP argmaxμ∈M

μ⊤η − 1/2∥μ∥2

M








=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)


p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η


μ⊤η + eH(μ)


μ⊤η − 1/2∥μ∥2

M








=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)


p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η


μ⊤η + eH(μ)


μ⊤η − 1/2∥μ∥2

M








=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)


p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η


μ⊤η + eH(μ)


μ⊤η − 1/2∥μ∥2

M








=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)


p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η


μ⊤η + eH(μ)


μ⊤η − 1/2∥μ∥2

M








=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)


p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η



μ∈Mμ⊤η − 1/2∥μ∥2

M







(Niculae, Martins, Blondel, and Cardie, 2018)


=Ap : p ∈

=EH∼p aH : p ∈

24

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p

• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass

∂μ∂η is sparse

computing∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadratic objectivelinear constraints(alas, exponentially many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸︷︷︸eηActive Set achieves

finite & linear convergence!

Completely modular: just add MAP

25


μ∈Mμ⊤η − 1/2∥μ∥2




Backward pass


computing∂μ∂η

⊤dy







25


μ∈Mμ⊤η − 1/2∥μ∥2




Backward pass


computing∂μ∂η

⊤dy







25


μ∈Mμ⊤η − 1/2∥μ∥2


• select a new corner ofM

• update the (sparse) coefficients of p


Backward pass


computing∂μ∂η

⊤dy







25


μ∈Mμ⊤η − 1/2∥μ∥2


• select a new corner ofM

• update the (sparse) coefficients of p


Backward pass


computing∂μ∂η

⊤dy




μ⊤ (η −μ(t−1))︸︷︷︸eη

Active Set achievesfinite & linear convergence!


25


μ∈Mμ⊤η − 1/2∥μ∥2


• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away‐step, pairwise

• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass


computing∂μ∂η

⊤dy







25


μ∈Mμ⊤η − 1/2∥μ∥2


• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass


computing∂μ∂η

⊤dy







25


μ∈Mμ⊤η − 1/2∥μ∥2



Backward pass


computing∂μ∂η

⊤dy




μ⊤ (η −μ(t−1))︸︷︷︸eη

Active Set achievesfinite & linear convergence!


25


μ∈Mμ⊤η − 1/2∥μ∥2



Backward pass


computing∂μ∂η

⊤dy







25


μ∈Mμ⊤η − 1/2∥μ∥2



Backward pass


computing∂μ∂η

⊤dy







25


μ∈Mμ⊤η − 1/2∥μ∥2



Backward pass


computing∂μ∂η

⊤dy







25

SparseMAP Applications

• Sparse alignment attention (more later)(Niculae, Martins, Blondel, and Cardie, 2018)

• Latent TreeLSTM(Niculae, Martins, and Cardie, 2018)

• As loss: supervised dependency parsing(Niculae, Martins, Blondel, and Cardie 2018;Blondel, Martins, and Niculae 2019b)

26

Latent Dependency TreesListOps (Nangia and Bowman, 2018)

Arity tagging with latent GCN (Corro and Titov, 2019; Kipf and Welling, 2017)

(max 2 9 (min 4 7 ) 0 )

27

Latent Dependency TreesListOps (Nangia and Bowman, 2018)

Arity tagging with latent GCN (Corro and Titov, 2019; Kipf and Welling, 2017)

(max 2 9 (min 4 7 ) 0 )4 - - 2 - - - - -

27

0 20 40 60 80 100

0.2

0.4

0.6

0.8

1

epoch

validationF1score Gold tree

Left‐to‐right

28

0 20 40 60 80 100

0.2

0.4

0.6

0.8

1

epoch


Latent treeLeft‐to‐right

28

What if MAP is notavailable?

Multiple, Overlapping Factors

Maximization in factor graphs: NP‐hard, even when each factor is tractable.

⋆ dog on wheels

*→

dog→

on→

wheels→

TREE

BUDGET

BUDGET

BUDGET

BUDGET

29



⋆ dog on wheels

*→

dog→

on→

wheels→TREE

BUDGET

BUDGET

BUDGET

BUDGET

29



⋆ dog on wheels

*→

dog→

on→

wheels→TREE

BUDGET

BUDGET

BUDGET

BUDGET

29

Multiple, Overlapping FactorsMaximization in factor graphs: NP‐hard, even when each factor is tractable.

⋆ dog on wheels

*→

dog→

on→

wheels→TREE

BUDGET

BUDGET

BUDGET

BUDGET

29

Optimization as Consensus-Seeking

μ[1:3]

μ[4:6]

μ[7:9]

fa

fb

μa

μb

Agreement on overlap: μa,[4:6] =μb,[4:6] =μ[4:6]

maxμf

∑f∈F

η⊤f μf s.t.

Cfμ =μf,

μf ∈Mf for f ∈ F

LP relaxation (Wainwright and Jordan, 2008)

the local polytope:L :=μ : Cfμ ∈Mf, f ∈ F ⊇M

M

L

(Niculae and Martins, 2020)

30


μ[1:3]

μ[4:6]

μ[7:9]

fa

fb

μa

μb


maxμf

∑f∈F

η⊤f μf s.t.

Cfμ =μf,




M

L


30


μ[1:3]

μ[4:6]

μ[7:9]

fa

fb

μa

μb


maxμf

∑f∈F

η⊤f μf s.t.

Cfμ =μf,




M

L


30


μ[1:3]

μ[4:6]

μ[7:9]

fa

fb

μa

μb


maxμf

∑f∈F

η⊤f μf s.t.

Cfμ =μf,




M

L


30


μ[1:3]

μ[4:6]

μ[7:9]

fa

fb

μa

μb


maxμ,μf

∑f∈F

η⊤f μf s.t. Cfμ =μf, μf ∈Mf for f ∈ F



M

L


30


μ[1:3]

μ[4:6]

μ[7:9]

fa

fb

μa

μb


maxμ,μf

∑f∈F




M

L


30


μ[1:3]

μ[4:6]

μ[7:9]

fa

fb

μa

μb


maxμ,μf

∑f∈F




M

L


30


μ[1:3]

μ[4:6]

μ[7:9]

fa

fb

μa

μb


maxμ,μf

∑f∈F

η⊤f μf− 1/2∥μ∥2 s.t. Cfμ =μf, μf ∈Mf for f ∈ F



M

L


30

Algorithms for LP-SparseMAP(Niculae and Martins, 2020)

Forward pass

argmaxCfμ=μf

∑f∈F

η⊤f μf− 1/2∥μ∥2

=argmaxCf μ=μf

∑f∈Fη⊤f μf − 1/2∥Dfμf∥2

• Separable objective,agreement constraintsADMM in consensus form

• SparseMAP subproblem for each f

Backward pass

• Sparse fixed‐point iteration

• Combines the SparseMAP Jacobiansof each factor

31

Algorithms for LP-SparseMAP(Niculae and Martins, 2020)

Forward pass

argmaxCfμ=μf

∑f∈F

η⊤f μf− 1/2∥μ∥2

=argmaxCf μ=μf

∑f∈Fη⊤f μf − 1/2∥Dfμf∥2

• Separable objective,agreement constraintsADMM in consensus form

• SparseMAP subproblem for each f

Backward pass

• Sparse fixed‐point iteration

• Combines the SparseMAP Jacobiansof each factor

31

Differentiable Sparse Structured Prediction(Niculae and Martins, 2020)

TREE

BUDGET

BUDGET

BUDGET

BUDGET

Factor graphs as a hidden‐layer DSL!

If |F | = 1, recovers SparseMAP.Modular library.Built‐in specialized factors:• OR, XOR, AND• OR‐with‐output• Budget, Knapsack• Pairwise

New factors only require MAP.

fg = FactorGraph()var = [fg.variable() for i =/ j] # handwave

fg.add(Tree(var))

for i in range(n):fg.add(Budget(var[i, :], budget=5)

μ = fg.lp_sparsemap(η)

32


TREE

BUDGET

BUDGET

BUDGET

BUDGET

Factor graphs as a hidden‐layer DSL!If |F | = 1, recovers SparseMAP.

Modular library.Built‐in specialized factors:• OR, XOR, AND• OR‐with‐output• Budget, Knapsack• Pairwise


fg = FactorGraph()var = [fg.variable() for i =/ j] # handwave

fg.add(Tree(var))

for i in range(n):fg.add(Budget(var[i, :], budget=5)

μ = fg.lp_sparsemap(η)

32


TREE

BUDGET

BUDGET

BUDGET

BUDGET

Factor graphs as a hidden‐layer DSL!If |F | = 1, recovers SparseMAP.Modular library.Built‐in specialized factors:• OR, XOR, AND• OR‐with‐output• Budget, Knapsack• Pairwise


class Factor:def map(ηf): # abstract, private

raise NotImplemented

def sparsemap(ηf):# active set algo, uses self.map

def backward(dμf):# analytic, uses active set result

class Budget(Factor):def sparsemap(ηf):

# specialized

def backward(dμf):# specialized

32


TREE

BUDGET

BUDGET

BUDGET

BUDGET

Factor graphs as a hidden‐layer DSL!If |F | = 1, recovers SparseMAP.Modular library.Built‐in specialized factors:• OR, XOR, AND• OR‐with‐output• Budget, Knapsack• Pairwise


class Factor:def map(ηf): # abstract, private

raise NotImplemented

def sparsemap(ηf):# active set algo, uses self.map

def backward(dμf):# analytic, uses active set result

class Budget(Factor):def sparsemap(ηf):

# specialized

def backward(dμf):# specialized

class Tree(Factor):def map(η):

# Chu-Liu/Edmonds algo32

0 20 40 60 80 100

0.2

0.4

0.6

0.8

1

epoch


Latent treeLeft‐to‐right

33

0 20 40 60 80 100

0.2

0.4

0.6

0.8

1

epoch


Latent w/ Budget(5)Latent treeLeft‐to‐right

33

Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situation.

hypothesis: A police officer watches a situation closely.

input

(P, H)

A

gentleman

overlooking

...

situation

A

police

officer...

closely

output

entails

contradicts

neutral

(Model: decomposable attention (Parikh et al., 2016))34



input

(P, H)

A

gentleman

overlooking

...

situation

A

police

officer...

closely

output

entails

contradicts

neutral




input

(P, H)

A

gentleman

overlooking

...

situation

A

police

officer...

closely

output

entails

contradicts

neutral




input

(P, H)

A

gentleman

overlooking

...

situation

A

police

officer...

closely

output

entails

contradicts

neutral

(Proposed model: global structured alignment.)34

Structured Alignment Modelsmatching

SparseMAP w/ Kuhn‐Munkres(Kuhn, 1955)

LP‐matching

LP‐SparseMAP w/ XORs(equivalent; different solver)

LP‐sequence

additional scorefor contiguous alignments

(i, j) − (i + 1, j± 1)

35



LP‐matching


LP‐sequence


(i, j) − (i + 1, j± 1)

35



LP‐matching


LP‐sequence


(i, j) − (i + 1, j± 1)35

MultiNLI (Williams et al., 2017)

softmax matching LP‐matching LP‐sequence66%

68%

70%

72%

36

a

gentleman

overlooking

a

neighborhood

situation

.

a

police

officer

watches a

situation

closely . a

police

officer

watches a

situation

closely .

37

a

police

officer

watches a

situation

closely .

a

gentleman

overlooking

a

neighborhood

situation

.

A

gentleman

overlooking

a

neighborhood

situation

.

A

police

officer

watches

a

situation

closely

.

38

ConclusionsDifferentiable & sparsestructured inference

Generic, extensible, efficient algorithms

Interpretable structured attention

Future workStructure beyond NLP

Weak & semi‐supervision

Generative latent structure models

[email protected] github.com/deep-spin/lp-sparsemaphttps://vene.ro @vnfrombucharest

· · ·

· · ·

a

police

officer

watches a

situation

closely .

a

gentleman

overlooking

a

neighborhood

situation

.

TREE

BUDGET

BUDGET

BUDGET

BUDGET

mailto:[email protected]


https://vene.ro


ConclusionsDifferentiable & sparsestructured inference

Generic, extensible, efficient algorithms

Interpretable structured attention

Future workStructure beyond NLP

Weak & semi‐supervision

Generative latent structure models

[email protected] github.com/deep-spin/lp-sparsemaphttps://vene.ro @vnfrombucharest

· · ·

· · ·

a

police

officer

watches a

situation

closely .

a

gentleman

overlooking

a

neighborhood

situation

.

TREE

BUDGET

BUDGET

BUDGET

BUDGET

mailto:[email protected]


https://vene.ro


Extra slides

Acknowledgements

This work was supported by the European Research Council (ERC StG DeepSPIN 758969) and by theFundação para a Ciência e Tecnologia through contract UID/EEA/50008/2013.

Some icons by Dave Gandy and Freepik via flaticon.com.

40

https://www.flaticon.com/authors/dave-gandy

https://www.freepik.com/

https://www.flaticon.com/

Sparsemaxsparsemax(θ) = argmax

p∈p⊤θ − 1/2∥p∥22

= argminp∈∥p − θ∥22

Computation:

p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via partial sort

(Held et al., 1974; Brucker, 1984; Condat, 2016)

Backward pass:

Jsparsemax = diag(s) − 1|S|ss⊤

where S = j : p⋆j > 0,sj = Jj ∈ SK

(Martins and Astudillo, 2016)

argmin differentiation(Gould et al., 2016; Amos and Kolter, 2017)

41


p∈p⊤θ − 1/2∥p∥22


Computation:



Backward pass:





41


p∈p⊤θ − 1/2∥p∥22


Computation:



Backward pass:





41


p∈p⊤θ − 1/2∥p∥22


Computation:



Backward pass:





41

Fusedmax

fusedmax(θ) = argmaxp∈

p⊤θ − 1/2∥p∥22 −∑2≤j≤d|pj − pj−1|

= argminp∈∥p − θ∥22 +∑2≤j≤d|pj − pj−1|

proxfused(θ) = argminp∈Rd∥p − θ∥22 +∑2≤j≤d|pj − pj−1|

Proposition: fusedmax(θ) = sparsemaxproxfused(θ)


“Fused Lasso” a.k.a. 1‐d Total Variation

(Tibshirani et al., 2005)

42

Fusedmax

fusedmax(θ) = argmaxp∈

p⊤θ − 1/2∥p∥22 −∑2≤j≤d|pj − pj−1|

= argminp∈∥p − θ∥22 +∑2≤j≤d|pj − pj−1|

proxfused(θ) = argminp∈Rd∥p − θ∥22 +∑2≤j≤d|pj − pj−1|

Proposition: fusedmax(θ) = sparsemaxproxfused(θ)


“Fused Lasso” a.k.a. 1‐d Total Variation

(Tibshirani et al., 2005)

42

Danskin’s Theorem(Danskin, 1966; Prop. B.25 in Bertsekas, 1999)

Let φ : Rd × Z→ R,Z ⊂ Rd compact.∂max

z∈Z φ(x, z) = conv∇xφ(x, z⋆) | z⋆ ∈ argmaxz∈Z

φ(x, z).

Example: maximum of a vector

∂maxj∈[d] θj = ∂max

p∈ p⊤θ

= ∂maxp∈ φ(p,θ)

= conv∇θφ(p⋆,θ)= convp⋆

θ = [t,0]

t0

1

−1 0 +1maxj θj

t0

1

−1 0 +1g1 | g ∈ ∂maxj θj

43




φ(x, z).



p∈ p⊤θ



θ = [t,0]

t0

1

−1 0 +1maxj θj

t0

1

−1 0 +1g1 | g ∈ ∂maxj θj

43




φ(x, z).



p∈ p⊤θ



θ = [t,0]

t0

1

−1 0 +1maxj θj

t0

1

−1 0 +1g1 | g ∈ ∂maxj θj

43

Dynamically inferringthe computation graph

So far: a structured hidden layerEH[aH]

Network must handle “soft” combinations of structures.Fine for attention, but can be limiting.

Dependency TreeLSTM

The bears eat the pretty ones

(Tai et al., 2015)

45

Dependency TreeLSTM


(Tai et al., 2015)

45

Dependency TreeLSTM


(Tai et al., 2015)

45

Dependency TreeLSTM


(Tai et al., 2015)

45

Dependency TreeLSTM


(Tai et al., 2015)

45

Dependency TreeLSTM


(Tai et al., 2015)

45

Dependency TreeLSTM


(Tai et al., 2015)

45

Latent Dependency TreeLSTM

input

x

p(y|x) =∑h∈H

p(y | h, x) p(h | x)

h ∈H


output

y

(Niculae, Martins, and Cardie, 2018)

46

Latent Dependency TreeLSTM

input

x

p(y|x) =∑h∈H

p(y | h, x) p(h | x)

h ∈HThe bears eat the pretty ones

output

y

(Niculae, Martins, and Cardie, 2018)

46

Structured Latent Variable Models

p(y | x) = ∑h∈H

p

φ

(y | h, x) p

π

(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

softmaxidea 3 SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47


p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)


∂p(y | x)∂π


scoreπ(h; x)





47


p(y | x) = ∑h∈H



∂p(y | x)∂π


scoreπ(h; x)


e.g., a TreeLSTM defined by h

sum overall possible trees



47


p(y | x) = ∑h∈H



∂p(y | x)∂π


scoreπ(h; x)


e.g., a TreeLSTM defined by h

sum overall possible trees



47


p(y | x) = ∑h∈H



∂p(y | x)∂π


scoreπ(h; x)





47


p(y | x) = ∑h∈H


How to define pπ?

∑h∈H

∂p(y | x)∂π

idea 1

pπ(h | x) = 1 if h = h⋆ else 0 argmax

idea 2

pπ(h | x) ∝ expscoreπ(h; x)

softmax

idea 3

SparseMAP




47


p(y | x) = ∑h∈H



∂p(y | x)∂π

idea 1


idea 2


softmax

idea 3

SparseMAP




47


p(y | x) = ∑h∈H



∂p(y | x)∂π

idea 1


idea 2


softmax

idea 3

SparseMAP




47


p(y | x) = ∑h∈H



∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2


softmax

idea 3

SparseMAP




47


p(y | x) = ∑h∈H



∂p(y | x)∂π



softmax

idea 3

SparseMAP




47


p(y | x) = ∑h∈H



∂p(y | x)∂π



softmax

idea 3

SparseMAP




47


p(y | x) = ∑h∈H



∂p(y | x)∂π


scoreπ(h; x)

softmaxidea 3

SparseMAP




47


p(y | x) = ∑h∈H



∂p(y | x)∂π


scoreπ(h; x)

softmaxidea 3

SparseMAP




47


p(y | x) = ∑h∈H



∂p(y | x)∂π


scoreπ(h; x)

softmaxidea 3

SparseMAP




47


p(y | x) = ∑h∈H



∂p(y | x)∂π


scoreπ(h; x)





47

SparseMAP

• • •= .7 • • • + .3 • • •

+ 0 • • • + ...p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )

48

SparseMAP

• • •= .7 • • • + .3 • • • + 0 • • • + ...

p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )

48

SparseMAP

• • •= .7 • • • + .3 • • • + 0 • • • + ...p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )

48

Sentiment classification (SST)

80%

81%

82%

83%

84%

85%

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3‐class)

Reverse dictionary lookup(definitions)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

⋆ The bears eat the pretty ones

Left‐to‐right: regular LSTM


Flat: bag‐of‐words–like


CoreNLP: off‐line parser

Sentence pair classification (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)


LTR80%

81%

82%

83%

84%

85%



80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%



LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10








p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)


∑h∈H

g(x;h) pπ(h | x)


LTR Flat80%

81%

82%

83%

84%

85%



80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%



LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10








p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)


∑h∈H

g(x;h) pπ(h | x)


LTR Flat CoreNLP80%

81%

82%

83%

84%

85%



80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%



LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10








p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)


∑h∈H

g(x;h) pπ(h | x)


LTR Flat CoreNLP Latent80%

81%

82%

83%

84%

85%



80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%



LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10








p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)


∑h∈H

g(x;h) pπ(h | x)



81%

82%

83%

84%

85%accuracy(binary)



80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%



LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10








p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)


∑h∈H

g(x;h) pπ(h | x)



81%

82%

83%

84%

85%accuracy(binary)



80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%



LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10








p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)

given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)



81%

82%

83%

84%

85%accuracy(binary)



80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%


Reverse dictionary lookup

(definitions)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10








p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)

given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)



81%

82%

83%

84%

85%accuracy(binary)



80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%



LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10








p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)


∑h∈H

g(x;h) pπ(h | x)



81%

82%

83%

84%

85%accuracy(binary)



80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%



LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10








p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)


∑h∈H

g(x;h) pπ(h | x)

Syntax vs. Composition Order

p = 22.6%

⋆ lovely and poignant .

CoreNLP parse, p = 21.4%


· · ·

50

Syntax vs. Composition Orderp = 22.6%




· · · 50

Syntax vs. Composition Order

p = 22.6%




· · ·

p = 15.33%

⋆ a deep and meaningful film .1.0 1.0

1.01.0

1.0 1.0

p = 15.27%

⋆ a deep and meaningful film .

1.01.0

1.01.0

1.01.0

· · ·CoreNLP parse, p = 0%

⋆ a deep and meaningful film .

1.0 1.0

1.01.0

1.0

1.0

51

Structured Output Prediction

SparseMAP LA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2

− η⊤μ + 1/2∥μ∥2

cost‐SparseMAP LρA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2+ρ(μ, μ)

− η⊤μ + 1/2∥μ∥2

Instance of a structured Fenchel‐Young loss, like CRF, SVM, etc. (Blondel, Martins, and Niculae, 2019b)

52

Structured Output Prediction

SparseMAP LA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2

− η⊤μ + 1/2∥μ∥2cost‐SparseMAP LρA(η, μ) = max

μ∈Mη⊤μ − 1/2∥μ∥2+ρ(μ, μ)

− η⊤μ + 1/2∥μ∥2

Instance of a structured Fenchel‐Young loss, like CRF, SVM, etc. (Blondel, Martins, and Niculae, 2019b)

52

References IAmos, Brandon and J. Zico Kolter (2017). “OptNet: Differentiable optimization as a layer in neural networks”. In: Proc. of ICML.

Andre‐Obrecht, Regine (1988). “A new statistical approach for the automatic segmentation of continuous speech signals”. In: IEEETransactions on Acoustics, Speech, and Signal Processing 36.1, pp. 29–40.

Bertsekas, Dimitri P (1999). Nonlinear Programming. Athena Scientific Belmont.

Blondel, Mathieu, André FT Martins, and Vlad Niculae (2019a). “Learning classifiers with Fenchel‐Young losses: Generalizedentropies, margins, and algorithms”. In: Proc. of AISTATS.

— (2019b). “Learning with Fenchel‐Young Losses”. In: preprint arXiv:1901.02324.

Brucker, Peter (1984). “An O(n) algorithm for quadratic knapsack problems”. In: Operations Research Letters 3.3, pp. 163–166.

Condat, Laurent (2016). “Fast projection onto the simplex and the ℓ1 ball”. In:Mathematical Programming 158.1‐2, pp. 575–585.

Correia, Gonçalo M., Vlad Niculae, and André FT Martins (2019). “Adaptively Sparse Transformers”. In: Proc. EMNLP.

Corro, Caio and Ivan Titov (2019). “Learning latent trees with stochastic perturbations and differentiable dynamic programming”.In: Proc. of ACL.

Danskin, John M (1966). “The theory of max‐min, with applications”. In: SIAM Journal on Applied Mathematics 14.4, pp. 641–664.

53

http://proceedings.mlr.press/v70/amos17a.html

http://www.athenasc.com/nonlinbook.html

https://arxiv.org/abs/1805.09717



https://www.sciencedirect.com/science/article/pii/0167637784900105

https://hal.archives-ouvertes.fr/hal-01056171



https://epubs.siam.org/doi/abs/10.1137/0114053

References IIDantzig, George B, Alex Orden, and Philip Wolfe (1955). “The generalized simplex method for minimizing a linear form underlinear inequality restraints”. In: Pacific Journal of Mathematics 5.2, pp. 183–195.

Frank, Marguerite and Philip Wolfe (1956). “An algorithm for quadratic programming”. In: Nav. Res. Log. 3.1‐2, pp. 95–110.

Gould, Stephen et al. (2016). “On differentiating parameterized argmin and argmax problems with application to bi‐leveloptimization”. In: preprint arXiv:1607.05447.

Grünwald, Peter D and A Philip Dawid (2004). “Game theory, maximum entropy, minimum discrepancy and robust Bayesiandecision theory”. In: Annals of Statistics, pp. 1367–1433.

Held, Michael, Philip Wolfe, and Harlan P Crowder (1974). “Validation of subgradient optimization”. In:Mathematical Programming6.1, pp. 62–88.

Hill, Felix et al. (2016). “Learning to understand phrases by embedding the dictionary”. In: TACL 4.1, pp. 17–30.

Kim, Yoon et al. (2017). “Structured attention networks”. In: Proc. of ICLR.

Kipf, Thomas, Elise van der Pol, and Max Welling (2020). “Contrastive Learning of Structured World Models”. In: Proc. of ICLR.

Kipf, Thomas and Max Welling (2017). “Semi‐supervised classification with graph convolutional networks”. In: Proc. of ICLR.

54

https://msp.org/pjm/1955/5-2/pjm-v5-n2-s.pdf

https://msp.org/pjm/1955/5-2/pjm-v5-n2-s.pdf

https://doi.org/10.1002/nav.3800030109



https://arxiv.org/abs/math/0410076

https://arxiv.org/abs/math/0410076

https://link.springer.com/article/10.1007/BF01580223




References IIIKoo, Terry et al. (2007). “Structured prediction models via the matrix‐tree theorem”. In: Proc. of EMNLP.

Kuhn, Harold W (1955). “The Hungarian method for the assignment problem”. In: Nav. Res. Log. 2.1‐2, pp. 83–97.

Lacoste‐Julien, Simon and Martin Jaggi (2015). “On the global linear convergence of Frank‐Wolfe optimization variants”. In: Proc.of NeurIPS.

Liu, Yang and Mirella Lapata (2018). “Learning structured text representations”. In: TACL 6, pp. 63–75.

Long, Jonathan, Evan Shelhamer, and Trevor Darrell (2015). “Fully convolutional networks for semantic segmentation”. In: Proc. ofCVPR.

Martins, André FT and Ramón Fernandez Astudillo (2016). “From softmax to sparsemax: A sparse model of attention andmulti‐label classification”. In: Proc. of ICML.

Martins, André FT, Mário AT Figueiredo, et al. (2015). “AD3: Alternating directions dual decomposition for MAP inference ingraphical models”. In: JMLR 16.1, pp. 495–545.

McDonald, Ryan T and Giorgio Satta (2007). “On the complexity of non‐projective data‐driven dependency parsing”. In: Proc. ofICPT.

Nangia, Nikita and Samuel Bowman (2018). “ListOps: A diagnostic dataset for latent tree learning”. In: Proc. of NAACL SRW.

55

http://www.aclweb.org/anthology/D07-1015

http://onlinelibrary.wiley.com/doi/10.1002/nav.3800020109/abstract





http://jmlr.org/papers/v16/martins15a.html

http://jmlr.org/papers/v16/martins15a.html

https://dl.acm.org/citation.cfm?id=1621410.1621426

https://www.aclweb.org/anthology/N18-4013

References IVNiculae, Vlad and Mathieu Blondel (2017). “A regularized framework for sparse and structured neural attention”. In: Proc. of

NeurIPS.Niculae, Vlad and André FT Martins (2020). “LP‐SparseMAP: Differentiable relaxed optimization for sparse structuredprediction”. In: preprint arXiv:2001.04437.

Niculae, Vlad, André FT Martins, Mathieu Blondel, et al. (2018). “SparseMAP: Differentiable sparse structured inference”. In: Proc.of ICML.

Niculae, Vlad, André FT Martins, and Claire Cardie (2018). “Towards dynamic computation graphs via sparse latent structure”. In:Proc. of EMNLP.

Nocedal, Jorge and Stephen Wright (1999). Numerical Optimization. Springer New York.

Parikh, Ankur et al. (2016). “A decomposable attention model for natural language inference”. In: Proc. of EMNLP.

Peters, Ben, Vlad Niculae, and André FT Martins (2019). “Sparse sequence‐to‐sequence models”. In: Proc. ACL.

Rabiner, Lawrence R. (1989). “A tutorial on Hidden Markov Models and selected applications in speech recognition”. In: P. IEEE77.2, pp. 257–286.

Smith, David A and Noah A Smith (2007). “Probabilistic models of nonprojective dependency trees”. In: Proc. of EMNLP.

56






https://doi.org/10.1007/b98874



https://doi.org/10.1109/5.18626

References VTai, Kai Sheng, Richard Socher, and Christopher D Manning (2015). “Improved semantic representations from tree‐structuredLong Short‐Term Memory networks”. In: Proc. of ACL‐IJCNLP.

Taskar, Ben (2004). “Learning structured prediction models: A large margin approach”. PhD thesis. Stanford University.

Tibshirani, Robert et al. (2005). “Sparsity and smoothness via the fused lasso”. In: Journal of the Royal Statistical Society: Series B(Statistical Methodology) 67.1, pp. 91–108.

Tsallis, Constantino (1988). “Possible generalization of Boltzmann‐Gibbs statistics”. In: Journal of Statistical Physics 52,pp. 479–487.

Valiant, Leslie G (1979). “The complexity of computing the permanent”. In: Theor. Comput. Sci. 8.2, pp. 189–201.

Wainwright, Martin J and Michael I Jordan (2008). Graphical models, exponential families, and variational inference. Vol. 1. 1–2. NowPublishers, Inc., pp. 1–305.

Williams, Adina, Nikita Nangia, and Samuel R Bowman (2017). “A broad‐coverage challenge corpus for sentence understandingthrough inference”. In: preprint arXiv:1704.05426.

Wolfe, Philip (1976). “Finding the nearest point in a polytope”. In:Mathematical Programming 11.1, pp. 128–149.

57



https://homes.cs.washington.edu/~taskar/pubs/thesis.pdf

https://web.stanford.edu/group/SOL/papers/fused-lasso-JRSSB.pdf


https://doi.org/10.1016/0304-3975(79)90044-6

https://people.eecs.berkeley.edu/~wainwrig/Papers/WaiJor08_FTML.pdf