+ All Categories
Home > Science > Compositional Distributional Models of Meaning

Compositional Distributional Models of Meaning

Date post: 15-Jul-2015
Category:
Upload: dimitrios-kartsaklis
View: 138 times
Download: 4 times
Share this document with a friend
38
Compositional Distributional Models of Meaning Dimitri Kartsaklis Department of Computer Science University of Oxford Hilary Term, 2015 Dimitri Kartsaklis Compositional Distributional Models of Meaning 1/37
Transcript
Page 1: Compositional Distributional Models of Meaning

Compositional Distributional Models of Meaning

Dimitri Kartsaklis

Department of Computer ScienceUniversity of Oxford

Hilary Term, 2015

Dimitri Kartsaklis Compositional Distributional Models of Meaning 1/37

Page 2: Compositional Distributional Models of Meaning

In a nutshell

Compositional distributional models of meaning (CDMs) aimto unify two orthogonal semantic paradigms:

The type-logical compositional approach of formal semanticsThe quantitative perspective of vector space models ofmeaning

The goal is to represent sentences as points in somehighly-dimensional metric space

Useful in every NLP task: sentence similarity, paraphrasedetection, sentiment analysis, machine translation etc.

We review three generic classes of CDMs: vector mixtures,tensor-based models and deep-learning models

Dimitri Kartsaklis Compositional Distributional Models of Meaning 2/37

Page 3: Compositional Distributional Models of Meaning

Outline

1 Introduction

2 Vector mixture models

3 Tensor-based models

4 Deep learning models

5 Afterword

Dimitri Kartsaklis Compositional Distributional Models of Meaning 3/37

Page 4: Compositional Distributional Models of Meaning

Compositional semantics

The principle of compositionality

The meaning of a complex expression is determined by themeanings of its parts and the rules used for combining them.

A lexicon:

(1) a. every ` Dt : λP.λQ.∀x [P(x)→ Q(x)]

b. man ` N : λy .man(y)

c. walks ` VI : λz .walk(z)

A parse tree, so syntax guides the semantic composition:

S

NP

Dt

Every

N

man

V IN

walks

NP→ Dt N : [[N]]([[Dt]])S → NP VIN : [[VIN ]]([[NP]])

Dimitri Kartsaklis Compositional Distributional Models of Meaning 4/37

Page 5: Compositional Distributional Models of Meaning

Syntax-to-semantics correspondence

Logical forms of compounds are computed via β-reduction:

S∀x [man(x)→ walk(x)]

NPλQ.∀x [man(x)→ Q(x)]

DtλP.λQ.∀x [P(x)→ Q(x)]

Every

Nλy .man(y)

man

V IN

λz .walk(z)

walks

The semantic value of a sentence can be true or false.

An approach known as Montague Grammar (1970)

Dimitri Kartsaklis Compositional Distributional Models of Meaning 5/37

Page 6: Compositional Distributional Models of Meaning

Distributional models of meaning

Distributional hypothesis

The meaning of a word is determined by its context [Harris, 1958].

Dimitri Kartsaklis Compositional Distributional Models of Meaning 6/37

Page 7: Compositional Distributional Models of Meaning

Distributional models of meaning

Distributional hypothesis

The meaning of a word is determined by its context [Harris, 1958].

Dimitri Kartsaklis Compositional Distributional Models of Meaning 6/37

Page 8: Compositional Distributional Models of Meaning

Words as vectors

A word is a vector of co-occurrence statistics with every otherword in the vocabulary:

milk

cute

dog

bank

money

12

8

5

0

1

cat

cat

dog

account

money

pet

Semantic relatedness is usually based on cosine similarity:

sim(−→v ,−→u ) = cos(−→v ,−→u ) =(−→v · −→u )

‖−→v ‖‖−→u ‖(1)

Dimitri Kartsaklis Compositional Distributional Models of Meaning 7/37

Page 9: Compositional Distributional Models of Meaning

A real vector space

30 20 10 0 10 20 30 4040

30

20

10

0

10

20

cat

dog

pet

kitten

mouse

horselion

elephanttigerparrot

eagle

pigeon

raven

seagull

money

stock

bank

financecurrency

profitcredit

market

business

broker

account

laptop

data

processor

technologyram

megabyte

keyboard monitor

intel

motherboard

microchip

cpu

gamefootball

score

championshipplayer league

team

Dimitri Kartsaklis Compositional Distributional Models of Meaning 8/37

Page 10: Compositional Distributional Models of Meaning

The necessity for a unified model

Montague grammar is compositional but not quantitative

Distributional models of meaning are quantitative but notcompositional

Note that the distributional hypothesis does not apply tophrases or sentences. There is not enough data:

Even if we had an infinitely large corpus, what the context ofa sentence would be?

Dimitri Kartsaklis Compositional Distributional Models of Meaning 9/37

Page 11: Compositional Distributional Models of Meaning

The role of compositionality

Compositional distributional models

We can synthetically produce a sentence vector by composing thevectors of the words in that sentence.

−→s = f (−→w1,−→w2, . . . ,

−→wn) (2)

Three generic classes of CDMs:

Vector mixture models [Mitchell and Lapata, 2008]

Tensor-based models[Coecke et al., 2010, Baroni and Zamparelli, 2010]

Deep learning models[Socher et al., 2012, Kalchbrenner et al., 2014].

Dimitri Kartsaklis Compositional Distributional Models of Meaning 10/37

Page 12: Compositional Distributional Models of Meaning

Outline

1 Introduction

2 Vector mixture models

3 Tensor-based models

4 Deep learning models

5 Afterword

Dimitri Kartsaklis Compositional Distributional Models of Meaning 11/37

Page 13: Compositional Distributional Models of Meaning

Element-wise vector composition

We can combine two vectors by working element-wise[Mitchell and Lapata, 2008]:

−−−→w1w2 = α−→w1 + β−→w2 =∑

i

(αcw1i + βcw2

i )−→ni (3)

−−−→w1w2 = −→w1 �−→w2 =∑

i

cw1i cw2

i−→ni (4)

An element-wise “mixture” of the input elements:

= =

Vector mixture Tensor-based

Dimitri Kartsaklis Compositional Distributional Models of Meaning 12/37

Page 14: Compositional Distributional Models of Meaning

Vector mixtures: Summary

Distinguishing feature:

All words contribute equally to the final result.

PROS:

Trivial to implement

As computationally light-weight as it gets

Surprisingly effective in practice

CONS:

A bag-of-word approach

Does not distinguish between the type-logical identities of thewords

By definition, sentence space = word space

Dimitri Kartsaklis Compositional Distributional Models of Meaning 13/37

Page 15: Compositional Distributional Models of Meaning

Outline

1 Introduction

2 Vector mixture models

3 Tensor-based models

4 Deep learning models

5 Afterword

Dimitri Kartsaklis Compositional Distributional Models of Meaning 14/37

Page 16: Compositional Distributional Models of Meaning

Relational words as functions

In a vector mixture model, an adjective is of the same order asthe noun it modifies, and both contribute equally to the result.

One step further: Relational words are multi-linear maps(tensors of various orders) that can be applied to one or morearguments (vectors).

= =

Vector mixture Tensor-based

Formalized in the context of compact closed categories byCoecke, Sadrzadeh and Clark.

Dimitri Kartsaklis Compositional Distributional Models of Meaning 15/37

Page 17: Compositional Distributional Models of Meaning

A categorical framework for composition

Categorical compositional distributional semantics

Coecke, Sadrzadeh and Clark (2010): Syntax and vector spacesemantics can be structurally homomorphic.

Take CF, the free compact closed category over a pregroupgrammar, as the structure accommodating the syntax

Take FVect, the category of finite-dimensional vector spacesand linear maps, as the semantic counterpart of CF.

CF and FVect share a compact closed structure, so a stronglymonoidal functor Q can be defined such that:

Q : CF → FVect (5)

The meaning of a sentence w1w2 . . .wn with type reduction αis given as:

Q(α)(−→w1 ⊗−→w2 ⊗ . . .⊗−→wn) (6)

Dimitri Kartsaklis Compositional Distributional Models of Meaning 16/37

Page 18: Compositional Distributional Models of Meaning

Categorical composition: Example

S

NP

Adj

happy

N

kids

VP

V

play

N

games

happy kids play games

n nl n nr s nl n

Grammar rules follow the inequalities:

pl · p ≤ 1 ≤ p · pl and p · pr ≤ 1 ≤ pr · p (7)

Derivation becomes:

(n · nl ) · n · (nr · s · nl ) · n = n · (nl · n) · (nr · s · nl ) · n ≤n · 1 · (nr · s · nl ) · n = n · (nr · s · nl ) · n =

(n · nr ) · s · (nl · n) ≤ 1 · s · 1 ≤ s

Dimitri Kartsaklis Compositional Distributional Models of Meaning 17/37

Page 19: Compositional Distributional Models of Meaning

Categorical composition: From syntax to semantics

Each atomic grammar type is mapped to a vector space:

Q(n) = N Q(s) = S

Due to strong monoidality, complex types are mapped totensor product of vector spaces:

Q(n · nl ) = Q(n)⊗ Q(nl ) = N ⊗ N

Q(nr · s · nl ) = Q(nr )⊗ Q(s)⊗ Q(nl ) = N ⊗ S ⊗ N

Finally, each morphism in CF is mapped to a linear map inFVect.

Dimitri Kartsaklis Compositional Distributional Models of Meaning 18/37

Page 20: Compositional Distributional Models of Meaning

A multi-linear model

The grammatical type of a word defines the vector space inwhich the word lives.

A noun lives in a basic vector space N.Adjectives are linear maps N → N, i.e elements in N ⊗ N.Intransitive verbs are maps N → S , i.e elements in N ⊗ S .Transitive verbs are bi-linear maps N ⊗ N → S , i.e. elementsof N ⊗ S ⊗ N.And so on.

The composition operation is tensor contraction, based oninner product.

happy kids play games

N N l N N r S N l N

Dimitri Kartsaklis Compositional Distributional Models of Meaning 19/37

Page 21: Compositional Distributional Models of Meaning

Frobenius algebras in language

Kartsaklis, Sadrzadeh, Pulman and Coecke (2013): Use theco-multiplication of a Frobenius algebra to model verb tensors.

s v ov

A combination of element-wise and categorical composition:

−−−→s v o = (s × v)� o

Useful e.g. in intonation:Who does John like? John likes Mary

(−−→john × likes)�−−−→mary

Sadrzadeh, Clark and Coecke (2013): Relative pronouns aremodelled in terms of Frobenius copying and deleting.

Dimitri Kartsaklis Compositional Distributional Models of Meaning 20/37

Page 22: Compositional Distributional Models of Meaning

Composition and lexical ambiguity

Kartsaklis and Sadrzadeh (2013): Explicitly handling lexicalambiguity improves the performance of tensor-based models

How can we model ambiguity in categorical composition?

Words as mixed states

Piedeleu, Kartsaklis, Coecke and Sadrzadeh (2015): Ambiguouswords are represented as mixed states in CPM(FHilb).

ρ(wt) =∑

i

pi |w it 〉〈w i

t | (8)

Von Neumann entropy shows how ambiguity evolves fromwords to compounds

Disambiguation = purification: Entropy of ‘vessel’ is 0.25,but entropy of ‘vessel that sails’ is 0.01.

Dimitri Kartsaklis Compositional Distributional Models of Meaning 21/37

Page 23: Compositional Distributional Models of Meaning

Tensor-based models: Summary

Distinguishing feature

Relational words are functions acting on arguments.

PROS:

Highly justified from a linguistic perspective

More powerful and robust than vector mixtures

A glass box approach that allows theoretical reasoning

CONS:

Every logical and functional word must be assigned to anappropriate tensor representation (e.g. how do you representquantifiers?)

Space complexity problems for functions of higher arity (e.g. aditransitive verb is a tensor of order 4)

Limited to linearity

Dimitri Kartsaklis Compositional Distributional Models of Meaning 22/37

Page 24: Compositional Distributional Models of Meaning

Outline

1 Introduction

2 Vector mixture models

3 Tensor-based models

4 Deep learning models

5 Afterword

Dimitri Kartsaklis Compositional Distributional Models of Meaning 23/37

Page 25: Compositional Distributional Models of Meaning

A simple neural net

A feed-forward neural network with one hidden layer:

h1 = f (w11x1+w21x2+w31x3+w41x4+w51x5+b1)h2 = f (w12x1+w22x2+w32x3+w42x4+w52x5+b2)h3 = f (w13x1+w23x2+w33x3+w43x4+w53x5+b3)

or−→h = f (W(1)−→x +

−→b (1))

Similarly:

−→y = f (W(2)−→h +−→b (2))

Note that W(1) ∈ R3×5 and W(2) ∈ R2×3

f is a non-linear function such as tanh or sigmoid(take f = Id and you have a tensor-based model)

Weights are optimized against some objective function

A universal approximator.

Dimitri Kartsaklis Compositional Distributional Models of Meaning 24/37

Page 26: Compositional Distributional Models of Meaning

Recursive neural networks

Pollack (1990), Socher et al. (2012): Recursive neural netsfor composition in natural language.

f is an element-wise non-linear function such as tanh orsigmoid.

A supervised approach.

Dimitri Kartsaklis Compositional Distributional Models of Meaning 25/37

Page 27: Compositional Distributional Models of Meaning

Unsupervised learning with NNs

How can we train a NN in an unsupervised manner?

Create an auto-encoder: Train the network to reproduce itsinput via an expansion layer.

Use the output of the hidden layer as a compressed version ofthe input [Socher et al., 2011]

Dimitri Kartsaklis Compositional Distributional Models of Meaning 26/37

Page 28: Compositional Distributional Models of Meaning

Convolutional NNs

Originated in pattern recognition [Fukushima, 1980]

Small filters apply on every position of the input vector:

Capable of extracting fine-grained local features independentlyof the exact position in input

Features become increasingly global as more layers are stacked

Each convolutional layer is usually followed by a pooling layer

Top layer is fully connected, usually a soft-max classifier

Application to language: Collobert and Weston (2008)

Dimitri Kartsaklis Compositional Distributional Models of Meaning 27/37

Page 29: Compositional Distributional Models of Meaning

DCNNs for modelling sentences

Kalchbrenner, Grefenstetteand Blunsom (2014): A deep

architecture using dynamick-max pooling

Syntactic structure isinduced automatically:

(Figures reused with permission)

Dimitri Kartsaklis Compositional Distributional Models of Meaning 28/37

Page 30: Compositional Distributional Models of Meaning

Beyond sentence level

Denil, Demiraj, Kalchbrenner, Blunsom, and De Freitas (2014): Anadditional convolutional layer can provide document vectors:

(Figure reused with permission)

Dimitri Kartsaklis Compositional Distributional Models of Meaning 29/37

Page 31: Compositional Distributional Models of Meaning

Deep learning models: Summary

Distinguishing feature

Drastic transformation of the sentence space.

PROS:

Non-linearity and layered approach allow the simulation of avery wide range of functions

Word vectors are parameters of the model, optimized duringtraining

State-of-the-art results in a number of NLP tasks

CONS:

Requires expensive training

Difficult to discover the right configuration

A black-box approach: not easy to correlate inner workingswith output

Dimitri Kartsaklis Compositional Distributional Models of Meaning 30/37

Page 32: Compositional Distributional Models of Meaning

Outline

1 Introduction

2 Vector mixture models

3 Tensor-based models

4 Deep learning models

5 Afterword

Dimitri Kartsaklis Compositional Distributional Models of Meaning 31/37

Page 33: Compositional Distributional Models of Meaning

A hierarchy of CDMs

Putting everything together:

Note that “power” is only theoretical; actual performancedepends on task, underlying assumptions, and configuration

Dimitri Kartsaklis Compositional Distributional Models of Meaning 32/37

Page 34: Compositional Distributional Models of Meaning

Open issues-Future work

No convincing solution for logical connectives, negation,quantifiers and so on.

Functional words, such as prepositions and relative pronouns,are also a problem.

Sentence space is usually identified with word space. This isconvenient, but obviously a simplification

Solutions depend on the specific CDM class—e.g. not muchto do in a vector mixture setting

Important: How can we make NNs more linguistically aware?[Hermann and Blunsom, 2013]

Dimitri Kartsaklis Compositional Distributional Models of Meaning 33/37

Page 35: Compositional Distributional Models of Meaning

Conclusion

CDMs provide quantitative semantic representations forsentences (or even documents)

Element-wise operations on word vectors constitute an easyand reasonably effective way to get sentence vectors

Categorical compositional distributional models allowreasoning on a theoretical level—a glass box approach

Deep learning models are extremely powerful and effective;still a black-box approach, not easy to explain why a specificconfiguration works and some other does not.

Convolutional networks seem to constitute a very promisingsolution for sentential/discourse semantics

Dimitri Kartsaklis Compositional Distributional Models of Meaning 34/37

Page 36: Compositional Distributional Models of Meaning

References I

Baroni, M. and Zamparelli, R. (2010).

Nouns are Vectors, Adjectives are Matrices.In Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP).

Coecke, B., Sadrzadeh, M., and Clark, S. (2010).

Mathematical Foundations for a Compositional Distributional Model of Meaning. Lambek Festschrift.Linguistic Analysis, 36:345–384.

Collobert, R. and Weston, J. (2008).

A unified architecture for natural language processing: Deep neural networks with multitask learning.In Proceedings of the 25th international conference on Machine learning, pages 160–167. ACM.

Denil, M., Demiraj, A., Kalchbrenner, N., Blunsom, P., and de Freitas, N. (2014).

Modelling, visualising and summarising documents with a single convolutional neural network.arXiv preprint arXiv:1406.3830.

Kalchbrenner, N., Grefenstette, E., and Blunsom, P. (2014).

A convolutional neural network for modelling sentences.In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers), pages 655–665, Baltimore, Maryland. Association for Computational Linguistics.

Kartsaklis, D. and Sadrzadeh, M. (2013).

Prior disambiguation of word tensors for constructing sentence vectors.In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages1590–1601, Seattle, Washington, USA. Association for Computational Linguistics.

Kartsaklis, D., Sadrzadeh, M., Pulman, S., and Coecke, B. (2014).

Reasoning about meaning in natural language with compact closed categories and Frobenius algebras.arXiv preprint arXiv:1401.5980.

Dimitri Kartsaklis Compositional Distributional Models of Meaning 35/37

Page 37: Compositional Distributional Models of Meaning

References II

Lambek, J. (2008).

From Word to Sentence.Polimetrica, Milan.

Mitchell, J. and Lapata, M. (2008).

Vector-based Models of Semantic Composition.In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, pages 236–244.

Montague, R. (1970a).

English as a formal language.In Linguaggi nella Societa e nella Tecnica, pages 189–224. Edizioni di Comunita, Milan.

Montague, R. (1970b).

Universal grammar.Theoria, 36:373–398.

Piedeleu, R., Kartsaklis, D., Coecke, B., and Sadrzadeh, M. (2015).

Open System Categorical Quantum Semantics in Natural Language Processing.arXiv preprint arXiv:1502.00831.

Pollack, J. B. (1990).

Recursive distributed representations.Artificial Intelligence, 46(1):77–105.

Sadrzadeh, M., Clark, S., and Coecke, B. (2013).

The Frobenius anatomy of word meanings I: subject and object relative pronouns.Journal of Logic and Computation, Advance Access.

Dimitri Kartsaklis Compositional Distributional Models of Meaning 36/37

Page 38: Compositional Distributional Models of Meaning

References III

Socher, R., Huang, E., Pennington, J., Ng, A., and Manning, C. (2011).

Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection.Advances in Neural Information Processing Systems, 24.

Socher, R., Huval, B., Manning, C., and A., N. (2012).

Semantic compositionality through recursive matrix-vector spaces.In Conference on Empirical Methods in Natural Language Processing 2012.

Dimitri Kartsaklis Compositional Distributional Models of Meaning 37/37


Recommended