Compositional Distributional Models of Meaning

Compositional Distributional Models of Meaning

Dimitri Kartsaklis

Department of Computer ScienceUniversity of Oxford

Hilary Term, 2015

Dimitri Kartsaklis Compositional Distributional Models of Meaning 1/37

In a nutshell

Compositional distributional models of meaning (CDMs) aimto unify two orthogonal semantic paradigms:

The type-logical compositional approach of formal semanticsThe quantitative perspective of vector space models ofmeaning

The goal is to represent sentences as points in somehighly-dimensional metric space

Useful in every NLP task: sentence similarity, paraphrasedetection, sentiment analysis, machine translation etc.

We review three generic classes of CDMs: vector mixtures,tensor-based models and deep-learning models


Outline

1 Introduction

2 Vector mixture models

3 Tensor-based models

4 Deep learning models

5 Afterword


Compositional semantics

The principle of compositionality

The meaning of a complex expression is determined by themeanings of its parts and the rules used for combining them.

A lexicon:

(1) a. every ` Dt : λP.λQ.∀x [P(x)→ Q(x)]

b. man ` N : λy .man(y)

c. walks ` VI : λz .walk(z)

A parse tree, so syntax guides the semantic composition:

S

NP

Dt

Every

N

man

V IN

walks

NP→ Dt N : [[N]]([[Dt]])S → NP VIN : [[VIN ]]([[NP]])


Syntax-to-semantics correspondence

Logical forms of compounds are computed via β-reduction:

S∀x [man(x)→ walk(x)]

NPλQ.∀x [man(x)→ Q(x)]

DtλP.λQ.∀x [P(x)→ Q(x)]

Every

Nλy .man(y)

man

V IN

λz .walk(z)

walks

The semantic value of a sentence can be true or false.

An approach known as Montague Grammar (1970)


Distributional models of meaning

Distributional hypothesis

The meaning of a word is determined by its context [Harris, 1958].


Distributional models of meaning

Distributional hypothesis

The meaning of a word is determined by its context [Harris, 1958].


Words as vectors

A word is a vector of co-occurrence statistics with every otherword in the vocabulary:

milk

cute

dog

bank

money

12

8

5

0

1

cat

cat

dog

account

money

pet

Semantic relatedness is usually based on cosine similarity:

sim(−→v ,−→u ) = cos(−→v ,−→u ) =(−→v · −→u )

‖−→v ‖‖−→u ‖(1)


A real vector space

30 20 10 0 10 20 30 4040

30

20

10

0

10

20

cat

dog

pet

kitten

mouse

horselion

elephanttigerparrot

eagle

pigeon

raven

seagull

money

stock

bank

financecurrency

profitcredit

market

business

broker

account

laptop

data

processor

technologyram

megabyte

keyboard monitor

intel

motherboard

microchip

cpu

gamefootball

score

championshipplayer league

team


The necessity for a unified model

Montague grammar is compositional but not quantitative

Distributional models of meaning are quantitative but notcompositional

Note that the distributional hypothesis does not apply tophrases or sentences. There is not enough data:

Even if we had an infinitely large corpus, what the context ofa sentence would be?


The role of compositionality

Compositional distributional models

We can synthetically produce a sentence vector by composing thevectors of the words in that sentence.

−→s = f (−→w1,−→w2, . . . ,

−→wn) (2)

Three generic classes of CDMs:

Vector mixture models [Mitchell and Lapata, 2008]

Tensor-based models[Coecke et al., 2010, Baroni and Zamparelli, 2010]

Deep learning models[Socher et al., 2012, Kalchbrenner et al., 2014].


Outline

1 Introduction




5 Afterword


Element-wise vector composition

We can combine two vectors by working element-wise[Mitchell and Lapata, 2008]:

−−−→w1w2 = α−→w1 + β−→w2 =∑

i

(αcw1i + βcw2

i )−→ni (3)

−−−→w1w2 = −→w1 �−→w2 =∑

i

cw1i cw2

i−→ni (4)

An element-wise “mixture” of the input elements:

= =

Vector mixture Tensor-based


Vector mixtures: Summary

Distinguishing feature:

All words contribute equally to the final result.

PROS:

Trivial to implement

As computationally light-weight as it gets

Surprisingly effective in practice

CONS:

A bag-of-word approach

Does not distinguish between the type-logical identities of thewords

By definition, sentence space = word space


Outline

1 Introduction




5 Afterword


Relational words as functions

In a vector mixture model, an adjective is of the same order asthe noun it modifies, and both contribute equally to the result.

One step further: Relational words are multi-linear maps(tensors of various orders) that can be applied to one or morearguments (vectors).

= =

Vector mixture Tensor-based

Formalized in the context of compact closed categories byCoecke, Sadrzadeh and Clark.


A categorical framework for composition

Categorical compositional distributional semantics

Coecke, Sadrzadeh and Clark (2010): Syntax and vector spacesemantics can be structurally homomorphic.

Take CF, the free compact closed category over a pregroupgrammar, as the structure accommodating the syntax

Take FVect, the category of finite-dimensional vector spacesand linear maps, as the semantic counterpart of CF.

CF and FVect share a compact closed structure, so a stronglymonoidal functor Q can be defined such that:

Q : CF → FVect (5)

The meaning of a sentence w1w2 . . .wn with type reduction αis given as:

Q(α)(−→w1 ⊗−→w2 ⊗ . . .⊗−→wn) (6)


Categorical composition: Example

S

NP

Adj

happy

N

kids

VP

V

play

N

games

happy kids play games

n nl n nr s nl n

Grammar rules follow the inequalities:

pl · p ≤ 1 ≤ p · pl and p · pr ≤ 1 ≤ pr · p (7)

Derivation becomes:

(n · nl ) · n · (nr · s · nl ) · n = n · (nl · n) · (nr · s · nl ) · n ≤n · 1 · (nr · s · nl ) · n = n · (nr · s · nl ) · n =

(n · nr ) · s · (nl · n) ≤ 1 · s · 1 ≤ s


Categorical composition: From syntax to semantics

Each atomic grammar type is mapped to a vector space:

Q(n) = N Q(s) = S

Due to strong monoidality, complex types are mapped totensor product of vector spaces:

Q(n · nl ) = Q(n)⊗ Q(nl ) = N ⊗ N

Q(nr · s · nl ) = Q(nr )⊗ Q(s)⊗ Q(nl ) = N ⊗ S ⊗ N

Finally, each morphism in CF is mapped to a linear map inFVect.


A multi-linear model

The grammatical type of a word defines the vector space inwhich the word lives.

A noun lives in a basic vector space N.Adjectives are linear maps N → N, i.e elements in N ⊗ N.Intransitive verbs are maps N → S , i.e elements in N ⊗ S .Transitive verbs are bi-linear maps N ⊗ N → S , i.e. elementsof N ⊗ S ⊗ N.And so on.

The composition operation is tensor contraction, based oninner product.

happy kids play games

N N l N N r S N l N


Frobenius algebras in language

Kartsaklis, Sadrzadeh, Pulman and Coecke (2013): Use theco-multiplication of a Frobenius algebra to model verb tensors.

s v ov

A combination of element-wise and categorical composition:

−−−→s v o = (s × v)� o

Useful e.g. in intonation:Who does John like? John likes Mary

(−−→john × likes)�−−−→mary

Sadrzadeh, Clark and Coecke (2013): Relative pronouns aremodelled in terms of Frobenius copying and deleting.


Composition and lexical ambiguity

Kartsaklis and Sadrzadeh (2013): Explicitly handling lexicalambiguity improves the performance of tensor-based models

How can we model ambiguity in categorical composition?

Words as mixed states

Piedeleu, Kartsaklis, Coecke and Sadrzadeh (2015): Ambiguouswords are represented as mixed states in CPM(FHilb).

ρ(wt) =∑

i

pi |w it 〉〈w i

t | (8)

Von Neumann entropy shows how ambiguity evolves fromwords to compounds

Disambiguation = purification: Entropy of ‘vessel’ is 0.25,but entropy of ‘vessel that sails’ is 0.01.


Tensor-based models: Summary

Distinguishing feature

Relational words are functions acting on arguments.

PROS:

Highly justified from a linguistic perspective

More powerful and robust than vector mixtures

A glass box approach that allows theoretical reasoning

CONS:

Every logical and functional word must be assigned to anappropriate tensor representation (e.g. how do you representquantifiers?)

Space complexity problems for functions of higher arity (e.g. aditransitive verb is a tensor of order 4)

Limited to linearity


Outline

1 Introduction




5 Afterword


A simple neural net

A feed-forward neural network with one hidden layer:

h1 = f (w11x1+w21x2+w31x3+w41x4+w51x5+b1)h2 = f (w12x1+w22x2+w32x3+w42x4+w52x5+b2)h3 = f (w13x1+w23x2+w33x3+w43x4+w53x5+b3)

or−→h = f (W(1)−→x +

−→b (1))

Similarly:

−→y = f (W(2)−→h +−→b (2))

Note that W(1) ∈ R3×5 and W(2) ∈ R2×3

f is a non-linear function such as tanh or sigmoid(take f = Id and you have a tensor-based model)

Weights are optimized against some objective function

A universal approximator.


Recursive neural networks

Pollack (1990), Socher et al. (2012): Recursive neural netsfor composition in natural language.

f is an element-wise non-linear function such as tanh orsigmoid.

A supervised approach.


Unsupervised learning with NNs

How can we train a NN in an unsupervised manner?

Create an auto-encoder: Train the network to reproduce itsinput via an expansion layer.

Use the output of the hidden layer as a compressed version ofthe input [Socher et al., 2011]


Convolutional NNs

Originated in pattern recognition [Fukushima, 1980]

Small filters apply on every position of the input vector:

Capable of extracting fine-grained local features independentlyof the exact position in input

Features become increasingly global as more layers are stacked

Each convolutional layer is usually followed by a pooling layer

Top layer is fully connected, usually a soft-max classifier

Application to language: Collobert and Weston (2008)


DCNNs for modelling sentences

Kalchbrenner, Grefenstetteand Blunsom (2014): A deep

architecture using dynamick-max pooling

Syntactic structure isinduced automatically:

(Figures reused with permission)


Beyond sentence level

Denil, Demiraj, Kalchbrenner, Blunsom, and De Freitas (2014): Anadditional convolutional layer can provide document vectors:

(Figure reused with permission)


Deep learning models: Summary

Distinguishing feature

Drastic transformation of the sentence space.

PROS:

Non-linearity and layered approach allow the simulation of avery wide range of functions

Word vectors are parameters of the model, optimized duringtraining

State-of-the-art results in a number of NLP tasks

CONS:

Requires expensive training

Difficult to discover the right configuration

A black-box approach: not easy to correlate inner workingswith output


Outline

1 Introduction




5 Afterword


A hierarchy of CDMs

Putting everything together:

Note that “power” is only theoretical; actual performancedepends on task, underlying assumptions, and configuration


Open issues-Future work

No convincing solution for logical connectives, negation,quantifiers and so on.

Functional words, such as prepositions and relative pronouns,are also a problem.

Sentence space is usually identified with word space. This isconvenient, but obviously a simplification

Solutions depend on the specific CDM class—e.g. not muchto do in a vector mixture setting

Important: How can we make NNs more linguistically aware?[Hermann and Blunsom, 2013]


Conclusion

CDMs provide quantitative semantic representations forsentences (or even documents)

Element-wise operations on word vectors constitute an easyand reasonably effective way to get sentence vectors

Categorical compositional distributional models allowreasoning on a theoretical level—a glass box approach

Deep learning models are extremely powerful and effective;still a black-box approach, not easy to explain why a specificconfiguration works and some other does not.

Convolutional networks seem to constitute a very promisingsolution for sentential/discourse semantics


References I

Baroni, M. and Zamparelli, R. (2010).

Nouns are Vectors, Adjectives are Matrices.In Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP).

Coecke, B., Sadrzadeh, M., and Clark, S. (2010).

Mathematical Foundations for a Compositional Distributional Model of Meaning. Lambek Festschrift.Linguistic Analysis, 36:345–384.

Collobert, R. and Weston, J. (2008).

A unified architecture for natural language processing: Deep neural networks with multitask learning.In Proceedings of the 25th international conference on Machine learning, pages 160–167. ACM.

Denil, M., Demiraj, A., Kalchbrenner, N., Blunsom, P., and de Freitas, N. (2014).

Modelling, visualising and summarising documents with a single convolutional neural network.arXiv preprint arXiv:1406.3830.

Kalchbrenner, N., Grefenstette, E., and Blunsom, P. (2014).

A convolutional neural network for modelling sentences.In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers), pages 655–665, Baltimore, Maryland. Association for Computational Linguistics.

Kartsaklis, D. and Sadrzadeh, M. (2013).

Prior disambiguation of word tensors for constructing sentence vectors.In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages1590–1601, Seattle, Washington, USA. Association for Computational Linguistics.

Kartsaklis, D., Sadrzadeh, M., Pulman, S., and Coecke, B. (2014).

Reasoning about meaning in natural language with compact closed categories and Frobenius algebras.arXiv preprint arXiv:1401.5980.


References II

Lambek, J. (2008).

From Word to Sentence.Polimetrica, Milan.

Mitchell, J. and Lapata, M. (2008).

Vector-based Models of Semantic Composition.In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, pages 236–244.

Montague, R. (1970a).

English as a formal language.In Linguaggi nella Societa e nella Tecnica, pages 189–224. Edizioni di Comunita, Milan.

Montague, R. (1970b).

Universal grammar.Theoria, 36:373–398.

Piedeleu, R., Kartsaklis, D., Coecke, B., and Sadrzadeh, M. (2015).

Open System Categorical Quantum Semantics in Natural Language Processing.arXiv preprint arXiv:1502.00831.

Pollack, J. B. (1990).

Recursive distributed representations.Artificial Intelligence, 46(1):77–105.

Sadrzadeh, M., Clark, S., and Coecke, B. (2013).

The Frobenius anatomy of word meanings I: subject and object relative pronouns.Journal of Logic and Computation, Advance Access.


References III

Socher, R., Huang, E., Pennington, J., Ng, A., and Manning, C. (2011).

Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection.Advances in Neural Information Processing Systems, 24.

Socher, R., Huval, B., Manning, C., and A., N. (2012).

Semantic compositionality through recursive matrix-vector spaces.In Conference on Empirical Methods in Natural Language Processing 2012.


Date post:	15-Jul-2015
Category:	Science
Upload:	dimitrios-kartsaklis
View:	138 times
Download:	4 times