+ All Categories
Home > Documents > Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 ·...

Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 ·...

Date post: 17-Jun-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
57
Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures by Gisoo Kim, Yongchan Kwon, Young-geun Kim, Wonyoung Kim and Youngwon Choi Seoul National University March-June, 2018 Seoul National University Deep Learning March-June, 2018 1 / 56
Transcript
Page 1: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Introduction

Deep Learning: A Statistical Perspective

Myunghee Cho PaikGuest lectures by Gisoo Kim, Yongchan Kwon, Young-geun Kim, Wonyoung

Kim and Youngwon Choi

Seoul National University

March-June, 2018

Seoul National University Deep Learning March-June, 2018 1 / 56

Page 2: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Introduction

Introduction

Seoul National University Deep Learning March-June, 2018 1 / 56

Page 3: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Introduction

Natural Language Processing

Natural Language Processing (NLP) includes:

Sentiment analysisMachine translationText generation....

How to train language?

How can we convert language into numbers?

Seoul National University Deep Learning March-June, 2018 2 / 56

Page 4: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Introduction

Word Embedding

How to map words in to Rd?

One-hot encoding

Each vectors has nothing to do with other vectors∀u 6= v , ‖u − v‖ = 1, uT v = 0

However...

Each word is related with its companies.“Ice” is closer to “Solid” than “Gas”

Seoul National University Deep Learning March-June, 2018 3 / 56

Page 5: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Introduction

Main Questions in Word Embedding

Vocabulary set: V = {a, the,deep,statistics,. . .,}Size N corpus: C = (v (1), v (2), . . . , v (N)), v (1), v (2), . . . , v (N) ∈ VGiven the corpus data, how can we measure similarity between words,sim(deep, statistics)?

How can we define f and learn wdeep,wstatistics such thatsim(deep, statistics) = f (wdeep,wstatistics)?

Seoul National University Deep Learning March-June, 2018 4 / 56

Page 6: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Introduction

Some Famous Word Embedding Techniques

Latent Semantic Analysis (LSA) (Deerwester et al. 1990)

Hyperspace Analogue to Language (HAL) (Lund and Burgess, 1996)

Word2Vec (Mikolov et al. 2013a)

GloVe (Pennington et al. 2014)

Seoul National University Deep Learning March-June, 2018 5 / 56

Page 7: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Introduction

LSA (Deerwester et al. 1990)

Term-document matrix: Xt×d

sim(a, b) ∝ co-occurence in eachdocuments.

Sigular Value Decomposition:Xt×d = TSDT

With k largest singular values:

X̂t×d = Tt×kSk×k(Dd×k)T

Tt×k : k-dim term vectors,Dd×k : k-dim document vectors

Figure: from (Deerwester et al. 1990)Seoul National University Deep Learning March-June, 2018 6 / 56

Page 8: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Introduction

HAL (Lund and Burgess, 1996)

Term-context term matrix: XV×V

How many times the column word appear in front of the row term?

sim(a, b) ∝ co-occurence in nearby context.

Concatenate row/columnto make 2V -dim vector

Dimension reductionwith k principalcomponents.

Train 160M terms, withV = 70, 000

Figure: from (Lund and Burgess, 1996)

Seoul National University Deep Learning March-June, 2018 7 / 56

Page 9: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Introduction

Sim(·, ·) ∝ Cooccurence?

Cooccurence with ”and” or ”the” does not mean semantic similarity.

Appears just frequently? or has significant similarity?

Transformation or defining new measure of similarity.

Entropy/Correlation based normalization (Rohde et al., 2006)

Positive pointwise mutual information(PPMI) max{0, log p(context|term)p(context) }

(Bullinaria and Levy, 2007)Square root type transformation (Lebret and Collobert, 2014)Train p(context|term) within every local window. (Word2Vec)

Seoul National University Deep Learning March-June, 2018 8 / 56

Page 10: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Word2Vec

Word2Vec

Seoul National University Deep Learning March-June, 2018 9 / 56

Page 11: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Word2Vec

Model Setup (Mikolov et al., 2013)

Vocabulary set: V = {e1, e2, . . . , eV } ⊂ {0, 1}VSize N corpus: C = (v (1), v (2), . . . , v (N)), v (1), v (2), . . . , v (N) ∈ VEmbedded word vectors:

Wd×V =

| | |w1 w2 · · · wV

| | |

,W ′d×V =

| | |w ′1 w ′2 · · · w ′V| | |

Seoul National University Deep Learning March-June, 2018 10 / 56

Page 12: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Word2Vec

Model Setup (Mikolov et al., 2013)

Thus, the model becomes:

P(v (output)|v (input)) =exp(w ′output · winput)∑Vj=1 exp(w ′j · winput)

W /W ′ is called input/output representation

Note that, W 6= W ′. If W = W ′, P(·|·) is maximized when”context word = input word” which is a rare event.

If the output (or context) word appears in the window, w ′output · winput

increases.

Seoul National University Deep Learning March-June, 2018 11 / 56

Page 13: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Word2Vec

Training the Model (Rong, 2014)

Initialize W → Read a (context, input) pair → Update W ′ →Update W → Read another (context, input) pair → · · ·Initialization: Wij = U[−0.5, 0.5] ∀i , jSuppose v (output) = eo appeared in the context of v (input) = ei .

Update W ′ by minimizing − log-likelihood:

L ≡ − logP(eo |ei ) = log(V∑j=1

exp(uij))− uio

where uij = w ′j · wi , j = 1, . . .V

Seoul National University Deep Learning March-June, 2018 12 / 56

Page 14: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Word2Vec

Training the Model (Rong, 2014)

Taking derivatives:

∂L

∂uik=

exp(uik)∑Vj=1 exp(uij)

− δ(k=o)

∂uik∂w ′k

= wi

∂L

∂w ′k= [P(ek |ei )− δ(k=o)]wi , ∀k = 1, . . . ,V

With gradient descent, the updating equation:

w ′k(new) = w ′k(old) − α[P(ek |ei )− δ(k=o)]wi ,∀k = 1, . . . ,V

If k = o, [P(ek |ei )− δk=o ] < 0. This indicates underestimating case.Thus the updating equation adds wi -direction on w ′kIn summary, the updating equation increases uio and decreasesuik , ∀k 6= o.

Seoul National University Deep Learning March-June, 2018 13 / 56

Page 15: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Word2Vec

Training the Model (Rong, 2014)

Given W ′, update W .

Reminder: v (output) = eo appeared in the context of v (input) = ei .

Taking derivatives w.r.t. wi

∂L

∂uik=

exp(uik)∑Vj=1 exp(uij)

− δ(k = o)

∂uik∂wi

= w ′k

∂L

∂wi=

V∑j=1

∂L

∂uij

∂uij∂wi

=V∑j=1

[P(ej |ei )− δ(j=o)]w′j

Define EH =∑V

j=1 [P(ej |ei )− δ(j=o)]w′j : sum of output vectors,

weighted by their prediction error.

With gradient descent, the updating equation:wi(new) = wi(old) − αEH

Seoul National University Deep Learning March-June, 2018 14 / 56

Page 16: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Word2Vec

CBOW and Skip-gram (Mikolov et al., 2013)

Figure: CBOW ModelFigure: Skip-gram Model

Seoul National University Deep Learning March-June, 2018 15 / 56

Page 17: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Word2Vec

Training CBOW (Rong, 2014)

Input: v (t−c), · · · , v (t−1), v (t+1), · · · , v (t+c). Output: v (t)

suppose v (t−c) = et(1), . . . , v(t+c) = et(2c), and v (t) = eo .

Define:

ht ≡1

2c

±c∑j=1

Wv (t+j) =1

2c

2c∑k=1

wt(k)

Suppose Then the model becomes

P(eo |et(1), . . . , et(2c)) =exp(w ′o · ht)∑Vj=1 exp(w ′j · ht)

The loss is defined by negative log-likelihood:

L ≡ logV∑j=1

exp(w ′j · ht)− w ′o · ht

where utj = w ′j · ht ,∀j = 1, . . .V .

Seoul National University Deep Learning March-June, 2018 16 / 56

Page 18: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Word2Vec

Training CBOW (Rong, 2014)

With similar calculation, the updating equation for W ′ becomes:

w ′k(new) = w ′k(old) − α[P(eo |v (t−c), · · · , v (t+c))− δ(k=o)]ht

For W , note that utj = w ′j · ht = 12c

∑2ck=1 w

′j · wt(k)

For back propagation:

∂L

∂wt(k)=

V∑j=1

∂L

∂utj

∂utj∂wt(k)

=1

2c

V∑j=1

[P(eo |v (t−c), · · · , v (t+c))− δ(k=o)]w′j =

1

2cEH

Thus the updating equation becomes:

wt(k)(new) = w t(k)(old) − α1

2cEH,∀k = 1, . . . , 2c

Seoul National University Deep Learning March-June, 2018 17 / 56

Page 19: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Word2Vec

Training Skip-gram (Rong, 2014)

Input:v (t). Output: v (t−c), · · · , v (t−1), v (t+1), · · · , v (t+c).

Suppose, v (t) = ei , and v (t−c) = et(1), . . . , v(t+c) = et(2c). Then the

model becomes:

P(v (t−c), · · · , v (t−1), v (t+1), · · · , v (t+c)|v (t)) =2c∏k=1

P(et(k)|ei )

=2c∏k=1

exp(w ′t(k) · wi )∑Vj=1 exp(w ′j · wi )

The loss becomes:

L ≡2c∑k=1

Lk =2c∑k=1

[logV∑j=1

exp(u(k)ij )− u

(k)it(k)]

where u(k)ij = w ′j · wi : score for only k-th loss.

Seoul National University Deep Learning March-June, 2018 18 / 56

Page 20: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Word2Vec

Training Skip-gram (Rong, 2014)

For W ′, j = 1, . . . ,V

∂L

∂w ′j=

2c∑k=1

∂Lk

∂u(k)ij

∂u(k)ij

∂w ′j

=2c∑k=1

[P(et(k)|ei )− δ(j=t(k))]wi

Thus, the updating equation for W ′ becomes:

w ′j(new) = w ′j(old) − α2c∑k=1

[P(et(k)|ei )− δ(j=t(k))]wi ,∀j = 1, . . .V

Seoul National University Deep Learning March-June, 2018 19 / 56

Page 21: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Word2Vec

Training Skip-gram (Rong, 2014)

For W ,

∂L

∂wi=

2c∑k=1

V∑j=1

∂Lk

∂u(k)ij

∂u(k)ij

∂wi

=2c∑k=1

V∑j=1

[P(et(k)|ei )− δ(j=t(k))]w′j ≡

2c∑k=1

EH(k)

Thus, the updating equation for W becomes:

wi(new) = w i(old) − α2c∑k=1

EH(k)

Seoul National University Deep Learning March-June, 2018 20 / 56

Page 22: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Word2Vec

Compuational Problem

For each input, output pair in corpus C = (v (1), v (2), . . . , v (N)), themodel must calculate:

P(eo |ei ) =exp(w ′o · wi )∑Vj=1 exp(w ′j · wi )

For each epoch, almost N × V times of inner product of d-dimvectors. (skip-gram: N × V × 2c)

The calculation is proportional to V ≈ 106.

(Mikolov et al., 2013) suggests 2 alternative formulations:Hierarchical softmax and Negative sampling

Seoul National University Deep Learning March-June, 2018 21 / 56

Page 23: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Word2Vec

Hierarchical softmax (Mikolov et al., 2013)

Efficient way of computingsoftmax

Build a Huffman binary treeusing word frequency

Instead of w ′j the model usesw ′n(ej ,l)

n(ej , l): l-th node to the wayfrom root to the word ej

Figure: Binary tree for HS

Let hi be the hidden node. Then the probability model becomes:

P(eo |ei ) =

L(eo)−1∏l=1

σ([n(eo , l+1) is at left child of n(eo , l)]×w ′n(eo ,l)·hi )

Seoul National University Deep Learning March-June, 2018 22 / 56

Page 24: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Word2Vec

Training Hierarchical Softmax (Rong, 2014)

Let L = − logP(eo |ei ), and w ′n(eo ,l) = w ′l . Then:

∂L

∂w ′l · hi= {σ([· · · ]w ′l · hi )− 1}[· · · ] =

{σ(w ′l · hi )− 1 [· · · ] = 1

σ(w ′l · hi ) [· · · ] = −1

σ(w ′l · hi ) is the probability of [w ′l+1 is left child node of w ′l ]. Thus,

∂L

∂w ′l · hi= P[w ′l+1 is left child node of w ′l ]− δ[··· ]

Thus the updating equation becomes: for l = 1, . . . , L(eo)− 1

w ′l(new) = w ′l(old) − α(P[w ′l+1 is left child node of w ′l ]− δ[··· ])hiFor skip-gram model, repeat this procedure for 2c outputs.The updating equation for W becomes:

wi(new) = w ′i(old) − αEH∂hi∂wi

where, EH =

L(eo)−1∑l=1

(P[· · · ]− δ[··· ])w ′l

Seoul National University Deep Learning March-June, 2018 23 / 56

Page 25: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Word2Vec

Nagative Sampling (Mikolov et al., 2013)

Generate en(1), . . . , en(k) from the noise distribution Pn

The goal is to discriminate (hi , eo) from (hi , en(1)), . . . , (hi , en(k))

For skip gram model, repeat this procedure with each 2c outputs.

k = 5− 20. are useful. For large datasets, k can be small as 2− 5.

The noise distribution: Pn(en) ∝[#(en)N

]3/4outperformed

significantly.

Figure: 5-Negative Sampling

Seoul National University Deep Learning March-June, 2018 24 / 56

Page 26: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Word2Vec

Objective in Negative Sampling (Goldberg and Levy, 2014)

Suppose (hi , eo) and (hi , en(1)), . . . , (hi , en(k)),(o 6= n(j),∀j = 1, . . . k) is given.Let [D = 1|hi , ej ] be the event that the pair (hi , ej) is came from theoriginal corpus.The model assumes: P(D = 1|hi , ej) = σ(w ′j · hi ). Thus thelikelihood becomes:

σ(w ′o · hi )×k∏

j=1

[1− σ(w ′n(j) · hi )

]Taking log leads to the objective in (Mikolov et al. 2013):

log σ(w ′o · hi ) +k∑

j=1

log σ(−w ′n(j) · hi ) en(j) ∼ Pn

Note that training hi given w ′o ,w′n(1), . . . ,w

′n(k), is a logistic

regression.

Seoul National University Deep Learning March-June, 2018 25 / 56

Page 27: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Word2Vec

Training Negative Sampling (Rong, 2014)

Define loss as:

L = − log σ(w ′o · hi )−k∑

j=1

log σ(−w ′n(j) · hi )

Let Wneg = {w ′n(1), . . . ,w′n(k)}. Then the derivative:

∂L

∂w ′j · hi=

{σ(w ′j · hi )− 1 w ′j = w ′oσ(w ′j · hi ) w ′j ∈ Wneg

= P(D = 1|hi , ej)− δ(j=o)

Thus the updating equation for W ′: for j = o, n(1), . . . , n(k),

w ′j(new) = w ′j(old) − α[P(D = 1|hi , ej)− δ(j=o)]hi

Let ∂L∂hi

=∑n(k)

j=1,n(1)(P(D = 1|hi , ej)− δ(j=o))w′j ≡ EH. Then the

updating equation for W :

wi(new) = w i(old) − αEH∂hi∂wi

Seoul National University Deep Learning March-June, 2018 26 / 56

Page 28: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Word2Vec

Two Pre-processing Techniques (Mikolov et al., 2013)

Frequent words (such as “a”, “the”, “in”) provide less informationvalue than rare words.

Let V = {v1, . . . , vV }, be the vocabulary set. Discard each word viwith probability:

P(vi ) = 1−√

t

[#(vi )/N]

where t = 10−5 is a proper threshold value.

“New York Times”, “Toronto Maple Leafs” can be considered as oneword.

In order to find those phrases, define a score:

score(vi , vj) =#(vivj)− δ#(vi )#(vj)

Over 2-4 cycles of the training set, calculate the score with decreasingδ. Above some threshold value, set vivj as a word.

Seoul National University Deep Learning March-June, 2018 27 / 56

Page 29: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

GloVe

GloVe

Seoul National University Deep Learning March-June, 2018 28 / 56

Page 30: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

GloVe

Motivation (Pennington et al., 2014)

Let V = {v1, . . . , vV }, be the vocabulary set.

Throughout the corpus C, define some statistics:

Xij : #(word vj is in the context of word vi )Xi ≡

∑k Xik : #(Any word appear in the context of vi )

Pij = Xij/Xi : Probability that vj appear in the context of vi

How can we measure similarity between words, sim(vi , vj)?

Seoul National University Deep Learning March-June, 2018 29 / 56

Page 31: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

GloVe

Motivation (Pennington et al., 2014)

Co-occurence probabilities for “ice” and “steam” with selectedcontext words from a corpus (N=6 billion)

If vk is related to vi rather than vj , than Pik/Pjk will be larger than 1.

If vk is related (or not related) to both vi and vj , then Pik/Pjk willclose to 1.

The ratio Pik/Pjk is useful to find out whether vk is close to vi (or vj)

Figure: from (Pennington et al., 2014)

Seoul National University Deep Learning March-June, 2018 30 / 56

Page 32: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

GloVe

Model Setup (Pennington et al., 2014)

With the motivation, the model becomes:

Pik

Pjk= F (wi ,wj ,w

′k) wi ,wj ,w

′k ∈ Rd

Setting 2 kinds of parameters W ,W ′ can help reduce overfitting,noise and generally improve results (Ciresan et al., 2012)

In vector space, knowing w1, . . . ,wV is same as knowingw1 − wi , . . . ,wV − wi . Thus the F can be restricted to:

Pik

Pjk= F (wi − wj ,w

′k)

In order to match the dimension and preserve the linear structure, usedot products:

Pik

Pjk= F

[(wi − wj) · w ′k

]Seoul National University Deep Learning March-June, 2018 31 / 56

Page 33: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

GloVe

Model Setup (Pennington et al., 2014)

For any i , j , k , l = 1, . . . ,V ,

F[(wi − wj) · w ′k

]F[(wj − wl) · w ′k

]=

Pik

Plk= F

[(wi − wl) · w ′k

]It is natural to define F satisfying F (x)F (y) = F (x + y). This impliesF = exp(·).

Moreover:

F[(wi − wj) · w ′k

]=

exp(wi · w ′k)

exp(wj · w ′k)=

Pik

Pjk

Thus,wi · w ′k = logPik = logXik − logXi

Since the role of a word and a context is exchangable,wi · w ′k = wk · w ′i .

Seoul National University Deep Learning March-June, 2018 32 / 56

Page 34: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

GloVe

Model Setup (Pennington et al., 2014)

Consider logXi as a bias of input representation: bi and add anotherbias b′k .

Finally, the model becomes:

wi · w ′k + bi + b′k = logXik

Now, define a weighted cost function:

L =V∑

i ,j=1

f (Xij)(wi · w ′j + bi + b′j − logXij)2

The weight must satisfy:

f (0) = 0: In order to avoid the case Xij = 0.f must be non-decreasing: frequent co-occurence must be emphasizedf should be relatively small for large values: case of “in”,”the”,”and”

Seoul National University Deep Learning March-June, 2018 33 / 56

Page 35: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

GloVe

Training GloVe (Pennington et al., 2014)

f is suggested as:

f (x) =

{(x/xmax)α x < xmax

1 x ≥ xmax

xmax is reported to have weak impact on performance.(fix xmax = 100)

α = 3/4 has a modest improvement over α = 1.

Training with AdaGrad (Duchi et al., 2011), stocastically samplingnon-zero elements of X .

The model generates W and W ′. The model concludes with W +W ′.

Seoul National University Deep Learning March-June, 2018 34 / 56

Page 36: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Toy Implementation

Toy Implementation

Seoul National University Deep Learning March-June, 2018 35 / 56

Page 37: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Toy Implementation

Data and Model Descriptions

Movie review data from NLTK corpus.

Consist of plot summary and critique.

Corpus size N = 1.5million, Vocabulary size V = 39768.

Embedding dimension:d = 100, window size:c = 5.

Negative sample size: k = 5.

GloVe trained with 10 epochs.

Time elapsed for training (Intel Core i7 CPU @ 3.60GHz):

Model CBOW+HS CBOW+NEG SG+HS SG+NEG GloVe

Time 9.14s 4.53s 12.4s 12.3s 44.2s

Seoul National University Deep Learning March-June, 2018 36 / 56

Page 38: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Toy Implementation

Results

Similarity between two vectors

Seoul National University Deep Learning March-June, 2018 37 / 56

Page 39: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Toy Implementation

Results

Similarity between two vectors (most frequent words)

Seoul National University Deep Learning March-June, 2018 38 / 56

Page 40: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Toy Implementation

Results

Top 5 similar words with “villian”

Seoul National University Deep Learning March-June, 2018 39 / 56

Page 41: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Toy Implementation

Results

Linear relationship: (“actor”+”she”-”actress”=?)

Seoul National University Deep Learning March-June, 2018 40 / 56

Page 42: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Toy Implementation

Results

Linear relationship: (“king”+”she”-”he”=?)

Seoul National University Deep Learning March-June, 2018 41 / 56

Page 43: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Performances

Performances

Seoul National University Deep Learning March-June, 2018 42 / 56

Page 44: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Performances

Intrinsic Performances (Pennington et al., 2014)

Word analogies task: 19,544 questions

Symantic:” Athens” is to “Greece” as “Berlin” is to ( ? )Syntatic: “dance” is to “dancing” as fly is to ( ? )

Corpus: Gigaword5 + Wikipedia2014

Percentage of correct answers:

Model d N Sem. Syn. Tot.

CBOW 300 6B 63.6 67.4 65.7

SG 300 6B 73.0 66.0 69.1

GloVe 300 6B 77.4 67.0 71.7

Table: From (Pennington et al., 2014)

Seoul National University Deep Learning March-June, 2018 43 / 56

Page 45: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Performances

Extrinsic Performances (Pennington et al., 2014)

Named entity recognition (NER) with Conditional Random Field(CRF) model

Input: Jim bought 300 shares of Acme Corp. in 2006

Output: [Jim](person) bought 300 shares of [AcmeCorp.](Organization) in 2006

4 Entities: person, location, organization, miscellaneous.

Seoul National University Deep Learning March-June, 2018 44 / 56

Page 46: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Performances

Extrinsic Performances (Pennington et al., 2014)

Trained with CoNLL-03 training set and 50-dimensional word vectors.

F1 score on validation set and 3 kinds of test sets:

Model Validation CoNLL-Test ACE MUC7

Discrete 91.0 85.4 77.4 73.4

CBOW 93.1 88.2 82.2 81.1

SG None None None None

GloVe 93.2 88.3 82.9 82.2

Table: From (Pennington et al., 2014)

Seoul National University Deep Learning March-June, 2018 45 / 56

Page 47: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Word Embedding + RNN

Word Embedding + RNN

Seoul National University Deep Learning March-June, 2018 46 / 56

Page 48: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Word Embedding + RNN

How to Add Embedded Vectors to RNN

Recall RNN model:

Input: xtHidden unit: ht = tanh(b + Uhht−1 + Uixt)Output unit: ot = c + UohtPredicted probability: pt = softmax(ot)Unknown parameters: (Ui ,Uo ,Uh, b, c)

Seoul National University Deep Learning March-June, 2018 47 / 56

Page 49: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Word Embedding + RNN

How to Add Embedded Vectors to RNN

With word embeddings:

Input: wi(t) = WxtHidden unit: ht = tanh(b + Uhht−1 + Uiw i(t))Output unit: ot = c + UohtPredicted probability: pt = softmax(ot)Unknown parameters: (W ,Ui ,Uo ,Uh, b, c)

W is not just input. Instead, it is the initial weight of the wordvectors.

Fine tuning the word vectors for specific goal.

Another derivative is added: for k = 1, . . . ,V

∂L

∂wk=∑

i(t)=k

∂L

∂ot

∂ot∂ht

∂ht∂wk

Can be generalized to LSTM and GRU.

Seoul National University Deep Learning March-June, 2018 48 / 56

Page 50: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Word Embedding + RNN

Word-rnn (Eidnes, 2015)

Goal: Generating clickbait headlines

Train 2M clickbait headlines scraped from Buzzfedd, Gawker, Jezebel,Huffington Post and Upworthy

RNN model using GloVe words vectors (N = 6B, d = 200) as initialweights.

3-layer LSTM model with T = 1200.

Seoul National University Deep Learning March-June, 2018 49 / 56

Page 51: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Word Embedding + RNN

Word-rnn (Eidnes, 2015)

8 first completions of “Barack Obama Says”:

Barack Obama Says It’s Wrong To Talk About IraqBarack Obama Says He’s Like ‘A Single Mother’ And ‘Over The Top’Barack Obama Says He Did 48 Things OverBarack Obama Says About Ohio LawBarack Obama Says He Is WrongBarack Obama Says He Will Get The American IdolBarack Obama Says Himself Are “Doing Well Around The World”Barack Obama Says As He Leaves Politics With His Wife

More on the website written in the references

Most of the generated sentences are grammatically correct and makesense.

Seoul National University Deep Learning March-June, 2018 50 / 56

Page 52: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Word Embedding + RNN

Word-rnn (Eidnes, 2015)

The model seems to understand the gender and political context.“Mary J. Williams On Coming Out As A Woman”“Romney Camp: ‘I Think You Are A Bad President’”

Updating W for only 2-layers works best.

Figure: From (Eidnes, 2015)

Seoul National University Deep Learning March-June, 2018 51 / 56

Page 53: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Conclusion

Conclusion

Seoul National University Deep Learning March-June, 2018 52 / 56

Page 54: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

Conclusion

Summary

Embedding discrete words into Rd has interesting results

Similar word vectors has high-value of cosine-similiarity.

Linear relationships: “king” +”she” -”he” = ?

Embedded vectors can be used as an input or initial weights of deepneural network.

Seoul National University Deep Learning March-June, 2018 53 / 56

Page 55: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

References

References

Seoul National University Deep Learning March-June, 2018 54 / 56

Page 56: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

References

Key References

Goldberg, Y., & Levy, O. (2014). word2vec explained: Derivingmikolov et al.’s negative-sampling word-embedding method. arXivpreprint arXiv:1402.3722.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J.(2013). Distributed representations of words and phrases and theircompositionality. In Advances in neural information processingsystems (pp. 3111-3119).

Pennington, J., Socher, R., & Manning, C. (2014). Glove: Globalvectors for word representation. In Proceedings of the 2014conference on empirical methods in natural language processing(EMNLP) (pp. 1532-1543).

Seoul National University Deep Learning March-June, 2018 55 / 56

Page 57: Deep Learning: A Statistical Perspectivestat.snu.ac.kr/mcp/NLP_lecture-2.pdf · 2018-07-23 · Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures

References

Key References

Rong, X. (2014). word2vec parameter learning explained. arXivpreprint arXiv:1411.2738.

Eidnes, L. (2015). Auto-Generating Clickbait With Recurrent NeuralNetworks. [online] Lars Eidnes’ blog. Available at:https://larseidnes.com/2015/10/13/auto-generating-clickbait-with-recurrent-neural-networks/ [Accessed 8 May2018].

Seoul National University Deep Learning March-June, 2018 56 / 56


Recommended