Computational Linguistics week 5

transcript

Computa(onal Linguis(cs Week 5

Neural Networks and Neural Language Models

By Mark Chang

Outlines

•  Machine Learning •  Neural Networks •  Training Neural Networks •  Vector Space of Seman(cs •  Neural Language Models (word2vec)

Machine Learning

Training Data

Machine Learning Model Output

Answer

Error FeedBack

Machine Learning Model

Tes(ng Data

AJer Training

Output

Machine Learning

Training Data

X , Y x(i), y(i)

h Parameter

Output

Answer

Cost Func(on

E(h(X),Y) Feedback

Logis(c Regression

Training Data X Y

-‐0.47241379 0 -‐0.35344828 0 -‐0.30148276 0 0.33448276 1 0.35344828 1 0.37241379 1 0.39137931 1 0.41034483 1 0.44931034 1 0.49827586 1 0.51724138 1

…. ….

Sigmoid func(on h(x) =1

�(w0+w1x)

w0 + w1x < 0

h(x) ⇡ 0

w0 + w1x > 0

h(x) ⇡ 1

Cost Func(on

•  Cross Entropy

E(h(X), Y ) =�1

(i)log(h(x(i))) + (1� y

(i))log(1� h(x(i))))

(i) = 1

E(h(x(i)), y(i)) = �log(h(x(i)))

h(x(i)) ⇡ 0 ) E(h(x(i)), y(i)) ⇡ 1h(x(i)) ⇡ 1 ) E(h(x(i)), y(i)) ⇡ 0

(i) = 0

E(h(x(i)), y(i)) = �log(1� h(x(i)))

h(x(i)) ⇡ 0 ) E(h(x(i)), y(i)) ⇡ 0

h(x(i)) ⇡ 1 ) E(h(x(i)), y(i)) ⇡ 1

Cost Func(on

•  Cross Entropy

E(h(X), Y ) =�1

(i)log(h(x(i))) + (1� y

(i))log(1� h(x(i))))

h(x(i)) ⇡ 0 and y

(i) = 0 ) E(h(X), Y ) ⇡ 0

h(x(i)) ⇡ 1 and y

(i) = 1 ) E(h(X), Y ) ⇡ 0

h(x(i)) ⇡ 0 and y

(i) = 1 ) E(h(X), Y ) ⇡ 1h(x(i)) ⇡ 1 and y

(i) = 0 ) E(h(X), Y ) ⇡ 1

Feedback

•  Gradient Descent:

w0 w0–⌘

@E(h(X), Y )

w1 w1–⌘@E(h(X), Y )

(�@E(h(X), Y )

@w0,�@E(h(X), Y )

Feedback

Neural Networks

Neurons & Ac(on Poten(al

h`p://humanphisiology.wikispaces.com/file/view/neuron.png/216460814/neuron.png

h`p://upload.wikimedia.org/wikipedia/commons/thumb/4/4a/Ac(on_poten(al.svg/1037px-‐Ac(on_poten(al.svg.png

Synapse

h`p://www.quia.com/files/quia/users/lmcgee/Systems/endocrine-‐nervous/synapse.gif

Ar(ficial Neurons

= w1x1 + w2x2 + w

�nin

1 + e�(w1x1+w2x2+wb)

= 0(0,0)

Ar(ficial Neurons

= w1x1 + w2x2 + w

�nin

= w1x1 + w2x2 + w

�nin

w1x1 + w2x2 + wb = 0

w1x1 + w2x2 + wb > 0

w1x1 + w2x2 + wb < 0

Binary Classifica(on：AND Gate

x1 x2 y

(0,1) (1,1)

b-‐30

1 + e�(20x1+20x2�30)

20x1 + 20x2 � 30 = 0

Binary Classifica(on：OR Gate

x1 x2 y

1 + e�(20x1+20x2�10)

(0,1) (1,1)

b-‐10

20x1 + 20x2 � 10 = 0

XOR Gate ?

(0,1) (1,1)

x1 x2 y

Binary Classifica(on：XOR Gate

-‐20

-‐10

(0,1) (1,1)

(1,0) 0

b-‐30

b-‐10

x1 x2 n1 n2 y

0 0 0 0 0

0 1 0 1 1

1 0 0 1 1

1 1 1 1 0

Neural Networks

n22 W12,y

W11,b W12,b

W11,x W21,11

W22,12

W21,12

W22,11

W21,b W22,b

Input Layer

Hidden Layer

Output Layer

Visual Pathway

http://www.nature.com/neuro/journal/v8/n8/images/nn0805-975-F1.jpg

Training Neural Networks

Training Data

Neural Networks Output

Answer

Ini(aliza(on Forward Propaga(on

Error Func(on

Backward Propaga(on

Ini(aliza(on

•  Randomly sampling W from –N ~ N

n22 W12,y

W11,b W12,b

W11,x W21,11

W22,12

W21,12

W22,11

W21,b W22,b

Forward Propaga(on

Error Func(on

J = �(z1log(n21(out)) + (1� z1)log(1� n21(out)))

� (z2log(n22(out)) + (1� z2)log(1� n22(out)))

⇡ 0 and z = 0 ) J ⇡ 0

⇡ 1 and z = 1 ) J ⇡ 0

⇡ 0 and z = 1 ) J ⇡ 1nout

⇡ 1 and z = 0 ) J ⇡ 1

Gradient Descent

w21,11 w21,11 � ⌘@J

@w21,11

w21,12 w21,12 � ⌘@J

@w21,12

w21,b w21,b � ⌘@J

@w21,b

w22,11 w21,11 � ⌘@J

@w22,11

w22,12 w21,12 � ⌘@J

@w22,12

w22,b w21,b � ⌘@J

@w22,b

w11,x w11,x � ⌘@J

@w11,x

w11,y w11,y � ⌘@J

@w11,y

w11,b w11,b � ⌘@J

@w11,b

w12,x w12,x � ⌘@J

@w12,x

w12,y w12,y � ⌘@J

@w12,y

w12,b w12,b � ⌘@J

@w12,b

(–@J

@w0, –

Backward Propaga(on

http://cpmarkchang.logdown.com/posts/277349-neural-network-backward-propagation

Vector Space of Seman(cs

Distribu(on Seman(cs

•  The meaning of a word can be inferred from its context.

The meanings of dog and cat are similar.

The dog run. A cat run. A dog sleep. The cat sleep. A dog bark. The cat meows.

Seman(c Vectors

The dog run. A cat run. A dog sleep. The cat sleep. A dog bark. The cat meows.

the a run sleep bark meow

dog 1 2 2 2 1 0

cat 2 1 2 2 0 1

Seman(c Vectors

dog (1, 2,..., xn)

cat (2, 1,..., xn)

Car (0, 0,..., xn)

Cosine Similarity

•  Cosine Similarity between A & B is: A ·B|A||B|

dog (a1, a2, ..., an)

cat (b1, b2, ..., bn)

Cosine similarity between dog & cat is:

a1b1 + a2b2 + ...+ anbnpa21 + a22 + ...+ a2n

pb21 + b22 + ...+ b2n

Opera(on of Vectors

Woman + King -‐ Man = Queen

Woman Queen

Man King

King -‐ Man

Neural Language Models (word2vec)

Dimension is too LARGE

(x1=the, x2 =a,..., xn)

Dimension of seman(c vectors is equal to the size of vocabulary.

x1 x2 x3 x4 xn ...

Compressed Vectors

One-‐Hot Encoding

Neural Network

Compressed Vector

One-‐Hot Encoding

dog cat run fly １

Ini(alize Weights dog

cat run

w11 w12 w13

w21 w22 w23

w31 w32 w33

w31 w32 w43

775V =

v11 v12 v13v21 v22 v23v31 v32 v33v31 v32 v43

Compressed Vectors

High dimension Low

dimension

Compressed Vectors

dog cat run fly

cat run

Context Word

dog １

v13 run

cat run

fly dog cat run fly

1 + e�V1W3⇡ 1

V1 ·W3 = v11w31 + v12w32 + v13w33

Context Word

cat １

v21 v22

v23 run

dog cat run fly

V2 ·W3 = v21w31 + v22w32 + v23w33

dog cat run fly

1 + e�V2W3⇡ 1

Non-‐context Word

dog １

V1 ·W4 = v11w41 + v12w42 + v13w43

1 + e�V1W4⇡ 0

dog cat run fly

dog cat run

Non-‐context Word

cat １

v21 v22

V2 ·W4 = v21w41 + v22w42 + v23w43

dog cat run

1 + e�V2W4⇡ 0

Result

dog cat run fly

dog cat run

Computational Linguistics week 5

Science