Computational Linguistics week 5

Post on 10-Feb-2017

538 views 2 download

transcript

Computa(onal  Linguis(cs  Week  5    

Neural  Networks  and    Neural  Language  Models  

By  Mark  Chang  

Outlines  

•  Machine  Learning  •  Neural  Networks  •  Training  Neural  Networks  •  Vector  Space  of  Seman(cs  •  Neural  Language  Models  (word2vec)  

Machine  Learning  

Machine  Learning

Training  Data

Machine  Learning  Model Output

Answer

Error FeedBack

Machine  Learning  Model

Tes(ng  Data

AJer  Training

Output

Machine  Learning

Training  Data  

X  ,  Y  x(i),  y(i)  

Model  

h  Parameter  

w

Output  

h(X)

Answer  

Y

Cost  Func(on  

E(h(X),Y) Feedback

X

Y

Logis(c  Regression  

Training  Data  X   Y  

-­‐0.47241379 0 -­‐0.35344828 0 -­‐0.30148276 0 0.33448276 1 0.35344828 1 0.37241379 1 0.39137931 1 0.41034483 1 0.44931034 1 0.49827586 1 0.51724138 1

…. ….

Model  

Sigmoid  func(on   h(x) =1

1 + e

�(w0+w1x)

w0 + w1x < 0

h(x) ⇡ 0

w0 + w1x > 0

h(x) ⇡ 1

Cost  Func(on  

•  Cross  Entropy  

 E(h(X), Y ) =�1

m

(mX

i

y

(i)log(h(x(i))) + (1� y

(i))log(1� h(x(i))))

y

(i) = 1

E(h(x(i)), y(i)) = �log(h(x(i)))

h(x(i)) ⇡ 0 ) E(h(x(i)), y(i)) ⇡ 1h(x(i)) ⇡ 1 ) E(h(x(i)), y(i)) ⇡ 0

y

(i) = 0

E(h(x(i)), y(i)) = �log(1� h(x(i)))

h(x(i)) ⇡ 0 ) E(h(x(i)), y(i)) ⇡ 0

h(x(i)) ⇡ 1 ) E(h(x(i)), y(i)) ⇡ 1

Cost  Func(on  

•  Cross  Entropy  

 E(h(X), Y ) =�1

m

(mX

i

y

(i)log(h(x(i))) + (1� y

(i))log(1� h(x(i))))

h(x(i)) ⇡ 0 and y

(i) = 0 ) E(h(X), Y ) ⇡ 0

h(x(i)) ⇡ 1 and y

(i) = 1 ) E(h(X), Y ) ⇡ 0

h(x(i)) ⇡ 0 and y

(i) = 1 ) E(h(X), Y ) ⇡ 1h(x(i)) ⇡ 1 and y

(i) = 0 ) E(h(X), Y ) ⇡ 1

   w1          w0        

Feedback  

•  Gradient  Descent:  

 w0 w0–⌘

@E(h(X), Y )

@w0

w1 w1–⌘@E(h(X), Y )

@w1

(�@E(h(X), Y )

@w0,�@E(h(X), Y )

@w1)

Feedback  

Neural  Networks  

Neurons  &  Ac(on  Poten(al  

h`p://humanphisiology.wikispaces.com/file/view/neuron.png/216460814/neuron.png    

h`p://upload.wikimedia.org/wikipedia/commons/thumb/4/4a/Ac(on_poten(al.svg/1037px-­‐Ac(on_poten(al.svg.png    

Synapse  

h`p://www.quia.com/files/quia/users/lmcgee/Systems/endocrine-­‐nervous/synapse.gif    

Ar(ficial  Neurons  

n W1

W2

x1

x2

b Wb

y

n

in

= w1x1 + w2x2 + w

b

n

out

=1

1 + e

�nin

nin

nout

y =1

1 + e�(w1x1+w2x2+wb)

nout

= 1

nout

= 0.5

nout

= 0(0,0)

x2

x1

Ar(ficial  Neurons

n

in

= w1x1 + w2x2 + w

b

n

out

=1

1 + e

�nin

n

in

= w1x1 + w2x2 + w

b

n

out

=1

1 + e

�nin

w1x1 + w2x2 + wb = 0

w1x1 + w2x2 + wb > 0

w1x1 + w2x2 + wb < 0

1

0

Binary  Classifica(on:AND  Gate  

x1 x2 y

0 0 0

0 1 0

1 0 0

1 1 1

(0,0)

(0,1) (1,1)

(1,0)

0

1

n 20

20

b-­‐30

y x1  

x2  

y =1

1 + e�(20x1+20x2�30)

20x1 + 20x2 � 30 = 0

Binary  Classifica(on:OR  Gate

x1 x2 y

0 0 0

0 1 1

1 0 1

1 1 1

y =1

1 + e�(20x1+20x2�10)

(0,0)

(0,1) (1,1)

(1,0)

0

1

n 20

20

b-­‐10

y x1  

x2  

20x1 + 20x2 � 10 = 0

XOR  Gate  ?

(0,0)

(0,1) (1,1)

(1,0)

0

0

1

x1 x2 y

0 0 0

0 1 1

1 0 1

1 1 0

Binary  Classifica(on:XOR  Gate  

n

-­‐20

20

b

-­‐10

y

(0,0)

(0,1) (1,1)

(1,0)

0 1

(0,0)

(0,1) (1,1)

(1,0)

1

0

(0,0)

(0,1) (1,1)

(1,0) 0

0 1

n1 20

20

b-­‐30

x1  

x2  

n2 20

20

b-­‐10

x1  

x2  

x1 x2 n1 n2 y

0 0 0 0 0

0 1 0 1 1

1 0 0 1 1

1 1 1 1 0

Neural  Networks  

x

y

n11

n12

n21

n22 W12,y

W12,x

b

W11,y

W11,b W12,b

b

W11,x W21,11

W22,12

W21,12

W22,11

W21,b W22,b

z1

z2

Input    Layer

Hidden  Layer

Output  Layer

Visual  Pathway

http://www.nature.com/neuro/journal/v8/n8/images/nn0805-975-F1.jpg  

Training  Neural  Networks  

Training  Neural  Networks  

Training  Data

Neural  Networks Output

Answer

Ini(aliza(on Forward  Propaga(on  

Error  Func(on  

Backward  Propaga(on  

Ini(aliza(on

•  Randomly  sampling  W  from    –N  ~  N  

x

y

n11

n12

n21

n22 W12,y

W12,x

b

W11,y

W11,b W12,b

b

W11,x W21,11

W22,12

W21,12

W22,11

W21,b W22,b

z1

z2

Forward  Propaga(on  

Error  Func(on

J = �(z1log(n21(out)) + (1� z1)log(1� n21(out)))

� (z2log(n22(out)) + (1� z2)log(1� n22(out)))

n21

n22

z1

z2

nout

⇡ 0 and z = 0 ) J ⇡ 0

nout

⇡ 1 and z = 1 ) J ⇡ 0

nout

⇡ 0 and z = 1 ) J ⇡ 1nout

⇡ 1 and z = 0 ) J ⇡ 1

   w1          w0        

Gradient  Descent

w21,11 w21,11 � ⌘@J

@w21,11

w21,12 w21,12 � ⌘@J

@w21,12

w21,b w21,b � ⌘@J

@w21,b

w22,11 w21,11 � ⌘@J

@w22,11

w22,12 w21,12 � ⌘@J

@w22,12

w22,b w21,b � ⌘@J

@w22,b

w11,x w11,x � ⌘@J

@w11,x

w11,y w11,y � ⌘@J

@w11,y

w11,b w11,b � ⌘@J

@w11,b

w12,x w12,x � ⌘@J

@w12,x

w12,y w12,y � ⌘@J

@w12,y

w12,b w12,b � ⌘@J

@w12,b

(–@J

@w0, –

@J

@w1)

Backward  Propaga(on  

http://cpmarkchang.logdown.com/posts/277349-neural-network-backward-propagation  

Vector  Space  of  Seman(cs  

Distribu(on  Seman(cs

•  The  meaning  of  a  word  can  be  inferred  from  its  context.  

The meanings of dog and cat are similar.

The  dog  run.  A  cat  run.  A  dog  sleep.  The  cat  sleep.  A  dog  bark.  The  cat  meows.  

Seman(c  Vectors

The  dog  run.  A  cat  run.  A  dog  sleep.  The  cat  sleep.  A  dog  bark.  The  cat  meows.  

the   a   run   sleep   bark   meow  

dog   1   2   2   2   1   0  

cat   2   1   2   2   0   1  

Seman(c  Vectors  

dog  (1,  2,...,  xn)      

cat  (2,  1,...,  xn)    

Car  (0,  0,...,  xn)    

Cosine  Similarity  

•  Cosine  Similarity  between  A  &  B  is:   A ·B|A||B|

dog      (a1, a2, ..., an)

cat   (b1, b2, ..., bn)

Cosine  similarity  between  dog  &  cat  is:  

a1b1 + a2b2 + ...+ anbnpa21 + a22 + ...+ a2n

pb21 + b22 + ...+ b2n

Opera(on  of  Vectors  

Woman  +  King  -­‐    Man    =  Queen

Woman Queen

Man King

King  -­‐  Man

King  -­‐  Man

Neural  Language  Models  (word2vec)  

Dimension  is  too  LARGE  

(x1=the,  x2  =a,...,  xn)  

dog  

Dimension  of  seman(c  vectors    is  equal  to  the  size  of  vocabulary.  

x1   x2   x3   x4   xn  ...  

Compressed  Vectors  

dog

One-­‐Hot    Encoding

Neural  Network  

Compressed  Vector

1.2  

0.7  

0.5  

1  

0  

0  

0  

One-­‐Hot  Encoding

dog cat run fly 1

Ini(alize  Weights  dog

cat run

fly

dog

cat run

fly

W =

2

664

w11 w12 w13

w21 w22 w23

w31 w32 w33

w31 w32 w43

3

775V =

2

664

v11 v12 v13v21 v22 v23v31 v32 v33v31 v32 v43

3

775

Compressed  Vectors

dog

High  dimension Low  

dimension

v11  

v12  

v13  

v11  

v12  

v13  

v11  

v12  

v13  

Compressed  Vectors  

dog cat run fly

v11  

v12  

v13  

v21  

v22  

v23  

v31  

v32  

v33  

v41  

v42  

v43  

dog

cat run

fly

Context  Word

dog 1

v11  

v12  

v13  

v11  

v12  

v13   run

w31  

w32  

w33  

dog

cat run

fly dog cat run fly

1

1 + e�V1W3⇡ 1

V1 ·W3 = v11w31 + v12w32 + v13w33

Context  Word

cat 1

v11  

v12  

v13  

v21  v22  

v23   run

w31  

w32  

w33  

dog cat run fly

V2 ·W3 = v21w31 + v22w32 + v23w33

dog cat run fly

1

1 + e�V2W3⇡ 1

Non-­‐context  Word

dog 1

v11  

v12  

v13  

v11  

v12  

v13  

fly

w41  

w42  

w43  

V1 ·W4 = v11w41 + v12w42 + v13w43

1

1 + e�V1W4⇡ 0

dog cat run fly

dog cat run

fly

Non-­‐context  Word

cat 1

v11  

v12  

v13  

v21  v22  

v23  

w41  

w42  

w43  

V2 ·W4 = v21w41 + v22w42 + v23w43

dog cat run

fly

dog cat run

fly

fly

1

1 + e�V2W4⇡ 0

Result  

dog cat run fly

v11  

v12  

v13  

v21  

v22  

v23  

v31  

v32  

v33  

v41  

v42  

v43  

dog cat run

fly

Further  Reading  

•  Logis(c  Regression  3D  –  h`p://cpmarkchang.logdown.com/posts/189069-­‐logis(-­‐regression-­‐model  

•  OverFimng  and  Regulariza(on  –  h`p://cpmarkchang.logdown.com/posts/193261-­‐machine-­‐learning-­‐overfimng-­‐and-­‐regulariza(on  

•  Model  Selec(on  –  h`p://cpmarkchang.logdown.com/posts/193914-­‐machine-­‐learning-­‐model-­‐selec(on  

•  Neural  Network  Back  Propaga(on  –  h`p://cpmarkchang.logdown.com/posts/277349-­‐neural-­‐network-­‐backward-­‐propaga(on  

Further  Reading  •  Neural  Probabilis(c  Language  Model:

–  h`p://cpmarkchang.logdown.com/posts/255785-­‐neural-­‐network-­‐neural-­‐probabilis(c-­‐language-­‐model  

–  h`p://cpmarkchang.logdown.com/posts/276263-­‐-­‐hierarchical-­‐probabilis(c-­‐neural-­‐networks-­‐neural-­‐network-­‐language-­‐model  

•  Word2vec  –  h`p://arxiv.org/pdf/1301.3781.pdf  –  h`p://papers.nips.cc/paper/5021-­‐distributed-­‐representa(ons-­‐of-­‐words-­‐and-­‐phrases-­‐and-­‐their-­‐composi(onality.pdf  

–  h`p://www-­‐personal.umich.edu/~ronxin/pdf/w2vexp.pdf