+ All Categories
Home > Documents > Deep Learning -...

Deep Learning -...

Date post: 17-Oct-2020
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
26
Deep Learning Advanced Machine Learning for NLP Jordan Boyd-Graber MATHEMATICAL DESCRIPTION Advanced Machine Learning for NLP | Boyd-Graber Deep Learning | 1 of 8
Transcript
Page 1: Deep Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_7000/02c.pdf · Advanced Machine Learning for NLP j Boyd-Graber Deep Learning j 3 of 8. Objective

Deep Learning

Advanced Machine Learning for NLPJordan Boyd-GraberMATHEMATICAL DESCRIPTION

Advanced Machine Learning for NLP | Boyd-Graber Deep Learning | 1 of 8

Page 2: Deep Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_7000/02c.pdf · Advanced Machine Learning for NLP j Boyd-Graber Deep Learning j 3 of 8. Objective

Learn the features and the function

a(2)1 = f�

W(1)11 x1 +W

(1)12 x2 +W

(1)13 x3 +b

(1)1

Advanced Machine Learning for NLP | Boyd-Graber Deep Learning | 2 of 8

Page 3: Deep Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_7000/02c.pdf · Advanced Machine Learning for NLP j Boyd-Graber Deep Learning j 3 of 8. Objective

Learn the features and the function

a(2)2 = f�

W(1)21 x1 +W

(1)22 x2 +W

(1)23 x3 +b

(1)2

Advanced Machine Learning for NLP | Boyd-Graber Deep Learning | 2 of 8

Page 4: Deep Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_7000/02c.pdf · Advanced Machine Learning for NLP j Boyd-Graber Deep Learning j 3 of 8. Objective

Learn the features and the function

a(2)3 = f�

W(1)31 x1 +W

(1)32 x2 +W

(1)33 x3 +b

(1)3

Advanced Machine Learning for NLP | Boyd-Graber Deep Learning | 2 of 8

Page 5: Deep Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_7000/02c.pdf · Advanced Machine Learning for NLP j Boyd-Graber Deep Learning j 3 of 8. Objective

Learn the features and the function

hW ,b(x) = a(3)1 = f�

W(2)11 a

(2)1 +W

(2)12 a

(2)2 +W

(2)13 a

(2)3 +b

(2)1

Advanced Machine Learning for NLP | Boyd-Graber Deep Learning | 2 of 8

Page 6: Deep Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_7000/02c.pdf · Advanced Machine Learning for NLP j Boyd-Graber Deep Learning j 3 of 8. Objective

Objective Function

• For every example x ,y of our supervised training set, we want the labely to match the prediction hW ,b(x).

J(W ,b;x ,y)≡1

2||hW ,b(x)− y ||2 (1)

• We want this value, summed over all of the examples to be as small aspossible

• We also want the weights not to be too large

λ

2

nl−1∑

l

sl∑

i=1

sl+1∑

j=1

W lji

�2(2)

Advanced Machine Learning for NLP | Boyd-Graber Deep Learning | 3 of 8

Page 7: Deep Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_7000/02c.pdf · Advanced Machine Learning for NLP j Boyd-Graber Deep Learning j 3 of 8. Objective

Objective Function

• For every example x ,y of our supervised training set, we want the labely to match the prediction hW ,b(x).

J(W ,b;x ,y)≡1

2||hW ,b(x)− y ||2 (1)

• We want this value, summed over all of the examples to be as small aspossible

• We also want the weights not to be too large

λ

2

nl−1∑

l

sl∑

i=1

sl+1∑

j=1

W lji

�2(2)

Advanced Machine Learning for NLP | Boyd-Graber Deep Learning | 3 of 8

Page 8: Deep Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_7000/02c.pdf · Advanced Machine Learning for NLP j Boyd-Graber Deep Learning j 3 of 8. Objective

Objective Function

• For every example x ,y of our supervised training set, we want the labely to match the prediction hW ,b(x).

J(W ,b;x ,y)≡1

2||hW ,b(x)− y ||2 (1)

• We want this value, summed over all of the examples to be as small aspossible

• We also want the weights not to be too large

λ

2

nl−1∑

l

sl∑

i=1

sl+1∑

j=1

W lji

�2(2)

Advanced Machine Learning for NLP | Boyd-Graber Deep Learning | 3 of 8

Page 9: Deep Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_7000/02c.pdf · Advanced Machine Learning for NLP j Boyd-Graber Deep Learning j 3 of 8. Objective

Objective Function

• For every example x ,y of our supervised training set, we want the labely to match the prediction hW ,b(x).

J(W ,b;x ,y)≡1

2||hW ,b(x)− y ||2 (1)

• We want this value, summed over all of the examples to be as small aspossible

• We also want the weights not to be too large

λ

2

nl−1∑

l

sl∑

i=1

sl+1∑

j=1

W lji

�2(2)

Advanced Machine Learning for NLP | Boyd-Graber Deep Learning | 3 of 8

Page 10: Deep Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_7000/02c.pdf · Advanced Machine Learning for NLP j Boyd-Graber Deep Learning j 3 of 8. Objective

Objective Function

• For every example x ,y of our supervised training set, we want the labely to match the prediction hW ,b(x).

J(W ,b;x ,y)≡1

2||hW ,b(x)− y ||2 (1)

• We want this value, summed over all of the examples to be as small aspossible

• We also want the weights not to be too large

λ

2

nl−1∑

l

sl∑

i=1

sl+1∑

j=1

W lji

�2(2)

Sum over all layers

Advanced Machine Learning for NLP | Boyd-Graber Deep Learning | 3 of 8

Page 11: Deep Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_7000/02c.pdf · Advanced Machine Learning for NLP j Boyd-Graber Deep Learning j 3 of 8. Objective

Objective Function

• For every example x ,y of our supervised training set, we want the labely to match the prediction hW ,b(x).

J(W ,b;x ,y)≡1

2||hW ,b(x)− y ||2 (1)

• We want this value, summed over all of the examples to be as small aspossible

• We also want the weights not to be too large

λ

2

nl−1∑

l

sl∑

i=1

sl+1∑

j=1

W lji

�2(2)

Sum over all sources

Advanced Machine Learning for NLP | Boyd-Graber Deep Learning | 3 of 8

Page 12: Deep Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_7000/02c.pdf · Advanced Machine Learning for NLP j Boyd-Graber Deep Learning j 3 of 8. Objective

Objective Function

• For every example x ,y of our supervised training set, we want the labely to match the prediction hW ,b(x).

J(W ,b;x ,y)≡1

2||hW ,b(x)− y ||2 (1)

• We want this value, summed over all of the examples to be as small aspossible

• We also want the weights not to be too large

λ

2

nl−1∑

l

sl∑

i=1

sl+1∑

j=1

W lji

�2(2)

Sum over all destinations

Advanced Machine Learning for NLP | Boyd-Graber Deep Learning | 3 of 8

Page 13: Deep Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_7000/02c.pdf · Advanced Machine Learning for NLP j Boyd-Graber Deep Learning j 3 of 8. Objective

Objective Function

Putting it all together:

J(W ,b) =

1

m

m∑

i=1

1

2||hW ,b(x

(i))− y(i)||2�

2

nl−1∑

l

sl∑

i=1

sl+1∑

j=1

W lji

�2(3)

• Our goal is to minimize J(W ,b) as a function of W and b

• Initialize W and b to small random value near zero

• Adjust parameters to optimize J

Advanced Machine Learning for NLP | Boyd-Graber Deep Learning | 4 of 8

Page 14: Deep Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_7000/02c.pdf · Advanced Machine Learning for NLP j Boyd-Graber Deep Learning j 3 of 8. Objective

Objective Function

Putting it all together:

J(W ,b) =

1

m

m∑

i=1

1

2||hW ,b(x

(i))− y(i)||2�

2

nl−1∑

l

sl∑

i=1

sl+1∑

j=1

W lji

�2(3)

• Our goal is to minimize J(W ,b) as a function of W and b

• Initialize W and b to small random value near zero

• Adjust parameters to optimize J

Advanced Machine Learning for NLP | Boyd-Graber Deep Learning | 4 of 8

Page 15: Deep Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_7000/02c.pdf · Advanced Machine Learning for NLP j Boyd-Graber Deep Learning j 3 of 8. Objective

Objective Function

Putting it all together:

J(W ,b) =

1

m

m∑

i=1

1

2||hW ,b(x

(i))− y(i)||2�

2

nl−1∑

l

sl∑

i=1

sl+1∑

j=1

W lji

�2(3)

• Our goal is to minimize J(W ,b) as a function of W and b

• Initialize W and b to small random value near zero

• Adjust parameters to optimize J

Advanced Machine Learning for NLP | Boyd-Graber Deep Learning | 4 of 8

Page 16: Deep Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_7000/02c.pdf · Advanced Machine Learning for NLP j Boyd-Graber Deep Learning j 3 of 8. Objective

Objective Function

Putting it all together:

J(W ,b) =

1

m

m∑

i=1

1

2||hW ,b(x

(i))− y(i)||2�

2

nl−1∑

l

sl∑

i=1

sl+1∑

j=1

W lji

�2(3)

• Our goal is to minimize J(W ,b) as a function of W and b

• Initialize W and b to small random value near zero

• Adjust parameters to optimize J

Advanced Machine Learning for NLP | Boyd-Graber Deep Learning | 4 of 8

Page 17: Deep Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_7000/02c.pdf · Advanced Machine Learning for NLP j Boyd-Graber Deep Learning j 3 of 8. Objective

Deep Learning from Data

Outline

Advanced Machine Learning for NLP | Boyd-Graber Deep Learning | 5 of 8

Page 18: Deep Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_7000/02c.pdf · Advanced Machine Learning for NLP j Boyd-Graber Deep Learning j 3 of 8. Objective

Deep Learning from Data

Gradient Descent

GoalOptimize J with respect to variables W and b

Parameter

Objectivestart

stop

undiscoveredcountry

Advanced Machine Learning for NLP | Boyd-Graber Deep Learning | 6 of 8

Page 19: Deep Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_7000/02c.pdf · Advanced Machine Learning for NLP j Boyd-Graber Deep Learning j 3 of 8. Objective

Deep Learning from Data

Backpropigation

• For convenience, write the input to sigmoid

z(l)i =

n∑

j=1

W(l−1)ij xj +b

(l−1)i (4)

• The gradient is a function of a node’s error δ(l)i

• For output nodes, the error is obvious:

δ(nl)i =

∂ z(nl)i

||y −hw ,b(x)||2 =−�

yi −a(nl)i

· f ′�

z(nl)i

� 1

2(5)

• Other nodes must “backpropagate” downstream error based onconnection strength

δ(l)i =

�st+1∑

j=1

W(l+1)ji δ

(l+1)j

f ′(z(l)i ) (6)

Advanced Machine Learning for NLP | Boyd-Graber Deep Learning | 7 of 8

Page 20: Deep Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_7000/02c.pdf · Advanced Machine Learning for NLP j Boyd-Graber Deep Learning j 3 of 8. Objective

Deep Learning from Data

Backpropigation

• For convenience, write the input to sigmoid

z(l)i =

n∑

j=1

W(l−1)ij xj +b

(l−1)i (4)

• The gradient is a function of a node’s error δ(l)i

• For output nodes, the error is obvious:

δ(nl)i =

∂ z(nl)i

||y −hw ,b(x)||2 =−�

yi −a(nl)i

· f ′�

z(nl)i

� 1

2(5)

• Other nodes must “backpropagate” downstream error based onconnection strength

δ(l)i =

�st+1∑

j=1

W(l+1)ji δ

(l+1)j

f ′(z(l)i ) (6)

Advanced Machine Learning for NLP | Boyd-Graber Deep Learning | 7 of 8

Page 21: Deep Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_7000/02c.pdf · Advanced Machine Learning for NLP j Boyd-Graber Deep Learning j 3 of 8. Objective

Deep Learning from Data

Backpropigation

• For convenience, write the input to sigmoid

z(l)i =

n∑

j=1

W(l−1)ij xj +b

(l−1)i (4)

• The gradient is a function of a node’s error δ(l)i

• For output nodes, the error is obvious:

δ(nl)i =

∂ z(nl)i

||y −hw ,b(x)||2 =−�

yi −a(nl)i

· f ′�

z(nl)i

� 1

2(5)

• Other nodes must “backpropagate” downstream error based onconnection strength

δ(l)i =

�st+1∑

j=1

W(l+1)ji δ

(l+1)j

f ′(z(l)i ) (6)

Advanced Machine Learning for NLP | Boyd-Graber Deep Learning | 7 of 8

Page 22: Deep Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_7000/02c.pdf · Advanced Machine Learning for NLP j Boyd-Graber Deep Learning j 3 of 8. Objective

Deep Learning from Data

Backpropigation

• For convenience, write the input to sigmoid

z(l)i =

n∑

j=1

W(l−1)ij xj +b

(l−1)i (4)

• The gradient is a function of a node’s error δ(l)i

• For output nodes, the error is obvious:

δ(nl)i =

∂ z(nl)i

||y −hw ,b(x)||2 =−�

yi −a(nl)i

· f ′�

z(nl)i

� 1

2(5)

• Other nodes must “backpropagate” downstream error based onconnection strength

δ(l)i =

�st+1∑

j=1

W(l+1)ji δ

(l+1)j

f ′(z(l)i ) (6)

Advanced Machine Learning for NLP | Boyd-Graber Deep Learning | 7 of 8

Page 23: Deep Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_7000/02c.pdf · Advanced Machine Learning for NLP j Boyd-Graber Deep Learning j 3 of 8. Objective

Deep Learning from Data

Backpropigation

• For convenience, write the input to sigmoid

z(l)i =

n∑

j=1

W(l−1)ij xj +b

(l−1)i (4)

• The gradient is a function of a node’s error δ(l)i

• For output nodes, the error is obvious:

δ(nl)i =

∂ z(nl)i

||y −hw ,b(x)||2 =−�

yi −a(nl)i

· f ′�

z(nl)i

� 1

2(5)

• Other nodes must “backpropagate” downstream error based onconnection strength

δ(l)i =

�st+1∑

j=1

W(l+1)ji δ

(l+1)j

f ′(z(l)i ) (6)

Advanced Machine Learning for NLP | Boyd-Graber Deep Learning | 7 of 8

Page 24: Deep Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_7000/02c.pdf · Advanced Machine Learning for NLP j Boyd-Graber Deep Learning j 3 of 8. Objective

Deep Learning from Data

Backpropigation

• For convenience, write the input to sigmoid

z(l)i =

n∑

j=1

W(l−1)ij xj +b

(l−1)i (4)

• The gradient is a function of a node’s error δ(l)i

• For output nodes, the error is obvious:

δ(nl)i =

∂ z(nl)i

||y −hw ,b(x)||2 =−�

yi −a(nl)i

· f ′�

z(nl)i

� 1

2(5)

• Other nodes must “backpropagate” downstream error based onconnection strength

δ(l)i =

�st+1∑

j=1

W(l+1)ji δ

(l+1)j

f ′(z(l)i ) (6)

(chain rule)Advanced Machine Learning for NLP | Boyd-Graber Deep Learning | 7 of 8

Page 25: Deep Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_7000/02c.pdf · Advanced Machine Learning for NLP j Boyd-Graber Deep Learning j 3 of 8. Objective

Deep Learning from Data

Partial Derivatives

• For weights, the partial derivatives are

∂W(l)ij

J(W ,b;x ,y) = a(l)j δ

(l+1)i (7)

• For the bias terms, the partial derivatives are

∂ b(l)i

J(W ,b;x ,y) =δ(l+1)i (8)

• But this is just for a single example . . .

Advanced Machine Learning for NLP | Boyd-Graber Deep Learning | 8 of 8

Page 26: Deep Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_7000/02c.pdf · Advanced Machine Learning for NLP j Boyd-Graber Deep Learning j 3 of 8. Objective

Deep Learning from Data

Full Gradient Descent Algorithm

1 Initialize U(l) and V (l) as zero

2 For each example i = 1 . . .m

1 Use backpropagation to compute ∇W J and ∇bJ2 Update weight shifts U(l) =U(l)+∇W (l)J(W ,b;x ,y)3 Update bias shifts V (l) = V (l)+∇b(l)J(W ,b;x ,y)

3 Update the parameters

W (l) =W (l)−α��

1

mU(l)��

(9)

b(l) =b(l)−α�

1

mV (l)�

(10)

4 Repeat until weights stop changing

Advanced Machine Learning for NLP | Boyd-Graber Deep Learning | 9 of 8


Recommended