Lecture 17: Neural Networks and Deep LearningLecture 17: Neural Networks and Deep Learning Jack...

Post on 22-May-2020

8 views 0 download

transcript

Lecture 17: Neural Networks and

Deep LearningJack Lanchantin

Dr. Yanjun Qi

1

UVA CS 6316 / CS 4501-004Machine Learning

Fall 2016

Neurons1-Layer Neural Network

Multi-layer Neural Network

Loss Functions

Backpropagation

Nonlinearity Functions

NNs in Practice2

3

x1

x2

x3

W1

w3

W2

x ŷ

ewx+b

1 + ewx+b

Logistic Regression

Sigmoid Function(aka logistic, logit, “S”, soft-step)

4

P(Y=1|x) =

x

ez

1 + ez

Expanded Logistic Regression

x1

x2

x3

Σ

+1

z

z = wT . x + b

y = sigmoid(z) =5

px11xp

p = 3

1x1

w1

w2

w3

b1SummingFunction

SigmoidFunction

Multiply by weights

ŷ = P(Y=1|x,w)

Input x

“Neuron”

x1

x2

x3

Σ

+1

z

6

w1

w2

w3

b1

Neurons

7http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

Neuron

x1

x2

x3

Σ z ŷ

8

ez

1 + ez

z = wT . x

ŷ = sigmoid(z) =px11xp1x1

From here on, we leave out bias for simplicity

Input x

w1

w2

w3

“Block View” of a Neuron

x *

Dot Product Sigmoid

w z

9

y

Input output

Dot Product SigmoidInput output

x *w z ŷ

parameterized block

ez

1 + ez

z = wT . x

ŷ = sigmoid(z) =px11xp1x1

Neuron Representation

Σ

10

The linear transformation and nonlinearity together is typically considered a single neuron

ŷ

x1

x2

x3

x

w1

w2

w3

Neuron Representation

*w

11

ŷx

The linear transformation and nonlinearity together is typically considered a single neuron

Neurons

1-Layer Neural NetworkMulti-layer Neural Network

Loss Functions

Backpropagation

Nonlinearity Functions

NNs in Practice12

13

x1

x2

x3

W1

w3

W2

x ŷ

1-Layer Neural Network (with 4 neurons)

x1

x2

x3

Σ

Σ

Σ

Σ

Input x Output ŷ

14

ŷ

1

ŷ

2

ŷ

3

ŷ

4

Wz

Linear Sigmoid

1 layer

matrixvector

1-Layer Neural Network (with 4 neurons)

x1

x2

x3

Σ

Σ

Σ

Σ

Input x

15

d = 4

p = 3

Wz

Output ŷ

ŷ

1

ŷ

2

ŷ

3

ŷ

4

ez

1 + ez

z =WT x

ŷ = sigmoid(z) =px1dxpdx1

dx1 dx1

Element-wise on vector z

1-Layer Neural Network (with 4 neurons)

x1

x2

x3

Input x

16

ez

1 + ez

z =WT x

ŷ = sigmoid(z) =px1dxpdx1

dx1 dx1

d = 4

p = 3

W Σ

Σ

Σ

Σ

Output ŷ

ŷ

1

ŷ

2

ŷ

3

ŷ

4Element-wise on vector z

“Block View” of a Neural Network

x *

Dot Product Sigmoid

w z

17

y

Input output

Dot Product SigmoidInput output

x *W z ŷ

W is now a matrix

ez

1 + ez

z =WT x

ŷ = sigmoid(z) =px1dxpdx1

dx1 dx1

z is now a vector

Neurons

1-Layer Neural Network

Multi-layer Neural NetworkLoss Functions

Backpropagation

Nonlinearity Functions

NNs in Practice18

19

x1

x2

x3

W1

w3

W2

x ŷ

Multi-Layer Neural Network(Multi-Layer Perceptron (MLP) Network)

20

Outputlayer

x1

x2

x3

x ŷ

W1

w2

Hiddenlayer

weight subscript represents layer number

2-layer NN

Multi-Layer Neural Network (MLP)

21

1st hidden layer

2nd hiddenlayer

Outputlayer

x1

x2

x3

x ŷ

3-layer NN

W1

w3

W2

Multi-Layer Neural Network (MLP)

22

z1 =WT xh1 = sigmoid(z1)z2 =WT h1 h2 = sigmoid(z2)z3 =wT h2 ŷ = sigmoid(z3)

x1

x2

x3

h1 h2

W1

w3

W2

x ŷ

1

2

3

hidden layer 1 output

Multi-Class Output MLP

x1

x2

x3

x ŷ

23

h1 h2

z1 =WT xh1 = sigmoid(z1)z2 =WT h1 h2 = sigmoid(z2)z3 =WT h2 ŷ = sigmoid(z2)

1

2

3

W1

w3

W2

“Block View” Of MLP

x y

1st hidden layer

2nd hidden layer

Output layer

24

*W1

*W2

*W3

z1 z2 z3h1 h2

“Deep” Neural Networks (i.e. > 1 hidden layer)

25Researchers have successfully used 1000 layers to train an object classifier

Neurons

1-Layer Neural Network

Multi-layer Neural Network

Loss FunctionsBackpropagation

Nonlinearity Functions

NNs in Practice26

27

x1

x2

x3

W1

w3

W2

x ŷ E (ŷ)

ŷ = P(y=1|X,W)

28

x1

x2

x3

x

Binary Classification Loss

E= loss = - logP(Y = ŷ | X= x ) = - y log(ŷ) - (1 - y) log(1-ŷ)

W1

w3

W2

this example is for a single sample x

true output

29

x1

x2

x3

Σ

W1

w3

W2

x

Regression Loss

E = loss= ( y - ŷ )212

ŷ

true output

z1

z2

z3

30

x1

x2

x3

x

Σ

Σ

Σ

ŷ1

ŷ2

ŷ3

Multi-Class Classification Loss

“Softmax” function. Normalizing function which converts each class output to a probability.

E = loss = - yj ln ŷjΣj = 1...K

= P( ŷi = 1 | x )

W1 W3

W2

ŷi

“0” for all except true class

K = 3

Neurons

1-Layer Neural Network

Multi-layer Neural Network

Loss Functions

BackpropagationNonlinearity Functions

NNs in Practice31

32

x1

x2

x3

W1

w3

W2

x ŷ E (ŷ)

Training Neural Networks

33

How do we learn the optimal weights WL for our task??● Gradient descent:

LeCun et. al. Efficient Backpropagation. 1998

WL(t+1) = WL(t) - E WL(t)

But how do we get gradients of lower layers?● Backpropagation!

○ Repeated application of chain rule of calculus○ Locally minimize the objective○ Requires all “blocks” of the network to be differentiable

x ŷ

W1w3

W2

34

Backpropagation Intro

http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

35

Backpropagation Intro

http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

36

Backpropagation Intro

http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

37

Backpropagation Intro

http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

38

Backpropagation Intro

http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

39

Backpropagation Intro

http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

40

Backpropagation Intro

http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

41

Backpropagation Intro

http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

42

Backpropagation Intro

http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

43

Backpropagation Intro

http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

44

Backpropagation Intro

http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

45

Backpropagation Intro

http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

46

Backpropagation Intro

Tells us: by increasing x by a scale of 1, we decrease f by a scale of 4

http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

47

ŷ = P(y=1|X,W)

x1

x2

x3

W1

w2

x

Backpropagation(binary classification example)

Example on 1-hidden layer NN for binary classification

48

Backpropagation(binary classification example)

x y*W1

*w2

z1z2h1

49

x y*W1

*w2

z1z2h1

E = loss =

Backpropagation(binary classification example)

Gradient Descent to Minimize loss: Need to find these!

50

Backpropagation(binary classification example)

x y*W1

*w2

z1z2h1

51

Backpropagation(binary classification example)

= ??

= ??

x y*W1

*w2

z1z2h1

52

Backpropagation(binary classification example)

= ??

= ??

Exploit the chain rule!

x y*W1

*w2

z1z2h1

53

Backpropagation(binary classification example)

x y*W1

*w2

z1z2h1

54

Backpropagation(binary classification example)

chain rule

x y*W1

*w2

z1z2h1

55

Backpropagation(binary classification example)

x y*W1

*w2

z1z2h1

56

Backpropagation(binary classification example)

x y*W1

*w2

z1z2h1

57

Backpropagation(binary classification example)

x y*W1

*w2

z1z2h1

58

Backpropagation(binary classification example)

x y*W1

*w2

z1z2h1

59

Backpropagation(binary classification example)

x y*W1

*w2

z1z2h1

60

Backpropagation(binary classification example)

x y*W1

*w2

z1z2h1

61

Backpropagation(binary classification example)

x y*W1

*w2

z1z2h1

62

Backpropagation(binary classification example)

already computed

x y*W1

*w2

z1z2h1

63

“Local-ness” of Backpropagation

fx y

“local gradients”activations

gradients

http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

64

“Local-ness” of Backpropagation

fx y

“local gradients”activations

gradients

x

65

Example: Sigmoid Block

sigmoid(x) = (x)

Deep Learning = Concatenation of Differentiable Parameterized Layers (linear & nonlinearity functions)

x y

1st hidden layer

2nd hidden layer

Output layer

66

*W1

*W2

*W3

z1 z2 z3h1 h2

Want to find optimal weights W to minimize some loss function E!

Backprop Whiteboard Demo

x1

x2

1

Σ

Σ

ŷ

67

w1 z1w2

w3w4

b1

b2

z2

h1

h2

Σ

w5

w6

1b3

z1 = x1w1 + x2w3 + b1z2 = x1w2 + x2w4 + b2

h1 =exp(z1)

1 + exp(z1)exp(z2)

1 + exp(z2)h2 =

ŷ = h1w5 + h2w6 + b3

E = ( y - ŷ )2

f1

f2

f3

f4

w(t+1) = w(t) - E w(t)

E w = ??

Neurons

1-Layer Neural Network

Multi-layer Neural Network

Loss Functions

Backpropagation

Nonlinearity FunctionsNNs in Practice

68

Nonlinearity Functions (i.e. transfer or activation functions)

x1

x2

x3

Σ

SummingFunction

SigmoidFunction

w1

w2

w3

+1

b1

z

x﹡w

Multiply by weights

69

Nonlinearity Functions (i.e. transfer or activation functions)

x﹡w

70https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions

Name Plot Equation Derivative ( w.r.t x )

Nonlinearity Functions (i.e. transfer or activation functions)

x﹡w

71https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions

Name Plot Equation Derivative ( w.r.t x )

Nonlinearity Functions (aka transfer or activation functions)

x﹡w

72https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions

Name Plot Equation Derivative ( w.r.t x )

usually works best in practice

Neurons

1-Layer Neural Network

Multi-layer Neural Network

Loss Functions

Backpropagation

Nonlinearity Functions

NNs in Practice73

74

Neural Net Pipeline

x ŷ

W1w3

1. Initialize weights2. For each batch of input x samples S:

a. Run the network “Forward” on S to compute outputs and lossb. Run the network “Backward” using outputs and loss to compute gradientsc. Update weights using SGD (or a similar method)

3. Repeat step 2 until loss convergence

W2

E

75

Non-Convexity of Neural Nets

In very high dimensions, there exists many local minimum which are about the same.

Pascanu, et. al. On the saddle point problem for non-convex optimization 2014

76

Building Deep Neural Nets

http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

fx

y

77

Building Deep Neural Nets

“GoogLeNet” for Object Classification

78

Block Example Implementation

http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

79

Advantage of Neural Nets

As long as it’s fully differentiable, we can train the model to automatically learn features for us.

Advanced Deep Learning Models:

Convolutional Neural Networks

& Recurrent Neural Networks

Most slides from http://cs231n.stanford.edu/

80

81

Convolutional Neural Networks(aka CNNs and ConvNets)

Challenges in Visual Recognition

Challenges in Visual Recognition

84

Problems with “Fully Connected Networks” on Images

1000x1000 Image1M hidden units:→ 10^12 parameters

Spatial Correlation is local! → Connect units

locally

Each neuron corresponds to a specific pixel

85

Problems with “Fully Connected Networks” on Images

1000x1000 Image1M hidden units:→ 10^12 parameters

Spatial Correlation is local! → Connect units

locally

Each neuron corresponds to a specific pixel

How do we deal with multiple dimensions in the input? Length,height, channel (R,G,B)

86

Convolutional Layer

87

Convolutional Layer

88

Convolutional Layer

89

Convolutional Layer

90

Convolution

91

Convolution

92

Neuron View of Convolutional Layer

93

Neuron View of Convolutional Layer

94

Convolutional Layer

95

Convolutional Layer

96

Neuron View of Convolutional Layer

97

Convolutional Neural Networks

98

Convolutional Neural Networks

99

Convolutional Neural Networks

100

Convolutional Neural Networks

http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

101

Pooling Layer

102

Pooling Layer (Max Pooling example)

103

History of ConvNets

Gradient-based learning applied to document recognition [LeCun, Bottou, Bengio, Haffner]

ImageNet Classification with Deep Convolutional Neural Networks [Krizhevsky, Sutskever, Hinton, 2012]

1998 2012

104

105

106

107

108

Recurrent Neural Networks

109

Standard “Feed-Forward” Neural Network

110

Standard “Feed-Forward” Neural Network

111

Recurrent Neural Networks (RNNs)RNNs can handle

112

Recurrent Neural Networks (RNNs)RNNs can handle

Recurrent Neural Networks (RNNs)

Traditional “Feed Forward” Neural Network Recurrent Neural Network

input

hidden

output

predict a vector at each timestep

input

hidden

output

114

Recurrent Neural Networks

ht

ht-1

115

Recurrent Neural Networks

ht

ht-1

116

“Vanilla” Recurrent Neural Network

ht

ht-1

117

Character Level Language Model with an RNN

118

Character Level Language Model with an RNN

119

Character Level Language Model with an RNN

120

Character Level Language Model with an RNN

121

Character Level Language Model with an RNN

122

Character Level Language Model with an RNN

123

Character Level Language Model with an RNN

124

Example: Generating Shakespeare with RNNs

125

Example: Generating C Code with RNNs

Long Short Term Memory Networks (LSTMs)Recurrent networks suffer from the “vanishing gradient problem”● Aren’t able to model long term dependencies in sequences

input

hidden

output

Long Short Term Memory Networks (LSTMs)Recurrent networks suffer from the “vanishing gradient problem”● Aren’t able to model long term dependencies in sequences

Use “gating units” to learn when to remember

input

output

128

RNNs and CNNs Together