Neural Networks and Backpropagation Sebastian Thrun 15-781, Fall 2000.

transcript

Neural Networks andBackpropagation

Sebastian Thrun

15-781, Fall 2000

Outline

Perceptrons Learning Hidden Layer Representations Speeding Up Training Bias, Overfitting and Early Stopping (Example: Face Recognition)

ALVINN drives 70mph on highways

Dean PomerleauCMU

ALVINN drives 70mph on highways

Human Brain

Neurons

Human Learning

Number of neurons: ~ 1010

Connections per neuron: ~ 104 to 105

Neuron switching time: ~ 0.001 second Scene recognition time: ~ 0.1 second

100 inference steps doesn’t seem much

The “Bible” (1986)

Perceptron

w2 wnw1

o u t p u t o

x2 xnx1. . .

i n p u t x

iixwnet0

o 1 if net > 00 otherwise{

Inverter

input x1

output

w1= 11

Boolean OR

input x1

input x2

1 1 1 x2x1

w2=1w1=1

w0= 0.5

Boolean AND

input x1

input x2

1 1 1 x2x1

w2=1w1=1

w0= 1.5

Boolean XOR

input x1

input x2

1 1 0 x2x1

Linear Separability

Boolean XOR

input x1

input x2

Perceptron Training Rule

iii www

ii xotw )(

step size

perceptronoutput

target

increment

new weight incrementold weight

Converges, if…

… training data linearly separable

… step size sufficiently small

… no “hidden” units

How To Train Multi-Layer Perceptrons?

Gradient descent

Sigmoid Squashing Function

w2 wnw1

o u t p u t

x2 xnx1. . .

i n p u t

iixwnet0

))(1)(()( xxx

Gradient Descent

Learn wi’s that minimize squared error

Dddd otwE

D = training data

Gradient Descent

Gradient:

EwE ,...,,][

Training rule: ][wEw

Dddd otwE

Gradient Descent (single layer)

xxwxwot

))(1()()(

))(()(

)()(22

Batch Learning

Initialize each wi to small random value

Repeat until termination:wi = 0

For each training example d do

od (i wi xi,d)

wi wi + (td od) od (1-od) xi,d

wi wi + wi

Incremental (Online) Learning

Initialize each wi to small random value

Repeat until termination:For each training example d do

wi = 0

od i wi xi,d

wi wi + (td od) od (1-od) xi,d

wi wi + wi

Backpropagation Algorithm

Generalization to multiple layers and multiple output units

Initialize all weights to small random numbers For each training example do

– For each hidden unit h:

– For each output unit k:

– For each hidden unit h:

– Update each network weight wij:

ijjij xw

ihih xwo )(

hkhk xwo )(

)()1( kkkkk otoo

khkhhh woo )1(

withijijij www

“activations”

“errors”

Can This Be Learned?

Input Output

10000000 10000000

01000000 01000000

00100000 00100000

00010000 00010000

00001000 00001000

00000100 00000100

00000010 00000010

00000001 00000001

Learned Hidden Layer Representation

Input Output

10000000 .89 .04 .08 10000000

01000000 .01 .11 .88 01000000

00100000 .01 .97 .27 00100000

00010000 .99 .97 .71 00010000

00001000 .03 .05 .02 00001000

00000100 .22 .99 .99 00000100

00000010 .80 .01 .98 00000010

00000001 .60 .94 .01 00000001

Training: Internal Representation

Training: Error

Training: Weights

ANNs in Speech Recognition

[Haung/Lippman 1988]

Speeding It Up: Momentumerror E

weight wijwij wij

Gradient descent

GD with Momentum

Convergence

May get stuck in local minima Weights may diverge

…but works well in practice

Overfitting in ANNs

Early Stopping (Important!!!)

Stop training when error goes up on validation set

Linear range, # of hiddenunits doesn’t really matter

left strt right up

Typical input images

Head pose (1-of-4): 90% accuracyFace recognition (1-of-20): 90% accuracy

ANNs for Face Recognition

left strt right up

Recurrent Networks

Neural Networks and Backpropagation Sebastian Thrun 15-781, Fall 2000.

Documents