+ All Categories
Home > Documents > IV. Neural Network Learning

IV. Neural Network Learning

Date post: 31-Dec-2015
Category:
Upload: chase-lyons
View: 33 times
Download: 2 times
Share this document with a friend
Description:
IV. Neural Network Learning. A. Neural Network Learning. Supervised Learning. Produce desired outputs for training inputs Generalize reasonably & appropriately to other inputs Good example: pattern recognition Feedforward multilayer networks. input layer. output layer. hidden layers. - PowerPoint PPT Presentation
71
06/14/22 1 IV. Neural Network Learning
Transcript
Page 1: IV. Neural Network Learning

04/19/23 1

IV. Neural Network Learning

Page 2: IV. Neural Network Learning

04/19/23 2

A.Neural Network

Learning

Page 3: IV. Neural Network Learning

04/19/23 3

Supervised Learning

• Produce desired outputs for training inputs

• Generalize reasonably & appropriately to other inputs

• Good example: pattern recognition

• Feedforward multilayer networks

Page 4: IV. Neural Network Learning

04/19/23 4

Feedforward Network. . .

. . .

. . .

. . .

. . .

. . .

inputlayer

outputlayer

hiddenlayers

Page 5: IV. Neural Network Learning

04/19/23 5

Typical Artificial Neuron

inputs

connectionweights

threshold

output

Page 6: IV. Neural Network Learning

04/19/23 6

Typical Artificial Neuron

linearcombination

net input(local field)

activationfunction

Page 7: IV. Neural Network Learning

04/19/23 7

Equations

hi = wijs j

j=1

n

∑ ⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟−θ

h = Ws −θ

Net input:

′ s i = σ hi( )

′ s = σ h( )

Neuron output:

Page 8: IV. Neural Network Learning

04/19/23 8

Single-Layer Perceptron

. . .

. . .

Page 9: IV. Neural Network Learning

04/19/23 9

Variables

xj

xn

x1

yh

wj

wn

w1

Page 10: IV. Neural Network Learning

04/19/23 10

Single Layer Perceptron Equations

Binary threshold activation function :

σ h( ) = Θ h( ) =1, if h > 0

0, if h ≤ 0

⎧ ⎨ ⎩

Hence, y =1, if w j x j > θ

j∑

0, otherwise

⎧ ⎨ ⎩

=1, if w ⋅x > θ

0, if w ⋅x ≤ θ

⎧ ⎨ ⎩

Page 11: IV. Neural Network Learning

04/19/23 11

2D Weight Vector

w

w1

w2

x

w ⋅x = w x cosφ

v

cosφ =v

x

w ⋅x = w v

w ⋅x > θ

⇔ w v > θ

⇔ v > θ w

w

+–

Page 12: IV. Neural Network Learning

04/19/23 12

N-Dimensional Weight Vector

w

+

separatinghyperplane

normalvector

Page 13: IV. Neural Network Learning

04/19/23 13

Goal of Perceptron Learning

• Suppose we have training patterns x1, x2, …, xP with corresponding desired outputs y1, y2, …, yP

• where xp {0, 1}n, yp {0, 1}• We want to find w, such thatyp = (wxp – ) for p = 1, …, P

Page 14: IV. Neural Network Learning

04/19/23 14

Treating Threshold as Weight

xj

xn

x1

yh

wj

wn

w1

h = w j x j

j=1

n

∑ ⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟−θ

= −θ + w j x j

j=1

n

Page 15: IV. Neural Network Learning

04/19/23 15

Treating Threshold as Weight

xj

xn

x1

yh

wj

wn

w1

h = w j x j

j=1

n

∑ ⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟−θ

= −θ + w j x j

j=1

n

–1

h = w0x0 + w j x j =j=1

n

∑ w j x j = ˜ w ⋅ ˜ x j= 0

n

∑€

Let x0 = −1 and w0 = θ

= w0

x0 =

Page 16: IV. Neural Network Learning

04/19/23 16

Augmented Vectors

˜ w =

θ

w1

M

wn

⎜ ⎜ ⎜ ⎜

⎟ ⎟ ⎟ ⎟

˜ x p =

−1

x1p

M

xnp

⎜ ⎜ ⎜ ⎜

⎟ ⎟ ⎟ ⎟

We want y p = Θ ˜ w ⋅ ˜ x p( ), p =1,K ,P

Page 17: IV. Neural Network Learning

04/19/23 17

Reformulation as Positive Examples

We have positive ( y p =1) and negative (y p = 0) examples

Want ˜ w ⋅ ˜ x p > 0 for positive, ˜ w ⋅ ˜ x p ≤ 0 for negative

Let z p = ˜ x p for positive, z p = −˜ x p for negative

Want ˜ w ⋅z p ≥ 0, for p =1,K ,P

Hyperplane through origin with all z p on one side

Page 18: IV. Neural Network Learning

04/19/23 18

Adjustment of Weight Vector

z10

z11

z1

z6

z7

z8

z4z3

z9

z5

z2

Page 19: IV. Neural Network Learning

04/19/23 19

Outline ofPerceptron Learning

Algorithm1. initialize weight vector randomly

2. until all patterns classified correctly, do:

a) for p = 1, …, P do:

1) if zp classified correctly, do nothing

2) else adjust weight vector to be closer to correct classification

Page 20: IV. Neural Network Learning

04/19/23 20

Weight Adjustment

˜ w

z p

ηz p

˜ ′ w

ηz p

˜ ′ ′ w

Page 21: IV. Neural Network Learning

04/19/23 21

Improvement in Performance

˜ ′ w ⋅z p = ˜ w + ηz p( ) ⋅z

p

= ˜ w ⋅z p + ηz p ⋅z p

= ˜ w ⋅z p + η z p 2

> ˜ w ⋅z p

If ˜ w ⋅z p < 0,

Page 22: IV. Neural Network Learning

04/19/23 22

Perceptron Learning Theorem

• If there is a set of weights that will solve the problem,

• then the PLA will eventually find it

• (for a sufficiently small learning rate)

• Note: only applies if positive & negative examples are linearly separable

Page 23: IV. Neural Network Learning

04/19/23 23

NetLogo Simulation of Perceptron Learning

Run Perceptron-Geometry.nlogo

Page 24: IV. Neural Network Learning

04/19/23 24

Classification Power of Multilayer Perceptrons

• Perceptrons can function as logic gates

• Therefore MLP can form intersections, unions, differences of linearly-separable regions

• Classes can be arbitrary hyperpolyhedra

• Minsky & Papert criticism of perceptrons

• No one succeeded in developing a MLP learning algorithm

Page 25: IV. Neural Network Learning

Hyperpolyhedral Classes

04/19/23 25

Page 26: IV. Neural Network Learning

04/19/23 26

Credit Assignment Problem

. . .

. . .

. . .

. . .

. . .

. . .

inputlayer

outputlayer

hiddenlayers

How do we adjust the weights of the hidden layers?. . .

Desiredoutput

Page 27: IV. Neural Network Learning

04/19/23 27

NetLogo Demonstration of

Back-Propagation Learning

Run Artificial Neural Net.nlogo

Page 28: IV. Neural Network Learning

04/19/23 28

Adaptive System

S F

Pk PmP1… …

SystemEvaluation Function

(Fitness, Figure of Merit)

Control ParametersC

ControlAlgorithm

Page 29: IV. Neural Network Learning

04/19/23 29

Gradient

∂F

∂Pk

measures how F is altered by variation of Pk

∇F =

∂F∂P1

M∂F

∂Pk

M∂F

∂Pm

⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜

⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟

∇F points in direction of maximum local increase in F

Page 30: IV. Neural Network Learning

04/19/23 30

Gradient Ascenton Fitness Surface

+

F

gradient ascent

Page 31: IV. Neural Network Learning

04/19/23 31

Gradient Ascentby Discrete Steps

+

F

Page 32: IV. Neural Network Learning

04/19/23 32

Gradient Ascent is Local

But Not Shortest

+

Page 33: IV. Neural Network Learning

04/19/23 33

Gradient Ascent Process

˙ P = η∇F P( )

Change in fitness :

˙ F =d F

d t=

∂F

∂Pk

dPk

d tk=1

m

∑ = ∇F( )k˙ P kk=1

m

∑˙ F =∇F ⋅ ˙ P

˙ F =∇F ⋅η∇F = η ∇F2

≥ 0

Therefore gradient ascent increases fitness(until reaches 0 gradient)

Page 34: IV. Neural Network Learning

04/19/23 34

General Ascent in Fitness

Note that any adaptive process P t( ) will increase

fitness provided :

0 < ˙ F =∇F ⋅ ˙ P = ∇F ˙ P cosϕ

where ϕ is angle between ∇F and ˙ P

Hence we need cosϕ > 0

or ϕ < 90o

Page 35: IV. Neural Network Learning

04/19/23 35

General Ascenton Fitness Surface

+

F

Page 36: IV. Neural Network Learning

04/19/23 36

Fitness as Minimum Error

Suppose for Q different inputs we have target outputs t1,K ,tQ

Suppose for parameters P the corresponding actual outputs

are y1,K ,yQ

Suppose D t,y( )∈ 0,∞[ ) measures difference between

target & actual outputs

Let E q = D tq ,yq( ) be error on qth sample

Let F P( ) = − E q P( ) = − D tq ,yq P( )[ ]q=1

Q

∑q=1

Q

Page 37: IV. Neural Network Learning

04/19/23 37

Gradient of Fitness

∇F =∇ − E q

q

∑ ⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟= − ∇E q

q

∂E q

∂Pk

=∂

∂Pk

D tq ,yq( )

=∂D tq,yq

( )

∂y jq

j

∑∂y j

q

∂Pk

=d D tq,yq

( )

d yq⋅∂yq

∂Pk

=∇y q D tq,yq

( ) ⋅∂yq

∂Pk

Page 38: IV. Neural Network Learning

04/19/23 38

Jacobian Matrix

Define Jacobian matrix Jq =

∂y1q

∂P1L ∂y1

q

∂Pm

M O M∂yn

q

∂P1L ∂yn

q

∂Pm

⎜ ⎜ ⎜ ⎜

⎟ ⎟ ⎟ ⎟

Note Jq ∈ ℜ n×m and ∇D tq,yq( )∈ ℜ n×1

Since ∇E q( )

k=

∂E q

∂Pk

=∂y j

q

∂Pk

∂D tq,yq( )

∂y jq

j

∑ ,

∴∇E q = Jq( )

T∇D tq,yq

( )

Page 39: IV. Neural Network Learning

04/19/23 39

Derivative of Squared Euclidean Distance

Suppose D t,y( ) = t − y2

= ti − y i( )2

i∑

∂D t − y( )∂y j

=∂

∂y j

ti − y i( )2

i

∑ =∂ ti − y i( )

2

∂y ji

=d t j − y j( )

2

d y j

= −2 t j − y j( )

∴dD t,y( )

dy= 2 y − t( )

Page 40: IV. Neural Network Learning

04/19/23 40

Gradient of Error on qth Input

∂E q

∂Pk

=dD tq,yq

( )

d yq⋅∂yq

∂Pk

= 2 yq − tq( ) ⋅

∂yq

∂Pk

= 2 y jq − t j

q( )

∂y jq

∂Pkj

∇E q = 2 Jq( )

Tyq − tq

( )

Page 41: IV. Neural Network Learning

04/19/23 41

Recap

To know how to decrease the differences between

actual & desired outputs,

we need to know elements of Jacobian, ∂y j

q

∂Pk,

which says how jth output varies with kth parameter

(given the qth input)

The Jacobian depends on the specific form of the system,in this case, a feedforward neural network

˙ P = η Jq( )

Ttq − yq

( )q

Page 42: IV. Neural Network Learning

04/19/23 42

Multilayer Notation

W1 W2 WL–2 WL–1

s1 s2 sL–1 sL

xq yq

Page 43: IV. Neural Network Learning

04/19/23 43

Notation• L layers of neurons labeled 1, …, L

• Nl neurons in layer l

• sl = vector of outputs from neurons in layer l

• input layer s1 = xq (the input pattern)• output layer sL = yq (the actual output)

• Wl = weights between layers l and l+1

• Problem: find how outputs yiq vary with

weights Wjkl (l = 1, …, L–1)

Page 44: IV. Neural Network Learning

04/19/23 44

Typical Neuron

sjl–1

sNl–1

s1l–1

silhi

lWijl–1

WiNl–1

Wi1l–1

Page 45: IV. Neural Network Learning

04/19/23 45

Error Back-Propagation

We will compute ∂E q

∂W ijl

starting with last layer (l = L −1)

and working back to earlier layers (l = L − 2,K ,1)

Page 46: IV. Neural Network Learning

04/19/23 46

Delta Values

Convenient to break derivatives by chain rule :

∂E q

∂W ijl−1

=∂E q

∂hil

∂hil

∂W ijl−1

Let δ il =

∂E q

∂hil

So ∂E q

∂W ijl−1

= δ il ∂hi

l

∂W ijl−1

Page 47: IV. Neural Network Learning

04/19/23 47

Output-Layer Neuron

sjL–1

sNL–1

s1L–1

siL = yi

qhiLWij

L–1

WiNL–1

Wi1L–1

tiq

Eq

Page 48: IV. Neural Network Learning

04/19/23 48

Output-Layer Derivatives (1)

δiL =

∂E q

∂hiL

=∂

∂hiL

skL − tk

q( )

2

k∑

=d si

L − tiq

( )2

d hiL

= 2 siL − ti

q( )

d siL

dhiL

= 2 siL − ti

q( ) ′ σ hi

L( )

Page 49: IV. Neural Network Learning

04/19/23 49

Output-Layer Derivatives (2)

∂hiL

∂W ijL−1

=∂

∂W ijL−1

W ikL−1sk

L−1

k

∑ = s jL−1

∴∂E q

∂W ijL−1

= δ iLs j

L−1

where δiL = 2 si

L − tiq

( ) ′ σ hiL

( )

Page 50: IV. Neural Network Learning

04/19/23 50

Hidden-Layer Neuron

sjl–1

sNl–1

s1l–1

silhi

lWijl–1

WiNl–1

Wi1l–1

skl+1

sNl+1

s1l+1

W1il

Wkil

WNil

s1l

sNl

Eq

Page 51: IV. Neural Network Learning

04/19/23 51

Hidden-Layer Derivatives (1)

Recall ∂E q

∂W ijl−1

= δ il ∂hi

l

∂W ijl−1

δil =

∂E q

∂hil

=∂E q

∂hkl +1

∂hkl +1

∂hil

k

∑ = δkl +1 ∂hk

l +1

∂hil

k

∂hkl +1

∂hil

=∂ Wkm

l sml

m∑

∂hil

=∂Wki

l sil

∂hil

= Wkildσ hi

l( )

d hil

= Wkil ′ σ hi

l( )

∴δil = δk

l +1Wkil ′ σ hi

l( )

k

∑ = ′ σ hil

( ) δkl +1Wki

l

k

Page 52: IV. Neural Network Learning

04/19/23 52

Hidden-Layer Derivatives (2)

∂hil

∂W ijl−1

=∂

∂W ijl−1

W ikl−1sk

l−1

k

∑ =dW ij

l−1s jl−1

dW ijl−1

= s jl−1

∴∂E q

∂W ijl−1

= δ ils j

l−1

where δil = ′ σ hi

l( ) δk

l +1Wkil

k

Page 53: IV. Neural Network Learning

04/19/23 53

Derivative of Sigmoid

Suppose s = σ h( ) =1

1+ exp −αh( ) (logistic sigmoid)

Dh s = Dh 1+ exp −αh( )[ ]−1

= − 1+ exp −αh( )[ ]−2

Dh 1+ e−αh( )

= − 1+ e−αh( )

−2−αe−αh

( ) = αe−αh

1+ e−αh( )

2

= α1

1+ e−αh

e−αh

1+ e−αh= αs

1+ e−αh

1+ e−αh−

1

1+ e−αh

⎝ ⎜

⎠ ⎟

= αs(1− s)

Page 54: IV. Neural Network Learning

04/19/23 54

Summary of Back-Propagation Algorithm

Output layer :δ iL = 2αsi

L 1− siL

( ) siL − ti

q( )

∂E q

∂W ijL−1

= δ iLs j

L−1

Hidden layers : δ il = αsi

l 1− sil

( ) δkl +1Wki

l

k

∂E q

∂W ijl−1

= δ ils j

l−1

Page 55: IV. Neural Network Learning

04/19/23 55

Output-Layer Computation

sjL–1

sNL–1

s1L–1

siL = yi

qhiLWij

L–1

WiNL–1

Wi1L–1

tiq–

δiL

2

1–

δiL = 2αsi

L 1− siL

( ) tiq − si

L( )

η

WijL–1

W ijL−1 = ηδ i

Ls jL−1

Page 56: IV. Neural Network Learning

04/19/23 56

Hidden-Layer Computation

sjl–1

sNl–1

s1l–1

silhi

lWijl–1

WiNl–1

Wi1l–1

skl+1

sNl+1

s1l+1

W1il

Wkil

WNil

Eq

δ1l+1

δkl+1

δNl+1δil

1–

δil = αsi

l 1− sil

( ) δkl +1Wki

l

k

η

Wijl–1

W ijl−1 = ηδ i

ls jl−1

Page 57: IV. Neural Network Learning

04/19/23 57

Training Procedures• Batch Learning

– on each epoch (pass through all the training pairs),

– weight changes for all patterns accumulated– weight matrices updated at end of epoch– accurate computation of gradient

• Online Learning– weight are updated after back-prop of each training pair

– usually randomize order for each epoch– approximation of gradient

• Doesn’t make much difference

Page 58: IV. Neural Network Learning

04/19/23 58

Summation of Error Surfaces

E1

E2

E

Page 59: IV. Neural Network Learning

04/19/23 59

Gradient Computationin Batch Learning

E1

E2

E

Page 60: IV. Neural Network Learning

04/19/23 60

Gradient Computationin Online Learning

E1

E2

E

Page 61: IV. Neural Network Learning

04/19/23 61

Testing Generalization

DomainAvailable

Data

TrainingData

TestData

Page 62: IV. Neural Network Learning

04/19/23 62

Problem of Rote Learning

error

epoch

error ontraining

data

error ontest data

stop training here

Page 63: IV. Neural Network Learning

04/19/23 63

Improving Generalization

DomainAvailable

Data

TrainingData

Test Data

Validation Data

Page 64: IV. Neural Network Learning

04/19/23 64

A Few Random Tips• Too few neurons and the ANN may not be able to decrease the error enough

• Too many neurons can lead to rote learning

• Preprocess data to:– standardize– eliminate irrelevant information– capture invariances– keep relevant information

• If stuck in local min., restart with different random weights

Page 65: IV. Neural Network Learning

Run Example BP Learning

04/19/23 65

Page 66: IV. Neural Network Learning

Beyond Back-Propagation

• Adaptive Learning Rate• Adaptive Architecture

– Add/delete hidden neurons– Add/delete hidden layers

• Radial Basis Function Networks• Recurrent BP• Etc., etc., etc.…

04/19/23 66

Page 67: IV. Neural Network Learning

What is the Power ofArtificial Neural

Networks?

• With respect to Turing machines?

• As function approximators?

04/19/23 67

Page 68: IV. Neural Network Learning

Can ANNs Exceed the “Turing Limit”?

• There are many results, which depend sensitively on assumptions; for example:

• Finite NNs with real-valued weights have super-Turing power (Siegelmann & Sontag ‘94)

• Recurrent nets with Gaussian noise have sub-Turing power (Maass & Sontag ‘99)

• Finite recurrent nets with real weights can recognize all languages, and thus are super-Turing (Siegelmann ‘99)

• Stochastic nets with rational weights have super-Turing power (but only P/POLY, BPP/log*) (Siegelmann ‘99)

• But computing classes of functions is not a very relevant way to evaluate the capabilities of neural computation

04/19/23 68

Page 69: IV. Neural Network Learning

A Universal Approximation Theorem

04/19/23 69

Suppose f is a continuous function on 0,1[ ]n

Suppose σ is a nonconstant, bounded,

monotone increasing real function on ℜ .

For any ε > 0, there is an m such that

∃a ∈ ℜm , b∈ ℜ n, W ∈ ℜm×n such that if

F x1,K ,xn( ) = aiσ W ij x j + b j

j=1

n

∑ ⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟

i=1

m

i.e., F x( ) = a ⋅σ Wx + b( )[ ]

then F x( ) − f x( ) < ε for all x ∈ 0,1[ ]n

(see, e.g., Haykin, N.Nets 2/e, 208–9)

Page 70: IV. Neural Network Learning

One Hidden Layer is Sufficient

• Conclusion: One hidden layer is sufficient to approximate any continuous function arbitrarily closely

04/19/23 70

Σ σ

Σ σ

Σ σ

Σ

1

x1

xn

a1

am

a2

b1

Wmn

Page 71: IV. Neural Network Learning

04/19/23 71

The Golden Rule of Neural Nets

Neural Networks are the

second-best wayto do everything!

IVB


Recommended