Chapter 4 Artificial Neural Networks. Questions: What is ANNs? How to learn an ANN? (algorithm) The...

Post on 27-Dec-2015

221 views 1 download

transcript

Chapter 4

Artificial Neural Networks

Questions:

• What is ANNs?

• How to learn an ANN? (algorithm)

• The presentational power of ANNs(advantage and disadvantage)

What is ANNs ------Background

Consider humans

• Neuron switching time 0.001 second

• Number of neurons 1010

• Connections per neuron 104~5

• Scene recognition time 0.1 second

much parallel computation

• Property of neuron: thresholded unit

One motivation for ANN systems is to capture this kind of highly parallel computation based on distributed reprensetation

• classfication

• Voice recognition

• others

What is ANNs? -----Problems related to ANNs

Another example:

Properties of artificial neural nets (ANNs)

• Many neuron like threshold switching units

• Many weighted interconnections among units

• Highly parallel distributed process

• Emphasis on tuning weights automatically

4.1 Perceptrons

> 00 1 2 211 ...( , ..., )1 1 otherwise

n nif x x xo x xn

To simplify notation, set x0 =1

0( ) sgn( )nio x x xi i

�������������������������� ��

0 1( , , ..., )n

0 1( , , ..., )

nx x x x

Learning a perceptron involves choosing values for the weight . Therefore, the space H of candidate hypotheses considered in perceptron learning is the set of all possible real-valued weight vectors.

0,..., n

1{ | }nH

R}

We can view the perceptron as representing a hyperplane decision surface in the n-dimensional space of instances.

Two way to train perceptron:

Perceptron Training Rule and Gradient Descent

0( )

sgn( )

nio x xi i

x

����������������������������

(1). Perceptron Training Rule

i i i ( )i it o x

• is target value

• o is perceptron output

• is small constant called learning rate

( )t x

•Initialize the ωi with random value in the given interval

•Update the value of ωi according to the training example

• A single perceptron can be used to represent many boolean functions, such as AND, OR, NAND, NOR, but fail to represent XOR.

• Eg: g(x1, x2) = AND(x1 ,x2)

o(x1, x2) = sgn(- 0.8 + 0.5x1 + 0.5x2 )

Representation Power of Perceptrons

x1 x1 - 0.8 + 0.5x1 + 0.5x2 O

-1 -1 -1.8 -1

-1 1 -0.8 -1

1 -1 -0.8 -1

1 1 0.2 1

Representation Power of Perceptrons

(a) Can prove it will converge• If training data is linearly separable• and sufficiently small(b)But some functions not representable ,eg: not linearly separa

ble(c) Every boolean function can be represented by some network

of perceptrons only two levels deep

(2). Gradient Descent

Key idea: searching the hypothesis space to find the weights that best fit the training examples.

Best fit: minimize the squared error

Where D is set of training examples

21( ) ( )

2 d dd D

E t o

0 1

( ) ( , , , )n

E E EE

Gradient:

Training rule:

( )E

i i i

ii

E

or

Gradient Descent

21( )

2

( ) ( )

( )( )

d dd Di i

d d d dd D i

d d idd D

Et o

t o t x

t o x

( )i d d idd D

t o x

i i i

ii

E

Gradient Descent

Gradient Descent Algorithm

Initialize each ωi to some small random value

• Until the termination condition is met , Do

– Initialize each Δωi to zero.

– For each <x, t> in training examples Do

• Input the instance x to the unit and compute the output o

• For each linear unit weight ωi Do

– For each linear unit weight ωi ,Do

i i ( ) it o x

i i i

When to use gradient descent

• Continuously parameterized hypothesis

• The error can be differentiable

Advantage vs Disadvantage

Advantage

• Guaranteed to converge to hypothesis with

local minimum error , Given sufficiently small learning rate η;

• Even when training data contains noise;

• Training data not linear separable ;

• Converge to the single global minimum.

Disadvantage

• Converging sometimes can be very slow;

• No guarantee Converging to global minimum in cases where there are multiple local minima

Incremental (Stochastic) Gradient Descent

standard Gradient Descent

Do until satisfied

• Compute the gradient

Stochastic Gradient Descent

For each training example d in D

• Compute the gradient

( )DE

( )DE

( )dE

( )dE

21( ) ( )

2d d dE t o 21

( ) ( )2D d d

d D

E t o

Vs.

Standard Gradient Descent vs. Stochastic Gradient Descent

• Stochastic Gradient Descent can approximate Standard Gradient Descent arbitrarily closely if η made small enough;

• Stochastic mode can converge faster;

• Stochastic Gradient descent can sometimes avoid falling into local minima.

(3).Perceptron training rule Vs. gradient descent

Perceptron training rule• Thresholded perceptron output: • Provided examples are linearly separable• Converge to a hypothesis that perfectly classfies the trainin

g data

gradient descent• Unthresholded linear output:• Regardless of whether the training data are linearly separa

ble • Converge asymptotically toward the minimum error hypot

hesis

( ) sgn( )o x x�������������������������� ��

( )o x x�������������������������� ��

4.2 Multilayer Networks

Perceptron: Network:

Perceptrons can only express liner decision,we need to express a rich variety of nonlinear decision

Sigmoid unit – a differentiable threshold unit

1( ) ( 1)

1 kxx here k

e

( )( )(1 ( ))

d xx x

dx

1( ) ( ( ) )

1 neto net net x

e

Sigmoid function:

Property:

Output:

Why do we use sigmoid instead of linear and

sgn(x)?

• computing the input and output of each unit foreword;

• modifying the weights of units pairs backward with respect to errors

The main idea of backpropagation algorithm

The Backpropagation Algorithm

21( ) ( )

2D kd kdd D k outputs

E t o

21( ) ( )

2d kd kdk outputs

E t o

Error definition :

Batch mode:

Individual mode:

ji

ji

j

j

j

x =the ith input to unit j

= the weight associated with the ith input to unit j

net (the weighted sum of inputs for unit j)

o = the output computed by unit j

t = the target output

ji jiix

for unit j

outputs =the set of units in the final layer

Ds(j) = the set ot units whose immediate

inputs include the output of j

oj

ω ij

oi = xji

… …

j net ji jiix

Training rule for Output Unit weights

jd d

j j j

oE E

net o net

21( ) ( )

2d

j j j jj j

Et o t o

o o

( )(1 )j j

j jj j

o neto o

net net

( ) (1 )dj j j j

j

Et o o o

net

( ) (1 )dji j j j j

ji

Et o o o

Training Rule for Hidden Unit Weights

( )

( )

( )

(

(

)

)

(1 )

d d k

k Ds jj k j

kk

k Ds j j

jk

k kj

kk Ds j j

j jk Ds j

j

jk kj

k Ds j j

E E net

net net net

net

net

o

onet

o net

o

ne

o

t

j k( )

(1 )

we have

j j kjk Ds j

and

o o

j d

j

Edenote

net

j jiji x

Error term

ok

Backpropagation Algorithm

• Initialize all weights to small random numbers

• Until termination condition is met Do

For each training example Do

//Propagate the input forward

1. Input the training example to the network and compute the network outputs

//Propagate the errors backward

2. For each output unit k

3. For each hidden unit h

4.Update each network weight

where

( ) (1 )k k k k kt o o o

h k(1 )h h khk outputs

o o

ji ji ji

ji j jix

Hidden layer Representations

Hidden layer Representations

Hidden layer Representations

Hidden layer Representations

Hidden layer Representations

Convergence and local minima

• Converge to some local minimum and not necessarily to the global minimum error

• Use stochastic gradient descent rather than the standard gradient descent

• Initialization will influence the convergence. Training multiple networks network with different initializing random weights,over the same data, then select the best one

• Training can take thousands of iterations -->slow

• Initialize weights near zero, Therefore initial networks near linear. Increasingly nonlinear functions is possible as training progresses

• Add a momentum term to speed convergence

j ji( ) ( 1) (0 1)ji jin x n

Expressive Capabilities of ANNs

• Every boolean function can be represented by network with single hidden layer

• Every bounded continuous function can be approximated with arbitrarily small error by network with one hidden layer

• Any function can be approximated to arbitrary accuracy by a network with two hidden layers

• The network with more hidden layers possibly results in the rise of precision , the possibility of converging to a local minima ,however, will increase as well.

When to Consider Neural Networks

• Input is high dimensional discrete or real valued

• Output is discrete or real valued

• Output is a vector of values

• Possibly noisy data

• Form of target function is unknown

• Human readability of result is unimportant

Overfitting in ANNs

Strategy applied to avoid overfitting

• Poor strategy: continue training until the error falls below some threshold

• A good indicator : the number of iterations that produces the lowest error over the validation set

• Once the trained weights reach a significantly higher error over the validation set than the stored weights, terminate!

Alternative Error Functions

Recurrent Networks

Recurrent Networks

Thank you !