+ All Categories
Home > Documents > Arti cial Neural Networks

Arti cial Neural Networks

Date post: 09-Feb-2022
Category:
Upload: others
View: 12 times
Download: 0 times
Share this document with a friend
31
Transcript
Page 1: Arti cial Neural Networks

Arti�cial Neural Networks[Read Ch. 4][Recommended exercises 4.1, 4.2, 4.5, 4.9, 4.11]� Threshold units� Gradient descent�Multilayer networks� Backpropagation� Hidden layer representations� Example: Face Recognition� Advanced topics74 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 2: Arti cial Neural Networks

Connectionist ModelsConsider humans:� Neuron switching time ~ :001 second� Number of neurons ~ 1010� Connections per neuron ~ 104�5� Scene recognition time ~ :1 second� 100 inference steps doesn't seem like enough!much parallel computationProperties of arti�cial neural nets (ANN's):�Many neuron-like threshold switching units�Many weighted interconnections among units� Highly parallel, distributed process� Emphasis on tuning weights automatically75 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 3: Arti cial Neural Networks

When to Consider Neural Networks� Input is high-dimensional discrete or real-valued(e.g. raw sensor input)� Output is discrete or real valued� Output is a vector of values� Possibly noisy data� Form of target function is unknown� Human readability of result is unimportantExamples:� Speech phoneme recognition [Waibel]� Image classi�cation [Kanade, Baluja, Rowley]� Financial prediction76 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 4: Arti cial Neural Networks

ALVINN drives 70 mph on highwaysSharp Left

SharpRight

4 Hidden Units

30 Output Units

30x32 Sensor Input Retina

Straight Ahead

77 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 5: Arti cial Neural Networks

Perceptronw1

w2

wn

w0

x1

x2

xn

x0=1

.

.

.

Σ AAΣ wi xi

n

i=01 if > 0

-1 otherwise{o =Σ wi xi

n

i=0o(x1; : : : ; xn) = 8>><>>: 1 if w0 + w1x1 + � � �+ wnxn > 0�1 otherwise.Sometimes we'll use simpler vector notation:o(~x) = 8>><>>: 1 if ~w � ~x > 0�1 otherwise.78 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 6: Arti cial Neural Networks

Decision Surface of a Perceptronx1

x2

++

--

+

-

x1

x2

(a) (b)

-

+ -

+Represents some useful functions�What weights representg(x1; x2) = AND(x1; x2)?But some functions not representable� e.g., not linearly separable� Therefore, we'll want networks of these...79 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 7: Arti cial Neural Networks

Perceptron training rulewi wi +�wiwhere �wi = �(t� o)xiWhere:� t = c(~x) is target value� o is perceptron output� � is small constant (e.g., .1) called learning rate

80 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 8: Arti cial Neural Networks

Perceptron training ruleCan prove it will converge� If training data is linearly separable� and � su�ciently small

81 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 9: Arti cial Neural Networks

Gradient DescentTo understand, consider simpler linear unit, whereo = w0 + w1x1 + � � �+ wnxnLet's learn wi's that minimize the squared errorE[~w] � 12 Xd2D(td � od)2Where D is set of training examples

82 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 10: Arti cial Neural Networks

Gradient Descent

-1

0

1

2

-2-1

01

23

0

5

10

15

20

25

w0 w1

E[w

]

Gradient rE[~w] � 264 @E@w0; @E@w1; � � � @E@wn375Training rule: �~w = ��rE[~w]i.e., �wi = �� @E@wi83 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 11: Arti cial Neural Networks

Gradient Descent@E@wi = @@wi12 Xd (td � od)2= 12 Xd @@wi(td � od)2= 12 Xd 2(td � od) @@wi(td � od)= Xd (td � od) @@wi(td � ~w � ~xd)@E@wi = Xd (td � od)(�xi;d)

84 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 12: Arti cial Neural Networks

Gradient DescentGradient-Descent(training examples; �)Each training example is a pair of the formh~x; ti, where ~x is the vector of input values,and t is the target output value. � is thelearning rate (e.g., .05).� Initialize each wi to some small random value� Until the termination condition is met, Do{ Initialize each �wi to zero.{ For each h~x; ti in training examples, Do� Input the instance ~x to the unit andcompute the output o� For each linear unit weight wi, Do�wi �wi + �(t� o)xi{ For each linear unit weight wi, Dowi wi +�wi85 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 13: Arti cial Neural Networks

SummaryPerceptron training rule guaranteed to succeed if� Training examples are linearly separable� Su�ciently small learning rate �Linear unit training rule uses gradient descent� Guaranteed to converge to hypothesis withminimum squared error� Given su�ciently small learning rate �� Even when training data contains noise� Even when training data not separable by H86 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 14: Arti cial Neural Networks

Incremental (Stochastic) Gradient DescentBatch mode Gradient Descent:Do until satis�ed1. Compute the gradient rED[~w]2. ~w ~w � �rED[~w]Incremental mode Gradient Descent:Do until satis�ed� For each training example d in D1. Compute the gradient rEd[~w]2. ~w ~w � �rEd[~w]ED[~w] � 12 Xd2D(td � od)2Ed[~w] � 12(td � od)2Incremental Gradient Descent can approximateBatch Gradient Descent arbitrarily closely if �made small enough87 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 15: Arti cial Neural Networks

Multilayer Networks of Sigmoid UnitsF1 F2

head hid who’d hood... ...

88 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 16: Arti cial Neural Networks

Sigmoid Unitw1

w2

wn

w0

x1

x2

xn

x0 = 1

AAAAA...

Σnet = Σ wi xii=0

n

1

1 + e-net

o = σ(net) = �(x) is the sigmoid function11 + e�xNice property: d�(x)dx = �(x)(1 � �(x))We can derive gradient decent rules to train� One sigmoid unit�Multilayer networks of sigmoid units !Backpropagation89 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 17: Arti cial Neural Networks

Error Gradient for a Sigmoid Unit@E@wi = @@wi 12 Xd2D(td � od)2= 12 Xd @@wi(td � od)2= 12 Xd 2(td � od) @@wi(td � od)= Xd (td � od) 0B@�@od@wi1CA= �Xd (td � od) @od@netd @netd@wiBut we know:@od@netd = @�(netd)@netd = od(1� od)@netd@wi = @(~w � ~xd)@wi = xi;dSo: @E@wi = � Xd2D(td � od)od(1� od)xi;d90 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 18: Arti cial Neural Networks

Backpropagation AlgorithmInitialize all weights to small random numbers.Until satis�ed, Do� For each training example, Do1. Input the training example to the networkand compute the network outputs2. For each output unit k�k ok(1� ok)(tk � ok)3. For each hidden unit h�h oh(1� oh) Xk2outputswh;k�k4. Update each network weight wi;jwi;j wi;j +�wi;jwhere �wi;j = ��jxi;j91 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 19: Arti cial Neural Networks

More on Backpropagation� Gradient descent over entire network weightvector� Easily generalized to arbitrary directed graphs�Will �nd a local, not necessarily global errorminimum{ In practice, often works well (can run multipletimes)� Often include weight momentum ��wi;j(n) = ��jxi;j + ��wi;j(n� 1)�Minimizes error over training examples{Will it generalize well to subsequentexamples?� Training can take thousands of iterations !slow!� Using network after training is very fast92 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 20: Arti cial Neural Networks

Learning Hidden Layer RepresentationsInputs Outputs

A target function:Input Output10000000 ! 1000000001000000 ! 0100000000100000 ! 0010000000010000 ! 0001000000001000 ! 0000100000000100 ! 0000010000000010 ! 0000001000000001 ! 00000001Can this be learned??93 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 21: Arti cial Neural Networks

Learning Hidden Layer RepresentationsA network:Inputs Outputs

Learned hidden layer representation:Input Hidden OutputValues10000000 ! .89 .04 .08 ! 1000000001000000 ! .01 .11 .88 ! 0100000000100000 ! .01 .97 .27 ! 0010000000010000 ! .99 .97 .71 ! 0001000000001000 ! .03 .05 .02 ! 0000100000000100 ! .22 .99 .99 ! 0000010000000010 ! .80 .01 .98 ! 0000001000000001 ! .60 .94 .01 ! 0000000194 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 22: Arti cial Neural Networks

Training

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 500 1000 1500 2000 2500

Sum of squared errors for each output unit

95 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 23: Arti cial Neural Networks

Training

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 500 1000 1500 2000 2500

Hidden unit encoding for input 01000000

96 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 24: Arti cial Neural Networks

Training

-5

-4

-3

-2

-1

0

1

2

3

4

0 500 1000 1500 2000 2500

Weights from inputs to one hidden unit

97 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 25: Arti cial Neural Networks

Convergence of BackpropagationGradient descent to some local minimum� Perhaps not global minimum...� Add momentum� Stochastic gradient descent� Train multiple nets with di�erent inital weightsNature of convergence� Initialize weights near zero� Therefore, initial networks near-linear� Increasingly non-linear functions possible astraining progresses98 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 26: Arti cial Neural Networks

Expressive Capabilities of ANNsBoolean functions:� Every boolean function can be represented bynetwork with single hidden layer� but might require exponential (in number ofinputs) hidden unitsContinuous functions:� Every bounded continuous function can beapproximated with arbitrarily small error, bynetwork with one hidden layer [Cybenko 1989;Hornik et al. 1989]� Any function can be approximated to arbitraryaccuracy by a network with two hidden layers[Cybenko 1988].99 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 27: Arti cial Neural Networks

Over�tting in ANNs0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0.01

0 5000 10000 15000 20000

Err

or

Number of weight updates

Error versus weight updates (example 1)

Training set error

Validation set error

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0 1000 2000 3000 4000 5000 6000

Err

or

Number of weight updates

Error versus weight updates (example 2)

Training set error

Validation set error

100 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 28: Arti cial Neural Networks

Neural Nets for Face Recognition... ...

left strt rght up30x32inputsTypical input images90% accurate learning head pose, and recognizing 1-of-20 faces

101 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 29: Arti cial Neural Networks

Learned Hidden Unit Weights... ...

left strt rght up30x32inputs Learned Weights

Typical input imageshttp://www.cs.cmu.edu/�tom/faces.html102 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 30: Arti cial Neural Networks

Alternative Error FunctionsPenalize large weights:E(~w) � 12 Xd2D Xk2outputs(tkd � okd)2 + Xi;jw2jiTrain on target slopes as well as values:E(~w) � 12 Xd2D Xk2outputs 26664(tkd � okd)2 + � Xj2inputs 0BB@@tkd@xjd � @okd@xjd 1CCA237775Tie together weights:� e.g., in phoneme recognition network103 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

Page 31: Arti cial Neural Networks

Recurrent Networksx(t) x(t) c(t)

x(t) c(t)

y(t)

b

y(t + 1)

Feedforward network Recurrent network

Recurrent network unfolded in time

y(t + 1)

y(t + 1)

y(t – 1)

x(t – 1) c(t – 1)

x(t – 2) c(t – 2)

(a) (b)

(c)

104 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997


Recommended