Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.

transcript

Neural NetworksNeural Networks

Introduction toIntroduction toArtificial IntelligenceArtificial Intelligence

COS302COS302

Michael L. LittmanMichael L. Littman

Fall 2001Fall 2001

AdministrationAdministration

11/28 Neural Networks11/28 Neural Networks

Ch. 19 [19.3, 19.4]Ch. 19 [19.3, 19.4]

12/03 Latent Semantic Indexing12/03 Latent Semantic Indexing

12/05 Belief Networks12/05 Belief Networks

Ch. 15 [15.1, 15.2]Ch. 15 [15.1, 15.2]

12/10 Belief Network Inference12/10 Belief Network Inference

Ch. 19 [19.6]Ch. 19 [19.6]

ProposalProposal

11/28 Neural Networks11/28 Neural Networks

Ch. 19 [19.3, 19.4]Ch. 19 [19.3, 19.4]

12/03 Backpropagation in NNs12/03 Backpropagation in NNs

12/05 Latent Semantic Indexing12/05 Latent Semantic Indexing

12/10 Segmentation12/10 Segmentation

Regression: DataRegression: Data

xx11= 2= 2yy11= 1= 1

xx22== 66 yy22= 2.2= 2.2

xx33== 44 yy33= 2= 2

xx44== 33 yy44= 1.9= 1.9

xx55== 44 yy55= 3.1= 3.1

Given x, want to predict y.Given x, want to predict y.

Regression: PictureRegression: Picture

0 2 4 6 8

Linear RegressionLinear Regression

Linear regression assumes that the Linear regression assumes that the expected value of the output given expected value of the output given an input E(y|x) is linear.an input E(y|x) is linear.

Simplest case:Simplest case:

out(x) = w xout(x) = w x

for some unknown for some unknown weightweight w. w.

Estimate w given the data.Estimate w given the data.

1-Parameter Linear Reg.1-Parameter Linear Reg.

Assume that the data is formed byAssume that the data is formed by

yyi i = w x= w xi i + noise+ noise

where…where…• the noise signals are indep.the noise signals are indep.• noise normally distributed: mean 0 noise normally distributed: mean 0

and unknown variance and unknown variance 22

Distribution for ysDistribution for ys

wx wx wx+wx+wx-wx-

wx+2wx+2wx-2wx-2

Pr(y|w, x) normally distributed with Pr(y|w, x) normally distributed with mean wx and variance mean wx and variance 22

Data to ModelData to Model

Fix xs. What w makes ys most likely? Fix xs. What w makes ys most likely? Also known as…Also known as…

argmaxargmaxw w Pr(yPr(y11…y…ynn|x|x11…x…xnn, w), w)

= argmax= argmaxw w i i Pr(yPr(yii|x|xii, w), w)

= argmax= argmaxw w i i exp(-1/2 ((yexp(-1/2 ((yii-wx-wxii)/)/))22))

= argmin= argminw w i i (y(yii-wx-wxii))22

Minimize sum-of-squared Minimize sum-of-squared residualsresiduals..

ResidualsResiduals

0 2 4 6 8

How Minimize?How Minimize?

E = E = i i (y(yii-wx-wxii))22

= = i i yyii2 2 – (2 – (2 ii x xi i yyii) w + () w + (i i xxii

22) w) w22

Minimize quadratic function of w.Minimize quadratic function of w.

E minimized withE minimized with

w* = (w* = (i i xxi i yyii) / () / (i i xxii22))

so ML model is Out(x) = w* x.so ML model is Out(x) = w* x.

Multivariate RegressionMultivariate Regression

What if inputs are vectors?What if inputs are vectors?

n data points, D componentsn data points, D components

…X =X =yy11

…Y =Y =

Closed Form SolutionClosed Form Solution

Multivariate linear regression Multivariate linear regression assumes a vector w s.t.assumes a vector w s.t.

Out(x) = wOut(x) = wTTx x

= w[1] x[1] + … + w[D] x[D]= w[1] x[1] + … + w[D] x[D]

ML solution: w = (XML solution: w = (XTTX)X)–1–1 (X (XTTY)Y)

XXTTX is DxD, k,j elt is sumX is DxD, k,j elt is sumi i xxij ij xxikik

XXTTY is Dx1, k elt is sumY is Dx1, k elt is sumi i xxik ik yyii

Got Constants?Got Constants?

0 2 4 6 8

Fitting with an OffsetFitting with an Offset

We might expect a linear function We might expect a linear function that doesn’t go through the origin.that doesn’t go through the origin.

Simple obvious hack so we don’t Simple obvious hack so we don’t have to start from scratch…have to start from scratch…

Gradient DescentGradient Descent

Scalar function: f(w): Scalar function: f(w): Want a local minimum.Want a local minimum.

Start with some value for w.Start with some value for w.

Gradient descent rule:Gradient descent rule:

w w w - w - //w f(w)w f(w) ““learning rate” (small pos. num.)learning rate” (small pos. num.)

Justify!Justify!

Partial DerivativesPartial Derivatives

E = sumE = sumk k (w(wTTxxk k – y– ykk))2 2 = f(w)= f(w)

wwj j w wj j - - //wwj j f(w) f(w)

How would a small increase in How would a small increase in weight wweight wj j change the error?change the error?

Small positive?Small positive?Large positive?Large positive?

Small negative?Small negative? Large negative?Large negative?

Neural Net ConnectionNeural Net Connection

Set of weights w.Set of weights w.

Find weights to minimize sum-of-Find weights to minimize sum-of-squared residuals. Why?squared residuals. Why?

When would we want to use gradient When would we want to use gradient descent?descent?

Linear PerceptronLinear Perceptron

Earliest, simplest NN.Earliest, simplest NN.

xx22 xx33 xxDD…

sumsum

ww22 ww33wwDD ww00

Learning RuleLearning Rule

Multivariate linear function, trained Multivariate linear function, trained by gradient descent.by gradient descent.

Derive the update rule…Derive the update rule…

out(x) = wout(x) = wTTxx

E = sumE = sumk k (w(wTTxxk k – y– ykk))2 2 = f(w)= f(w)

wwj j w wj j - - //wwj j f(w) f(w)

““Batch” AlgorithmBatch” Algorithm

1.1. Randomly initialize wRandomly initialize w11…w…wDD

2.2. Append 1s to inputs to allow Append 1s to inputs to allow function to miss the originfunction to miss the origin

3.3. For i=1 to n, For i=1 to n, i i = y= yi i – w– wTT xxii

4.4. For j=1 to D, wFor j=1 to D, wjj= w= wj j + + sum sumii i i xxijij

5.5. If sumIf sumii ii2 2 is small, stop, else 3.is small, stop, else 3.

Why squared?Why squared?

ClassificationClassification

Let’s say all outputs are 0 or 1.Let’s say all outputs are 0 or 1.

How can we interpret the output of How can we interpret the output of the perceptron as zero or one?the perceptron as zero or one?

ClassificationClassification

0 5 10

Change Output FunctionChange Output Function

SolutionSolution::

Instead ofInstead of out(x) = wout(x) = wTTxx

we’ll use we’ll use

out(x) = g(wout(x) = g(wTTx)x)

g(x): g(x): (0,1), squashing function (0,1), squashing function

SigmoidSigmoid

E = sumE = sumk k (g(w(g(wTTxxkk)) –– yykk))2 2 = f(w)= f(w)

where g(h) = 1/(1+ewhere g(h) = 1/(1+e-h-h))

0 5 10

Classification Percept.Classification Percept.

netnetii

xx22 xx33 xxDD…

sumsum

ww22 ww33wwDD ww00

yy squashsquash

Classifying RegionsClassifying Regions

Gradient Descent in Gradient Descent in PerceptronsPerceptrons

Notice g’(h) = g(h)(1-g(h)).Notice g’(h) = g(h)(1-g(h)).

Let netLet neti i = sum= sumk k wwk k xxikik, , ii = y = yii-g(net-g(netii) )

out(xout(xii) = g(net) = g(netii))

E = sumE = sumi i (y(yii-g(net-g(netii))))22

E/E/wwj j = sum= sumi i 2(y2(yii-g(net-g(netii)) (-)) (-//wwj j g(netg(netii))))

= –2 sum= –2 sumi i (y(yii-g(net-g(netii)) g’(net)) g’(netii) ) //wwj j netnetii

= –2 sum= –2 sumi i ii g(net g(netii) (1-g(net) (1-g(netii)) x)) xii

Delta Rule for PerceptronsDelta Rule for Perceptrons

wwj j = w= wjj

+ + sum sumii i i out(xout(xii) (1-out(x) (1-out(xii)) x)) xijij

Invented and popularized by Invented and popularized by Rosenblatt (1962)Rosenblatt (1962)

Guaranteed convergenceGuaranteed convergenceStable behavior for overconstrained Stable behavior for overconstrained

and underconstrained problemsand underconstrained problems

What to LearnWhat to Learn

Linear regression as MLLinear regression as ML

Gradient descent to find MLGradient descent to find ML

Perceptron training rule (regressions Perceptron training rule (regressions version and classification version)version and classification version)

Sigmoids for classification problemsSigmoids for classification problems

Homework 9 (due 12/5)Homework 9 (due 12/5)

1.1. Write a program that decides if a pair Write a program that decides if a pair of words are synonyms using wordnet. of words are synonyms using wordnet. I’ll send you the list, you send me the I’ll send you the list, you send me the answers.answers.

2.2. Draw a decision tree that represents Draw a decision tree that represents (a) f(a) f11+f+f22+…+f+…+fnn (or), (b) f (or), (b) f11ff22…f…fn n (and), (and), (c) parity (odd number of features (c) parity (odd number of features “on”).“on”).

3.3. Show that g’(h) = g(h)(1-g(h)).Show that g’(h) = g(h)(1-g(h)).

Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.

Documents