+ All Categories
Home > Documents > Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function...

Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function...

Date post: 20-May-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
32
Reminder: Linear Classifiers § Inputs are feature values § Each feature has a weight § Sum is the activation § If the activation is: § Positive, output +1 § Negative, output -1 S f 1 f 2 f 3 w 1 w 2 w 3 >0?
Transcript
Page 1: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

Reminder:LinearClassifiers

§ Inputsarefeaturevalues§ Eachfeaturehasaweight§ Sumistheactivation

§ Iftheactivationis:§ Positive,output+1§ Negative,output-1

Sf1f2f3

w1

w2

w3>0?

Page 2: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

Howtogetprobabilisticdecisions?

§ Activation:§ If verypositiveà wantprobabilitygoingto1§ If verynegativeà wantprobabilitygoingto0

§ Sigmoidfunction

z = w · f(x)z = w · f(x)

z = w · f(x)

�(z) =1

1 + e�z

Page 3: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

Bestw?

§ Maximumlikelihoodestimation:

with:

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)

P (y(i) = +1|x(i);w) =1

1 + e�w·f(x(i))

P (y(i) = �1|x(i);w) = 1� 1

1 + e�w·f(x(i))

=LogisticRegression

Page 4: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

MulticlassLogisticRegression§ Multi-classlinearclassification

§ Aweightvectorforeachclass:

§ Score(activation)ofaclassy:

§ Predictionw/highestscorewins:

§ Howtomakethescoresintoprobabilities?

z1, z2, z3 ! ez1

ez1 + ez2 + ez3,

ez2

ez1 + ez2 + ez3,

ez3

ez1 + ez2 + ez3

original activations softmax activations

Page 5: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

Bestw?

§ Maximumlikelihoodestimation:

with:

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)

P (y(i)|x(i);w) =ewy(i) ·f(x(i))

Py e

wy·f(x(i))

=Multi-ClassLogisticRegression

Page 6: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

ThisLecture

§ Optimization

§ i.e.,howdowesolve:

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)

Page 7: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

HillClimbing

§ RecallfromCSPslecture:simple,generalidea§ Startwherever§ Repeat:movetothebestneighboringstate§ Ifnoneighborsbetterthancurrent,quit

§ What’sparticularlytrickywhenhill-climbingformulticlasslogisticregression?• Optimizationoveracontinuousspace

• Infinitelymanyneighbors!• Howtodothisefficiently?

Page 8: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

1-DOptimization

§ Couldevaluate and§ Thenstepinbestdirection

§ Or,evaluatederivative:

§ Tellswhichdirectiontostepinto

w

g(w)

w0

g(w0)

g(w0 + h) g(w0 � h)

@g(w0)

@w= lim

h!0

g(w0 + h)� g(w0 � h)

2h

Page 9: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

2-DOptimization

Source: offconvex.org

Page 10: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

GradientAscent

§ Performupdateinuphilldirectionforeachcoordinate§ Thesteepertheslope(i.e.thehigherthederivative)thebiggerthestepforthatcoordinate

§ E.g.,consider:

§ Updates:

g(w1, w2)

w2 w2 + ↵ ⇤ @g

@w2(w1, w2)

w1 w1 + ↵ ⇤ @g

@w1(w1, w2)

§ Updatesinvectornotation:

with:

w w + ↵ ⇤ rwg(w)

rwg(w) =

"@g@w1

(w)@g@w2

(w)

#

=gradient

Page 11: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

§ Idea:§ Startsomewhere§ Repeat:Takeastepinthegradientdirection

GradientAscent

Figure source: Mathworks

Page 12: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

WhatistheSteepestDirection?

§ First-OrderTaylorExpansion:

§ SteepestDescentDirection:

§ Recall: à

§ Hence,solution:

g(w +�) ⇡ g(w) +@g

@w1�1 +

@g

@w2�2

rg =

"@g@w1@g@w2

#Gradientdirection=steepestdirection!

max�:�2

1+�22"

g(w +�)

max�:�2

1+�22"

g(w) +@g

@w1�1 +

@g

@w2�2

� = "rg

krgk

� = "a

kakmax

�:k�k"�>a

Page 13: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

Gradientinndimensions

rg =

2

6664

@g@w1@g@w2

· · ·@g@wn

3

7775

Page 14: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

OptimizationProcedure:GradientAscent

§ init§ for iter = 1, 2, …

w

§ :learningrate--- tweakingparameterthatneedstobechosencarefully

§ How?Trymultiplechoices§ Cruderuleofthumb:updatechangesabout0.1– 1%

w

w w + ↵ ⇤ rg(w)

Page 15: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

BatchGradientAscentontheLogLikelihoodObjective

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)

g(w)

§ init§ for iter = 1, 2, …

w

w w + ↵ ⇤X

i

r logP (y(i)|x(i);w)

Page 16: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

StochasticGradientAscentontheLogLikelihoodObjective

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)

§ init§ for iter = 1, 2, …

§ pick random j

w

w w + ↵ ⇤ r logP (y(j)|x(j);w)

Observation: oncegradientononetrainingexamplehasbeencomputed,mightaswellincorporatebeforecomputingnextone

Page 17: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

Mini-BatchGradientAscentontheLogLikelihoodObjective

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)

§ init§ for iter = 1, 2, …

§ pick random subset of training examples J

w

Observation: gradientoversmallsetoftrainingexamples(=mini-batch)canbecomputedinparallel,mightaswelldothatinsteadofasingleone

w w + ↵ ⇤X

j2J

r logP (y(j)|x(j);w)

Page 18: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

§ We’lltalkaboutthatoncewecoveredneuralnetworks,whichareageneralizationoflogisticregression

Howaboutcomputingallthederivatives?

Page 19: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

NeuralNetworks

Page 20: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

Multi-classLogisticRegression

§ =specialcaseofneuralnetwork

z1

z2

z3

f1(x)

f2(x)

f3(x)

fK(x)

softmax

P (y1|x;w) =ez1

ez1 + ez2 + ez3

P (y2|x;w) =ez2

ez1 + ez2 + ez3

P (y3|x;w) =ez3

ez1 + ez2 + ez3…

Page 21: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

DeepNeuralNetwork=Alsolearnthefeatures!

z1

z2

z3

f1(x)

f2(x)

f3(x)

fK(x)

softmax

P (y1|x;w) =ez1

ez1 + ez2 + ez3

P (y2|x;w) =ez2

ez1 + ez2 + ez3

P (y3|x;w) =ez3

ez1 + ez2 + ez3…

Page 22: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

DeepNeuralNetwork=Alsolearnthefeatures!

f1(x)

f2(x)

f3(x)

fK(x)

softmax

P (y1|x;w) =ez1

ez1 + ez2 + ez3

P (y2|x;w) =ez2

ez1 + ez2 + ez3

P (y3|x;w) =ez3

ez1 + ez2 + ez3…

x1

x2

x3

xL

… … … …

z(1)1

z(1)2

z(1)3

z(1)K(1) z(2)

K(2)

z(2)1

z(2)2

z(2)3

z(OUT )1

z(OUT )2

z(OUT )3

z(n�1)3

z(n�1)2

z(n�1)1

z(n�1)K(n�1)

z(k)i = g(X

j

W (k�1,k)i,j z(k�1)

j ) g=nonlinearactivationfunction

Page 23: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

DeepNeuralNetwork=Alsolearnthefeatures!

softmax

P (y1|x;w) =ez1

ez1 + ez2 + ez3

P (y2|x;w) =ez2

ez1 + ez2 + ez3

P (y3|x;w) =ez3

ez1 + ez2 + ez3…

x1

x2

x3

xL

… … … …

z(1)1

z(1)2

z(1)3

z(1)K(1) z(n)

K(n)z(2)K(2)

z(2)1

z(2)2

z(2)3 z(n)3

z(n)2

z(n)1

z(OUT )1

z(OUT )2

z(OUT )3

z(n�1)3

z(n�1)2

z(n�1)1

z(n�1)K(n�1)

z(k)i = g(X

j

W (k�1,k)i,j z(k�1)

j ) g=nonlinearactivationfunction

Page 24: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

CommonActivationFunctions

[source:MIT6.S191introtodeeplearning.com]

Page 25: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

DeepNeuralNetwork:AlsoLearntheFeatures!

§ Trainingthedeepneuralnetworkisjustlikelogisticregression:

justwtendstobeamuch,muchlargervectorJ

àjustrungradientascent+stopwhenloglikelihoodofhold-outdatastartstodecrease

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)

Page 26: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

NeuralNetworksProperties

§ Theorem(UniversalFunctionApproximators).Atwo-layerneuralnetworkwithasufficientnumberofneuronscanapproximateanycontinuousfunctiontoanydesiredaccuracy.

§ Practicalconsiderations§ Canbeseenaslearningthefeatures

§ Largenumberofneurons§ Dangerforoverfitting§ (henceearlystopping!)

Page 27: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

UniversalFunctionApproximationTheorem*

§ Inwords: Givenanycontinuousfunctionf(x),ifa2-layerneuralnetworkhasenoughhiddenunits,thenthereisachoiceofweightsthatallowittocloselyapproximatef(x).

Cybenko (1989)“Approximationsbysuperpositionsofsigmoidalfunctions”Hornik (1991)“ApproximationCapabilitiesofMultilayerFeedforwardNetworks”Leshno andSchocken (1991)”MultilayerFeedforwardNetworkswithNon-PolynomialActivationFunctionsCanApproximateAnyFunction”

Page 28: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

UniversalFunctionApproximationTheorem*

Cybenko (1989)“Approximationsbysuperpositionsofsigmoidalfunctions”Hornik (1991)“ApproximationCapabilitiesofMultilayerFeedforwardNetworks”Leshno andSchocken (1991)”MultilayerFeedforwardNetworkswithNon-PolynomialActivationFunctionsCanApproximateAnyFunction”

Page 29: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

§ Derivativestables:

Howaboutcomputingallthederivatives?

[source:http://hyperphysics.phy-astr.gsu.edu/hbase/Math/derfunc.html

Page 30: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

Howaboutcomputingallthederivatives?

n Butneuralnetfisneveroneofthose?n Noproblem:CHAINRULE:

If

Then

à Derivativescanbecomputedbyfollowingwell-definedprocedures

f(x) = g(h(x))

f 0(x) = g0(h(x))h0(x)

Page 31: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

§ Automaticdifferentiationsoftware§ e.g.Theano,TensorFlow,PyTorch,Chainer§ Onlyneedtoprogramthefunctiong(x,y,w)§ Canautomaticallycomputeallderivativesw.r.t.allentriesinw§ Thisistypicallydonebycachinginfoduringforwardcomputationpassoff,andthendoingabackwardpass=“backpropagation”

§ Autodiff /Backpropagationcanoftenbedoneatcomputationalcostcomparabletotheforwardpass

§ Needtoknowthisexists§ Howthisisdone?

AutomaticDifferentiation

Page 32: Reminder: Linear Classifiers · Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate

SummaryofKeyIdeas§ Optimizeprobabilityoflabelgiveninput

§ Continuousoptimization§ Gradientascent:

§ Computesteepestuphilldirection=gradient(=justvectorofpartialderivatives)§ Takestepinthegradientdirection§ Repeat(untilheld-outdataaccuracystartstodrop=“earlystopping”)

§ Deepneuralnets§ Lastlayer=stilllogisticregression§ Nowalsomanymorelayersbeforethislastlayer

§ =computingthefeatures§ à thefeaturesarelearnedratherthanhand-designed

§ Universalfunctionapproximationtheorem§ If neuralnetislargeenough§ Then neuralnetcanrepresentanycontinuousmappingfrominputtooutputwitharbitraryaccuracy§ Butremember:needtoavoidoverfitting/memorizingthetrainingdataà earlystopping!

§ Automaticdifferentiationgivesthederivativesefficiently(how?=outsideofscopeof188)

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)


Recommended