Machine Learning (CSE 446): Neural Networks...Machine Learning (CSE 446): Neural Networks Noah Smith...

Machine Learning (CSE 446):Neural Networks

Noah Smithc© 2017

University of [email protected]

November 6, 2017

1 / 22

Admin

No Wednesday office hours for Noah; no lecture Friday.

2 / 22

Classifiers We’ve Covered So Far

decision boundary? difficult part of learning?

decision trees piecewise-axis-aligned greedy split decisionsK-nearest neighbors possibly very complex indexing training dataperceptron linear iterative optimization method requiredlogistic regression linear iterative optimization method requirednaıve Bayes linear (see A4) none

3 / 22

Classifiers We’ve Covered So Far

decision boundary? difficult part of learning?

decision trees piecewise-axis-aligned greedy split decisionsK-nearest neighbors possibly very complex indexing training dataperceptron linear iterative optimization method requiredlogistic regression linear iterative optimization method requirednaıve Bayes linear (see A4) none

The next methods we’ll cover permit nonlinear decision boundaries.

4 / 22

Inspiration from NeuronsImage from Wikimedia Commons.

Input signals come in through dendrites, output signal passes out through the axon.

5 / 22

Neuron-Inspired Classifiers

x[1]

…

w[1] ×

x[2] w[2] ×

x[3] w[3] ×

x[d] w[d] ×

∑

b

!

fire, or not?ŷ

bias parameter

weight parameters

“activation” output

input

6 / 22


x[1]

…

w[1] ×

x[2] w[2] ×

x[3] w[3] ×

x[d] w[d] ×

∑

b

! ŷoutput

input

f

7 / 22


xn w × ∑

b

Lnyn

! ŷweights “activation”

classifier output, “f”input

correct output loss

8 / 22


-10 -5 0 5 10-1.0

0.0

1.0

Hyperbolic tangent function, tanh(z) =ez − e−z

ez + e−z.

Generalization: apply elementwise to a vector, so that tanh : Rk → (−1, 1)k.

9 / 22


xn w1 × ∑

b1

Lnyn

tanh

ŷ

weights

classifier output, “f”

input

correct output loss

w2 × ∑

b2

tanh

v1×

v2

∑

×

!

“activation”

“hidden units”

10 / 22


xn W × ∑

b

Lnyn

tanhŷ

weights

classifier output, “f”

input

correct output loss

v× ∑ !

“activation”

“hidden units”

11 / 22

Two-Layer Neural Network

f(x) = sign

(H∑

h=1

vh · tanh (wh · x+ bh)

)= sign (v · tanh (Wx+ b))

I Two-layer networks allow decision boundaries that are nonlinear.

I It’s fairly easy to show that “XOR” can be simulated (recall conjunction featuresfrom the “practical issues” lecture on 10/18).

I Theoretical result: any continuous function on a bounded region in Rd can beapproximated arbitrarily well, with a finite number of hidden units.

I The number of hidden units affects how complicated your decision boundary canbe and how easily you will overfit.

12 / 22


f(x) = sign

(H∑

h=1







13 / 22


f(x) = sign

(H∑

h=1







14 / 22


f(x) = sign

(H∑

h=1







15 / 22


f(x) = sign

(H∑

h=1







16 / 22

Learning with a Two-Layer Network

Parameters: W ∈ RH×d, b ∈ RH , and v ∈ RH

I If we choose a differentiable loss, then the the whole function will be differentiablewith respect to all parameters.

I Because of the squashing function, which is not convex, the overall learningproblem is not convex.

I What does (stochastic) (sub)gradient descent do with non-convex functions?

I To calculate gradients, we need to use the chain rule from calculus.

I Special name for (S)GD with chain rule invocations: backpropagation.

17 / 22








18 / 22








19 / 22








20 / 22





I What does (stochastic) (sub)gradient descent do with non-convex functions? Itfinds a local minimum.



21 / 22





I What does (stochastic) (sub)gradient descent do with non-convex functions? Itfinds a local minimum.



22 / 22

Date post:	21-May-2020
Category:	Documents
Upload:	others
View:	32 times
Download:	0 times

Machine Learning (CSE 446): Neural Networks...Machine Learning (CSE 446): Neural Networks Noah Smith...

Documents