Today’s Topics Midterm class mean: 83.5 HW3 Due Thursday and HW4 Out Thursday Turn in Your BN...

1

Today’s Topics• Midterm class mean: 83.5

• HW3 Due Thursday and HW4 Out Thursday

• Turn in Your BN Nannon Player (in Separate, ‘Dummy’ Assignment)

until a Week from Thursday

• Weight Space (for ANNs)

• Gradient Descent and Local Minima

• Stochastic Gradient Descent

• Backpropagation

• The Need to Train the Biases and a Simple Algebraic Trick

• Perceptron Training Rule and a Worked Example

• Case Analysis of Delta Rule

• Neural ‘Word Vectors’11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9

CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9

Back to Prob Reasoning for Two Slides:Base-Rate Fallacyhttps://en.wikipedia.org/wiki/Base_rate_fallacy

11/3/15 2

Assume Disease A is rare (one in 1 million, say – so picture not to scale)

Assume population is 10B = 1010

So 104 people have it

Assume testForA is 99.99% accurate

You test positive. What is the prob you have Disease A?

Someone (not in cs540) might naively think prob = 0.9999

People for whom testForA = true

9999 people that actually have Disease A 106 people that do NOT have Disease A

Prob(A | testForA) = 0.01

A

This same issue arises in ML when have many more neg than pos ex’s: false pos overwhelm true pos

99.99%0.01%

https://en.wikipedia.org/wiki/Base_rate_fallacy


A Major Weakness of BN’s (I also copied this and prev slide to an earlier lecture, for future cs540’s)

• If many ‘hidden’ random vars (N binary vars, say),

then the marginalization formula leads to many calls

to a BN (2N in our example; for N = 20, 2N = 1,048,576)

• Using uniform-random sampling to estimate the result is too

inaccurate since most of the probability might be concentrated

in only a few ‘complete world states’

• Hence, much research (beyond cs540’s scope) on scaling up

inference in BNs and other graphical models, eg via more

sophisticated sampling (eg, MCMC)

11/3/15 3

WARNING!Some Calculus Ahead

11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 4


No Calculus Experience?For HWs and the Final …

• Derivatives generalize the idea of SLOPE

• You only need to know how to calc the SLOPE of a line

d (m x + b)

d x

11/3/15 5

= m // ‘mx + b’ is the algebraic form of a line // ‘m’ is the slope // ‘b’ is the y intercept (value of y when x = 0)

Two (distinct)

points define a

line

Weight Space

• Given a neural-network layout, the weights and biases are free parameters that define a space

• Each point in this Weight Space specifies a networkweight space is a continuous space we search

• Associated with each point is an error rate, E, over the training data

• Backprop performs gradient descent in weight space


Gradient Descent in Weight Space

Total Error on Training Set

W1

W2

∂ E

∂ W

W1

W2


ERROR with Current Wgt Settings

New Wgt Settings

Current Wgt Settings


Backprop Seeks LOCAL Minima(in a continuous space)

11/3/15 8

Weight Space

Error on Train Set

Note: a local min might over fit the training data, so

often ‘early stopping’ used

(later)


Local Min are Good Enough for Us!

• ANNs, including Deep Networks, make accurate predictions eventhough we likely are only finding local min

• The world could have been like this:

• Note: ensembles of ANNs work well (often find different local minima)

11/3/15 9

Weight Space

Error on Train Set

Ie, most min

poor, hard to find a good min

The Gradient-Descent Rule

The ‘gradient’ This is a N+1 dimensional vector (ie, the ‘slope’ in weight space)

Since we want to reduce errors, we want to go ‘down hill’

We’ll take a finite step in weight space:E

W1

W2

w = - E ( w )

or wi = - Ewi

‘delta’ = the change to w

E

w


E(w) [ ]Ew0

Ew1

Ew2

EwN

, , , … , _

‘On Line’ vs. ‘Batch’ Backprop• Technically, we should look at the error gradient for the

entire training set, before taking a step in weight space (‘batch’ backprop)

• However, in practice we take a step after each example (‘on-line’ backprop)– Much faster convergence (learn after each example)– Called ‘stochastic’ gradient descent– Stochastic gradient descent very popular at Google, etc

due to easy parallelism


‘On Line’ vs. ‘Batch’ BP (continued)

BATCH – add w vectors for every training example, then ‘move’ in weight space

ON-LINE – ‘move’ after each example (aka, stochastic gradient descent)

E

wi

wex1

wex3wex2

w

wex1

wex2

wex3

* Final locations in space need not be the same for BATCH and ON-LINE w

* Note w i,BATCH w i, ON-LINE, for i > 1

E

w


Vector from BATCH

Assume one layer of hidden units (std. non-deep topology)

1. Error ½ ( Teacheri – Output i ) 2

2. = ½ (Teacheri – F ( [ Wi,j x Output j ] )2

3. = ½ (Teacheri – F ( [ Wi,j x F ( Wj,k x Output k)] ))2

Determine

Recall

BP Calculations(note: text uses ‘loss’ instead of ‘error’)

∂ Error∂ Wi,j

∂ Error∂ Wj,k

= (use equation 2)

= (use equation 3)

See Sec 18.7.4 and Fig 18.24 in textbook for results (I won’t

ask you to derive on final)

wx,y = - (∂ E / ∂ wx,y )

k j i


11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9

Differentiating the Logistic Function (‘soft’ step-function)

out i =

1

1 + e - ( wj,i x outj - i )

F '(wgt’ed in) = out i ( 1- out i )

1/2

Wj x outj

F '(wgt’ed in)

wgt’ed input

1/4

Notice that even if totally wrong, no (or very little) change in weights

F(wgt’ed in)

14

Note: Differentiating RLU’s easy!

(use F’ = 0 when input = bias)

Gradient Descent for the Perceptron (for the simple case of linear output units)

Error ½ ( T – o )2

Network’s outputTeacher’s answer (a constant wrt the weights)

= (T – o) ∂ E

∂ Wa

∂ (T – o)

∂ Wa

= - (T – o) ∂ o

∂ W a


Continuation of Derivation

∂ E

∂ Wk

= - (T – o) ∂ Wa

∂( ∑ w k x k)

= - (T – o) x a

So ΔWk = η (T – o) xa The Perceptron Rule

Stick in formula for output

Also known as the delta rule and other names (with some variation in the calc)


We’ll use for both LINEAR and STEP-FUNCTION activation

∂ E

∂ Wa

Recall ΔWk - η


Node Biases

Recall: A node’s output is weighted function of its inputs and a ‘bias’ term

Inputbias

1

Output

These biases also need to be learned!

17


Training Biases ( Θ’s )

A node’s output (assume ‘step function’ for simplicity)

1 if W1 X1 + W2 X2 +…+ Wn Xn ≥ Θ

0 otherwise

Rewriting

W1 X1 + W2 X2 + … + Wn Xn – Θ ≥ 0

W1 X1 + W2 X2 + … + Wn Xn + Θ (-1) ≥ 0

weight‘activation’

18


Training Biases (cont.)

Hence, add another unit whose activation is always -1

The bias is then just another weight!

Eg

-1 Θ

Θ

19


Perceptron Example(assume step function and use η = 0.1)

X1 X2 Correct Output

3 -2 1

6 1 0

5 -3 1

11/3/15 20

Train Set

X1

-1

X2

1

-3

2

ΔWa = η (T – o) xa

Perceptron Learning Rule

Out = StepFunction(3 1 - 2 (-3) - 1 2) = 1

No wgt changes, since correct




3 -2 1

6 1 0

5 -3 1

11/3/15 21

Train Set

X1

-1

X2

1

-3

2



Out = StepFunction(6 1 + 1 (-3) - 1 2) = 1 // So need to update weights




3 -2 1

6 1 0

5 -3 1

11/3/15 22

Train Set

X1

-1

X2

1 - 0.16 = 0.4

-3-0.11 = -3.1

2 0 0.1(-1) = 2.1



Out = StepFunction(6 1 + 1 (-3) - 1 2) = 1


Pushing the Weights and Bias in the Correct Direction when Wrong

11/3/15 23

Output

Wgt’ed Sum bias

Assume TEACHER=1 and ANN=0, so some combo of (a) wgts on some positively valued inputs too small (b) wgts on some negatively valued inputs too large (c) ‘bias’ too large

Opposite movement when TEACHER= 0

and ANN = 1


Case Analysis:Assume Teach = 1, out = 0, η = 1

Input Vector: 1, -1, 1, -1 ‘-1’ ‘-1’

Weights: 2, -4, -3, 5 6 or -6 // the

BIAS

New Wgts: 2+1, -4-1, -3+1, 5-1 6-1 -6-1

bigger smaller bigger smaller smaller smaller

Old vs New Input Wgt 2 vs 3 4 vs 5 -3 vs -2 -5 vs -4

11/3/15 24

ΔWk = η (T – o) xk

Cases for the BIAS

Four Cases Pos/Neg Input Pos/Neg Weight

So weighted sum will be LARGER (-2 vs 2) And BIAS will be SMALLER

Note: ‘bigger’ means closer to +infinity and ‘smaller’ means closer to -infinity


Neural Word Vectors – Current Hot Topic(see https://code.google.com/p/word2vec/ or http://deeplearning4j.org/word2vec.html)

Distributional Hypothesis words can be characterized by the words that appear nearby in a large text corpus

(matrix algebra also usedfor this task, eg singular-value decomposition, SVD)

11/3/15 25

Two Possible Designs(CBOW = Continuous Bag of Words)

Initially assign each word a random k-long vector of random #’s in [0,1] or [-1,1](k is something like 100 or 300) – as opposed to traditional ‘1-of-N’ encoding Recall 1-of-N: aardvark = 1,0,0,…,0 // N is 50,000 or more! zzzz = 0,0,0,…,1 // And nothing ‘shared’ by related words

Compute Error / Inputi to change the input vector(s)- ie, find good word vectors so easy to learn to predict I/O pairs in the fig above

https://code.google.com/p/word2vec/

http://deeplearning4j.org/word2vec.html


Neural Word Vectors

Surprisingly, one can do ‘simple algebra’ with these word vectors!

vectorFrance – vectorParis = vectorItaly – X

Subtract vector for Paris from vector for France, then subtract vector for Italy. Negate then find closest word vectors in one’s word ‘library’

web page suggests X = vectorRome

though I got vectorMilan

(which is reasonable; vectorRome was 2nd)11/3/15 26

- = - ? king – man = queen – X


Wrapup of Basics of ANN Training

• We differentiate (in the calculus sense) all the free parameters in an ANN with a fixed structure (‘topology’)– If all else is held constant (‘partial derivatives’),

what is the impact of changing weighta?

– Simultaneously move each weight a small amount in the direction that reduces error

– Process example-by-example, many times

• Seeks local minimum, ie, where all derivatives = 0

11/3/15 27

∂ Error∂ Wa

Date post:	20-Jan-2016
Category:	Documents
Upload:	bruce-bell
View:	216 times
Download:	2 times

Today’s Topics Midterm class mean: 83.5 HW3 Due Thursday and HW4 Out Thursday Turn in Your BN...

Documents