Date post: | 20-Jan-2016 |
Category: |
Documents |
Upload: | bruce-bell |
View: | 216 times |
Download: | 2 times |
1
Today’s Topics• Midterm class mean: 83.5
• HW3 Due Thursday and HW4 Out Thursday
• Turn in Your BN Nannon Player (in Separate, ‘Dummy’ Assignment)
until a Week from Thursday
• Weight Space (for ANNs)
• Gradient Descent and Local Minima
• Stochastic Gradient Descent
• Backpropagation
• The Need to Train the Biases and a Simple Algebraic Trick
• Perceptron Training Rule and a Worked Example
• Case Analysis of Delta Rule
• Neural ‘Word Vectors’11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
Back to Prob Reasoning for Two Slides:Base-Rate Fallacyhttps://en.wikipedia.org/wiki/Base_rate_fallacy
11/3/15 2
Assume Disease A is rare (one in 1 million, say – so picture not to scale)
Assume population is 10B = 1010
So 104 people have it
Assume testForA is 99.99% accurate
You test positive. What is the prob you have Disease A?
Someone (not in cs540) might naively think prob = 0.9999
People for whom testForA = true
9999 people that actually have Disease A 106 people that do NOT have Disease A
Prob(A | testForA) = 0.01
A
This same issue arises in ML when have many more neg than pos ex’s: false pos overwhelm true pos
99.99%0.01%
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
A Major Weakness of BN’s (I also copied this and prev slide to an earlier lecture, for future cs540’s)
• If many ‘hidden’ random vars (N binary vars, say),
then the marginalization formula leads to many calls
to a BN (2N in our example; for N = 20, 2N = 1,048,576)
• Using uniform-random sampling to estimate the result is too
inaccurate since most of the probability might be concentrated
in only a few ‘complete world states’
• Hence, much research (beyond cs540’s scope) on scaling up
inference in BNs and other graphical models, eg via more
sophisticated sampling (eg, MCMC)
11/3/15 3
WARNING!Some Calculus Ahead
11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 4
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
No Calculus Experience?For HWs and the Final …
• Derivatives generalize the idea of SLOPE
• You only need to know how to calc the SLOPE of a line
d (m x + b)
d x
11/3/15 5
= m // ‘mx + b’ is the algebraic form of a line // ‘m’ is the slope // ‘b’ is the y intercept (value of y when x = 0)
Two (distinct)
points define a
line
Weight Space
• Given a neural-network layout, the weights and biases are free parameters that define a space
• Each point in this Weight Space specifies a networkweight space is a continuous space we search
• Associated with each point is an error rate, E, over the training data
• Backprop performs gradient descent in weight space
11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 6
Gradient Descent in Weight Space
Total Error on Training Set
W1
W2
∂ E
∂ W
W1
W2
11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 7
ERROR with Current Wgt Settings
New Wgt Settings
Current Wgt Settings
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
Backprop Seeks LOCAL Minima(in a continuous space)
11/3/15 8
Weight Space
Error on Train Set
Note: a local min might over fit the training data, so
often ‘early stopping’ used
(later)
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
Local Min are Good Enough for Us!
• ANNs, including Deep Networks, make accurate predictions eventhough we likely are only finding local min
• The world could have been like this:
• Note: ensembles of ANNs work well (often find different local minima)
11/3/15 9
Weight Space
Error on Train Set
Ie, most min
poor, hard to find a good min
The Gradient-Descent Rule
The ‘gradient’ This is a N+1 dimensional vector (ie, the ‘slope’ in weight space)
Since we want to reduce errors, we want to go ‘down hill’
We’ll take a finite step in weight space:E
W1
W2
w = - E ( w )
or wi = - Ewi
‘delta’ = the change to w
E
w
11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 10
E(w) [ ]Ew0
Ew1
Ew2
EwN
, , , … , _
‘On Line’ vs. ‘Batch’ Backprop• Technically, we should look at the error gradient for the
entire training set, before taking a step in weight space (‘batch’ backprop)
• However, in practice we take a step after each example (‘on-line’ backprop)– Much faster convergence (learn after each example)– Called ‘stochastic’ gradient descent– Stochastic gradient descent very popular at Google, etc
due to easy parallelism
11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 11
‘On Line’ vs. ‘Batch’ BP (continued)
BATCH – add w vectors for every training example, then ‘move’ in weight space
ON-LINE – ‘move’ after each example (aka, stochastic gradient descent)
E
wi
wex1
wex3wex2
w
wex1
wex2
wex3
* Final locations in space need not be the same for BATCH and ON-LINE w
* Note w i,BATCH w i, ON-LINE, for i > 1
E
w
11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 12
Vector from BATCH
Assume one layer of hidden units (std. non-deep topology)
1. Error ½ ( Teacheri – Output i ) 2
2. = ½ (Teacheri – F ( [ Wi,j x Output j ] )2
3. = ½ (Teacheri – F ( [ Wi,j x F ( Wj,k x Output k)] ))2
Determine
Recall
BP Calculations(note: text uses ‘loss’ instead of ‘error’)
∂ Error∂ Wi,j
∂ Error∂ Wj,k
= (use equation 2)
= (use equation 3)
See Sec 18.7.4 and Fig 18.24 in textbook for results (I won’t
ask you to derive on final)
wx,y = - (∂ E / ∂ wx,y )
k j i
11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 13
11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
Differentiating the Logistic Function (‘soft’ step-function)
out i =
1
1 + e - ( wj,i x outj - i )
F '(wgt’ed in) = out i ( 1- out i )
1/2
Wj x outj
F '(wgt’ed in)
wgt’ed input
1/4
Notice that even if totally wrong, no (or very little) change in weights
F(wgt’ed in)
14
Note: Differentiating RLU’s easy!
(use F’ = 0 when input = bias)
Gradient Descent for the Perceptron (for the simple case of linear output units)
Error ½ ( T – o )2
Network’s outputTeacher’s answer (a constant wrt the weights)
= (T – o) ∂ E
∂ Wa
∂ (T – o)
∂ Wa
= - (T – o) ∂ o
∂ W a
11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 15
Continuation of Derivation
∂ E
∂ Wk
= - (T – o) ∂ Wa
∂( ∑ w k x k)
= - (T – o) x a
So ΔWk = η (T – o) xa The Perceptron Rule
Stick in formula for output
Also known as the delta rule and other names (with some variation in the calc)
11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9 16
We’ll use for both LINEAR and STEP-FUNCTION activation
∂ E
∂ Wa
Recall ΔWk - η
11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
Node Biases
Recall: A node’s output is weighted function of its inputs and a ‘bias’ term
Inputbias
1
Output
These biases also need to be learned!
17
11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
Training Biases ( Θ’s )
A node’s output (assume ‘step function’ for simplicity)
1 if W1 X1 + W2 X2 +…+ Wn Xn ≥ Θ
0 otherwise
Rewriting
W1 X1 + W2 X2 + … + Wn Xn – Θ ≥ 0
W1 X1 + W2 X2 + … + Wn Xn + Θ (-1) ≥ 0
weight‘activation’
18
11/3/15 CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
Training Biases (cont.)
Hence, add another unit whose activation is always -1
The bias is then just another weight!
Eg
-1 Θ
Θ
19
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
Perceptron Example(assume step function and use η = 0.1)
X1 X2 Correct Output
3 -2 1
6 1 0
5 -3 1
11/3/15 20
Train Set
X1
-1
X2
1
-3
2
ΔWa = η (T – o) xa
Perceptron Learning Rule
Out = StepFunction(3 1 - 2 (-3) - 1 2) = 1
No wgt changes, since correct
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
Perceptron Example(assume step function and use η = 0.1)
X1 X2 Correct Output
3 -2 1
6 1 0
5 -3 1
11/3/15 21
Train Set
X1
-1
X2
1
-3
2
ΔWa = η (T – o) xa
Perceptron Learning Rule
Out = StepFunction(6 1 + 1 (-3) - 1 2) = 1 // So need to update weights
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
Perceptron Example(assume step function and use η = 0.1)
X1 X2 Correct Output
3 -2 1
6 1 0
5 -3 1
11/3/15 22
Train Set
X1
-1
X2
1 - 0.16 = 0.4
-3-0.11 = -3.1
2 0 0.1(-1) = 2.1
ΔWa = η (T – o) xa
Perceptron Learning Rule
Out = StepFunction(6 1 + 1 (-3) - 1 2) = 1
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
Pushing the Weights and Bias in the Correct Direction when Wrong
11/3/15 23
Output
Wgt’ed Sum bias
Assume TEACHER=1 and ANN=0, so some combo of (a) wgts on some positively valued inputs too small (b) wgts on some negatively valued inputs too large (c) ‘bias’ too large
Opposite movement when TEACHER= 0
and ANN = 1
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
Case Analysis:Assume Teach = 1, out = 0, η = 1
Input Vector: 1, -1, 1, -1 ‘-1’ ‘-1’
Weights: 2, -4, -3, 5 6 or -6 // the
BIAS
New Wgts: 2+1, -4-1, -3+1, 5-1 6-1 -6-1
bigger smaller bigger smaller smaller smaller
Old vs New Input Wgt 2 vs 3 4 vs 5 -3 vs -2 -5 vs -4
11/3/15 24
ΔWk = η (T – o) xk
Cases for the BIAS
Four Cases Pos/Neg Input Pos/Neg Weight
So weighted sum will be LARGER (-2 vs 2) And BIAS will be SMALLER
Note: ‘bigger’ means closer to +infinity and ‘smaller’ means closer to -infinity
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
Neural Word Vectors – Current Hot Topic(see https://code.google.com/p/word2vec/ or http://deeplearning4j.org/word2vec.html)
Distributional Hypothesis words can be characterized by the words that appear nearby in a large text corpus
(matrix algebra also usedfor this task, eg singular-value decomposition, SVD)
11/3/15 25
Two Possible Designs(CBOW = Continuous Bag of Words)
Initially assign each word a random k-long vector of random #’s in [0,1] or [-1,1](k is something like 100 or 300) – as opposed to traditional ‘1-of-N’ encoding Recall 1-of-N: aardvark = 1,0,0,…,0 // N is 50,000 or more! zzzz = 0,0,0,…,1 // And nothing ‘shared’ by related words
Compute Error / Inputi to change the input vector(s)- ie, find good word vectors so easy to learn to predict I/O pairs in the fig above
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
Neural Word Vectors
Surprisingly, one can do ‘simple algebra’ with these word vectors!
vectorFrance – vectorParis = vectorItaly – X
Subtract vector for Paris from vector for France, then subtract vector for Italy. Negate then find closest word vectors in one’s word ‘library’
web page suggests X = vectorRome
though I got vectorMilan
(which is reasonable; vectorRome was 2nd)11/3/15 26
- = - ? king – man = queen – X
CS 540 - Fall 2015 (Shavlik©), Lecture 19, Week 9
Wrapup of Basics of ANN Training
• We differentiate (in the calculus sense) all the free parameters in an ANN with a fixed structure (‘topology’)– If all else is held constant (‘partial derivatives’),
what is the impact of changing weighta?
– Simultaneously move each weight a small amount in the direction that reduces error
– Process example-by-example, many times
• Seeks local minimum, ie, where all derivatives = 0
11/3/15 27
∂ Error∂ Wa