Date post: | 05-Jan-2016 |
Category: |
Documents |
Upload: | francis-oliver |
View: | 223 times |
Download: | 2 times |
Linear Discrimination
Reading: Chapter 2 of textbook
Framework
• Assume our data consists of instances x = (x1, x2, ..., xn)
• Assume data can be separated into two classes, positive and negative, by a linear decision surface.
• Learning: Assuming data is n-dimensional, learn (n−1)-dimensional hyperplane to classify the data into classes.
Linear Discriminant
Feature 1
Feature 2
Linear Discriminant
Feature 2
Feature 1
Linear Discriminant
Feature 2
Feature 1
Example where line won’t work?
Feature 2
Feature 1
Perceptrons• Discriminant function:
w0 is called the “bias”.
−w0 is called the “threshold”
• Classification:
€
y(x) = f (wT x + w0)
= f (w0 + w1x1 + ...+ wn xn )
€
y(x) = sgn(w0 + w1x1 + w2x2 + ...+ wn xn )
where sgn(z) =−1 if y < 0
0 if y = 0 +1 if y > 0
⎧ ⎨ ⎪
⎩ ⎪
Perceptrons as simple neural networks
.
.
.
w1
w2
wn
o
w0
+1x1
x2
xn €
y(x) = sgn(w0 + w1x1 + w2x2 + ...+ wn xn )
where sgn(z) =−1 if y < 0
0 if y = 0 +1 if y > 0
⎧ ⎨ ⎪
⎩ ⎪
Example
• What is the class y?
.4
-.4
-.1
+11
-1
Geometry of the perceptron
Feature 1
Feature 2
HyperplaneIn 2d:
2
01
2
12
02211 0
w
wx
w
wx
wxwxw
Work with one neighbor on this:
(a) Find weights for a perceptron that separates “true” and “false” in x1
x2. Find the slope and intercept, and sketch the separation line defined by this discriminant.
(b) Do the same, but for x1 x2.
(c) What (if anything) might make one separation line better than another?
In-class exercise
• To simplify notation, assume a “dummy” coordinate (or attribute) x0 = 1. Then we can write:
• We can generalize the perceptron to cases where we project data points xn into “feature space”, (xn):
))(sgn()( xwx Ty
)...sgn(
)sgn()(
1100 DD
T
xwxwxw
y
xwx
Notation
• Let S = {(xk, tk): k = 1, 2, ..., m} be a training set.
Note: xk is a vector of inputs, and tk is {+1, −1} for binary classification, tk for regression.
• Output o:
• Error of a perceptron on a single training example, (xk ,tk)
€
o = sgn w jj =0
n
∑ x j = wT ⋅ x
€
E k =1
2(t k − ok )2
Example
• S = {((0,0), -1), ((0,1), 1)}
• Let w = {w0, w1, w2) = {0.1, 0.1, −0.3}
o
What is E1?
What is E2?
+1
x10.1
0.1
x2−0.3
€
E k =1
2(t k − ok )2
How do we train a perceptron?
Gradient descent in weight space
From T. M. Mitchell, Machine Learning
Perceptron learning algorithm
• Start with random weights w = (w1, w2, ... , wn).
• Do gradient descent in weight space, in order to minimize error E:
– Given error E, want to modify weights w so as to take a step in direction of steepest descent.
Gradient descent
• We want to find w so as to minimize sum-squared error:
• To minimize, take the derivative of E(w) with respect to w.
• A vector derivative is called a “gradient”: E(w)
€
E(w) =1
2(t k − ok )2
k =0
m
∑
nw
E
w
E
w
EE ,...,,)(
10
w
• Here is how we change each weight:
and is the learning rate.
jj
jjj
w
Ew
www
where
• Error function
has to be differentiable, so output function o also has to be differentiable.
€
E(w) =1
2(t k − ok )2
k =0
m
∑
-1
1
activation
output
0
-1
1
activation
output
0
Activation functions
0sgn wxwo
jjj
0wxwoj
jj
Not differentiable Differentiable
€
∂E
∂wi
=∂
∂wi
1
2(t k − ok )2
k
∑ 1( )
=1
2
∂
∂wi
(t k − ok )2
k
∑ (2)
=1
22(t k − ok )
∂
∂wi
(t k − ok )k
∑ (3)
= (t k
k
∑ − ok )∂
∂wi
(t k − w⋅ x k ) (4)
= (t k
k
∑ − ok )(−xi
k ) (5)
€
So,
Δw i = η (t k
k
∑ − ok )xi
k (6)
This is called the perceptronlearning rule, with “true gradientdescent”.
• Problem with true gradient descent:
Search process will land in local optimum.
• Common approach to this: use stochastic gradient descent:– Instead of doing weight update after all training
examples have been processed, do weight update after each training example has been processed (i.e., perceptron output has been calculated).
– Stochastic gradient descent approximates true gradient descent increasingly well as 1/.
Training a perceptron
1. Start with random weights, w = (w1, w2, ... , wn).
2. Select training example (xk, tk).
3. Run the perceptron with input xk and weights w to obtain o.
4. Let be the learning rate (a user-set parameter). Now,
5. Go to 2.
€
wi ← wi + Δwi
where
Δwi = η (t k − ok )x ik