Single-Layer PerceptronClassifiers
Berlin Chen, 2002
2
Outline
• Foundations of trainable decision-making networks to be formulated– Input space to output space (classification space)
• Focus on the classification of linearly separable classes of patterns– Linear discriminating functions and simple correction
function – Continuous error function minimization
• Explanation and justification of perceptron and delta training rules
3
Classification Model, Features,and Decision Regions
• A pattern is the quantitative description of an object, event, or phenomenon– Spatial patterns: weather maps, fingerprints …– Temporal patterns: speech signals …
• Pattern classification/recognition– Assign the input data (a physical object, event, or
phenomenon) to one of the pre-specified classes (categories)
– Discriminate the input data within object population via the search for invariant attributes among members of the population
4
Classification Model, Features,and Decision Regions (cont.)
• The block diagram of the recognition and classification system
Dimension reduction
A neural network for classification and for feature
extraction
5
Classification Model, Features,and Decision Regions (cont.)
• More about Feature Extraction– The compressed data from the input patterns while
poses salient information– E.g.
• Speech vowel sounds analyzed in 16-channel filterbanks can provide 16 spectral vectors, which can be further transformed into two dimensions
– Tone height (high-low) and retraction (front-back)
• Input patterns to be projected and reduced to lower dimensions
6
Classification Model, Features,and Decision Regions (cont.)
• More about Feature Extractiony
x
x’
y’
7
Classification Model, Features,and Decision Regions (cont.)
• Two simple ways to generate the pattern vectors for cases of spatial and temporal objects to be classified
• A pattern classifier maps input patterns (vectors) in En
space into numbers (E1) which specify the membership( ) Rjij ,...,2 ,1 ,0 == x
8
Classification Model, Features,and Decision Regions (cont.)
• Classification described in geometric terms
– Decision regions– Decision surfaces: generally, the decision surfaces for n-
dimensional patterns may be (n-1)-dimensional hyper-surfaces
( ) ,..., R, jji jo 21 ,Χ allfor , =∈= xx
The decision surfaces hereare curved lines
9
Discriminant Functions• Determine the membership in a category by the
classifier based on the comparison of R discriminant functions g1(x), g2(x),…, gR(x)– When x is within the region Xk if gk(x) has the largest
value ( ) ( ) ( ) j,..., R, k, jk,ggki jk ≠=>= 21 for if 0 xxx
x1, x2,…., xp, ….,xP
P>>nAssume the classifier Has been designed
x1
x2
xn
g1
g2
gR
g1(x)
gR(x)
g2(x)
10
Discriminant Functions (cont.)
• Example 3.1 ( ) ( ) ( )
( )( ) 2 class : 0
1 class : 022
21
21
<>
++=−=
xx
xxx
gg
xx-ggg
The decision surface does not uniquely specify the discriminant functions
The classifier that classifies patternsinto two classes or categories is called“dichotomizer”
Decision surface Equation:
“two” “cut”
11
Discriminant Functions (cont.)
12
Discriminant Functions (cont.)
[2,-1,0]
[-2,1,0]
(0,-2,0)
(1,0,0)[0,0,1]
[2,-1,1]
(0,-2,1)
(x-0,y+2, g1 -1)(2,-1,1)=02x-y-2+ g1 -1=0g1=-2x+y+3(x-0,y+2, g2 -1)(-2,1,1)=0-2x+y+2+ g2 -1=0g2=2x-y-1g=g1 -g2=0-4x+2y+4=0-2x+y+2=0
An infinite number of discriminant functions will yield
correct classification
(x-0,y+2, g1 -1)(2,-1,2)=02x-y-2+2g1 -2=0g1=-x+1/2y+2
(x-0,y+2, g2 -1)(-2,1,2)=0-2x+y+2+ 2g2 -2=0g2=x-1/2yg=g1 -g2=0-2x+y+2=0
Solution 1
Solution 2x
y
g
( ) [ ]
( ) [ ]
=
+
−=
2
12
2
11
1- 2
31 2
xx
g
xx
g
x
x
13
Discriminant Functions (cont.)
( ) ( ) ( )xxx 21 ggg −=
Multi-class
Two-class
( )( ) 2 class : 0
1 class : 0<>
xx
gg
subtraction Sign examination
14
Discriminant Functions (cont.)
The design of discriminator for this case is not straightforward.The discriminant functions may result as nonlinearfunctions of x1 and x2
15
Bayes’ Decision Theory
• A decision-making based on both the posterior knowledge obtained from specific observation data and prior knowledge of the categories – Prior class probabilities – Class-conditioned probabilities
( ) iP i class , ∀ω
( ) ixP i class , ∀ω
( ) ( ) ( )( )
( ) ( )( ) ( )∑
=
===
1
maxargmaxargmaxarg j
jj
ii
i
ii
ii
i PxPPxP
xPPxP
xPkωωωωωω
ω
( ) ( ) ( )iii
ii
PxPxPk ωωω maxargmaxarg ==
16
Bayes’ Decision Theory (cont.)
• Bayes’ decision rule designed to minimize the overall risk involved in making decision– The expected loss (conditional risk) when making
decision
• The overall risk (Bayes’ risk)
– Minimize the overall risk (classification error) by computing the conditional risks and select the decision
for which the conditional risk is minimum, i.e., is maximum (minimum-error-rate decision rule)
( ) ( ) ( ) ( )( )( )xP
xP
j, ij, i
xlxPxlxR
i
ijj
jij
jjii
ω
ω
ωδωωδδ
-1
10
, where,,
=
=
≠=
==
∑
∑
≠
iδ
( )( ) ( ) ( ) xxdxxpxxRR sample afor decision selected the: , δδ∫∞
∞−=
iδ ( )xR iδ( )xP iω
17
Bayes’ Decision Theory (cont.)
• Two-class pattern classification
Likelihood ratio or log-likelihood ratio:
( ) ( )( )
( )( )1
2
2
1
2
1
ωω
ωω
ω
ω
PP
xPxP
xl<>
=
Bayes’ Classifier
( ) ( ) ( ) ( ) ( )1221 logloglogloglog2
1
ωωωωω
ω
PPxPxPxl −<>
−=
( ) ( ) ( ) ( )2211
2
1
ωωωωω
ω
PxPPxP<>
( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )22221111 , ωωωωωω PxPxPxgPxPxPxg ≅=≅=
( ) ( ) ( )( ) ( ) ( ) ( )( ) ( ) ( ) ( )dxPxPdxPxP
PRxPPRxPRxPRxPerrorp
RR 1122
112221
1221
21
,,
ωωωω
ωωωωωω
∫∫ +=
∈+∈=
∈+∈=
Classification error:
18
Bayes’ Decision Theory (cont.)
• When the environment is multivariate Gaussian, the Bayes’ classifier reduces to a linear classifier– The same form taken by the perceptron– But the linear nature of the perceptron is not
contingent on the assumption of Gaussianity
[ ]( )( )[ ][ ]( )( )[ ] ΣµXµX
µXΣµXµX
µX
=−−
==−−
=
t
t
E
EE
E
22
22
11
11
: Class
: Class
ω
ω
( )( )
( ) ( )
−−−= − µxΣµx
Σx 1
21
2 21exp
2
1 tn
Pπ
ω
( ) ( )21
21 == ωω PP
Assumptions
19
Bayes’ Decision Theory (cont.)
• When the environment is Gaussian, the Bayes’classifier reduces to a linear classifier (cont.)
( ) ( ) ( )( ) ( ) ( ) ( )
( ) ( )
21
21
21
logloglog
11
121
21
21
21
211
1
21
b
PPl
ttt
tt
+=
−+−=
−−+−−−=
−=
−−−
−−
wx
µΣµµΣµxΣµµ
µxΣµxµxΣµx
xxx ωω
( ) 0log2
1
ω
ω
<>
+=∴ bl wxx
20
Bayes’ Decision Theory (cont.)
• Multi-class pattern classification
21
Linear Machine and Minimum Distance Classification
• Find the linear-form discriminant function for two-class classification when the class prototypes are known
• Example 3.1: Select the decision hyperplane that contains the midpoint of the line segment connecting center point of two classes
22
Linear Machine and Minimum Distance Classification (cont.)
0)(21)(
0)2
()(
21
2221
2121
=−+−
=+
−−
xxxxx
xxxxx
t
t
The dichotomizer’s discriminant function g(x):
( )21
221
21
1
21
where,01
asTaken
xx
xxw
xw
−=
−=
=
+
+
n
n
w
w2
21 xx +
Augmentedinput pattern
It is a simple minimum-distance classifier.
23
Linear Machine and Minimum Distance Classification (cont.)
• The linear-form discriminant functions for multi-class classification – There are up to R(R-1)/2 decision hyperplanes for R
pairwise separable classes
xx
x xx
x
x
o o oo oo o
Δ Δ
Δ Δ
Δ
Δ
Δ
o
o oo oo o
oxx
x xx
x
x
ΔΔ
Δ Δ
Δ
Δ
Δ
Some classes may not be contiguous
24
Linear Machine and Minimum Distance Classification (cont.)
• Linear machine or minimum-distance classifier– Assume the class prototypes are known for all classes
• Euclidean distance between input pattern x and the center of class i, xi :
• Minimizing is equal to
maximizing
– Set the discriminant function for each class i to be:
iti
ti
ti xxxxxxxx +−=− 22
( ) ( )it
ii xxxxxx −−=−
iti
ti xxxx
21
−
( ) iti
tiig xxxxx
21
−=
( )itini
ii
w xx
xw
21
1, −=
=
+( ) where,
1
1,
=
+
xwx
ni
ii w
g
( ) ywx tiig =
The same for all classes
25
Linear Machine and Minimum Distance Classification (cont.)
This approach is also called correlation classification
An 1 as the n+1’th component of the input pattern
( ) iti
tiig xxxxx
21
−= ( ) ywx tiig =
26
Linear Machine and Minimum Distance Classification (cont.)
• Example 3.2
−=
−−=
−=
255 5-
,5.145 2
,522
10 321 www
05.10107:027315:
05.3778:
2123
2113
2112
=−+−=++−=−+
xxSxxS
xxS
( )( )( ) 2555
5.145252210
213
212
211
−+−=−−=−+=
xxgxxgxxxg
xx
( ) iti
tiig xxxxx
21
−=
12S13S
23S
27
Linear Machine and Minimum Distance Classification (cont.)
• If R linear discriminant functions exist for a set of patterns such that
– The classes are linearly separable
( ) ( )j i,..., R, ,..., R,j, i
i,gg ji
≠==
∈>
,2121
Classfor xxx
28
Linear Machine and Minimum Distance Classification (cont.)
29
Linear Machine and Minimum Distance Classification (cont.)
(a) 2x1-x2+2=0, decision surface is a line(b) 2x1-x2+2=0, decision surface is a plane(c) x1=[2,5], x2=[-1,-3]=>The decision surface for minimum distance classifier
(x1-x2)t x+1/2 (||x2||2-||x1||2)t=03x1+ 8x2-19/2=0
(d)
x1
x2
(0,0)(-1,0) (0,2)
x1
x2
(0,0)(-1,0) (0,2)
x3
x1
x2
(0,0)(-1,0) (0,2)
(19/6,0)
(19/16,0)
30
Linear Machine and Minimum Distance Classification (cont.)
• Examples 3.1 and 3.2 have shown that the coefficients (weights) of the linear discriminant functions can be determined if the a priori information about the sets of patterns and their class membership is known
31
Linear Machine and Minimum Distance Classification (cont.)
• The example of linearly non-separable patterns
32
Linear Machine and Minimum Distance Classification (cont.)
x1+x2+1=0
-x1-x2+1=0TLU#2
TLU#1x1
x2
-1
-1
1
1
-1
-1-1
TLU#211
1
(1,1)
(-1,-1)
(-1, 1)
(1, -1)
o1
o2
(1,1)(-1, 1)
(1, -1) o1+o2-1=0
o1
o2
33
Discrete Perceptron Training Algorithm- Geometrical Representations
• Examine the neural network classifiers that derive/training their weights based on the error-correction scheme
( ) ywy tg =
Vector Representationsin the Weight Space
Class 1:
Class 2:
0>yw t
0<yw t
Augmentedinput pattern
34
Discrete Perceptron Training Algorithm- Geometrical Representations (count.)
• Devise an analytic approach based on the geometrical representations – E.g. the decision surface for the training pattern y1
If y1 in Class 1:
y1 in Class 2
( ) 11 yyww =∇ t
11 yww c+=′
y1 in Class 1
If y1 in Class 2:
11 yww c−=′
c (>0) is the correction increment (is two times of the learning constant introduced before)
Weight Space
Weight Space
c controls the size of adjustment
Gradient(the direction ofsteep increase)
35
Discrete Perceptron Training Algorithm- Geometrical Representations (count.)
Weight adjustments of three augmented training pattern y1, y2, y3 , shown in the weight space
- Weights in the shaded region are the solutions
- The three lines labeled are fixed during training
Weight Space
23
12
11
CCC
∈∈∈
yyy
36
Discrete Perceptron Training Algorithm- Geometrical Representations (count.)
• More about the correction increment c– If it is not merely a constant, but related to the current
training pattern
yyw t
p1
= ( )
0 because ,
0
2
11
1
>==
=±
cc
ct
t
t
t
y
ywyyyw
yyw
m
yy
ywy 2
1t
c =⇒
How to select the correction increment based on the dislocates of w1 and the corrected weight vector w
37
Discrete Perceptron Training Algorithm- Geometrical Representations (count.)
• For fixed correction rule with c=constant, the correction of weights is always the same fixed portion of the current training vector– The weight can be initialized at any value
• For dynamic correction rule with c dependent on the distance from the weight (i.e. the weight vector) to the decision surface in the weight space– The initial weight should be different from 0
yww c±=′ or ( )[ ]yywwwww
tdc sgn−=∆
∆+=′
yy
ywy 2
1t
c =⇒
38
Discrete Perceptron Training Algorithm- Geometrical Representations (count.)
• Dynamic correction rule with c dependent on the distance from the weight
yy
y
ywy
y
yw
t
t
c
c
1
2
1
λ
λ
=
=
39
Discrete Perceptron Training Algorithm- Geometrical Representations (count.)
• Example 3.3
=
13
3y
=
11
1y
−=
1 5.0
2y
−
=12
4y24
13
22
11
CCCC
∈∈∈∈
yyyy
( )[ ] jjkt
kk dc yyww sgn
2−=∆
What if ?-> interpreted as a mistakeand followed by a correlation
0=jkt yw
40
Continuous PerceptronTraining Algorithm
• Replace the TLU (Threshold Logic Unit) with the sigmoid activation function for two reasons:– Gain finer control over the training procedure– Facilitate the differential characteristics to enable
computation of the error gradient
( )www E∇−= ηˆ
learning constant error gradient
41
Continuous PerceptronTraining Algorithm (cont.)
• The new weights is obtained by moving in the direction of the negative gradient along the multidimensional error surface
42
Continuous PerceptronTraining Algorithm (cont.)
• Define the error as the squared difference between the desired output and the actual output
( )221 odE −=
( )[ ] ( )[ ]22
21
21or netfdfdE t −=−= yw
( ) ( )[ ]( )
( ) ( ) ( )
( )
( )
( )
( ) ( )yw
w
netfod
wnet
wnetwnet
netfod
wE
wEwE
E
netfdE
nn
′−−=
∂∂
∂∂∂
∂
′−−=
∂∂
∂∂∂∂
=∇
−∇=∇
++
∆
1
2
1
1
2
1
2
.
...
21
43
Continuous PerceptronTraining Algorithm (cont.)
• Bipolar Continuous Activation Function
• Unipolar Continuous Activation Function
( ) ( )netnetf
⋅−+=
λexp11
( ) ( )( )[ ]
( )[ ]{ } ( )222 11
exp1exp2 onetf
netnetnetf −=−⋅=⋅−+⋅−
⋅=′ λλλλλ
( ) ( )yww oood −−⋅⋅+= 1ˆ λη
( ) ( ) 1exp1
2−
⋅−+=
netnetf
λ
( ) ( )( )[ ]
( ) ( )[ ] ( )oonetfnetfnetnetnetf −⋅=−⋅=⋅−+⋅−⋅
=′ 11exp1exp
2 λλλλλ
( )( )yww 2121ˆ ood −−⋅+= λη
44
Continuous PerceptronTraining Algorithm (cont.)
• Example 3.3 ( ) ( ) 1exp1
2−
−+=
netnetf
=
13
3y
−=
1 5.0
2y
=
11
1y
−
=12
4y
45
Continuous PerceptronTraining Algorithm (cont.)
• Example 3.3 Trajectories started from fourarbitrary initial weights
Total error surface
46
Continuous PerceptronTraining Algorithm (cont.)
• Treat the last fixed component of input pattern vector as the neuron activation threshold
47
Continuous PerceptronTraining Algorithm (cont.)
• R-category linear classifier using R discrete bipolar perceptrons– Goal: The i-th TLU response of +1 is indicative of
class i and all other TLU respond with -1
( )yww iiii odc −⋅+=21ˆ
ij,..,R,jdd ji ≠=−== ,21for ,1 ,1
For “local representation”
48
Continuous PerceptronTraining Algorithm (cont.)
• Example 3.5
49
Continuous PerceptronTraining Algorithm (cont.)
• R-category linear classifier using R continuous bipolar perceptrons
( )( ),...,R,i
ood iiiii
21for
121ˆ 2
=
−−⋅+= yww λη
ij,..,R,jdd ji ≠=−== ,21for ,1 ,1
50
Continuous PerceptronTraining Algorithm (cont.)
• Error function dependent on the difference vector d-o
51
Bayes’ Classifier vs. Percepron
• Perceptron operates on the promise that the patterns to be classified are linear separable (otherwise the training algorithm will oscillate), while Bayes’ classifier assumes the (Gaussian) distribution of two classes certainly do overlap each other
• The perceptron is nonparametric while the Bayes’classifier is parametric (its derivation is contingent on the assumption of the underlying distributions)
• The perceptron is simple and adaptive, and needs small storage, while the Bayes’ classifier could be made adaptive but at the expanse of increased storage and more complex computations
52
Homework
• P3.5, P3.7, P3.9, P3.22