Introduction to Machine Learning
Figures
Ethem Alpaydın
c©The MIT Press, 2004
1
Chapter 1:
Introduction
2
Sav
ing
s
Income
Low-Risk
High-Riskθ
2
θ1
Figure 1.1: Example of a training dataset where each
circle corresponds to one data instance with input
values in the corresponding axes and its sign
indicates the class. For simplicity, only two customer
attributes, income and savings, are taken as input
and the two classes are low-risk (‘+’) and high-risk
(‘−’). An example discriminant that separates thetwo types of examples is also shown. From:
E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
3
x: milage
y: p
rice
Figure 1.2: A training dataset of used cars and the
function fitted. For simplicity, milage is taken as the
only input attribute and a linear model is used.
From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
4
Chapter 2:
Supervised Learning
5
x
2:
En
gin
e p
ow
er
x1: Price
x1
t
x2
t
Figure 2.1: Training set for the class of a “family
car.” Each data point corresponds to one example
car and the coordinates of the point indicate the
price and engine power of that car. ‘+’ denotes a
positive example of the class (a family car), and ‘−’denotes a negative example (not a family car); it is
another type of car. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
6
x
2:
Engin
e pow
er
x1: Price
p1
p2
e1
e2
C
Figure 2.2: Example of a hypothesis class. The class
of family car is a rectangle in the price-engine power
space. From: E. Alpaydın. 2004. Introduction to
Machine Learning. c©The MIT Press.
7
������������x 2: Engine powerx
1: Price
p1
p2
e1
e2
Ch
False negative
False positive
Figure 2.3: C is the actual class and h is our inducedhypothesis. The point where C is 1 but h is 0 is afalse negative, and the point where C is 0 but h is 1is a false positive. Other points, namely true
positives and true negatives, are correctly classified.
From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
8
x
2:
En
gin
e p
ow
er
x1: Price
C
S
G
Figure 2.4: S is the most specific hypothesis and G is
the most general hypothesis. From: E. Alpaydın.
2004. Introduction to Machine Learning. c©The MITPress.
9
x
2
x1
Figure 2.5: An axis-aligned rectangle can shatter
four points. Only rectangles covering two points are
shown. From: E. Alpaydın. 2004. Introduction to
Machine Learning. c©The MIT Press.
10
x
2
x1
C
h
Figure 2.6: The difference between h and C is thesum of four rectangular strips, one of which is
shaded. From: E. Alpaydın. 2004. Introduction to
Machine Learning. c©The MIT Press.
11
x
2
x1
h1
h2
Figure 2.7: When there is noise, there is not a simple
boundary between the positive and negative
instances, and zero misclassification error may not be
possible with a simple hypothesis. A rectangle is a
simple hypothesis with four parameters defining the
corners. An arbitrary closed form can be drawn by
piecewise functions with a larger number of control
points. From: E. Alpaydın. 2004. Introduction to
Machine Learning. c©The MIT Press.
12
Engin
e pow
er
Price
Family car
Sports car
Luxury sedan
?
?
Figure 2.8: There are three classes: family car, sports
car, and luxury sedan. There are three hypotheses
induced, each one covering the instances of one class
and leaving outside the instances of the other two
classes. ‘?’ are reject regions where no, or more than
one, class is chosen. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
13
x: milage
y: p
rice
Figure 2.9: Linear, second-order, and sixth-order
polynomials are fitted to the same set of points. The
highest order gives a perfect fit but given this much
data, it is very unlikely that the real curve is so
shaped. The second order seems better than the
linear fit in capturing the trend in the training data.
From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
14
Chapter 3:
Bayes Decision Theory
15
x
2
x1
C1
C3
C2
reject
Figure 3.1: Example of decision regions and decision
boundaries. From: E. Alpaydın. 2004. Introduction
to Machine Learning. c©The MIT Press.
16
Rain
Wet grass
P(R)=0.4
P(W | R)=0.9
P(W | ~R)=0.2
Figure 3.2: Bayesian network modeling that rain is
the cause of wet grass. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
17
Sprinkler Rain
Wet grass
P(W | R,S)=0.95
P(W | R,~S)=0.90
P(W | ~R,S)=0.90
P(W | ~R,~S)=0.10
P(S)=0.2 P(R)=0.4
Figure 3.3: Rain and sprinkler are the two causes of
wet grass. From: E. Alpaydın. 2004. Introduction to
Machine Learning. c©The MIT Press.
18
Sprinkler Rain
Wet grass
P(W | R,S)=0.95
P(W | R,~S)=0.90
P(W | ~R,S)=0.90
P(W | ~R,~S)=0.10
Cloudy
P(R | C)=0.8
P(R | ~C)=0.1
P(S | C)=0.1
P(S | ~C)=0.5
P(C)=0.5
Figure 3.4: If it is cloudy, it is likely that it will rain
and we will not use the sprinkler. From: E. Alpaydın.
2004. Introduction to Machine Learning. c©The MITPress.
19
Sprinkler Rain
Wet grass
Cloudy
P(R | C)=0.8
P(R | ~C)=0.1
P(S | C)=0.1
P(S | ~C)=0.5
P(C)=0.5
rooF
P(F | R)=0.1
P(F | ~R)=0.7
P(W | R,S)=0.95
P(W | R,~S)=0.90
P(W | ~R,S)=0.90
P(W | ~R,~S)=0.10
Figure 3.5: Rain not only makes the grass wet but
also disturbs the cat who normally makes noise on
the roof. From: E. Alpaydın. 2004. Introduction to
Machine Learning. c©The MIT Press.
20
C
x
P(C)
p(x | C)
Figure 3.6: Bayesian network for classification.
From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
21
C
P(C)
p(x2
| C)
x1
p(xd
| C)p(x1
| C)
xd
x2
Figure 3.7: Naive Bayes’ classifier is a Bayesian
network for classification assuming independent
inputs. From: E. Alpaydın. 2004. Introduction to
Machine Learning. c©The MIT Press.
22
x
choose
class
U
Figure 3.8: Influence diagram corresponding to
classification. Depending on input x, a class is
chosen that incurs a certain utility (risk). From:
E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
23
Chapter 4:
Parametric Methods
24
di
E[d]
variance
bias
θ
Figure 4.1: θ is the parameter to be estimated. di
are several estimates (denoted by ‘×’) over differentsamples. Bias is the difference between the expected
value of d and θ. Variance is how much di are
scattered around the expected value. We would like
both to be small. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
25
−10 −8 −6 −4 −2 0 2 4 6 8 100
0.1
0.2
0.3
0.4Likelihoods
x
p(x|
C i)
−10 −8 −6 −4 −2 0 2 4 6 8 100
0.2
0.4
0.6
0.8
1Posteriors with equal priors
x
p(C
i|x)
Figure 4.2: Likelihood functions and the posteriors
with equal priors for two classes when the input is
one-dimensional. Variances are equal and the
posteriors intersect at one point, which is the
threshold of decision. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
26
−10 −8 −6 −4 −2 0 2 4 6 8 100
0.1
0.2
0.3
0.4Likelihoods
x
p(x|
C i)
−10 −8 −6 −4 −2 0 2 4 6 8 100
0.2
0.4
0.6
0.8
1Posteriors with equal priors
x
p(C
i|x)
Figure 4.3: Likelihood functions and the posteriors
with equal priors for two classes when the input is
one-dimensional. Variances are unequal and the
posteriors intersect at two points. From:
E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
27
X
E[R|x]=wx+w0
p(r|x*)
x*
E[R|x*]
Figure 4.4: Regression assumes 0 mean Gaussian
noise added to the model; here, the model is linear.
From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
28
0 1 2 3 4 5−5
0
5(a) Function and data
0 1 2 3 4 5−5
0
5(b) Order 1
0 1 2 3 4 5−5
0
5(c) Order 3
0 1 2 3 4 5−5
0
5(d) Order 5
Figure 4.5: (a) Function, f(x) = 2 sin(1.5x), and one
noisy (N (0, 1)) dataset sampled from the function.Five samples are taken, each containing twenty
instances. (b), (c), (d) are five polynomial fits, gi(·),of order 1, 3, and 5. For each case, dotted line is the
average of the five fits, namely, g(·). From:E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
29
1 1.5 2 2.5 3 3.5 4 4.5 50
0.5
1
1.5
2
2.5
3
3.5
4
bias
variance
error
Figure 4.6: In the same setting as that of figure 4.5,
using one hundred models instead of five, bias,
variance, and error for polynomials of order 1 to 5.
Order 1 has the smallest variance. Order 5 has the
smallest bias. As the order is increased, bias
decreases but variance increases. Order 3 has the
minimum error. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
30
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−5
0
5(a) Data and fitted polynomials
1 2 3 4 5 6 7 80.5
1
1.5
2
2.5
3(b) Error vs polynomial order
TrainingValidation
Figure 4.7: In the same setting as that of figure 4.5,
training and validation sets (each containing 50
instances) are generated. (a) Training data and
fitted polynomials of order from 1 to 8. (b) Training
and validation errors as a function of the polynomial
order. The “elbow” is at 3. From: E. Alpaydın.
2004. Introduction to Machine Learning. c©The MITPress.
31
Chapter 5:
Multivariate Methods
32
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
x1
x2
Figure 5.1: Bivariate normal distribution. From:
E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
33
Cov(x1,x
2)=0, Var(x
1)=Var(x
2)
x1
x 2
Cov(x1,x
2)=0, Var(x
1)>Var(x
2)
Cov(x1,x
2)>0 Cov(x
1,x
2)
0
0.05
0.1
x1
x2
p( x
|C1)
0
0.5
1
x1
x2
p(C
1| x
)
Figure 5.3: Classes have different covariance
matrices. From: E. Alpaydın. 2004. Introduction to
Machine Learning. c©The MIT Press.
35
Figure 5.4: Covariances may be arbitary but shared
by both classes. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
36
Figure 5.5: All classes have equal, diagonal
covariance matrices but variances are not equal.
From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
37
Figure 5.6: All classes have equal, diagonal
covariance matrices of equal variances on both
dimensions. From: E. Alpaydın. 2004. Introduction
to Machine Learning. c©The MIT Press.
38
Chapter 6:
Dimensionality Reduction
39
z1
z2
x1
x
2
z1
z
2
Figure 6.1: Principal components analysis centers
the sample and then rotates the axes to line up with
the directions of highest variance. If the variance on
z2 is too small, it can be ignored and we have
dimensionality reduction from two to one. From:
E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
40
0 10 20 30 40 50 60 700
100
200
Eigenvectors
Eig
enva
lues
(a) Scree graph for Optdigits
0 10 20 30 40 50 60 700
0.2
0.4
0.6
0.8
1
Eigenvectors
Pro
p of
var
(b) Proportion of variance explained
Figure 6.2: (a) Scree graph. (b) Proportion of
variance explained is given for the Optdigits dataset
from the UCI Repository. This is a handwritten digit
dataset with ten classes and sixty-four dimensional
inputs. The first twenty eigenvectors explain 90
percent of the variance. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
41
−40 −30 −20 −10 0 10 20 30 40−40
−30
−20
−10
0
10
20
30
First Eigenvector
Sec
ond
Eig
enve
ctor
Optdigits after PCA
0
0
7
4
6
2
5
5
0
8
7
19
5
3
0
4
7
8
4
7
8
5
9 1
2
0
6
1
8
7
0
7
6
9
19
3
9
4
9
2
1
9
9
6
4
3
2
8
2
7 1
4
6
2
0
4
6
3
71
0
2
2
5
2
4
8
17
30
33
7
7
9
1
3
3
4
3
4
2
8
89
8
4
7
1
6
9
4
0
1
3
6
2
Figure 6.3: Optdigits data plotted in the space of
two principal components. Only the labels of
hundred data points are shown to minimize the
ink-to-noise ratio. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
42
variables
new
variables
x1 xdx2
z1 zk
z2
W V
x1 xdx2
z1 zk
z2
W
variables
factors
PCA FA
Figure 6.4: Principal components analysis generates
new variables that are linear combinations of the
original input variables. In factor analysis, however,
we posit that there are factors that when linearly
combined generate the input variables. From:
E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
43
z1
z
2
x1
x
2
Figure 6.5: Factors are independent unit normals
that are stretched, rotated, and translated to make
up the inputs. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
44
−2500 −2000 −1500 −1000 −500 0 500 1000 1500 2000−2000
−1500
−1000
−500
0
500
1000
1500
2000
Athens
Berlin
Dublin
Helsinki
Istanbul
Lisbon
London
Madrid
Moscow
Paris
Rome
Zurich
Figure 6.6: Map of Europe drawn by MDS. Cities
include Athens, Berlin, Dublin, Helsinki, Istanbul,
Lisbon, London, Madrid, Moscow, Paris, Rome, and
Zurich. Pairwise road travel distances between these
cities are given as input, and MDS places them in two
dimensions such that these distances are preserved as
well as possible. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
45
w
m1
m1
m2
m2
s1
2
s2
2
x1
x
2
Figure 6.7: Two-dimensional, two-class data
projected on w. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
46
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−5
−4
−3
−2
−1
0
1
2
3
4Optdigits after LDA
0
0
7
4
62
55
0
8
7
1
9
5
3
0
4
7
8
4
7
8
5
9
1
2
0
6
1
8
7
0
7
6
91
9
3
9
4
9
2
1
99
6
4
32
8
2
7
14
6
2
0
4
63
7
1
0
2
2
52
4
8
1
7
3
0
3
3
77
9
1
3
3
4
3
4
2
88
9
8
4
7
1
6
9
4
0
1
3 6
2
Figure 6.8: Optdigits data plotted in the space of
the first two dimensions found by LDA. Comparing
this with figure 6.3, we see that LDA, as expected,
leads to a better separation of classes than PCA.
Even in this two-dimensional space (there are nine),
we can discern separate clouds for different classes.
From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
47
Chapter 7:
Clustering
48
mi
x i
.
.
.
iCommunication
line
.
.
.
mi
x' =mi
Encoder Decoder
Fin
d c
losest
Figure 7.1: Given x, the encoder sends the index of
the closest code word and the decoder generates the
code word with the received index as x′. Error is
‖x′ − x‖2. From: E. Alpaydın. 2004. Introduction toMachine Learning. c©The MIT Press.
49
−40 −20 0 20 40−30
−20
−10
0
10
20
x1
x 2
k−means: Initial
−40 −20 0 20 40−30
−20
−10
0
10
20
x1
x 2
After 1 iteration
−40 −20 0 20 40−30
−20
−10
0
10
20
x1
x 2
After 2 iterations
−40 −20 0 20 40−30
−20
−10
0
10
20
x1
x 2After 3 iterations
Figure 7.2: Evolution of k-means. Crosses indicate
center positions. Data points are marked depending
on the closest center. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
50
Initialize mi, i = 1, . . . , k, for example, to k random xt
Repeat
For all xt ∈ X
bti ←{
1 if ‖xt −mi‖ = minj ‖xt −mj‖0 otherwise
For all mi, i = 1, . . . , k
mi ←∑
tbtix
t/∑
tbti
Until mi converge
Figure 7.3: k-means algorithm. From: E. Alpaydın.
2004. Introduction to Machine Learning. c©The MITPress.
51
−40 −30 −20 −10 0 10 20−30
−25
−20
−15
−10
−5
0
5
10
15
20
x1
x 2
EM solution
Figure 7.4: Data points and the fitted Gaussians by
EM, initialized by one k-means iteration of figure 7.2.
Unlike in k-means, as can be seen, EM allows
estimating the covariance matrices. The data points
labeled by greater hi, the contours of the estimated
Gaussian densities, and the separating curve of
hi = 0.5 (dashed line) are shown. From: E. Alpaydın.
2004. Introduction to Machine Learning. c©The MITPress.
52
a
b
fe
d
c
a de cb f
1
3
2
h
Figure 7.5: A two-dimensional dataset and the
dendrogram showing the result of single-link
clustering is shown. Note that leaves of the tree are
ordered so that no branches cross. The tree is then
intersected at a desired value of h to get the clusters.
From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
53
Chapter 8:
Nonparametric Methods
54
0 1 2 3 4 5 6 7 80
0.1
0.2
0.3
0.4Histogram: h = 2
0 1 2 3 4 5 6 7 80
0.1
0.2
0.3
0.4h = 1
0 1 2 3 4 5 6 7 80
0.2
0.4
0.6
0.8h = 0.5
Figure 8.1: Histograms for various bin lengths. ‘×’denote data points. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
55
0 1 2 3 4 5 6 7 80
0.1
0.2
0.3
0.4Naive estimator: h = 2
0 1 2 3 4 5 6 7 80
0.1
0.2
0.3
0.4h = 1
0 1 2 3 4 5 6 7 80
0.2
0.4
0.6
0.8h = 0.5
Figure 8.2: Naive estimate for various bin lengths.
From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
56
0 1 2 3 4 5 6 7 80
0.05
0.1
0.15
0.2Kernel estimator: h = 1
0 1 2 3 4 5 6 7 80
0.1
0.2
0.3
0.4h = 0.5
0 1 2 3 4 5 6 7 80
0.2
0.4
0.6
0.8h = 0.25
Figure 8.3: Kernel estimate for various bin lengths.
From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
57
0 1 2 3 4 5 6 7 80
0.1
0.2
0.3
0.4k−NN estimator: k = 5
0 1 2 3 4 5 6 7 80
0.5
1k = 3
0 1 2 3 4 5 6 7 80
0.5
1k = 1
Figure 8.4: k-nearest neighbor estimate for various k
values. From: E. Alpaydın. 2004. Introduction to
Machine Learning. c©The MIT Press.
58
x1
x
2
*
*
Figure 8.5: Dotted lines are the Voronoi tesselation
and the straight line is the class discriminant. In
condensed nearest neighbor, those instances that do
not participate in defining the discriminant (marked
by ‘*’) can be removed without increasing the
training error. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
59
Z ← ∅Repeat
For all x ∈ X (in random order)Find x′ ∈ Z s.t. ‖x− x′‖ = minxj∈Z ‖x− xj‖If class(x)6=class(x′) add x to Z
Until Z does not change
Figure 8.6: Condensed nearest neighbor algorithm.
From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
60
0 1 2 3 4 5 6 7 8−2
0
2
4Regressogram smoother: h = 6
0 1 2 3 4 5 6 7 8−2
0
2
4h = 3
0 1 2 3 4 5 6 7 8−2
0
2
4h = 1
Figure 8.7: Regressograms for various bin lengths.
‘×’ denote data points. From: E. Alpaydın. 2004.Introduction to Machine Learning. c©The MIT Press.
61
0 1 2 3 4 5 6 7 8−2
0
2
4Running mean smoother: h = 6
0 1 2 3 4 5 6 7 8−2
0
2
4h = 3
0 1 2 3 4 5 6 7 8−2
0
2
4h = 1
Figure 8.8: Running mean smooth for various bin
lengths. From: E. Alpaydın. 2004. Introduction to
Machine Learning. c©The MIT Press.
62
0 1 2 3 4 5 6 7 8−2
0
2
4Kernel smooth: h = 1
0 1 2 3 4 5 6 7 8−2
0
2
4h = 0.5
0 1 2 3 4 5 6 7 8−2
0
2
4h = 0.25
Figure 8.9: Kernel smooth for various bin lengths.
From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
63
0 1 2 3 4 5 6 7 8−2
0
2
4Running line smooth: h = 6
0 1 2 3 4 5 6 7 8−2
0
2
4h = 3
0 1 2 3 4 5 6 7 8−2
0
2
4
6h = 1
Figure 8.10: Running line smooth for various bin
lengths. From: E. Alpaydın. 2004. Introduction to
Machine Learning. c©The MIT Press.
64
0 1 2 3 4 5 6 7 8−2
0
2
4Regressogram linear smoother: h = 6
0 1 2 3 4 5 6 7 8−4
−2
0
2
4h = 3
0 1 2 3 4 5 6 7 8−2
0
2
4h = 1
Figure 8.11: Regressograms with linear fits in bins
for various bin lengths. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
65
Chapter 9:
Decision Trees
66
x
2
x1
w10
w20
x1>w
10
x2>w
20
YesNo
NoYes
C1
C1
C1
C2
C2
Figure 9.1: Example of a dataset and the
corresponding decision tree. Oval nodes are the
decision nodes and rectangles are leaf nodes. The
univariate decision node splits along one axis, and
successive splits are orthogonal to each other. After
the first split, {x|x1 < w10} is pure and is not splitfurther. From: E. Alpaydın. 2004. Introduction to
Machine Learning. c©The MIT Press.
67
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p
entr
opy=
−p*
log 2
(p)−
(1−
p)*l
og2(
1−p)
Figure 9.2: Entropy function for a two-class problem.
From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
68
GenerateTree(X )If NodeEntropy(X)< θI /* eq. 9.3
Create leaf labelled by majority class in XReturn
i ← SplitAttribute(X)For each branch of xi
Find Xi falling in branchGenerateTree(Xi)
SplitAttribute(X )MinEnt← MAXFor all attributes i = 1, . . . , d
If xi is discrete with n values
Split X into X1, . . . ,Xn by xie ← SplitEntropy(X1, . . . ,Xn) /* eq. 9.8 */If e
0 1 2 3 4 5 6 7 8−2
0
2
4 θr = 0.5
0 1 2 3 4 5 6 7 8−2
0
2
4 θr = 0.2
0 1 2 3 4 5 6 7 8−2
0
2
4 θr = 0.05
Figure 9.4: Regression tree smooths for various
values of θr. The corresponding trees are given in
figure 9.5. From: E. Alpaydın. 2004. Introduction to
Machine Learning. c©The MIT Press.
70
x < 3.16
x < 1.36
Yes No
NoYes
1.37 -1.35
1.86
x < 3.16
x < 1.36
Yes No
2.20
x < 5.96
NoYes
1.37 -1.35
NoYes
0.9 2.40
Yes No
x < 6.91
x < 3.16
x < 1.36
Yes No
2.20
x < 5.96
NoYes
-1.35
NoYes
2.40
Yes No
x < 6.91x < 0.76
NoYes
1.15 1.80
NoYes
1.20 0.60
x < 6.31
Figure 9.5: Regression trees implementing the
smooths of figure 9.4 for various values of θr. From:
E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
71
x1
> 38.5
x2
> 2.5
Yes No
NoYes
0.8 0.6
x4
'A' 'C''B'
0.20.30.4
x1
: Age
x2 : Years in job
x3 : Gender
x4
: Job type
Figure 9.6: Example of a (hypothetical) decision
tree. Each path from the root to a leaf can be
written down as a conjunctive rule, composed of
conditions defined by the decision nodes on the path.
From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
72
Ripper(Pos,Neg,k)
RuleSet ← LearnRuleSet(Pos,Neg)For k times
RuleSet ← OptimizeRuleSet(RuleSet,Pos,Neg)LearnRuleSet(Pos,Neg)
RuleSet ← ∅DL ← DescLen(RuleSet,Pos,Neg)Repeat
Rule ← LearnRule(Pos,Neg)Add Rule to RuleSet
DL’ ← DescLen(RuleSet,Pos,Neg)If DL’>DL+64
PruneRuleSet(RuleSet,Pos,Neg)
Return RuleSet
If DL’
PruneRuleSet(RuleSet,Pos,Neg)
For each Rule ∈ RuleSet in reverse orderDL ← DescLen(RuleSet,Pos,Neg)DL’ ← DescLen(RuleSet-Rule,Pos,Neg)IF DL’¡DL Delete Rule from RuleSet
Return RuleSet
OptimizeRuleSet(RuleSet,Pos,Neg)
For each Rule ∈ RuleSetDL0 ← DescLen(RuleSet,Pos,Neg)DL1 ← DescLen(RuleSet-Rule+ReplaceRule(RuleSet,Pos,Neg),Pos,Neg)
DL2 ← DescLen(RuleSet-Rule+ReviseRule(RuleSet,Rule,Pos,Neg),Pos,Neg)
If DL1=min(DL0,DL1,DL2)
Delete Rule from RuleSet and
add ReplaceRule(RuleSet,Pos,Neg)
Else If DL2=min(DL0,DL1,DL2)
Delete Rule from RuleSet and
add ReviseRule(RuleSet,Rule,Pos,Neg)
Return RuleSet
Figure 9.7: Ripper algorithm for learning rules
(cont’d). From: E. Alpaydın. 2004. Introduction to
Machine Learning. c©The MIT Press.
74
x
2
x1
C1
C2
+
Yes No
C1
C2
w11
x1+w
12x
2+w
10= 0
Figure 9.8: Example of a linear multivariate decision
tree. The linear multivariate node can place an
arbitrary hyperplane thus is more general, whereas
the univariate node is restricted to axis-aligned splits.
From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
75
Chapter 10:
Linear Discrimination
76
C1C
2
g(x)=w1x
1+w
2x
2+w
0=0
g(x)>0g(x)<0
+
x1
x
2
Figure 10.1: In the two-dimensional case, the linear
discriminant is a line that separates the examples
from two classes. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
77
w
g(x)=0
g(x)>0g(x)<0
x
|w0|/||w||
|g(x)|/||w||
x1
x
2
Figure 10.2: The geometric interpretation of the
linear discriminant. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
78
C1C
2
H2
H3
C3
H1
+
+
+
x1
x
2
Figure 10.3: In linear classification, each hyperplane
Hi separates the examples of Ci from the examples ofall other classes. Thus for it to work, the classes
should be linearly separable. Dotted lines are the
induced boundaries of the linear classifier. From:
E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
79
C1C
2
H31
C3
H12
+
+
+
H23
x1
x
2
Figure 10.4: In pairwise linear separation, there is a
separate hyperplane for each pair of classes. For an
input to be assigned to C1, it should be on thepositive side of H12 and H13 (which is the negative
side of H31); we do not care for the value of H23. In
this case, C1 is not linearly separable from otherclasses but is pairwise linearly separable. From:
E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
80
−10 −8 −6 −4 −2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 10.5: The logistic, or sigmoid, function.
From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
81
For j = 0, . . . , d
wj ←rand(-0.01,0.01)Repeat
For j = 0, . . . , d
∆wj ← 0For t = 1, . . . , N
o ← 0For j = 0, . . . , d
o ← o + wjxtjy ← sigmoid(o)∆wj ← ∆wj + (rt − y)xtj
For j = 0, . . . , d
wj ← wj + η∆wjUntil convergence
Figure 10.6: Logistic discrimination algorithm
implementing gradient-descent for the single output
case with two classes. For w0, we assume that there
is an extra input x0, which is always +1: xt0 ≡ +1, ∀t.From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
82
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
3
x
P(C
o|x)
10
100
1000
Figure 10.7: For a univariate two-class problem
(shown with ‘◦’ and ‘×’ ), the evolution of the linewx + w0 and the sigmoid output after 10, 100, and
1,000 iterations over the sample. From: E. Alpaydın.
2004. Introduction to Machine Learning. c©The MITPress.
83
For i = 1, . . . , K, For j = 0, . . . , d, wij ← rand(−0.01, 0.01)Repeat
For i = 1, . . . , K, For j = 0, . . . , d, ∆wij ← 0For t = 1, . . . , N
For i = 1, . . . , K
oi ← 0For j = 0, . . . , d
oi ← oi + wijxtjFor i = 1, . . . , K
yi ← exp(oi)/∑
kexp(ok)
For i = 1, . . . , K
For j = 0, . . . , d
∆wij ← ∆wij + (rti − yi)xtjFor i = 1, . . . , K
For j = 0, . . . , d
wij ← wij + η∆wijUntil convergence
Figure 10.8: Logistic discrimination algorithm
implementing gradient-descent for the case with
K > 2 classes. For generality, we take xt0 ≡ 1, ∀t.From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
84
0 0.5 1 1.5 2 2.5 3 3.5 40
0.5
1
1.5
2
2.5
3
3.5
4
x1
x 2
Figure 10.9: For a two-dimensional problem with
three classes, the solution found by logistic
discrimination. Thin lines are where gi(x) = 0, and
the thick line is the boundary induced by the linear
classifier choosing the maximum. From: E. Alpaydın.
2004. Introduction to Machine Learning. c©The MITPress.
85
−20
−10
0
10
20
x1
x2
g i(x
1,x 2
)
0
0.2
0.4
0.6
0.8
1
x1
x2
P(C
i|x1,
x 2)
Figure 10.10: For the same example in figure 10.9,
the linear discriminants (top), and the posterior
probabilities after the softmax (bottom). From:
E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
86
w
C1C
2
g(x)=+1
g(x)= -1+
1/||w||
2/||w||
x1
x
2
Figure 10.11: On both sides of the optimal
separating hyperplance, the instances are at least
1/‖w‖ away and the total margin is 2/‖w‖. From:E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
87
w
C1C
2
g(x)=+1
g(x)=-1+
1
2
3
x1
x
2
Figure 10.12: In classifying an instance, there are
three possible cases: In (1), ξ = 0; it is on the right
side and sufficiently away. In (2), ξ = 1 + g(x) > 1; it
is on the wrong side. In (3), ξ = 1− g(x), 0 < ξ < 1; itis on the right side but is in the margin and not
sufficiently away. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
88
−8 −6 −4 −2 0 2 4 6 80
10
20
30
40
50
60
70
Figure 10.13: Quadratic and ²-sensitive error
functions. We see that ²-sensitive error function is
not affected by small errors and also is affected less
by large errors and thus is more robust to outliers.
From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
89
Chapter 11:
Multilayer Perceptrons
90
x0=+1 x
1x
2
y
xd
w0 w
dw
2w
1
Figure 11.1: Simple perceptron. xj , j = 1, . . . , d are
the input units. x0 is the bias unit that always has
the value 1. y is the output unit. wj is the weight of
the directed connection from input xj to the output.
From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
91
wK
w2w
1
y1
y2
yK
x0=+1 x
1x
2x
d
Figure 11.2: K parallel perceptrons. xj , j = 0, . . . , d
are the inputs and yi, i = 1, . . . , K are the outputs. wij
is the weight of the connection from input xj to
output yi. Each output is a weighted sum of the
inputs. When used for K-class classification problem,
there is a postprocessing to choose the maximum, or
softmax if we need the posterior probabilities. From:
E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
92
For i = 1, . . . , K
For j = 0, . . . , d
wij ← rand(−0.01, 0.01)Repeat
For all (xt, rt) ∈ X in random orderFor i = 1, . . . , K
oi ← 0For j = 0, . . . , d
oi ← oi + wijxtjFor i = 1, . . . , K
yi ← exp(oi)/∑
kexp(ok)
For i = 1, . . . , K
For j = 0, . . . , d
wij ← wij + η(rti − yi)xtjUntil convergence
Figure 11.3: Percepton training algorithm
implementing stochastic online gradient-descent for
the case with K > 2 classes. This is the online
version of the algorithm given in figure 10.8. From:
E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
93
x0=+1 x1 x2
y
-1.5
+1+1
x1
x2
+
(0,0)
(1,1)
(1,0)
(0,1)
1.5
1.5
Figure 11.4: The perceptron that implements AND
and its geometric interpretation. From: E. Alpaydın.
2004. Introduction to Machine Learning. c©The MITPress.
94
x1
x2
Figure 11.5: XOR problem is not linearly separable.
We cannot draw a line where the empty circles are
on one side and the filled circles on the other side.
From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
95
whj
zh
z0=+1
vih
yi
x0=+1 xj xd
Figure 11.6: The structure of a multilayer
perceptron. From: E. Alpaydın. 2004. Introduction
to Machine Learning. c©The MIT Press.
96
x0=+1 x
1x
2
y
z1
z0=+1
z2
-0.5-1 +1
-1+1
-0.5
+1 +1-0.5
z1
z2
+
+
+
x1
x2
z1
z2
y
Figure 11.7: The multilayer perceptron that solves
the XOR problem. The hidden units and the output
have the threshold activation function with threshold
at 0. From: E. Alpaydın. 2004. Introduction to
Machine Learning. c©The MIT Press.
97
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
100
200 300
Figure 11.8: Sample training data shown as ‘+’,
where xt ∼ U(−0.5, 0.5), and yt = f(xt) +N (0, 0.1).f(x) = sin(6x) is shown by a dashed line. The
evolution of the fit of an MLP with two hidden units
after 100, 200, and 300 epochs is drawn. From:
E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
98
0 50 100 150 200 250 3000
0.2
0.4
0.6
0.8
1
1.2
1.4
Training Epochs
Mea
n S
quar
e E
rror
TrainingValidation
Figure 11.9: The mean square error on training and
validation sets as a function of training epochs.
From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
99
−0.5 0 0.5−4
−3
−2
−1
0
1
2
3
4
−0.5 0 0.5−4
−3
−2
−1
0
1
2
3
4
−0.5 0 0.5−4
−3
−2
−1
0
1
2
3
4
Figure 11.10: (a) The hyperplanes of the hidden unit
weights on the first layer, (b) hidden unit outputs,
and (c) hidden unit outputs multiplied by the weights
on the second layer. Two sigmoid hidden units
slightly displaced, one multiplied by a negative
weight, when added, implement a bump. From:
E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
100
Initialize all vih and whj to rand(−0.01, 0.01)Repeat
For all (xt, rt) ∈ X in random orderFor h = 1, . . . , H
zh ← sigmoid(wThxt)For i = 1, . . . , K
yi = vTi z
For i = 1, . . . , K
∆vi = η(rti − yti)z
For h = 1, . . . , H
∆wh = η(∑
i(rti − yti)vih)zh(1− zh)xt
For i = 1, . . . , K
vi ← vi + ∆viFor h = 1, . . . , H
wh ← wh + ∆whUntil convergence
Figure 11.11: Backpropagation algorithm for training
a multilayer perceptron for regression with K
outputs. This code can easily be adapted for
two-class classification (by setting a single sigmoid
output) and to K > 2 classification (by using softmax
outputs). From: E. Alpaydın. 2004. Introduction to
Machine Learning. c©The MIT Press.
101
0 5 10 15 20 25 300
0.02
0.04
0.06
0.08
0.1
0.12
Number of Hidden Units
Mea
n S
quar
e E
rror
TrainingValidation
Figure 11.12: As complexity increases, training error
is fixed but the validation error starts to increase and
the network starts to overfit. From: E. Alpaydın.
2004. Introduction to Machine Learning. c©The MITPress.
102
0 100 200 300 400 500 600 700 800 900 10000
0.5
1
1.5
2
2.5
3
3.5
Training Epochs
Mea
n S
quar
e E
rror
TrainingValidation
Figure 11.13: As training continues, the validation
error starts to increase and the network starts to
overfit. From: E. Alpaydın. 2004. Introduction to
Machine Learning. c©The MIT Press.
103
Figure 11.14: A structured MLP. Each unit is
connected to a local group of units below it and
checks for a particular feature—for example, edge,
corner, and so forth—in vision. Only one hidden unit
is shown for each region; typically there are many to
check for different local features. From: E. Alpaydın.
2004. Introduction to Machine Learning. c©The MITPress.
104
Figure 11.15: In weight sharing, different units have
connections to different inputs but share the same
weight value (denoted by line type). Only one set of
units is shown; there should be multiple sets of units,
each checking for different features. From:
E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
105
A A
A
A
Figure 11.16: The identity of the object does not
change when it is translated, rotated, or scaled. Note
that this may not always be true, or may be true up
to a point: ‘b’ and ‘q’ are rotated versions of each
other. These are hints that can be incorporated into
the learning process to make learning easier. From:
E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
106
Dynamic Node Creation Cascade Correlation
Figure 11.17: Two examples of constructive
algorithms: Dynamic node creation adds a unit to an
existing layer. Cascade correlation adds each unit as
new hidden layer connected to all the previous layers.
Dashed lines denote the newly added
unit/connections. Bias units/weights are omitted for
clarity. From: E. Alpaydın. 2004. Introduction to
Machine Learning. c©The MIT Press.
107
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Hidden 1
Hid
den
2
Hidden Representation
00
7
4
62
5
5
0
8
7
19
5
3
0
4
7
8
47
8
5
9
1
2
0
6
1
8
7
0
7
6
9 1
9
3
9 4
9
2
1
99
6
4
3
2
8
2
71 4
6
20
4
6
3
7
1
02
2
5
2
4
8
1
7
3
0
33
77
9
1
3
3
4
3
4
2
8
8
9
8
4
7
1
6
9
4
0
1
3
6
2
Figure 11.18: Optdigits data plotted in the space of
the two hidden units of an MLP trained for
classification. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
108
x0=+
1x
1x
d
zH
Encoder
y1
yd
Decoder
y1
yd
x0=+
1x
1x
d
zH
Linear Nonlinear
Figure 11.19: In the autoassociator, there are as
many outputs as there are inputs and the desired
outputs are the inputs. When the number of hidden
units is less than the number of inputs, the MLP is
trained to find the best coding of the inputs on the
hidden units, performing dimensionality reduction.
On the left, the first layer acts as an encoder and the
second layer acts as the decoder. On the right, if the
encoder and decoder are multilayer perceptrons with
sigmoid hidden units, the network performs nonlinear
dimensionality reduction. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
109
x0=+1
xt
whj
zhz0=+1
vih
yi
xt-1xt-T
delaydelay...
Figure 11.20: A time delay neural network. Inputs in
a time window of length T are delayed in time until
we can feed all T inputs as the input vector to the
MLP. From: E. Alpaydın. 2004. Introduction to
Machine Learning. c©The MIT Press.
110
(a) (c)(b)
Figure 11.21: Examples of MLP with partial
recurrency. Recurrent connections are shown with
dashed lines: (a) self-connections in the hidden layer,
(b) self-connections in the output layer, and (c)
connections from the output to the hidden layer.
Combinations of these are also possible. From:
E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
111
(a) (b)
x
x3
x0
x1
x2
W
y
h
VR
W
W
W
WR
R
R
V
h0
h3
h1
h2
y
Figure 11.22: Backpropagation through time: (a)
recurrent network and (b) its equivalent unfolded
network that behaves identically in four steps. From:
E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
112
Chapter 12:
Local Models
113
x1
x2
x
mi
Figure 12.1: Shaded circles are the centers and the
empty circle is the input instance. The online version
of k-means moves the closest center along the
direction of (x−mi) by a factor specified by η.From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
114
Initialize mi, i = 1, . . . , k, for example, to k random xt
Repeat
For all xt ∈ X in random orderi ← arg minj ‖xt −mj‖mi ←mi + η(xt −mj)
Until mi converge
Figure 12.2: Online k-means algorithm. The batch
version is given in figure 7.3. From: E. Alpaydın.
2004. Introduction to Machine Learning. c©The MITPress.
115
x1
xd
m1
bk
m2
b2
b1
mk
Figure 12.3: The winner-take-all competitive neural
network, which is a network of k perceptrons with
recurrent connections at the output. Dashed lines
are recurrent connections, of which the ones that end
with an arrow are excitatory and the ones that end
with a circle are inhibitory. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
116
x1
x2
xa
mi
xb
ρ
Figure 12.4: The distance from xa to the closest
center is less than the vigilance value ρ and the
center is updated as in online k-means. However, xb
is not close enough to any of the centers and a new
group should be created at that position. From:
E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
117
x1
x2
x
mi
mi-1
mi+1
mi+2
mi-2
Figure 12.5: In the SOM, not only the closest unit
but also its neighbors, in terms of indices, are moved
toward the input. Here, neighborhood is 1; mi and
its 1-nearest neighbors are updated. Note here that
mi+1 is far from mi, but as it is updated with mi,
and as mi will be updated when mi+1 is the winner,
they will become neighbors in the input space as well.
From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
118
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 12.6: The one-dimensional form of the
bell-shaped function used in the radial basis function
network. This one has m = 0 and s = 1. It is like a
Gaussian but it is not a density; it does not integrate
to 1. It is nonzero between (m− 3s, m + 3s), but amore conservative interval is (m− 2s, m + 2s). From:E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
119
x1
x2
xa : (1.0, 0.0, 0.0)
xb : (0.0, 0.0, 1.0)
xc : (1.0, 1.0, 0.0)
m1
xb
m3
m2
xc
x1
x2
xa
xb
xc
+
+
w1
w2
Local representation in the
space of (p1, p
2, p
3)
Distributed representation in the
space of (h1, h
2)
xa
xa : (1.0, 1.0)
xb : (0.0, 1.0)
xc : (1.0, 0.0)
Figure 12.7: The difference between local and
distributed representations. The values are hard,
0/1, values. One can use soft values in (0, 1) and get
a more informative encoding. In the local
representation, this is done by the Gaussian RBF
that uses the distance to the center, mi, and in the
distributed representation, this is done by the sigmoid
that uses the distance to the hyperplane, wi. From:
E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
120
xj
xd
mhj
, sh
ph
p0=+1
wih
yi
x1
Figure 12.8: The RBF network where ph are the
hidden units using the bell-shaped activation
function. mh, sh are the first-layer parameters, and
wi are the second-layer weights. From: E. Alpaydın.
2004. Introduction to Machine Learning. c©The MITPress.
121
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 12.9: (-) Before and (- -) after normalization
for three Gaussians whose centers are denoted by ‘*’.
Note how the nonzero region of a unit depends also
on the positions of other units. If the spreads are
small, normalization implements a harder split; with
large spreads, units overlap more. From:
E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
122
gh
wih
yi
vih
mh
, sh
xj
xd
x1
Figure 12.10: The mixture of experts can be seen as
an RBF network where the second-layer weights are
outputs of linear models. Only one linear model is
shown for clarity. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
123
ghw
ih
yi
vih
wh
Local
experts
Gating
network
mhj
xj
xd
x1
Figure 12.11: The mixture of experts can be seen as
a model for combining multiple models. wh are the
models and the gating network is another model
determining the weight of each model, as given by gh.
Viewed in this way, neither the experts nor the gating
are restricted to be linear. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
124
Chapter 13:
Hidden Markov Models
125
1 2
3
a11 a
12
a21
a13
π3
π2
π1
Figure 13.1: Example of a Markov model with three
states is a stochastic automaton. πi is the probability
that the system starts in state Si, and aij is the
probability that the system moves from state Si to
state Sj . From: E. Alpaydın. 2004. Introduction to
Machine Learning. c©The MIT Press.
126
1 1 11
a11
a11
a11
a11
N
i
1
N
i
2
N
i
T
N
i
T-1
O1
O2
OT-1
OT
π1
πi
πN
Figure 13.2: An HMM unfolded in time as a lattice
(or trellis) showing all the possible trajectories. One
path, shown in thicker lines, is the actual (unknown)
state trajectory that generated the observation
sequence. From: E. Alpaydın. 2004. Introduction to
Machine Learning. c©The MIT Press.
127
N
i
1
ja
iji
1
j
N
aij
(a) Forward (b) Backward
t t+1 t+1t
αi
βj
Ot+1
Ot+1
Figure 13.3: Forward-backward procedure: (a)
computation of αt(j) and (b) computation of βt(i).
From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
128
i ja
ijα
i
Ot+
1
βj
t t+1
Figure 13.4: Computation of arc probabilities, ξt(i, j).
From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
129
1 2 3
a11
a12
a13
π1
4
Figure 13.5: Example of a left-to-right HMM. From:
E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
130
Chapter 14:
Assessing and Comparing
Classification Algorithms
131
Hit
ra
te: |
T
P
|/
(|
T
P
|+
|FN
|)
False alarm rate: |FP|/(|FP|+|TN|)
Figure 14.1: Typical roc curve. Each classifier has a
parameter, for example, a threshold, which allows us
to move over this curve, and we decide on a point,
based on the relative importance of hits versus false
alarms, namely, true positives and false positives.
From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
132
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4Unit Normal Z=N(0,1)
x
p(x)
2.5% 2.5%
Figure 14.2: 95 percent of the unit normal
distribution lies between −1.96 and 1.96. From:E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
133
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4Unit Normal Z=N(0,1)
x
p(x)
5%
Figure 14.3: 95 percent of the unit normal
distribution lies before 1.64. From: E. Alpaydın.
2004. Introduction to Machine Learning. c©The MITPress.
134
Chapter 15:
Combining Multiple Learners
135
x
w2
y
d1
dLd2
+
wL
w1
f ( )
Figure 15.1: In voting, the combiner function f(·) isa weighted sum. dj are the multiple learners, and wj
are the weights of their votes. y is the overall output.
In the case of multiple outputs, for example,
classification, the learners have multiple outputs dji
whose weighted sum gives yi. From: E. Alpaydın.
2004. Introduction to Machine Learning. c©The MITPress.
136
Training:
For all {xt, rt}Nt=1 ∈ X , initialize pt1 = 1/NFor all base-learners j = 1, . . . , L
Randomly draw Xj from X with probabilities ptjTrain dj using XjFor each (xt, rt), calculate ytj ← dj(xt)Calculate error rate: ²j ←
∑tptj · 1(ytj 6= rt)
If ²j > 1/2, then L ← j − 1; stopβj ← ²j/(1− ²j)For each (xt, rt), decrease probabilities if correct:
If ytj = rt ptj+1 ← βjptj Else ptj+1 ← ptj
Normalize probabilities:
Zj ←∑
tptj+1; p
tj+1 ← ptj+1/Zj
Testing:
Given x, calculate dj(x), j = 1, . . . , L
Calculate class outputs, i = 1, . . . , K:
yi =∑L
j=1
(log 1
βj
)dji(x)
Figure 15.2: AdaBoost algorithm. From:
E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
137
x
y
d1 dLd2
+w
L
w1
f ( )
gating
Figure 15.3: Mixture of experts is a voting method
where the votes, as given by the gating system, are a
function of the input. The combiner system f also
includes this gating system. From: E. Alpaydın.
2004. Introduction to Machine Learning. c©The MITPress.
138
x
y
d1 dLd2
f ( )
Figure 15.4: In stacked generalization, the combiner
is another learner and is not restricted to being a
linear combination as in voting. From: E. Alpaydın.
2004. Introduction to Machine Learning. c©The MITPress.
139
x
y=d1
d1
d2
w1>θ
1
yes
no
w2>θ
2
yes
no
y=d2
... dL
y=dL
Figure 15.5: Cascading is a multistage method where
there is a sequence of classifiers, and the next one is
used only when the preceding ones are not confident.
From: E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
140
Chapter 16:
Reinforcement Learning
141
ENVIRONMENT
AGENT
ActionReward
State
Figure 16.1: The agent interacts with an
environment. At any state of the environment, the
agent takes an action that changes the state and
returns a reward. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
142
Initialize V (s) to arbitrary values
Repeat
For all s ∈ SFor all a ∈ A
Q(s, a) ← E[r|s, a] + γ∑
s′∈S P (s′|s, a)V (s′)
V (s) ← maxa Q(s, a)Until V (s) converge
Figure 16.2: Value iteration algorithm for
model-based learning. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
143
Initialize a policy π arbitrarily
Repeat
π ← π′Compute the values using π by
solving the linear equations
V π(s) = E[r|s, π(s)] + γ∑
s′∈S P (s′|s, π(s))V π(s′)
Improve the policy at each state
π′(s) ← arg maxa(E[r|s, a] + γ∑
s′∈S P (s′|s, a)V π(s′))
Until π = π′
Figure 16.3: Policy iteration algorithm for
model-based learning. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
144
G100*
0
90
81
A
B
Figure 16.4: Example to show that Q values increase
but never decrease. This is a deterministic grid-world
where G is the goal state with reward 100, all other
immediate rewards are 0 and γ = 0.9. Let us consider
the Q value of the transition marked by asterisk, and
let us just consider only the two paths A and B. Let
us say that path A is seen before path B, then we
have γ max(0, 81) = 72.9. If afterward B is seen, a
shorter path is found and the Q value becomes
γ max(100, 81) = 90. If B is seen before A, the Q value
is γ max(100, 0) = 90. Then when B is seen, it does
not change because γ max(100, 81) = 90. From:
E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
145
Initialize all Q(s, a) arbitrarily
For all episodes
Initalize s
Repeat
Choose a using policy derived from Q, e.g., ²-greedy
Take action a, observe r and s′
Update Q(s, a):
Q(s, a) ← Q(s, a) + η(r + γ maxa′ Q(s′, a′)−Q(s, a))s ← s′
Until s is terminal state
Figure 16.5: Q learning, which is an off-policy
temporal difference algorithm. From: E. Alpaydın.
2004. Introduction to Machine Learning. c©The MITPress.
146
Initialize all Q(s, a) arbitrarily
For all episodes
Initalize s
Choose a using policy derived from Q, e.g., ²-greedy
Repeat
Take action a, observe r and s′
Choose a′ using policy derived from Q, e.g., ²-greedyUpdate Q(s, a):
Q(s, a) ← Q(s, a) + η(r + γQ(s′, a′)−Q(s, a))s ← s′, a ← a′
Until s is terminal state
Figure 16.6: Sarsa algorithm, which is an on-policy
version of Q learning. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
147
0 10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 16.7: Example of an eligibility trace for a
value. Visits are marked by an asterisk. From:
E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
148
Initialize all Q(s, a) arbitrarily, e(s, a) ← 0, ∀s, aFor all episodes
Initalize s
Choose a using policy derived from Q, e.g., ²-greedy
Repeat
Take action a, observe r and s′
Choose a′ using policy derived from Q, e.g., ²-greedyδ ← r + γQ(s′, a′)−Q(s, a)e(s, a) ← 1For all s, a:
Q(s, a) ← Q(s, a) + ηδe(s, a)e(s, a) ← γλe(s, a)
s ← s′, a ← a′Until s is terminal state
Figure 16.8: Sarsa(λ) algorithm. From: E. Alpaydın.
2004. Introduction to Machine Learning. c©The MITPress.
149
ENVIRONMENT
AGENT
ActionReward
State
SE πb
Figure 16.9: In the case of a partially observable
environment, the agent has a state estimator (SE)
that keeps an internal belief state b and the policy π
generates actions based on the belief states. From:
E. Alpaydın. 2004. Introduction to Machine
Learning. c©The MIT Press.
150
G
S
Figure 16.10: The grid world. The agent can move
in the four compass directions starting from S. The
goal state is G. From: E. Alpaydın. 2004.
Introduction to Machine Learning. c©The MIT Press.
151
Appendix: Probability
152
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4Unit Normal Z=N(0,1)
x
p(x)
Figure A.1: Probability density function of Z, theunit normal distribution.
153