ChristianSchüller15.04.03page 1
Bayesian decision theory
Bayesian Decision Theory
Machine Learning for Context Aware Computing
ChristianSchüller15.04.03page 2
Bayesian decision theory
introduction (Chapter 1 – Pattern Classification, Duda/Hart/Stork)
Bayesian decision theory (Chapter 2 – Pattern Classification)
> machine perception
- pattern recognition systems
- the design cycle
- learning and adaptation
- conclusion
- an example
Machine Learning for Context Aware Computing
index of contents
ChristianSchüller15.04.03page 3
Bayesian decision theory
build a machine that can recognize patterns:
- speech recognition
- fingerprint identification
- DNA sequence identification
- OCR (optical character recognition)
accurate pattern recognition immensely useful
deeper understanding by solving problems
algorithm and hardware design is influenced by knowledge howthese are solved in nature
Machine Learning for Context Aware Computing
machine perception
ChristianSchüller15.04.03page 4
Bayesian decision theoryMachine Learning for Context Aware Computing
index of contents
introduction to ML
Bayesian decision theory
- machine perception
- pattern recognition systems
- the design cycle
- learning and adaptation
- conclusion
> an example
ChristianSchüller15.04.03page 5
Bayesian decision theoryMachine Learning for Context Aware Computing
an example – fish packing plant (1/8)
- “Sorting incoming fish on a conveyor according to species using optical sensing”
- pilot project: separate sea bass from salmonai
ms
problem analysis
- wants automate process of incoming fish
- pilot project: separate sea bass from salmon
- using optical sensing
- take sample pictures
- extract features
> length, width
> lightness
> number and shape of fins
- notice noise or variations in the images
variation in lighting
position of the fish on the conveyor
aims
ChristianSchüller15.04.03page 6
Bayesian decision theoryMachine Learning for Context Aware Computing
preprocessing
- use a segmentation operation to isolate fishes from...
feature extraction
an example – fish packing plant (2/8)
> one another
> background
- information from a single fish is sent to a feature extractor whose purpose is to reduce the data by measuring certain features
- the features are passed to a classifier
model
- differences between the population, different models
- hypothesize the class of models
- choose best corresponding model
ChristianSchüller15.04.03page 7
Bayesian decision theoryMachine Learning for Context Aware Computing
overview
Preprocessing
Feature extraction
Classification “salmon”“sea bass”
an example – fish packing plant (3/8)
training samples
- length obvious feature, try to classify by the length L
- obtain training samples by making length measurements
- suppose sea bass is generally longer than a salmon
classification
- evaluates evidence
- makes final decision
ChristianSchüller15.04.03page 8
Bayesian decision theoryMachine Learning for Context Aware Computing
feature: length
- length alone is a poor feature
an example – fish packing plant (4/8)
5 10
2
150
20 25length
20
count
4
6
8
10
12
14
16
18
L
salmon sea bass
- select the lightness as a possible feature
ChristianSchüller15.04.03page 9
Bayesian decision theoryMachine Learning for Context Aware Computing
feature: lightness
an example – fish packing plant (5/8)
- careful elimination of variations in illumination
- classes much better separated2 4
2
60
8 10lightness
count
4
6
8
10
12
14
x
salmon sea bass
decision boundary and cost relationship
- Move our decision boundary toward smaller values of lightness in order to minimize the cost (reduce the number of sea bass that are classified salmon!)
- task of decisions theory
ChristianSchüller15.04.03page 10
Bayesian decision theoryMachine Learning for Context Aware Computing
decision theory
- make decision rules, such as to minimize cost
- width as new feature to classify
- problem to partition feature space into two regions
‡ xT = [ x1,x2 ]
x1 = lightnessx2 = width
- add other features that are not correlated with the ones we already have
- a precaution should be taken not to reduce the performance by adding such “noisy features”
an example – fish packing plant (6/8)
5 10
2
150
20 25lightness
20
width
4
6
8
10
12
14
16
18
salmon sea bass
ChristianSchüller15.04.03page 11
Bayesian decision theoryMachine Learning for Context Aware Computing
best decision boundary
- best decision boundary should be the one which provides an optimal performance such as in the following figure:
- satisfaction is premature because the central aim of designing a classifier is to correctly classify novel input
‡ issue of generalization
an example – fish packing plant (7/8)
5 10
2
150
20 25lightness
20
width
4
6
8
10
12
14
16
18
salmon sea bass
ChristianSchüller15.04.03page 12
Bayesian decision theoryMachine Learning for Context Aware Computing
generalized decision boundary
an example – fish packing plant (8/8)
20
10
12
14
16
18
5 10
2
150
20 25lightness
width
4
6
8
salmon sea bass
ChristianSchüller15.04.03page 13
Bayesian decision theoryMachine Learning for Context Aware Computing
index of contents
introduction to ML
Bayesian decision theory
- machine perception
> pattern recognition systems
- the design cycle
- learning and adaptation
- conclusion
- an example
ChristianSchüller15.04.03page 14
Bayesian decision theoryMachine Learning for Context Aware Computing
pattern recognition systems (1/3)
overview
decision
post-processing
classification
feature extraction
segmentation
sensing
input
costs
adjustments for context
adjustment fomissing features
ChristianSchüller15.04.03page 15
Bayesian decision theoryMachine Learning for Context Aware Computing
pattern recognition systems (2/3)
sensing
- use of a transducer (i.e. camera)
- pattern recognition systems depends off:
segmentation and grouping
> the bandwidth
> the resolution sensitivity distortion of the transducer
- patterns should:
> be well separated
> not overlap
ChristianSchüller15.04.03page 16
Bayesian decision theoryMachine Learning for Context Aware Computing
pattern recognition systems (3/3)
feature extraction
- arbitrary boundary between feature extraction and classification
- invariant features with respect to translation, rotation and scale
classification
- use a feature vector provided by a feature extractor to assign the
object to a category
post processing
- exploit context input dependent information other than from the
target pattern itself to improve performance
ChristianSchüller15.04.03page 17
Bayesian decision theoryMachine Learning for Context Aware Computing
index of contents
introduction to ML
Bayesian decision theory
- machine perception
- pattern recognition systems
> the design cycle
- learning and adaptation
- conclusion
- an example
ChristianSchüller15.04.03page 18
Bayesian decision theoryMachine Learning for Context Aware Computing
the design cycle (1/3)
overviewstart
collect data
choose features
choose model
train classifier
evaluate classifier
end
prior knowledge(e.g. invariances)
ChristianSchüller15.04.03page 19
Bayesian decision theoryMachine Learning for Context Aware Computing
the design cycle (2/3)
data collection
- how do we know when we have collected an adequately large and representative set of examples for training and testing the system?
feature choice
- depends on the characteristics of the problem domain. Simple to
extract, invariant to irrelevant transformation insensitive to noise
model choice
- unsatisfied with the performance of our fish classifier and want to
jump to another class of model
- how to combine prior knowledge and empirical data?
ChristianSchüller15.04.03page 20
Bayesian decision theoryMachine Learning for Context Aware Computing
the design cycle (3/3)
training
- use data to determine the classifier
- different procedures for training classifiers and choosing models
evaluation
- measure the error rate (or performance and switch from one set of
features to another one)
solving problems
- no universal methods have been found for solving problems
ChristianSchüller15.04.03page 21
Bayesian decision theoryMachine Learning for Context Aware Computing
index of contents
introduction to ML
Bayesian decision theory
- machine perception
- pattern recognition systems
- the design cycle
> learning and adaptation
- conclusion
- an example
ChristianSchüller15.04.03page 22
Bayesian decision theoryMachine Learning for Context Aware Computing
learning and adaptation
supervised learning
- teacher provides a category label or cost for each pattern in the
training set
unsupervised learning
- the system forms clusters or “natural groupings” of the input
patterns
learning
- pattern recognition problems to hard to guess best classification
decision
ChristianSchüller15.04.03page 23
Bayesian decision theoryMachine Learning for Context Aware Computing
index of contents
introduction to ML
Bayesian decision theory
- machine perception
- pattern recognition systems
- the design cycle
- learning and adaptation
> conclusion
- an example
ChristianSchüller15.04.03page 24
Bayesian decision theoryMachine Learning for Context Aware Computing
conclusion
overwhelmed
- seems to be overwhelmed by the number, complexity and
magnitude of the sub-problems of pattern recognition
problems
- many of these sub-problems can indeed be solved
unresolved problems
- many fascinating unsolved problems still remain
- mathematical theories solving some problems have been
discovered
ChristianSchüller15.04.03page 25
Bayesian decision theory
index of contents
introduction to ML
Bayesian decision theory
> introduction
- minimum-error-rate classification
- classifiers, discriminant functions and decision surfaces
- the normal density
- discriminant functions for the normal density
- Bayesian decision theory / continuous features
- Bayesian decision theory / discrete features
ChristianSchüller15.04.03page 26
Bayesian decision theory
assumptions
- sequence of types of fish appears to be random
introduction (1/4)
- decision-theoretic terminology: each fish emerges nature is in one
or the other of two possible states
state of nature
- w denote state of nature
- w1 = sea bass and w2 = salmon
- the catch of salmon and sea bass is equiprobable:P(w1) = P(w2) (uniform priors)P(w1) + P( w2) = 1 (exclusivity and exhaustivity)
a priori probability
- P(w1) = priority next fish is a sea bass
ChristianSchüller15.04.03page 27
Bayesian decision theory
introduction (2/4)
example
- classification problem of sea bass and salmon by lightness
- assume apriori probabilities are not equal
> i.e. assume P(sea bass) > P(salmon)
> if you don’t have a chance to see the fish, every time decide as a sea bass
- if you see the lightness of fish
Question: P(sea bass | lightness) = ? and P(salmon | lightness) = ?
p(lightness | sea bass)
p(lightness | salmon)
P(sea bass | lightness)
P(salmon | lightness)
ChristianSchüller15.04.03page 28
Bayesian decision theory
introduction (3/4)
where in case of two categories
Â=
=
=2
1
)()|()(j
jjj Pxpxp ww
Bayes’ rule
P(wj | x) = p(x | wj) x P (wj)
p(x)
decision given the posterior probabilities
- x is an observation for which:
> P(w1 | x) > P(w2 | x) ‡ True state of nature = w1
> P(w1 | x) < P(w2 | x) ‡ True state of nature = w2
- whenever we observe a particular x, the probability of error is :
> P(error | x) = P(w1 | x) if we decide w2
> P(error | x) = P(w2 | x) if we decide w1
- maximum a posterior classifier or Bayes classifier
ChristianSchüller15.04.03page 29
Bayesian decision theory
introduction (4/4)
minimizing the probability of error
maximum-likelihood classifier
- P(w1) = P(w2)
P(error | x) = min [P(w1 | x), P(w2 | x)] (Bayes decision)
- decide w1 if P(w1 | x) > P(w2 | x); otherwise decide w2
- simpler decision rule
- decide w1 if p(x | w1) > p(x | w2); otherwise decide w2
ChristianSchüller15.04.03page 30
Bayesian decision theory
index of contents
introduction to ML
Bayesian decision theory
- introduction
- minimum-error-rate classification
- classifiers, discriminant functions and decision surfaces
- the normal density
- discriminant functions for the normal density
> Bayesian decision theory / continuous features
- Bayesian decision theory / discrete features
ChristianSchüller15.04.03page 31
Bayesian decision theory
Bayesian decision theory / continuous features (1/4)
generalization of the preceding ideas
- use of more than one feature
- use more than two states of nature
- allow actions and not only decide the state of nature
- introduce a loss of function which is more general than the
probability of error
feature space
- replace scalar x by the feature vector x
loss function
- states how costly each action is
- let us treat situations in which some kinds of classification mistakes
are more costly than others
ChristianSchüller15.04.03page 32
Bayesian decision theory
Bayesian decision theory / continuous features (2/4)
definitions
- let {w1, w2,…, wc} be the set of c states of nature (or “categories”)
- let {a1, a2,…, aa} be the set of possible actions
- let l(ai | wj) be the loss incurred for taking action ai when thestate of nature is wj
risk := expected values (cost)
- expected loss by taking action ai:
Â=
=
=cj
jjjii xx PR
1
)|()|()|( wwala
select the action ai for which R(ai | x) is minimum
- R is minimum and R in this case is called the Bayes risk = best performance that can be achieved!
P(wj | x) = p(x | wj) x P (wj)
p(x)- a posterior probability
ChristianSchüller15.04.03page 33
Bayesian decision theory
Bayesian decision theory / continuous features (3/4)
two-category classification
- a1 : deciding w1
- a2 : deciding w2
- lij = l(ai | wj)
conditional risk
- R(a1 | x) = l11P(w1 | x) + l12P(w2 | x)
- R(a2 | x) = l21P(w1 | x) + l22P(w2 | x)
rule
- if R(a1 | x) < R(a2 | x) action a1: “decide w1” is taken
- loss incurred for deciding ai when the true state of nature is wj
- this results in the equivalent rule: decide w1 if: (l21- l11) P(x | w1) > (l12- l22) P(x | w2) and decide w2 otherwise
ChristianSchüller15.04.03page 34
Bayesian decision theory
Bayesian decision theory / continuous features (4/4)
likelihood
- the preceding rule is equivalent to the following rule:
)(
)(.
)|()|(
1
2
1121
2212
2
1
ww
llll
ww
P
Pxpxp
->
-
- then take action a1 (decide w1), otherwise action a2 (decide w2)
likelihood ratio
- likelihood ratio for class- conditional probability density function
1aq- decision boundary de- termined by threshold
ChristianSchüller15.04.03page 35
Bayesian decision theory
index of contents
introduction to ML
Bayesian decision theory
- introduction
> minimum-error-rate classification
- classifiers, discriminant functions and decision surfaces
- the normal density
- discriminant functions for the normal density
- Bayesian decision theory / continuous features
- Bayesian decision theory / discrete features
ChristianSchüller15.04.03page 36
Bayesian decision theory
minimum-error-rate classification (1/2)
actions are decisions on classes
- if action ai is taken and the true state of nature is wj then: the decision is correct if i = j and in error if i ≠ j
decision rule
- seek a decision rule that minimizes the probability of error which is the error rate
introduction of the zero-one loss function
cjiji
jiji ,...,1,
1
0),( =
ÓÌÏ
≠
==wal
- therefore, the conditional risk is:
Â
Â
≠
=
=
-==
=
ijij
cj
jjjii
xPxP
xPxR
)|(1)|(
)|()|()|(1
ww
wwala
ChristianSchüller15.04.03page 37
Bayesian decision theory
minimum-error-rate classification (2/2)
minimizing the risk requires maximization of P(wi | x)
for minimum error rate
- Decide wi if P (wi | x) > P(wj | x) "j ≠ i
ChristianSchüller15.04.03page 38
Bayesian decision theory
index of contents
introduction to ML
Bayesian decision theory
- introduction
- minimum-error-rate classification
> classifiers, discriminant functions and decision surfaces
- the normal density
- discriminant functions for the normal density
- Bayesian decision theory / continuous features
- Bayesian decision theory / discrete features
ChristianSchüller15.04.03page 39
Bayesian decision theory
classifiers, discriminant functions and decision surfaces (1/4)
the multi-category case
- set of discriminant functions gi(x), i = 1, ..., c
functional structure of general statistical pattern classifier
- the classifier assigns a feature vector x to class wi if: gi(x) > gj(x) " j ≠ i
ChristianSchüller15.04.03page 40
Bayesian decision theory
classifiers, discriminant functions and decision surfaces (2/4)
Bayes classifier can be represented in this way
the selection of a discriminant function is not unique
- gi(x) = P(wi | x) minimum error-rate
for minimum error classifier, one may choose:
gi(x) = p(x | wi) P(wi)
gi(x) = ln P(x | wi) + ln P(wi)
discriminant functions
- discriminant functions can be in different forms, but the effect of decision rules is the same: decision boundaries
- gi(x) = - R(ai | x) minimum conditional risk
- decide x is in Ri if: gi(x) > gj(x) " j ≠ i
ChristianSchüller15.04.03page 41
Bayesian decision theory
two-dimensional two-category classifier
classifiers, discriminant functions and decision surfaces (4/4)
ChristianSchüller15.04.03page 42
Bayesian decision theory
index of contents
introduction to ML
Bayesian decision theory
- introduction
- minimum-error-rate classification
- classifiers, discriminant functions and decision surfaces
> the normal density
- discriminant functions for the normal density
- Bayesian decision theory / continuous features
- Bayesian decision theory / discrete features
ChristianSchüller15.04.03page 43
Bayesian decision theory
the normal density (1/2)
univariate normal density
˙˙˚
˘
ÍÍÎ
Ș¯
ˆÁË
Ê --=
2
2
1exp
2
1)(
sm
sp
xxP
- s2 = expected squared deviation or variance
- x = (x1, x2, …, xd)t (t stands for the transpose vector form)
- m = (m1, m2, …, md)t
- S = d-by-d covariance matrix
- |S| and S-1 are determinant and inverse respectively
multivariate normal density in d dimensions
˙̊˘
ÍÎ
È -S--S
= - )()(2
1exp
)2(
1)( 1
2/12/mm
pxxxP t
d
ChristianSchüller15.04.03page 44
Bayesian decision theory
the normal density (2/2)
univariate normal distribution
multivariate normal distribution
covariance matrixdetermines the shape of Gaussian curve
ChristianSchüller15.04.03page 45
Bayesian decision theory
index of contents
introduction to ML
Bayesian decision theory
- introduction
- minimum-error-rate classification
- classifiers, discriminant functions and decision surfaces
- the normal density
> discriminant functions for the normal density
- Bayesian decision theory / continuous features
- Bayesian decision theory / discrete features
ChristianSchüller15.04.03page 46
Bayesian decision theory
discriminant functions for the normal density (1/4)
minimum error-rate classification can be achieved by thediscriminant function
case of multivariate normal density, discriminant function is
gi(x) = ln P(x | wi) + ln P(wi)
)(lnln2
12ln
2)()(
2
1)(
1
iii
it
ii Pd
xxxg wpmm +S-----= Â-
case 1: Si = s_I (independence, equal s)
)(ln2
1)(
22 iiti
ti
i Pxxg wmmss
m+-=
[ ] )(ln22
1)(
2 iiti
ti
ti Pxxxxg wmmm
s++--=
ChristianSchüller15.04.03page 47
Bayesian decision theory
discriminant functions for the normal density (2/4)
case 1: Si = s_I (independence, equal s)
ChristianSchüller15.04.03page 48
Bayesian decision theory
discriminant functions for the normal density (3/4)
case 2: Si = S (covariance of all classes are identical but arbitrary!)
ChristianSchüller15.04.03page 49
Bayesian decision theory
discriminant functions for the normal density (4/4)
case 3: Si = arbitrary
)(lnln2
12ln
2)()(
2
1)(
1
iii
it
ii Pd
xxxg wpmm +S-----= Â-
ChristianSchüller15.04.03page 50
Bayesian decision theory
index of contents
introduction to ML
Bayesian decision theory
- introduction
- minimum-error-rate classification
- classifiers, discriminant functions and decision surfaces
- the normal density
- discriminant functions for the normal density
- Bayesian decision theory / continuous features
> Bayesian decision theory / discrete features
ChristianSchüller15.04.03page 51
Bayesian decision theory
Bayesian decision theory / discrete features (1/2)
components of x are binary or integer valued, x can take only one ofm discrete values
case of independent binary features in 2 category problem
- v1, v2, …, vm
- x = [x1, x2, …, xd ]t where each xi is either 0 or 1, with probabilities
> pi = P(xi = 1 | w1)
> qi = P(xi = 1 | w2)
( )dxxp iÚ w| ( )Âx ixP w|replaced by
- fundamental Bayes decision rule remains the same
ChristianSchüller15.04.03page 52
Bayesian decision theory
Bayesian decision theory / discrete features (2/2)
the discriminant function in this case is
0g(x) if and 0g(x) if
)(
)(ln
1
1ln
:
,...,1 )1(
)1(ln
:
)(
21
2
1
10
i
01
£>
+-
-=
=-
-=
+=
Â
Â
=
=
ww
ww
decide
P
P
q
pw
and
dipq
qpw
where
wxwxg
d
i i
i
ii
ii
i
d
ii