COSC 522 –Machine Learning Lecture 13...

COSC 522 – Machine Learning

Lecture 13 – Fusion

Hairong Qi, Gonzalez Family ProfessorElectrical Engineering and Computer ScienceUniversity of Tennessee, Knoxvillehttp://www.eecs.utk.edu/faculty/qiEmail: [email protected]

Course Website: http://web.eecs.utk.edu/~hqi/cosc522/

http://www.eecs.utk.edu/faculty/qihttp://web.eecs.utk.edu/~hqi/cosc522/

Racap - Decision Rules• Supervised learning

– Baysian based - Maximum Posterior Probability (MPP): For a given x, if P(w1|x) > P(w2|x), then x belongs to class 1, otherwise 2.

– Parametric Learning– Case 1: Minimum Euclidean Distance (Linear Machine), Si = s2I– Case 2: Minimum Mahalanobis Distance (Linear Machine), Si = S– Case 3: Quadratic classifier, Si = arbitrary– Estimate Gaussian parameters using MLE

– Nonparametric Learning– K-Nearest Neighbor

– Neural network– Perceptron– BPNN

– Kernel-based approaches– Support Vector Machine

– Decision tree– Least-square based

• Unsupervised learning– Kmeans– Winner-takes-all

• Supporting pre-/post-processing techniques– Normalization– Dimensionality Reduction (FLD, PCA)– Performance Evaluation (metrics, confusion matrices, ROC, cross validation)– Fusion

2

( ) ( ) ( )( )xpPxp

xP jjjωω

ω|

| =

Syllabus

3

COSC 522 - Machine Learning (Fall 2020) SyllabusLecture Date Content Tests Assignment Due Date

1 8/20 Introduction HW0 8/252 8/25 Baysian Decision Theory - MPP HW1 9/13 8/27 Discriminant Function - MD Proj1 - Supervised Learning 9/104 9/1 Parametric Learning - MLE5 9/3 Non-parametric Learning - kNN6 9/8 Unsupervised Learning - kmeans HW2 9/157 9/10 Dimensionality Reduction - FLD Proj2 - Unsupervised Learning and DR 9/248 9/15 Dimentionality Reduction - PCA9 9/17 Linear Regression

10 9/22 Performance Evaluation HW3 9/2911 9/24 Fusion Proj3 - Regression 10/812 9/29 Midterm Exam Final Project - Milestone 1: Forming Team 9/2913 10/1 Gradient Descent 14 10/6 Neural Network - Perceptron HW4 10/1315 10/8 Neural Network - BPNN Proj4 - BPNN 10/2216 10/13 Neural Network - Practices Final Project - Milestone 2: Choosing Topic 10/1317 10/15 Kernel Methods - SVM18 10/20 Kernel Methods - SVM HW5 10/2719 10/22 Kernel Methods - SVM Proj5 - SVM & DT 11/520 10/27 Decision Tree Final Project - Milestone 3: Literature Survey 10/2721 10/29 Random Forest22 11/3 HW6 11/1023 11/5 From PCA to t-SNE24 11/10 From Gaussian to Mixture and EM25 11/12 From Supervised/Unsupervised to RL26 11/17 From Classification/Regression to Generation Final Project - Milestone 4: Prototype 11/1727 11/19 From NN to CNN28 11/24 Final Exam

12/3 8:00-10:15 Final Presentation Final Project - Report 12/4

Questions• Rationale with fusion?• Different flavors of fusion?• The fusion hierarchy• What is the cost function for Naïve Bayes?• What is the procedure for Naïve Bayes?• What is the limitation of Naïve Bayes?• What is the procedure of Behavior-Knowledge-Space

(BKS)? • How does it resolve issues with NB?• What is Boosting and what is its difference to committee-

based fusion approaches?• What is AdaBoost?

4

Motivation

• Combining classifiers to achieve higher accuracy– Combination of multiple classifiers– Classifier fusion– Mixture of experts– Committees of neural networks– Consensus aggregation– …

• Reference: – L. I. Kuncheva, J. C. Bezdek, R. P. W. Duin, “Decision templates for

multiple classifier fusion: an experimental comparison,” Pattern Recognition, 34: 299-314, 2001.

– Y. S. Huang and C. Y. Suen, “A method of combining multiple experts for the recognition of unconstrained handwritten numerals,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 17, no. 1, pp. 90–94, Jan. 1995.

5

Three heads are better than one.

Popular Approaches

Data-based fusion (early fusion)Feature-based fusion (middle fusion)Decision-based fusion (late fusion)

ApproachesCommittee-based

Majority votingBootstrap aggregation (Bagging) [Breiman, 1996]

Baysian-basedNaïve Bayes combination (NB)Behavior-knowledge space (BKS) [Huang and Suen, 1995]

BoostingAdaptive boosting (AdaBoost) [Freund and Schapire, 1996]

Interval-based integration

6

Application Example – CivilianTarget Recognition

7

7

Compact Cluster Laydown

0.0

20.0

40.0

60.0

80.0

100.0

120.0

140.0

0.0 20.0 40.0 60.0 80.0 100.0 120.0 140.0

Ft

Ft Series1

Suzuki VitaraFord 350Ford 250 Harley Motocycle

Consensus Patterns

• Unanimity (100%)• Simple majority (50%+1)• Plurality (most votes)

8

Example of Majority Voting -Temporal Fusion

Fuse all the 1-sec sub-interval local processing results corresponding to the same event (usually lasts about 10-sec)Majority voting

9

,maxarg cji ωϕ = ],1[ Cc∈

number of possible localprocessing results

number of localoutput c occurrence




10

NB (the independence assumption)

11

C1 AAV DW HMV

AAV 894 329 143

DW 99 411 274

HMV 98 42 713

Confusion matrix

The real class is DW, theclassifier says it’s HMV

L1 AAV DW HMV

AAV

DW

HMV

C2 AAV DW HMV

AAV 1304 156 77

DW 114 437 83

HMV 13 107 450

L2 AAV DW HMV

AAV

DW

HMV

Probability that the true class is k given that Ciassigns it to s

k

s

i = 1, 2 (classifiers)

Probability multiplication

NB – Derivation • Assume the classifiers are mutually independent• Bayes combination - Naïve Bayes, simple Bayes, idiot’s Bayes• Assume

– L classifiers, i=1,..,L– c classes, k=1,…,c– si: class label given by the ith classifier, i=1,…,L, s={s1,…,sL}

12

P (!k|s) =p(s|!k)P (!k)

p(s)=

P (!k)QL

i=1 p(si|!k)p(s)

P (!k) = Nk/N

p(si|!k) = cmk,si/Nk

P (!k|s) ⇡1

NL�1k

LY

i=1

cmik,si

1

P (!k|s) =p(s|!k)P (!k)

p(s)=

P (!k)QL

i=1 p(si|!k)p(s)

P (!k) = Nk/N


P (!k|s) ⇡1

NL�1k

LY

i=1

cmik,si

1

P (!k|s) =p(s|!k)P (!k)

p(s)=

P (!k)QL

i=1 p(si|!k)p(s)

P (!k) = Nk/N


P (!k|s) ⇡1

NL�1k

LY

i=1

cmik,si

1

P (!k|s) =p(s|!k)P (!k)

p(s)=

P (!k)QL

i=1 p(si|!k)p(s)

P (!k) = Nk/N


P (!k|s) ⇡1

NL�1k

LY

i=1

cmk,si

1

BKS

• Majority voting won’t work• Behavior-Knowledge Space

algorithm (Huang&Suen)

13

Assumption:- 2 classifiers- 3 classes- 100 samples in the training set

Then:- 9 possible classification combinations

c1, c2 samples from each class fused result

1,1 10/3/3 11,2 3/0/6 31,3 5/4/5 1,3

…3,3 0/0/6 3




14

Boosting

• Base classifiers are trained in sequence!• Base classifiers as weak learners• Weighted majority voting to combine classifiers

15

658 14. COMBINING MODELS

Figure 14.1 Schematic illustration of theboosting framework. Eachbase classifier ym(x) is trainedon a weighted form of the train-ing set (blue arrows) in whichthe weights w(m)n depend onthe performance of the pre-vious base classifier ym−1(x)(green arrows). Once all baseclassifiers have been trained,they are combined to givethe final classifier YM (x) (redarrows).

{w(1)n } {w(2)n } {w(M)n }

y1(x) y2(x) yM (x)

YM (x) = sign

(M∑

m

αmym(x)

)

AdaBoost

1. Initialize the data weighting coefficients {wn} by setting w(1)n = 1/N forn = 1, . . . , N .

2. For m = 1, . . . , M :(a) Fit a classifier ym(x) to the training data by minimizing the weighted

error function

Jm =N∑

n=1

w(m)n I(ym(xn) != tn) (14.15)

where I(ym(xn) != tn) is the indicator function and equals 1 whenym(xn) != tn and 0 otherwise.

(b) Evaluate the quantities

"m =

N∑

n=1

w(m)n I(ym(xn) != tn)

N∑

n=1

w(m)n

(14.16)

and then use these to evaluate

αm = ln{

1 − "m"m

}. (14.17)

(c) Update the data weighting coefficients

w(m+1)n = w(m)n exp {αmI(ym(xn) != tn)} (14.18)

AdaBoost• Step 1: Initialize the data weighting coefficients {wn} by setting wn(1) = 1/N,

where N is the # of samples• Step 2: for each classifier ym(x)

– (a) Fit a classifier ym(x) to the training data by minimizing the weighted errorfunction

– (b) Evaluate the quantities

– (c) Update the data weighting coefficients

• Step 3: Make predictions using the final model

16

14.3. Boosting 659

3. Make predictions using the final model, which is given by

YM (x) = sign

(M∑

m=1

αmym(x)

). (14.19)

We see that the first base classifier y1(x) is trained using weighting coeffi-cients w(1)n that are all equal, which therefore corresponds to the usual procedurefor training a single classifier. From (14.18), we see that in subsequent iterationsthe weighting coefficients w(m)n are increased for data points that are misclassifiedand decreased for data points that are correctly classified. Successive classifiers aretherefore forced to place greater emphasis on points that have been misclassified byprevious classifiers, and data points that continue to be misclassified by successiveclassifiers receive ever greater weight. The quantities "m represent weighted mea-sures of the error rates of each of the base classifiers on the data set. We thereforesee that the weighting coefficients αm defined by (14.17) give greater weight to themore accurate classifiers when computing the overall output given by (14.19).

The AdaBoost algorithm is illustrated in Figure 14.2, using a subset of 30 datapoints taken from the toy classification data set shown in Figure A.7. Here each baselearners consists of a threshold on one of the input variables. This simple classifiercorresponds to a form of decision tree known as a ‘decision stumps’, i.e., a deci-Section 14.4sion tree with a single node. Thus each base learner classifies an input according towhether one of the input features exceeds some threshold and therefore simply parti-tions the space into two regions separated by a linear decision surface that is parallelto one of the axes.

14.3.1 Minimizing exponential errorBoosting was originally motivated using statistical learning theory, leading to

upper bounds on the generalization error. However, these bounds turn out to be tooloose to have practical value, and the actual performance of boosting is much betterthan the bounds alone would suggest. Friedman et al. (2000) gave a different andvery simple interpretation of boosting in terms of the sequential minimization of anexponential error function.

Consider the exponential error function defined by

E =N∑

n=1

exp {−tnfm(xn)} (14.20)

where fm(x) is a classifier defined in terms of a linear combination of base classifiersyl(x) of the form

fm(x) =12

m∑

l=1

αlyl(x) (14.21)

and tn ∈ {−1, 1} are the training set target values. Our goal is to minimize E withrespect to both the weighting coefficients αl and the parameters of the base classifiersyl(x).



{w(1)n } {w(2)n } {w(M)n }

y1(x) y2(x) yM (x)

YM (x) = sign

(M∑

m

αmym(x)

)

AdaBoost



error function

Jm =N∑

n=1

w(m)n I(ym(xn) != tn) (14.15)



"m =

N∑

n=1


N∑

n=1

w(m)n

(14.16)


αm = ln{

1 − "m"m

}. (14.17)





{w(1)n } {w(2)n } {w(M)n }

y1(x) y2(x) yM (x)

YM (x) = sign

(M∑

m

αmym(x)

)

AdaBoost



error function

Jm =N∑

n=1

w(m)n I(ym(xn) != tn) (14.15)



"m =

N∑

n=1


N∑

n=1

w(m)n

(14.16)


αm = ln{

1 − "m"m

}. (14.17)





{w(1)n } {w(2)n } {w(M)n }

y1(x) y2(x) yM (x)

YM (x) = sign

(M∑

m

αmym(x)

)

AdaBoost



error function

Jm =N∑

n=1

w(m)n I(ym(xn) != tn) (14.15)



"m =

N∑

n=1


N∑

n=1

w(m)n

(14.16)


αm = ln{

1 − "m"m

}. (14.17)





{w(1)n } {w(2)n } {w(M)n }

y1(x) y2(x) yM (x)

YM (x) = sign

(M∑

m

αmym(x)

)

AdaBoost



error function

Jm =N∑

n=1

w(m)n I(ym(xn) != tn) (14.15)



"m =

N∑

n=1


N∑

n=1

w(m)n

(14.16)


αm = ln{

1 − "m"m

}. (14.17)



17


m = 1

−1 0 1 2

−2

0

2 m = 2

−1 0 1 2

−2

0

2 m = 3

−1 0 1 2

−2

0

2

m = 6

−1 0 1 2

−2

0

2 m = 10

−1 0 1 2

−2

0

2 m = 150

−1 0 1 2

−2

0

2

Figure 14.2 Illustration of boosting in which the base learners consist of simple thresholds applied to one orother of the axes. Each figure shows the number m of base learners trained so far, along with the decisionboundary of the most recent base learner (dashed black line) and the combined decision boundary of the en-semble (solid green line). Each data point is depicted by a circle whose radius indicates the weight assigned tothat data point when training the most recently added base learner. Thus, for instance, we see that points thatare misclassified by the m = 1 base learner are given greater weight when training the m = 2 base learner.

Instead of doing a global error function minimization, however, we shall sup-pose that the base classifiers y1(x), . . . , ym−1(x) are fixed, as are their coefficientsα1, . . . , αm−1, and so we are minimizing only with respect to αm and ym(x). Sep-arating off the contribution from base classifier ym(x), we can then write the errorfunction in the form

E =N∑

n=1

exp{−tnfm−1(xn) −

12tnαmym(xn)

}

=N∑

n=1

w(m)n exp{−1

2tnαmym(xn)

}(14.22)

where the coefficients w(m)n = exp{−tnfm−1(xn)} can be viewed as constantsbecause we are optimizing only αm and ym(x). If we denote by Tm the set ofdata points that are correctly classified by ym(x), and if we denote the remainingmisclassified points by Mm, then we can in turn rewrite the error function in the

Value-based vs. Interval-based Fusion• Interval-based fusion can provide fault tolerance• Interval integration – overlap function

– Assume each sensor in a cluster measures the same parameters, the integration algorithm is to construct a simple function (overlap function) from the outputs of the sensors in a cluster and can resolve it at different resolutions as required

18

w5s5w4s4

w3s3w2s2w1s1

w6s6 w7s7

3

2

1

0

O(x) Crest: the highest, widestpeak of the overlapfunction

A Variant of kNN

• Generation of local confidence ranges (For example, at each node i, use kNN for each kÎ{5,…,15})

• Apply the integration algorithm on the confidence ranges generated from each node to construct an overlapping function

19

confidencerange

confidence level

smallest largest in this column

Class 1 Class 2 … Class nk=5 3/5 2/5 … 0k=6 2/6 3/6 … 1/6… … … … …

k=15 10/15 4/15 … 1/15{2/6, 10/15} {4/15, 3/6} … {0, 1/6}

Example of Interval-based Fusion

20

stop 1 stop 2 stop 3 stop 4c acc c acc c acc c acc

class 1 1 0.2 0.5 0.125 0.75 0.125 1 0.125class 2 2.3 0.575 4.55 0.35 0.6 0.1 0.75 0.125class 3 0.7 0.175 0.5 0.25 3.3 0.55 3.45 0.575

Confusion Matrices of Classification on Military Targets

21

Acoustic (75.47%, 81.78%)

Seismic (85.37%, 89.44%)

Multi-modalityfusion

(84.34%)

Multi-sensorfusion

(96.44%)

AAV DW HMV

AAV 29 2 1DW 0 18 8HMV 0 2 23

22

Confusion Matrices

Acoustic Seismic

Multi-modal

Reference

• For details regarding majority voting and Naïve Bayes, see

24

http://www.cs.rit.edu/~nan2563/combining_classifiers_notes.pdf

Date post:	12-Feb-2021
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

COSC 522 –Machine Learning Lecture 13...

Documents