+ All Categories
Home > Documents > COSC 522 –Machine Learning Lecture 13...

COSC 522 –Machine Learning Lecture 13...

Date post: 12-Feb-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
24
COSC 522 – Machine Learning Lecture 13 – Fusion Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi Email: [email protected] Course Website: http://web.eecs.utk.edu/~hqi/cosc522/
Transcript
  • COSC 522 – Machine Learning

    Lecture 13 – Fusion

    Hairong Qi, Gonzalez Family ProfessorElectrical Engineering and Computer ScienceUniversity of Tennessee, Knoxvillehttp://www.eecs.utk.edu/faculty/qiEmail: [email protected]

    Course Website: http://web.eecs.utk.edu/~hqi/cosc522/

    http://www.eecs.utk.edu/faculty/qihttp://web.eecs.utk.edu/~hqi/cosc522/

  • Racap - Decision Rules• Supervised learning

    – Baysian based - Maximum Posterior Probability (MPP): For a given x, if P(w1|x) > P(w2|x), then x belongs to class 1, otherwise 2.

    – Parametric Learning– Case 1: Minimum Euclidean Distance (Linear Machine), Si = s2I– Case 2: Minimum Mahalanobis Distance (Linear Machine), Si = S– Case 3: Quadratic classifier, Si = arbitrary– Estimate Gaussian parameters using MLE

    – Nonparametric Learning– K-Nearest Neighbor

    – Neural network– Perceptron– BPNN

    – Kernel-based approaches– Support Vector Machine

    – Decision tree– Least-square based

    • Unsupervised learning– Kmeans– Winner-takes-all

    • Supporting pre-/post-processing techniques– Normalization– Dimensionality Reduction (FLD, PCA)– Performance Evaluation (metrics, confusion matrices, ROC, cross validation)– Fusion

    2

    ( ) ( ) ( )( )xpPxp

    xP jjjωω

    ω|

    | =

  • Syllabus

    3

    COSC 522 - Machine Learning (Fall 2020) SyllabusLecture Date Content Tests Assignment Due Date

    1 8/20 Introduction HW0 8/252 8/25 Baysian Decision Theory - MPP HW1 9/13 8/27 Discriminant Function - MD Proj1 - Supervised Learning 9/104 9/1 Parametric Learning - MLE5 9/3 Non-parametric Learning - kNN6 9/8 Unsupervised Learning - kmeans HW2 9/157 9/10 Dimensionality Reduction - FLD Proj2 - Unsupervised Learning and DR 9/248 9/15 Dimentionality Reduction - PCA9 9/17 Linear Regression

    10 9/22 Performance Evaluation HW3 9/2911 9/24 Fusion Proj3 - Regression 10/812 9/29 Midterm Exam Final Project - Milestone 1: Forming Team 9/2913 10/1 Gradient Descent 14 10/6 Neural Network - Perceptron HW4 10/1315 10/8 Neural Network - BPNN Proj4 - BPNN 10/2216 10/13 Neural Network - Practices Final Project - Milestone 2: Choosing Topic 10/1317 10/15 Kernel Methods - SVM18 10/20 Kernel Methods - SVM HW5 10/2719 10/22 Kernel Methods - SVM Proj5 - SVM & DT 11/520 10/27 Decision Tree Final Project - Milestone 3: Literature Survey 10/2721 10/29 Random Forest22 11/3 HW6 11/1023 11/5 From PCA to t-SNE24 11/10 From Gaussian to Mixture and EM25 11/12 From Supervised/Unsupervised to RL26 11/17 From Classification/Regression to Generation Final Project - Milestone 4: Prototype 11/1727 11/19 From NN to CNN28 11/24 Final Exam

    12/3 8:00-10:15 Final Presentation Final Project - Report 12/4

  • Questions• Rationale with fusion?• Different flavors of fusion?• The fusion hierarchy• What is the cost function for Naïve Bayes?• What is the procedure for Naïve Bayes?• What is the limitation of Naïve Bayes?• What is the procedure of Behavior-Knowledge-Space

    (BKS)? • How does it resolve issues with NB?• What is Boosting and what is its difference to committee-

    based fusion approaches?• What is AdaBoost?

    4

  • Motivation

    • Combining classifiers to achieve higher accuracy– Combination of multiple classifiers– Classifier fusion– Mixture of experts– Committees of neural networks– Consensus aggregation– …

    • Reference: – L. I. Kuncheva, J. C. Bezdek, R. P. W. Duin, “Decision templates for

    multiple classifier fusion: an experimental comparison,” Pattern Recognition, 34: 299-314, 2001.

    – Y. S. Huang and C. Y. Suen, “A method of combining multiple experts for the recognition of unconstrained handwritten numerals,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 17, no. 1, pp. 90–94, Jan. 1995.

    5

    Three heads are better than one.

  • Popular Approaches

    Data-based fusion (early fusion)Feature-based fusion (middle fusion)Decision-based fusion (late fusion)

    ApproachesCommittee-based

    Majority votingBootstrap aggregation (Bagging) [Breiman, 1996]

    Baysian-basedNaïve Bayes combination (NB)Behavior-knowledge space (BKS) [Huang and Suen, 1995]

    BoostingAdaptive boosting (AdaBoost) [Freund and Schapire, 1996]

    Interval-based integration

    6

  • Application Example – CivilianTarget Recognition

    7

    7

    Compact Cluster Laydown

    0.0

    20.0

    40.0

    60.0

    80.0

    100.0

    120.0

    140.0

    0.0 20.0 40.0 60.0 80.0 100.0 120.0 140.0

    Ft

    Ft Series1

    Suzuki VitaraFord 350Ford 250 Harley Motocycle

  • Consensus Patterns

    • Unanimity (100%)• Simple majority (50%+1)• Plurality (most votes)

    8

  • Example of Majority Voting -Temporal Fusion

    Fuse all the 1-sec sub-interval local processing results corresponding to the same event (usually lasts about 10-sec)Majority voting

    9

    ,maxarg cji ωϕ = ],1[ Cc∈

    number of possible localprocessing results

    number of localoutput c occurrence

  • Questions• Rationale with fusion?• Different flavors of fusion?• The fusion hierarchy• What is the cost function for Naïve Bayes?• What is the procedure for Naïve Bayes?• What is the limitation of Naïve Bayes?• What is the procedure of Behavior-Knowledge-Space

    (BKS)? • How does it resolve issues with NB?• What is Boosting and what is its difference to committee-

    based fusion approaches?• What is AdaBoost?

    10

  • NB (the independence assumption)

    11

    C1 AAV DW HMV

    AAV 894 329 143

    DW 99 411 274

    HMV 98 42 713

    Confusion matrix

    The real class is DW, theclassifier says it’s HMV

    L1 AAV DW HMV

    AAV

    DW

    HMV

    C2 AAV DW HMV

    AAV 1304 156 77

    DW 114 437 83

    HMV 13 107 450

    L2 AAV DW HMV

    AAV

    DW

    HMV

    Probability that the true class is k given that Ciassigns it to s

    k

    s

    i = 1, 2 (classifiers)

    Probability multiplication

  • NB – Derivation • Assume the classifiers are mutually independent• Bayes combination - Naïve Bayes, simple Bayes, idiot’s Bayes• Assume

    – L classifiers, i=1,..,L– c classes, k=1,…,c– si: class label given by the ith classifier, i=1,…,L, s={s1,…,sL}

    12

    P (!k|s) =p(s|!k)P (!k)

    p(s)=

    P (!k)QL

    i=1 p(si|!k)p(s)

    P (!k) = Nk/N

    p(si|!k) = cmk,si/Nk

    P (!k|s) ⇡1

    NL�1k

    LY

    i=1

    cmik,si

    1

    P (!k|s) =p(s|!k)P (!k)

    p(s)=

    P (!k)QL

    i=1 p(si|!k)p(s)

    P (!k) = Nk/N

    p(si|!k) = cmk,si/Nk

    P (!k|s) ⇡1

    NL�1k

    LY

    i=1

    cmik,si

    1

    P (!k|s) =p(s|!k)P (!k)

    p(s)=

    P (!k)QL

    i=1 p(si|!k)p(s)

    P (!k) = Nk/N

    p(si|!k) = cmk,si/Nk

    P (!k|s) ⇡1

    NL�1k

    LY

    i=1

    cmik,si

    1

    P (!k|s) =p(s|!k)P (!k)

    p(s)=

    P (!k)QL

    i=1 p(si|!k)p(s)

    P (!k) = Nk/N

    p(si|!k) = cmk,si/Nk

    P (!k|s) ⇡1

    NL�1k

    LY

    i=1

    cmk,si

    1

  • BKS

    • Majority voting won’t work• Behavior-Knowledge Space

    algorithm (Huang&Suen)

    13

    Assumption:- 2 classifiers- 3 classes- 100 samples in the training set

    Then:- 9 possible classification combinations

    c1, c2 samples from each class fused result

    1,1 10/3/3 11,2 3/0/6 31,3 5/4/5 1,3

    …3,3 0/0/6 3

  • Questions• Rationale with fusion?• Different flavors of fusion?• The fusion hierarchy• What is the cost function for Naïve Bayes?• What is the procedure for Naïve Bayes?• What is the limitation of Naïve Bayes?• What is the procedure of Behavior-Knowledge-Space

    (BKS)? • How does it resolve issues with NB?• What is Boosting and what is its difference to committee-

    based fusion approaches?• What is AdaBoost?

    14

  • Boosting

    • Base classifiers are trained in sequence!• Base classifiers as weak learners• Weighted majority voting to combine classifiers

    15

    658 14. COMBINING MODELS

    Figure 14.1 Schematic illustration of theboosting framework. Eachbase classifier ym(x) is trainedon a weighted form of the train-ing set (blue arrows) in whichthe weights w(m)n depend onthe performance of the pre-vious base classifier ym−1(x)(green arrows). Once all baseclassifiers have been trained,they are combined to givethe final classifier YM (x) (redarrows).

    {w(1)n } {w(2)n } {w(M)n }

    y1(x) y2(x) yM (x)

    YM (x) = sign

    (M∑

    m

    αmym(x)

    )

    AdaBoost

    1. Initialize the data weighting coefficients {wn} by setting w(1)n = 1/N forn = 1, . . . , N .

    2. For m = 1, . . . , M :(a) Fit a classifier ym(x) to the training data by minimizing the weighted

    error function

    Jm =N∑

    n=1

    w(m)n I(ym(xn) != tn) (14.15)

    where I(ym(xn) != tn) is the indicator function and equals 1 whenym(xn) != tn and 0 otherwise.

    (b) Evaluate the quantities

    "m =

    N∑

    n=1

    w(m)n I(ym(xn) != tn)

    N∑

    n=1

    w(m)n

    (14.16)

    and then use these to evaluate

    αm = ln{

    1 − "m"m

    }. (14.17)

    (c) Update the data weighting coefficients

    w(m+1)n = w(m)n exp {αmI(ym(xn) != tn)} (14.18)

  • AdaBoost• Step 1: Initialize the data weighting coefficients {wn} by setting wn(1) = 1/N,

    where N is the # of samples• Step 2: for each classifier ym(x)

    – (a) Fit a classifier ym(x) to the training data by minimizing the weighted errorfunction

    – (b) Evaluate the quantities

    – (c) Update the data weighting coefficients

    • Step 3: Make predictions using the final model

    16

    14.3. Boosting 659

    3. Make predictions using the final model, which is given by

    YM (x) = sign

    (M∑

    m=1

    αmym(x)

    ). (14.19)

    We see that the first base classifier y1(x) is trained using weighting coeffi-cients w(1)n that are all equal, which therefore corresponds to the usual procedurefor training a single classifier. From (14.18), we see that in subsequent iterationsthe weighting coefficients w(m)n are increased for data points that are misclassifiedand decreased for data points that are correctly classified. Successive classifiers aretherefore forced to place greater emphasis on points that have been misclassified byprevious classifiers, and data points that continue to be misclassified by successiveclassifiers receive ever greater weight. The quantities "m represent weighted mea-sures of the error rates of each of the base classifiers on the data set. We thereforesee that the weighting coefficients αm defined by (14.17) give greater weight to themore accurate classifiers when computing the overall output given by (14.19).

    The AdaBoost algorithm is illustrated in Figure 14.2, using a subset of 30 datapoints taken from the toy classification data set shown in Figure A.7. Here each baselearners consists of a threshold on one of the input variables. This simple classifiercorresponds to a form of decision tree known as a ‘decision stumps’, i.e., a deci-Section 14.4sion tree with a single node. Thus each base learner classifies an input according towhether one of the input features exceeds some threshold and therefore simply parti-tions the space into two regions separated by a linear decision surface that is parallelto one of the axes.

    14.3.1 Minimizing exponential errorBoosting was originally motivated using statistical learning theory, leading to

    upper bounds on the generalization error. However, these bounds turn out to be tooloose to have practical value, and the actual performance of boosting is much betterthan the bounds alone would suggest. Friedman et al. (2000) gave a different andvery simple interpretation of boosting in terms of the sequential minimization of anexponential error function.

    Consider the exponential error function defined by

    E =N∑

    n=1

    exp {−tnfm(xn)} (14.20)

    where fm(x) is a classifier defined in terms of a linear combination of base classifiersyl(x) of the form

    fm(x) =12

    m∑

    l=1

    αlyl(x) (14.21)

    and tn ∈ {−1, 1} are the training set target values. Our goal is to minimize E withrespect to both the weighting coefficients αl and the parameters of the base classifiersyl(x).

    658 14. COMBINING MODELS

    Figure 14.1 Schematic illustration of theboosting framework. Eachbase classifier ym(x) is trainedon a weighted form of the train-ing set (blue arrows) in whichthe weights w(m)n depend onthe performance of the pre-vious base classifier ym−1(x)(green arrows). Once all baseclassifiers have been trained,they are combined to givethe final classifier YM (x) (redarrows).

    {w(1)n } {w(2)n } {w(M)n }

    y1(x) y2(x) yM (x)

    YM (x) = sign

    (M∑

    m

    αmym(x)

    )

    AdaBoost

    1. Initialize the data weighting coefficients {wn} by setting w(1)n = 1/N forn = 1, . . . , N .

    2. For m = 1, . . . , M :(a) Fit a classifier ym(x) to the training data by minimizing the weighted

    error function

    Jm =N∑

    n=1

    w(m)n I(ym(xn) != tn) (14.15)

    where I(ym(xn) != tn) is the indicator function and equals 1 whenym(xn) != tn and 0 otherwise.

    (b) Evaluate the quantities

    "m =

    N∑

    n=1

    w(m)n I(ym(xn) != tn)

    N∑

    n=1

    w(m)n

    (14.16)

    and then use these to evaluate

    αm = ln{

    1 − "m"m

    }. (14.17)

    (c) Update the data weighting coefficients

    w(m+1)n = w(m)n exp {αmI(ym(xn) != tn)} (14.18)

    658 14. COMBINING MODELS

    Figure 14.1 Schematic illustration of theboosting framework. Eachbase classifier ym(x) is trainedon a weighted form of the train-ing set (blue arrows) in whichthe weights w(m)n depend onthe performance of the pre-vious base classifier ym−1(x)(green arrows). Once all baseclassifiers have been trained,they are combined to givethe final classifier YM (x) (redarrows).

    {w(1)n } {w(2)n } {w(M)n }

    y1(x) y2(x) yM (x)

    YM (x) = sign

    (M∑

    m

    αmym(x)

    )

    AdaBoost

    1. Initialize the data weighting coefficients {wn} by setting w(1)n = 1/N forn = 1, . . . , N .

    2. For m = 1, . . . , M :(a) Fit a classifier ym(x) to the training data by minimizing the weighted

    error function

    Jm =N∑

    n=1

    w(m)n I(ym(xn) != tn) (14.15)

    where I(ym(xn) != tn) is the indicator function and equals 1 whenym(xn) != tn and 0 otherwise.

    (b) Evaluate the quantities

    "m =

    N∑

    n=1

    w(m)n I(ym(xn) != tn)

    N∑

    n=1

    w(m)n

    (14.16)

    and then use these to evaluate

    αm = ln{

    1 − "m"m

    }. (14.17)

    (c) Update the data weighting coefficients

    w(m+1)n = w(m)n exp {αmI(ym(xn) != tn)} (14.18)

    658 14. COMBINING MODELS

    Figure 14.1 Schematic illustration of theboosting framework. Eachbase classifier ym(x) is trainedon a weighted form of the train-ing set (blue arrows) in whichthe weights w(m)n depend onthe performance of the pre-vious base classifier ym−1(x)(green arrows). Once all baseclassifiers have been trained,they are combined to givethe final classifier YM (x) (redarrows).

    {w(1)n } {w(2)n } {w(M)n }

    y1(x) y2(x) yM (x)

    YM (x) = sign

    (M∑

    m

    αmym(x)

    )

    AdaBoost

    1. Initialize the data weighting coefficients {wn} by setting w(1)n = 1/N forn = 1, . . . , N .

    2. For m = 1, . . . , M :(a) Fit a classifier ym(x) to the training data by minimizing the weighted

    error function

    Jm =N∑

    n=1

    w(m)n I(ym(xn) != tn) (14.15)

    where I(ym(xn) != tn) is the indicator function and equals 1 whenym(xn) != tn and 0 otherwise.

    (b) Evaluate the quantities

    "m =

    N∑

    n=1

    w(m)n I(ym(xn) != tn)

    N∑

    n=1

    w(m)n

    (14.16)

    and then use these to evaluate

    αm = ln{

    1 − "m"m

    }. (14.17)

    (c) Update the data weighting coefficients

    w(m+1)n = w(m)n exp {αmI(ym(xn) != tn)} (14.18)

    658 14. COMBINING MODELS

    Figure 14.1 Schematic illustration of theboosting framework. Eachbase classifier ym(x) is trainedon a weighted form of the train-ing set (blue arrows) in whichthe weights w(m)n depend onthe performance of the pre-vious base classifier ym−1(x)(green arrows). Once all baseclassifiers have been trained,they are combined to givethe final classifier YM (x) (redarrows).

    {w(1)n } {w(2)n } {w(M)n }

    y1(x) y2(x) yM (x)

    YM (x) = sign

    (M∑

    m

    αmym(x)

    )

    AdaBoost

    1. Initialize the data weighting coefficients {wn} by setting w(1)n = 1/N forn = 1, . . . , N .

    2. For m = 1, . . . , M :(a) Fit a classifier ym(x) to the training data by minimizing the weighted

    error function

    Jm =N∑

    n=1

    w(m)n I(ym(xn) != tn) (14.15)

    where I(ym(xn) != tn) is the indicator function and equals 1 whenym(xn) != tn and 0 otherwise.

    (b) Evaluate the quantities

    "m =

    N∑

    n=1

    w(m)n I(ym(xn) != tn)

    N∑

    n=1

    w(m)n

    (14.16)

    and then use these to evaluate

    αm = ln{

    1 − "m"m

    }. (14.17)

    (c) Update the data weighting coefficients

    w(m+1)n = w(m)n exp {αmI(ym(xn) != tn)} (14.18)

  • 17

    660 14. COMBINING MODELS

    m = 1

    −1 0 1 2

    −2

    0

    2 m = 2

    −1 0 1 2

    −2

    0

    2 m = 3

    −1 0 1 2

    −2

    0

    2

    m = 6

    −1 0 1 2

    −2

    0

    2 m = 10

    −1 0 1 2

    −2

    0

    2 m = 150

    −1 0 1 2

    −2

    0

    2

    Figure 14.2 Illustration of boosting in which the base learners consist of simple thresholds applied to one orother of the axes. Each figure shows the number m of base learners trained so far, along with the decisionboundary of the most recent base learner (dashed black line) and the combined decision boundary of the en-semble (solid green line). Each data point is depicted by a circle whose radius indicates the weight assigned tothat data point when training the most recently added base learner. Thus, for instance, we see that points thatare misclassified by the m = 1 base learner are given greater weight when training the m = 2 base learner.

    Instead of doing a global error function minimization, however, we shall sup-pose that the base classifiers y1(x), . . . , ym−1(x) are fixed, as are their coefficientsα1, . . . , αm−1, and so we are minimizing only with respect to αm and ym(x). Sep-arating off the contribution from base classifier ym(x), we can then write the errorfunction in the form

    E =N∑

    n=1

    exp{−tnfm−1(xn) −

    12tnαmym(xn)

    }

    =N∑

    n=1

    w(m)n exp{−1

    2tnαmym(xn)

    }(14.22)

    where the coefficients w(m)n = exp{−tnfm−1(xn)} can be viewed as constantsbecause we are optimizing only αm and ym(x). If we denote by Tm the set ofdata points that are correctly classified by ym(x), and if we denote the remainingmisclassified points by Mm, then we can in turn rewrite the error function in the

  • Value-based vs. Interval-based Fusion• Interval-based fusion can provide fault tolerance• Interval integration – overlap function

    – Assume each sensor in a cluster measures the same parameters, the integration algorithm is to construct a simple function (overlap function) from the outputs of the sensors in a cluster and can resolve it at different resolutions as required

    18

    w5s5w4s4

    w3s3w2s2w1s1

    w6s6 w7s7

    3

    2

    1

    0

    O(x) Crest: the highest, widestpeak of the overlapfunction

  • A Variant of kNN

    • Generation of local confidence ranges (For example, at each node i, use kNN for each kÎ{5,…,15})

    • Apply the integration algorithm on the confidence ranges generated from each node to construct an overlapping function

    19

    confidencerange

    confidence level

    smallest largest in this column

    Class 1 Class 2 … Class nk=5 3/5 2/5 … 0k=6 2/6 3/6 … 1/6… … … … …

    k=15 10/15 4/15 … 1/15{2/6, 10/15} {4/15, 3/6} … {0, 1/6}

  • Example of Interval-based Fusion

    20

    stop 1 stop 2 stop 3 stop 4c acc c acc c acc c acc

    class 1 1 0.2 0.5 0.125 0.75 0.125 1 0.125class 2 2.3 0.575 4.55 0.35 0.6 0.1 0.75 0.125class 3 0.7 0.175 0.5 0.25 3.3 0.55 3.45 0.575

  • Confusion Matrices of Classification on Military Targets

    21

    Acoustic (75.47%, 81.78%)

    Seismic (85.37%, 89.44%)

    Multi-modalityfusion

    (84.34%)

    Multi-sensorfusion

    (96.44%)

    AAV DW HMV

    AAV 29 2 1DW 0 18 8HMV 0 2 23

  • 22

    Confusion Matrices

    Acoustic Seismic

    Multi-modal

  • 23

  • Reference

    • For details regarding majority voting and Naïve Bayes, see

    24

    http://www.cs.rit.edu/~nan2563/combining_classifiers_notes.pdf


Recommended