by MohammadNikjooSoukhtabandani...My special thanks to the committee members, Dr. Tas...

MULTIPLE CLASSIFIER STRATEGIES FOR DYNAMIC PHYSIOLOGICAL AND

BIOMECHANICAL SIGNALS

by

Mohammad Nikjoo Soukhtabandani

A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy

Graduate Department of Electrical and Computer EngineeringUniversity of Toronto

Copyright c⃝ 2012 by Mohammad Nikjoo Soukhtabandani

Abstract

Multiple classifier strategies for dynamic physiological and biomechanical signals

Mohammad Nikjoo Soukhtabandani

Doctor of Philosophy

Graduate Department of Electrical and Computer Engineering

University of Toronto

2012

Access technologies often deal with the classification of several physiological and biome-

chanical signals. In most previous studies involving access technologies, a single classifier

has been trained. Despite reported success of these single classifiers, classification accu-

racies are often below clinically viable levels. One approach to improve upon the perfor-

mance of these classifiers is to utilize the state of- the-art multiple classifier systems (MCS).

Because MCS invoke more than one classifier, more information can be exploited from the

signals, potentially leading to higher classification performance than that achievable with

single classifiers. Moreover, by decreasing the feature space dimensionality of each classi-

fier, the speed of the system can be increased.

MCSs may combine classifiers on three levels: abstract, rank, or measurement level.

Among them, abstract-level MCSs have been the most widely applied in the literature given

the flexibility of the abstract level output, i.e., class labels may be derived from any type

of classifier and outputs from multiple classifiers, each designed within a different context,

can be easily combined.

In this thesis, we develop two new abstract-level MCSs based on "reputation" values of

individual classifiers: the static reputation-based algorithm (SRB) and the dynamic reputation-

based algorithm (DRB). In SRB, each individual classifier is applied to a “validation set”,

which is disjoint from training and test sets, to estimate its reputation value. Then, each

individual classifier is assigned a weight proportional to its reputation value. Finally, the to-

ii

tal decision of the classification system is computed using Bayes rule. We have applied this

method to the problem of dysphagia detection in adults with neurogenic swallowing diffi-

culties. The aim was to discriminate between safe and unsafe swallows. The weighted clas-

sification accuracy exceeded 85% and, because of its high sensitivity, the SRB approach was

deemed suitable for screening purposes. In the next step of this dissertation, I analyzed the

SRB algorithm mathematically and examined its asymptotic behavior. Specifically, I con-

trasted the SRB performance against that of majority voting, the benchmark abstract-level

MCS, in the presence of different types of noise.

In the second phase of this thesis, I exploited the idea of the Dirichlet reputation system

to develop a new MCS method, the dynamic reputation-based algorithm, which is suitable

for the classification of non-stationary signals. In this method, the reputation of each clas-

sifier is updated dynamically whenever a new sample is classified. At any point in time, a

classifier’s reputation reflects the classifier’s performance on both the validation and the test

sets. Therefore, the effect of random high-performance of weak classifiers is appropriately

moderated and likewise, the effect of a poorly performing individual classifier is mitigated

as its reputation value, and hence overall influence on the final decision is diminished. We

applied DRB to the challenging problem of discerning physiological responses from non-

verbal youth with severe disabilities. The promising experimental results encourage further

development of reputation-based multi-classifier systems in the domain of access technol-

ogy research.

iii

Acknowledgements

First and foremost, I would like to thank God for showing me the correct path in the life.

During my Ph.D. program, I have worked with a great number of people whose contri-

bution in assorted ways to the research and writing of the thesis deserve special mention. It

is a pleasure to convey my gratitude to them all in my humble acknowledgments.

In the first place, I would like to record my gratitude to my supervisor Dr. Tom Chau

for his supervision, advice, and guidance from the very early stages of this research as well

as for the extraordinary experiences throughout this work. Above all and when it was most

needed, he provided me with unflinching encouragement and support in various ways. He

was not only my supervisor, but also a great role model from whom I have learned a lot,

from research skills to personality. I am indebted to him more than he knows. Dr. Chau, I

am grateful in every possible way and hope to keep up our collaboration in the future.

I would also like to acknowledge my colleagues and friends at the PRISM lab and ECE

department of UofT who were always helpful and made my stay memorable.

My special thanks to the committee members, Dr. Tas Venetsanopoulos and Dr. Hans

Kunov, who gave me valuable comments during my work, read and review this thesis, and

participated in the thesis defence examination. Also, thanks to Dr. Elvino Sousa and Dr.

Bernie Hudgins for reviewing the thesis and participating in the final examination.

Fiannly, I would like to express my deepest appreciation to my family and parents. Where

would I be without my family? My parents deserve special mention for their unwavering

support and prayers. Although my father is not among us and cannot see my graduation,

his presence is always with me and I’m sure he is very proud of my success. I am much

obliged for the unconditional love and support from my mother who helped me both finan-

cially and emotionally to study in the best schools of Iran and Canada. I am sure she has

been longing for my graduation as much as I have, and I am now happy to deliver her the

good news. Thank you very much and god bless you.

iv

Contents

1 Background and Literature Review 1

1.1 Combining Multiple Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Majority Voting Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2 Application of MCS to Rehab Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2.1 Dysphagia Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2.2 Recognition of physiological arousal . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3 Thesis Rationale and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Reputation-based neural network combinations 15

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.1.2 Majority Voting Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 Reputation-Based Voting Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3 Discriminating between Healthy and Abnormal Swallows . . . . . . . . . . . . . . . . 23

2.3.1 Signal Acquisition and pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.4.1 Classification accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

v

2.4.2 Internal neural network representations . . . . . . . . . . . . . . . . . . . . . . . 32

2.4.3 Training error and convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.4.4 Local robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.4.5 Global robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3 Automatic discrimination between safe and unsafe swallowing using a reputation-

based classifier 41

3.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.2.1 Automatic classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3.2 Data segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3.3 Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.3.4 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.3.5 Reputation-Based Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.3.6 Classifier evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.5.1 Dual versus single axis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.5.2 Internal representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.5.3 Reputation-based classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.5.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4 Analyzing the performance of the static reputation-based algorithm 62

4.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

vi

4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3 Large n behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.3.2 Modeling the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.3.3 Asymptotic Analysis of SRB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.4 Medium and small n behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.4.1 Patterns of Success and Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.4.2 Probability of error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.4.3 Results for Uniform Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.4.4 Results for Gaussian Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.5 Conclulsion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5 Time-Evolving Reputation-Based Classifier and its Application on Classification of

Physiological Responses of Children With Disabilities 76

5.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.2.1 Dynamic weighted voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.2.2 Dynamic reputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.3 Mathematical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.3.2 Properties of the Dirichlet Distribution . . . . . . . . . . . . . . . . . . . . . . . . 82

5.4 Static Reputation-Based Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.5 Time-Evolving Reputation-Based Algorithm Using the Dirichlet Distribution . 84

5.5.1 An Updating Rule for Reputation Values . . . . . . . . . . . . . . . . . . . . . . . 88

5.6 Clinical Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.6.1 The Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.6.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

vii

5.6.4 The Effect of the Constant C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.8 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.8.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.8.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6 Conclusion 102

6.1 Summary of research completed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.2 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

Bibliography 105

viii

List of Tables

2.1 The average performance of the individual classifiers and their reputation-

based combination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1 Example of 4 distributions of 10 test samples among different combinations

of classifier outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.1 An example to clarify the updating rule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.2 The average performance of different classifiers. . . . . . . . . . . . . . . . . . . . . . . 95

5.3 The P values of t-test between the time-evolving classifier, static classifier,

and majority voting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

ix

List of Figures

2.1 The schematic of back-propagation neural network with 3 inputs and 4 hid-

den layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.2 The block diagram of the proposed algorithm . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3 The Hinton graph of the weight matrix for the time domain classifier . . . . . . . 33

2.4 The Hinton graph of the weight matrix for the frequency domain classifier

(Peak - peak value of FFT; BW - bandwidth) . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.5 The Hinton graph of the weight matrix for the information theoretic classifier. 35

2.6 The training error of the different neural network classifiers versus the num-

ber of training epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.7 Sensitivity to varying magnitudes of perturbation of the input vector . . . . . . . 37

2.8 The output decision versus magnitude of input perturbation . . . . . . . . . . . . . 38

2.9 The accuracy of the proposed classifier with increasing magnitude of pertur-

bation in the test set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.10 The accuracy of the proposed classifier as the number of contaminated sam-

ples increase. The p-values arise from a comparison of the accuracy between

the uncontaminated sample and samples with varying levels of contamination. 40

3.1 Data collection set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.2 Example of safe swallowing signals from AP and SI axes . . . . . . . . . . . . . . . . . 48

3.3 Example of unsafe swallowing signals from AP and SI axes . . . . . . . . . . . . . . . 49

x

3.4 Classification performance for the single-axis (AP, SI) and dual-axis (AP + SI)

reputation-based classifiers. The height of each bar denotes the average of

the respective performance measure while the error bars denote standard de-

viation around the mean. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.5 Parallel axes plot depicting the internal representation of safe (solid line) and

unsafe (dashed line) swallows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.6 Comparison of the densities of classification accuracies by majority voting

(top) and reputation-based classification (bottom) for safe and unsafe swal-

low discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.1 Comparison between majority voting and SRB algorithm for different uni-

form noise distributions, for (a) three, and (b) fifteen classifiers. . . . . . . . . . . . 72

4.2 Comparison between majority voting and SRB algorithm for uniform noise

with b = 0.3 and different numbers of classifiers. . . . . . . . . . . . . . . . . . . . . . . 73

4.3 Comparison between majority voting and SRB algorithm for different Gaus-

sian noise powers, for (a) three, and (b) fifteen classifiers respectively. . . . . . . . 74

4.4 Comparison between majority voting and SRB algorithm for a fixed value of

standard deviation of Gaussian noise (σ = 0.3) and different number of clas-

sifiers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.1 An example of the voting scheme in the proposed algorithm. . . . . . . . . . . . . . 87

5.2 Sample physiological signals collected from one of the participants. In each

graph, the horizontal axis denotes time while the vertical axis is signal ampli-

tude in arbitrary units. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.3 The effect of the constant C on the time-evolving reputation-based algorithm. 98

5.4 The evolution of reputation values for the 5 classifiers for different values of C . 99

xi

1

Chapter 1

Background and Literature Review

1.1 Combining Multiple Classifiers

Multiple Classifier Systems (MCS) have been widely used by machine learning researchers

for solving different pattern classification problems. There are several reasons for using

MCS. Some of these are listed below.

1. Empirical studies have demonstrated that MCS can yield higher classification perfor-

mance than that of the best individual classifier in many applications [89, 135]. For

some cases, there are analytical proofs of multi-classifier performance gain [65, 134].

2. In rehabilitation engineering applications, different representations of a pattern may

be available to the designer, suggesting that several classifiers could be developed

and trained. For instance, in designing access pathways, various physiological sig-

nals such as heart rate, cortical hemodynamics, and EEG signals could serve as dif-

ferent descriptions of the same pattern. If the designer insists on using one classifier

for solving this problem, he or she will be faced with the challenge of determining

appropriate feature normalizations.

3. Even if feature normalization is not required for some applications, combining mul-

2

tiple classifiers may be preferred over the use of a single classifier. With multiple clas-

sifiers, we can reduce the dimensionality of the input feature vectors and therefore

reduce the complexity of the problem [89].

4. Sometimes more than one training set is available for training classifiers. These sets

could be generated at different times or in different environments. There is likely new

information in each training set. To exploit this new information, different classifiers

should be developed on each data set [50, 19].

5. Some classifiers such as neural networks require random initialization. For each ini-

tialization, the output will be slightly different, even for classifiers trained on the same

data. One way to obtain the best overall average performance is to combine different

versions of these classifiers [56].

6. For a given pattern recognition problem, typically more than one classification algo-

rithm is available, each yielding a different level of classification performance with dif-

ferent data. It is therefore unsurprising that the combination of admissible classifiers

for a given problem often leads to better classification results than that achievable by

any individual classifier alone [74].

Most of the available algorithms for combining classifiers can be categorized into three

groups based on their architecture: parallel, cascade, and, hierarchical. In the parallel archi-

tecture [89, 62, 135], each classifier performs independently and sends its output to a judge.

The judge then decides about the class label of the input based on outputs of the individual

classifiers. In the cascade architecture [129, 16], individual classifiers are arranged in a lin-

ear sequence. In this architecture, the output of each classifier is the input of the next one.

The number of possible classes for a specific pattern reduces after this pattern passes each

classifier. To increase the efficiency of the system, computationally cheap classifiers are in-

voked first, followed by more expensive ones [56]. The hierarchical architecture [57, 102] is

similar to a decision tree classifier [24, 97]. The tree nodes in this format are associated with

3

complex classifiers and a large number of features. This classifier has high efficiency and

flexibility in exploiting the discriminant power of different types of features. In this thesis

research, we focus primarily on the parallel architectures, given its natural fit to the target

problems described later.

To better understand the different algorithms for combining classifiers we introduce the

following notation.

Assume the time series, S, is the pre-processed version of an acquired signal. Also letΘ=

θ1,θ2, ...,θL be a set of L ≥ 2 classifiers and Ω= ω1,ω2, ...,ωc be a set of c ≥ 2 class labels,

whereωj =ωk , ∀j = k . Without loss of generality, Ω⊂N. The input of each classifier is the

feature vector xεRn i , where n i is the dimension of the feature space for the i t h classifier θi ,

whose output is a class labelωj , j = 1, ..., c . In other words, the i t h classifier, i = 1, . . . , L, is a

functional mapping, Rn i → Ω, which for each input x gives an output θi (x ) ∈ Ω. Generally,

the classifier function could be linear or non-linear. It is assumed that for the i t h classifier,

a total number of d i subjects are assigned for training.

Parallel architecture algorithms operate on the output of individual classifiers. Xu et al.

[135] categorized the possible outputs of different classifiers into three groups.

• Abstract: The classifier θi only outputs a unique class label, ωi ∈ Ω. In this level,

no information is available about the certainty of the guessed label. Moreover, no

alternative class label accompanies the guessed label.

• Rank: The output of the classifier θi is a subset of Ω. This subset of class labels are

rank ordered in terms of the likelihood of the input having one of these labels. This

approach is useful when there are a large number of classes for the problem.

• Measurement: The classifier θi outputs a measurement value for each class, which

represents the degree to which the input belongs to that class.

Among these different types of outputs, the ’measurement’ type of output conveys the rich-

est information about the labels and decision. However, this type of output does not have

4

the generality of the abstract flavor. For the measurement output, all the classifiers should

be able to output a measurement value for each class label. Furthermore, the measurement

values for all the classifiers should be of the same scale. On the other hand, the abstract

output may arise from any kind of classifier. In fact, several classifiers, each designed within

a different context, could be combined. Therefore, this output type is more interesting than

the other types. In this research, our focus is on the abstract level problem.

Most of the MCS in literature are of the measurement output type [54, 40, 47]. Some

interesting classifier combinations are proposed by Kittler et al. in [62]. Example rules for

combining classifiers include the product rule, sum rule, median rule, max rule, and min

rule. In all of these methods, individual classifiers use distinct features to estimate the pos-

terior probabilities given the input pattern. Kittler et al. [62]has shown that the sum rule has

better performance on average than the other methods listed above. The reason is that this

method is less sensitive to the errors of individual classifiers. Also Kittler et al. have shown

that the product rule is the most appropriate for error-free independent probabilities. An-

other example of a measurement level method is the adaptive weighting algorithm [123].

In this method, a judge evaluates the decisions of individual classifiers. This algorithm has

better performance than its non-adaptive counterpart. Also, by treating the outputs of in-

dividual classifiers as fuzzy membership values, Cho and Kim [22, 23] proposed different

combination methods based on fuzzy rules. Another approach for combining classifiers at

the measurement level is a mixture of local experts (MLE) [54]. In this adaptive method,

first, the input space is stochastically partitioned into several number of subspaces. Then, a

set of expert network and a gate network are trained and become experts in each subspace.

Although this method works well for problems with disjoint regions, it suffers from bound-

ary effects between regions [116]. To overcome this problem, a hierarchical MLE has been

proposed in [57]. In this method the expectation-maximization (EM) [30] algorithm is used

to learn an online hierarchical mixture model with a softmax gating function. By using a

soft gating function, the algorithm is able to accommodate the boundaries of different re-

5

gions. Also, the EM algorithm can learn and detect the dense areas in the feature space and

appropriately train the experts. However, as the EM algorithm only learns the cluster cen-

troids and not intermediary points, it fails to properly classify many non-linearly separable

problems [116].

Rank level MCS [51, 4] are less popular than measurement level ones. Among rank level

MCS, the algorithm proposed in [51] is among the most popular. In this method, each clas-

sifier first ranks the class labels for each input pattern. Then the Borda count method [8]

is utilized to find the fused rank table. For a given class, we count within each classifier’s

output, the number of classes ranked below the class in question. The Borda count for a

class is the sum of these counts across classifiers. Although this method is simple to imple-

ment and requires no training, it does not consider the differences in individual classifier

capabilities. All the classifiers in this method are treated equally even if there is some prior

information about the effectiveness of some classifiers compared to others. To solve this

problem, different methods have been proposed to modify the assignment of weights to the

rank scores produced by each classifier. For this matter, two methods are suggested in [49]

and [84], both of which assign weights based on the confidence of the combined decisions

measured by the Borda count method. Suggested measures include a statistic related to the

distribution of sum of ranks and the intervals between the computed Borda counts for a set

of classes [51].

Dempster-Shafer classifiers [30, 111] are another category of rank-level classifiers. These

classifiers are used to combine cumulative evidence or change priors in the presence of

new evidence. Dempster-Shafer theory of evidence is the generalization of the Bayesian

reasoning which does not require probabilities for each parameter of interest. Despite of

its positive points, it has been shown by Zadeh [137] that this rule may generate a counter-

intuitive result. Other flaws and drawbacks of this method have been discussed in [130, 131].

The abstract level MCS [89, 135, 102] have some advantages over the MCS designed for

other types of classifier outputs. The abstract level MCS works with different types of clas-

6

sifiers, each developed and trained in a different context. Majority voting [135] is one of the

best known abstract level MCS. This method is simple to implement and computationally

efficient. Because of these advantages, this algorithm has been widely used in solving many

pattern classification problems [89, 69, 79, 100]. Despite its simplicity, majority voting can

significantly improve upon the classification accuracies of individual classifiers [69]. In [64]

this algorithm is compared to many measurement level algorithms such as the sum and

product rules. In these comparisons, the performance of the majority voting algorithm is

almost the same as that of the best measurement level algorithm. Weighted majority voting

is another MCS at the abstract level. Several versions of weighted majority voting has been

proposed in the literature for different applications [135, 115, 43, 52]. Although these algo-

rithms may work properly for a specific application, they are not applicable to biomedical

signals which are mostly non-stationary. Moreover, in most of these methods, the weight of

each individual classifier is determined heuristically. Among different versions of weighted

majority voting, the algorithm proposed by Xu et al. [135] received highest acceptance by

other researchers. The main reason is that in this method instead of a heuristic approach, a

mathematical approach based on the posterior probability of each class is used for assign-

ing the weight of each individual classifier. However, as it is explained in [65], this method

is not able to deal with degenerate cases when one or more beliefs are zero. These cases are

likely to happen in any classification problem.

Another method for combining individual classifiers in the abstract level is boosting

[102]. The main purpose of boosting is to combine several weak classifiers and generate

a strong one. Although this algorithm is somewhat robust against overfitting, it is very sen-

sitive to noisy data, mislabels, and outliers [56]. Moreover, it requires many comparable

classifiers to perform properly. Different adaptive versions of the boosting algorithm have

been proposed to take full advantage of weak classifiers [39, 38, 29, 41]. Most of these algo-

rithms deploy a coordinate-wise gradient descent approach and try to minimize a convex

cost function [83]. However, it has been shown in [82] that these methods are highly sus-

7

ceptible to random classification noise. Long and Servedio [82] showed that under random

classification noise, these boosting methods are not able to perform better than 50 %.

Because there is no standard form of weighted majority algorithm in the literature, ma-

jority voting is the selected benchmark for abstract level classifiers. Therefore, we further

investigate this algorithm in detail.

1.1.1 Majority Voting Algorithm

In a multi-classifier system, the problem is to arrive at a global decision Θ∗(x ) = ωj given

a number of local decisions θi (x ) ∈ Ω, where the following equations generally do not hold

[135]:

θ1(x ) = θ2(x ) = ...= θL(x ). (1.1)

In the literature, a classical approach for solving this problem is majority voting [53] [117].

To express this idea mathematically, we define an indicator function

I i (x ,ωj ) = I (θi (x )−ωj ) =

1, when θi (x ) =w j ,

0, otherwise.(1.2)

Now, using (2), the majority voting rule can be expressed as follows:

Θ∗(x ) =

ωm a x , if maxωj IΩ(x ,ωj )> L/2,

Q , otherwise,(1.3)

whereωm a x = argmaxw j IΩ(x ,ωj ), IΩ(x ,ωj ) =∑L

i=1

I i (x ,ωj ), j = 1, ..., c , and Q /∈ Ω is the rejection state. In other words, given a feature vector,

each classifier votes for a specific class. The class with the majority of votes is selected as the

candidate class. If the candidate class earns more than half of the total votes, it is selected

as the final output of the system. Otherwise, the feature vector is rejected by the system.

The majority voting algorithm is computationally inexpensive, simple to implement and

8

applicable to a wide array of classification problems [69] [56]. However, this method suffers

from a major drawback; the decision heuristic is strictly democratic, meaning that the votes

from different classifiers are always equally weighted, regardless of the past performance of

individual classifiers. Therefore, votes of weak classifiers, i.e., classifiers whose performance

only slightly exceeds that of the random classifier, can diminish the overall performance of

the system when they have the majority. To exemplify this issue, consider a classification

system with c = 2 classes, Ω = ω1,ω2, and L = 3 classifiers, Θ = θ1,θ2,θ3, where two

are weak classifiers with 51% average accuracy while the remaining one is a strong classifier

with 99% average accuracy. Now assume that for a specific feature vector both the weak

classifiers vote for ω1 but the strong classifier votes for ω2. Based on the majority voting

rule, ω1 is preferred over ω2, which is mostly likely an incorrect classification. It has been

shown in [123] [54] [56], that in practice, there are various situations in which the majority

vote may be suboptimal.

To overcome this drawback of the majority voting algorithm, [135] introduced the no-

tion of weighted majority voting by incorporating classifier-specific beliefs which reflect

each classifier’s uncertainty about a given test case. However, this algorithm is not appli-

cable when some of the beliefs are equal to zero. This degenerate case can easily happen

whenever there is a large number of classes in the system. Moreover, this approach is ac-

curate only when the prior probabilities of all classes are equal, an assumption that can be

violated in many situations. In a similar vein, [115] proposed a weighted majority vote rule

using a genetic algorithm. In this method, the performance of the whole set of classifiers

is maximized rather than that of each classifier separately. Unfortunately, this algorithm

is computationally costly and provides only a slight improvement in classification perfor-

mance. Motivated by the idea of [135], we propose a novel algorithm for combining several

classifiers based on their individual reputations, numerical weights that reflect each classi-

fier’s past performance. The algorithm will be discussed in detail, in Chapter 2.

A MCS is especially useful if the individual classifiers are independent. It has been shown

9

in [135, 56] that this condition can be satisfied by using different training or feature sets.

If such is not possible, many resampling techniques may be used to satisfy the indepen-

dence requirement. Stacking [133], bagging [14], and boosting [102] are the most popular

resampling techniques. In stacking, the outputs of individual classifiers are used to train the

stacked classifier. The final decision is made based on the outputs of individual classifiers

in addition to that of the stacked classifier. In bagging, several versions of the initial dataset

are created using bootstrapping. The final decision is made using a fixed rule such as av-

eraging. Boosting is another method for generating different training sets. In this method,

the classifiers are trained hierarchically to learn to discriminate more complex regions of

feature space.

1.2 Application of MCS to Rehab Engineering

Researchers at Holland Bloorview Kids Rehabilitation Hospital are working to improve ac-

cess technologies for patients who are nonverbal and severely motorically impaired. The

main focus of much of this research is to recognize a specific parameter-of-interest related

to physical or physiological states of the individual. The parameter-of-interest, for instance,

could be the affective state of a person (e.g., happy or sad), communicative intent (e.g.,

preferred versus non-preferred) or the presence or absence of a chronic condition such as

dysphagia. A number of physiological and biomechanical signals are typically recorded

from different participants in response to several different stimuli. Features are typically

extracted from these signals and the parameter-of-interest is estimated using various clas-

sification algorithms. Although many novel ideas have been proposed for recognizing the

parameter-of-interest using single classifiers, the potential of MCS is largely unexplored.

Given the many different signal modalities exploited (e.g., biopotentials, mechanical vibra-

tions, thermography, optical signals), there is reason to believe that the application of MCS

may enhance classification. Specifically, we focus on two different problems: dysphagia

10

detection and classification of physiological arousal.

1.2.1 Dysphagia Detection

Dysphagia refers to any swallowing disorder [81] and may arise secondary to stroke, mul-

tiple sclerosis, and eosinophilic esophagitis, among other conditions [85]. If unmanaged,

dysphagia may lead to aspiration pneumonia in which food and liquid enter the airway

and into lungs [32]. The video-fluoroscopic swallowing study (VFSS) is the gold standard

method for dysphagia detection [119]. In this method, clinicians detect dysphagia using

a lateral X-ray video recorded during ingestion of a barium-coated bolus. The health of a

swallow is judged according to criteria such as the depth of airway invasion and the degree

of bolus clearance after the swallow. However, this technique requires expensive and spe-

cialized equipment, ionizing radiation and significant human resources, thereby precluding

its use in the daily monitoring of dysphagia [74].

Swallowing accelerometry has been proposed as a potential adjunct to VFSS. In this

method, the patient wears a dual-axis accelerometer infero-anterior to the thyroid notch.

Swallowing events are automatically extracted from the recorded acceleration signals. Pat-

tern classification methods are then deployed to discriminate between healthy and un-

healthy swallows.

Sejdic et al. [109] proposed a swallow event detection algorithm based on the sequential

fuzzy partitioning of the standard deviation of dual-axes acceleration signals. They achieved

over 90% detection accuracy. Lee et al. [75], proposed a pseudo-automatic neural network

algorithm for swallow event detection in which the standard deviation of the recorded signal

was compared to an empirically-derived threshold. However, in many cases, the segments

selected by this method differed from those derived manually by human experts. Consid-

ering the acceleration signal as a stochastic diffusion, Damouras et al. [27] introduced an

online volatility-based technique for detecting swallowing events from long records of un-

processed dual axis accelerations. They achieved performance comparable to previously

11

published off-line detection methods.

Using a radial basis classifier with statistical and energetic features, Lee et al. [71] de-

tected aspirations from single-axis cervical acceleration signals with approximately 80%

sensitivity and specificity in a pediatric population. In an adult stroke population, Lee et

al. [74], discriminated healthy and abnormal swallowing using a non-linear classification

paradigm. In [74], first, a Genetic Algorithm (GA) is used to select the most discriminatory

feature combinations and then, both the regular neural network classifier and the proba-

bilistic neural network classifier is utilized for classification. Despite promising classifica-

tion results, the proposed method was computationally expensive. A genetic algorithm was

used to select up to 12 features from among 444 possible choices. The large number of fea-

tures necessitated a large training set, which is not always available.

In the first two phases of our research, we applied the proposed MCS to the dysphagia

detection problem.

1.2.2 Recognition of physiological arousal

In the automatic recognition of physiological arousal, researchers try to decode the affec-

tive state of a participant using physiological signals recorded from the participant’s body.

The detection of change in an individual’s physiological state is relevant to bedside moni-

toring systems[112], ambulant monitoring of medically fragile individuals [44], monitoring

of patients under high-stress conditions [110], and communication with individuals with

severe and multiple disabilities [10]. To detect physiological arousal of individuals, several

physiological signals are established as indicators including electrical brain activity [7], elec-

trodermal activity (EDA) [10, 13], skin temperature [61] and cardiorespiratory signals [96].

Most of the previous efforts in physiological arousal detection have examined differ-

ent physiological signals independently [6, 36, 125]. Further, the correlation between these

signals is often not considered in the literature [9]. In fact, most of the previous emotion

recognition algorithms neglect the multivariate nature of the problem and invoke a univari-

12

ate approximation [18, 59, 93]. Recently, [9] proposed a multivariate algorithm for emotion

recognition. Although this algorithm is able to detect changes in the physiological states

of individuals, it is not able to classify these states. The application of emotion recognition

using MCS is a novel approach which will be considered in our research.

In the third phase of our research, we applied the proposed MCS to the problem of phys-

iological arousal detection.

1.3 Thesis Rationale and Objectives

The classifier limitations introduced in the previous Section motivated the development

and design of a new algorithm for combining several classifiers in the context of rehabilita-

tion engineering. The specific rationale for this research are listed below.

• In the context of rehab engineering, several signals are collected from multiple phys-

ical and physiological sources. Each of these signals has a unique characteristic (e.g.

measurement modality) which distinguishes it from others. Therefore, for each of

these signals, a specific classifier should be trained and developed based on individ-

ual signal properties. This idea steered us towards a multiple classifier approach.

• Although several algorithms have been proposed in the literature for combining clas-

sifiers, none have specifically been designed for rehab engineering applications. In

this context, signals are usually non-stationary and non-Gaussian, which make it dif-

ficult for the designer to train and develop one specific classifier.

• By reviewing all the existing classifier combination algorithms in the literature, we

find none that is able to update its classifier combination rule after it has been trained.

In other words, after designing and training a classifier, the designer has no access

to the classifier and cannot invoke further training or retraining. When signals are

time-varying, samples of the test set may be completely different from samples of the

13

training set. For example, in the physiological arousal detection problem, the emotion

of a person, and hence the signals representing emotion may evolve over the course

of a measurement period. Therefore, there is a need for a classifier which is able to

update its performance based on recently collected data samples.

• In combining classifiers using the majority voting algorithm, the past performance of

classifiers is not considered. Therefore, the vote of a good classifier is equal to that

of a weak one. This limitation can decrease the overall performance of the system,

especially when there are several weak classifiers in the system.

Based on the above rationales, the main objectives of this dissertation were as follows:

1. Develop a new algorithm for combining several classifiers based on their performance

on a validation set. This algorithm is an extension of the popular majority voting al-

gorithm.

2. Apply the proposed algorithm to the signals gathered from a rehab engineering con-

text and compare its performance to that of various existing algorithms.

3. Propose a new dynamic classifier combination algorithm, which is able to update its

performance based on samples of the test set. This algorithm is the dynamic counter-

part of the reputation-based algorithm.

4. Apply the proposed dynamic algorithm to the scenario in which samples of the test set

are significantly different from those of the training set and compare its performance

to that of its static counterpart. The physiological arousal detection problem is a good

example of this scenario.

5. Characterize the asymptotic performance of the static reputation-based algorithm

under well-circumscribed conditions such as a specific type of classifiers, specific

data distribution, and specific dimensionality.

14

1.4 Thesis Organization

This dissertation contains six different Chapters. In Chapter 2, we introduce the static reputation-

based algorithm and compare it to other MCS methods using real dysphagia data. In Chap-

ter 3, we examine the application of the proposed algorithm on dysphagia detection and in-

vestigate different contributions of axial accelerometer signals using the proposed method.

The asymptotic behavior of the static reputation-based algorithm is analyzed in Chapter 4.

In Chapter 5, we propose the dynamic reputation-based algorithm and investigate its appli-

cation to physiological arousal detection. Finally, a summary of contribution is presented

in Chapter 6.

Each main Chapter of this dissertation is a verbatim excerpt of a journal article which

addresses one of the objectives of the thesis. As a result, the introduction section of each

chapter may contain redundant information. It is recommended that the reader skip the

redundant introductory sections.

15

Chapter 2

Reputation-based neural network

combinations

The contents of this chapter are reproduced from a published book chapter: Nikjoo, M.,

Kushki, A., Lee, J., Steele. C., Chau, T. (2011). Reputation-based neural network combi-

nations, in K. Suzuki (editor), Artificial Neural Networks - Methodological Advances and

Biomedical Applications, InTech Publishing, Chapter 8, pp. 151-170. ISBN 978-953-307-

243-2.

2.1 Introduction

The exercise of combining classifiers is primarily driven by the desire to enhance the per-

formance of a classification system. There may also be problem-specific rationale for inte-

grating several individual classifiers. For example, a designer may have access to different

types of features from the same study participant. For instance, in the human identification

problem, the participant’s voice, face, and handwriting provide different types of features.

In such instances, it may be sensible to train one classifier on each type of feature [56]. In

other situations, there may be multiple training sets, collected at different times or under

16

slightly different circumstances. Individual classifiers can be trained on each available data

set [135, 56]. Lastly, the demand for classification speed in online applications may neces-

sitate the use of multiple classifiers [56].

Optimal combination of multiple classifiers is a well-studied area. Traditionally, the goal

of these methods is to improve classification accuracy by employing multiple classifiers to

address the complexity and non-uniformity of class boundaries in the feature space. For

example, classifiers with different parameter choices and architectures may be combined

so that each classifier focuses on the subset of the feature space where it performs best.

Well-known examples of these methods include bagging [15] and boosting [5].

Given the universal approximation ability of neural networks such as multilayer percep-

trons and radial basis functions[48], there is theoretical appeal to combine several neural

network classifiers to enhance classification. Indeed, several methods have been developed

for this purpose, including, for example, optimal linear combinations [126] and mixture of

experts [55], and negative correlation [21] and evolving neural network ensembles [136].

In these methods, all base classifiers are generally trained on the same feature space (ei-

ther using the entire training set or subsets of the training set). While these methods have

proven effective in many applications, they are associated with numerical instabilities and

high computational complexity in some cases [5].

Another approach to classifier combination is training the base classifiers on different

feature spaces. This approach is advantageous in combating the undesirable effects of as-

sociated with high-dimensional feature spaces (curse of dimensionality). Moreover, the

feature sets can be chosen to minimize the correlation between the individual base clas-

sifiers to further improve the overall accuracy and generalization power of classification

[126]. These methods are also highly desirable in situations where heterogeneous feature

combinations are used.

Combination of classifiers based on different feature has been generally accomplished

through fixed classification rules. These rules may select one classifier output among all

17

available outputs (for example, using the minimum or maximum operator), or they may

provide a classification decision based on the collective outputs of all classifiers (for exam-

ple, using the mean, median, or voting operators)[67, 11]. Among the latter, the simplest and

most widely applied rule is the majority vote [53, 117]. Many authors have demonstrated

that classification performance improves beyond that of the single classifier scenario when

multiple classifier decisions are combined via a simple majority vote [135, 86]. [135] further

introduced the notion of weighted majority voting by incorporating classifier-specific be-

liefs which reflect each classifier’s uncertainty about a given test case. Unfortunately, this

method does not deal with the degenerate case when one or more beliefs are zero - a sit-

uation likely to occur in multi-class classification problems. Moreover, this classifier relies

on the training data set to derive beliefs values for each classifier. This approach, therefore,

risks overfitting the classifier to the training set and a consequent degradation in general-

ization power.

In this chapter, we describe a method for combining several neural network classifiers in

a manner which is computationally inexpensive and does not demand additional training

data beyond that needed to train individual classifiers. Our reputation method will build on

the ideas of [135]. In the following, we present notation that is used throughout the chapter

and detail the majority voting algorithm using this notation. The following presentation is

adapted from [90].

2.1.1 Notation

Assume the time series, S, is the pre-processed version of an acquired signal. Also let Θ =


whereωj =ωk , ∀j = k . Without loss of generality, Ω⊂N . The input of each classifier is the




18


a total number of d i subjects are assigned for training. The main goal of combining the

decisions of different classifiers is to increase the accuracy of class selection.

2.1.2 Majority Voting Algorithm

In a multi-classifier system, the problem is to arrive at a global decision Θ∗(x ) =ωj given a

number of local decisions θi (x )∈Ω, where generally [135]:

θ1(x ) = θ2(x ) = ...= θL(x ). (2.1)

In the literature, a classical approach for solving this problem is majority voting [53] [117].

To express this idea mathematically, we define an indicator function

I i (x ,ωj ) = I (θi (x )−ωj ) =

1, when θi (x ) =w j ,

0, otherwise.(2.2)

Now, using (2), the majority voting rule can be expressed as follows:

Θ∗(x ) =

ωm a x , if maxωj IΩ(x ,ωj )> L/2,

Q , otherwise,(2.3)

whereωm a x = argmaxw j IΩ(x ,ωj ), IΩ(x ,ωj ) =∑L

i=1

I i (x ,ωj ), j = 1, ..., c , and Q /∈ Ω is the rejection state. In other words, given a feature vector,

each classifier votes for a specific class. The class with the majority of votes is selected as the

candidate class. If the candidate class earns more than half of the total votes, it is selected

as the final output of the system. Otherwise, the feature vector is rejected by the system.

The majority voting algorithm is computationally inexpensive, simple to implement and

applicable to a wide array of classification problems [69] [56]. Despite its simplicity, majority

19

voting can significantly improve upon the classification accuracies of individual classifiers

[69]. However, this method suffers from a major drawback; the decision heuristic is strictly

democratic, meaning that the votes from different classifiers are always equally weighted,

regardless of the past performance of individual classifiers. Therefore, votes of weak classi-

fiers, i.e., classifiers whose performance only slightly exceeds that of the random classifier,

can diminish the overall performance of the system when they have the majority. To exem-

plify this issue, consider a classification system with c = 2 classes, Ω = ω1,ω2, and L = 3

classifiers, Θ = θ1,θ2,θ3, where two are weak classifiers with 51% average accuracy while

the remaining one is a strong classifier with 99% average accuracy. Now assume that for a

specific feature vector both the weak classifiers vote for ω1 but the strong classifier votes

forω2. Based on the majority voting rule,ω1 is preferred overω2, which is mostly likely an

incorrect classification.

As it is discussed in [123] [54] [56], in practice, there are various situations in which the

majority vote may be suboptimal. Motivated by the above, we propose a novel algorithm for

combining several classifiers based on their individual reputations, numerical weights that

reflect each classifier’s past performance. The algorithm is detailed in the next section and

again is adapted from [90].

2.2 Reputation-Based Voting Algorithm

To mitigate the risk of the overall decision being unduly influenced by poorly performing

classifiers, we propose a novel fusion algorithm which extends the majority voting concept

to acknowledge the past performance of classifiers. To measure the past performance of the

i t h classifier, we define a measure called reputation, ri ∈ ℜ, 0 ≤ ri ≤ 1. For a classifier with

high performance, the reputation is close to 1 while a weak classifier would have a reputa-

tion value close to 0. For each feature vector, both the majority vote and the reputation of

each classifier contribute to the final global decision. The collection of reputation values for

20

L classifiers constitutes the reputation set, R = r1, r2, ..., rL. Each classifier is mapped to a

real-valued reputation, ri , namely,

ri = r (θi ) =α, i = 1, ..., L, (2.4)

where r : Θ → [0,1], α ∈ ℜ and 0 ≤ α ≤ 1. To determine the reputation of each classifier,

we utilize a validation set in addition to the classical training and test sets. Specifically, the

performance of the trained classifiers on the validation data determines their reputation

values. Now, we have all the necessary tools to explain the proposed algorithm.

1. For a classification problem with c ≥ 2 classes, we design and develop L ≥ 2 individual

classifiers. The proposed algorithm is especially useful if the individual classifiers are

independent. This condition can be guaranteed by using different training sets or

using various resampling techniques such as bagging [14] and boosting [102]. Unlike

some of the previous work [70] [53], there is no restriction on the number of classifiers

L and this value can be either an odd or an even number. Also, it should be noted here

that, in general, the feature space dimension, n i , of each classifier could be different

and the number of training exemplars, d i , for each classifier could be unique.

2. After training the L classifiers individually, the respective performance of each is eval-

uated using the validation set and a reputation value is assigned to each classifier. The

validation sets are disjoint from the training sets. It is important to notify that here we

use two different validation sets. The first one is the traditional validation set which

is used repeatedly until the train of classifier is done satisfactory [33]. The second

validation set, however, is used to calculate the reputation values of individual clas-

sifiers.The accuracy of each classifier is estimated with the corresponding validation

set and normalized to [0, 1] to generate a reputation value. For instance, a classifier,

θi , with 90% accuracy has a reputation ri = 0.9.

3. Now, for each feature vector, x , in the test set, L decisions are made using L distinctive

21

classifiers:

Ω(x ) = θ1(x ),θ2(x ), ...,θL(x ). (2.5)

4. To arrive at a final decision, we consider the votes of the classifiers with high repu-

tations rather than simply selecting the majority class. First, we sort the reputation

values of the classifiers in descending order,

R∗ = r1∗ , r2∗ , ..., rL∗, (2.6)

such that r1∗ ≥ r2∗ ≥ ... ≥ rL∗ . Then, using this set, we rank the classifiers to obtain a

reputation-ordered set of classifiers,Θ∗.

Θ∗ =

θ1∗

θ2∗

...

θL∗

. (2.7)

The first element of this set corresponds to the classifier with the highest reputation.

5. Next, we examine the votes of the first m elements of the reputation-ordered set of

classifiers, with

m =

L2

, if L is even,

L+12

, if L is odd.(2.8)

If the top m classifiers vote for the same class,ωj , we accept the majority vote and take

ωj as the final decision of the system. However, if the votes of the first m classifiers

are not equal, we consider the classifier’s individual reputations (Step 2).

6. Let p (ωj ) be the prior probability of classωj . As before, Θ(x ) = θ1(x ),θ2(x ), ...,θL(x )represents the local decisions made by different classifiers about the input vector x .

22

The probability that the combined classifier decision is ωj given the input vector x

and the individual local classifier decisions is denoted as the posterior probability,

p (ωj |θ1(x ),θ2(x ), ...,θL(x )) (2.9)

Clearly, we should choose the class which maximizes this probability.

argmaxωj

p (ωj |θ1(x ),θ2(x ), ...,θL(x )), j = 1, ...c . (2.10)

To estimate the posterior probability we use Bayes formula. For notational simplicity

we drop the argument x from the local decisions.

p (ωj |θ1, ...,θL) =p (θ1, ...,θL |ωj )p (ωj )

p (θ1, ...,θL), (2.11)

where p (θ1, ...,θL |ωj ) is the likelihood of ωj and p (θ1, ...,θL) is the evidence factor,

which is estimated using the law of total probability

p (θ1, ...,θL) =c∑

j=1

p (x ,θ1, ...,θL |ωj )p (ωj ). (2.12)

By assuming that the classifiers are independent of each other, the likelihood can be

written as

p (θ1, ...,θL |ωj ) = p (θ1|ωj )...p (θL |ωj ). (2.13)

Substituting (2.12) into the Bayes rule (2.11) and then replacing the likelihood term

with (2.13), we obtain,

p (ωj |θ1, ...,θL) =

∏Li=1 p (θi |ωj )p (ωj )∑c

t=1

∏Li=1 p (θi |ωt )p (ωt )

. (2.14)

To calculate the local likelihood functions, p (θi |ωj ), we use the reputation values cal-

23

culated in Step 6. When the correct class is ωj and classifier θi classifies x into the

classωj , i.e., θi (x ) =ωj , we can write

p (θi =ωj |ωj ) = ri . (2.15)

In other words, p (θi =ωj |ωj ) is the probability that the classifier θi correctly classifies

x into classωj when x actually belongs to this class. This probability is exactly equal

to the reputation of the classifier. On the other hand, when the classifier categorizes x

incorrectly, i.e., θi (x ) =ωj given that the correct class isωj , then

p (θi =ωj |ωj ) = 1− ri . (2.16)

When there is no known priority among classes, we can assume equal prior probabil-

ities. Hence,

p (ω1) = p (ω2) = . . .= p (ωc ) =1

c. (2.17)

Finally, for each class, ωj , we compute the a posteriori probabilities as given by (5.8)

using (3.21), (3.22), and (5.10). The class with the highest posterior probability is se-

lected as the final decision of the system and the input subject x is categorized as

belonging to this class.

The advantage of the reputation-based algorithm over the majority voting algorithm lies

in the fact that the former has a higher probability of correct consensus and a faster rate of

convergence to the peak probability of correct classification [90].

2.3 Discriminating between Healthy and Abnormal Swallows

We apply the proposed algorithm to the problem of swallow classification. Specifically, the

problem is to differentiate between safe and unsafe swallowing on the basis of dual-axis

24

accelerometry [106, 27]. The basic idea is to decompose a high dimensional classification

problem into 3 lower dimensional problems, each with a unique subset of features and a

dedicated classifier. The individual classifier decisions are then melded according to the

proposed reputation algorithm.

2.3.1 Signal Acquisition and pre-processing

In this chapter, we consider a randomly selected subset of 100 healthy swallows and 100

dysphagic swallows from the larger database described in [75]. Briefly, dual-axis swallowing

accelerometry data were collected at 10 kHz from 24 adult patients (mean age 64.8± 18.6

years, 2 males) with dysphagia and 17 non-dysphagic persons (mean age 46.9± 23.8 years,

8 males). Patients provided an average number of 17.8±8.8 swallows while non-dysphagic

participants completed 19 swallow sequences each. Both groups swallowed boluses of dif-

ferent consistencies. For more details of the instrumentation and swallowing protocol,

please see [75]. It has been shown in [73] that the majority of power in a swallowing vi-

bration lies below 100Hz. Therefore, we downsampled all signals to 1KHz. Then, individual

swallows were segmented according to previously identified swallow onsets and offsets [75].

Each segmented swallow was denoised using a 5-level discrete Daubechies-5 wavelet trans-

form. To remove low-frequency motion artifacts due to bolus intake and participant mo-

tion, each signal was subjected to a 4th-order highpass Butterworth filter with a cutoff fre-

quency of 1Hz.

2.3.2 Feature Extraction

Let S be a pre-processed acceleration time series, S = s2, s2, ..., sn. As in previous accelerom-

etry studies, signal features from three different domains were considered [75, 72]. The dif-

ferent genres of features are summarized below.

1. Time-Domain Features

25

• Mean: The sample mean of a distribution is an unbiased estimation of the loca-

tion of that distribution. The sample mean of the time series S can be calculated

as

µs =1

n

n∑i=1

s i . (2.18)

• Variance: The variance of a distribution measures its spread around the mean

and reflects the signal’s power. The unbiased estimation of variance can be ob-

tained as

σ2s =

1

n −1

n∑i=1

(s i −µs )2. (2.19)

• Skewness: The skewness of a distribution is a measure of the symmetry of a dis-

tribution. This factor can be computed as follows

γ1,s =1n

∑ni=1(s i −µs )3

( 1n

∑ni=1(s i −µs )2)1.5

. (2.20)

• Kurtosis: This feature reflects the peakedness of a distribution and can be found

as

γ2,s =1n

∑ni=1(s i −µs )4

( 1n

∑ni=1(s i −µs )2)2

. (2.21)

2. Frequency-Domain Features

• The peak magnitude value of the Fast Fourier Transform (FFT) of the signal S is

used as a feature. All the FFT coefficients are normalized by the length of the

signal, n .

• The centroid frequency of the signal S [106] can be estimated as

f =

∫ f m a x

0f |Fs ( f )|2d f∫ f m a x

0|Fs ( f )|2d f

, (2.22)

where Fs ( f ) is the Fourier transform of the signal S and f m a x is the Nyquist fre-

26

quency (5KHz in this study).

• The bandwidth of the spectrum can be computed using the following formula

BW =

√√√√∫ f m a x

0( f − f )2|Fs ( f )|2d f∫ f m a x

0|Fs ( f )|2d f

. (2.23)

3. Information-Theory-Based Features

• Entropy Rate [94]: [94] introduced a new method for measuring the entropy rate

in a signal which quantifies the extent of regularity in that signal. They showed

that this rate is useful for signals with some relationship among consecutive sig-

nal points. [74] used the entropy rate for the classification of healthy and ab-

normal swallowing. Following their approach, we first normalized the signal S

to zero-mean and unit variance. Then, we quantized the normalized signal into

10 equally spaced levels, represented by the integers 0 to 9, ranging from the

minimum to maximum value. Now, the sequence of U consecutive points in the

quantized signal, S = s1, s2, ..., s3, was coded using the following equation

a i = s i+U−1.10U−1+ ...+ s i .100, (2.24)

with i = 1, 2, ..., n −U + 1. The coded integers comprised the coding set AU =

a 1, ..., a n−U+1. Using the Shannon entropy formula, we estimated entropy

E (U ) =−10U−1∑

t=0

PAU (t ) ln PAU (t ), (2.25)

where pAU (t ) represents the probability of observing the value t in AU , approxi-

mated by the corresponding sample frequency. Then, the entropy rate was nor-

27

malized using the following equation

N E (U ) =E (U )−E (U −1)+E (1).β

E (1), (2.26)

where β was the percentage of the coded integers in AL that occurred only once.

Finally, the regularity index ρ ∈ [0, 1]was obtained as

ρ = 1−min N E (U ), (2.27)

where a value of ρ close to 0 signifies maximum randomness while ρ close to 1

indicates maximum regularity.

• Memory [72]: To calculate the memory of the signal, its autocorrelation function

was computed from zero to the maximum time lag. Then, it was normalized

such that the autocorrelation at zero lag was unity. The memory was estimated

as the time duration from zero to the point where the autocorrelation decays to

1/e of its zero lag value.

• Lemple-Ziv (L-Z) complexity [77]: The L-Z complexity measures the predictabil-

ity of a signal. To compute the L-Z complexity for signal S, first, the minimum

and the maximum values of signal points were calculated and then, the signal

was quantized into 100 equally spaced levels between its minimum and maxi-

mum values. Then, the quantized signal, B n1 = b1,b2, ...,bn, was decomposed

into T different blocks, B n1 = ψ1,ψ2, ...,ψT . A blockψwas defined as

Ψ= B ℓj = b j ,b j+1, ...,bℓ, 1≤ j ≤ ℓ≤ n . (2.28)

28

The values of the blocks can be calculated as follows:

Ψ=

ψm =b1, if m=1,

ψm+1 = B hm+1

hm+1, m ≥ 1,(2.29)

where hm is the ending index for ψm , such that ψm+1 is a unique sequence of

minimal length within the sequence B hm+1−11 . Finally, the normalized L-Z com-

plexity was calculated as

LZ =T log100 n

n. (2.30)

2.3.3 Classification

We trained 3 separate back-propagation neural network (NN) classifiers [33], one for each

genre of signal feature outlined above. Hence, the feature space dimensionalities for the

classifiers were 4 (NN with time features), 3 (NN with frequency features) and 3 (NN with

information-theoretic features). Each neural net classifier had 2 inputs, 4 hidden units and

1 output. Figure 2.1 shows the schematic of one NN classifier used in our work. Although it

is possible to invoke different classifiers for each genre of signal feature, we utilized the same

classifiers here to facilitate the evaluation of local decisions. The use of different feature sets

for each classifier ensures that the classifiers will perform independently [135].

Figure 2.2 is a block diagram of our proposed algorithm. First, the three small neural

networks, classify their inputs independently. Then, using the outputs of these classifiers

and their respective reputation values, the reputation-based algorithm determines the cor-

rect label of the input.

Classifier accuracy was estimated via a 10-fold cross validation with a 90-10 split. How-

ever, unlike classical cross-validation, we further segmented the ’training’ set into an actual

training set and a validation set. In other words, in each fold, 160 (80%) swallows were used

for training, 20 (10%) for validation and 20 (10%) reserved for testing. Among the 20 swal-

29

Figure 2.1: The schematic of back-propagation neural network with 3 inputs and 4 hiddenlayers

Figure 2.2: The block diagram of the proposed algorithm

30

lows of the validation set, 10 were used as a traditional validation set and 10 were used for

computation of the reputation values. After training, classifier reputations were estimated

using this second validation set. Classifiers were then ranked according to their reputation

values. Without loss of generality, assume r1 ≥ r2 ≥ r3. If θ1 and θ2 cast the same vote about

a test swallow, their common decision was accepted as the final classification. However, if

they voted differently, the a posteriori probability of each class was computed using (5.8)

and the maximum a posteriori probability rule was applied to select the final classification.

To better understand the difference between the multiple classifier system and a single,

all encompassing classifier, we also trained a multilayer neural network via back-propagation

with all 10 features, i.e., using the collective inputs of all three smaller classifiers. This all

encompassing classifier, from hereon referred to as the grand classifier, also had 4 hid-

den units. We also statistically compared the accuracies of the individual classifiers against

those of a majority vote classifier combination and a reputation-based classifier combina-

tion (Section 2.4.1).

To understand the knowledge representation of the individual small classifiers, we plot-

ted Hinton diagrams for the input-to-hidden unit weight matrices (Section 2.4.2). Subse-

quently, we qualitatively compared the training performance of the small and grand classi-

fiers to ascertain potential benefits of training a collection of small classifiers (Section 2.4.3).

Through a systematic perturbation analysis, we quantified the local robustness of the reputation-

based neural network combination (Section 2.4.4). In particular, we qualitatively examined

the change in the classifier output and accuracy as the magnitude of the input perturba-

tion increased. Finally, we estimated the breakdown point of the reputation-based neural

network combination using increasing proportion of contaminants in the overall data set

(Section 2.4.5).

31

Table 2.1: The average performance of the individual classifiers and their reputation-basedcombination.

Classifier Average Performance (%)

Time domain NN 67.5±12.5

Frequency domain NN 69.5±10.3

Information theoretic NN 65.5±11.2

Grand classifier 70.0±8.5

Majority vote 74.5±8.9

Combined classifier decision 78.0±8.2

2.4 Results and Discussion

2.4.1 Classification accuracy

Table 2.1 tabulates the local and global classification results. On average, the frequency do-

main classifier appears best among the individual NNs while the information-theoretic NN

fairs worst. Also, it is clear from this table that by combing the local decisions of the clas-

sifiers, using reputation based algorithm, the overall performance of the system increases

dramatically. The result of the grand classifier is statistically the same as the small classi-

fiers. However, training this classifier is more difficult and requires more time. Hence, there

appears to be no justification of considering an all encompassing classifier in this appli-

cation. Collectively, these results indicate that there is merit in combining neural network

classifiers in this problem domain. The accuracy of the majority vote neural network combi-

nation did not significantly differ from that of the individual (p > 0.11) and grand classifiers

(p = 0.16). On the other hand, the reputation-based combination led to further improve-

ment in accuracy over the time domain (p = 0.04) and information-theoretic (p = 0.05)

classifiers, but did not significantly surpass the grand (p = 0.09) and frequency domain net-

works (p = 0.09).

32

The reputation-based scheme yields accuracies better than those reported in [74] (74.7%).

Moreover, in [74], the entire database was required and the maximum feature space dimen-

sion was 12. Here, we only considered a fraction of the database and no classifier had a

feature space dimensionality greater than 4. Therefore, our system offers the advantages of

computational efficiency and less stringent demands on training data.

2.4.2 Internal neural network representations

Figures 2.3, 2.4 and 2.5 are the Hinton graphs for the input to hidden layer weight matrices

for the time, frequency, and information theoretic domain neural networks, respectively.

In these figures, the weight matrix of each classifier is represented using a grid of squares.

The area of each square represents the weight magnitude, while the shading reveals the

sign. Shaded squares signify negative weights. The first column denotes the hidden unit

biases while the subsequent columns are the weights on the connections projecting from

each input unit. For instance, the frequency domain neural network uses 3 input features

and 1 bias, resulting in 4 columns. Given that there are 4 hidden units, the weight matrix is

represented as a 4×4 grid.

In Figure 2.3, we see that the first neuron has a very large negative weight for kurtosis

and a sizable one for variance. This suggests that this neuron represents swallows with low

variance and platykurtic distributions. The second neuron seems to primarily represent

swallows with leptokurtic distributions given its positive weight on the kurtosis input. By

the same token, then third neuron appears to internally denote swallows with large positive

means and leptokurtic distributions. Finally, the last neuron captures swallows primarily

with high variance. Overall, the strongest weights are found on variance and kurtosis fea-

tures, suggesting that they play the most important role in distinguishing between healthy

and unhealthy swallows in our sample. Resonating with our findings here, [71] identified a

dispersion-type measure as discriminatory between healthy swallows and swallows. Simi-

larly, [75] determined that the kurtosis of swallow accelerometry signals tended to be axial-

33

Bias Mean Median Variance Kurtosis

1

2

3

4

Input

Neu

ron

Figure 2.3: The Hinton graph of the weight matrix for the time domain classifier

specific and thus potentially discriminatory between different types of swallows.

Moving on to Figure 2.4, we notice that neurons one and two seem to have captured

inverse dependencies of spectral centroid and bandwidth features. While neuron one em-

bodies swallows with lower spectral centroid but broad bandwidth, neuron two captures

swallows high frequency narrow band swallows. The peak FFT feature seems to be the least

important spectral information, which is consoling in some senses as this suggests that de-

cisions are not based upon signal strength, which may vary greatly across swallows regard-

less of swallowing health.

In the information theoretic neural network (Figure 2.5), we find that the memory fea-

ture seems to have a distributed representation across the 4 neurons, with three favoring

weak memory or rapidly decaying autocorrelations. Neuron one almost uniformly consid-

ers all three information theoretic features, specifically epitomizing swallows with low com-

plexity, low entropy rate and minimal memory. This characterization might reflect ’noisy’

34

Bias Peak Centroid freq BW

1

2

3

4

Input

Neu

ron

Figure 2.4: The Hinton graph of the weight matrix for the frequency domain classifier (Peak- peak value of FFT; BW - bandwidth)

swallows. Interestingly, neuron three focuses on positive complexity and memory. We can

interpret this neuron as representing swallows which have strong predictability and hence

longer memory. In short, it appears that each individual neural network has internally rep-

resented some unique flavors of swallows. This apportioned representation across neural

networks suggests that there is indeed sound rationale to combine classifiers, in order to

comprehensively characterize the diversity of swallows.

2.4.3 Training error and convergence

Figure 2.6 pits the training performance of the small classifiers against the grand classifier

as the number of training epochs increase. After 12 epochs, the small classifiers have lower

normalized mean squared errors than the grand classifier. This is one of the main advan-

tages of using a multiple classifier system over a single all encompassing classifier; the rate

35

Bias L−Z Entropy Memory

1

2

3

4

Input

Neu

ron

Figure 2.5: The Hinton graph of the weight matrix for the information theoretic classifier.

of convergence during training is often faster with smaller classifiers, i.e., those with fewer

input features, and in many cases lower training error can be achieved.

2.4.4 Local robustness

To gauge one aspect of the local robustness of the proposed neural network combination,

we measured the sensitivity of the system to a local perturbation of the input. Recall that the

reputation algorithm yields a class label rather than a continuous number. Thus, to facilitate

sensitivity analysis, we calculated the reputation-weighted average of the outputs of the

small classifiers for a specific input. For semantic convenience, we will just call this the

reputation-weighted output. The unperturbed input sample was the mean vector of all the

features in the test set. Perturbed inputs were created by adding varying degrees of positive

and negative offsets to every feature of the mean vector. The sensitivity of the system to a

given perturbation was defined as the difference between the reputation-weighted output

36

0 2 4 6 8 10 120.4

0.5

0.6

0.7

0.8

0.9

1

Training epoch

Mea

n s

qu

ared

err

or

Time−Domain ClassifierFrequency−Domain ClassifierInfo−Theory−Domain ClassifierGrand Classifier

Figure 2.6: The training error of the different neural network classifiers versus the numberof training epochs

for the unperturbed input and that for the perturbed input. At each iteration, the amount

of perturbation was proportional to the range of the features in the test set. For instance, in

the first iteration, 5% of the range of each feature in the test set was added to the respective

feature. Figure 2.7 shows the relative sensitivity of the system versus the magnitude of input

perturbation. Between ±10%, the relative sensitivity is less than 10% of the output value,

suggesting that the reputation-based classifier while not robust in the strict statistical sense,

can tolerate a modest level perturbation at the inputs.

To examine the effect of a local perturbation on the final decision of our algorithm, we

again added/subtracted noise to the mean input vector and computed the output label us-

ing the reputation-based algorithm. For the present problem, the output was binary and

without loss of generality, denoted arbitrarily as ’1’ or ’2’. The unperturbed input belonged

to class 1. As portrayed in Figure 2.8, the decision of the proposed algorithm is robust to

37

−20 −15 −10 −5 0 5 10 15 20−15

−10

−5

0

5

10

15

Perturbation (% of feature value range)

Rel

ativ

e se

nsi

tivi

ty

Figure 2.7: Sensitivity to varying magnitudes of perturbation of the input vector

negative perturbations up to 20% of the range of the features and positive perturbations up

to 10% of the range of the features. However, for a positive perturbation higher than 10%

of the range of the features, the reputation algorithm misclassifies the input. For practical

purposes, this means that the reputation-based neural network combination can tolerate

a simultaneous 10% perturbation in all its input features before making a wrong decision

in the binary classification case. In the specific domain of dual-axis accelerometry, head

movement induces high magnitude perturbations [107] which, according to our analysis

here, will likely cause classification errors.

We also investigated the effect of local perturbations on the accuracy of the proposed

algorithm. We perturbed all 20 samples in the test set. The amount of perturbation ranged

from 0 to 100% of the range of the features in the test set. For each contamination value, we

calculated the accuracy of the proposed algorithm using the perturbed test set. Figure 2.9

illustrates the effect of varying levels of perturbation on the accuracy of the proposed al-

gorithm. Based on this figure, the accuracy of the proposed algorithm decreased with in-

38

−20 −10 0 10 20

1

2


Cla

ssif

icat

ion

Figure 2.8: The output decision versus magnitude of input perturbation

creasing magnitude of perturbation in the test set. The initial accuracy of the proposed

algorithm, for the unperturbed test set, was 78% and decreased to 50% for full-range (100%

of the range of the features) perturbations. It is interesting to note that the decay in accu-

racy is quite steep for the first 20%, indicating that accuracy will take a hit with any non-zero

amount of perturbation. Intuitively, this finding makes sense as the resemblance between

test and training data diminishes as the magnitude of perturbation increases.

2.4.5 Global robustness

The sensitivity curve only offers local information about the robustness of the classifier. To

measure the robustness of the system globally, we estimated the breakdown point for the

proposed algorithm. For this matter, we substituted some feature vectors from among the

200 initial samples with contaminated versions. Contaminated feature vectors were created

by sampling from a normal distribution with mean equal to that of the feature vector but

39

0 20 40 60 80 10050

55

60

65

70

75

80


Acc

ura

cy

Figure 2.9: The accuracy of the proposed classifier with increasing magnitude of perturba-tion in the test set

with 3 times the standard deviation. The number of contaminants ranged from 20 to 100,

i.e., 10 to 50% of the original data set. Using 10-fold cross validation, we divided the sam-

ples into 3 sets: training, testing, and validation. Therefore, it was possible that contami-

nations appeared in any or all of the training, testing, and validation sets. Figure 2.10 plots

the accuracy of the proposed algorithm for different numbers of contaminated samples.

Error-bars in this figure depict the standard deviation of each accuracy obtained from the

cross-validation. To estimate the breakdown point for this system, we used the rank sum

test to test for a significant difference between accuracies with and without varying levels

of contamination. At a 5% significance level, we identified the first significant departure

from the uncontaminated distribution of accuracy at 80 contaminated samples (p = 0.043).

Given that there were 200 samples, the breakdown point was thus identified as 80200= 0.4.

40

0 20 40 60 80 100 12050

55

60

65

70

75

80

85

Acc

ura

cy

Number of contaminated samples

p=0.065

p=0.043

Figure 2.10: The accuracy of the proposed classifier as the number of contaminated sam-ples increase. The p-values arise from a comparison of the accuracy between the uncon-taminated sample and samples with varying levels of contamination.

2.5 Conclusion

We have presented the formulation of a reputation-based neural network combination.

The method was demonstrated using a dysphagia dataset. We noted that generally the

reputation-based classifier either achieved higher accuracies than single classifiers or ex-

hibited comparable accuracies to the best single classifiers. Interpreting the weight ma-

trices of the neural networks, we observed that many different aspects of time, frequency

and information-theoretic characteristics of swallows were encoded. Finally, we empiri-

cally characterized the local and global robustness of the reputation-based classifier, show-

ing that there exists a certain tolerance (approximately 10% of the range of a feature value)

to input perturbations. However, large magnitude perturbations, such as those observed in

head movement, would likely lead to erroneous classification of the swallowing accelerom-

etry input feature vector.

41

Chapter 3

Automatic discrimination between safe

and unsafe swallowing using a

reputation-based classifier

The contents of this chapter are reproduced from a journal article accepted for publica-

tion: Nikjoo, M., Steele, C., Lee, J., Sejdic, E., Chau, T., "Discriminating between healthy

and abnormal swallowing by combining classifiers on the basis of reputation", Biomedical

Engineering Online, 2011.

3.1 Abstract

Swallowing accelerometry has been suggested as a potential non-invasive tool for day-to-

day assessment of swallowing function in neurogenic dysphagia. Various vibratory signal

features and complementary measurement modalities have been put forth in the literature

for the potential discrimination between safe and unsafe swallowing. To date, automatic

classification of swallowing accelerometry has exclusively involved a single-axis of vibra-

tion although a second axis is known to contain additional information about the nature of

42

the swallow. Furthermore, the only published attempt at automatic classification in adult

patients has been based on a small sample of swallowing vibrations. In this Chapter, a large

corpus of dual-axis accelerometric signals were collected from older adults referred to vide-

ofluoroscopic examination on the suspicion of dysphagia. We invoked a reputation-based

classifier combination to automatically categorize the dual-axis accelerometric signals into

safe and unsafe swallows, as labeled via videofluoroscopic review. With selected time, fre-

quency and information theoretic features, the reputation-based algorithm distinguished

between safe and unsafe swallowing with promising accuracy (80.48±5.0) and provided in-

teresting insight into the accelerometric differences between the two classes of swallows.

Given its computational efficiency, reputation-based classification of dual-axis accelerom-

etry may be a viable option for point-of-care swallow assessment where turnkey clinical

informatics are desired.

3.2 Introduction

Dysphagia refers to any swallowing disorder [81] and may arise secondary to stroke, mul-

tiple sclerosis, and eosinophilic esophagitis, among many other conditions [85]. If un-

managed, dysphagia may lead to aspiration pneumonia in which food and liquid enter

the airway and into lungs [32]. The video-fluoroscopic swallowing study (VFSS) is the gold

standard method for dysphagia detection [119]. This method entails a lateral X-ray video

recorded during ingestion of a barium-coated bolus. The health of a swallow is then judged

by clinical experts according to criteria such as the depth of airway invasion and the degree

of bolus clearance after the swallow. However, this technique requires expensive and spe-

cialized equipment, ionizing radiation and significant human resources, thereby preclud-

ing its use in the daily monitoring of dysphagia [74]. Swallowing accelerometry has been

proposed as a potential adjunct to VFSS. In this method, the patient wears a dual-axis ac-

celerometer infero-anterior to the thyroid notch. Swallowing events are automatically ex-

43

tracted from the recorded acceleration signals and pattern classification methods are then

deployed to discriminate between healthy and unhealthy swallows. It is important to dis-

tinguish between swallowing vibrations and swallowing sounds, based on current evidence

in the literature. Swallowing sounds have been largely attributed to pharyngeal reverbera-

tions arising from opening and closing of valves (oropharyngeal, laryngeal and esophageal

valves), action of various pumps (pharyngeal, esophageal, and respiratory pumps) and vi-

brations of the vocal tract [25]. In contrast, in swallowing accelerometry, vocalizations are

explicitly removed by preprocessing [104] and studies have implicated hyolaryngeal mo-

tion as the primary source of the acceleration signal [98, 139]. Fundamentally, both the

method of transduction and the primary physiological source of these signals are different.

Our focus here is swallowing vibrations and recent progress in swallowing accelerometry is

reviewed below.

3.2.1 Automatic classification

Das, Reddy & Narayanan [28] deployed a fuzzy logic-committee network to distinguish be-

tween swallows and ’artifacts’ using time and frequency domain features of single-axis ac-

celerometry signals. Although they achieved very high accuracies, their sample of swallows

and ’artifacts’ was very modest. Using a radial basis classifier with statistical and energetic

features, Lee et al. [71] detected aspirations from single-axis cervical acceleration signals

with approximately 80% sensitivity and specificity in a large pediatric cerebral palsy popula-

tion. Both of these studies only examined accelerations in the anterior-posterior anatomical

direction. However, recent research has shown that there is distinct information about swal-

lowing that is encoded in the superior-inferior vibration [73]. Further, hyolaryngeal motion

associated with swallowing is inherently two-dimensional and this motion was implicated

as the likely source of swallow vibrations [139].

In the first dual-axis classification study, Lee et al. [74] discriminated between no airway

invasion and airway invasion past the true vocal folds in 24 adult stroke patients using a

44

variety of classifiers (linear discriminant, neural network, probabilistic network and nearest

neighbor). A genetic algorithm (GA) selected the most discriminatory feature combinations.

With linear classifiers, an adjusted accuracy of 74.7% was achieved in feature spaces of up

to 12 dimensions.

In the aforementioned studies, various genres of features have demonstrated discrim-

inatory potential. These include statistical features such as dispersion ratio and normal-

ity [71], time-frequency features such as wavelet energies [73], information theoretic fea-

tures such as entropy rate [72], temporal features such signal memory [45], and spectral

features such as the spectral centroid [106]. Further, there is evidence to suggest that com-

plementary measurement modalities, such as nasal air flow and submental mechanomyo-

graphy [75]may enhance segmentation and classification. Given the presence of multiple

feature genres and different measurement modalities, the swallow detection and classifica-

tion problem lends itself to a multi-classifier approach. For example, it may be sensible to

dedicate one classifier to each feature genre [56]. Moreover, data sets from different patient

groups might warrant different classifiers [56, 135]. Lastly, the demand for classification

speed may necessitate the use of multiple classifiers [56].

In this Chapter, we invoke a novel, computationally efficient reputation-based classi-

fier combination to automatically categorize dual-axis accelerometric signals from adult

patients into safe and unsafe swallows, as labeled via videofluoroscopic review. We con-

sider multiple feature genres from both anterior-posterior and superior-inferior axes and

examine a much larger data set than that of previous swallow accelerometry classification

studies.

45

Dual-axes accelerometer

X-rayemitter

X-raydetector

PC + LabView program

Data acquisition card Image acquisitioncard

AP amplifier SI amplifier

Figure 3.1: Data collection set-up

3.3 Methods

3.3.1 Data collection

In this Chapter, we re-examine data from a subset of participants originally reported in [114].

Briefly, we recruited 30 patients (aged 65.47± 13.4 years, 15 male) with suspicion of neuro-

genic dysphagia who were referred to routine videofluoroscopic examination at one of two

local hospitals. Patients had dysphagia secondary to stroke, acquired brain injury, neurode-

generative disease, and spinal cord injury. Research ethics approval was obtained from both

participating hospitals.

The data collection set-up is shown in Figure 3.1. Sagittal plane videofluoroscopic im-

ages of the cervical region were recorded to computer at a nominal 30 frames per sec-

ond via an analog image acquisition card (PCI-1405, National Instruments). Each frame

was marked with a timestamp via a software frame counter. A dual-axis accelerometer

(ADXL322, Analog Devices) was taped to the participantŠs neck at the level of the cricoid

cartilage. The axes of the accelerometer were aligned to the anatomical anterior-posterior

(AP) and superior-inferior (SI) axes. Signals from both the AP and SI axes were passed

through separate pre-amplifiers each with an internal bandpass filter (Model P55, Grass

Technologies). The cutoff frequencies of the bandpass filter were set at 0.1 Hz and 3 kHz.

46

The amplifier gain was 10. The signals were then sampled at 10 kHz using a data acquisi-

tion card (USB NI-6210, National Instruments) and stored on a computer for subsequent

analyses. A trigger was sent from a custom LabView virtual instrument to the image acqui-

sition card to synchronize videofluoroscopic and accelerometric recordings. The above in-

strumentation settings replicate those of previous dual-axis swallowing accelerometry stud-

ies [139, 45, 92, 72, 105, 104, 106].

Each participant swallowed a minimum of two or a maximum of three 5 mL teaspoons of

thin liquid barium (40 %w/v suspension) while his/her head was in a neutral position. The

number of sips that the participant performed was determined by the attending clinician.

The recording of dual-axis accelerometry terminated after the participant finished his/her

swallows. However, the participant’s speech-language pathologist continued the videofluo-

roscopy protocol as per usual. In total, we obtained 224 individual swallowing samples from

the 30 participants, 164 of which were labeled as unsafe swallows (as defined below) and 60

as safe swallows.

3.3.2 Data segmentation

To segment the data for analysis, a speech-language pathologist reviewed the videofluo-

roscopy recordings. The beginning of a swallow was defined as the frame when the liquid

bolus passed the point where the shadow of the mandible intersects the tongue base. The

end of the swallow was identified as the frame when the hyoid bone returned to its rest po-

sition following bolus movement through the upper esophegeal sphincter. The beginning

and end frames as defined above where marked within the video recording using a cus-

tom C++ program. The cropped video file was then exported together with the associated

segments of dual-axes acceleration data. An unsafe swallow was defined as any swallow

without airway clearance. Typically, this would include penetration and aspiration. Residue

would be considered a situation of swallowing inefficiency that is not unsafe swallowing un-

less the residue was subsequently aspirated. Backflow is extremely rare in the oropharynx,

47

and would only be classified as unsafe should it lead to penetration-aspiration. This defi-

nition of unsafe swallowing is in keeping with the industry standard Penetration-Aspiration

Scale [99].

3.3.3 Pre-Processing

It has been shown in [73] that the majority of power in swallowing vibrations of healthy

adults lies below 100 Hz. Given that we were dealing with patient data, we estimated the

bandwidth of each of the 224 swallows using the bandwidth estimator defined in [108], ob-

taining average bandwidths of 175± 73 Hz and 226± 84 Hz for the AP and SI axes, respec-

tively. Moreover, spectral centroids were < 70 Hz in both axes, suggesting that there is no

appreciable signal energy beyond a few hundred Hz. Therefore, we downsampled all sig-

nals to 1 kHz. Vocalization was removed from each segmented swallow according to the

periodicity detector proposed in [104]. Whitening of the accelerometry signals to account

for instrumentation nonlinearities was achieved using inverse filtering [106]. Finally, the

signals were denoised using a Daubechies-8 wavelet (8db) transform with soft threshold-

ing [105]. Both the decomposition level and the wavelet coefficient were chosen empirically

to minimize noise while maximizing the information that remained in the signal. Figures

3.2 and 3.3 exemplify pre-processed safe and unsafe swallowing signals, respectively.

3.3.4 Feature Extraction

Let S be a pre-processed acceleration time series, S = s2, s2, ..., sn. As in previous accelerom-

etry studies, signal features from multiple domains were considered [75, 72]. The different

genres of features are summarized below.

1. Time Domain Features

• The sample mean is an unbiased estimation of the location of a signal’s ampli-

48

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

−0.1

0

0.1

0.2

AP

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

−0.4

−0.2

0

0.2

Time (seconds)

Acc

eler

atio

n (g

)

SI

Figure 3.2: Example of safe swallowing signals from AP and SI axes

tude distribution and is given by,

µs =1

n

n∑i=1

s i . (3.1)

• The variance of a distribution measures its spread around the mean and reflects

the signal’s power. The unbiased estimation of variance can be obtained as

σ2s =

1

n −1

n∑i=1

(s i −µs )2. (3.2)

• The median is a robust location estimate of the amplitude distribution. For the

49

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

−0.1

0

0.1

0.2

AP

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

−0.6

−0.4

−0.2

0

0.2

SI

Time (seconds)

Acc

eler

atio

n (g

)

Figure 3.3: Example of unsafe swallowing signals from AP and SI axes

sorted set S, the median can be calculated as

M E D(s ) =

sv+1, if n = 2v +1

sv+sv+1

2, if n = 2v

, (3.3)

• Skewness is a measure of the symmetry of a distribution. This feature can be

computed as follows

γ1,s =1n

∑ni=1(s i −µs )3

( 1n

∑ni=1(s i −µs )2)1.5

. (3.4)

• This feature reflects the peakedness of a distribution and can be found as

γ2,s =1n

∑ni=1(s i −µs )4

( 1n

∑ni=1(s i −µs )2)2

. (3.5)

2. Frequency Domain Features

50

• The peak magnitude value of the Fast Fourier Transform (FFT) of the signal S was

also used as a feature. All the FFT coefficients were normalized by the length of

the signal, n .

• The centroid frequency of the signal S [106]was estimated as

f =

∫ f m a x

0f |Fs ( f )|2d f∫ f m a x

0|Fs ( f )|2d f

, (3.6)

where Fs ( f ) is the Fourier transform of the signal S and f m a x is the Nyquist fre-

quency (effectively 500 Hz after downsampling).

• The bandwidth of the spectrum was computed using the following formula

BW =

√√√√∫ f m a x

0( f − f )2|Fs ( f )|2d f∫ f m a x

0|Fs ( f )|2d f

. (3.7)

3. Information Theory-Based Features

• The entropy rate [94] of a signal quantifies the extent of regularity in that sig-

nal. The measure is useful for signals with some relationship among consecu-

tive signal points. We first normalized the signal S to zero-mean and unit vari-

ance. Then, we quantized the normalized signal into 10 equally spaced levels,

represented by the integers 0 to 9, ranging from the minimum to maximum

value. Now, the sequence of U consecutive points in the quantized signal, S =

s1, s2, ..., s3, was coded using the following equation

a i = s i+U−1.10U−1+ ...+ s i .100, (3.8)

with i = 1, 2, ..., n −U + 1. The coded integers comprised the coding set AU =

51

a 1, ..., a n−U+1. Using the Shannon entropy formula, we estimated entropy

E (U ) =−10U−1∑

t=0

PAU (t ) ln PAU (t ), (3.9)

where pAU (t ) represents the probability of observing the value t in AU , approxi-

mated by the corresponding sample frequency. Then, the entropy rate was nor-

malized using the following equation

N E (U ) =E (U )−E (U −1)+E (1).β

E (1), (3.10)

where β was the percentage of the coded integers in AL that occurred only once.

Finally, the regularity index ρ ∈ [0, 1]was obtained as

ρ = 1−min N E (U ), (3.11)

where a value of ρ close to 0 signifies maximum randomness while ρ close to 1

indicates maximum regularity.

• To calculate the memory of the signal [72], its autocorrelation function was com-

puted from zero to the maximum time lag (equal to the length of the signal) and

normalized such that the autocorrelation at zero lag was unity. The memory was

estimated as the time required for the the autocorrelation to decay to 1/e of its

zero lag value.

• Lempel-Ziv (L-Z) complexity [77]measures the predictability of a signal. To com-

pute the L-Z complexity for signal S, first, the minimum and the maximum val-

ues of signal points were calculated and then, the signal was quantized into 100

equally spaced levels between its minimum and maximum values. Then, the

quantized signal, B n1 = b1,b2, ...,bn, was decomposed into T different blocks,

52

B n1 = ψ1,ψ2, ...,ψT . A blockψwas defined as

Ψ= B ℓj = b j ,b j+1, ...,bℓ, 1≤ j ≤ ℓ≤ n . (3.12)

The values of the blocks can be calculated as follows:

Ψ=

ψm =b1, if m=1,

ψm+1 = B hm+1

hm+1, m ≥ 1,(3.13)

where hm is the ending index for ψm , such that ψm+1 is a unique sequence of

minimal length within the sequence B hm+1−11 . Finally, the normalized L-Z com-

plexity was calculated as

LZ =T log100 n

n. (3.14)

3.3.5 Reputation-Based Classification

Reputation typically refers to the quality or integrity of an individual component within a

system of interacting components. The notion of reputation has been widely used to ascer-

tain the health of nodes in wireless networks [26], identify malicious hosts in a distributed

system [121] and detect free-riders in peer-to-peer networks [124], among many other prac-

tical applications. Here, we apply the concept of reputation to judiciously combine deci-

sions of multiple classifiers for the purpose of differentiating between safe and unsafe swal-

lows. The general idea is to differentially weigh classifier decisions on the basis of their past

performance.

The past performance of the i t h classifier is captured via its reputation, ri ∈ℜ, 0≤ ri ≤ 1,

where 1 signifies a strong classifier (high accuracy) and 0 denotes a weak classifier. Briefly,

the classifier is formulated as follows. Let Θ = θ1,θ2, ...,θL be a set of L ≥ 2 classifiers and

Ω = ω1,ω2, ...,ωc be a set of c ≥ 2 class labels, where ωj = ωk , ∀j = k . Without loss

of generality, Ω ⊂ N. The input of each classifier is the feature vector xεRn i , where n i is

53

the dimension of the feature space for the i t h classifier θi , whose output is a class label

ωj , j = 1, ..., c . Let p (ωj ) be the prior probability of classωj .

1. For a classification problem with c ≥ 2 classes, we invoke L ≥ 2 individual classifiers.

2. After training the L classifiers individually, the respective accuracy of each is evaluated

using a validation set and expressed as a real number in [0,1]. This number is the

reputation of the classifier.

3. For each feature vector, x , in the test set, L decisions are obtained using the L distinct

classifiers:

Ω(x ) = θ1(x ),θ2(x ), ...,θL(x ). (3.15)

4. We sort the reputation values of the classifiers in descending order,

R∗ = r1∗ , r2∗ , ..., rL∗, (3.16)

such that r1∗ ≥ r2∗ ≥ ... ≥ rL∗ . Then, using this set, we rank the classifiers to obtain a

reputation-ordered set of classifiers,Θ∗.

Θ∗ =

θ1∗

θ2∗

...

θL∗

. (3.17)

The first element of this set corresponds to the classifier with the highest reputation.

5. Next, we examine the votes of the first m elements of the reputation-ordered set of

54

classifiers, with

m =

L2

, if L is even,

L+12

, if L is odd.(3.18)

If the top m classifiers vote for the same class,ωj , we accept the majority vote and take

ωj as the final decision of the system. However, if the votes of the first m classifiers

are not equal, we consider the classifiers’ individual reputations (Step 2).

6. The probability that the combined classifier decision is ωj given the input vector x

and the individual local classifier decisions is denoted as the posterior probability,

p (ωj |θ1(x ),θ2(x ), ...,θL(x )) (3.19)

which can be estimated using Bayes rule as

p (ωj |θ1, ...,θL) =


t=1


. (3.20)

when the classifiers are independent. For notational convenience, we have dropped

the argument for θ above, but it is understood to be a function of x . The local likeli-

hood functions, p (θi |ωj ), are estimated by the reputation values calculated in Step 6.

When the correct class isωj and classifier θi classifies x into the classωj , i.e., θi (x ) =

ωj , we can write

p (θi =ωj |ωj ) = ri . (3.21)

In other words, p (θi =ωj |ωj ) is the probability that the classifier θi correctly classifies

x into classωj when x actually belongs to this class. This probability is exactly equal

to the reputation of the classifier. On the other hand, when the classifier categorizes x

incorrectly, i.e., θi (x ) =ωj given that the correct class isωj , then

p (θi =ωj |ωj ) = 1− ri . (3.22)

55


ities. Hence,

p (ω1) = p (ω2) = . . .= p (ωc ) =1

c. (3.23)

Thus, for each class, ωj , we can estimate the a posteriori probabilities as given by

(5.8) using (3.21), (3.22), and (5.10). The class with the highest posterior probability is

selected as the final decision of the system and the input subject x is categorized as

belonging to this class.

3.3.6 Classifier evaluation

We ranked the signal features introduced above using the Fisher ratio [78] for univariate

separability. In the time domain, mean and variance in the AP axis and skewness in the

SI axis were the top-ranked features. Similarly, in the frequency domain, the peak magni-

tude of the FFT and the spectral centroid in the AP direction and the bandwidth in the SI

direction were retained. Finally, in the information theoretic domain, entropy rate for the

SI signal and memory of the AP signal were the highest ranking features. Subsequently, we

only examined these feature subsets for classification. For comparison between single and

dual-axes classifiers, we also considered classifiers that employed feature subsets (as iden-

tified above) from a single axis.

Given the disproportion of safe and unsafe samples, we invoked a smooth bootstrapping

procedure [35] to balance the classes. All features were then standardized to zero mean and

unit variance. Three separate support vector machine (SVM) classifiers [33] were invoked,

one for each feature genre (time, frequency and information theoretic). Hence, the fea-

ture space dimensionalities for the classifiers were 3 (SVM with time features), 3 (SVM with

frequency features) and 2 (SVM with information-theoretic features). The use of different

feature sets for each classifier ensures that the classifiers will perform independently [135].

Classifier accuracy was estimated via a 10-fold cross validation with a 90-10 split. In

56

each fold, the whole training set was used to estimate the individual classifier reputations.

Classifiers were then ranked according to their reputation values. Without loss of generality,

assume r1 ≥ r2 ≥ r3. If θ1 and θ2 cast the same vote about a test swallow, their common

decision was accepted as the final classification. However, if they voted differently, the a

posteriori probability of each class was computed using (5.8) and the maximum a posteriori

probability rule was applied to select the final classification.

3.4 Results

The sensitivity, specificity and accuracy of the single-axis and dual-axis accelerometry clas-

sifiers are summarized in Figure 3.4. The dual-axis classifier had significantly higher accu-

racy (80.48±5.0) than either single-axis classifier (p << 0.05, two-sample t-test), specificity

(64± 8.8) comparable to that of the SI classifier (p = 1.0) and sensitivity (97.1± 2) on par

with that of the AP classifier (p = 1.0). In other words, the dual-axis classifier retained the

best sensitivity and specificity achievable with either single-axis classifier.

3.5 Discussion

3.5.1 Dual versus single axis

Of the two axes, the AP axis tended to carry more useful information than the SI direction

for discrimination between safe and unsafe swallowing. This observation is evidenced in

Figure 3.4, where AP accuracy is dramatically higher than SI levels, echoing the findings

of [73] who suggested that the AP axis is richer in information content (i.e., higher entropy)

relating to swallowing. Nonetheless, the SI axis does carry information distinct from that

of the AP orientation, as dual-axis classification exceeds any single-axis counterpart. Our

results thus support the inclusion of selected features from both the AP and SI axes for the

automatic discrimination between safe and unsafe swallowing. Indeed, when comparing

57

Figure 3.4: Classification performance for the single-axis (AP, SI) and dual-axis (AP + SI)reputation-based classifiers. The height of each bar denotes the average of the respectiveperformance measure while the error bars denote standard deviation around the mean.

AP and SI signals, [73] reported minimal mutual information, and inter-axis dissimilarities

in the scalograms, pseudo-spectra and temporal evolution of low- and high-frequency con-

tent.

In a recent videofluoroscopic study, both AP and SI accelerations were attributed to the

planar motion of the hyoid and larynx during swallowing [139]. In that study, the displace-

ment of the hyoid bone and larynx along with their interaction explained over 70% of the

variance in the doubly integrated acceleration in both AP and SI axes at the level of the

cricoid cartilage. Juxtaposed with our findings above, this reported physiological source of

swallow accelerometry suggests that it is the difference in hyolaryngeal motion that is mani-

fested as discriminatory cues between safe and unsafe swallowing. Indeed, early single-axis

accelerometry research had implicated decreased laryngeal elevation as the reason for sup-

pressed AP accelerations in individuals with severe dysphagia [98].

58

−1 0 1 2 3

Normalized feature values

AP − mean

AP − variance

SI − skewness

AP − spectral centroid

AP − spectral peak

SI − bandwidth

SI − entropy rate

AP − memory

Figure 3.5: Parallel axes plot depicting the internal representation of safe (solid line) andunsafe (dashed line) swallows

3.5.2 Internal representation

Figure 3.5 is a parallel axes plot depicting the internal representation of safe and unsafe

swallows acquired by the reputation-based classifier. Each feature has been normalized by

its standard deviation to facilitate visualization. On each axis, the median feature value is

shown. The median values of adjacent axes are joined by solid (safe swallow) or dashed (un-

safe swallow) lines. We immediately observe some distinct patterns which characterize each

type of swallow. In the AP axis, unsafe swallows tend to have lower acceleration amplitude,

higher variance, higher spectral centroid and shorter memory. The lower mean vibration

amplitude in unsafe swallowing resonates with previous reports of suppressed peak accel-

eration [98] in dysphagic patients and reduced peak anterior hyoid excursion [60] in older

adults, both suggesting compromised airway protection. The observation of a higher spec-

tral centroid in unsafe swallowing may reflect departures from the typical axial high-low

frequency coupling trends of normal swallowing as detailed in [73]. Likewise, the shorter

59

memory and hence faster decay of the autocorrelation may be indicative of compromised

overall coordination in unsafe swallowing.

It is also interesting to note that unsafe swallows tend to be negatively skewed while

safe swallows are evenly split between positive and negative skew. In other words, in un-

safe swallowing, the upward motion of the hyolaryngeal structure appears to have weaker

accelerations than during the downward motion. This is opposite of the tendency reported

in [73] for healthy swallowing and may reflect inadequate urgency to protect the airway.

3.5.3 Reputation-based classification

The merit of a reputation-based classifier for the present problem can be appreciated by

contrasting its performance against that of the classic method of combining classifiers, i.e.,

via the majority voting algorithm. To this end, Figure 3.6 summarizes the accuracies of both

approaches from a 10-fold cross-validation using the data of this study. The histograms

summarize the distribution of accuracies obtained from cross-validation. The correspond-

ing density estimate (solid line) was obtained using a semi-parametric maximum likelihood

estimator based on a finite mixture of Gaussian kernels. Clearly, the location of the density

of reputation-based accuracies appears to be further to the right of the location of the ma-

jority voting density. The large spread in both densities amplifies the risk of Type II error and

thus conventional testing (e.g., Wilcoxon ranksum) fails to identify any differences. How-

ever, upon more careful inspection using a two-sample Kolmogorov-Smirnoff test of the

20% one-sided trimmed densities (i.e., omitting the 2 most extreme points in each density),

a statistically significant difference between the distributions (p = 0.0098) is confirmed.

The reputation-based classifier achieved higher adjusted accuracies (> 85%; average of

sensitivity and specificity) than those reported in [74] (no greater than 75%). Patients were

similarly aged and all had neurogenic dysphagia. However, some key differences between

the studies are worthy of mention. The present study had a slightly larger sample size, a

better balance between males and females ([74] almost exclusively had males), and most

60

Figure 3.6: Comparison of the densities of classification accuracies by majority voting (top)and reputation-based classification (bottom) for safe and unsafe swallow discrimination

importantly, a more significant representation of unsafe swallows (73% of total swallows

compared to only 13% in [74]). Arguably, vibration patterns of pathological swallows vary

more widely than those of safe swallows and hence a more comprehensive representation

of the former may be well-justified.

Generally, the reputation-based classification scheme mitigates the risk of the overall

classifier performance being unduly affected by a poorly performing component classifier

within a multi-classifier system. Additionally, as exemplified in this study, the dimensional-

ity of individual classifiers can be minimized, reducing the demand for voluminous training

data.

3.5.4 Limitations

The dual-axes classifier attained very high sensitivity but modest specificity. In part, this

bias towards higher sensitivity may be attributable to the preponderance of unsafe swallow

61

examples in the original data set, despite our efforts to balance the classes via bootstrap-

ping. In a practical system, it would mean that the classifier may overzealously flag a safe

swallow as unsafe. This class imbalance issue may be a limitation of studying patients re-

ferred to videofluoroscopy, the majority of whom likely have a greater propensity for prob-

lematic swallowing. Hence, to obtain a larger number of safe swallows, a significantly ex-

panded sample of patients may need to be recruited in the future.

The reputation classifier assumes independent features. This constrains the admissi-

ble features, but [73] has argued that many SI and AP features have low correlations. Fu-

ture work may invoke independent component analysis or principal component analysis to

generate additional novel independent features. The present classifier relies on static rep-

utation values. In clinical application, the classifier may be trained and tested at different

times with different patients. As a consequence, the feature distributions may change over

time. In such case, dynamic reputation values may be more appropriate and future work

may consider an online approach to dynamically update classifier reputations.

3.6 Conclusion

This study has demonstrated the potential for automatic discrimination between safe and

unsafe (without airway clearance) swallows on the basis of a selected subset of time, fre-

quency and information theoretic features derived from non-invasive, dual-axis accelero-

metric measurements at the level of the cricoid cartilage. Dual-axis classification was more

accurate than single-axis classification. The reputation-based classifier internally repre-

sented unsafe swallows as those with lower mean acceleration, lower range of acceleration,

higher spectral centroid, slower autocorrelation decay and weaker acceleration in the supe-

rior direction. Our results suggest that reputation-based classification of dual-axis swallow-

ing accelerometry from adult stroke patients deserves further consideration as a turn-key

clinical informatic in the management of swallowing disorders.

62

Chapter 4

Analyzing the performance of the static

reputation-based algorithm

The contents of this chapter are reproduced from a journal article currently under review:

Nikjoo, M. and Chau, T., "Analyzing the performance of the static reputation-based algo-

rithm", Pattern Recognition Letters.

4.1 Abstract

Weighted classifier ensembles have been widely applied in various domains. In this paper,

we characterize the large and finite ensemble behavior of one such system, a recently pro-

posed static reputation-based (SRB) multiple classifier algorithm. Specifically, we examine

the convergence of the probability of error in the presence of asymptotically large ensem-

bles. In the finite ensemble case, we determine the lower bound of classification accuracy

through an examination of patterns of success and failure. Finally, we provide empirical

simulations of probabilities of classification error under uniform and Gaussian models of

point estimation errors for finite ensembles. The systematic characterization provided in

this paper sheds light on the strengths and limitations of the reputation-based algorithm as

63

a multi-classifier ensemble.

4.2 Introduction

Multiple classifier systems (MCS) have been explored in many pattern recognition applica-

tions [2, 138, 31, 91], primarily as a technique to improve the performance of a classification

system. Other rationale for the deployment of MCS include the availability of different fea-

tures and training sets or the need to improve classification speed [56]. Classifiers can be

combined on three levels: abstract, rank, and measurement levels. Among the three, the

abstract level is the simplest, allowing several classifiers, each designed within a different

context, to be combined. As a consequence, abstract level MCSs are preferred among other

types of classifier combinations [56].

Recently, a novel weighted classifier combination technique, the static reputation-based

algorithm (SRB) was proposed and demonstrated on the classification of dual-axes swallow-

ing signals [88]. The SRB algorithm differentially weighs classifier decisions on the basis of

their past performance. With the above data set, the SRB algorithm improved classification

accuracy beyond simple majority voting and various individual classifiers, while reducing

computational load [88]. Although similar in spirit to weighted majority voting [135], SRB is

not subject to the degenerate condition of vanishing weights.

Generally, there are few theoretical appraisals of weighted multiple classifier combina-

tions. Kuncheva [64] analyzed six different classifiers’ combination techniques and devel-

oped formulae for classification error. Chen and Cheng [20] studied the asymptotic behav-

ior of three classifier combinations including: average, median, and majority voting. They

have shown that for large numbers of individual classifiers, median and majority voting pro-

duce the same decision. [17] discuss the asymptotic behavior and convergence rate of these

same three fusion strategies (average, median, and maximum) under uniform, Gaussian

and Cauchy noise.

64

In this paper, we model the SRB algorithm using a percentile greater than the median,

and analyze its asymptotic behavior for uniform and Gaussian noise, derive a formula for its

probability of errors and discuss its behavior relative to that of majority voting. Moreover,

using the idea of "patterns of success and failure" [66], we highlight performance differences

between SRB and majority voting. Finally, adapting the simulation method proposed by [1],

we study the behavior of the SRB algorithm for small and medium numbers of classifiers by

subjecting their outputs to uniform and Gaussian perturbation errors.

The remainder of this paper is organized as follows. In Section 2, we analyze the asymp-

totic behavior of SRB. In Section 3, we characterize the SRB algorithm first in terms of pat-

terns of success and failure and subsequently in terms of probability of error.

4.3 Large n behavior

4.3.1 Problem Statement

Consider a classification problem with two classes, c1 and c2. Let x be an input feature

vector to be classified using n independent classifiers. Let P = P(c1|x ) be the actual pos-

terior probability of the first class for the given input x . Without loss of generality, we as-

sume that the given input belongs to the first class and therefore P > 0.5 (the other case

can be discussed similarly). Assume that each classifier outputs an estimate of P , denoted

as ω1,ω2, ...,ωn . To improve the estimation of P , an MCS may be used. For instance, one

can use the minimum, maximum, average, and majority vote to fuse individual classifier

outputs to improve the estimation of P [1, 64]. Let

ω= f (ω1,ω2, ...,ωn ) (4.1)

represent the fused estimate of P(c1|x ). We assume that the individual estimates, ωi ’s, are

independent identically distributed (i.i.d.) random variables. Each estimate is the posterior

65

probability plus random noise:

ωi = P +εi . (4.2)

Therefore, the fused estimate, ω, is also a random variable. We denote f ω(y ) and Fω(z )

as the probability density function (pdf) and cumulative distribution function (cdf) of the

fused estimate, respectively. The correct fused decision for the given input occurs when

ω > 0.5. In this section, we will analyze the asymptotic behavior of the static reputation-

based algorithm, i.e., when n→∞ for two different types of noise: uniform and Gaussian.

4.3.2 Modeling the Algorithm

Before proposing a model for the SRB algorithm, we review a model of the majority voting

algorithm. First, we rearrange the classifiers’ estimations of P from least to greatest, i.e.,

ω(1) ≤ω(2) ≤ ... ≤ω(n ). The majority voting algorithm can be modeled as the median of the

series,ω(m ), as follows:

m a j (ω) =ω(m ) =

ω( n+12 )

, if n is odd;

ω( n2 )

, if n is even.(4.3)

In other words, the median estimator defines the final decision of the system. Ifω(m ) > 0.5

then the system correctly assigns the given input, x , to the first class.

Now let us consider the SRB algorithm. Unlike the majority voting method, the final

decision of the system is equal to the decision of one of the estimators,ω(m ) toω(n ), accord-

ing to the Bayes rule. Let us clarify with an example. Assume the set ω = ω(1) = 0.2,ω(2) =

0.3,ω(3) = 0.4,ω(4) = 0.6,ω(5) = 0.7 represents the posterior probability estimates for the first

class by five different classifiers. Also, let r = r(1) = 0.5, r(2) = 0.5, r(3) = 0.5, r(4) = 0.9, r(5) = 0.9be the reputation values of these classifiers. Majority voting would assign the given input to

the second class based on the median of the first set, ω(m ) = 0.4. With the SRB algorithm,

66

the classifiers are first divided into two groups based on their decisions: r(1), r(2),r(3) and

r(4), r(5). Then, the Bayes rule for each class is written as P(c1|ω) = 0.92∗0.53

0.92∗0.53+0.12∗0.53 = 0.9878

and P(c2|ω) = 0.12∗0.53

0.92∗0.53+0.12∗0.53 = 0.0122. Therefore, the final decision of the SRB is the first

class. Hence, unlike the median selection of majority voting, SRB selects the third quantile,

ω(4), in this case as the final decision of the system. In general, the final decision of the SRB

coincides with a percentile greater than or equal to the median. It should be mentioned

here that we have inherently assumed that the reputation values are accurate metrics for

gauging the performance of the classifiers on the test set.

4.3.3 Asymptotic Analysis of SRB

The following Theorem is useful in analyzing the asymptotic behaviors of the SRB algo-

rithm.

Theorem 1. [37] Let Ω1,Ω2, ...,Ωn be i.i.d. with distribution function F (ω), density f (ω),

finite mean µ, and finite varianceσ2. Also, Let 0<α< 1 andωα be the αt h quantile of F , so

that F (ωα) =α. If F (ω) is continuous and f (ωα)> 0, then:

pn (Yn −ωα)→N (0,

α(1−α)f 2(ωα)

), (4.4)

where Yn is the sample αth quantile.

Proof. See [76].

Result 1. An interesting result of Theorem 1 is the asymptotic distribution of the sample

median. The median of the distribution F ,ωm , satisfies F (ωm ) = 12

. Therefore, according to

Theorem 1, when n→∞, the distribution of the sample median isN (ωm , 14n f 2(ωm )

).

Based on the above result, the function m a j (ω) is asymptotically distributed according

to N (ω(m ), 14n f 2(ω(m ))

) for large n . Moreover, SRB is asymptotically distributed according to

N (ω(α), α(1−α)n f 2(ω(α))

).

Recall that we assume that the given input x belongs to the first class. Therefore, the

67

probability of error given x is:

Pe = P(ω≤ 0.5) = Fω(0.5) =

0.5∫0

f ω(y )d y . (4.5)

For for n→∞ the probability of error for the majority voting algorithm is

Φ

0.5−ω(m )σ(n )

=Φ

2p

n (0.5−ω(m )) · f (ω(m )) , (4.6)

where Φ is the cdf ofN (0,1). Similarly, by Theorem 1, the SRB algorithm has the following

probability of error for large n :

Φ

0.5−ω(α)σ(n )

=Φ

pn (0.5−ω(α)) · f (ω(α))p

α(1−α)!

, (4.7)

with 0.5≤α≤ 1.

Result 2. For the correct fused decision and under the assumption of uniform noise, the

probability of error for the SRB algorithm converges to zero faster than or as quickly as the

majority voting method.

Proof. Since 0.5≤α≤ 1,ω(m ) ≤ω(α) and 2≤ 1pα(1−α) . For the true fused decision, 0.5<

ω(m ) ≤ω(α). Moreover, for the uniform distribution f (ω(m )) = f (ω(α))> 0, ∀α. Therefore, the

argument of Φ for both majority voting and SRB algorithms are negative. Hence, when n →∞, the probability of error for both algorithms is Φ(−∞) = 0. Furthermore,

pn (0.5−ω(α))· f (ω(α))p

α(1−α) ≤2p

n (0.5−ω(m )) · f (ω(m )) < 0. In other words, the probability of error for SRB converges to

zero as quickly as, if not faster than than majority voting method.

Result 3. For the correct fused decision and under the assumption of Gaussian noise,

the probability of error for the SRB algorithm converges to zero faster than or as quickly as

the majority voting method.

Proof. For Gaussian noise, again 2 ≤ 1pα(1−α) and ω(m ) ≤ ω(α). However, in this case,

0 < f (ω(α)) ≤ f (ω(m )). Based on Equation (4.2), ωi ∼ N (P,σ2), where σ2 is the power of

68

Table 4.1: Example of 4 distributions of 10 test samples among different combinations ofclassifier outputs

Probability of correctClassifier outputs classification

Pattern 111 101 011 001 110 100 010 000 Pm a j PSR B

1 4 0 0 1 1 2 1 1 0.5 0.72 4 0 0 1 0 3 2 0 0.4 0.73 2 0 3 0 1 4 0 0 0.6 14 2 2 1 0 3 0 0 1 0.9 0.9

Gaussian noise. Without loss of generality, we assume that σ = 1. To ascertain the rate of

convergence of the SRB algorithm, we compare the arguments of Φ in Equations (4.6) and

(4.7). In the Gaussian case,

(0.5−ω(α))pα(1−α) e−

(ω(α)−P)2

2 < 2(0.5−P)< 0, (4.8)

proving Result 3. Note that ωm = P and f (ωm ) = 1p2π

. Numerical simulations confirm that

the inequality (4.8) holds for 0.5<α< 1.

4.4 Medium and small n behavior

In the previous section, we characterized the behavior of the SRB algorithm in the large

classifier ensemble case, n → ∞. Here, we describe the small and medium n behavior of

the SRB algorithm. First, we contrast the "pattern of failure" and "pattern of success" of the

SRB algorithm against those of the majority voting algorithm for a specific example. Then,

we estimate the SRB’s probability of error via simulation. In the simulation part, instead of

using the approximate versions of error probability in Equations (4.6) and (4.7), we calculate

the exact probabilities of error using the binomial formula.

69

4.4.1 Patterns of Success and Failure

To quantify small and medium n behavior, we begin with an example based on the "pat-

tern of success" and "pattern of failure" introduced by [66]. Assume an MCS system with

three classifiers θ1,θ2,θ3. Let the probability of correct classification (reputation) for these

classifiers be 0.7, 0.6, and 0.5, respectively. Also assume there are 10 samples in the test set

of which the first classifier correctly labels exactly seven, the second classifier six, and the

third classifier five. Table 4.1 exemplifies some possible ways of distributing the 10 samples

into the eight possible combinations of outputs of the three classifiers. The column labels

are 3-bit strings, where each bit denotes a correct (1) or incorrect (0) classification for each

classifier. For example, 101 means that only the first and third classifier yield a correct classi-

fication. Each entry within a given column is the number of samples for which the specified

classifier output combination occurs. Each row in Table 4.1 is a different distribution of the

10 samples among the 8 unique combinations of classifier outputs. Following the terminol-

ogy of [66], the distributions which represent best or worse case classification accuracy for

a method are ’patterns’ of success or failure, respectively.

The accuracy of majority voting is given by 110× the summation of the entries in columns

111, 101, 011, and 110, i.e., all columns with two or three correct votes. However, for the SRB

algorithm, the output combination 100 should also be considered as a correct decision of

the system. By applying the SRB algorithm to the above classifiers, we notice that the vote

of the first classifier, θ1, is alone sufficient for the correct decision of system.

Table 4.1 only depicts a subset of possible distributions. However, we have intentionally

included a pattern of success and failure for both algorithms. The second row of the Table

4.1 is a pattern of failure for both majority voting and SRB. For this distribution, the accuracy

of majority voting is 0.4 which is lower than the accuracy of the worst individual classifier.

However, for this pattern of failure, SRB is on par with the best individual classifier (0.7).

The third row of Table 4.1 depicts the pattern of success for the SRB algorithm. In the best

situation, the SRB can classify inputs with 100% accuracy. However, for this pattern, the

70

accuracy of majority voting is only 0.6 which is even lower than the performance of the best

individual classifier. The last row of the Table is the pattern of success for majority voting.

For this distribution, SRB performs comparably. Two major conclusions can be derived

from Table 4.1:

• The SRB algorithm always performs better or equal than majority voting. Even for the

pattern of success of majority voting, the SRB algorithm performs comparably.

• The SRB algorithm always performs better or equal than the best individual classifier.

On the other hand, in some situations, the accuracy of majority voting can dip below

that of the weakest individual classifier in the system.

4.4.2 Probability of error

In this section, we compare majority voting and SRB in terms of their probabilities of error.

Our simulation method is an extension of the method proposed by [1]. In [1], the accuracies

of all the individual classifiers were considered equal. However, in our simulation method,

we consider the fact that in reality some classifiers are stronger performers than others and

generate classifiers with different levels of accuracies, drawn from the interval ( 12

,1). For

each point estimate P of the posterior probability of a given class, it is assumed that the ex-

perts’ estimates are the posterior probability plus zero-mean noise. Two different types of

noise were considered in this experiment: uniform and Gaussian noise. For uniform noise

defined on the interval [−b ,b ], we varied the value of b to modify the support of the distri-

bution. For Gaussian noise, the standard deviation σ was varied linearly to generate noise

with different powers. For both uniform and Gaussian cases, the simulation was replicated

at 10 different levels of noise. In [1], when the addition of noise values to P resulted in an es-

timate with a value higher than one or lower than zero, the value was clipped to one or zero,

respectively. However, as shown in [64], clipping the distribution may affect the calculation

of empirical error rate. Therefore, in our method, we interpret the individual estimates,ωi ’s,

71

as the amount of evidence for class i , rather than as a strict probability. Hence, we do not

force ωi to lie in the interval [0,1]. We apply the threshold ω > 0.5 on the fused decision

and calculate the probability of error ,Pe , as P(ω ≤ 0.5). It has been shown in [64] that this

strategy is more accurate than the clipping method.

4.4.3 Results for Uniform Noise

We assume that classifiers should estimate a fixed a posteriori value equal to P = 0.7. How-

ever, it is assumed that classifiers have three different levels of accuracies: P1 = 0.7, P2 = 0.6,

and P3 = 0.5. The additive noise is zero-mean and uniformly distributed. We change the

support of the distribution by changing the value of b from 0.1 to 1 linearly. At each value

for b , the classification error rate is calculated based on the estimates of the individual clas-

sifiers. Since the classifiers are independent, the probability of an incorrect decision can be

calculated using the binomial formula. First, we assume a system with n = 3 classifiers with

accuracies equal to 0.5, 0.60, 0.70. Then, every time, two classifiers with accuracies equal

to 0.5, 0.70 are added to the system. For example, for n = 7 the accuracies of individual

classifiers are 0.5, 0.5, 0.5, 0.6, 0.7, 0.7, 0.7. Therefore, the average accuracy is always fixed

and equal to 0.6. For SRB algorithm, the reputation value of each expert is equal to

ri =

1−Pe i , if Pe i ≤ 0.5;

0.5, if Pe i > 0.5.(4.9)

which Pe i is equal to the probability of error for that expert.

Figures 4.1-(a) and 4.1-(b) show the classification error for systems with 3 and 15 clas-

sifiers respectively. As these figures show, the classification error for the SRB algorithm is

lower than that of majority voting. By comparing 4.1-(a) with 4.1-(b), we appreciate that the

error rate is decreased by increasing the number of classifiers.

To illustrate the effect of the number of classifiers on the classification error rate, we

72

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

15

20

25

30

35

40

45

b Value

Majority VotingSRB

(a)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

15

20

25

30

35

b Value

Pro

babi

lity

of E

rror

(%

)

Majority VotingSRB

(b)

Figure 4.1: Comparison between majority voting and SRB algorithm for different uniformnoise distributions, for (a) three, and (b) fifteen classifiers.

find the fused estimate of classifiers for a fixed b value (b = 0.3) for different numbers of

classifiers. Figure 4.2 shows the result of this experiment. As expected, by increasing the

number of classifiers, the error rate of classification decreases. The SRB algorithm has lower

probability of error and therefor higher overall accuracy compared to majority voting for all

ensembles of classifiers.

4.4.4 Results for Gaussian Noise

We assumed that the noise is zero-mean Gaussian with power σ2. We varied the power of

the noise by adjusting σ linearly from 0.1 to 1. At each level of noise, we calculated the

classification error rate for the SRB and majority voting methods.

Figures 4.3-(a) and 4.3-(b) depict the classification error rate for 3 and 15 number of

classifiers respectively. As these figures show, by increasing the power of noise, the classi-

fication error of both the SRB and majority voting are increased. Also, for both algorithms

the classification error is decreasing by increasing the number of classifiers from three to

73

3 5 7 9 11 130

5

10

15

20

25

Number of Classifiers

Pro

babi

lity

of E

rror

(%

)

Majority VotingSRB

Figure 4.2: Comparison between majority voting and SRB algorithm for uniform noise withb = 0.3 and different numbers of classifiers.

fifteen.

Figure 4.4 shows the classification error for a fixed value of standard deviation, σ = 0.3,

and different numbers of classifiers. As expected, by increasing the number of classifiers,

the classification error of the system is decreased. For all classifier ensembles, the SRB algo-

rithm outperforms majority voting.

4.5 Conclulsion

In this paper, we determined that for asymptotically large classifier ensembles under uni-

form and Gaussian noise assumptions, the probability of error of the static reputation-

based algorithm converges no slower than majority voting. For finite ensembles, our anal-

ysis revealed that the accuracy of the SRB algorithm is bound by the accuracy of the best

74

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

15

20

25

30

35

40

45

Standard Deviation

Pro

babi

lity

of E

rror

(%

)

Majority VotingSRB

(a)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

15

20

25

30

35

40

Standard Deviation

Pro

babi

lity

of E

rror

(%

)

Majority VotingSRB

(b)

Figure 4.3: Comparison between majority voting and SRB algorithm for different Gaussiannoise powers, for (a) three, and (b) fifteen classifiers respectively.

individual classifier in the ensemble. Finally, for finite ensembles, the probability of error of

the SRB decreases as the number of classifiers grows, regardless of whether point estimation

errors are uniform or gaussian, and remains at all times below that of majority voting.

75

3 5 7 9 11 135

10

15

20

25

30

35

Number of Classifiers

Pro

babi

lity

of E

rror

(%

)

Majority VotingSRB

Figure 4.4: Comparison between majority voting and SRB algorithm for a fixed value of stan-dard deviation of Gaussian noise (σ= 0.3) and different number of classifiers.

76

Chapter 5

Time-Evolving Reputation-Based

Classifier and its Application on

Classification of Physiological Responses

of Children With Disabilities

The contents of this chapter are reproduced from a journal article currently under review:

Nikjoo, M., Kushki, A., Andrews, A., and Chau, T., "Time-Evolving Reputation-Based Classi-

fier and its Application on Classification of Physiological Responses of Children With Dis-

abilities", Information Fusion.

5.1 Abstract

This Chapter proposes a novel dynamic algorithm for fusing classifiers in a multi-classifier

system (MCS). The algorithm uses the Dirichlet distribution to continuously compute rep-

utation values for each individual classifier over time. We show that the proposed algorithm

is particularly advantageous for non-stationary signals where training and test sets may be

77

collected at temporally distant instances in time and hence are statistically disparate. The

algorithm adjusts the contribution of each classifier to the final decision adaptively based

on the samples of the test set. We demonstrate the advantage of the proposed algorithm

over majority voting and a static reputation-based MCS in the classification of physiologi-

cal arousal states in individuals with severe disabilities.

5.2 Introduction

The classification of multiple physiological or biomechanical signals is a technical challenge

encountered in many bioengineering applications such as biometrics, brain-computer in-

terfaces (BCI) [113, 122, 101], dysphagia detection using accelerometry signals [91, 74] and

emotion detection using physiological signals [95, 112].

A traditional approach to these classification problems is to use a single classifier. In this

approach, all the features gathered from different signals are stacked in one vector and fed

to a single classifier. Despite its simplicity, the single classifier system is susceptible to the

curse of dimensionality [34]; when working with a high-dimensional feature vector, training

necessitates voluminous data and convergence can be extremely slow. Thus, a single clas-

sifier system may be intractable for many online biomedical applications with large feature

spaces [56].

A Multiple Classifier System (MCS) judiciously integrates decisions from multiple classi-

fiers and offers a promising alternative to a single classifier in the situations outlined above.

In particular, MCSs can enhance classification performance (e.g., accuracy) and accelerate

classification in online applications [134, 65]. MCSs also accommodate combinations of

different individual classifiers each optimized to unique feature subsets. For instance, in

body-machine control interfaces, physiological (e.g., blood volume pulse, cortical hemody-

namics), biomechanical (e.g., limb motion) and environmental noise (e.g., ambient acous-

tics) provide different types of features for classification. It may be sensible to train one

78

classifier on each type of feature [56]. Likewise, in clinical data collection, there may be

multiple training sets, arriving at different times or under slightly different circumstances

[19, 50]. In such cases, individual classifiers might be trained on each available data set.

MCSs may combine classifiers on three levels: abstract, rank, or measurement level [135].

At the abstract level, each individual classifier outputs a unique class label. These labels are

then fused using methods such as majority voting. At the rank level, the output of each

classifier is a subset of class labels ranked in terms of the classifier’s belief about the class to

which the input belongs. At the measurement level, each classifier outputs a real value for

each class, representing the probability with which the input belongs to a given class.

Abstract-level MCSs have been the most widely applied in the literature [79, 102, 88]

given the flexibility of the abstract level output, i.e., class labels may be derived from any

type of classifier and outputs from multiple classifiers, each designed within a different con-

text, can be easily combined.

Several methods have been proposed for combining class labels at the abstract level.

Among these, majority voting [117, 53] has received special attention. In this democratic

scheme, the class with the most votes from the individual classifiers is selected as the MCS

decision. Majority voting can increase the overall performance (e.g., accuracy) of a classifi-

cation system in a wide variety of circumstances [135, 86]. Weighted majority voting [135],

is an extension of simple majority voting, where the weight of each classifier is determined

according to its belief, i.e., its certainty about a given subset of the training set. However,

the algorithm fails when one or more beliefs are zero [65], a degenerate condition that may

occur in multi-class problems. More importantly, beliefs are computed exclusively using

samples of the training set and thus there is a risk of poor generalization when the data are

non-stationary. We previously introduced the static reputation algorithm [88], a MCS which

estimates classifier weights or reputation values over a validation set, without encounter-

ing degenerate conditions. Using a combination of neural networks [88] or linear discrim-

inants [91], the static reputation algorithm classified noisy biomechanical signals with sig-

79

nificantly higher accuracies than that achievable by single classifier systems. Nonetheless,

fixed reputation values were assigned to the individual classifiers. Therefore, the static rep-

utation algorithm is also susceptible to diminished performance in the face of statistical

disparity between samples of the test and validation sets. This disparity is of particular

concern in the classification of non-stationary physiological signals. Specifically, if one of

the individual classifiers performs poorly with the test data, the importance of its vote re-

mains unaltered, thereby negatively influencing the overall decision. We propose a dynamic

weighted voting method as a solution to this disparity issue.

5.2.1 Dynamic weighted voting

Dynamic weighted voting first surfaced in the business literature as a means of tracking

concept drift, i.e., the evolution of dynamic concepts such as customer preferences [103].

Kolter and Maloof [63] proposed a novel tracking algorithm based on the notion of dynamic

weighted majority voting, where several individual classifiers were trained and their weights

were adaptively updated based on their correctness of categorizing training exemplars. Sun

[118] also presented a dynamic weighting algorithm based on the local within-class nearest

neighbors concept. In this algorithm, individual classifiers receive different weights based

on their performance over a weighting set, a single hold-out set distinct from the training

set. Although the weights of individual classifiers change over time, the above methods

do not address the disparity problem since they only consider performance on a train-

ing or weighting set rather than a previously unseen test set. In contrast, Valdovinos and

Sanchez [127] proposed an algorithm to directly address the disparity problem. In their ap-

proach, each classifier calculated the distance between the unknown input vector and its

nearest neighbor in the training set. Each classifier received a weight inversely proportional

to this distance, i.e., the classifier with the shortest nearest neighbor distance received the

highest weight. Although this method adaptively assigned classifier weights, it only oper-

ated at the measurement level. Further, the calculated distance was the only consideration

80

in classifier weight assignment, exposing the multi-classifier system to the risk of heavily

weighing a poorly performing classifier. Other important factors, such as the past perfor-

mance of a classifier, were not considered. Finally, the choice of distance metric is often

problem-specific and hence a non-trivial challenge.

Recently, unsupervised adaptive classifiers have attracted the attention of several re-

searchers especially those who are working in the area of Brain-Computer Interface (BCI).

In [12], authors proposed an adaptive linear discriminant analysis (ALDA) method for on-

line clustering of EEG patterns using a Gaussian mixture model (GMM). In this method, the

initial data distribution is estimated using a labeled training period. The major problem

of this method is that if the feature distribution shifts far away from the training session

during long-term use, the initial parameter values are no longer valid [80]. Therefore, this

method is not able to properly deal with the disparity problem. Hasan et al. [46] proposed

another unsupervised classification method based on GMM. In this algorithm, a sequential

EM method is used to adapt the GMM-based classifier using simulated online system. Be-

cause the data set used in [46] was modest in duration, it is not possible to accurately mea-

sure the robustness of this method against the disparity problem. Gan [42] has proposed a

self-adapting unsupervised learning system to update the mean and variance of Bayesian

classifiers. However, in this method, it is not possible to decrease the effect of old data

[128]. Awate et al. [3] proposed an adaptive unsupervised algorithm for MRI brain-tissue

classification using a non-parametric Markov model. In this algorithm, the underlying im-

age statistics are learned automatically and a classification strategy based on that model

is constructed. The computational complexity of this method is very high which makes it

unattractive for real-time applications.

5.2.2 Dynamic reputation

To address the disparity problem, we propose a time-evolving reputation-based algorithm.

Building on the static reputation algorithm [88], each classifier adaptively constructs a rep-

81

utation or weight for itself, initially according to the classifier’s performance on a validation

set and subsequently according to its performance on the test set. In other words, the rep-

utation of each classifier is updated dynamically whenever a new sample is classified. At

any point in time, a classifier’s reputation reflects the classifier’s performance on both the

validation and the test sets. Therefore, the effect of random high-performance of weak clas-

sifiers is appropriately moderated and likewise, the effect of a poorly performing individual

classifier is mitigated as its reputation value, and hence overall influence on the final de-

cision is diminished. The proposed algorithm functions at the abstract level and does not

require any computationally intense individual classifiers. Because of the time-evolving up-

date scheme, we call this algorithm the time-evolving reputation-based algorithm.

In the following, we present the mathematical background behind the time-evolving al-

gorithm (Section 2), review the static reputation-based algorithm (Section 3) and propose

the time-evolving reputation-based algorithm (Section 4). We exemplify the algorithm us-

ing a multivariate non-stationary physiological data set and contrast performance against

that of majority voting and the static reputation algorithm, two abstract level multi-classifier

systems.

5.3 Mathematical Background

5.3.1 Notation

Assume the time series, S, is the pre-processed version of an acquired signal. Also let Θ =


whereωj =ωk , ∀j = k . Without loss of generality, Ω⊂N. The input of each classifier is the





82

a total number of d i samples are assigned for training. Symbols in bold will signify vectors.

5.3.2 Properties of the Dirichlet Distribution

For a better understanding of the proposed algorithm, we briefly introduce the Dirichlet

distribution.

The Dirichlet distribution is a continuous multivariate probability distribution. Assume

an L-component random vector U = U1, ...,UL, where each Ui > 0, and∑L

i=1 Ui = 1. The

parameters of the Dirichlet distribution are L positive real numbers, α = α1, ...,αL, each

corresponding to one of the random variables Ui . The parameter αi is also known as the

evidence for Ui . The Dirichlet density can be defined mathematically as [58, 132]

f (u 1, ..., u L ;α1, ...,αL) =1

H (α)

L∏i=1

Uαi−1i , (5.1)

where H (α) is a normalization factor expressed in terms of the Gamma function, Γ:

H (α) =

∏Li=1Γ(αi )

Γ(∑L

i=1αi ). (5.2)

The expected value of any of the L random variables, Ui , is defined as follows:

E (Ui |α) = αi∑Lj=1αj

. (5.3)

5.4 Static Reputation-Based Algorithm

We summarize the static reputation-based algorithm [88] upon which the dynamic reputa-

tion algorithm is based. Let p (ωj ) be the prior probability of class ωj . Suppose we have a

multiclassifier system with L distinct classifiers.

1. Train the L ≥ 2 individual classifiers using the training set.

83

2. Evaluate the accuracy of the i t h , i = 1, . . . , L, classifier on a validation set (disjoint from

the training set). Normalize this accuracy to obtain the reputation, ri ∈ [0, 1], for the

i t h classifier.

3. For each feature vector, x , in the test set, obtain L decisions using the L distinct clas-

sifiers:

Ω(x ) = θ1(x ),θ2(x ), ...,θL(x ). (5.4)

4. Sort the reputation values of the classifiers in descending order,

R∗ = r1∗ , r2∗ , ..., rL∗, (5.5)

such that r1∗ ≥ r2∗ ≥ ... ≥ rL∗ . Now, a reputation-ordered set of classifiers, Θ∗, is ob-

tained by ranking the classifiers

Θ∗ =

θ1∗

θ2∗

...

θL∗

, (5.6)

i.e., θ1∗ and θL∗ are the classifiers with the highest and lowest reputations.

5. Observe the vote of the first m classifiers in the reputation-ordered set, where

m =

L2

, if L is even,

L+12

, if L is odd.(5.7)

If all of these m classifiers vote for the same class, this class is selected as the final

decision of the system. If not, the final decision of the system is obtained using the

84

following step.

6. Using Bayes rule and assuming independent classifiers, estimate the posterior prob-

ability as

p (ωj |θ1, ...,θL) =


t=1


. (5.8)

For simplicity, we have dropped the argument for θ above, but it is understood to be

a function of x . The conditional probability p (θi |ωj ) in (5.8) is given by

p (θi |ωj ) =

ri , if x ∈w j and θi (x ) =w j ,

1− ri , if x ∈w j and θi (x ) =w j .(5.9)


ities. Hence,

p (ω1) = p (ω2) = . . .= p (ωc ) =1

c. (5.10)

7. For each class,ωj , estimate a posteriori probabilities as given by (5.8) using (5.9) and

(5.10). Select the class with the highest posterior probability as the final decision of

the system.

5.5 Time-Evolving Reputation-Based Algorithm Using the Dirich-

let Distribution

The static reputation-based algorithm improves upon the performance of the majority vot-

ing algorithm whenever there is a high correlation between samples of the validation and

test sets. However, the static algorithm performs poorly when samples of the validation and

the test sets are statistically disparate. To address this issue, the time-evolving reputation-

based algorithm updates the reputation values based on the performance of the individual

classifiers on the test set. To do so, we use the Dirichlet reputation system [58], which has

85

been touted for its flexibility, simplicity and representational capacity [132]. The Dirichlet

reputation system captures a sequence of observations and updates the a priori reputation

value for each classifier. By adding the observations to the a priori information, the total

evidence for each outcome can be written as follows:

αi = vi +Cβi , (5.11)

where C > 0 is a constant, β = β1, ...,βL is the vector of initial confidence values of the L

classifiers, with the condition∑L

i=1βi = 1, βi > 0 and v = v1, ..., vL where vi ≥ 0, is the

vector of a posteriori evidence for the L classifiers. When no other evidence is available,

usually it is assumed that

β1 = ...=βL =1

L. (5.12)

In our method, because the initial reputation is known, βi is a real number related to the

initial reputation value of the i t h classifier. In the spirit of (5.3), we can write the expected

weight of the i t h classifier as a function of the initial confidence in that classifier, βi and the

accumulated evidence vi , namely,

E (Ui |v,β ) =vi +Cβi

C +∑L

i=1 vi

. (5.13)

By construction, (5.13) is between 0 and 1 and its summation over i is unity. Equation

(5.13) allows us to update the expected value of the Dirichlet distribution using evidence

from posterior observations. With equal initial confidence values for all the outcomes, the

expected values of different outcomes are equal to each other. However, after recording

several observations and updating the values of vi in (5.13), it is possible that the posterior

expected values become different for different outcomes.

To illustrate the application of this equation, we provide an example here. Assume there

86

are three card players A1, A2, and A3 (L=3), who play a specific game several times and each

time one of them wins the game. When no priori information is available, we assume that

the initial probabilities of winning the game for the three players are β1 = β2 = β3 = 13

. For

this example, let C = 2 and the initial observations are v1 = v2 = v3 = 0. Inserting these

values into (5.13), we find that the initial probability of winning the game is, as expected,

uniformly 13

for each of the players. Now assume the participants played the game 10 times

and A1 won 8 times while A2 and A3 each win once, i.e., v1 = 8, v2 = 1, and v3 = 1. Insert-

ing this new posterior evidence into (5.13), we find that E (P(A1)) = 2636

, E (P(A2)) = 536

, and

E (P(A3)) = 536

. In other words, by augmenting our initial knowledge with posterior evidence,

the probability of selecting the first player as the winner of the game has increased while the

probability of selecting the other players has decreased. By the same token, this property of

the Dirichlet reputation system can be exploited for the automatic updating of reputation

values of individual classifiers based on their a posteriori performance on the test set.

Now, let us return to the initial problem of combining multiple classifiers. Recall that

our main goal is develop a time-evolving reputation-based algorithm that can adapt to the

statistical disparity between training and testing data in nonstationary physiological data.

To automatically update our confidence in each classifier over time, we extend (5.13) to

account for the passage of time, namely,

βi [k +1] =vi [k +1]+Cβi [k ]

C +∑L

i=1 vi [k +1], (5.14)

where k is the time index and is incremented whenever a new subject is classified, βi [k ] is

the confidence of the i t h classifier at time k , and vi [k ] is the evidence in favor of classifier i

at time k .

Let us examine a voting scheme by which evidence values vi [k ] can be determined at

each time step. After training, individual classifiers are evaluated on a validation set to es-

timate their initial reputation values. Then, each sample from the test set is classified by

87

Figure 5.1: An example of the voting scheme in the proposed algorithm.

each classifier into one of the possible classes. The classifiers scrutinize their outputs and

vote for each other if their decisions agree. In other words, classifier i votes for classifier j ,

and receives a vote from classifier j , if both classify the input sample into the same class.

This voting scheme is exemplified in Figure 5.1 where classifiers C1,C4,C6 and C7 mutually

endorse each other. Likewise, classifiers C2,C3,C5 cast votes for one another.

The classifier group with the highest number of votes, i.e., largest number of classifiers

in agreement, is declared the winning group. For each member of the winning group,

the accumulated votes are recorded as the corresponding value vi [k ] in (5.14). Hence,

0 ≤ vi [k ] < L. The classifiers outside of the winning group are assigned a null vote, i.e.,

vi [k ] = 0.

The new confidence of each classifier can then be calculated using (5.14). For the initial

confidence in each classifier, we use the reputation values calculated by the static reputation

algorithm, namely,

88

βi [0] =ri [0]∑L

i=1 ri [0], (5.15)

where ri [0] is the initial reputation value of the i t h classifier computed using the validation

set. This choice of initial confidence values ensures that our method performs comparably

to the static reputation-based algorithm when there is a high correlation between samples

of the validation and test sets. Based on [88], all the reputation values should satisfy the

following condition:

0.5< ri ≤ 1. (5.16)

Although equation (5.14) gives us an updating rule for the confidence values, we still

need a method to estimate time-evolving reputation values, which are required for the fi-

nal decision of the multiple classifier system. Based on equation (5.15), the relationship

between βi and ri is a non-invertible, non-trivial relationship. We next define a rule for

updating the reputation values.

5.5.1 An Updating Rule for Reputation Values

The new reputation value of classifier i can be calculated as a function of the previous rep-

utation value,

ri [k +1] = f (ri [k ]). (5.17)

where, f could, in general, be a non-linear function. After updating the reputation values of

all classifiers using (5.17), the final decision of the system about the input sample should be

made using (5.8) and Bayes theorem as exemplified in [88] and [91].

Let us more closely examine an updating function for (5.17). For simplicity, we assume

that there are two possible classes for each sample, i.e.,Ω= ω1,ω2 and the number of clas-

sifiers L is odd. The basic principle underlying our update rule is as follows. If for a given

test sample, the confidence in a classifier undergoes a maximum positive change, then the

89

classifier’s reputation is maximized. Conversely, if the confidence in the classifier decreases

maximally, its reputation becomes 0.5. In all other instances, the reputation should change

in a manner proportional to relative change in confidence. To formalize, denote the maxi-

mum positive and negative changes in confidence in the i t h classifier asβm a xi andβm i n

i ,

respectively. Based on (5.15), βi [0] is maximized when ri [0] = 1 and rj [0] = 0.5, ∀j = i . In

other words, the maximum confidence in the i t h classifier is

βm a xi =

1

0.5× (L−1)+1=

2

L+1. (5.18)

For example for a MCS with L = 3 classifiers, βm a x is 0.5, which is reached when the reputa-

tion of one classifier is 1 and the other two are 0.5.

Again, based on (5.15), βi is minimized when ri = 0.5 and rj [0] = 1, ∀j = i . Hence, the

minimum confidence in the i t h classifier is,

βm i ni =

0.5

(L−1)+0.5=

1

2L−1. (5.19)

Based on (5.18) and (5.19), the range of β for each classifier is defined as β ∈ [ 2L+1

, 12L−1].

To update the reputation value of each classifier, we first need to find the change in the

corresponding confidence value. To do so, we use (5.14) and calculate the difference of

confidence values in one time interval as follows:

βi [k +1]−βi [k ] =vi [k +1]+Cβi [k ]

C +∑L

i=1 vi [k +1]−βi [k ] (5.20)

=vi [k +1]−βi [k ]

∑Li=1 vi [k +1]

C +∑L

i=1 vi [k +1].

Theorem 1. Equation (5.20) is maximized with respect to vi when vi [k +1] is equal to L−12

and for this case∑L

i=1 vi [k +1] = L2−14

.

Proof. See Appendix A.

90

By maximizing vi [k+1]while minimizing βi [k ], (5.20) is minimized with respect to both

vi and βi . By using (5.19) and Theorem 1, the maximum change in confidence can be ob-

tained as follows:

maxvi ,βi

βi [k +1]−βi [k ] =L−1

2− L2−1

4(2L−1)

C + L2−14

(5.21)

=3(L−1)2

(2L−1)(L2−1+4C ).

We see that (5.21) approaches a maximum as C → 0:

βi−m a x =maxC

maxvi ,βi

βi [k +1]−βi [k ]= 3(L−1)(2L−1)(L+1)

. (5.22)

Theorem 2. Equation (5.20) is minimized with respect to vi when vi [k + 1] = 0 and for

this case∑L

i=1 vi [k +1] = (L−1)(L−2).

Proof. Proof is available in the Appendix B.

Based on Theorem 2, we have

βi−m i n =minβi [k +1]−βi [k ]= −2

L+1. (5.23)

Now, by using (5.22) and (5.23), we define two new parameters for the classifier, i , as

follows:

∆+ = (βi [k +1]−βi [k ])× (2L−1)(L+1)3(L−1)

, (5.24)

and

∆− = (βi [k +1]−βi [k ])× (L+1)2

. (5.25)

When βi [k + 1] ≥ βi [k ], ∆+ is the relative change in confidence, βi , with respect to the

maximum possible positive change. Likewise, when βi [k + 1] < βi [k ], ∆− is the relative

91

change of βi with respect to the maximum negative change. Now we are ready to define a

reputation update rule. Considering (5.16) and one time index k , the reputation value may

undergo a maximum increase of 1− ri [k − 1] and a maximum decrease of ri [k − 1]− 0.5.

Therefore, for the i t h classifier, the reputation value can be updated as follows:

ri [k +1] =

ri [k ]+ (1− ri [k ])∆+, if βi [k +1]≥βi [k ]

ri [k ]+ (ri [k ]−0.5)∆−, if βi [k +1]<βi [k ].(5.26)

We should note here that∆+ ∈ (0,1). Let us consider two limiting cases of the reputation

update rule. According to (5.14), when C →∞, βi [k + 1]→ βi [k ] and hence ∆+→ 0. Thus,

in this case ri [k + 1]→ ri [k ]. Upon inspection of (5.14), we realize that in this first limiting

case, the current votes of the other classifiers are completely ignored and the overall deci-

sion is modulated only by the original reputations determined from the validation set. On

the other hand, when C → 0, the value of∆+ approaches unity. In this limiting case, the past

performance of a classifier is completely neglected and the weighting of individual classi-

fiers rests solely on the current votes among classifiers. According to (5.26), for this case,

ri [k +1]→ 1.

We also note that∆− ∈ (−1, 0). When C →∞,∆− converges to 0, and ri [k +1] converges

to ri [k ]. Likewise, when C → 0,∆− approaches−1, and thus ri [k +1]→ 0.5. Therefore, in all

limiting cases, (5.26) satisfies the initial requirement that 0.5< ri ≤ 1, as defined in (5.16).

We further illustrate the proposed update rule with an example. Assume an MCS with

L = 3 classifiers, r1 = 0.7, r2 = 0.7, and r3 = 0.6. The initial confidence values for these

classifiers are β1[0] = 0.35, β2[0] = 0.35, and β3[0] = 0.3 respectively. Now, assume at time

k = 1, classifiers θ1 and θ2 vote similarly but classifier θ3 votes differently. In Table (5.1), we

list the updated reputation values for different values of C .

For C = 0.01, the votes of the classifiers are important factors in updating the reputation

values. In this case, a positive vote from one classifier increases the reputation values signif-

icantly. By looking at equation (5.14), we see that as C → 0, βi [k +1] changes independently

92

Table 5.1: An example to clarify the updating rule.

C r1[1] r2[1] r3[1]

0.01 0.849 0.849 0.540

2 0.775 0.775 0.570

10 0.725 0.725 0.595

100 0.703 0.703 0.598

1000 0.700 0.700 0.599

of βi [k ]. Therefore, for small values of C , the proposed algorithm converges to the majority

voting algorithm, where the final decision of the system is based only on the instantaneous

votes of the classifiers. On the other hand, Table 5.1 shows that for C = 1000, the reputa-

tion values essentially become constant and hence the votes of the classifiers are neglected.

Based on equation (5.14), when C →∞, βi [k + 1]→ βi [k ]. Therefore, the confidence val-

ues and by extension, reputation values become nearly constant. From this example, we

see that based on the choice of C , reputation values can have differential effects on the fi-

nal decision of the classifier system. The appropriate choice for C depends on the desired

’performance memory’ and should be tuned empirically for each application.

The time-evolving reputation algorithm for the two-class problem is summarized below.

1. L classifiers are trained and evaluated on the validation set to obtain their initial rep-

utation values.

2. The confidence value of each classifier is calculated by using its corresponding repu-

tation value and equation (5.15).

3. For each input vector, x , each classifier outputs one class label, eitherω1 orω2.

4. The classifiers are categorized into two groups based on their outputs. The group with

the higher number of members is the winner.

93

5. Each member of the winner group, receives one vote from each of its co-members.

The members of the loser group do not receive any votes.

6. The reputation value of each classifier is updated using equation (5.26).

7. The confidence value of each classifier is updated using equation (5.14).

8. The final decision of the system about the input vector is made using equation (5.8).

9. The algorithm is repeated from Step 2.

5.6 Clinical Application

5.6.1 The Data

We demonstrate the proposed algorithm using physiological and biomechanical signals col-

lected from adolescents with severe disabilities for the purpose of automatically differenti-

ating between rest and activity states [68]. The eventual goal of such pattern classification

is to provide a speech-free means of expression for individuals who are unable to commu-

nicate through speech or gestures.

The dataset contains five signals, namely, blood volume pulse, electrodermal activity

(EDA), respiration, skin temperature, and limb acceleration. These signals were collected

from nine participants with severe disabilities (aged 18.11 +/- 2.2 years; 7 female). Each

participant alternated 20 times between 20-second periods of rest and activity. During the

rest state, the participants sat quietly whereas in the activity state they observed pictures of

items they liked or disliked. The viewing of affective images is known to elicit physiological

responses [87, 120]. An example of the different signals from one participant is shown in

Figure 5.2. The objective of our classifier system was to automatically classify features of the

measured signals into rest or activity states. The features and classifier are discussed next.

94

(a) Hand acceleration (b) Skin temperature

(c) Respiration (d) Electrodermal Activity

(e) Blood Volume Pulse

Figure 5.2: Sample physiological signals collected from one of the participants. In eachgraph, the horizontal axis denotes time while the vertical axis is signal amplitude in arbitraryunits.

5.6.2 Classification

We designed an MCS with 5 different LDA classifiers to address this classification problem.

In total, 40 samples, 20 samples for each class, were used in this experiment. The following

parameters were computed:

• Classifier 1: mean, standard deviation and slope of the heart rate (derived from blood

volume pulse), and the mean of the absolute value first difference of the heart rate

signal normalized by its mean and standard deviation.

• Classifier 2: mean, standard deviation and slope of the EDA signal, and the mean of

95

Table 5.2: The average performance of different classifiers.

Participant Single classifier Static Reputation Majority Vote Time-evolving Reputation

1 54.95±6.5 54.04±5.5 55.23±4.8 59.70±6.3

2 77.98±3.5 77.04±3.1 79.93±3.2 85.52±3.2

3 56.00±6.0 74.57±3.7 67.00±5.8 74.43±6.1

4 82.89±6.0 82.82±4.4 78.73±4.7 84.30±4.2

5 93.11±4.0 80.11±4.0 86.70±2.5 86.34±2.9

6 87.96±5.0 86.64±5.0 86.62±2.9 86.55±3.1

7 59.29±6.7 58.43±3.7 60.54±4.0 59.95±5.7

8 74.05±5.7 74.84±4.3 76.30±3.7 76.73±4.9

9 73.52±6.7 87.09±2.5 75.82±5.6 94.18±2.9

Overall Performance 73.31 75.06 74.10 78.63

the absolute first difference of the EDA signal normalized by its mean and standard

deviation.

• Classifier 3: mean, standard deviation and slope of the respiration signal, and the

mean absolute first difference of the respiration signal normalized by its mean and

standard deviation.

• Classifier 4: mean and slope of the temperature signal.

• Classifier 5: mean, standard deviation and slope of the limb acceleration signal.

The actual input features to the various classifiers were the differences in the above param-

eters between successive rest and picture-watching intervals. Classifiers accuracies were

estimated using 5-fold cross validation. However, unlike classical cross-validation, we fur-

ther segmented the ’traditional training’ set into training and validation sets. In each fold,

80% of ’traditional training’ samples were used for classifier training and 20% for calculat-

ing the initial reputation values. This process was repeated for 20 different iterations. To

gauge the performance of the proposed algorithm, we compared it with other fusion meth-

96

Table 5.3: The P values of t-test between the time-evolving classifier, static classifier, andmajority voting.

Subject’s Number P Value (Dynamic and majority vote) P Value (Time-evolving and Static)

1 0.0358 0.0016

2 p < 0.001 p < 0.001

3 0.0014 0.9214

4 p < 0.001 0.2245

5 0.3125 p < 0.001

6 0.2826 0.8965

7 0.5255 0.1245

8 0.5592 0.4248

9 p < 0.001 p < 0.001

ods. Since the outputs of the classifiers in this experiment were only class labels, we com-

pared our method to abstract level fusion algorithms, namely, the majority voting and static

reputation-based algorithms (the latter being akin to a weighted majority voting algorithm).

The proposed algorithm was also compared against a single classifier with all 17 input fea-

tures. For the proposed algorithm, the best value of C was estimated empirically as C = 40.

Table 5.2 summarizes the result of different classifiers for this experiment.

5.6.3 Results

As seen in Table 5.2, the time-evolving reputation-based classifier outperformed its static

counterpart on this nonstationary data set. Due to the disparity between the features of

the training set and those of the test set, the fixed reputations of the static system poorly

predicted the individual classifier performance on test data. In contrast, the time-evolving

reputation-based classifier mixes the initial reputation values with instantaneous classifier

performance on the test set. Consequently, the time-evolving algorithm is more robust to

nonstationary features. A pairwise t-test reveals that the time-evolving classifier is signifi-

97

cantly more accurate (p < 0.05) than the static version for participants 1, 2, 5 and 9 (Table

5.3). This finding suggests that the signal features for this subset of participants were par-

ticularly nonstationary. On the other hand, the majority voting classifier relies exclusively

on the instantaneous classifier performance and neglects their reputations. Based on Ta-

bles 5.2 and 5.3, the time-evolving reputation-based algorithm yields significantly higher

accuracies (p < 0.05) than the majority voting method for participants 1, 2, 3, 4, and 9. This

finding suggests that for these participants, the historical classifier performance as captured

via initial reputations, does bear some predictive value for future decisions. Finally, the

time-evolving reputation-based algorithm outperformed the single classifier for all but one

participant. Moreover, the single classifier was very slow compared to the time-evolving al-

gorithm and highly susceptible to the curse of dimensionality given its high dimensional

input feature space. It is interesting to note that all the classifiers generated low accuracies

for the first and seventh participants. During data collection, these two participants expe-

rienced excessive amounts of involuntary movement which likely contaminated both rest

and activity recordings uniformly, obfuscating their distinctive characteristics.

5.6.4 The Effect of the Constant C

With the above physiological data, we investigated the sensitivity of the proposed algorithm

to the constant C . We ran the proposed algorithm for different values of C , the accura-

cies of which are summarized in Figure 5.3. For small values of C , the accuracies of the

time-evolving reputation-based algorithm and the majority voting method are comparable.

According to equation (5.14), when C → 0 we have βi [k+1]→ vi [k+1]∑Li=1 vi [k+1]

. Therefore, the past

performance of classifiers are not considered in computing new confidence values, i.e., rep-

utations are updated strictly on the basis of instantaneous votes. This is exactly the behavior

of majority voting.

As evident in Figure 5.3, the time-evolving and static reputation-based classifiers per-

form similarly when C is large. Indeed, when C → ∞ in equation (5.14), we note that

98

βi [k + 1] → βi [k ]. In other words, the confidence and reputation values become nearly

constant and equal to their initial values.

0.001 0.1 1 10 40 100 500 1000 1000073

74

75

76

77

78

79

Value of C

Ave

rage

Acc

urac

y

Time−Evolving ReputationStatic ReputationMajority vote

Figure 5.3: The effect of the constant C on the time-evolving reputation-based algorithm.

Figure 5.4 depicts the reputation values of the different classifiers for different values

of the constant C , as different folds of a test set are shown to the classifier. In 5.4-(a), we

see that for large C , the reputation values are almost constant for all the five classifiers. In

contrast, as seen in figure 5.4-(c), for small values of C , reputation values fluctuate wildly on

a sample by sample basis. From figure 5.4-(b), we note that reputation values change but

converge quickly for C = 40. These observations confirm our choice of this constant.

99

(a) C = 10000 (b) C = 40

(c) C = 0.001

Figure 5.4: The evolution of reputation values for the 5 classifiers for different values of C .

5.7 Conclusion

A time-evolving reputation-based classifier was introduced. This novel multiple classifier

fusion algorithm updates initial reputation values of the individual classifiers based on their

performances on the test set. We showed that with a nonstationary physiological data set,

the proposed algorithm can provide higher accuracies than the static reputation-based al-

gorithm, the majority voting method, and a single classifier.

100

5.8 Appendix

5.8.1 Proof of Theorem 1

Let Γi represent the set of classifiers that classify a specific sample of the test set asωi . Γm a x

is the Γi with the maximum number of members. Let D(Γi ) represent the cardinality of

the set Γi . Recall that there are L classifiers, where L is odd. Assume two possible class

labels, ω1,ω2, for each sample. Therefore, for the winning subset of classifiers, D(Γm a x ) ∈ L+1

2, ..., L. Since each classifier cannot vote for itself, we have for the a posterior evidence,

vi ∈ L−12

, ..., L−1. Let vi [k +1] = γ≥ 0, then

L∑i=1

vi [k +1] = γ(γ+1). (5.27)

Now, define h(βi ) = βi [k + 1]− βi [k ]. Because C > 0, (5.20) reaches its maximum when

C → 0. To find the maximum of (5.20), we calculate ∂ h∂ γ|C→0 which is

∂ h

∂ γ|C→0 =

∂

∂ γγ−βi [k ]γ(γ−1)

γ(γ−1)

=−1

(γ−1)2. (5.28)

Because (5.28) is a strictly negative function, h(βi ) is a monotonically decreasing func-

tion and is thus maximized when vi [k +1] takes on its minimum value of L−12

. Plugging this

minimum into (5.27), we have∑L

i=1 vi [k +1] = L−12× L+1

2= L2−1

4. This proves Theorem 1.

5.8.2 Proof of Theorem 2

Two options are available for classifier i . Either it belongs to Γm a x or it belongs to the set with

the fewest members, Γm i n . First, assume i ∈ Γm a x . Based on (5.28), h is a monotonically

decreasing function. Hence, its minimum is reached when vi = L − 1 and∑L

i=1 vi [k + 1] =

101

L(L−1). For these values, the minimum of h is obtained as

minvi

h(βi )|i∈Γm a x =(L−1)−βi [k ]L(L−1)

L(L−1)(5.29)

(5.29) reaches its minimum when βi = βi−m a x = 2L+1

, which was obtained in (5.18). For

this value, (5.29) becomes

minvi ,βi

h(βi )|i∈Γm a x =−(L−1)L(L+1)

. (5.30)

Now, assume i ∈ Γm i n . In this case, D(Γm i n ) ∈ 1, ..., L−12. For these cases vi [k + 1] = 0

and∑L

i=1 vi [k +1]∈ (L−1)(L+1)4

, ..., (L−1)(L−2). Therefore, (5.20) becomes

h(βi ) =βi [k +1]−βi [k ] =−βi [k ]

∑Li=1 vi [k +1]

C +∑L

i=1 vi [k +1]. (5.31)

By letting∑L

i=1 vi [k +1] = γ′, we have

∂ h

∂ γ′ =−Cβi [k ](γ′+C )2

, (5.32)

which is a strictly negative function. Therefore h is a decreasing function and reach its min-

imum when∑L

i=1 vi [k + 1] = (L− 1)(L− 2). By using this value, βm a x = 2L+1

, and C → 0, the

minimum of h becomes

minvi ,βi

h(βi )|i∈Γm i n =−2

L+1. (5.33)

By comparing (5.29) and (5.33), we see that (5.33) has a lower value and hence (5.20) is

minimized when i ∈ Γm i n and this proves Theorem 2.

102

Chapter 6

Conclusion

6.1 Summary of research completed

To achieve the above objectives, the following phases of research were designed and imple-

mented.

1. In the first phase of this thesis, we proposed a new MCS based on the past perfor-

mances of the individual classifiers. We compared the performance of the proposed

algorithm with other MCS methods using real data and showed that the proposed

method outperformed its competitors.

2. In the second phase, we applied the proposed method to the problem of dysphagia

detection. We showed that the proposed algorithm can distinguish between healthy

and dysphagic swallows with higher accuracy than its competitors. Furthermore, in

this phase, we used the proposed algorithm to investigate the difference between

two sensitive channels (aligned to two different orthogonal anatomical axes) of ac-

celerometer.

3. In the third phase of this research, we introduced a new dynamic MCS which updates

its performance based on samples of the test set. We investigated the application of

103

the proposed method on emotion detection using real data collected from children

with disabilities. We showed that for this application, the dynamic reputation-based

algorithm works better than all of its competitors.

6.2 Summary of contributions

This thesis makes several original contributions to multiple classifier systems and their ap-

plications on rehab engineering. These contributions are summarized as follows:

1. We developed a new algorithm for combining several individual classifiers using their

reputation values. By gauging the performance of each individual classifier in the

validation set, a novel weighted majority voting method was introduced.

2. We classified dysphagic and healthy swallows with accuracy higher than currently

published results. By applying the proposed method to real data collected from sev-

eral healthy and dysphagic participants, we have shown that the static reputation-

based algorithm can classify swallowing accelerometry with high sensitivity and mod-

est specificity, making it a suitable screening tool.

3. We introduced a new dynamic MCS based on the Dirichlet distribution for classifica-

tion of non-stationary signals. The proposed algorithm is robust against high dispar-

ity between training and test sets.

4. We detected physiological arousal using the dynamic reputation-based algorithm. By

applying the proposed algorithm to the physiological signals from youth with severe

disabilities, We demonstrated the potential of speech-free detection of activity en-

gagement.

5. We analyzed the asymptotic behavior of the static reputation-based algorithm as well

as the behavior for small and medium-sized classifier ensembles. I also characterized

104

the performance of the algorithm in the presence of uniform and gaussian noise and

quantified its probability of error.

The above contributions resulted in 1 book chapter, 3 journal papers, and 1 patent.

Some ideas for future work are as follows:

• In this thesis, I have characterized the asymptotic and small ensemble behavior of

the static reputation-based algorithm only. Replicating this theoretical and empirical

exercise for the dynamic reputation-based algorithm is a worthwhile endeavor.

• In most of the previously proposed algorithms, an offline classification approach is

considered. However, in many rehab engineering problems, such as the detection of

physiological arousal and hand pose detection, there is need for online classification.

In this case, designing a new method to combine several classifiers in an online man-

ner could be extremely helpful.

105

Bibliography

[1] F Alkoot and J Kittler. Experimental evaluation of expert fusion strategie. Pattern

Recognition Letters, 20, 1999.

[2] O Aran and L Akarun. A multi-class classification strategy for fisher scores: Ap-

plication to signer independent sign language recognition. Pattern Recognition,

43(5):1776–1788, 2009.

[3] S Awate, T Tasdizen, N Foster, and R Whitaker. Adaptive markov modeling for mutual-

information-based, unsupervised mri brain-tissue classification. IEEE Trasactions on

Biomedical Engineering, 10, 2006.

[4] S Bagui and N Pal. A multistage generalization of the rank nearest neighbor classifi-

cation rule. Pattern Recognition Letter, 16(6):601–614, 1995.

[5] E. Bauer and R. Kohavi. An empirical comparison of voting classification algorithms:

Bagging, boosting, and variants. Machine learning, 36(1):105–139, 1999.

[6] R Bauer. Physiologic measures of emotion. Journal of Clin Neurophysiol, 15, 1998.

[7] N Birbaumer. Breaking the silence: Brain-computer interfaces (bci) for communica-

tion and motor control. Psychophysiology, 43, 2006.

[8] D Black. The theory of committees and elections. 2nd ed. London: Cambridge Uni-

versity Press, 1963.

106

[9] S Blain, S Kingsnorth, and T Chau. A multivariate classification algorithm for detect-

ing change in physiological state. Submitted to:, 1999.

[10] S Blain, A Mihailidis, and T Chau. Assessing the potential of electrodermal activity as

an alternative access pathway. Med Eng Phys, 30, 2008.

[11] I. Bloch. Information combination operators for data fusion: A comparative review

with classification. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE

Transactions on, 26(1):52–67, 2002.

[12] J Blumberg, J Rickert, S Waldert, A Schulze-Bonhage, A Aertsen, and C Mehring. Adap-

tive classification for brain-computer interfaces. Proceedings - IEEE Engineering in

Medicine and Biology Society, 1, 2007.

[13] W Boucsein. Electrodermal activity. New York: Plenum Press, 1992.

[14] L Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.

[15] L. Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.

[16] L Bruzzone and R Cossu. A multiple-cascade-classifier system for a robust and par-

tially unsupervised updating of land-cover maps. IEEE Trans. Geoscience and Remote

Sensing, 2002.

[17] J Cabrera. On the impact of fusion strategies on classification errors for large ensem-

bles of classifiers. Pattern Recognition, 39, 2006.

[18] J Cacioppo and L Tassinary. Inferring psychological significance from physiological

signals. American Psychologist, 45, 1990.

[19] J Cao, M Ahmadi, and M Shridhar. Recognition of handwritten numerals with multi-

ple feature and multistage classifier. Pattern Recognition, 28(2):153–160, 1995.

107

[20] D Chen and X Cheng. An asymptotic analysis of some expert fusion methods. Pattern

Recognition Letters, 22, 2001.

[21] H. Chen and X. Yao. Regularized negative correlation learning for neural network

ensembles. Neural Networks, IEEE Transactions on, 20(12):1962–1979, 2009.

[22] S Cho and J Kim. Combining multiple neural networks by fuzzy integral for robust

classification. IEEE Transactions on Systems, Man, and Cybernetics, 25(2):380–384,

1995.

[23] S Cho and J Kim. Multiple network fusion using fuzzy logic. IEEE Trans. Neural Net-

works, 6(2):497–501, 1995.

[24] P Chou. Optimal partitioning for classification and regression trees. IEEE Trans. Pat-

tern Analysis and Machine Intelligence, 13(4):340–354, 1991.

[25] G Cichero and B Merdoch. The physiologic cause of swallowing sounds: answers from

heart sounds and vocal tract acoustics. Dysphagia, 13(1):39–52, 1998.

[26] T Ciszkowski, I Dunajewski, and Z Kotulski. Reputation as optimality measure in wire-

less sensor network-based monitoring systems. Probabilistic Engineering Mechanics,

26(1):67–75, 2011.

[27] S Damouras, E Sejdic, C Steele, and T Chau. An on-line swallow detection algorithm

based on the quadratic variation of dual-axis accelerometry. IEEE Transactions on

Signal Processing, 58(6):3352–3359, 2010.

[28] A Das, NP Reddy, and J Narayanan. Hybrid fuzzy logic committee neural networks

for recognition of swallow acceleration signals. Computer Methods and Programs in

Biomedicine, 64:87–99, 2001.

[29] A Demiriz, K Bennett, and J Shawe-Taylor. Linear programming boosting via column

generation. Kluwer Machine Learning, 46, 2002.

108

[30] A Dempster, N Laird, and D Rubin. Maximum likelihood from incomplete data via

the (em) algorithm. J. Royal Statistical Soc, 39, 1977.

[31] T Deselaers, G Heigold, and H Ney. Object classification by fusing svms and gaussian

mixtures. Pattern Recognition, 43(7), 2010.

[32] R Ding and J Logemann. Pneumonia in stroke patients: a retrospective study. Dys-

phagia, 15:51–57, 2010.

[33] R Duda, P Hart, and D Stork. Pattern Classification. Wiley-Interscience, 2nd edition,

2000.

[34] R Duda, P Hart, and D Stork. Pattern classification. Wiley-Interscience, 2001.

[35] B Efron and R Tibshirani. An Introduction to the Bootstrap. CRC Press, Boca Raton,

FL, 1994.

[36] E Elaad and G Shakhar. Effects of mental countermeasures on psychophysiological

detection in the guilty knowledge test. Int J Psychophysiol, 11, 1991.

[37] T Ferguson. Asymptotic joint distribution of sample mean and a sample quantile.

Technical Report: UCLA, 1999.

[38] Y Freund. An adaptive version of the boost by majority algorithm. Machine Learning,

3(43):293–318, 2001.

[39] Y Freund and R Schapire. A decision-theoretic generalization of on-line learning and

an application to boosting. EuroCOLT 95 - Proceedings of the Second European Con-

ference on Computational Learning Theory, 1995.

[40] Y Freund and R Schapire. Experiments with a new boosting algorithm. In Proc. 13th

Int’l Conf. Machine Learning, 1996.

109

[41] J Friedman, T Hastie, and R Tibshirani. Additive logistic regression: a statistical view

of boosting. Annals of Statistics, 28(2):337–407, 2000.

[42] J Gan. Self-adapting bci based on unsupervised learning. Proc. 3rd Int. Workshop

Brain-Comput. Interfaces, 2006.

[43] A Gangardiwala and R Polikar. Dynamically weighted majority voting for incremental

learning and comparison of three boosting based approaches. Proceedings of Inter-

national Joint Conference on Neural Networks, 2005.

[44] A Van Halteren, D Konstantas, R Bults, K Wac, N Dokovski, G Koprinkov, V Jones, and

I Widya. Mobihealth: Ambulant patient monitoring over next generation public wire-

less networks. Demiris G., ed. e-Health: Current Status and Future Trends, 2004.

[45] F Hanna, SM Molfenter, RE Cliffe, T Chau, and CM Steele. Anthropometric and de-

mographic correlates of dual-axis swallowing accelerometry signal characteristics: a

canonical correlation analysis. Dysphagia, 25(2):94–103, 2010.

[46] B Hasan and J Gan. Unsupervised adaptive gmm for bci. 4th International IEEE EMBS

Conference on Neural Engineering, 2009.

[47] Hashem and B Schmeiser. Improving model accuracy using optimal linear combi-

nations of trained neural networks. IEEE Trans. Neural Networks, vol. 6, no. 3, pp.

792-794, 6(3):792–794, 1991.

[48] S Haykin. Neural Netowrks. Macmillan College Publishing Company, New York, 1994.

[49] T Hettmansperger. Statistical inference based on rank. New York- Wiley, 1984.

[50] T Ho. Random decision forests. Third IntŁl Conf. Document Analysis and Recognition,

Montreal, Canada, 1995.

[51] T Ho, J Hull, and S Srihari. Decision combination in multiple classifier systems. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 16(1):66–75, 1994.

110

[52] C Huang and J Wang. Multi-weighted majority voting algorithm on support vector

machine and its application. TENCON 2009-IEEE Region 10 Conference, 2009.

[53] J Hull, N Srihari, E Cohen, C Kuan, P Cullen, and P Palumbo. A blackboard-based

approach to handwritten zip code recognition. In Procedings of the US Postal Service

Advanced Technology Conference, pages 1018–1032, Washington, DC, 1988.

[54] R Jacob, M Jordan, S Nowlan, and G Hinton. Adaptive mixtures of local experts. Nueral

Computation, 3(5):79–87, 1991.

[55] R.A. Jacobs, M.I. Jordan, S.J. Nowlan, and G.E. Hinton. Adaptive mixtures of local

experts. Neural computation, 3(1):79–87, 1991.

[56] A Jain, R Duin, and J Mao. Statsitical pattern recognition: A review. IEEE Transactions

on Pattern Analysis and Machine Intelligence, 22(1):4–37, 2000.

[57] M Jordan and R Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural

Computation, 6, 1994.

[58] A Josang and J haller. Dirichlet reputation systems. Proc. 2nd Int. Conf. Availability,

Reliability Security, 2007.

[59] C Katsis, G Ganiatsas, and D Fotiadis. An integrated telemedicine platform for the

assessment of affective physiological states. Diagn Pathol, 1, 2006.

[60] Y Kim and GH McCullough. Maximal hyoid displacement in normal swallowing. Dys-

phagia, 23(3):274–279, 2008.

[61] A Kistler, C Mariauzouls, and K von Berlepsch. Fingertip temperature as an indicator

for sympathetic responses. Int. Journal of Psychophysiol, 29, 1998.

[62] J Kittler, M Hatef, R Duin, and J Matas. On combining classifiers. IEEE Trans. Pattern

Analysis and Machine Intelligence, 20(3):226–239, 1998.

111

[63] J Kolter and M Maloof. Dynamic weighted majority: A new ensemble method for

tracking concept drift. Proceedings of the Third IEEE International Conference on Data

Mining, USA, 2003.

[64] L Kuncheva. A theoretical study on six classifier fusion strategies. IEEE Trans. Pattern

Analysis and Machine Intelligence, 24(2):281–286, 2002.

[65] L Kuncheva. Combining Pattern Classifiers. Wiley-Interscience, 2004.

[66] L Kuncheva, C Whitaker, C Shipp, and R Duin. Limits on the majority vote accuracy

in classifier fusion. Pattern Analysis and Applications, 6, 2003.

[67] L.I. Kuncheva. A theoretical study on six classifier fusion strategies. Pattern Analysis

and Machine Intelligence, IEEE Transactions on, 24(2):281–286, 2002.

[68] A Kushki, A Andrewsa, G King, and T Chau. Signals of the peripheral nervous sys-

tem: Viable means for classircation of activity engagement in individuals with severe

physical disabilities. ???, 2011.

[69] L Lam and C Suen. Application of majority voting to pattern recognition: An analysis

of its behavior and performance. IEEE Transactions on Systems, Man, and Cybernetics,

27(5):553–568, 1997.

[70] D Lee and S Srihari. Handwritten digit recognition: A comparison of algorithms. In

3rd Intl. Workshop Frontiers Handwriting Recognition, pages 153–162, 1993.

[71] J Lee, S Blain, M Casas, G Berall, D Kenny, and T Chau. A radial basis classifier for the

automatic detetion of aspiration in children with dysphagia. Journal of Neuroengi-

neering and Rehabilitation, 3(14):1–17, 2006.

[72] J Lee, E Sejdic, CM Steele, and T Chau. Effects of liquid stimuli on dual-axis swallow-

ing accelerometry signals in a healthy population. Biomedical Engineering OnLine,

9(7):10 pp., 2010.

112

[73] J Lee, C Steele, and T Chau. Time and time-frequency characterization of dual-axis

swallowing accelerometry signals. Physiological Measurement, 29(9):1105–1120, 2008.

[74] J Lee, C Steele, and T Chau. Classification of healthy and abnormal swallows based on

accelerometry and nasal airflow signals. Artificial Intelligence in Medicine, xx(yy):zz,

Under review.

[75] J Lee, CM Steele, and T Chau. Swallow segmentation with artificial neural networks

and multi-sensor fusion. Medical Engineering & Physics, 31(9):1049–1055, 2009.

[76] E Lehman. Elements of large sample theory. Springer, 1998.

[77] A Lempel and J Ziv. On the complexity of finite sequences. IEEE Transactions on

Information Theory, 22:75–81, 1976.

[78] T Lin, H Li, and K Tsai. Implementing the Fisher’s discriminant ratio in a k-means

clustering algorithm for feature selection and data set trimming. Journal of Chemical

Information and Computer Science, 44(1):76–87, 2004.

[79] X Lin, S Yacoub, J Burns, and S Simske. Performance analysis of pattern classifier

combination by plurality voting. Pattern Recognition Letter, 2003.

[80] G Liu, G Huang, J Meng, D Zhang, and X Zhu. Unsupervised adaptation based on

fuzzy c-means for brainUcomputer interface. 1st Int. Conf. on Information Science

and Engineering, 2009.

[81] J Logemann. Evaluation and treatment of swallowing disorders. Pro-Ed, Austin, TX,

1997.

[82] P Long and R Servedio. Random classification noise defeats all convex potential

boosters. Machine Learning, 78(3):287–304, 2010.

[83] L Mason, J Baxter, P Bartlett, and M Frean. Boosting algorithms as gradient descent.

Advances in Neural Information Processing Systems, 12, 2000.

113

[84] R Meddis. Statistics using ranks: A unified approach. Philadelphia: Basil Blackwel,

1984.

[85] A Miller. The neuroscientific principles of swallowing and dysphagia. Singular Pub-

lishing Group, San Diego, 1999.

[86] C Nadal, R Legault, and C Suen. Complementary algorithms for the recognition of to-

tally unconstrained handwritten numerals. 10th Int. Conf. Pattern Recognition, A:434–

449, 1990.

[87] B Nhan and T Chau. Classifying affective states using thermal infrared imaging of the

human face. IEEE Transactions on Biomedical Engineering, 57(4):979–987, 2010.

[88] M Nikjoo, A Kushki, J Lee, C Steele, and T Chau. Reputation-based neural network

combinations. In K. Suzuki (editor), Artificial Neural Networks - Methodological Ad-

vances and Biomedical Applications. InTech Publishing, 2011.

[89] M Nikjoo, A Kushki, CM Steele, and T Chau. Reputation-based neural network com-

binations. In Neural Networks. Intech Publishers, Rjeka, Croatia, 2011.

[90] M Nikjoo, C Steele, and T Chau. Biomedical Engineering Online, 2010.

[91] M Nikjoo, C Steele, E Sejdic , and T Chau. Automatic discrimination between safe

and unsafe swallowing using a reputation-based classifier. Biomedical Engineering

Online, 2011.

[92] I Orovic , S Stankovic , T Chau, CM Steele, and E Sejdic . Time-frequency analysis and

hermite projection method applied to swallowing accelerometry signals. EURASIP

Journal of Advances in Signal Processing, 2010(article ID 323125):7 pages, 2010.

[93] R Picard, E Vyzas, and J Healey. Toward machine emotional intelligence: analysis of

affective physiological state. IEEE Trans. Pattern Analysis and Machine Intelligence,

23, 2001.

114

[94] A Porta, S Guzzetti, N Montano, R Furlan, M Pagani, A Malliani, and S Cerutti. Entropy,

entropy rate, and pattern classification as tools to typify complexity in short heart

period variability series. IEEE Transactions on Biomedical Engineering, 48(11):1282–

1291, 2001.

[95] S Power, T Falk, and T Chau. Classification of prefrontal activity due to mental arith-

metic and music imagery using hidden markov models and frequency domain near-

infrared spectroscopy. Journal of Neural Engineering, 7(2), 2010.

[96] K Prkachin, R Williams-Avery, C Zwaal, and D Mills. Cardiovascular changes during

induced emotion. Journal of Psychosom Res, 47, 1999.

[97] J Quinlan. Programs for machine learning. Morgan Kaufmann, 1993.

[98] NP Redy, A Katakam, V Gupta, R Unnikrishnan, J Narayanan, and EP Canilang. Mea-

surements of acceleration during videofluoroscopic evaluation of dysphagic patients.

Medical Engineering & Physics, 22(6):405–412, 2000.

[99] JC Rosenbek, J Robbins, EV Roecker, JL Coyle, and JL Woods. A penetration-aspiration

scale. Dysphagia, 11(2):93–98, 1996.

[100] D Ruta and B Gabrys. A theoretical analysis of the limits of majority voting errors for

multiple classifier systems. Technical Report, Department of Computing and Informa-

tion Systems, University of Paisley, 2000.

[101] R Sanchez-Reillo, C Sanchez-Avila, and A Gonzalez-Marcos. Biometric identification

through hand geometry measurements. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 22(10):1168–1171, 2000.

[102] R Schapire. The strength of weak learnability. Machine Learning, 5(2):197–227, 1990.

[103] J Schlimmer and R Granger. Beyond incremental processing: Tracking concept drift.

Proceedings of the Fifth National Conference on Artificial Intelligence, 1986.

115

[104] E Sejdic , TH Falk, CM Steele, and T Chau. Vocalization removal for improved auto-

matic segmentation of dual-axis swallowing accelerometry signals. Medical Engineer-

ing & Physics, 32(6):668–672, 2010.

[105] E Sejdic , CM Steele, and T Chau. A procedure for denoising dual-axis accelerometry

signals. Physiological Measurement, 31(1):N1–N9, 2010.

[106] E Sejdic, V Komisar, CM Steele, and T Chau. Baseline characteristics of dual-axis cer-

vical accelerometry signals. Annals of Biomedical Engineering, 38(3):1048–1059, 2010.

[107] E Sejdic, C Steele, and T Chau. The effects of head movement on dual-axis cervical

accelerometry signals. BMC Research Notes, ?(?):in press, 2010.

[108] E Sejdic, C Steele, and T Chau. The effects of head movement on dual-axis cervical

accelerometry signals. BMC Research Notes, 3(269):6pp, 2010.

[109] E Sejdic, CM Steele, and T Chau. Segmentation and time duration analysis of dual-

axis swallowing accelerometry signals in healthy subjects. IEEE Transactions on

Biomedical Engineering, 56(4):1090–1097, 2009.

[110] L Senhadji, G Carrault, H Gauvrit, E Wodey, P Pladys, and F Carre. Pediatric anesthesia

monitoring with the help of eeg and ecg. Acta Biotheor, 48, 2000.

[111] G Shafer. A mathematical theory of evidence. Princeton University Press, 1977.

[112] D Shah, M Mackay, S Lavery, S Watson, A Harvey, J Zempel, A Mathur, and T Inder.

Combining multiple classifiers with dynamic weighted voting. Pediatrics, 2008.

[113] P Shenoy, M Krauledat, B Blankertz, R Rao, and K Müller. Towards adaptive classifica-

tion for bci. Journal of Neural Engineering, 3(1), 2006.

[114] CM Steele, E Sejdic , and T Chau. Noninvasive detection of thin-liquid aspiration us-

ing dual-axis swallowing accelerometry. American Journal of Physical Medicine and

Rehabilitation, Under Review(xx):yy, 2011.

116

[115] C Stefano, A Cioppa, and A Marcelli. An adaptive weighted majority vote rule for com-

bining multiple classifiers. In 16th International Conference on Pattern Recognition,

pages 192–195, Quebec, Canada, 2002. IEEE.

[116] H Stern. Improving on the mixture of experts algorithm. 2003.

[117] C Suen, C Nadal, T Mai, R Legault, and L Lam. Recognition of totally unconstrained

handwritten numerals based on the concept of multiple experts. In C. Suen, editor,

Proceedings of the International Workshop on Frontiers in Handwriting Recognition,

pages 131–143, 1990.

[118] S Sun. Local within-class accuracies for weighting individual outputs in multiple clas-

sifier systems. Pattern Recognition Letters, 2010.

[119] A Tabaee, P Johnson, C Gartner, K Kalwerisky, R Desloge, and M Stewart. Patient-

controlled comparison of flexible endoscopic evaluation of swallowing with sensory

testing (feesst). The Laryngoscope, 116:821–825, 2006.

[120] K Tai and T Chau. Single-trial classification of nirs signals during emotional induc-

tion tasks: twoards a corporeal machine interface. Journal of Neuroengineering and

Rehabilitation, 6(39):14pp, 2010.

[121] A Tajeddine, A Kayssi, A Chehab, and H Artail. Fuzzy reputation-based trust model.

Applied Soft Computing, 11(1):345–355, 2011.

[122] R Tibshirani, T Hastie, B Narasimhan, S Soltys, G Shi, A Koong, and Q Le. Sample

classification from protein mass spectrometry, by peak probability contrasts. Bioin-

formatics, 20(17), 2004.

[123] V Tresp and M Taniguchi. Combining estimators using non-constant weighting func-

tions. In G. Tesauro, D.S. Touretzky, and T.K. Leen, editors, Advances in Neural In-

117

formation Processing Systems, volume 7, pages 418–435. MIT Press, Cambridge, MA,

1995.

[124] YM Tseng and FG Chen. A free-rider aware reputation system for peer-to-peer file-

sharing networks. Expert Systems with Applications, 38(3):2432–2440, 2011.

[125] G Turpin, F Schaefer, and W Boucsein. Effects of stimulus intensity, risetime, and du-

ration on autonomic and behavioral responding: Implications for the differentiation

of orienting, startle, and defense responses. Psychophysiology, 36, 1999.

[126] N. Ueda. Optimal linear combination of neural networks for improving classifica-

tion performance. Pattern Analysis and Machine Intelligence, IEEE Transactions on,

22(2):207–215, 2002.

[127] R Valdovinos and J Sanchez. Combining multiple classifiers with dynamic weighted

voting. Proceedings of the 4th International Conference on Hybrid Artificial Intelligence

Systems, 2009.

[128] C Vidaurre, M Kawanabe, P von Bunau, B Blankertz, and K Muller. Toward unsu-

pervised adaptation of lda for brainUcomputer interfaces. IEEE TRANSACTIONS ON

BIOMEDICAL ENGINEERING, 58(3):50–51, 2011.

[129] P Viola and M Jones. Fast and robust classification using asymmetric adaboost and a

detector cascade. Advances in Neural Information Processing Systems, 2002.

[130] P Walley. Statistical reasoning with imprecise probabilities. Chapman and Hall, Lon-

don, 1991.

[131] P Wang. A defect in dempster-shafer theory. Proceedings of the 10th Conference on

Uncertainty in Artificial Intelligence, 1994.

[132] X Wang, L Ding, and D Bi. Dirichlet reputation systems. IEEE Trans. Instrumentation

and Measurement, 59(1):171–179, 2010.

118

[133] D Wolpert. Stacked generalization. Neural Networks, 5, 1992.

[134] M Wozniak. Combining pattern recognition algorithms chances and limits. 2nd In-

ternational Conference on Computer Engineering and Technology, Chengdu, Sichuan,

China, 2010.

[135] L Xu, A Kryzak, and CY Suen. Methods of combining multiple classifiers and their

applications to handwriting recognition. IEEE Transactions on Systems, Man and Cy-

bernetics, 22(3):418–435, 1992.

[136] X. Yao and M.M. Islam. Evolving artificial neural network ensembles. IEEE Computa-

tional Intelligence Magazine, 3(1):31–42, 2008.

[137] L Zadeh. A simple view of the dempster-shafer theory of evidence and its implication

for the rule of combination. The Al Magazine, 7(2):85–90, 1986.

[138] D Zhang, B Liu, C Sun, and X Wang. Learning the classifier combination for image

classification. Journal of Computers, 6(8), 2011.

[139] DCB Zoratto, T Chau, and CM Steele. Hyolaryngeal excursion as the physiological

source of swallowing accelerometry signals. Physiological Measurement, 31(6):843–

855, 2010.

Date post:	25-Apr-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

by MohammadNikjooSoukhtabandani...My special thanks to the committee members, Dr. Tas...

Documents