MULTIPLE CLASSIFIER STRATEGIES FOR DYNAMIC PHYSIOLOGICAL AND
BIOMECHANICAL SIGNALS
by
Mohammad Nikjoo Soukhtabandani
A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy
Graduate Department of Electrical and Computer EngineeringUniversity of Toronto
Copyright c⃝ 2012 by Mohammad Nikjoo Soukhtabandani
Abstract
Multiple classifier strategies for dynamic physiological and biomechanical signals
Mohammad Nikjoo Soukhtabandani
Doctor of Philosophy
Graduate Department of Electrical and Computer Engineering
University of Toronto
2012
Access technologies often deal with the classification of several physiological and biome-
chanical signals. In most previous studies involving access technologies, a single classifier
has been trained. Despite reported success of these single classifiers, classification accu-
racies are often below clinically viable levels. One approach to improve upon the perfor-
mance of these classifiers is to utilize the state of- the-art multiple classifier systems (MCS).
Because MCS invoke more than one classifier, more information can be exploited from the
signals, potentially leading to higher classification performance than that achievable with
single classifiers. Moreover, by decreasing the feature space dimensionality of each classi-
fier, the speed of the system can be increased.
MCSs may combine classifiers on three levels: abstract, rank, or measurement level.
Among them, abstract-level MCSs have been the most widely applied in the literature given
the flexibility of the abstract level output, i.e., class labels may be derived from any type
of classifier and outputs from multiple classifiers, each designed within a different context,
can be easily combined.
In this thesis, we develop two new abstract-level MCSs based on "reputation" values of
individual classifiers: the static reputation-based algorithm (SRB) and the dynamic reputation-
based algorithm (DRB). In SRB, each individual classifier is applied to a “validation set”,
which is disjoint from training and test sets, to estimate its reputation value. Then, each
individual classifier is assigned a weight proportional to its reputation value. Finally, the to-
ii
tal decision of the classification system is computed using Bayes rule. We have applied this
method to the problem of dysphagia detection in adults with neurogenic swallowing diffi-
culties. The aim was to discriminate between safe and unsafe swallows. The weighted clas-
sification accuracy exceeded 85% and, because of its high sensitivity, the SRB approach was
deemed suitable for screening purposes. In the next step of this dissertation, I analyzed the
SRB algorithm mathematically and examined its asymptotic behavior. Specifically, I con-
trasted the SRB performance against that of majority voting, the benchmark abstract-level
MCS, in the presence of different types of noise.
In the second phase of this thesis, I exploited the idea of the Dirichlet reputation system
to develop a new MCS method, the dynamic reputation-based algorithm, which is suitable
for the classification of non-stationary signals. In this method, the reputation of each clas-
sifier is updated dynamically whenever a new sample is classified. At any point in time, a
classifier’s reputation reflects the classifier’s performance on both the validation and the test
sets. Therefore, the effect of random high-performance of weak classifiers is appropriately
moderated and likewise, the effect of a poorly performing individual classifier is mitigated
as its reputation value, and hence overall influence on the final decision is diminished. We
applied DRB to the challenging problem of discerning physiological responses from non-
verbal youth with severe disabilities. The promising experimental results encourage further
development of reputation-based multi-classifier systems in the domain of access technol-
ogy research.
iii
Acknowledgements
First and foremost, I would like to thank God for showing me the correct path in the life.
During my Ph.D. program, I have worked with a great number of people whose contri-
bution in assorted ways to the research and writing of the thesis deserve special mention. It
is a pleasure to convey my gratitude to them all in my humble acknowledgments.
In the first place, I would like to record my gratitude to my supervisor Dr. Tom Chau
for his supervision, advice, and guidance from the very early stages of this research as well
as for the extraordinary experiences throughout this work. Above all and when it was most
needed, he provided me with unflinching encouragement and support in various ways. He
was not only my supervisor, but also a great role model from whom I have learned a lot,
from research skills to personality. I am indebted to him more than he knows. Dr. Chau, I
am grateful in every possible way and hope to keep up our collaboration in the future.
I would also like to acknowledge my colleagues and friends at the PRISM lab and ECE
department of UofT who were always helpful and made my stay memorable.
My special thanks to the committee members, Dr. Tas Venetsanopoulos and Dr. Hans
Kunov, who gave me valuable comments during my work, read and review this thesis, and
participated in the thesis defence examination. Also, thanks to Dr. Elvino Sousa and Dr.
Bernie Hudgins for reviewing the thesis and participating in the final examination.
Fiannly, I would like to express my deepest appreciation to my family and parents. Where
would I be without my family? My parents deserve special mention for their unwavering
support and prayers. Although my father is not among us and cannot see my graduation,
his presence is always with me and I’m sure he is very proud of my success. I am much
obliged for the unconditional love and support from my mother who helped me both finan-
cially and emotionally to study in the best schools of Iran and Canada. I am sure she has
been longing for my graduation as much as I have, and I am now happy to deliver her the
good news. Thank you very much and god bless you.
iv
Contents
1 Background and Literature Review 1
1.1 Combining Multiple Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Majority Voting Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Application of MCS to Rehab Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.1 Dysphagia Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.2 Recognition of physiological arousal . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Thesis Rationale and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Reputation-based neural network combinations 15
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.2 Majority Voting Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Reputation-Based Voting Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Discriminating between Healthy and Abnormal Swallows . . . . . . . . . . . . . . . . 23
2.3.1 Signal Acquisition and pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.1 Classification accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
v
2.4.2 Internal neural network representations . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.3 Training error and convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.4 Local robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.5 Global robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3 Automatic discrimination between safe and unsafe swallowing using a reputation-
based classifier 41
3.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.1 Automatic classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.2 Data segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.3 Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.4 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.5 Reputation-Based Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.6 Classifier evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5.1 Dual versus single axis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5.2 Internal representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.5.3 Reputation-based classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.5.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4 Analyzing the performance of the static reputation-based algorithm 62
4.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
vi
4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3 Large n behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.2 Modeling the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.3 Asymptotic Analysis of SRB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4 Medium and small n behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4.1 Patterns of Success and Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4.2 Probability of error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4.3 Results for Uniform Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4.4 Results for Gaussian Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5 Conclulsion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5 Time-Evolving Reputation-Based Classifier and its Application on Classification of
Physiological Responses of Children With Disabilities 76
5.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2.1 Dynamic weighted voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2.2 Dynamic reputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3 Mathematical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3.2 Properties of the Dirichlet Distribution . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4 Static Reputation-Based Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.5 Time-Evolving Reputation-Based Algorithm Using the Dirichlet Distribution . 84
5.5.1 An Updating Rule for Reputation Values . . . . . . . . . . . . . . . . . . . . . . . 88
5.6 Clinical Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.6.1 The Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.6.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
vii
5.6.4 The Effect of the Constant C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.8 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.8.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.8.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6 Conclusion 102
6.1 Summary of research completed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.2 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Bibliography 105
viii
List of Tables
2.1 The average performance of the individual classifiers and their reputation-
based combination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1 Example of 4 distributions of 10 test samples among different combinations
of classifier outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.1 An example to clarify the updating rule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2 The average performance of different classifiers. . . . . . . . . . . . . . . . . . . . . . . 95
5.3 The P values of t-test between the time-evolving classifier, static classifier,
and majority voting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
ix
List of Figures
2.1 The schematic of back-propagation neural network with 3 inputs and 4 hid-
den layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 The block diagram of the proposed algorithm . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3 The Hinton graph of the weight matrix for the time domain classifier . . . . . . . 33
2.4 The Hinton graph of the weight matrix for the frequency domain classifier
(Peak - peak value of FFT; BW - bandwidth) . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5 The Hinton graph of the weight matrix for the information theoretic classifier. 35
2.6 The training error of the different neural network classifiers versus the num-
ber of training epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.7 Sensitivity to varying magnitudes of perturbation of the input vector . . . . . . . 37
2.8 The output decision versus magnitude of input perturbation . . . . . . . . . . . . . 38
2.9 The accuracy of the proposed classifier with increasing magnitude of pertur-
bation in the test set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.10 The accuracy of the proposed classifier as the number of contaminated sam-
ples increase. The p-values arise from a comparison of the accuracy between
the uncontaminated sample and samples with varying levels of contamination. 40
3.1 Data collection set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 Example of safe swallowing signals from AP and SI axes . . . . . . . . . . . . . . . . . 48
3.3 Example of unsafe swallowing signals from AP and SI axes . . . . . . . . . . . . . . . 49
x
3.4 Classification performance for the single-axis (AP, SI) and dual-axis (AP + SI)
reputation-based classifiers. The height of each bar denotes the average of
the respective performance measure while the error bars denote standard de-
viation around the mean. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.5 Parallel axes plot depicting the internal representation of safe (solid line) and
unsafe (dashed line) swallows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.6 Comparison of the densities of classification accuracies by majority voting
(top) and reputation-based classification (bottom) for safe and unsafe swal-
low discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1 Comparison between majority voting and SRB algorithm for different uni-
form noise distributions, for (a) three, and (b) fifteen classifiers. . . . . . . . . . . . 72
4.2 Comparison between majority voting and SRB algorithm for uniform noise
with b = 0.3 and different numbers of classifiers. . . . . . . . . . . . . . . . . . . . . . . 73
4.3 Comparison between majority voting and SRB algorithm for different Gaus-
sian noise powers, for (a) three, and (b) fifteen classifiers respectively. . . . . . . . 74
4.4 Comparison between majority voting and SRB algorithm for a fixed value of
standard deviation of Gaussian noise (σ = 0.3) and different number of clas-
sifiers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.1 An example of the voting scheme in the proposed algorithm. . . . . . . . . . . . . . 87
5.2 Sample physiological signals collected from one of the participants. In each
graph, the horizontal axis denotes time while the vertical axis is signal ampli-
tude in arbitrary units. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.3 The effect of the constant C on the time-evolving reputation-based algorithm. 98
5.4 The evolution of reputation values for the 5 classifiers for different values of C . 99
xi
1
Chapter 1
Background and Literature Review
1.1 Combining Multiple Classifiers
Multiple Classifier Systems (MCS) have been widely used by machine learning researchers
for solving different pattern classification problems. There are several reasons for using
MCS. Some of these are listed below.
1. Empirical studies have demonstrated that MCS can yield higher classification perfor-
mance than that of the best individual classifier in many applications [89, 135]. For
some cases, there are analytical proofs of multi-classifier performance gain [65, 134].
2. In rehabilitation engineering applications, different representations of a pattern may
be available to the designer, suggesting that several classifiers could be developed
and trained. For instance, in designing access pathways, various physiological sig-
nals such as heart rate, cortical hemodynamics, and EEG signals could serve as dif-
ferent descriptions of the same pattern. If the designer insists on using one classifier
for solving this problem, he or she will be faced with the challenge of determining
appropriate feature normalizations.
3. Even if feature normalization is not required for some applications, combining mul-
2
tiple classifiers may be preferred over the use of a single classifier. With multiple clas-
sifiers, we can reduce the dimensionality of the input feature vectors and therefore
reduce the complexity of the problem [89].
4. Sometimes more than one training set is available for training classifiers. These sets
could be generated at different times or in different environments. There is likely new
information in each training set. To exploit this new information, different classifiers
should be developed on each data set [50, 19].
5. Some classifiers such as neural networks require random initialization. For each ini-
tialization, the output will be slightly different, even for classifiers trained on the same
data. One way to obtain the best overall average performance is to combine different
versions of these classifiers [56].
6. For a given pattern recognition problem, typically more than one classification algo-
rithm is available, each yielding a different level of classification performance with dif-
ferent data. It is therefore unsurprising that the combination of admissible classifiers
for a given problem often leads to better classification results than that achievable by
any individual classifier alone [74].
Most of the available algorithms for combining classifiers can be categorized into three
groups based on their architecture: parallel, cascade, and, hierarchical. In the parallel archi-
tecture [89, 62, 135], each classifier performs independently and sends its output to a judge.
The judge then decides about the class label of the input based on outputs of the individual
classifiers. In the cascade architecture [129, 16], individual classifiers are arranged in a lin-
ear sequence. In this architecture, the output of each classifier is the input of the next one.
The number of possible classes for a specific pattern reduces after this pattern passes each
classifier. To increase the efficiency of the system, computationally cheap classifiers are in-
voked first, followed by more expensive ones [56]. The hierarchical architecture [57, 102] is
similar to a decision tree classifier [24, 97]. The tree nodes in this format are associated with
3
complex classifiers and a large number of features. This classifier has high efficiency and
flexibility in exploiting the discriminant power of different types of features. In this thesis
research, we focus primarily on the parallel architectures, given its natural fit to the target
problems described later.
To better understand the different algorithms for combining classifiers we introduce the
following notation.
Assume the time series, S, is the pre-processed version of an acquired signal. Also letΘ=
θ1,θ2, ...,θL be a set of L ≥ 2 classifiers and Ω= ω1,ω2, ...,ωc be a set of c ≥ 2 class labels,
whereωj =ωk , ∀j = k . Without loss of generality, Ω⊂N. The input of each classifier is the
feature vector xεRn i , where n i is the dimension of the feature space for the i t h classifier θi ,
whose output is a class labelωj , j = 1, ..., c . In other words, the i t h classifier, i = 1, . . . , L, is a
functional mapping, Rn i → Ω, which for each input x gives an output θi (x ) ∈ Ω. Generally,
the classifier function could be linear or non-linear. It is assumed that for the i t h classifier,
a total number of d i subjects are assigned for training.
Parallel architecture algorithms operate on the output of individual classifiers. Xu et al.
[135] categorized the possible outputs of different classifiers into three groups.
• Abstract: The classifier θi only outputs a unique class label, ωi ∈ Ω. In this level,
no information is available about the certainty of the guessed label. Moreover, no
alternative class label accompanies the guessed label.
• Rank: The output of the classifier θi is a subset of Ω. This subset of class labels are
rank ordered in terms of the likelihood of the input having one of these labels. This
approach is useful when there are a large number of classes for the problem.
• Measurement: The classifier θi outputs a measurement value for each class, which
represents the degree to which the input belongs to that class.
Among these different types of outputs, the ’measurement’ type of output conveys the rich-
est information about the labels and decision. However, this type of output does not have
4
the generality of the abstract flavor. For the measurement output, all the classifiers should
be able to output a measurement value for each class label. Furthermore, the measurement
values for all the classifiers should be of the same scale. On the other hand, the abstract
output may arise from any kind of classifier. In fact, several classifiers, each designed within
a different context, could be combined. Therefore, this output type is more interesting than
the other types. In this research, our focus is on the abstract level problem.
Most of the MCS in literature are of the measurement output type [54, 40, 47]. Some
interesting classifier combinations are proposed by Kittler et al. in [62]. Example rules for
combining classifiers include the product rule, sum rule, median rule, max rule, and min
rule. In all of these methods, individual classifiers use distinct features to estimate the pos-
terior probabilities given the input pattern. Kittler et al. [62]has shown that the sum rule has
better performance on average than the other methods listed above. The reason is that this
method is less sensitive to the errors of individual classifiers. Also Kittler et al. have shown
that the product rule is the most appropriate for error-free independent probabilities. An-
other example of a measurement level method is the adaptive weighting algorithm [123].
In this method, a judge evaluates the decisions of individual classifiers. This algorithm has
better performance than its non-adaptive counterpart. Also, by treating the outputs of in-
dividual classifiers as fuzzy membership values, Cho and Kim [22, 23] proposed different
combination methods based on fuzzy rules. Another approach for combining classifiers at
the measurement level is a mixture of local experts (MLE) [54]. In this adaptive method,
first, the input space is stochastically partitioned into several number of subspaces. Then, a
set of expert network and a gate network are trained and become experts in each subspace.
Although this method works well for problems with disjoint regions, it suffers from bound-
ary effects between regions [116]. To overcome this problem, a hierarchical MLE has been
proposed in [57]. In this method the expectation-maximization (EM) [30] algorithm is used
to learn an online hierarchical mixture model with a softmax gating function. By using a
soft gating function, the algorithm is able to accommodate the boundaries of different re-
5
gions. Also, the EM algorithm can learn and detect the dense areas in the feature space and
appropriately train the experts. However, as the EM algorithm only learns the cluster cen-
troids and not intermediary points, it fails to properly classify many non-linearly separable
problems [116].
Rank level MCS [51, 4] are less popular than measurement level ones. Among rank level
MCS, the algorithm proposed in [51] is among the most popular. In this method, each clas-
sifier first ranks the class labels for each input pattern. Then the Borda count method [8]
is utilized to find the fused rank table. For a given class, we count within each classifier’s
output, the number of classes ranked below the class in question. The Borda count for a
class is the sum of these counts across classifiers. Although this method is simple to imple-
ment and requires no training, it does not consider the differences in individual classifier
capabilities. All the classifiers in this method are treated equally even if there is some prior
information about the effectiveness of some classifiers compared to others. To solve this
problem, different methods have been proposed to modify the assignment of weights to the
rank scores produced by each classifier. For this matter, two methods are suggested in [49]
and [84], both of which assign weights based on the confidence of the combined decisions
measured by the Borda count method. Suggested measures include a statistic related to the
distribution of sum of ranks and the intervals between the computed Borda counts for a set
of classes [51].
Dempster-Shafer classifiers [30, 111] are another category of rank-level classifiers. These
classifiers are used to combine cumulative evidence or change priors in the presence of
new evidence. Dempster-Shafer theory of evidence is the generalization of the Bayesian
reasoning which does not require probabilities for each parameter of interest. Despite of
its positive points, it has been shown by Zadeh [137] that this rule may generate a counter-
intuitive result. Other flaws and drawbacks of this method have been discussed in [130, 131].
The abstract level MCS [89, 135, 102] have some advantages over the MCS designed for
other types of classifier outputs. The abstract level MCS works with different types of clas-
6
sifiers, each developed and trained in a different context. Majority voting [135] is one of the
best known abstract level MCS. This method is simple to implement and computationally
efficient. Because of these advantages, this algorithm has been widely used in solving many
pattern classification problems [89, 69, 79, 100]. Despite its simplicity, majority voting can
significantly improve upon the classification accuracies of individual classifiers [69]. In [64]
this algorithm is compared to many measurement level algorithms such as the sum and
product rules. In these comparisons, the performance of the majority voting algorithm is
almost the same as that of the best measurement level algorithm. Weighted majority voting
is another MCS at the abstract level. Several versions of weighted majority voting has been
proposed in the literature for different applications [135, 115, 43, 52]. Although these algo-
rithms may work properly for a specific application, they are not applicable to biomedical
signals which are mostly non-stationary. Moreover, in most of these methods, the weight of
each individual classifier is determined heuristically. Among different versions of weighted
majority voting, the algorithm proposed by Xu et al. [135] received highest acceptance by
other researchers. The main reason is that in this method instead of a heuristic approach, a
mathematical approach based on the posterior probability of each class is used for assign-
ing the weight of each individual classifier. However, as it is explained in [65], this method
is not able to deal with degenerate cases when one or more beliefs are zero. These cases are
likely to happen in any classification problem.
Another method for combining individual classifiers in the abstract level is boosting
[102]. The main purpose of boosting is to combine several weak classifiers and generate
a strong one. Although this algorithm is somewhat robust against overfitting, it is very sen-
sitive to noisy data, mislabels, and outliers [56]. Moreover, it requires many comparable
classifiers to perform properly. Different adaptive versions of the boosting algorithm have
been proposed to take full advantage of weak classifiers [39, 38, 29, 41]. Most of these algo-
rithms deploy a coordinate-wise gradient descent approach and try to minimize a convex
cost function [83]. However, it has been shown in [82] that these methods are highly sus-
7
ceptible to random classification noise. Long and Servedio [82] showed that under random
classification noise, these boosting methods are not able to perform better than 50 %.
Because there is no standard form of weighted majority algorithm in the literature, ma-
jority voting is the selected benchmark for abstract level classifiers. Therefore, we further
investigate this algorithm in detail.
1.1.1 Majority Voting Algorithm
In a multi-classifier system, the problem is to arrive at a global decision Θ∗(x ) = ωj given
a number of local decisions θi (x ) ∈ Ω, where the following equations generally do not hold
[135]:
θ1(x ) = θ2(x ) = ...= θL(x ). (1.1)
In the literature, a classical approach for solving this problem is majority voting [53] [117].
To express this idea mathematically, we define an indicator function
I i (x ,ωj ) = I (θi (x )−ωj ) =
1, when θi (x ) =w j ,
0, otherwise.(1.2)
Now, using (2), the majority voting rule can be expressed as follows:
Θ∗(x ) =
ωm a x , if maxωj IΩ(x ,ωj )> L/2,
Q , otherwise,(1.3)
whereωm a x = argmaxw j IΩ(x ,ωj ), IΩ(x ,ωj ) =∑L
i=1
I i (x ,ωj ), j = 1, ..., c , and Q /∈ Ω is the rejection state. In other words, given a feature vector,
each classifier votes for a specific class. The class with the majority of votes is selected as the
candidate class. If the candidate class earns more than half of the total votes, it is selected
as the final output of the system. Otherwise, the feature vector is rejected by the system.
The majority voting algorithm is computationally inexpensive, simple to implement and
8
applicable to a wide array of classification problems [69] [56]. However, this method suffers
from a major drawback; the decision heuristic is strictly democratic, meaning that the votes
from different classifiers are always equally weighted, regardless of the past performance of
individual classifiers. Therefore, votes of weak classifiers, i.e., classifiers whose performance
only slightly exceeds that of the random classifier, can diminish the overall performance of
the system when they have the majority. To exemplify this issue, consider a classification
system with c = 2 classes, Ω = ω1,ω2, and L = 3 classifiers, Θ = θ1,θ2,θ3, where two
are weak classifiers with 51% average accuracy while the remaining one is a strong classifier
with 99% average accuracy. Now assume that for a specific feature vector both the weak
classifiers vote for ω1 but the strong classifier votes for ω2. Based on the majority voting
rule, ω1 is preferred over ω2, which is mostly likely an incorrect classification. It has been
shown in [123] [54] [56], that in practice, there are various situations in which the majority
vote may be suboptimal.
To overcome this drawback of the majority voting algorithm, [135] introduced the no-
tion of weighted majority voting by incorporating classifier-specific beliefs which reflect
each classifier’s uncertainty about a given test case. However, this algorithm is not appli-
cable when some of the beliefs are equal to zero. This degenerate case can easily happen
whenever there is a large number of classes in the system. Moreover, this approach is ac-
curate only when the prior probabilities of all classes are equal, an assumption that can be
violated in many situations. In a similar vein, [115] proposed a weighted majority vote rule
using a genetic algorithm. In this method, the performance of the whole set of classifiers
is maximized rather than that of each classifier separately. Unfortunately, this algorithm
is computationally costly and provides only a slight improvement in classification perfor-
mance. Motivated by the idea of [135], we propose a novel algorithm for combining several
classifiers based on their individual reputations, numerical weights that reflect each classi-
fier’s past performance. The algorithm will be discussed in detail, in Chapter 2.
A MCS is especially useful if the individual classifiers are independent. It has been shown
9
in [135, 56] that this condition can be satisfied by using different training or feature sets.
If such is not possible, many resampling techniques may be used to satisfy the indepen-
dence requirement. Stacking [133], bagging [14], and boosting [102] are the most popular
resampling techniques. In stacking, the outputs of individual classifiers are used to train the
stacked classifier. The final decision is made based on the outputs of individual classifiers
in addition to that of the stacked classifier. In bagging, several versions of the initial dataset
are created using bootstrapping. The final decision is made using a fixed rule such as av-
eraging. Boosting is another method for generating different training sets. In this method,
the classifiers are trained hierarchically to learn to discriminate more complex regions of
feature space.
1.2 Application of MCS to Rehab Engineering
Researchers at Holland Bloorview Kids Rehabilitation Hospital are working to improve ac-
cess technologies for patients who are nonverbal and severely motorically impaired. The
main focus of much of this research is to recognize a specific parameter-of-interest related
to physical or physiological states of the individual. The parameter-of-interest, for instance,
could be the affective state of a person (e.g., happy or sad), communicative intent (e.g.,
preferred versus non-preferred) or the presence or absence of a chronic condition such as
dysphagia. A number of physiological and biomechanical signals are typically recorded
from different participants in response to several different stimuli. Features are typically
extracted from these signals and the parameter-of-interest is estimated using various clas-
sification algorithms. Although many novel ideas have been proposed for recognizing the
parameter-of-interest using single classifiers, the potential of MCS is largely unexplored.
Given the many different signal modalities exploited (e.g., biopotentials, mechanical vibra-
tions, thermography, optical signals), there is reason to believe that the application of MCS
may enhance classification. Specifically, we focus on two different problems: dysphagia
10
detection and classification of physiological arousal.
1.2.1 Dysphagia Detection
Dysphagia refers to any swallowing disorder [81] and may arise secondary to stroke, mul-
tiple sclerosis, and eosinophilic esophagitis, among other conditions [85]. If unmanaged,
dysphagia may lead to aspiration pneumonia in which food and liquid enter the airway
and into lungs [32]. The video-fluoroscopic swallowing study (VFSS) is the gold standard
method for dysphagia detection [119]. In this method, clinicians detect dysphagia using
a lateral X-ray video recorded during ingestion of a barium-coated bolus. The health of a
swallow is judged according to criteria such as the depth of airway invasion and the degree
of bolus clearance after the swallow. However, this technique requires expensive and spe-
cialized equipment, ionizing radiation and significant human resources, thereby precluding
its use in the daily monitoring of dysphagia [74].
Swallowing accelerometry has been proposed as a potential adjunct to VFSS. In this
method, the patient wears a dual-axis accelerometer infero-anterior to the thyroid notch.
Swallowing events are automatically extracted from the recorded acceleration signals. Pat-
tern classification methods are then deployed to discriminate between healthy and un-
healthy swallows.
Sejdic et al. [109] proposed a swallow event detection algorithm based on the sequential
fuzzy partitioning of the standard deviation of dual-axes acceleration signals. They achieved
over 90% detection accuracy. Lee et al. [75], proposed a pseudo-automatic neural network
algorithm for swallow event detection in which the standard deviation of the recorded signal
was compared to an empirically-derived threshold. However, in many cases, the segments
selected by this method differed from those derived manually by human experts. Consid-
ering the acceleration signal as a stochastic diffusion, Damouras et al. [27] introduced an
online volatility-based technique for detecting swallowing events from long records of un-
processed dual axis accelerations. They achieved performance comparable to previously
11
published off-line detection methods.
Using a radial basis classifier with statistical and energetic features, Lee et al. [71] de-
tected aspirations from single-axis cervical acceleration signals with approximately 80%
sensitivity and specificity in a pediatric population. In an adult stroke population, Lee et
al. [74], discriminated healthy and abnormal swallowing using a non-linear classification
paradigm. In [74], first, a Genetic Algorithm (GA) is used to select the most discriminatory
feature combinations and then, both the regular neural network classifier and the proba-
bilistic neural network classifier is utilized for classification. Despite promising classifica-
tion results, the proposed method was computationally expensive. A genetic algorithm was
used to select up to 12 features from among 444 possible choices. The large number of fea-
tures necessitated a large training set, which is not always available.
In the first two phases of our research, we applied the proposed MCS to the dysphagia
detection problem.
1.2.2 Recognition of physiological arousal
In the automatic recognition of physiological arousal, researchers try to decode the affec-
tive state of a participant using physiological signals recorded from the participant’s body.
The detection of change in an individual’s physiological state is relevant to bedside moni-
toring systems[112], ambulant monitoring of medically fragile individuals [44], monitoring
of patients under high-stress conditions [110], and communication with individuals with
severe and multiple disabilities [10]. To detect physiological arousal of individuals, several
physiological signals are established as indicators including electrical brain activity [7], elec-
trodermal activity (EDA) [10, 13], skin temperature [61] and cardiorespiratory signals [96].
Most of the previous efforts in physiological arousal detection have examined differ-
ent physiological signals independently [6, 36, 125]. Further, the correlation between these
signals is often not considered in the literature [9]. In fact, most of the previous emotion
recognition algorithms neglect the multivariate nature of the problem and invoke a univari-
12
ate approximation [18, 59, 93]. Recently, [9] proposed a multivariate algorithm for emotion
recognition. Although this algorithm is able to detect changes in the physiological states
of individuals, it is not able to classify these states. The application of emotion recognition
using MCS is a novel approach which will be considered in our research.
In the third phase of our research, we applied the proposed MCS to the problem of phys-
iological arousal detection.
1.3 Thesis Rationale and Objectives
The classifier limitations introduced in the previous Section motivated the development
and design of a new algorithm for combining several classifiers in the context of rehabilita-
tion engineering. The specific rationale for this research are listed below.
• In the context of rehab engineering, several signals are collected from multiple phys-
ical and physiological sources. Each of these signals has a unique characteristic (e.g.
measurement modality) which distinguishes it from others. Therefore, for each of
these signals, a specific classifier should be trained and developed based on individ-
ual signal properties. This idea steered us towards a multiple classifier approach.
• Although several algorithms have been proposed in the literature for combining clas-
sifiers, none have specifically been designed for rehab engineering applications. In
this context, signals are usually non-stationary and non-Gaussian, which make it dif-
ficult for the designer to train and develop one specific classifier.
• By reviewing all the existing classifier combination algorithms in the literature, we
find none that is able to update its classifier combination rule after it has been trained.
In other words, after designing and training a classifier, the designer has no access
to the classifier and cannot invoke further training or retraining. When signals are
time-varying, samples of the test set may be completely different from samples of the
13
training set. For example, in the physiological arousal detection problem, the emotion
of a person, and hence the signals representing emotion may evolve over the course
of a measurement period. Therefore, there is a need for a classifier which is able to
update its performance based on recently collected data samples.
• In combining classifiers using the majority voting algorithm, the past performance of
classifiers is not considered. Therefore, the vote of a good classifier is equal to that
of a weak one. This limitation can decrease the overall performance of the system,
especially when there are several weak classifiers in the system.
Based on the above rationales, the main objectives of this dissertation were as follows:
1. Develop a new algorithm for combining several classifiers based on their performance
on a validation set. This algorithm is an extension of the popular majority voting al-
gorithm.
2. Apply the proposed algorithm to the signals gathered from a rehab engineering con-
text and compare its performance to that of various existing algorithms.
3. Propose a new dynamic classifier combination algorithm, which is able to update its
performance based on samples of the test set. This algorithm is the dynamic counter-
part of the reputation-based algorithm.
4. Apply the proposed dynamic algorithm to the scenario in which samples of the test set
are significantly different from those of the training set and compare its performance
to that of its static counterpart. The physiological arousal detection problem is a good
example of this scenario.
5. Characterize the asymptotic performance of the static reputation-based algorithm
under well-circumscribed conditions such as a specific type of classifiers, specific
data distribution, and specific dimensionality.
14
1.4 Thesis Organization
This dissertation contains six different Chapters. In Chapter 2, we introduce the static reputation-
based algorithm and compare it to other MCS methods using real dysphagia data. In Chap-
ter 3, we examine the application of the proposed algorithm on dysphagia detection and in-
vestigate different contributions of axial accelerometer signals using the proposed method.
The asymptotic behavior of the static reputation-based algorithm is analyzed in Chapter 4.
In Chapter 5, we propose the dynamic reputation-based algorithm and investigate its appli-
cation to physiological arousal detection. Finally, a summary of contribution is presented
in Chapter 6.
Each main Chapter of this dissertation is a verbatim excerpt of a journal article which
addresses one of the objectives of the thesis. As a result, the introduction section of each
chapter may contain redundant information. It is recommended that the reader skip the
redundant introductory sections.
15
Chapter 2
Reputation-based neural network
combinations
The contents of this chapter are reproduced from a published book chapter: Nikjoo, M.,
Kushki, A., Lee, J., Steele. C., Chau, T. (2011). Reputation-based neural network combi-
nations, in K. Suzuki (editor), Artificial Neural Networks - Methodological Advances and
Biomedical Applications, InTech Publishing, Chapter 8, pp. 151-170. ISBN 978-953-307-
243-2.
2.1 Introduction
The exercise of combining classifiers is primarily driven by the desire to enhance the per-
formance of a classification system. There may also be problem-specific rationale for inte-
grating several individual classifiers. For example, a designer may have access to different
types of features from the same study participant. For instance, in the human identification
problem, the participant’s voice, face, and handwriting provide different types of features.
In such instances, it may be sensible to train one classifier on each type of feature [56]. In
other situations, there may be multiple training sets, collected at different times or under
16
slightly different circumstances. Individual classifiers can be trained on each available data
set [135, 56]. Lastly, the demand for classification speed in online applications may neces-
sitate the use of multiple classifiers [56].
Optimal combination of multiple classifiers is a well-studied area. Traditionally, the goal
of these methods is to improve classification accuracy by employing multiple classifiers to
address the complexity and non-uniformity of class boundaries in the feature space. For
example, classifiers with different parameter choices and architectures may be combined
so that each classifier focuses on the subset of the feature space where it performs best.
Well-known examples of these methods include bagging [15] and boosting [5].
Given the universal approximation ability of neural networks such as multilayer percep-
trons and radial basis functions[48], there is theoretical appeal to combine several neural
network classifiers to enhance classification. Indeed, several methods have been developed
for this purpose, including, for example, optimal linear combinations [126] and mixture of
experts [55], and negative correlation [21] and evolving neural network ensembles [136].
In these methods, all base classifiers are generally trained on the same feature space (ei-
ther using the entire training set or subsets of the training set). While these methods have
proven effective in many applications, they are associated with numerical instabilities and
high computational complexity in some cases [5].
Another approach to classifier combination is training the base classifiers on different
feature spaces. This approach is advantageous in combating the undesirable effects of as-
sociated with high-dimensional feature spaces (curse of dimensionality). Moreover, the
feature sets can be chosen to minimize the correlation between the individual base clas-
sifiers to further improve the overall accuracy and generalization power of classification
[126]. These methods are also highly desirable in situations where heterogeneous feature
combinations are used.
Combination of classifiers based on different feature has been generally accomplished
through fixed classification rules. These rules may select one classifier output among all
17
available outputs (for example, using the minimum or maximum operator), or they may
provide a classification decision based on the collective outputs of all classifiers (for exam-
ple, using the mean, median, or voting operators)[67, 11]. Among the latter, the simplest and
most widely applied rule is the majority vote [53, 117]. Many authors have demonstrated
that classification performance improves beyond that of the single classifier scenario when
multiple classifier decisions are combined via a simple majority vote [135, 86]. [135] further
introduced the notion of weighted majority voting by incorporating classifier-specific be-
liefs which reflect each classifier’s uncertainty about a given test case. Unfortunately, this
method does not deal with the degenerate case when one or more beliefs are zero - a sit-
uation likely to occur in multi-class classification problems. Moreover, this classifier relies
on the training data set to derive beliefs values for each classifier. This approach, therefore,
risks overfitting the classifier to the training set and a consequent degradation in general-
ization power.
In this chapter, we describe a method for combining several neural network classifiers in
a manner which is computationally inexpensive and does not demand additional training
data beyond that needed to train individual classifiers. Our reputation method will build on
the ideas of [135]. In the following, we present notation that is used throughout the chapter
and detail the majority voting algorithm using this notation. The following presentation is
adapted from [90].
2.1.1 Notation
Assume the time series, S, is the pre-processed version of an acquired signal. Also let Θ =
θ1,θ2, ...,θL be a set of L ≥ 2 classifiers and Ω= ω1,ω2, ...,ωc be a set of c ≥ 2 class labels,
whereωj =ωk , ∀j = k . Without loss of generality, Ω⊂N . The input of each classifier is the
feature vector xεRn i , where n i is the dimension of the feature space for the i t h classifier θi ,
whose output is a class labelωj , j = 1, ..., c . In other words, the i t h classifier, i = 1, . . . , L, is a
functional mapping, Rn i → Ω, which for each input x gives an output θi (x ) ∈ Ω. Generally,
18
the classifier function could be linear or non-linear. It is assumed that for the i t h classifier,
a total number of d i subjects are assigned for training. The main goal of combining the
decisions of different classifiers is to increase the accuracy of class selection.
2.1.2 Majority Voting Algorithm
In a multi-classifier system, the problem is to arrive at a global decision Θ∗(x ) =ωj given a
number of local decisions θi (x )∈Ω, where generally [135]:
θ1(x ) = θ2(x ) = ...= θL(x ). (2.1)
In the literature, a classical approach for solving this problem is majority voting [53] [117].
To express this idea mathematically, we define an indicator function
I i (x ,ωj ) = I (θi (x )−ωj ) =
1, when θi (x ) =w j ,
0, otherwise.(2.2)
Now, using (2), the majority voting rule can be expressed as follows:
Θ∗(x ) =
ωm a x , if maxωj IΩ(x ,ωj )> L/2,
Q , otherwise,(2.3)
whereωm a x = argmaxw j IΩ(x ,ωj ), IΩ(x ,ωj ) =∑L
i=1
I i (x ,ωj ), j = 1, ..., c , and Q /∈ Ω is the rejection state. In other words, given a feature vector,
each classifier votes for a specific class. The class with the majority of votes is selected as the
candidate class. If the candidate class earns more than half of the total votes, it is selected
as the final output of the system. Otherwise, the feature vector is rejected by the system.
The majority voting algorithm is computationally inexpensive, simple to implement and
applicable to a wide array of classification problems [69] [56]. Despite its simplicity, majority
19
voting can significantly improve upon the classification accuracies of individual classifiers
[69]. However, this method suffers from a major drawback; the decision heuristic is strictly
democratic, meaning that the votes from different classifiers are always equally weighted,
regardless of the past performance of individual classifiers. Therefore, votes of weak classi-
fiers, i.e., classifiers whose performance only slightly exceeds that of the random classifier,
can diminish the overall performance of the system when they have the majority. To exem-
plify this issue, consider a classification system with c = 2 classes, Ω = ω1,ω2, and L = 3
classifiers, Θ = θ1,θ2,θ3, where two are weak classifiers with 51% average accuracy while
the remaining one is a strong classifier with 99% average accuracy. Now assume that for a
specific feature vector both the weak classifiers vote for ω1 but the strong classifier votes
forω2. Based on the majority voting rule,ω1 is preferred overω2, which is mostly likely an
incorrect classification.
As it is discussed in [123] [54] [56], in practice, there are various situations in which the
majority vote may be suboptimal. Motivated by the above, we propose a novel algorithm for
combining several classifiers based on their individual reputations, numerical weights that
reflect each classifier’s past performance. The algorithm is detailed in the next section and
again is adapted from [90].
2.2 Reputation-Based Voting Algorithm
To mitigate the risk of the overall decision being unduly influenced by poorly performing
classifiers, we propose a novel fusion algorithm which extends the majority voting concept
to acknowledge the past performance of classifiers. To measure the past performance of the
i t h classifier, we define a measure called reputation, ri ∈ ℜ, 0 ≤ ri ≤ 1. For a classifier with
high performance, the reputation is close to 1 while a weak classifier would have a reputa-
tion value close to 0. For each feature vector, both the majority vote and the reputation of
each classifier contribute to the final global decision. The collection of reputation values for
20
L classifiers constitutes the reputation set, R = r1, r2, ..., rL. Each classifier is mapped to a
real-valued reputation, ri , namely,
ri = r (θi ) =α, i = 1, ..., L, (2.4)
where r : Θ → [0,1], α ∈ ℜ and 0 ≤ α ≤ 1. To determine the reputation of each classifier,
we utilize a validation set in addition to the classical training and test sets. Specifically, the
performance of the trained classifiers on the validation data determines their reputation
values. Now, we have all the necessary tools to explain the proposed algorithm.
1. For a classification problem with c ≥ 2 classes, we design and develop L ≥ 2 individual
classifiers. The proposed algorithm is especially useful if the individual classifiers are
independent. This condition can be guaranteed by using different training sets or
using various resampling techniques such as bagging [14] and boosting [102]. Unlike
some of the previous work [70] [53], there is no restriction on the number of classifiers
L and this value can be either an odd or an even number. Also, it should be noted here
that, in general, the feature space dimension, n i , of each classifier could be different
and the number of training exemplars, d i , for each classifier could be unique.
2. After training the L classifiers individually, the respective performance of each is eval-
uated using the validation set and a reputation value is assigned to each classifier. The
validation sets are disjoint from the training sets. It is important to notify that here we
use two different validation sets. The first one is the traditional validation set which
is used repeatedly until the train of classifier is done satisfactory [33]. The second
validation set, however, is used to calculate the reputation values of individual clas-
sifiers.The accuracy of each classifier is estimated with the corresponding validation
set and normalized to [0, 1] to generate a reputation value. For instance, a classifier,
θi , with 90% accuracy has a reputation ri = 0.9.
3. Now, for each feature vector, x , in the test set, L decisions are made using L distinctive
21
classifiers:
Ω(x ) = θ1(x ),θ2(x ), ...,θL(x ). (2.5)
4. To arrive at a final decision, we consider the votes of the classifiers with high repu-
tations rather than simply selecting the majority class. First, we sort the reputation
values of the classifiers in descending order,
R∗ = r1∗ , r2∗ , ..., rL∗, (2.6)
such that r1∗ ≥ r2∗ ≥ ... ≥ rL∗ . Then, using this set, we rank the classifiers to obtain a
reputation-ordered set of classifiers,Θ∗.
Θ∗ =
θ1∗
θ2∗
...
θL∗
. (2.7)
The first element of this set corresponds to the classifier with the highest reputation.
5. Next, we examine the votes of the first m elements of the reputation-ordered set of
classifiers, with
m =
L2
, if L is even,
L+12
, if L is odd.(2.8)
If the top m classifiers vote for the same class,ωj , we accept the majority vote and take
ωj as the final decision of the system. However, if the votes of the first m classifiers
are not equal, we consider the classifier’s individual reputations (Step 2).
6. Let p (ωj ) be the prior probability of classωj . As before, Θ(x ) = θ1(x ),θ2(x ), ...,θL(x )represents the local decisions made by different classifiers about the input vector x .
22
The probability that the combined classifier decision is ωj given the input vector x
and the individual local classifier decisions is denoted as the posterior probability,
p (ωj |θ1(x ),θ2(x ), ...,θL(x )) (2.9)
Clearly, we should choose the class which maximizes this probability.
argmaxωj
p (ωj |θ1(x ),θ2(x ), ...,θL(x )), j = 1, ...c . (2.10)
To estimate the posterior probability we use Bayes formula. For notational simplicity
we drop the argument x from the local decisions.
p (ωj |θ1, ...,θL) =p (θ1, ...,θL |ωj )p (ωj )
p (θ1, ...,θL), (2.11)
where p (θ1, ...,θL |ωj ) is the likelihood of ωj and p (θ1, ...,θL) is the evidence factor,
which is estimated using the law of total probability
p (θ1, ...,θL) =c∑
j=1
p (x ,θ1, ...,θL |ωj )p (ωj ). (2.12)
By assuming that the classifiers are independent of each other, the likelihood can be
written as
p (θ1, ...,θL |ωj ) = p (θ1|ωj )...p (θL |ωj ). (2.13)
Substituting (2.12) into the Bayes rule (2.11) and then replacing the likelihood term
with (2.13), we obtain,
p (ωj |θ1, ...,θL) =
∏Li=1 p (θi |ωj )p (ωj )∑c
t=1
∏Li=1 p (θi |ωt )p (ωt )
. (2.14)
To calculate the local likelihood functions, p (θi |ωj ), we use the reputation values cal-
23
culated in Step 6. When the correct class is ωj and classifier θi classifies x into the
classωj , i.e., θi (x ) =ωj , we can write
p (θi =ωj |ωj ) = ri . (2.15)
In other words, p (θi =ωj |ωj ) is the probability that the classifier θi correctly classifies
x into classωj when x actually belongs to this class. This probability is exactly equal
to the reputation of the classifier. On the other hand, when the classifier categorizes x
incorrectly, i.e., θi (x ) =ωj given that the correct class isωj , then
p (θi =ωj |ωj ) = 1− ri . (2.16)
When there is no known priority among classes, we can assume equal prior probabil-
ities. Hence,
p (ω1) = p (ω2) = . . .= p (ωc ) =1
c. (2.17)
Finally, for each class, ωj , we compute the a posteriori probabilities as given by (5.8)
using (3.21), (3.22), and (5.10). The class with the highest posterior probability is se-
lected as the final decision of the system and the input subject x is categorized as
belonging to this class.
The advantage of the reputation-based algorithm over the majority voting algorithm lies
in the fact that the former has a higher probability of correct consensus and a faster rate of
convergence to the peak probability of correct classification [90].
2.3 Discriminating between Healthy and Abnormal Swallows
We apply the proposed algorithm to the problem of swallow classification. Specifically, the
problem is to differentiate between safe and unsafe swallowing on the basis of dual-axis
24
accelerometry [106, 27]. The basic idea is to decompose a high dimensional classification
problem into 3 lower dimensional problems, each with a unique subset of features and a
dedicated classifier. The individual classifier decisions are then melded according to the
proposed reputation algorithm.
2.3.1 Signal Acquisition and pre-processing
In this chapter, we consider a randomly selected subset of 100 healthy swallows and 100
dysphagic swallows from the larger database described in [75]. Briefly, dual-axis swallowing
accelerometry data were collected at 10 kHz from 24 adult patients (mean age 64.8± 18.6
years, 2 males) with dysphagia and 17 non-dysphagic persons (mean age 46.9± 23.8 years,
8 males). Patients provided an average number of 17.8±8.8 swallows while non-dysphagic
participants completed 19 swallow sequences each. Both groups swallowed boluses of dif-
ferent consistencies. For more details of the instrumentation and swallowing protocol,
please see [75]. It has been shown in [73] that the majority of power in a swallowing vi-
bration lies below 100Hz. Therefore, we downsampled all signals to 1KHz. Then, individual
swallows were segmented according to previously identified swallow onsets and offsets [75].
Each segmented swallow was denoised using a 5-level discrete Daubechies-5 wavelet trans-
form. To remove low-frequency motion artifacts due to bolus intake and participant mo-
tion, each signal was subjected to a 4th-order highpass Butterworth filter with a cutoff fre-
quency of 1Hz.
2.3.2 Feature Extraction
Let S be a pre-processed acceleration time series, S = s2, s2, ..., sn. As in previous accelerom-
etry studies, signal features from three different domains were considered [75, 72]. The dif-
ferent genres of features are summarized below.
1. Time-Domain Features
25
• Mean: The sample mean of a distribution is an unbiased estimation of the loca-
tion of that distribution. The sample mean of the time series S can be calculated
as
µs =1
n
n∑i=1
s i . (2.18)
• Variance: The variance of a distribution measures its spread around the mean
and reflects the signal’s power. The unbiased estimation of variance can be ob-
tained as
σ2s =
1
n −1
n∑i=1
(s i −µs )2. (2.19)
• Skewness: The skewness of a distribution is a measure of the symmetry of a dis-
tribution. This factor can be computed as follows
γ1,s =1n
∑ni=1(s i −µs )3
( 1n
∑ni=1(s i −µs )2)1.5
. (2.20)
• Kurtosis: This feature reflects the peakedness of a distribution and can be found
as
γ2,s =1n
∑ni=1(s i −µs )4
( 1n
∑ni=1(s i −µs )2)2
. (2.21)
2. Frequency-Domain Features
• The peak magnitude value of the Fast Fourier Transform (FFT) of the signal S is
used as a feature. All the FFT coefficients are normalized by the length of the
signal, n .
• The centroid frequency of the signal S [106] can be estimated as
f =
∫ f m a x
0f |Fs ( f )|2d f∫ f m a x
0|Fs ( f )|2d f
, (2.22)
where Fs ( f ) is the Fourier transform of the signal S and f m a x is the Nyquist fre-
26
quency (5KHz in this study).
• The bandwidth of the spectrum can be computed using the following formula
BW =
√√√√∫ f m a x
0( f − f )2|Fs ( f )|2d f∫ f m a x
0|Fs ( f )|2d f
. (2.23)
3. Information-Theory-Based Features
• Entropy Rate [94]: [94] introduced a new method for measuring the entropy rate
in a signal which quantifies the extent of regularity in that signal. They showed
that this rate is useful for signals with some relationship among consecutive sig-
nal points. [74] used the entropy rate for the classification of healthy and ab-
normal swallowing. Following their approach, we first normalized the signal S
to zero-mean and unit variance. Then, we quantized the normalized signal into
10 equally spaced levels, represented by the integers 0 to 9, ranging from the
minimum to maximum value. Now, the sequence of U consecutive points in the
quantized signal, S = s1, s2, ..., s3, was coded using the following equation
a i = s i+U−1.10U−1+ ...+ s i .100, (2.24)
with i = 1, 2, ..., n −U + 1. The coded integers comprised the coding set AU =
a 1, ..., a n−U+1. Using the Shannon entropy formula, we estimated entropy
E (U ) =−10U−1∑
t=0
PAU (t ) ln PAU (t ), (2.25)
where pAU (t ) represents the probability of observing the value t in AU , approxi-
mated by the corresponding sample frequency. Then, the entropy rate was nor-
27
malized using the following equation
N E (U ) =E (U )−E (U −1)+E (1).β
E (1), (2.26)
where β was the percentage of the coded integers in AL that occurred only once.
Finally, the regularity index ρ ∈ [0, 1]was obtained as
ρ = 1−min N E (U ), (2.27)
where a value of ρ close to 0 signifies maximum randomness while ρ close to 1
indicates maximum regularity.
• Memory [72]: To calculate the memory of the signal, its autocorrelation function
was computed from zero to the maximum time lag. Then, it was normalized
such that the autocorrelation at zero lag was unity. The memory was estimated
as the time duration from zero to the point where the autocorrelation decays to
1/e of its zero lag value.
• Lemple-Ziv (L-Z) complexity [77]: The L-Z complexity measures the predictabil-
ity of a signal. To compute the L-Z complexity for signal S, first, the minimum
and the maximum values of signal points were calculated and then, the signal
was quantized into 100 equally spaced levels between its minimum and maxi-
mum values. Then, the quantized signal, B n1 = b1,b2, ...,bn, was decomposed
into T different blocks, B n1 = ψ1,ψ2, ...,ψT . A blockψwas defined as
Ψ= B ℓj = b j ,b j+1, ...,bℓ, 1≤ j ≤ ℓ≤ n . (2.28)
28
The values of the blocks can be calculated as follows:
Ψ=
ψm =b1, if m=1,
ψm+1 = B hm+1
hm+1, m ≥ 1,(2.29)
where hm is the ending index for ψm , such that ψm+1 is a unique sequence of
minimal length within the sequence B hm+1−11 . Finally, the normalized L-Z com-
plexity was calculated as
LZ =T log100 n
n. (2.30)
2.3.3 Classification
We trained 3 separate back-propagation neural network (NN) classifiers [33], one for each
genre of signal feature outlined above. Hence, the feature space dimensionalities for the
classifiers were 4 (NN with time features), 3 (NN with frequency features) and 3 (NN with
information-theoretic features). Each neural net classifier had 2 inputs, 4 hidden units and
1 output. Figure 2.1 shows the schematic of one NN classifier used in our work. Although it
is possible to invoke different classifiers for each genre of signal feature, we utilized the same
classifiers here to facilitate the evaluation of local decisions. The use of different feature sets
for each classifier ensures that the classifiers will perform independently [135].
Figure 2.2 is a block diagram of our proposed algorithm. First, the three small neural
networks, classify their inputs independently. Then, using the outputs of these classifiers
and their respective reputation values, the reputation-based algorithm determines the cor-
rect label of the input.
Classifier accuracy was estimated via a 10-fold cross validation with a 90-10 split. How-
ever, unlike classical cross-validation, we further segmented the ’training’ set into an actual
training set and a validation set. In other words, in each fold, 160 (80%) swallows were used
for training, 20 (10%) for validation and 20 (10%) reserved for testing. Among the 20 swal-
29
Figure 2.1: The schematic of back-propagation neural network with 3 inputs and 4 hiddenlayers
Figure 2.2: The block diagram of the proposed algorithm
30
lows of the validation set, 10 were used as a traditional validation set and 10 were used for
computation of the reputation values. After training, classifier reputations were estimated
using this second validation set. Classifiers were then ranked according to their reputation
values. Without loss of generality, assume r1 ≥ r2 ≥ r3. If θ1 and θ2 cast the same vote about
a test swallow, their common decision was accepted as the final classification. However, if
they voted differently, the a posteriori probability of each class was computed using (5.8)
and the maximum a posteriori probability rule was applied to select the final classification.
To better understand the difference between the multiple classifier system and a single,
all encompassing classifier, we also trained a multilayer neural network via back-propagation
with all 10 features, i.e., using the collective inputs of all three smaller classifiers. This all
encompassing classifier, from hereon referred to as the grand classifier, also had 4 hid-
den units. We also statistically compared the accuracies of the individual classifiers against
those of a majority vote classifier combination and a reputation-based classifier combina-
tion (Section 2.4.1).
To understand the knowledge representation of the individual small classifiers, we plot-
ted Hinton diagrams for the input-to-hidden unit weight matrices (Section 2.4.2). Subse-
quently, we qualitatively compared the training performance of the small and grand classi-
fiers to ascertain potential benefits of training a collection of small classifiers (Section 2.4.3).
Through a systematic perturbation analysis, we quantified the local robustness of the reputation-
based neural network combination (Section 2.4.4). In particular, we qualitatively examined
the change in the classifier output and accuracy as the magnitude of the input perturba-
tion increased. Finally, we estimated the breakdown point of the reputation-based neural
network combination using increasing proportion of contaminants in the overall data set
(Section 2.4.5).
31
Table 2.1: The average performance of the individual classifiers and their reputation-basedcombination.
Classifier Average Performance (%)
Time domain NN 67.5±12.5
Frequency domain NN 69.5±10.3
Information theoretic NN 65.5±11.2
Grand classifier 70.0±8.5
Majority vote 74.5±8.9
Combined classifier decision 78.0±8.2
2.4 Results and Discussion
2.4.1 Classification accuracy
Table 2.1 tabulates the local and global classification results. On average, the frequency do-
main classifier appears best among the individual NNs while the information-theoretic NN
fairs worst. Also, it is clear from this table that by combing the local decisions of the clas-
sifiers, using reputation based algorithm, the overall performance of the system increases
dramatically. The result of the grand classifier is statistically the same as the small classi-
fiers. However, training this classifier is more difficult and requires more time. Hence, there
appears to be no justification of considering an all encompassing classifier in this appli-
cation. Collectively, these results indicate that there is merit in combining neural network
classifiers in this problem domain. The accuracy of the majority vote neural network combi-
nation did not significantly differ from that of the individual (p > 0.11) and grand classifiers
(p = 0.16). On the other hand, the reputation-based combination led to further improve-
ment in accuracy over the time domain (p = 0.04) and information-theoretic (p = 0.05)
classifiers, but did not significantly surpass the grand (p = 0.09) and frequency domain net-
works (p = 0.09).
32
The reputation-based scheme yields accuracies better than those reported in [74] (74.7%).
Moreover, in [74], the entire database was required and the maximum feature space dimen-
sion was 12. Here, we only considered a fraction of the database and no classifier had a
feature space dimensionality greater than 4. Therefore, our system offers the advantages of
computational efficiency and less stringent demands on training data.
2.4.2 Internal neural network representations
Figures 2.3, 2.4 and 2.5 are the Hinton graphs for the input to hidden layer weight matrices
for the time, frequency, and information theoretic domain neural networks, respectively.
In these figures, the weight matrix of each classifier is represented using a grid of squares.
The area of each square represents the weight magnitude, while the shading reveals the
sign. Shaded squares signify negative weights. The first column denotes the hidden unit
biases while the subsequent columns are the weights on the connections projecting from
each input unit. For instance, the frequency domain neural network uses 3 input features
and 1 bias, resulting in 4 columns. Given that there are 4 hidden units, the weight matrix is
represented as a 4×4 grid.
In Figure 2.3, we see that the first neuron has a very large negative weight for kurtosis
and a sizable one for variance. This suggests that this neuron represents swallows with low
variance and platykurtic distributions. The second neuron seems to primarily represent
swallows with leptokurtic distributions given its positive weight on the kurtosis input. By
the same token, then third neuron appears to internally denote swallows with large positive
means and leptokurtic distributions. Finally, the last neuron captures swallows primarily
with high variance. Overall, the strongest weights are found on variance and kurtosis fea-
tures, suggesting that they play the most important role in distinguishing between healthy
and unhealthy swallows in our sample. Resonating with our findings here, [71] identified a
dispersion-type measure as discriminatory between healthy swallows and swallows. Simi-
larly, [75] determined that the kurtosis of swallow accelerometry signals tended to be axial-
33
Bias Mean Median Variance Kurtosis
1
2
3
4
Input
Neu
ron
Figure 2.3: The Hinton graph of the weight matrix for the time domain classifier
specific and thus potentially discriminatory between different types of swallows.
Moving on to Figure 2.4, we notice that neurons one and two seem to have captured
inverse dependencies of spectral centroid and bandwidth features. While neuron one em-
bodies swallows with lower spectral centroid but broad bandwidth, neuron two captures
swallows high frequency narrow band swallows. The peak FFT feature seems to be the least
important spectral information, which is consoling in some senses as this suggests that de-
cisions are not based upon signal strength, which may vary greatly across swallows regard-
less of swallowing health.
In the information theoretic neural network (Figure 2.5), we find that the memory fea-
ture seems to have a distributed representation across the 4 neurons, with three favoring
weak memory or rapidly decaying autocorrelations. Neuron one almost uniformly consid-
ers all three information theoretic features, specifically epitomizing swallows with low com-
plexity, low entropy rate and minimal memory. This characterization might reflect ’noisy’
34
Bias Peak Centroid freq BW
1
2
3
4
Input
Neu
ron
Figure 2.4: The Hinton graph of the weight matrix for the frequency domain classifier (Peak- peak value of FFT; BW - bandwidth)
swallows. Interestingly, neuron three focuses on positive complexity and memory. We can
interpret this neuron as representing swallows which have strong predictability and hence
longer memory. In short, it appears that each individual neural network has internally rep-
resented some unique flavors of swallows. This apportioned representation across neural
networks suggests that there is indeed sound rationale to combine classifiers, in order to
comprehensively characterize the diversity of swallows.
2.4.3 Training error and convergence
Figure 2.6 pits the training performance of the small classifiers against the grand classifier
as the number of training epochs increase. After 12 epochs, the small classifiers have lower
normalized mean squared errors than the grand classifier. This is one of the main advan-
tages of using a multiple classifier system over a single all encompassing classifier; the rate
35
Bias L−Z Entropy Memory
1
2
3
4
Input
Neu
ron
Figure 2.5: The Hinton graph of the weight matrix for the information theoretic classifier.
of convergence during training is often faster with smaller classifiers, i.e., those with fewer
input features, and in many cases lower training error can be achieved.
2.4.4 Local robustness
To gauge one aspect of the local robustness of the proposed neural network combination,
we measured the sensitivity of the system to a local perturbation of the input. Recall that the
reputation algorithm yields a class label rather than a continuous number. Thus, to facilitate
sensitivity analysis, we calculated the reputation-weighted average of the outputs of the
small classifiers for a specific input. For semantic convenience, we will just call this the
reputation-weighted output. The unperturbed input sample was the mean vector of all the
features in the test set. Perturbed inputs were created by adding varying degrees of positive
and negative offsets to every feature of the mean vector. The sensitivity of the system to a
given perturbation was defined as the difference between the reputation-weighted output
36
0 2 4 6 8 10 120.4
0.5
0.6
0.7
0.8
0.9
1
Training epoch
Mea
n s
qu
ared
err
or
Time−Domain ClassifierFrequency−Domain ClassifierInfo−Theory−Domain ClassifierGrand Classifier
Figure 2.6: The training error of the different neural network classifiers versus the numberof training epochs
for the unperturbed input and that for the perturbed input. At each iteration, the amount
of perturbation was proportional to the range of the features in the test set. For instance, in
the first iteration, 5% of the range of each feature in the test set was added to the respective
feature. Figure 2.7 shows the relative sensitivity of the system versus the magnitude of input
perturbation. Between ±10%, the relative sensitivity is less than 10% of the output value,
suggesting that the reputation-based classifier while not robust in the strict statistical sense,
can tolerate a modest level perturbation at the inputs.
To examine the effect of a local perturbation on the final decision of our algorithm, we
again added/subtracted noise to the mean input vector and computed the output label us-
ing the reputation-based algorithm. For the present problem, the output was binary and
without loss of generality, denoted arbitrarily as ’1’ or ’2’. The unperturbed input belonged
to class 1. As portrayed in Figure 2.8, the decision of the proposed algorithm is robust to
37
−20 −15 −10 −5 0 5 10 15 20−15
−10
−5
0
5
10
15
Perturbation (% of feature value range)
Rel
ativ
e se
nsi
tivi
ty
Figure 2.7: Sensitivity to varying magnitudes of perturbation of the input vector
negative perturbations up to 20% of the range of the features and positive perturbations up
to 10% of the range of the features. However, for a positive perturbation higher than 10%
of the range of the features, the reputation algorithm misclassifies the input. For practical
purposes, this means that the reputation-based neural network combination can tolerate
a simultaneous 10% perturbation in all its input features before making a wrong decision
in the binary classification case. In the specific domain of dual-axis accelerometry, head
movement induces high magnitude perturbations [107] which, according to our analysis
here, will likely cause classification errors.
We also investigated the effect of local perturbations on the accuracy of the proposed
algorithm. We perturbed all 20 samples in the test set. The amount of perturbation ranged
from 0 to 100% of the range of the features in the test set. For each contamination value, we
calculated the accuracy of the proposed algorithm using the perturbed test set. Figure 2.9
illustrates the effect of varying levels of perturbation on the accuracy of the proposed al-
gorithm. Based on this figure, the accuracy of the proposed algorithm decreased with in-
38
−20 −10 0 10 20
1
2
Perturbation (% of feature value range)
Cla
ssif
icat
ion
Figure 2.8: The output decision versus magnitude of input perturbation
creasing magnitude of perturbation in the test set. The initial accuracy of the proposed
algorithm, for the unperturbed test set, was 78% and decreased to 50% for full-range (100%
of the range of the features) perturbations. It is interesting to note that the decay in accu-
racy is quite steep for the first 20%, indicating that accuracy will take a hit with any non-zero
amount of perturbation. Intuitively, this finding makes sense as the resemblance between
test and training data diminishes as the magnitude of perturbation increases.
2.4.5 Global robustness
The sensitivity curve only offers local information about the robustness of the classifier. To
measure the robustness of the system globally, we estimated the breakdown point for the
proposed algorithm. For this matter, we substituted some feature vectors from among the
200 initial samples with contaminated versions. Contaminated feature vectors were created
by sampling from a normal distribution with mean equal to that of the feature vector but
39
0 20 40 60 80 10050
55
60
65
70
75
80
Perturbation (% of feature value range)
Acc
ura
cy
Figure 2.9: The accuracy of the proposed classifier with increasing magnitude of perturba-tion in the test set
with 3 times the standard deviation. The number of contaminants ranged from 20 to 100,
i.e., 10 to 50% of the original data set. Using 10-fold cross validation, we divided the sam-
ples into 3 sets: training, testing, and validation. Therefore, it was possible that contami-
nations appeared in any or all of the training, testing, and validation sets. Figure 2.10 plots
the accuracy of the proposed algorithm for different numbers of contaminated samples.
Error-bars in this figure depict the standard deviation of each accuracy obtained from the
cross-validation. To estimate the breakdown point for this system, we used the rank sum
test to test for a significant difference between accuracies with and without varying levels
of contamination. At a 5% significance level, we identified the first significant departure
from the uncontaminated distribution of accuracy at 80 contaminated samples (p = 0.043).
Given that there were 200 samples, the breakdown point was thus identified as 80200= 0.4.
40
0 20 40 60 80 100 12050
55
60
65
70
75
80
85
Acc
ura
cy
Number of contaminated samples
p=0.065
p=0.043
Figure 2.10: The accuracy of the proposed classifier as the number of contaminated sam-ples increase. The p-values arise from a comparison of the accuracy between the uncon-taminated sample and samples with varying levels of contamination.
2.5 Conclusion
We have presented the formulation of a reputation-based neural network combination.
The method was demonstrated using a dysphagia dataset. We noted that generally the
reputation-based classifier either achieved higher accuracies than single classifiers or ex-
hibited comparable accuracies to the best single classifiers. Interpreting the weight ma-
trices of the neural networks, we observed that many different aspects of time, frequency
and information-theoretic characteristics of swallows were encoded. Finally, we empiri-
cally characterized the local and global robustness of the reputation-based classifier, show-
ing that there exists a certain tolerance (approximately 10% of the range of a feature value)
to input perturbations. However, large magnitude perturbations, such as those observed in
head movement, would likely lead to erroneous classification of the swallowing accelerom-
etry input feature vector.
41
Chapter 3
Automatic discrimination between safe
and unsafe swallowing using a
reputation-based classifier
The contents of this chapter are reproduced from a journal article accepted for publica-
tion: Nikjoo, M., Steele, C., Lee, J., Sejdic, E., Chau, T., "Discriminating between healthy
and abnormal swallowing by combining classifiers on the basis of reputation", Biomedical
Engineering Online, 2011.
3.1 Abstract
Swallowing accelerometry has been suggested as a potential non-invasive tool for day-to-
day assessment of swallowing function in neurogenic dysphagia. Various vibratory signal
features and complementary measurement modalities have been put forth in the literature
for the potential discrimination between safe and unsafe swallowing. To date, automatic
classification of swallowing accelerometry has exclusively involved a single-axis of vibra-
tion although a second axis is known to contain additional information about the nature of
42
the swallow. Furthermore, the only published attempt at automatic classification in adult
patients has been based on a small sample of swallowing vibrations. In this Chapter, a large
corpus of dual-axis accelerometric signals were collected from older adults referred to vide-
ofluoroscopic examination on the suspicion of dysphagia. We invoked a reputation-based
classifier combination to automatically categorize the dual-axis accelerometric signals into
safe and unsafe swallows, as labeled via videofluoroscopic review. With selected time, fre-
quency and information theoretic features, the reputation-based algorithm distinguished
between safe and unsafe swallowing with promising accuracy (80.48±5.0) and provided in-
teresting insight into the accelerometric differences between the two classes of swallows.
Given its computational efficiency, reputation-based classification of dual-axis accelerom-
etry may be a viable option for point-of-care swallow assessment where turnkey clinical
informatics are desired.
3.2 Introduction
Dysphagia refers to any swallowing disorder [81] and may arise secondary to stroke, mul-
tiple sclerosis, and eosinophilic esophagitis, among many other conditions [85]. If un-
managed, dysphagia may lead to aspiration pneumonia in which food and liquid enter
the airway and into lungs [32]. The video-fluoroscopic swallowing study (VFSS) is the gold
standard method for dysphagia detection [119]. This method entails a lateral X-ray video
recorded during ingestion of a barium-coated bolus. The health of a swallow is then judged
by clinical experts according to criteria such as the depth of airway invasion and the degree
of bolus clearance after the swallow. However, this technique requires expensive and spe-
cialized equipment, ionizing radiation and significant human resources, thereby preclud-
ing its use in the daily monitoring of dysphagia [74]. Swallowing accelerometry has been
proposed as a potential adjunct to VFSS. In this method, the patient wears a dual-axis ac-
celerometer infero-anterior to the thyroid notch. Swallowing events are automatically ex-
43
tracted from the recorded acceleration signals and pattern classification methods are then
deployed to discriminate between healthy and unhealthy swallows. It is important to dis-
tinguish between swallowing vibrations and swallowing sounds, based on current evidence
in the literature. Swallowing sounds have been largely attributed to pharyngeal reverbera-
tions arising from opening and closing of valves (oropharyngeal, laryngeal and esophageal
valves), action of various pumps (pharyngeal, esophageal, and respiratory pumps) and vi-
brations of the vocal tract [25]. In contrast, in swallowing accelerometry, vocalizations are
explicitly removed by preprocessing [104] and studies have implicated hyolaryngeal mo-
tion as the primary source of the acceleration signal [98, 139]. Fundamentally, both the
method of transduction and the primary physiological source of these signals are different.
Our focus here is swallowing vibrations and recent progress in swallowing accelerometry is
reviewed below.
3.2.1 Automatic classification
Das, Reddy & Narayanan [28] deployed a fuzzy logic-committee network to distinguish be-
tween swallows and ’artifacts’ using time and frequency domain features of single-axis ac-
celerometry signals. Although they achieved very high accuracies, their sample of swallows
and ’artifacts’ was very modest. Using a radial basis classifier with statistical and energetic
features, Lee et al. [71] detected aspirations from single-axis cervical acceleration signals
with approximately 80% sensitivity and specificity in a large pediatric cerebral palsy popula-
tion. Both of these studies only examined accelerations in the anterior-posterior anatomical
direction. However, recent research has shown that there is distinct information about swal-
lowing that is encoded in the superior-inferior vibration [73]. Further, hyolaryngeal motion
associated with swallowing is inherently two-dimensional and this motion was implicated
as the likely source of swallow vibrations [139].
In the first dual-axis classification study, Lee et al. [74] discriminated between no airway
invasion and airway invasion past the true vocal folds in 24 adult stroke patients using a
44
variety of classifiers (linear discriminant, neural network, probabilistic network and nearest
neighbor). A genetic algorithm (GA) selected the most discriminatory feature combinations.
With linear classifiers, an adjusted accuracy of 74.7% was achieved in feature spaces of up
to 12 dimensions.
In the aforementioned studies, various genres of features have demonstrated discrim-
inatory potential. These include statistical features such as dispersion ratio and normal-
ity [71], time-frequency features such as wavelet energies [73], information theoretic fea-
tures such as entropy rate [72], temporal features such signal memory [45], and spectral
features such as the spectral centroid [106]. Further, there is evidence to suggest that com-
plementary measurement modalities, such as nasal air flow and submental mechanomyo-
graphy [75]may enhance segmentation and classification. Given the presence of multiple
feature genres and different measurement modalities, the swallow detection and classifica-
tion problem lends itself to a multi-classifier approach. For example, it may be sensible to
dedicate one classifier to each feature genre [56]. Moreover, data sets from different patient
groups might warrant different classifiers [56, 135]. Lastly, the demand for classification
speed may necessitate the use of multiple classifiers [56].
In this Chapter, we invoke a novel, computationally efficient reputation-based classi-
fier combination to automatically categorize dual-axis accelerometric signals from adult
patients into safe and unsafe swallows, as labeled via videofluoroscopic review. We con-
sider multiple feature genres from both anterior-posterior and superior-inferior axes and
examine a much larger data set than that of previous swallow accelerometry classification
studies.
45
Dual-axes accelerometer
X-rayemitter
X-raydetector
PC + LabView program
Data acquisition card Image acquisitioncard
AP amplifier SI amplifier
Figure 3.1: Data collection set-up
3.3 Methods
3.3.1 Data collection
In this Chapter, we re-examine data from a subset of participants originally reported in [114].
Briefly, we recruited 30 patients (aged 65.47± 13.4 years, 15 male) with suspicion of neuro-
genic dysphagia who were referred to routine videofluoroscopic examination at one of two
local hospitals. Patients had dysphagia secondary to stroke, acquired brain injury, neurode-
generative disease, and spinal cord injury. Research ethics approval was obtained from both
participating hospitals.
The data collection set-up is shown in Figure 3.1. Sagittal plane videofluoroscopic im-
ages of the cervical region were recorded to computer at a nominal 30 frames per sec-
ond via an analog image acquisition card (PCI-1405, National Instruments). Each frame
was marked with a timestamp via a software frame counter. A dual-axis accelerometer
(ADXL322, Analog Devices) was taped to the participantŠs neck at the level of the cricoid
cartilage. The axes of the accelerometer were aligned to the anatomical anterior-posterior
(AP) and superior-inferior (SI) axes. Signals from both the AP and SI axes were passed
through separate pre-amplifiers each with an internal bandpass filter (Model P55, Grass
Technologies). The cutoff frequencies of the bandpass filter were set at 0.1 Hz and 3 kHz.
46
The amplifier gain was 10. The signals were then sampled at 10 kHz using a data acquisi-
tion card (USB NI-6210, National Instruments) and stored on a computer for subsequent
analyses. A trigger was sent from a custom LabView virtual instrument to the image acqui-
sition card to synchronize videofluoroscopic and accelerometric recordings. The above in-
strumentation settings replicate those of previous dual-axis swallowing accelerometry stud-
ies [139, 45, 92, 72, 105, 104, 106].
Each participant swallowed a minimum of two or a maximum of three 5 mL teaspoons of
thin liquid barium (40 %w/v suspension) while his/her head was in a neutral position. The
number of sips that the participant performed was determined by the attending clinician.
The recording of dual-axis accelerometry terminated after the participant finished his/her
swallows. However, the participant’s speech-language pathologist continued the videofluo-
roscopy protocol as per usual. In total, we obtained 224 individual swallowing samples from
the 30 participants, 164 of which were labeled as unsafe swallows (as defined below) and 60
as safe swallows.
3.3.2 Data segmentation
To segment the data for analysis, a speech-language pathologist reviewed the videofluo-
roscopy recordings. The beginning of a swallow was defined as the frame when the liquid
bolus passed the point where the shadow of the mandible intersects the tongue base. The
end of the swallow was identified as the frame when the hyoid bone returned to its rest po-
sition following bolus movement through the upper esophegeal sphincter. The beginning
and end frames as defined above where marked within the video recording using a cus-
tom C++ program. The cropped video file was then exported together with the associated
segments of dual-axes acceleration data. An unsafe swallow was defined as any swallow
without airway clearance. Typically, this would include penetration and aspiration. Residue
would be considered a situation of swallowing inefficiency that is not unsafe swallowing un-
less the residue was subsequently aspirated. Backflow is extremely rare in the oropharynx,
47
and would only be classified as unsafe should it lead to penetration-aspiration. This defi-
nition of unsafe swallowing is in keeping with the industry standard Penetration-Aspiration
Scale [99].
3.3.3 Pre-Processing
It has been shown in [73] that the majority of power in swallowing vibrations of healthy
adults lies below 100 Hz. Given that we were dealing with patient data, we estimated the
bandwidth of each of the 224 swallows using the bandwidth estimator defined in [108], ob-
taining average bandwidths of 175± 73 Hz and 226± 84 Hz for the AP and SI axes, respec-
tively. Moreover, spectral centroids were < 70 Hz in both axes, suggesting that there is no
appreciable signal energy beyond a few hundred Hz. Therefore, we downsampled all sig-
nals to 1 kHz. Vocalization was removed from each segmented swallow according to the
periodicity detector proposed in [104]. Whitening of the accelerometry signals to account
for instrumentation nonlinearities was achieved using inverse filtering [106]. Finally, the
signals were denoised using a Daubechies-8 wavelet (8db) transform with soft threshold-
ing [105]. Both the decomposition level and the wavelet coefficient were chosen empirically
to minimize noise while maximizing the information that remained in the signal. Figures
3.2 and 3.3 exemplify pre-processed safe and unsafe swallowing signals, respectively.
3.3.4 Feature Extraction
Let S be a pre-processed acceleration time series, S = s2, s2, ..., sn. As in previous accelerom-
etry studies, signal features from multiple domains were considered [75, 72]. The different
genres of features are summarized below.
1. Time Domain Features
• The sample mean is an unbiased estimation of the location of a signal’s ampli-
48
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
−0.1
0
0.1
0.2
AP
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
−0.4
−0.2
0
0.2
Time (seconds)
Acc
eler
atio
n (g
)
SI
Figure 3.2: Example of safe swallowing signals from AP and SI axes
tude distribution and is given by,
µs =1
n
n∑i=1
s i . (3.1)
• The variance of a distribution measures its spread around the mean and reflects
the signal’s power. The unbiased estimation of variance can be obtained as
σ2s =
1
n −1
n∑i=1
(s i −µs )2. (3.2)
• The median is a robust location estimate of the amplitude distribution. For the
49
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
−0.1
0
0.1
0.2
AP
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
−0.6
−0.4
−0.2
0
0.2
SI
Time (seconds)
Acc
eler
atio
n (g
)
Figure 3.3: Example of unsafe swallowing signals from AP and SI axes
sorted set S, the median can be calculated as
M E D(s ) =
sv+1, if n = 2v +1
sv+sv+1
2, if n = 2v
, (3.3)
• Skewness is a measure of the symmetry of a distribution. This feature can be
computed as follows
γ1,s =1n
∑ni=1(s i −µs )3
( 1n
∑ni=1(s i −µs )2)1.5
. (3.4)
• This feature reflects the peakedness of a distribution and can be found as
γ2,s =1n
∑ni=1(s i −µs )4
( 1n
∑ni=1(s i −µs )2)2
. (3.5)
2. Frequency Domain Features
50
• The peak magnitude value of the Fast Fourier Transform (FFT) of the signal S was
also used as a feature. All the FFT coefficients were normalized by the length of
the signal, n .
• The centroid frequency of the signal S [106]was estimated as
f =
∫ f m a x
0f |Fs ( f )|2d f∫ f m a x
0|Fs ( f )|2d f
, (3.6)
where Fs ( f ) is the Fourier transform of the signal S and f m a x is the Nyquist fre-
quency (effectively 500 Hz after downsampling).
• The bandwidth of the spectrum was computed using the following formula
BW =
√√√√∫ f m a x
0( f − f )2|Fs ( f )|2d f∫ f m a x
0|Fs ( f )|2d f
. (3.7)
3. Information Theory-Based Features
• The entropy rate [94] of a signal quantifies the extent of regularity in that sig-
nal. The measure is useful for signals with some relationship among consecu-
tive signal points. We first normalized the signal S to zero-mean and unit vari-
ance. Then, we quantized the normalized signal into 10 equally spaced levels,
represented by the integers 0 to 9, ranging from the minimum to maximum
value. Now, the sequence of U consecutive points in the quantized signal, S =
s1, s2, ..., s3, was coded using the following equation
a i = s i+U−1.10U−1+ ...+ s i .100, (3.8)
with i = 1, 2, ..., n −U + 1. The coded integers comprised the coding set AU =
51
a 1, ..., a n−U+1. Using the Shannon entropy formula, we estimated entropy
E (U ) =−10U−1∑
t=0
PAU (t ) ln PAU (t ), (3.9)
where pAU (t ) represents the probability of observing the value t in AU , approxi-
mated by the corresponding sample frequency. Then, the entropy rate was nor-
malized using the following equation
N E (U ) =E (U )−E (U −1)+E (1).β
E (1), (3.10)
where β was the percentage of the coded integers in AL that occurred only once.
Finally, the regularity index ρ ∈ [0, 1]was obtained as
ρ = 1−min N E (U ), (3.11)
where a value of ρ close to 0 signifies maximum randomness while ρ close to 1
indicates maximum regularity.
• To calculate the memory of the signal [72], its autocorrelation function was com-
puted from zero to the maximum time lag (equal to the length of the signal) and
normalized such that the autocorrelation at zero lag was unity. The memory was
estimated as the time required for the the autocorrelation to decay to 1/e of its
zero lag value.
• Lempel-Ziv (L-Z) complexity [77]measures the predictability of a signal. To com-
pute the L-Z complexity for signal S, first, the minimum and the maximum val-
ues of signal points were calculated and then, the signal was quantized into 100
equally spaced levels between its minimum and maximum values. Then, the
quantized signal, B n1 = b1,b2, ...,bn, was decomposed into T different blocks,
52
B n1 = ψ1,ψ2, ...,ψT . A blockψwas defined as
Ψ= B ℓj = b j ,b j+1, ...,bℓ, 1≤ j ≤ ℓ≤ n . (3.12)
The values of the blocks can be calculated as follows:
Ψ=
ψm =b1, if m=1,
ψm+1 = B hm+1
hm+1, m ≥ 1,(3.13)
where hm is the ending index for ψm , such that ψm+1 is a unique sequence of
minimal length within the sequence B hm+1−11 . Finally, the normalized L-Z com-
plexity was calculated as
LZ =T log100 n
n. (3.14)
3.3.5 Reputation-Based Classification
Reputation typically refers to the quality or integrity of an individual component within a
system of interacting components. The notion of reputation has been widely used to ascer-
tain the health of nodes in wireless networks [26], identify malicious hosts in a distributed
system [121] and detect free-riders in peer-to-peer networks [124], among many other prac-
tical applications. Here, we apply the concept of reputation to judiciously combine deci-
sions of multiple classifiers for the purpose of differentiating between safe and unsafe swal-
lows. The general idea is to differentially weigh classifier decisions on the basis of their past
performance.
The past performance of the i t h classifier is captured via its reputation, ri ∈ℜ, 0≤ ri ≤ 1,
where 1 signifies a strong classifier (high accuracy) and 0 denotes a weak classifier. Briefly,
the classifier is formulated as follows. Let Θ = θ1,θ2, ...,θL be a set of L ≥ 2 classifiers and
Ω = ω1,ω2, ...,ωc be a set of c ≥ 2 class labels, where ωj = ωk , ∀j = k . Without loss
of generality, Ω ⊂ N. The input of each classifier is the feature vector xεRn i , where n i is
53
the dimension of the feature space for the i t h classifier θi , whose output is a class label
ωj , j = 1, ..., c . Let p (ωj ) be the prior probability of classωj .
1. For a classification problem with c ≥ 2 classes, we invoke L ≥ 2 individual classifiers.
2. After training the L classifiers individually, the respective accuracy of each is evaluated
using a validation set and expressed as a real number in [0,1]. This number is the
reputation of the classifier.
3. For each feature vector, x , in the test set, L decisions are obtained using the L distinct
classifiers:
Ω(x ) = θ1(x ),θ2(x ), ...,θL(x ). (3.15)
4. We sort the reputation values of the classifiers in descending order,
R∗ = r1∗ , r2∗ , ..., rL∗, (3.16)
such that r1∗ ≥ r2∗ ≥ ... ≥ rL∗ . Then, using this set, we rank the classifiers to obtain a
reputation-ordered set of classifiers,Θ∗.
Θ∗ =
θ1∗
θ2∗
...
θL∗
. (3.17)
The first element of this set corresponds to the classifier with the highest reputation.
5. Next, we examine the votes of the first m elements of the reputation-ordered set of
54
classifiers, with
m =
L2
, if L is even,
L+12
, if L is odd.(3.18)
If the top m classifiers vote for the same class,ωj , we accept the majority vote and take
ωj as the final decision of the system. However, if the votes of the first m classifiers
are not equal, we consider the classifiers’ individual reputations (Step 2).
6. The probability that the combined classifier decision is ωj given the input vector x
and the individual local classifier decisions is denoted as the posterior probability,
p (ωj |θ1(x ),θ2(x ), ...,θL(x )) (3.19)
which can be estimated using Bayes rule as
p (ωj |θ1, ...,θL) =
∏Li=1 p (θi |ωj )p (ωj )∑c
t=1
∏Li=1 p (θi |ωt )p (ωt )
. (3.20)
when the classifiers are independent. For notational convenience, we have dropped
the argument for θ above, but it is understood to be a function of x . The local likeli-
hood functions, p (θi |ωj ), are estimated by the reputation values calculated in Step 6.
When the correct class isωj and classifier θi classifies x into the classωj , i.e., θi (x ) =
ωj , we can write
p (θi =ωj |ωj ) = ri . (3.21)
In other words, p (θi =ωj |ωj ) is the probability that the classifier θi correctly classifies
x into classωj when x actually belongs to this class. This probability is exactly equal
to the reputation of the classifier. On the other hand, when the classifier categorizes x
incorrectly, i.e., θi (x ) =ωj given that the correct class isωj , then
p (θi =ωj |ωj ) = 1− ri . (3.22)
55
When there is no known priority among classes, we can assume equal prior probabil-
ities. Hence,
p (ω1) = p (ω2) = . . .= p (ωc ) =1
c. (3.23)
Thus, for each class, ωj , we can estimate the a posteriori probabilities as given by
(5.8) using (3.21), (3.22), and (5.10). The class with the highest posterior probability is
selected as the final decision of the system and the input subject x is categorized as
belonging to this class.
3.3.6 Classifier evaluation
We ranked the signal features introduced above using the Fisher ratio [78] for univariate
separability. In the time domain, mean and variance in the AP axis and skewness in the
SI axis were the top-ranked features. Similarly, in the frequency domain, the peak magni-
tude of the FFT and the spectral centroid in the AP direction and the bandwidth in the SI
direction were retained. Finally, in the information theoretic domain, entropy rate for the
SI signal and memory of the AP signal were the highest ranking features. Subsequently, we
only examined these feature subsets for classification. For comparison between single and
dual-axes classifiers, we also considered classifiers that employed feature subsets (as iden-
tified above) from a single axis.
Given the disproportion of safe and unsafe samples, we invoked a smooth bootstrapping
procedure [35] to balance the classes. All features were then standardized to zero mean and
unit variance. Three separate support vector machine (SVM) classifiers [33] were invoked,
one for each feature genre (time, frequency and information theoretic). Hence, the fea-
ture space dimensionalities for the classifiers were 3 (SVM with time features), 3 (SVM with
frequency features) and 2 (SVM with information-theoretic features). The use of different
feature sets for each classifier ensures that the classifiers will perform independently [135].
Classifier accuracy was estimated via a 10-fold cross validation with a 90-10 split. In
56
each fold, the whole training set was used to estimate the individual classifier reputations.
Classifiers were then ranked according to their reputation values. Without loss of generality,
assume r1 ≥ r2 ≥ r3. If θ1 and θ2 cast the same vote about a test swallow, their common
decision was accepted as the final classification. However, if they voted differently, the a
posteriori probability of each class was computed using (5.8) and the maximum a posteriori
probability rule was applied to select the final classification.
3.4 Results
The sensitivity, specificity and accuracy of the single-axis and dual-axis accelerometry clas-
sifiers are summarized in Figure 3.4. The dual-axis classifier had significantly higher accu-
racy (80.48±5.0) than either single-axis classifier (p << 0.05, two-sample t-test), specificity
(64± 8.8) comparable to that of the SI classifier (p = 1.0) and sensitivity (97.1± 2) on par
with that of the AP classifier (p = 1.0). In other words, the dual-axis classifier retained the
best sensitivity and specificity achievable with either single-axis classifier.
3.5 Discussion
3.5.1 Dual versus single axis
Of the two axes, the AP axis tended to carry more useful information than the SI direction
for discrimination between safe and unsafe swallowing. This observation is evidenced in
Figure 3.4, where AP accuracy is dramatically higher than SI levels, echoing the findings
of [73] who suggested that the AP axis is richer in information content (i.e., higher entropy)
relating to swallowing. Nonetheless, the SI axis does carry information distinct from that
of the AP orientation, as dual-axis classification exceeds any single-axis counterpart. Our
results thus support the inclusion of selected features from both the AP and SI axes for the
automatic discrimination between safe and unsafe swallowing. Indeed, when comparing
57
Figure 3.4: Classification performance for the single-axis (AP, SI) and dual-axis (AP + SI)reputation-based classifiers. The height of each bar denotes the average of the respectiveperformance measure while the error bars denote standard deviation around the mean.
AP and SI signals, [73] reported minimal mutual information, and inter-axis dissimilarities
in the scalograms, pseudo-spectra and temporal evolution of low- and high-frequency con-
tent.
In a recent videofluoroscopic study, both AP and SI accelerations were attributed to the
planar motion of the hyoid and larynx during swallowing [139]. In that study, the displace-
ment of the hyoid bone and larynx along with their interaction explained over 70% of the
variance in the doubly integrated acceleration in both AP and SI axes at the level of the
cricoid cartilage. Juxtaposed with our findings above, this reported physiological source of
swallow accelerometry suggests that it is the difference in hyolaryngeal motion that is mani-
fested as discriminatory cues between safe and unsafe swallowing. Indeed, early single-axis
accelerometry research had implicated decreased laryngeal elevation as the reason for sup-
pressed AP accelerations in individuals with severe dysphagia [98].
58
−1 0 1 2 3
Normalized feature values
AP − mean
AP − variance
SI − skewness
AP − spectral centroid
AP − spectral peak
SI − bandwidth
SI − entropy rate
AP − memory
Figure 3.5: Parallel axes plot depicting the internal representation of safe (solid line) andunsafe (dashed line) swallows
3.5.2 Internal representation
Figure 3.5 is a parallel axes plot depicting the internal representation of safe and unsafe
swallows acquired by the reputation-based classifier. Each feature has been normalized by
its standard deviation to facilitate visualization. On each axis, the median feature value is
shown. The median values of adjacent axes are joined by solid (safe swallow) or dashed (un-
safe swallow) lines. We immediately observe some distinct patterns which characterize each
type of swallow. In the AP axis, unsafe swallows tend to have lower acceleration amplitude,
higher variance, higher spectral centroid and shorter memory. The lower mean vibration
amplitude in unsafe swallowing resonates with previous reports of suppressed peak accel-
eration [98] in dysphagic patients and reduced peak anterior hyoid excursion [60] in older
adults, both suggesting compromised airway protection. The observation of a higher spec-
tral centroid in unsafe swallowing may reflect departures from the typical axial high-low
frequency coupling trends of normal swallowing as detailed in [73]. Likewise, the shorter
59
memory and hence faster decay of the autocorrelation may be indicative of compromised
overall coordination in unsafe swallowing.
It is also interesting to note that unsafe swallows tend to be negatively skewed while
safe swallows are evenly split between positive and negative skew. In other words, in un-
safe swallowing, the upward motion of the hyolaryngeal structure appears to have weaker
accelerations than during the downward motion. This is opposite of the tendency reported
in [73] for healthy swallowing and may reflect inadequate urgency to protect the airway.
3.5.3 Reputation-based classification
The merit of a reputation-based classifier for the present problem can be appreciated by
contrasting its performance against that of the classic method of combining classifiers, i.e.,
via the majority voting algorithm. To this end, Figure 3.6 summarizes the accuracies of both
approaches from a 10-fold cross-validation using the data of this study. The histograms
summarize the distribution of accuracies obtained from cross-validation. The correspond-
ing density estimate (solid line) was obtained using a semi-parametric maximum likelihood
estimator based on a finite mixture of Gaussian kernels. Clearly, the location of the density
of reputation-based accuracies appears to be further to the right of the location of the ma-
jority voting density. The large spread in both densities amplifies the risk of Type II error and
thus conventional testing (e.g., Wilcoxon ranksum) fails to identify any differences. How-
ever, upon more careful inspection using a two-sample Kolmogorov-Smirnoff test of the
20% one-sided trimmed densities (i.e., omitting the 2 most extreme points in each density),
a statistically significant difference between the distributions (p = 0.0098) is confirmed.
The reputation-based classifier achieved higher adjusted accuracies (> 85%; average of
sensitivity and specificity) than those reported in [74] (no greater than 75%). Patients were
similarly aged and all had neurogenic dysphagia. However, some key differences between
the studies are worthy of mention. The present study had a slightly larger sample size, a
better balance between males and females ([74] almost exclusively had males), and most
60
Figure 3.6: Comparison of the densities of classification accuracies by majority voting (top)and reputation-based classification (bottom) for safe and unsafe swallow discrimination
importantly, a more significant representation of unsafe swallows (73% of total swallows
compared to only 13% in [74]). Arguably, vibration patterns of pathological swallows vary
more widely than those of safe swallows and hence a more comprehensive representation
of the former may be well-justified.
Generally, the reputation-based classification scheme mitigates the risk of the overall
classifier performance being unduly affected by a poorly performing component classifier
within a multi-classifier system. Additionally, as exemplified in this study, the dimensional-
ity of individual classifiers can be minimized, reducing the demand for voluminous training
data.
3.5.4 Limitations
The dual-axes classifier attained very high sensitivity but modest specificity. In part, this
bias towards higher sensitivity may be attributable to the preponderance of unsafe swallow
61
examples in the original data set, despite our efforts to balance the classes via bootstrap-
ping. In a practical system, it would mean that the classifier may overzealously flag a safe
swallow as unsafe. This class imbalance issue may be a limitation of studying patients re-
ferred to videofluoroscopy, the majority of whom likely have a greater propensity for prob-
lematic swallowing. Hence, to obtain a larger number of safe swallows, a significantly ex-
panded sample of patients may need to be recruited in the future.
The reputation classifier assumes independent features. This constrains the admissi-
ble features, but [73] has argued that many SI and AP features have low correlations. Fu-
ture work may invoke independent component analysis or principal component analysis to
generate additional novel independent features. The present classifier relies on static rep-
utation values. In clinical application, the classifier may be trained and tested at different
times with different patients. As a consequence, the feature distributions may change over
time. In such case, dynamic reputation values may be more appropriate and future work
may consider an online approach to dynamically update classifier reputations.
3.6 Conclusion
This study has demonstrated the potential for automatic discrimination between safe and
unsafe (without airway clearance) swallows on the basis of a selected subset of time, fre-
quency and information theoretic features derived from non-invasive, dual-axis accelero-
metric measurements at the level of the cricoid cartilage. Dual-axis classification was more
accurate than single-axis classification. The reputation-based classifier internally repre-
sented unsafe swallows as those with lower mean acceleration, lower range of acceleration,
higher spectral centroid, slower autocorrelation decay and weaker acceleration in the supe-
rior direction. Our results suggest that reputation-based classification of dual-axis swallow-
ing accelerometry from adult stroke patients deserves further consideration as a turn-key
clinical informatic in the management of swallowing disorders.
62
Chapter 4
Analyzing the performance of the static
reputation-based algorithm
The contents of this chapter are reproduced from a journal article currently under review:
Nikjoo, M. and Chau, T., "Analyzing the performance of the static reputation-based algo-
rithm", Pattern Recognition Letters.
4.1 Abstract
Weighted classifier ensembles have been widely applied in various domains. In this paper,
we characterize the large and finite ensemble behavior of one such system, a recently pro-
posed static reputation-based (SRB) multiple classifier algorithm. Specifically, we examine
the convergence of the probability of error in the presence of asymptotically large ensem-
bles. In the finite ensemble case, we determine the lower bound of classification accuracy
through an examination of patterns of success and failure. Finally, we provide empirical
simulations of probabilities of classification error under uniform and Gaussian models of
point estimation errors for finite ensembles. The systematic characterization provided in
this paper sheds light on the strengths and limitations of the reputation-based algorithm as
63
a multi-classifier ensemble.
4.2 Introduction
Multiple classifier systems (MCS) have been explored in many pattern recognition applica-
tions [2, 138, 31, 91], primarily as a technique to improve the performance of a classification
system. Other rationale for the deployment of MCS include the availability of different fea-
tures and training sets or the need to improve classification speed [56]. Classifiers can be
combined on three levels: abstract, rank, and measurement levels. Among the three, the
abstract level is the simplest, allowing several classifiers, each designed within a different
context, to be combined. As a consequence, abstract level MCSs are preferred among other
types of classifier combinations [56].
Recently, a novel weighted classifier combination technique, the static reputation-based
algorithm (SRB) was proposed and demonstrated on the classification of dual-axes swallow-
ing signals [88]. The SRB algorithm differentially weighs classifier decisions on the basis of
their past performance. With the above data set, the SRB algorithm improved classification
accuracy beyond simple majority voting and various individual classifiers, while reducing
computational load [88]. Although similar in spirit to weighted majority voting [135], SRB is
not subject to the degenerate condition of vanishing weights.
Generally, there are few theoretical appraisals of weighted multiple classifier combina-
tions. Kuncheva [64] analyzed six different classifiers’ combination techniques and devel-
oped formulae for classification error. Chen and Cheng [20] studied the asymptotic behav-
ior of three classifier combinations including: average, median, and majority voting. They
have shown that for large numbers of individual classifiers, median and majority voting pro-
duce the same decision. [17] discuss the asymptotic behavior and convergence rate of these
same three fusion strategies (average, median, and maximum) under uniform, Gaussian
and Cauchy noise.
64
In this paper, we model the SRB algorithm using a percentile greater than the median,
and analyze its asymptotic behavior for uniform and Gaussian noise, derive a formula for its
probability of errors and discuss its behavior relative to that of majority voting. Moreover,
using the idea of "patterns of success and failure" [66], we highlight performance differences
between SRB and majority voting. Finally, adapting the simulation method proposed by [1],
we study the behavior of the SRB algorithm for small and medium numbers of classifiers by
subjecting their outputs to uniform and Gaussian perturbation errors.
The remainder of this paper is organized as follows. In Section 2, we analyze the asymp-
totic behavior of SRB. In Section 3, we characterize the SRB algorithm first in terms of pat-
terns of success and failure and subsequently in terms of probability of error.
4.3 Large n behavior
4.3.1 Problem Statement
Consider a classification problem with two classes, c1 and c2. Let x be an input feature
vector to be classified using n independent classifiers. Let P = P(c1|x ) be the actual pos-
terior probability of the first class for the given input x . Without loss of generality, we as-
sume that the given input belongs to the first class and therefore P > 0.5 (the other case
can be discussed similarly). Assume that each classifier outputs an estimate of P , denoted
as ω1,ω2, ...,ωn . To improve the estimation of P , an MCS may be used. For instance, one
can use the minimum, maximum, average, and majority vote to fuse individual classifier
outputs to improve the estimation of P [1, 64]. Let
ω= f (ω1,ω2, ...,ωn ) (4.1)
represent the fused estimate of P(c1|x ). We assume that the individual estimates, ωi ’s, are
independent identically distributed (i.i.d.) random variables. Each estimate is the posterior
65
probability plus random noise:
ωi = P +εi . (4.2)
Therefore, the fused estimate, ω, is also a random variable. We denote f ω(y ) and Fω(z )
as the probability density function (pdf) and cumulative distribution function (cdf) of the
fused estimate, respectively. The correct fused decision for the given input occurs when
ω > 0.5. In this section, we will analyze the asymptotic behavior of the static reputation-
based algorithm, i.e., when n→∞ for two different types of noise: uniform and Gaussian.
4.3.2 Modeling the Algorithm
Before proposing a model for the SRB algorithm, we review a model of the majority voting
algorithm. First, we rearrange the classifiers’ estimations of P from least to greatest, i.e.,
ω(1) ≤ω(2) ≤ ... ≤ω(n ). The majority voting algorithm can be modeled as the median of the
series,ω(m ), as follows:
m a j (ω) =ω(m ) =
ω( n+12 )
, if n is odd;
ω( n2 )
, if n is even.(4.3)
In other words, the median estimator defines the final decision of the system. Ifω(m ) > 0.5
then the system correctly assigns the given input, x , to the first class.
Now let us consider the SRB algorithm. Unlike the majority voting method, the final
decision of the system is equal to the decision of one of the estimators,ω(m ) toω(n ), accord-
ing to the Bayes rule. Let us clarify with an example. Assume the set ω = ω(1) = 0.2,ω(2) =
0.3,ω(3) = 0.4,ω(4) = 0.6,ω(5) = 0.7 represents the posterior probability estimates for the first
class by five different classifiers. Also, let r = r(1) = 0.5, r(2) = 0.5, r(3) = 0.5, r(4) = 0.9, r(5) = 0.9be the reputation values of these classifiers. Majority voting would assign the given input to
the second class based on the median of the first set, ω(m ) = 0.4. With the SRB algorithm,
66
the classifiers are first divided into two groups based on their decisions: r(1), r(2),r(3) and
r(4), r(5). Then, the Bayes rule for each class is written as P(c1|ω) = 0.92∗0.53
0.92∗0.53+0.12∗0.53 = 0.9878
and P(c2|ω) = 0.12∗0.53
0.92∗0.53+0.12∗0.53 = 0.0122. Therefore, the final decision of the SRB is the first
class. Hence, unlike the median selection of majority voting, SRB selects the third quantile,
ω(4), in this case as the final decision of the system. In general, the final decision of the SRB
coincides with a percentile greater than or equal to the median. It should be mentioned
here that we have inherently assumed that the reputation values are accurate metrics for
gauging the performance of the classifiers on the test set.
4.3.3 Asymptotic Analysis of SRB
The following Theorem is useful in analyzing the asymptotic behaviors of the SRB algo-
rithm.
Theorem 1. [37] Let Ω1,Ω2, ...,Ωn be i.i.d. with distribution function F (ω), density f (ω),
finite mean µ, and finite varianceσ2. Also, Let 0<α< 1 andωα be the αt h quantile of F , so
that F (ωα) =α. If F (ω) is continuous and f (ωα)> 0, then:
pn (Yn −ωα)→N (0,
α(1−α)f 2(ωα)
), (4.4)
where Yn is the sample αth quantile.
Proof. See [76].
Result 1. An interesting result of Theorem 1 is the asymptotic distribution of the sample
median. The median of the distribution F ,ωm , satisfies F (ωm ) = 12
. Therefore, according to
Theorem 1, when n→∞, the distribution of the sample median isN (ωm , 14n f 2(ωm )
).
Based on the above result, the function m a j (ω) is asymptotically distributed according
to N (ω(m ), 14n f 2(ω(m ))
) for large n . Moreover, SRB is asymptotically distributed according to
N (ω(α), α(1−α)n f 2(ω(α))
).
Recall that we assume that the given input x belongs to the first class. Therefore, the
67
probability of error given x is:
Pe = P(ω≤ 0.5) = Fω(0.5) =
0.5∫0
f ω(y )d y . (4.5)
For for n→∞ the probability of error for the majority voting algorithm is
Φ
0.5−ω(m )σ(n )
=Φ
2p
n (0.5−ω(m )) · f (ω(m )) , (4.6)
where Φ is the cdf ofN (0,1). Similarly, by Theorem 1, the SRB algorithm has the following
probability of error for large n :
Φ
0.5−ω(α)σ(n )
=Φ
pn (0.5−ω(α)) · f (ω(α))p
α(1−α)!
, (4.7)
with 0.5≤α≤ 1.
Result 2. For the correct fused decision and under the assumption of uniform noise, the
probability of error for the SRB algorithm converges to zero faster than or as quickly as the
majority voting method.
Proof. Since 0.5≤α≤ 1,ω(m ) ≤ω(α) and 2≤ 1pα(1−α) . For the true fused decision, 0.5<
ω(m ) ≤ω(α). Moreover, for the uniform distribution f (ω(m )) = f (ω(α))> 0, ∀α. Therefore, the
argument of Φ for both majority voting and SRB algorithms are negative. Hence, when n →∞, the probability of error for both algorithms is Φ(−∞) = 0. Furthermore,
pn (0.5−ω(α))· f (ω(α))p
α(1−α) ≤2p
n (0.5−ω(m )) · f (ω(m )) < 0. In other words, the probability of error for SRB converges to
zero as quickly as, if not faster than than majority voting method.
Result 3. For the correct fused decision and under the assumption of Gaussian noise,
the probability of error for the SRB algorithm converges to zero faster than or as quickly as
the majority voting method.
Proof. For Gaussian noise, again 2 ≤ 1pα(1−α) and ω(m ) ≤ ω(α). However, in this case,
0 < f (ω(α)) ≤ f (ω(m )). Based on Equation (4.2), ωi ∼ N (P,σ2), where σ2 is the power of
68
Table 4.1: Example of 4 distributions of 10 test samples among different combinations ofclassifier outputs
Probability of correctClassifier outputs classification
Pattern 111 101 011 001 110 100 010 000 Pm a j PSR B
1 4 0 0 1 1 2 1 1 0.5 0.72 4 0 0 1 0 3 2 0 0.4 0.73 2 0 3 0 1 4 0 0 0.6 14 2 2 1 0 3 0 0 1 0.9 0.9
Gaussian noise. Without loss of generality, we assume that σ = 1. To ascertain the rate of
convergence of the SRB algorithm, we compare the arguments of Φ in Equations (4.6) and
(4.7). In the Gaussian case,
(0.5−ω(α))pα(1−α) e−
(ω(α)−P)2
2 < 2(0.5−P)< 0, (4.8)
proving Result 3. Note that ωm = P and f (ωm ) = 1p2π
. Numerical simulations confirm that
the inequality (4.8) holds for 0.5<α< 1.
4.4 Medium and small n behavior
In the previous section, we characterized the behavior of the SRB algorithm in the large
classifier ensemble case, n → ∞. Here, we describe the small and medium n behavior of
the SRB algorithm. First, we contrast the "pattern of failure" and "pattern of success" of the
SRB algorithm against those of the majority voting algorithm for a specific example. Then,
we estimate the SRB’s probability of error via simulation. In the simulation part, instead of
using the approximate versions of error probability in Equations (4.6) and (4.7), we calculate
the exact probabilities of error using the binomial formula.
69
4.4.1 Patterns of Success and Failure
To quantify small and medium n behavior, we begin with an example based on the "pat-
tern of success" and "pattern of failure" introduced by [66]. Assume an MCS system with
three classifiers θ1,θ2,θ3. Let the probability of correct classification (reputation) for these
classifiers be 0.7, 0.6, and 0.5, respectively. Also assume there are 10 samples in the test set
of which the first classifier correctly labels exactly seven, the second classifier six, and the
third classifier five. Table 4.1 exemplifies some possible ways of distributing the 10 samples
into the eight possible combinations of outputs of the three classifiers. The column labels
are 3-bit strings, where each bit denotes a correct (1) or incorrect (0) classification for each
classifier. For example, 101 means that only the first and third classifier yield a correct classi-
fication. Each entry within a given column is the number of samples for which the specified
classifier output combination occurs. Each row in Table 4.1 is a different distribution of the
10 samples among the 8 unique combinations of classifier outputs. Following the terminol-
ogy of [66], the distributions which represent best or worse case classification accuracy for
a method are ’patterns’ of success or failure, respectively.
The accuracy of majority voting is given by 110× the summation of the entries in columns
111, 101, 011, and 110, i.e., all columns with two or three correct votes. However, for the SRB
algorithm, the output combination 100 should also be considered as a correct decision of
the system. By applying the SRB algorithm to the above classifiers, we notice that the vote
of the first classifier, θ1, is alone sufficient for the correct decision of system.
Table 4.1 only depicts a subset of possible distributions. However, we have intentionally
included a pattern of success and failure for both algorithms. The second row of the Table
4.1 is a pattern of failure for both majority voting and SRB. For this distribution, the accuracy
of majority voting is 0.4 which is lower than the accuracy of the worst individual classifier.
However, for this pattern of failure, SRB is on par with the best individual classifier (0.7).
The third row of Table 4.1 depicts the pattern of success for the SRB algorithm. In the best
situation, the SRB can classify inputs with 100% accuracy. However, for this pattern, the
70
accuracy of majority voting is only 0.6 which is even lower than the performance of the best
individual classifier. The last row of the Table is the pattern of success for majority voting.
For this distribution, SRB performs comparably. Two major conclusions can be derived
from Table 4.1:
• The SRB algorithm always performs better or equal than majority voting. Even for the
pattern of success of majority voting, the SRB algorithm performs comparably.
• The SRB algorithm always performs better or equal than the best individual classifier.
On the other hand, in some situations, the accuracy of majority voting can dip below
that of the weakest individual classifier in the system.
4.4.2 Probability of error
In this section, we compare majority voting and SRB in terms of their probabilities of error.
Our simulation method is an extension of the method proposed by [1]. In [1], the accuracies
of all the individual classifiers were considered equal. However, in our simulation method,
we consider the fact that in reality some classifiers are stronger performers than others and
generate classifiers with different levels of accuracies, drawn from the interval ( 12
,1). For
each point estimate P of the posterior probability of a given class, it is assumed that the ex-
perts’ estimates are the posterior probability plus zero-mean noise. Two different types of
noise were considered in this experiment: uniform and Gaussian noise. For uniform noise
defined on the interval [−b ,b ], we varied the value of b to modify the support of the distri-
bution. For Gaussian noise, the standard deviation σ was varied linearly to generate noise
with different powers. For both uniform and Gaussian cases, the simulation was replicated
at 10 different levels of noise. In [1], when the addition of noise values to P resulted in an es-
timate with a value higher than one or lower than zero, the value was clipped to one or zero,
respectively. However, as shown in [64], clipping the distribution may affect the calculation
of empirical error rate. Therefore, in our method, we interpret the individual estimates,ωi ’s,
71
as the amount of evidence for class i , rather than as a strict probability. Hence, we do not
force ωi to lie in the interval [0,1]. We apply the threshold ω > 0.5 on the fused decision
and calculate the probability of error ,Pe , as P(ω ≤ 0.5). It has been shown in [64] that this
strategy is more accurate than the clipping method.
4.4.3 Results for Uniform Noise
We assume that classifiers should estimate a fixed a posteriori value equal to P = 0.7. How-
ever, it is assumed that classifiers have three different levels of accuracies: P1 = 0.7, P2 = 0.6,
and P3 = 0.5. The additive noise is zero-mean and uniformly distributed. We change the
support of the distribution by changing the value of b from 0.1 to 1 linearly. At each value
for b , the classification error rate is calculated based on the estimates of the individual clas-
sifiers. Since the classifiers are independent, the probability of an incorrect decision can be
calculated using the binomial formula. First, we assume a system with n = 3 classifiers with
accuracies equal to 0.5, 0.60, 0.70. Then, every time, two classifiers with accuracies equal
to 0.5, 0.70 are added to the system. For example, for n = 7 the accuracies of individual
classifiers are 0.5, 0.5, 0.5, 0.6, 0.7, 0.7, 0.7. Therefore, the average accuracy is always fixed
and equal to 0.6. For SRB algorithm, the reputation value of each expert is equal to
ri =
1−Pe i , if Pe i ≤ 0.5;
0.5, if Pe i > 0.5.(4.9)
which Pe i is equal to the probability of error for that expert.
Figures 4.1-(a) and 4.1-(b) show the classification error for systems with 3 and 15 clas-
sifiers respectively. As these figures show, the classification error for the SRB algorithm is
lower than that of majority voting. By comparing 4.1-(a) with 4.1-(b), we appreciate that the
error rate is decreased by increasing the number of classifiers.
To illustrate the effect of the number of classifiers on the classification error rate, we
72
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
5
10
15
20
25
30
35
40
45
b Value
Majority VotingSRB
(a)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
5
10
15
20
25
30
35
b Value
Pro
babi
lity
of E
rror
(%
)
Majority VotingSRB
(b)
Figure 4.1: Comparison between majority voting and SRB algorithm for different uniformnoise distributions, for (a) three, and (b) fifteen classifiers.
find the fused estimate of classifiers for a fixed b value (b = 0.3) for different numbers of
classifiers. Figure 4.2 shows the result of this experiment. As expected, by increasing the
number of classifiers, the error rate of classification decreases. The SRB algorithm has lower
probability of error and therefor higher overall accuracy compared to majority voting for all
ensembles of classifiers.
4.4.4 Results for Gaussian Noise
We assumed that the noise is zero-mean Gaussian with power σ2. We varied the power of
the noise by adjusting σ linearly from 0.1 to 1. At each level of noise, we calculated the
classification error rate for the SRB and majority voting methods.
Figures 4.3-(a) and 4.3-(b) depict the classification error rate for 3 and 15 number of
classifiers respectively. As these figures show, by increasing the power of noise, the classi-
fication error of both the SRB and majority voting are increased. Also, for both algorithms
the classification error is decreasing by increasing the number of classifiers from three to
73
3 5 7 9 11 130
5
10
15
20
25
Number of Classifiers
Pro
babi
lity
of E
rror
(%
)
Majority VotingSRB
Figure 4.2: Comparison between majority voting and SRB algorithm for uniform noise withb = 0.3 and different numbers of classifiers.
fifteen.
Figure 4.4 shows the classification error for a fixed value of standard deviation, σ = 0.3,
and different numbers of classifiers. As expected, by increasing the number of classifiers,
the classification error of the system is decreased. For all classifier ensembles, the SRB algo-
rithm outperforms majority voting.
4.5 Conclulsion
In this paper, we determined that for asymptotically large classifier ensembles under uni-
form and Gaussian noise assumptions, the probability of error of the static reputation-
based algorithm converges no slower than majority voting. For finite ensembles, our anal-
ysis revealed that the accuracy of the SRB algorithm is bound by the accuracy of the best
74
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
5
10
15
20
25
30
35
40
45
Standard Deviation
Pro
babi
lity
of E
rror
(%
)
Majority VotingSRB
(a)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
5
10
15
20
25
30
35
40
Standard Deviation
Pro
babi
lity
of E
rror
(%
)
Majority VotingSRB
(b)
Figure 4.3: Comparison between majority voting and SRB algorithm for different Gaussiannoise powers, for (a) three, and (b) fifteen classifiers respectively.
individual classifier in the ensemble. Finally, for finite ensembles, the probability of error of
the SRB decreases as the number of classifiers grows, regardless of whether point estimation
errors are uniform or gaussian, and remains at all times below that of majority voting.
75
3 5 7 9 11 135
10
15
20
25
30
35
Number of Classifiers
Pro
babi
lity
of E
rror
(%
)
Majority VotingSRB
Figure 4.4: Comparison between majority voting and SRB algorithm for a fixed value of stan-dard deviation of Gaussian noise (σ= 0.3) and different number of classifiers.
76
Chapter 5
Time-Evolving Reputation-Based
Classifier and its Application on
Classification of Physiological Responses
of Children With Disabilities
The contents of this chapter are reproduced from a journal article currently under review:
Nikjoo, M., Kushki, A., Andrews, A., and Chau, T., "Time-Evolving Reputation-Based Classi-
fier and its Application on Classification of Physiological Responses of Children With Dis-
abilities", Information Fusion.
5.1 Abstract
This Chapter proposes a novel dynamic algorithm for fusing classifiers in a multi-classifier
system (MCS). The algorithm uses the Dirichlet distribution to continuously compute rep-
utation values for each individual classifier over time. We show that the proposed algorithm
is particularly advantageous for non-stationary signals where training and test sets may be
77
collected at temporally distant instances in time and hence are statistically disparate. The
algorithm adjusts the contribution of each classifier to the final decision adaptively based
on the samples of the test set. We demonstrate the advantage of the proposed algorithm
over majority voting and a static reputation-based MCS in the classification of physiologi-
cal arousal states in individuals with severe disabilities.
5.2 Introduction
The classification of multiple physiological or biomechanical signals is a technical challenge
encountered in many bioengineering applications such as biometrics, brain-computer in-
terfaces (BCI) [113, 122, 101], dysphagia detection using accelerometry signals [91, 74] and
emotion detection using physiological signals [95, 112].
A traditional approach to these classification problems is to use a single classifier. In this
approach, all the features gathered from different signals are stacked in one vector and fed
to a single classifier. Despite its simplicity, the single classifier system is susceptible to the
curse of dimensionality [34]; when working with a high-dimensional feature vector, training
necessitates voluminous data and convergence can be extremely slow. Thus, a single clas-
sifier system may be intractable for many online biomedical applications with large feature
spaces [56].
A Multiple Classifier System (MCS) judiciously integrates decisions from multiple classi-
fiers and offers a promising alternative to a single classifier in the situations outlined above.
In particular, MCSs can enhance classification performance (e.g., accuracy) and accelerate
classification in online applications [134, 65]. MCSs also accommodate combinations of
different individual classifiers each optimized to unique feature subsets. For instance, in
body-machine control interfaces, physiological (e.g., blood volume pulse, cortical hemody-
namics), biomechanical (e.g., limb motion) and environmental noise (e.g., ambient acous-
tics) provide different types of features for classification. It may be sensible to train one
78
classifier on each type of feature [56]. Likewise, in clinical data collection, there may be
multiple training sets, arriving at different times or under slightly different circumstances
[19, 50]. In such cases, individual classifiers might be trained on each available data set.
MCSs may combine classifiers on three levels: abstract, rank, or measurement level [135].
At the abstract level, each individual classifier outputs a unique class label. These labels are
then fused using methods such as majority voting. At the rank level, the output of each
classifier is a subset of class labels ranked in terms of the classifier’s belief about the class to
which the input belongs. At the measurement level, each classifier outputs a real value for
each class, representing the probability with which the input belongs to a given class.
Abstract-level MCSs have been the most widely applied in the literature [79, 102, 88]
given the flexibility of the abstract level output, i.e., class labels may be derived from any
type of classifier and outputs from multiple classifiers, each designed within a different con-
text, can be easily combined.
Several methods have been proposed for combining class labels at the abstract level.
Among these, majority voting [117, 53] has received special attention. In this democratic
scheme, the class with the most votes from the individual classifiers is selected as the MCS
decision. Majority voting can increase the overall performance (e.g., accuracy) of a classifi-
cation system in a wide variety of circumstances [135, 86]. Weighted majority voting [135],
is an extension of simple majority voting, where the weight of each classifier is determined
according to its belief, i.e., its certainty about a given subset of the training set. However,
the algorithm fails when one or more beliefs are zero [65], a degenerate condition that may
occur in multi-class problems. More importantly, beliefs are computed exclusively using
samples of the training set and thus there is a risk of poor generalization when the data are
non-stationary. We previously introduced the static reputation algorithm [88], a MCS which
estimates classifier weights or reputation values over a validation set, without encounter-
ing degenerate conditions. Using a combination of neural networks [88] or linear discrim-
inants [91], the static reputation algorithm classified noisy biomechanical signals with sig-
79
nificantly higher accuracies than that achievable by single classifier systems. Nonetheless,
fixed reputation values were assigned to the individual classifiers. Therefore, the static rep-
utation algorithm is also susceptible to diminished performance in the face of statistical
disparity between samples of the test and validation sets. This disparity is of particular
concern in the classification of non-stationary physiological signals. Specifically, if one of
the individual classifiers performs poorly with the test data, the importance of its vote re-
mains unaltered, thereby negatively influencing the overall decision. We propose a dynamic
weighted voting method as a solution to this disparity issue.
5.2.1 Dynamic weighted voting
Dynamic weighted voting first surfaced in the business literature as a means of tracking
concept drift, i.e., the evolution of dynamic concepts such as customer preferences [103].
Kolter and Maloof [63] proposed a novel tracking algorithm based on the notion of dynamic
weighted majority voting, where several individual classifiers were trained and their weights
were adaptively updated based on their correctness of categorizing training exemplars. Sun
[118] also presented a dynamic weighting algorithm based on the local within-class nearest
neighbors concept. In this algorithm, individual classifiers receive different weights based
on their performance over a weighting set, a single hold-out set distinct from the training
set. Although the weights of individual classifiers change over time, the above methods
do not address the disparity problem since they only consider performance on a train-
ing or weighting set rather than a previously unseen test set. In contrast, Valdovinos and
Sanchez [127] proposed an algorithm to directly address the disparity problem. In their ap-
proach, each classifier calculated the distance between the unknown input vector and its
nearest neighbor in the training set. Each classifier received a weight inversely proportional
to this distance, i.e., the classifier with the shortest nearest neighbor distance received the
highest weight. Although this method adaptively assigned classifier weights, it only oper-
ated at the measurement level. Further, the calculated distance was the only consideration
80
in classifier weight assignment, exposing the multi-classifier system to the risk of heavily
weighing a poorly performing classifier. Other important factors, such as the past perfor-
mance of a classifier, were not considered. Finally, the choice of distance metric is often
problem-specific and hence a non-trivial challenge.
Recently, unsupervised adaptive classifiers have attracted the attention of several re-
searchers especially those who are working in the area of Brain-Computer Interface (BCI).
In [12], authors proposed an adaptive linear discriminant analysis (ALDA) method for on-
line clustering of EEG patterns using a Gaussian mixture model (GMM). In this method, the
initial data distribution is estimated using a labeled training period. The major problem
of this method is that if the feature distribution shifts far away from the training session
during long-term use, the initial parameter values are no longer valid [80]. Therefore, this
method is not able to properly deal with the disparity problem. Hasan et al. [46] proposed
another unsupervised classification method based on GMM. In this algorithm, a sequential
EM method is used to adapt the GMM-based classifier using simulated online system. Be-
cause the data set used in [46] was modest in duration, it is not possible to accurately mea-
sure the robustness of this method against the disparity problem. Gan [42] has proposed a
self-adapting unsupervised learning system to update the mean and variance of Bayesian
classifiers. However, in this method, it is not possible to decrease the effect of old data
[128]. Awate et al. [3] proposed an adaptive unsupervised algorithm for MRI brain-tissue
classification using a non-parametric Markov model. In this algorithm, the underlying im-
age statistics are learned automatically and a classification strategy based on that model
is constructed. The computational complexity of this method is very high which makes it
unattractive for real-time applications.
5.2.2 Dynamic reputation
To address the disparity problem, we propose a time-evolving reputation-based algorithm.
Building on the static reputation algorithm [88], each classifier adaptively constructs a rep-
81
utation or weight for itself, initially according to the classifier’s performance on a validation
set and subsequently according to its performance on the test set. In other words, the rep-
utation of each classifier is updated dynamically whenever a new sample is classified. At
any point in time, a classifier’s reputation reflects the classifier’s performance on both the
validation and the test sets. Therefore, the effect of random high-performance of weak clas-
sifiers is appropriately moderated and likewise, the effect of a poorly performing individual
classifier is mitigated as its reputation value, and hence overall influence on the final de-
cision is diminished. The proposed algorithm functions at the abstract level and does not
require any computationally intense individual classifiers. Because of the time-evolving up-
date scheme, we call this algorithm the time-evolving reputation-based algorithm.
In the following, we present the mathematical background behind the time-evolving al-
gorithm (Section 2), review the static reputation-based algorithm (Section 3) and propose
the time-evolving reputation-based algorithm (Section 4). We exemplify the algorithm us-
ing a multivariate non-stationary physiological data set and contrast performance against
that of majority voting and the static reputation algorithm, two abstract level multi-classifier
systems.
5.3 Mathematical Background
5.3.1 Notation
Assume the time series, S, is the pre-processed version of an acquired signal. Also let Θ =
θ1,θ2, ...,θL be a set of L ≥ 2 classifiers and Ω= ω1,ω2, ...,ωc be a set of c ≥ 2 class labels,
whereωj =ωk , ∀j = k . Without loss of generality, Ω⊂N. The input of each classifier is the
feature vector xεRn i , where n i is the dimension of the feature space for the i t h classifier θi ,
whose output is a class labelωj , j = 1, ..., c . In other words, the i t h classifier, i = 1, . . . , L, is a
functional mapping, Rn i → Ω, which for each input x gives an output θi (x ) ∈ Ω. Generally,
the classifier function could be linear or non-linear. It is assumed that for the i t h classifier,
82
a total number of d i samples are assigned for training. Symbols in bold will signify vectors.
5.3.2 Properties of the Dirichlet Distribution
For a better understanding of the proposed algorithm, we briefly introduce the Dirichlet
distribution.
The Dirichlet distribution is a continuous multivariate probability distribution. Assume
an L-component random vector U = U1, ...,UL, where each Ui > 0, and∑L
i=1 Ui = 1. The
parameters of the Dirichlet distribution are L positive real numbers, α = α1, ...,αL, each
corresponding to one of the random variables Ui . The parameter αi is also known as the
evidence for Ui . The Dirichlet density can be defined mathematically as [58, 132]
f (u 1, ..., u L ;α1, ...,αL) =1
H (α)
L∏i=1
Uαi−1i , (5.1)
where H (α) is a normalization factor expressed in terms of the Gamma function, Γ:
H (α) =
∏Li=1Γ(αi )
Γ(∑L
i=1αi ). (5.2)
The expected value of any of the L random variables, Ui , is defined as follows:
E (Ui |α) = αi∑Lj=1αj
. (5.3)
5.4 Static Reputation-Based Algorithm
We summarize the static reputation-based algorithm [88] upon which the dynamic reputa-
tion algorithm is based. Let p (ωj ) be the prior probability of class ωj . Suppose we have a
multiclassifier system with L distinct classifiers.
1. Train the L ≥ 2 individual classifiers using the training set.
83
2. Evaluate the accuracy of the i t h , i = 1, . . . , L, classifier on a validation set (disjoint from
the training set). Normalize this accuracy to obtain the reputation, ri ∈ [0, 1], for the
i t h classifier.
3. For each feature vector, x , in the test set, obtain L decisions using the L distinct clas-
sifiers:
Ω(x ) = θ1(x ),θ2(x ), ...,θL(x ). (5.4)
4. Sort the reputation values of the classifiers in descending order,
R∗ = r1∗ , r2∗ , ..., rL∗, (5.5)
such that r1∗ ≥ r2∗ ≥ ... ≥ rL∗ . Now, a reputation-ordered set of classifiers, Θ∗, is ob-
tained by ranking the classifiers
Θ∗ =
θ1∗
θ2∗
...
θL∗
, (5.6)
i.e., θ1∗ and θL∗ are the classifiers with the highest and lowest reputations.
5. Observe the vote of the first m classifiers in the reputation-ordered set, where
m =
L2
, if L is even,
L+12
, if L is odd.(5.7)
If all of these m classifiers vote for the same class, this class is selected as the final
decision of the system. If not, the final decision of the system is obtained using the
84
following step.
6. Using Bayes rule and assuming independent classifiers, estimate the posterior prob-
ability as
p (ωj |θ1, ...,θL) =
∏Li=1 p (θi |ωj )p (ωj )∑c
t=1
∏Li=1 p (θi |ωt )p (ωt )
. (5.8)
For simplicity, we have dropped the argument for θ above, but it is understood to be
a function of x . The conditional probability p (θi |ωj ) in (5.8) is given by
p (θi |ωj ) =
ri , if x ∈w j and θi (x ) =w j ,
1− ri , if x ∈w j and θi (x ) =w j .(5.9)
When there is no known priority among classes, we can assume equal prior probabil-
ities. Hence,
p (ω1) = p (ω2) = . . .= p (ωc ) =1
c. (5.10)
7. For each class,ωj , estimate a posteriori probabilities as given by (5.8) using (5.9) and
(5.10). Select the class with the highest posterior probability as the final decision of
the system.
5.5 Time-Evolving Reputation-Based Algorithm Using the Dirich-
let Distribution
The static reputation-based algorithm improves upon the performance of the majority vot-
ing algorithm whenever there is a high correlation between samples of the validation and
test sets. However, the static algorithm performs poorly when samples of the validation and
the test sets are statistically disparate. To address this issue, the time-evolving reputation-
based algorithm updates the reputation values based on the performance of the individual
classifiers on the test set. To do so, we use the Dirichlet reputation system [58], which has
85
been touted for its flexibility, simplicity and representational capacity [132]. The Dirichlet
reputation system captures a sequence of observations and updates the a priori reputation
value for each classifier. By adding the observations to the a priori information, the total
evidence for each outcome can be written as follows:
αi = vi +Cβi , (5.11)
where C > 0 is a constant, β = β1, ...,βL is the vector of initial confidence values of the L
classifiers, with the condition∑L
i=1βi = 1, βi > 0 and v = v1, ..., vL where vi ≥ 0, is the
vector of a posteriori evidence for the L classifiers. When no other evidence is available,
usually it is assumed that
β1 = ...=βL =1
L. (5.12)
In our method, because the initial reputation is known, βi is a real number related to the
initial reputation value of the i t h classifier. In the spirit of (5.3), we can write the expected
weight of the i t h classifier as a function of the initial confidence in that classifier, βi and the
accumulated evidence vi , namely,
E (Ui |v,β ) =vi +Cβi
C +∑L
i=1 vi
. (5.13)
By construction, (5.13) is between 0 and 1 and its summation over i is unity. Equation
(5.13) allows us to update the expected value of the Dirichlet distribution using evidence
from posterior observations. With equal initial confidence values for all the outcomes, the
expected values of different outcomes are equal to each other. However, after recording
several observations and updating the values of vi in (5.13), it is possible that the posterior
expected values become different for different outcomes.
To illustrate the application of this equation, we provide an example here. Assume there
86
are three card players A1, A2, and A3 (L=3), who play a specific game several times and each
time one of them wins the game. When no priori information is available, we assume that
the initial probabilities of winning the game for the three players are β1 = β2 = β3 = 13
. For
this example, let C = 2 and the initial observations are v1 = v2 = v3 = 0. Inserting these
values into (5.13), we find that the initial probability of winning the game is, as expected,
uniformly 13
for each of the players. Now assume the participants played the game 10 times
and A1 won 8 times while A2 and A3 each win once, i.e., v1 = 8, v2 = 1, and v3 = 1. Insert-
ing this new posterior evidence into (5.13), we find that E (P(A1)) = 2636
, E (P(A2)) = 536
, and
E (P(A3)) = 536
. In other words, by augmenting our initial knowledge with posterior evidence,
the probability of selecting the first player as the winner of the game has increased while the
probability of selecting the other players has decreased. By the same token, this property of
the Dirichlet reputation system can be exploited for the automatic updating of reputation
values of individual classifiers based on their a posteriori performance on the test set.
Now, let us return to the initial problem of combining multiple classifiers. Recall that
our main goal is develop a time-evolving reputation-based algorithm that can adapt to the
statistical disparity between training and testing data in nonstationary physiological data.
To automatically update our confidence in each classifier over time, we extend (5.13) to
account for the passage of time, namely,
βi [k +1] =vi [k +1]+Cβi [k ]
C +∑L
i=1 vi [k +1], (5.14)
where k is the time index and is incremented whenever a new subject is classified, βi [k ] is
the confidence of the i t h classifier at time k , and vi [k ] is the evidence in favor of classifier i
at time k .
Let us examine a voting scheme by which evidence values vi [k ] can be determined at
each time step. After training, individual classifiers are evaluated on a validation set to es-
timate their initial reputation values. Then, each sample from the test set is classified by
87
Figure 5.1: An example of the voting scheme in the proposed algorithm.
each classifier into one of the possible classes. The classifiers scrutinize their outputs and
vote for each other if their decisions agree. In other words, classifier i votes for classifier j ,
and receives a vote from classifier j , if both classify the input sample into the same class.
This voting scheme is exemplified in Figure 5.1 where classifiers C1,C4,C6 and C7 mutually
endorse each other. Likewise, classifiers C2,C3,C5 cast votes for one another.
The classifier group with the highest number of votes, i.e., largest number of classifiers
in agreement, is declared the winning group. For each member of the winning group,
the accumulated votes are recorded as the corresponding value vi [k ] in (5.14). Hence,
0 ≤ vi [k ] < L. The classifiers outside of the winning group are assigned a null vote, i.e.,
vi [k ] = 0.
The new confidence of each classifier can then be calculated using (5.14). For the initial
confidence in each classifier, we use the reputation values calculated by the static reputation
algorithm, namely,
88
βi [0] =ri [0]∑L
i=1 ri [0], (5.15)
where ri [0] is the initial reputation value of the i t h classifier computed using the validation
set. This choice of initial confidence values ensures that our method performs comparably
to the static reputation-based algorithm when there is a high correlation between samples
of the validation and test sets. Based on [88], all the reputation values should satisfy the
following condition:
0.5< ri ≤ 1. (5.16)
Although equation (5.14) gives us an updating rule for the confidence values, we still
need a method to estimate time-evolving reputation values, which are required for the fi-
nal decision of the multiple classifier system. Based on equation (5.15), the relationship
between βi and ri is a non-invertible, non-trivial relationship. We next define a rule for
updating the reputation values.
5.5.1 An Updating Rule for Reputation Values
The new reputation value of classifier i can be calculated as a function of the previous rep-
utation value,
ri [k +1] = f (ri [k ]). (5.17)
where, f could, in general, be a non-linear function. After updating the reputation values of
all classifiers using (5.17), the final decision of the system about the input sample should be
made using (5.8) and Bayes theorem as exemplified in [88] and [91].
Let us more closely examine an updating function for (5.17). For simplicity, we assume
that there are two possible classes for each sample, i.e.,Ω= ω1,ω2 and the number of clas-
sifiers L is odd. The basic principle underlying our update rule is as follows. If for a given
test sample, the confidence in a classifier undergoes a maximum positive change, then the
89
classifier’s reputation is maximized. Conversely, if the confidence in the classifier decreases
maximally, its reputation becomes 0.5. In all other instances, the reputation should change
in a manner proportional to relative change in confidence. To formalize, denote the maxi-
mum positive and negative changes in confidence in the i t h classifier asβm a xi andβm i n
i ,
respectively. Based on (5.15), βi [0] is maximized when ri [0] = 1 and rj [0] = 0.5, ∀j = i . In
other words, the maximum confidence in the i t h classifier is
βm a xi =
1
0.5× (L−1)+1=
2
L+1. (5.18)
For example for a MCS with L = 3 classifiers, βm a x is 0.5, which is reached when the reputa-
tion of one classifier is 1 and the other two are 0.5.
Again, based on (5.15), βi is minimized when ri = 0.5 and rj [0] = 1, ∀j = i . Hence, the
minimum confidence in the i t h classifier is,
βm i ni =
0.5
(L−1)+0.5=
1
2L−1. (5.19)
Based on (5.18) and (5.19), the range of β for each classifier is defined as β ∈ [ 2L+1
, 12L−1].
To update the reputation value of each classifier, we first need to find the change in the
corresponding confidence value. To do so, we use (5.14) and calculate the difference of
confidence values in one time interval as follows:
βi [k +1]−βi [k ] =vi [k +1]+Cβi [k ]
C +∑L
i=1 vi [k +1]−βi [k ] (5.20)
=vi [k +1]−βi [k ]
∑Li=1 vi [k +1]
C +∑L
i=1 vi [k +1].
Theorem 1. Equation (5.20) is maximized with respect to vi when vi [k +1] is equal to L−12
and for this case∑L
i=1 vi [k +1] = L2−14
.
Proof. See Appendix A.
90
By maximizing vi [k+1]while minimizing βi [k ], (5.20) is minimized with respect to both
vi and βi . By using (5.19) and Theorem 1, the maximum change in confidence can be ob-
tained as follows:
maxvi ,βi
βi [k +1]−βi [k ] =L−1
2− L2−1
4(2L−1)
C + L2−14
(5.21)
=3(L−1)2
(2L−1)(L2−1+4C ).
We see that (5.21) approaches a maximum as C → 0:
βi−m a x =maxC
maxvi ,βi
βi [k +1]−βi [k ]= 3(L−1)(2L−1)(L+1)
. (5.22)
Theorem 2. Equation (5.20) is minimized with respect to vi when vi [k + 1] = 0 and for
this case∑L
i=1 vi [k +1] = (L−1)(L−2).
Proof. Proof is available in the Appendix B.
Based on Theorem 2, we have
βi−m i n =minβi [k +1]−βi [k ]= −2
L+1. (5.23)
Now, by using (5.22) and (5.23), we define two new parameters for the classifier, i , as
follows:
∆+ = (βi [k +1]−βi [k ])× (2L−1)(L+1)3(L−1)
, (5.24)
and
∆− = (βi [k +1]−βi [k ])× (L+1)2
. (5.25)
When βi [k + 1] ≥ βi [k ], ∆+ is the relative change in confidence, βi , with respect to the
maximum possible positive change. Likewise, when βi [k + 1] < βi [k ], ∆− is the relative
91
change of βi with respect to the maximum negative change. Now we are ready to define a
reputation update rule. Considering (5.16) and one time index k , the reputation value may
undergo a maximum increase of 1− ri [k − 1] and a maximum decrease of ri [k − 1]− 0.5.
Therefore, for the i t h classifier, the reputation value can be updated as follows:
ri [k +1] =
ri [k ]+ (1− ri [k ])∆+, if βi [k +1]≥βi [k ]
ri [k ]+ (ri [k ]−0.5)∆−, if βi [k +1]<βi [k ].(5.26)
We should note here that∆+ ∈ (0,1). Let us consider two limiting cases of the reputation
update rule. According to (5.14), when C →∞, βi [k + 1]→ βi [k ] and hence ∆+→ 0. Thus,
in this case ri [k + 1]→ ri [k ]. Upon inspection of (5.14), we realize that in this first limiting
case, the current votes of the other classifiers are completely ignored and the overall deci-
sion is modulated only by the original reputations determined from the validation set. On
the other hand, when C → 0, the value of∆+ approaches unity. In this limiting case, the past
performance of a classifier is completely neglected and the weighting of individual classi-
fiers rests solely on the current votes among classifiers. According to (5.26), for this case,
ri [k +1]→ 1.
We also note that∆− ∈ (−1, 0). When C →∞,∆− converges to 0, and ri [k +1] converges
to ri [k ]. Likewise, when C → 0,∆− approaches−1, and thus ri [k +1]→ 0.5. Therefore, in all
limiting cases, (5.26) satisfies the initial requirement that 0.5< ri ≤ 1, as defined in (5.16).
We further illustrate the proposed update rule with an example. Assume an MCS with
L = 3 classifiers, r1 = 0.7, r2 = 0.7, and r3 = 0.6. The initial confidence values for these
classifiers are β1[0] = 0.35, β2[0] = 0.35, and β3[0] = 0.3 respectively. Now, assume at time
k = 1, classifiers θ1 and θ2 vote similarly but classifier θ3 votes differently. In Table (5.1), we
list the updated reputation values for different values of C .
For C = 0.01, the votes of the classifiers are important factors in updating the reputation
values. In this case, a positive vote from one classifier increases the reputation values signif-
icantly. By looking at equation (5.14), we see that as C → 0, βi [k +1] changes independently
92
Table 5.1: An example to clarify the updating rule.
C r1[1] r2[1] r3[1]
0.01 0.849 0.849 0.540
2 0.775 0.775 0.570
10 0.725 0.725 0.595
100 0.703 0.703 0.598
1000 0.700 0.700 0.599
of βi [k ]. Therefore, for small values of C , the proposed algorithm converges to the majority
voting algorithm, where the final decision of the system is based only on the instantaneous
votes of the classifiers. On the other hand, Table 5.1 shows that for C = 1000, the reputa-
tion values essentially become constant and hence the votes of the classifiers are neglected.
Based on equation (5.14), when C →∞, βi [k + 1]→ βi [k ]. Therefore, the confidence val-
ues and by extension, reputation values become nearly constant. From this example, we
see that based on the choice of C , reputation values can have differential effects on the fi-
nal decision of the classifier system. The appropriate choice for C depends on the desired
’performance memory’ and should be tuned empirically for each application.
The time-evolving reputation algorithm for the two-class problem is summarized below.
1. L classifiers are trained and evaluated on the validation set to obtain their initial rep-
utation values.
2. The confidence value of each classifier is calculated by using its corresponding repu-
tation value and equation (5.15).
3. For each input vector, x , each classifier outputs one class label, eitherω1 orω2.
4. The classifiers are categorized into two groups based on their outputs. The group with
the higher number of members is the winner.
93
5. Each member of the winner group, receives one vote from each of its co-members.
The members of the loser group do not receive any votes.
6. The reputation value of each classifier is updated using equation (5.26).
7. The confidence value of each classifier is updated using equation (5.14).
8. The final decision of the system about the input vector is made using equation (5.8).
9. The algorithm is repeated from Step 2.
5.6 Clinical Application
5.6.1 The Data
We demonstrate the proposed algorithm using physiological and biomechanical signals col-
lected from adolescents with severe disabilities for the purpose of automatically differenti-
ating between rest and activity states [68]. The eventual goal of such pattern classification
is to provide a speech-free means of expression for individuals who are unable to commu-
nicate through speech or gestures.
The dataset contains five signals, namely, blood volume pulse, electrodermal activity
(EDA), respiration, skin temperature, and limb acceleration. These signals were collected
from nine participants with severe disabilities (aged 18.11 +/- 2.2 years; 7 female). Each
participant alternated 20 times between 20-second periods of rest and activity. During the
rest state, the participants sat quietly whereas in the activity state they observed pictures of
items they liked or disliked. The viewing of affective images is known to elicit physiological
responses [87, 120]. An example of the different signals from one participant is shown in
Figure 5.2. The objective of our classifier system was to automatically classify features of the
measured signals into rest or activity states. The features and classifier are discussed next.
94
(a) Hand acceleration (b) Skin temperature
(c) Respiration (d) Electrodermal Activity
(e) Blood Volume Pulse
Figure 5.2: Sample physiological signals collected from one of the participants. In eachgraph, the horizontal axis denotes time while the vertical axis is signal amplitude in arbitraryunits.
5.6.2 Classification
We designed an MCS with 5 different LDA classifiers to address this classification problem.
In total, 40 samples, 20 samples for each class, were used in this experiment. The following
parameters were computed:
• Classifier 1: mean, standard deviation and slope of the heart rate (derived from blood
volume pulse), and the mean of the absolute value first difference of the heart rate
signal normalized by its mean and standard deviation.
• Classifier 2: mean, standard deviation and slope of the EDA signal, and the mean of
95
Table 5.2: The average performance of different classifiers.
Participant Single classifier Static Reputation Majority Vote Time-evolving Reputation
1 54.95±6.5 54.04±5.5 55.23±4.8 59.70±6.3
2 77.98±3.5 77.04±3.1 79.93±3.2 85.52±3.2
3 56.00±6.0 74.57±3.7 67.00±5.8 74.43±6.1
4 82.89±6.0 82.82±4.4 78.73±4.7 84.30±4.2
5 93.11±4.0 80.11±4.0 86.70±2.5 86.34±2.9
6 87.96±5.0 86.64±5.0 86.62±2.9 86.55±3.1
7 59.29±6.7 58.43±3.7 60.54±4.0 59.95±5.7
8 74.05±5.7 74.84±4.3 76.30±3.7 76.73±4.9
9 73.52±6.7 87.09±2.5 75.82±5.6 94.18±2.9
Overall Performance 73.31 75.06 74.10 78.63
the absolute first difference of the EDA signal normalized by its mean and standard
deviation.
• Classifier 3: mean, standard deviation and slope of the respiration signal, and the
mean absolute first difference of the respiration signal normalized by its mean and
standard deviation.
• Classifier 4: mean and slope of the temperature signal.
• Classifier 5: mean, standard deviation and slope of the limb acceleration signal.
The actual input features to the various classifiers were the differences in the above param-
eters between successive rest and picture-watching intervals. Classifiers accuracies were
estimated using 5-fold cross validation. However, unlike classical cross-validation, we fur-
ther segmented the ’traditional training’ set into training and validation sets. In each fold,
80% of ’traditional training’ samples were used for classifier training and 20% for calculat-
ing the initial reputation values. This process was repeated for 20 different iterations. To
gauge the performance of the proposed algorithm, we compared it with other fusion meth-
96
Table 5.3: The P values of t-test between the time-evolving classifier, static classifier, andmajority voting.
Subject’s Number P Value (Dynamic and majority vote) P Value (Time-evolving and Static)
1 0.0358 0.0016
2 p < 0.001 p < 0.001
3 0.0014 0.9214
4 p < 0.001 0.2245
5 0.3125 p < 0.001
6 0.2826 0.8965
7 0.5255 0.1245
8 0.5592 0.4248
9 p < 0.001 p < 0.001
ods. Since the outputs of the classifiers in this experiment were only class labels, we com-
pared our method to abstract level fusion algorithms, namely, the majority voting and static
reputation-based algorithms (the latter being akin to a weighted majority voting algorithm).
The proposed algorithm was also compared against a single classifier with all 17 input fea-
tures. For the proposed algorithm, the best value of C was estimated empirically as C = 40.
Table 5.2 summarizes the result of different classifiers for this experiment.
5.6.3 Results
As seen in Table 5.2, the time-evolving reputation-based classifier outperformed its static
counterpart on this nonstationary data set. Due to the disparity between the features of
the training set and those of the test set, the fixed reputations of the static system poorly
predicted the individual classifier performance on test data. In contrast, the time-evolving
reputation-based classifier mixes the initial reputation values with instantaneous classifier
performance on the test set. Consequently, the time-evolving algorithm is more robust to
nonstationary features. A pairwise t-test reveals that the time-evolving classifier is signifi-
97
cantly more accurate (p < 0.05) than the static version for participants 1, 2, 5 and 9 (Table
5.3). This finding suggests that the signal features for this subset of participants were par-
ticularly nonstationary. On the other hand, the majority voting classifier relies exclusively
on the instantaneous classifier performance and neglects their reputations. Based on Ta-
bles 5.2 and 5.3, the time-evolving reputation-based algorithm yields significantly higher
accuracies (p < 0.05) than the majority voting method for participants 1, 2, 3, 4, and 9. This
finding suggests that for these participants, the historical classifier performance as captured
via initial reputations, does bear some predictive value for future decisions. Finally, the
time-evolving reputation-based algorithm outperformed the single classifier for all but one
participant. Moreover, the single classifier was very slow compared to the time-evolving al-
gorithm and highly susceptible to the curse of dimensionality given its high dimensional
input feature space. It is interesting to note that all the classifiers generated low accuracies
for the first and seventh participants. During data collection, these two participants expe-
rienced excessive amounts of involuntary movement which likely contaminated both rest
and activity recordings uniformly, obfuscating their distinctive characteristics.
5.6.4 The Effect of the Constant C
With the above physiological data, we investigated the sensitivity of the proposed algorithm
to the constant C . We ran the proposed algorithm for different values of C , the accura-
cies of which are summarized in Figure 5.3. For small values of C , the accuracies of the
time-evolving reputation-based algorithm and the majority voting method are comparable.
According to equation (5.14), when C → 0 we have βi [k+1]→ vi [k+1]∑Li=1 vi [k+1]
. Therefore, the past
performance of classifiers are not considered in computing new confidence values, i.e., rep-
utations are updated strictly on the basis of instantaneous votes. This is exactly the behavior
of majority voting.
As evident in Figure 5.3, the time-evolving and static reputation-based classifiers per-
form similarly when C is large. Indeed, when C → ∞ in equation (5.14), we note that
98
βi [k + 1] → βi [k ]. In other words, the confidence and reputation values become nearly
constant and equal to their initial values.
0.001 0.1 1 10 40 100 500 1000 1000073
74
75
76
77
78
79
Value of C
Ave
rage
Acc
urac
y
Time−Evolving ReputationStatic ReputationMajority vote
Figure 5.3: The effect of the constant C on the time-evolving reputation-based algorithm.
Figure 5.4 depicts the reputation values of the different classifiers for different values
of the constant C , as different folds of a test set are shown to the classifier. In 5.4-(a), we
see that for large C , the reputation values are almost constant for all the five classifiers. In
contrast, as seen in figure 5.4-(c), for small values of C , reputation values fluctuate wildly on
a sample by sample basis. From figure 5.4-(b), we note that reputation values change but
converge quickly for C = 40. These observations confirm our choice of this constant.
99
(a) C = 10000 (b) C = 40
(c) C = 0.001
Figure 5.4: The evolution of reputation values for the 5 classifiers for different values of C .
5.7 Conclusion
A time-evolving reputation-based classifier was introduced. This novel multiple classifier
fusion algorithm updates initial reputation values of the individual classifiers based on their
performances on the test set. We showed that with a nonstationary physiological data set,
the proposed algorithm can provide higher accuracies than the static reputation-based al-
gorithm, the majority voting method, and a single classifier.
100
5.8 Appendix
5.8.1 Proof of Theorem 1
Let Γi represent the set of classifiers that classify a specific sample of the test set asωi . Γm a x
is the Γi with the maximum number of members. Let D(Γi ) represent the cardinality of
the set Γi . Recall that there are L classifiers, where L is odd. Assume two possible class
labels, ω1,ω2, for each sample. Therefore, for the winning subset of classifiers, D(Γm a x ) ∈ L+1
2, ..., L. Since each classifier cannot vote for itself, we have for the a posterior evidence,
vi ∈ L−12
, ..., L−1. Let vi [k +1] = γ≥ 0, then
L∑i=1
vi [k +1] = γ(γ+1). (5.27)
Now, define h(βi ) = βi [k + 1]− βi [k ]. Because C > 0, (5.20) reaches its maximum when
C → 0. To find the maximum of (5.20), we calculate ∂ h∂ γ|C→0 which is
∂ h
∂ γ|C→0 =
∂
∂ γγ−βi [k ]γ(γ−1)
γ(γ−1)
=−1
(γ−1)2. (5.28)
Because (5.28) is a strictly negative function, h(βi ) is a monotonically decreasing func-
tion and is thus maximized when vi [k +1] takes on its minimum value of L−12
. Plugging this
minimum into (5.27), we have∑L
i=1 vi [k +1] = L−12× L+1
2= L2−1
4. This proves Theorem 1.
5.8.2 Proof of Theorem 2
Two options are available for classifier i . Either it belongs to Γm a x or it belongs to the set with
the fewest members, Γm i n . First, assume i ∈ Γm a x . Based on (5.28), h is a monotonically
decreasing function. Hence, its minimum is reached when vi = L − 1 and∑L
i=1 vi [k + 1] =
101
L(L−1). For these values, the minimum of h is obtained as
minvi
h(βi )|i∈Γm a x =(L−1)−βi [k ]L(L−1)
L(L−1)(5.29)
(5.29) reaches its minimum when βi = βi−m a x = 2L+1
, which was obtained in (5.18). For
this value, (5.29) becomes
minvi ,βi
h(βi )|i∈Γm a x =−(L−1)L(L+1)
. (5.30)
Now, assume i ∈ Γm i n . In this case, D(Γm i n ) ∈ 1, ..., L−12. For these cases vi [k + 1] = 0
and∑L
i=1 vi [k +1]∈ (L−1)(L+1)4
, ..., (L−1)(L−2). Therefore, (5.20) becomes
h(βi ) =βi [k +1]−βi [k ] =−βi [k ]
∑Li=1 vi [k +1]
C +∑L
i=1 vi [k +1]. (5.31)
By letting∑L
i=1 vi [k +1] = γ′, we have
∂ h
∂ γ′ =−Cβi [k ](γ′+C )2
, (5.32)
which is a strictly negative function. Therefore h is a decreasing function and reach its min-
imum when∑L
i=1 vi [k + 1] = (L− 1)(L− 2). By using this value, βm a x = 2L+1
, and C → 0, the
minimum of h becomes
minvi ,βi
h(βi )|i∈Γm i n =−2
L+1. (5.33)
By comparing (5.29) and (5.33), we see that (5.33) has a lower value and hence (5.20) is
minimized when i ∈ Γm i n and this proves Theorem 2.
102
Chapter 6
Conclusion
6.1 Summary of research completed
To achieve the above objectives, the following phases of research were designed and imple-
mented.
1. In the first phase of this thesis, we proposed a new MCS based on the past perfor-
mances of the individual classifiers. We compared the performance of the proposed
algorithm with other MCS methods using real data and showed that the proposed
method outperformed its competitors.
2. In the second phase, we applied the proposed method to the problem of dysphagia
detection. We showed that the proposed algorithm can distinguish between healthy
and dysphagic swallows with higher accuracy than its competitors. Furthermore, in
this phase, we used the proposed algorithm to investigate the difference between
two sensitive channels (aligned to two different orthogonal anatomical axes) of ac-
celerometer.
3. In the third phase of this research, we introduced a new dynamic MCS which updates
its performance based on samples of the test set. We investigated the application of
103
the proposed method on emotion detection using real data collected from children
with disabilities. We showed that for this application, the dynamic reputation-based
algorithm works better than all of its competitors.
6.2 Summary of contributions
This thesis makes several original contributions to multiple classifier systems and their ap-
plications on rehab engineering. These contributions are summarized as follows:
1. We developed a new algorithm for combining several individual classifiers using their
reputation values. By gauging the performance of each individual classifier in the
validation set, a novel weighted majority voting method was introduced.
2. We classified dysphagic and healthy swallows with accuracy higher than currently
published results. By applying the proposed method to real data collected from sev-
eral healthy and dysphagic participants, we have shown that the static reputation-
based algorithm can classify swallowing accelerometry with high sensitivity and mod-
est specificity, making it a suitable screening tool.
3. We introduced a new dynamic MCS based on the Dirichlet distribution for classifica-
tion of non-stationary signals. The proposed algorithm is robust against high dispar-
ity between training and test sets.
4. We detected physiological arousal using the dynamic reputation-based algorithm. By
applying the proposed algorithm to the physiological signals from youth with severe
disabilities, We demonstrated the potential of speech-free detection of activity en-
gagement.
5. We analyzed the asymptotic behavior of the static reputation-based algorithm as well
as the behavior for small and medium-sized classifier ensembles. I also characterized
104
the performance of the algorithm in the presence of uniform and gaussian noise and
quantified its probability of error.
The above contributions resulted in 1 book chapter, 3 journal papers, and 1 patent.
Some ideas for future work are as follows:
• In this thesis, I have characterized the asymptotic and small ensemble behavior of
the static reputation-based algorithm only. Replicating this theoretical and empirical
exercise for the dynamic reputation-based algorithm is a worthwhile endeavor.
• In most of the previously proposed algorithms, an offline classification approach is
considered. However, in many rehab engineering problems, such as the detection of
physiological arousal and hand pose detection, there is need for online classification.
In this case, designing a new method to combine several classifiers in an online man-
ner could be extremely helpful.
105
Bibliography
[1] F Alkoot and J Kittler. Experimental evaluation of expert fusion strategie. Pattern
Recognition Letters, 20, 1999.
[2] O Aran and L Akarun. A multi-class classification strategy for fisher scores: Ap-
plication to signer independent sign language recognition. Pattern Recognition,
43(5):1776–1788, 2009.
[3] S Awate, T Tasdizen, N Foster, and R Whitaker. Adaptive markov modeling for mutual-
information-based, unsupervised mri brain-tissue classification. IEEE Trasactions on
Biomedical Engineering, 10, 2006.
[4] S Bagui and N Pal. A multistage generalization of the rank nearest neighbor classifi-
cation rule. Pattern Recognition Letter, 16(6):601–614, 1995.
[5] E. Bauer and R. Kohavi. An empirical comparison of voting classification algorithms:
Bagging, boosting, and variants. Machine learning, 36(1):105–139, 1999.
[6] R Bauer. Physiologic measures of emotion. Journal of Clin Neurophysiol, 15, 1998.
[7] N Birbaumer. Breaking the silence: Brain-computer interfaces (bci) for communica-
tion and motor control. Psychophysiology, 43, 2006.
[8] D Black. The theory of committees and elections. 2nd ed. London: Cambridge Uni-
versity Press, 1963.
106
[9] S Blain, S Kingsnorth, and T Chau. A multivariate classification algorithm for detect-
ing change in physiological state. Submitted to:, 1999.
[10] S Blain, A Mihailidis, and T Chau. Assessing the potential of electrodermal activity as
an alternative access pathway. Med Eng Phys, 30, 2008.
[11] I. Bloch. Information combination operators for data fusion: A comparative review
with classification. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE
Transactions on, 26(1):52–67, 2002.
[12] J Blumberg, J Rickert, S Waldert, A Schulze-Bonhage, A Aertsen, and C Mehring. Adap-
tive classification for brain-computer interfaces. Proceedings - IEEE Engineering in
Medicine and Biology Society, 1, 2007.
[13] W Boucsein. Electrodermal activity. New York: Plenum Press, 1992.
[14] L Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.
[15] L. Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.
[16] L Bruzzone and R Cossu. A multiple-cascade-classifier system for a robust and par-
tially unsupervised updating of land-cover maps. IEEE Trans. Geoscience and Remote
Sensing, 2002.
[17] J Cabrera. On the impact of fusion strategies on classification errors for large ensem-
bles of classifiers. Pattern Recognition, 39, 2006.
[18] J Cacioppo and L Tassinary. Inferring psychological significance from physiological
signals. American Psychologist, 45, 1990.
[19] J Cao, M Ahmadi, and M Shridhar. Recognition of handwritten numerals with multi-
ple feature and multistage classifier. Pattern Recognition, 28(2):153–160, 1995.
107
[20] D Chen and X Cheng. An asymptotic analysis of some expert fusion methods. Pattern
Recognition Letters, 22, 2001.
[21] H. Chen and X. Yao. Regularized negative correlation learning for neural network
ensembles. Neural Networks, IEEE Transactions on, 20(12):1962–1979, 2009.
[22] S Cho and J Kim. Combining multiple neural networks by fuzzy integral for robust
classification. IEEE Transactions on Systems, Man, and Cybernetics, 25(2):380–384,
1995.
[23] S Cho and J Kim. Multiple network fusion using fuzzy logic. IEEE Trans. Neural Net-
works, 6(2):497–501, 1995.
[24] P Chou. Optimal partitioning for classification and regression trees. IEEE Trans. Pat-
tern Analysis and Machine Intelligence, 13(4):340–354, 1991.
[25] G Cichero and B Merdoch. The physiologic cause of swallowing sounds: answers from
heart sounds and vocal tract acoustics. Dysphagia, 13(1):39–52, 1998.
[26] T Ciszkowski, I Dunajewski, and Z Kotulski. Reputation as optimality measure in wire-
less sensor network-based monitoring systems. Probabilistic Engineering Mechanics,
26(1):67–75, 2011.
[27] S Damouras, E Sejdic, C Steele, and T Chau. An on-line swallow detection algorithm
based on the quadratic variation of dual-axis accelerometry. IEEE Transactions on
Signal Processing, 58(6):3352–3359, 2010.
[28] A Das, NP Reddy, and J Narayanan. Hybrid fuzzy logic committee neural networks
for recognition of swallow acceleration signals. Computer Methods and Programs in
Biomedicine, 64:87–99, 2001.
[29] A Demiriz, K Bennett, and J Shawe-Taylor. Linear programming boosting via column
generation. Kluwer Machine Learning, 46, 2002.
108
[30] A Dempster, N Laird, and D Rubin. Maximum likelihood from incomplete data via
the (em) algorithm. J. Royal Statistical Soc, 39, 1977.
[31] T Deselaers, G Heigold, and H Ney. Object classification by fusing svms and gaussian
mixtures. Pattern Recognition, 43(7), 2010.
[32] R Ding and J Logemann. Pneumonia in stroke patients: a retrospective study. Dys-
phagia, 15:51–57, 2010.
[33] R Duda, P Hart, and D Stork. Pattern Classification. Wiley-Interscience, 2nd edition,
2000.
[34] R Duda, P Hart, and D Stork. Pattern classification. Wiley-Interscience, 2001.
[35] B Efron and R Tibshirani. An Introduction to the Bootstrap. CRC Press, Boca Raton,
FL, 1994.
[36] E Elaad and G Shakhar. Effects of mental countermeasures on psychophysiological
detection in the guilty knowledge test. Int J Psychophysiol, 11, 1991.
[37] T Ferguson. Asymptotic joint distribution of sample mean and a sample quantile.
Technical Report: UCLA, 1999.
[38] Y Freund. An adaptive version of the boost by majority algorithm. Machine Learning,
3(43):293–318, 2001.
[39] Y Freund and R Schapire. A decision-theoretic generalization of on-line learning and
an application to boosting. EuroCOLT 95 - Proceedings of the Second European Con-
ference on Computational Learning Theory, 1995.
[40] Y Freund and R Schapire. Experiments with a new boosting algorithm. In Proc. 13th
Int’l Conf. Machine Learning, 1996.
109
[41] J Friedman, T Hastie, and R Tibshirani. Additive logistic regression: a statistical view
of boosting. Annals of Statistics, 28(2):337–407, 2000.
[42] J Gan. Self-adapting bci based on unsupervised learning. Proc. 3rd Int. Workshop
Brain-Comput. Interfaces, 2006.
[43] A Gangardiwala and R Polikar. Dynamically weighted majority voting for incremental
learning and comparison of three boosting based approaches. Proceedings of Inter-
national Joint Conference on Neural Networks, 2005.
[44] A Van Halteren, D Konstantas, R Bults, K Wac, N Dokovski, G Koprinkov, V Jones, and
I Widya. Mobihealth: Ambulant patient monitoring over next generation public wire-
less networks. Demiris G., ed. e-Health: Current Status and Future Trends, 2004.
[45] F Hanna, SM Molfenter, RE Cliffe, T Chau, and CM Steele. Anthropometric and de-
mographic correlates of dual-axis swallowing accelerometry signal characteristics: a
canonical correlation analysis. Dysphagia, 25(2):94–103, 2010.
[46] B Hasan and J Gan. Unsupervised adaptive gmm for bci. 4th International IEEE EMBS
Conference on Neural Engineering, 2009.
[47] Hashem and B Schmeiser. Improving model accuracy using optimal linear combi-
nations of trained neural networks. IEEE Trans. Neural Networks, vol. 6, no. 3, pp.
792-794, 6(3):792–794, 1991.
[48] S Haykin. Neural Netowrks. Macmillan College Publishing Company, New York, 1994.
[49] T Hettmansperger. Statistical inference based on rank. New York- Wiley, 1984.
[50] T Ho. Random decision forests. Third IntŁl Conf. Document Analysis and Recognition,
Montreal, Canada, 1995.
[51] T Ho, J Hull, and S Srihari. Decision combination in multiple classifier systems. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 16(1):66–75, 1994.
110
[52] C Huang and J Wang. Multi-weighted majority voting algorithm on support vector
machine and its application. TENCON 2009-IEEE Region 10 Conference, 2009.
[53] J Hull, N Srihari, E Cohen, C Kuan, P Cullen, and P Palumbo. A blackboard-based
approach to handwritten zip code recognition. In Procedings of the US Postal Service
Advanced Technology Conference, pages 1018–1032, Washington, DC, 1988.
[54] R Jacob, M Jordan, S Nowlan, and G Hinton. Adaptive mixtures of local experts. Nueral
Computation, 3(5):79–87, 1991.
[55] R.A. Jacobs, M.I. Jordan, S.J. Nowlan, and G.E. Hinton. Adaptive mixtures of local
experts. Neural computation, 3(1):79–87, 1991.
[56] A Jain, R Duin, and J Mao. Statsitical pattern recognition: A review. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 22(1):4–37, 2000.
[57] M Jordan and R Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural
Computation, 6, 1994.
[58] A Josang and J haller. Dirichlet reputation systems. Proc. 2nd Int. Conf. Availability,
Reliability Security, 2007.
[59] C Katsis, G Ganiatsas, and D Fotiadis. An integrated telemedicine platform for the
assessment of affective physiological states. Diagn Pathol, 1, 2006.
[60] Y Kim and GH McCullough. Maximal hyoid displacement in normal swallowing. Dys-
phagia, 23(3):274–279, 2008.
[61] A Kistler, C Mariauzouls, and K von Berlepsch. Fingertip temperature as an indicator
for sympathetic responses. Int. Journal of Psychophysiol, 29, 1998.
[62] J Kittler, M Hatef, R Duin, and J Matas. On combining classifiers. IEEE Trans. Pattern
Analysis and Machine Intelligence, 20(3):226–239, 1998.
111
[63] J Kolter and M Maloof. Dynamic weighted majority: A new ensemble method for
tracking concept drift. Proceedings of the Third IEEE International Conference on Data
Mining, USA, 2003.
[64] L Kuncheva. A theoretical study on six classifier fusion strategies. IEEE Trans. Pattern
Analysis and Machine Intelligence, 24(2):281–286, 2002.
[65] L Kuncheva. Combining Pattern Classifiers. Wiley-Interscience, 2004.
[66] L Kuncheva, C Whitaker, C Shipp, and R Duin. Limits on the majority vote accuracy
in classifier fusion. Pattern Analysis and Applications, 6, 2003.
[67] L.I. Kuncheva. A theoretical study on six classifier fusion strategies. Pattern Analysis
and Machine Intelligence, IEEE Transactions on, 24(2):281–286, 2002.
[68] A Kushki, A Andrewsa, G King, and T Chau. Signals of the peripheral nervous sys-
tem: Viable means for classircation of activity engagement in individuals with severe
physical disabilities. ???, 2011.
[69] L Lam and C Suen. Application of majority voting to pattern recognition: An analysis
of its behavior and performance. IEEE Transactions on Systems, Man, and Cybernetics,
27(5):553–568, 1997.
[70] D Lee and S Srihari. Handwritten digit recognition: A comparison of algorithms. In
3rd Intl. Workshop Frontiers Handwriting Recognition, pages 153–162, 1993.
[71] J Lee, S Blain, M Casas, G Berall, D Kenny, and T Chau. A radial basis classifier for the
automatic detetion of aspiration in children with dysphagia. Journal of Neuroengi-
neering and Rehabilitation, 3(14):1–17, 2006.
[72] J Lee, E Sejdic, CM Steele, and T Chau. Effects of liquid stimuli on dual-axis swallow-
ing accelerometry signals in a healthy population. Biomedical Engineering OnLine,
9(7):10 pp., 2010.
112
[73] J Lee, C Steele, and T Chau. Time and time-frequency characterization of dual-axis
swallowing accelerometry signals. Physiological Measurement, 29(9):1105–1120, 2008.
[74] J Lee, C Steele, and T Chau. Classification of healthy and abnormal swallows based on
accelerometry and nasal airflow signals. Artificial Intelligence in Medicine, xx(yy):zz,
Under review.
[75] J Lee, CM Steele, and T Chau. Swallow segmentation with artificial neural networks
and multi-sensor fusion. Medical Engineering & Physics, 31(9):1049–1055, 2009.
[76] E Lehman. Elements of large sample theory. Springer, 1998.
[77] A Lempel and J Ziv. On the complexity of finite sequences. IEEE Transactions on
Information Theory, 22:75–81, 1976.
[78] T Lin, H Li, and K Tsai. Implementing the Fisher’s discriminant ratio in a k-means
clustering algorithm for feature selection and data set trimming. Journal of Chemical
Information and Computer Science, 44(1):76–87, 2004.
[79] X Lin, S Yacoub, J Burns, and S Simske. Performance analysis of pattern classifier
combination by plurality voting. Pattern Recognition Letter, 2003.
[80] G Liu, G Huang, J Meng, D Zhang, and X Zhu. Unsupervised adaptation based on
fuzzy c-means for brainUcomputer interface. 1st Int. Conf. on Information Science
and Engineering, 2009.
[81] J Logemann. Evaluation and treatment of swallowing disorders. Pro-Ed, Austin, TX,
1997.
[82] P Long and R Servedio. Random classification noise defeats all convex potential
boosters. Machine Learning, 78(3):287–304, 2010.
[83] L Mason, J Baxter, P Bartlett, and M Frean. Boosting algorithms as gradient descent.
Advances in Neural Information Processing Systems, 12, 2000.
113
[84] R Meddis. Statistics using ranks: A unified approach. Philadelphia: Basil Blackwel,
1984.
[85] A Miller. The neuroscientific principles of swallowing and dysphagia. Singular Pub-
lishing Group, San Diego, 1999.
[86] C Nadal, R Legault, and C Suen. Complementary algorithms for the recognition of to-
tally unconstrained handwritten numerals. 10th Int. Conf. Pattern Recognition, A:434–
449, 1990.
[87] B Nhan and T Chau. Classifying affective states using thermal infrared imaging of the
human face. IEEE Transactions on Biomedical Engineering, 57(4):979–987, 2010.
[88] M Nikjoo, A Kushki, J Lee, C Steele, and T Chau. Reputation-based neural network
combinations. In K. Suzuki (editor), Artificial Neural Networks - Methodological Ad-
vances and Biomedical Applications. InTech Publishing, 2011.
[89] M Nikjoo, A Kushki, CM Steele, and T Chau. Reputation-based neural network com-
binations. In Neural Networks. Intech Publishers, Rjeka, Croatia, 2011.
[90] M Nikjoo, C Steele, and T Chau. Biomedical Engineering Online, 2010.
[91] M Nikjoo, C Steele, E Sejdic , and T Chau. Automatic discrimination between safe
and unsafe swallowing using a reputation-based classifier. Biomedical Engineering
Online, 2011.
[92] I Orovic , S Stankovic , T Chau, CM Steele, and E Sejdic . Time-frequency analysis and
hermite projection method applied to swallowing accelerometry signals. EURASIP
Journal of Advances in Signal Processing, 2010(article ID 323125):7 pages, 2010.
[93] R Picard, E Vyzas, and J Healey. Toward machine emotional intelligence: analysis of
affective physiological state. IEEE Trans. Pattern Analysis and Machine Intelligence,
23, 2001.
114
[94] A Porta, S Guzzetti, N Montano, R Furlan, M Pagani, A Malliani, and S Cerutti. Entropy,
entropy rate, and pattern classification as tools to typify complexity in short heart
period variability series. IEEE Transactions on Biomedical Engineering, 48(11):1282–
1291, 2001.
[95] S Power, T Falk, and T Chau. Classification of prefrontal activity due to mental arith-
metic and music imagery using hidden markov models and frequency domain near-
infrared spectroscopy. Journal of Neural Engineering, 7(2), 2010.
[96] K Prkachin, R Williams-Avery, C Zwaal, and D Mills. Cardiovascular changes during
induced emotion. Journal of Psychosom Res, 47, 1999.
[97] J Quinlan. Programs for machine learning. Morgan Kaufmann, 1993.
[98] NP Redy, A Katakam, V Gupta, R Unnikrishnan, J Narayanan, and EP Canilang. Mea-
surements of acceleration during videofluoroscopic evaluation of dysphagic patients.
Medical Engineering & Physics, 22(6):405–412, 2000.
[99] JC Rosenbek, J Robbins, EV Roecker, JL Coyle, and JL Woods. A penetration-aspiration
scale. Dysphagia, 11(2):93–98, 1996.
[100] D Ruta and B Gabrys. A theoretical analysis of the limits of majority voting errors for
multiple classifier systems. Technical Report, Department of Computing and Informa-
tion Systems, University of Paisley, 2000.
[101] R Sanchez-Reillo, C Sanchez-Avila, and A Gonzalez-Marcos. Biometric identification
through hand geometry measurements. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 22(10):1168–1171, 2000.
[102] R Schapire. The strength of weak learnability. Machine Learning, 5(2):197–227, 1990.
[103] J Schlimmer and R Granger. Beyond incremental processing: Tracking concept drift.
Proceedings of the Fifth National Conference on Artificial Intelligence, 1986.
115
[104] E Sejdic , TH Falk, CM Steele, and T Chau. Vocalization removal for improved auto-
matic segmentation of dual-axis swallowing accelerometry signals. Medical Engineer-
ing & Physics, 32(6):668–672, 2010.
[105] E Sejdic , CM Steele, and T Chau. A procedure for denoising dual-axis accelerometry
signals. Physiological Measurement, 31(1):N1–N9, 2010.
[106] E Sejdic, V Komisar, CM Steele, and T Chau. Baseline characteristics of dual-axis cer-
vical accelerometry signals. Annals of Biomedical Engineering, 38(3):1048–1059, 2010.
[107] E Sejdic, C Steele, and T Chau. The effects of head movement on dual-axis cervical
accelerometry signals. BMC Research Notes, ?(?):in press, 2010.
[108] E Sejdic, C Steele, and T Chau. The effects of head movement on dual-axis cervical
accelerometry signals. BMC Research Notes, 3(269):6pp, 2010.
[109] E Sejdic, CM Steele, and T Chau. Segmentation and time duration analysis of dual-
axis swallowing accelerometry signals in healthy subjects. IEEE Transactions on
Biomedical Engineering, 56(4):1090–1097, 2009.
[110] L Senhadji, G Carrault, H Gauvrit, E Wodey, P Pladys, and F Carre. Pediatric anesthesia
monitoring with the help of eeg and ecg. Acta Biotheor, 48, 2000.
[111] G Shafer. A mathematical theory of evidence. Princeton University Press, 1977.
[112] D Shah, M Mackay, S Lavery, S Watson, A Harvey, J Zempel, A Mathur, and T Inder.
Combining multiple classifiers with dynamic weighted voting. Pediatrics, 2008.
[113] P Shenoy, M Krauledat, B Blankertz, R Rao, and K Müller. Towards adaptive classifica-
tion for bci. Journal of Neural Engineering, 3(1), 2006.
[114] CM Steele, E Sejdic , and T Chau. Noninvasive detection of thin-liquid aspiration us-
ing dual-axis swallowing accelerometry. American Journal of Physical Medicine and
Rehabilitation, Under Review(xx):yy, 2011.
116
[115] C Stefano, A Cioppa, and A Marcelli. An adaptive weighted majority vote rule for com-
bining multiple classifiers. In 16th International Conference on Pattern Recognition,
pages 192–195, Quebec, Canada, 2002. IEEE.
[116] H Stern. Improving on the mixture of experts algorithm. 2003.
[117] C Suen, C Nadal, T Mai, R Legault, and L Lam. Recognition of totally unconstrained
handwritten numerals based on the concept of multiple experts. In C. Suen, editor,
Proceedings of the International Workshop on Frontiers in Handwriting Recognition,
pages 131–143, 1990.
[118] S Sun. Local within-class accuracies for weighting individual outputs in multiple clas-
sifier systems. Pattern Recognition Letters, 2010.
[119] A Tabaee, P Johnson, C Gartner, K Kalwerisky, R Desloge, and M Stewart. Patient-
controlled comparison of flexible endoscopic evaluation of swallowing with sensory
testing (feesst). The Laryngoscope, 116:821–825, 2006.
[120] K Tai and T Chau. Single-trial classification of nirs signals during emotional induc-
tion tasks: twoards a corporeal machine interface. Journal of Neuroengineering and
Rehabilitation, 6(39):14pp, 2010.
[121] A Tajeddine, A Kayssi, A Chehab, and H Artail. Fuzzy reputation-based trust model.
Applied Soft Computing, 11(1):345–355, 2011.
[122] R Tibshirani, T Hastie, B Narasimhan, S Soltys, G Shi, A Koong, and Q Le. Sample
classification from protein mass spectrometry, by peak probability contrasts. Bioin-
formatics, 20(17), 2004.
[123] V Tresp and M Taniguchi. Combining estimators using non-constant weighting func-
tions. In G. Tesauro, D.S. Touretzky, and T.K. Leen, editors, Advances in Neural In-
117
formation Processing Systems, volume 7, pages 418–435. MIT Press, Cambridge, MA,
1995.
[124] YM Tseng and FG Chen. A free-rider aware reputation system for peer-to-peer file-
sharing networks. Expert Systems with Applications, 38(3):2432–2440, 2011.
[125] G Turpin, F Schaefer, and W Boucsein. Effects of stimulus intensity, risetime, and du-
ration on autonomic and behavioral responding: Implications for the differentiation
of orienting, startle, and defense responses. Psychophysiology, 36, 1999.
[126] N. Ueda. Optimal linear combination of neural networks for improving classifica-
tion performance. Pattern Analysis and Machine Intelligence, IEEE Transactions on,
22(2):207–215, 2002.
[127] R Valdovinos and J Sanchez. Combining multiple classifiers with dynamic weighted
voting. Proceedings of the 4th International Conference on Hybrid Artificial Intelligence
Systems, 2009.
[128] C Vidaurre, M Kawanabe, P von Bunau, B Blankertz, and K Muller. Toward unsu-
pervised adaptation of lda for brainUcomputer interfaces. IEEE TRANSACTIONS ON
BIOMEDICAL ENGINEERING, 58(3):50–51, 2011.
[129] P Viola and M Jones. Fast and robust classification using asymmetric adaboost and a
detector cascade. Advances in Neural Information Processing Systems, 2002.
[130] P Walley. Statistical reasoning with imprecise probabilities. Chapman and Hall, Lon-
don, 1991.
[131] P Wang. A defect in dempster-shafer theory. Proceedings of the 10th Conference on
Uncertainty in Artificial Intelligence, 1994.
[132] X Wang, L Ding, and D Bi. Dirichlet reputation systems. IEEE Trans. Instrumentation
and Measurement, 59(1):171–179, 2010.
118
[133] D Wolpert. Stacked generalization. Neural Networks, 5, 1992.
[134] M Wozniak. Combining pattern recognition algorithms chances and limits. 2nd In-
ternational Conference on Computer Engineering and Technology, Chengdu, Sichuan,
China, 2010.
[135] L Xu, A Kryzak, and CY Suen. Methods of combining multiple classifiers and their
applications to handwriting recognition. IEEE Transactions on Systems, Man and Cy-
bernetics, 22(3):418–435, 1992.
[136] X. Yao and M.M. Islam. Evolving artificial neural network ensembles. IEEE Computa-
tional Intelligence Magazine, 3(1):31–42, 2008.
[137] L Zadeh. A simple view of the dempster-shafer theory of evidence and its implication
for the rule of combination. The Al Magazine, 7(2):85–90, 1986.
[138] D Zhang, B Liu, C Sun, and X Wang. Learning the classifier combination for image
classification. Journal of Computers, 6(8), 2011.
[139] DCB Zoratto, T Chau, and CM Steele. Hyolaryngeal excursion as the physiological
source of swallowing accelerometry signals. Physiological Measurement, 31(6):843–
855, 2010.