A Method for Calculation of Optimum Data Size and Bin Size of Histogram Features in Fault Diagnosis...

Expert Systems with Applications 38 (2011) 7708–7717

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

A method for calculation of optimum data size and bin size of histogram featuresin fault diagnosis of mono-block centrifugal pump

V. Indira a,⇑, R. Vasanthakumari b, N.R. Sakthivel c, V. Sugumaran d

a Department of Mathematics, Sri Manakula Vinayagar Engineering College, Madagadipet, Puducherry, Indiab Department of Mathematics, Kasthurba College for Women, Villianur, Puducherry, Indiac Department of Mechanical Engineering, Amrita School of Engineering, Ettimadai, Coimbatore, Indiad Department of Mechanical Engineering, SRM University, Kattankulathur, Kanchepuram Dt., India

a r t i c l e i n f o

Keywords:Centrifugal pumpFault diagnosisHistogram featuresMachine learningMinimum sample sizeNumber of binsPower analysisVibration signals

0957-4174/$ - see front matter � 2011 Elsevier Ltd. Adoi:10.1016/j.eswa.2010.12.140

⇑ Corresponding author.E-mail addresses: [email protected] (V.

com (R. Vasanthakumari), [email protected][email protected] (V. Sugumaran).

a b s t r a c t

Mono-block centrifugal pump plays a key role in various applications. Any deviation in the functions ofcentrifugal pump would lead to a monetary loss. Thus, it becomes very essential to avoid the economicloss due to malfunctioning of centrifugal pump. It is clear that the fault diagnosis and condition monitor-ing of pumps are important issues that cannot be ignored. Over the past 25 years, much research has beenfocused on vibration based techniques. Machine learning approach is one of the most widely used tech-niques using vibration signals in fault diagnosis. There are set of connected activities involved in machinelearning approach namely, data acquisition, feature extraction, feature selection, and feature classifica-tion. Training and testing the classifier are the two important activities in the process of feature classifi-cation. When the histogram features are used as the representative of the vibration signals, a properguideline has not been proposed so far to choose number of bins and number of samples required to trainthe classifier. This paper illustrates a systematic method to choose the number of bins and the minimumnumber of samples required to train the classifier with statistical stability so as to get best classificationaccuracy. In this study, power analysis method was employed to find the minimum number of samplesrequired and a decision tree algorithm namely J48 was used to validate the results of power analysis andto find the optimum number of bins.

� 2011 Elsevier Ltd. All rights reserved.

1. Introduction

Mono-block centrifugal pumps play an important role in indus-tries and are the key elements in food industry, waste water treat-ment plants, agriculture, oil and gas industry, paper and pulpindustry, etc. The development of various faults in centrifugalpumps would cause severe problems such as abnormal noise, leak-age, high vibration, etc. Hence a condition monitoring and faultdiagnosis has become very essential for centrifugal pump mainte-nance. Much research has been devoted towards the identificationof faults in centrifugal pump. Over the past ten years, vibrationbased machine learning approach has drawn a considerable atten-tion in the field of fault diagnosis.

In this paper, only vibration signals of good and five faulty con-ditions were considered for fault diagnosis of centrifugal pump.The faults considered in the present study were cavitations, bear-ing fault, impeller fault, seal fault, bearing and impeller fault to-

ll rights reserved.

Indira), [email protected] (N.R. Sakthivel),

gether. The characterization of these signals was achieved bymachine leaning approach. The important two activities in ma-chine learning approach are training and testing the classifier. Tomodel the centrifugal pump fault diagnosis problem as a machinelearning problem, a large number of vibration signals are requiredfor each condition of the centrifugal pump considered for study. Itis possible to acquire any number of vibration signals for good cen-trifugal pump condition; however, it is very difficult to acquire sig-nals of faulty centrifugal pumps of same type with a specific faultalone. Actually, the signals of centrifugal pump with specific faultare to be taken from centrifugal pumps where the fault occurrednaturally during operation. The difficulties involved in carryingout these, forces the fault diagnosis engineer to make a compro-mise. Taking many vibration signals from one specimen having atypical intended fault is one level of compromise in practice. Forexample, taking required number of vibration signals from a cen-trifugal pump having seal fault alone. Another level of compromiseis that taking vibration signals from a centrifugal pump, where therequired type of fault is simulated onto it (Sakthivel, Sugumaran, &Nair Binoy, 2010). To overcome these problems, one should knowwhat number of samples to be trained to get good classificationaccuracy. As will be seen in Section 6, if it is known that the good

http://dx.doi.org/10.1016/j.eswa.2010.12.140

mailto:[email protected]

mailto:vasunthara1@gmail. com

mailto:vasunthara1@gmail. com



http://dx.doi.org/10.1016/j.eswa.2010.12.140

http://www.sciencedirect.com/science/journal/09574174

http://www.elsevier.com/locate/eswa

Table 1Mono-block centrifugal pump specification.

Speed: 280 rpm Pump size: 50 mm � 50 mm

Current:11.5 A Discharge: 392 litre per secondHead: 20 m Power: 2HP

V. Indira et al. / Expert Systems with Applications 38 (2011) 7708–7717 7709

classification accuracy can be obtained by training only three sam-ples per class, the vibration signals of faulty centrifugal pumpsneed not be taken from the centrifugal pumps in which the faultwas simulated. Indeed, the signals could be acquired from the cen-trifugal pumps in which the fault has occurred naturally and anyresults obtained out of these signals would be more practical andrealistic in nature. Hence knowledge about the optimum numberof samples required for building a model or training a classifier isvery essential. In such situations, a study on determination of min-imum sample size is highly desirable. Histogram analysis of vibra-tion signals yields different parameters, which could be used forclassifying different conditions of the pump. If histogram is plottedusing the amplitude of vibration signals, they look dissimilar fordifferent classes. One could observe a drastic change in height ofthe corresponding bins for different classes, only when the ampli-tude range was divided into a certain number of bins. In otherwords, the separability of different classes would be more in histo-gram plot for a particular number of bins. Thus, it becomes neces-sary to find the right number of bins to be chosen to obtain the bestclassification accuracy.

In machine learning, a model built with large sample size wouldbe robust. During implementation, it becomes necessary to answer‘how large should be the sample size to build a robust classifier?’As it is difficult to get a large number of samples, then the appro-priate question to ask is ‘What is the minimum number of samplesthat are required to build a classifier which has good predictionaccuracy?’ To answer this question many researchers have takendifferent approaches.

Many works on minimum sample size determination have beenreported in the field of bioinformatics and other clinical studies, toname a few, micro array data (Hwang, Schmitt, Stephanopoulos, &Stephanopoulos, 2002), cDNA arrays (Schena, Shalon, Davis, &Brown, 1995), transcription level (Lockhart, 1996), etc. Based onthese works, data-driven hypotheses could be developed whichin turn furthers vibration signal analysis research. Unfortunately,not much of work has been reported for finding optimum numberof samples required for training classifiers using vibration signals.It might be due to the fact that acquiring vibration signals wouldbe relatively easy compared to that of clinical data and the studyof minimum sample size may look insignificant. Also, an appropri-ate guideline was not proposed so far to choose minimum samplesize of vibration signals for fault diagnosis using machine learningapproach. Hence, one has to resort to some thumb rules which lackmathematical reasoning or blindly follow some previous work asbasis for fixing the sample size. This is the fundamental motivationfor taking up this study. There are number of ways available fordetermination of sample size viz. for tests of continuous variables(Day & Graham, 1991; Fleiss, 1981; Pearson & Hartley, 1970), fortests of proportions (Casagrande, Pike, & Smith, 1978; Feigl,1978; Gordon & Watson, 1996; Haseman, 1978; Lakatos & Lan,1992; Lemeshow, Hosmer, & Klar, 1988; Lubin & Gail, 1990;O’Neill, 1984; Roebruck & Kuhn, 1995; Thomas & Conlon, 1992;Whitehead, 1993), for time-to-event (survival) data (Hanley &McNeil, 1982; Schoenfeld & Richter, 1982), for receiver operatingcurve (ROC) analysis (Obuchowski, 1994; Obuchowski & McClish,1997; Whittemore, 1981), for logistic and Poisson regression (Bull,1993; Flack & Eudey, 1993; Hsieh, 1989; Lui & Cumberland, 1992;Signorini, 1991), repeated measurements (Greenland, 1988; Lipsitz& Fitzmaurice, 1994), precision (Beal, 1989; Buderer, 1996; Du-pont, 1988; Samuels & Lu, 1992; Satten & Kupper, 1990; Streiner,1994), paired samples (Donner & Eliasziw, 1992; Lachenbruch,1992; Lachin, 1992; Lu & Bean, 1995; Nam, 1992; Nam, 1997; Par-ker & Bregman, 1986; Royston, 1993), measurement of agreement(Birkett & Day, 1994), and power (Faul, Erdfelder, Lang, & Buchner,2007). Studies were also carried out to discuss issues surroundingestimating variance, sample size re-estimation based on interim

data (Browne, 1995; Gould, 1995; Shih & Zhao, 1997), studies withplanned interim analyses (Geller & Pocock, 1987; Kim & DeMets,1992; Lewis, 1993; O’Brien & Fleming, 1979; Pocock, 1977; White-head, 1992), and ethical issues (Lantos, 1993). However, there arecertain issues to be addressed in implementation of such tech-niques to have better statistical stability.

In machine learning approach, the vibration signals are typicallysubjected to analyses such as hypothesis testing, classification(Sugumaran, Sabareesh, & Ramachandran, 2008), regression andclustering that rely on statistical parameters to draw conclusions(Alfayez, Mba, & Dyson, 2005; guo-hua, yong-zhong, yu, & huang,2007; Kavuri & Venkatasubramanian, 1993b; Konga & chen,2004; Rengaswamy & Venkatasubramanian, 2000; Vaidyanathan& Venkatasubramanian, 1992; wang & hu, 2006; Wang & McFad-den, 1993a, 1993b; widodo & yang, 2007). However, these param-eters could not be reliably estimated with only a small number ofvibration signals. Since the statistical stability of conclusions lar-gely depends on the accuracy of parameters used, a certain mini-mum number of vibration signals are required to ensureconfidence in the sample distribution and accuracy of parametervalues.

The objective of this paper is to determine the minimum num-ber of samples required to separate the classes with statistical sta-bility using F-test based statistical power analysis. Themethodology is illustrated with the help of a typical centrifugalpump fault diagnosis case study problem with six conditions (clas-ses). The minimum sample size and the optimum number of binsto be used are also determined using an entropy based algorithmcalled ‘J48’. The results of power analysis are compared with thatof J48 algorithm and sample size guidelines are presented for cen-trifugal pump fault diagnosis at the conclusion section.

2. Experimental studies

A motor with 2HP speed was used to drive the pump. Acceler-ometer was used along with data acquisition system for acquiringdata. A piezoelectric accelerometer and its accessories form thecore equipment for vibration measurement and recording. Thevibration signals were acquired from the mono-block centrifugalpump working under normal condition (Good) and with five differ-ent faulty conditions considered for the study at a constant speedof 2880 rpm. The specification of mono-block centrifugal pump isshown in Table 1. The sampling frequency used in the study was24,000 Hz and each signal (Sample) has a length of 1024 datapoints. For each condition of the centrifugal pump, 250 sampleswere taken. A randomly selected signal for each condition is givenin Fig. 1.

3. Histogram features

A difference in the range of amplitude for different classes couldbe viewed when the magnitude of the signals were measured intime domain. Variation in the vibration amplitude could be shownby using one of the best methods namely histogram plot. This his-togram plot provides some valuable information for classificationand this information would serve as features for fault diagnosisof centrifugal pump.

Fig. 1. Vibration signals of the pump for good and faulty conditions.

0

20

40

60

80

100

120

0 10 20 30 40 50 60

Number of Bins

Cla

ssifi

catio

n A

ccur

acy

(%)

Fig. 2. Number of bins as a function of classification accuracy.

7710 V. Indira et al. / Expert Systems with Applications 38 (2011) 7708–7717

3.1. Selection of number of bins

There are two important things to be considered on selection ofbins namely, range of bins and the width of bins. The steps in-volved in choosing bin range are as follows.

(i) Find the minimum amplitude of each signal in all the sixclasses.

(ii) Calculate the minimum of the values obtained in step (i).(iii) Find the maximum amplitude of each signal in all the six

classes.(iv) Calculate the maximum of the values obtained in step (iii).

Thus, the bin range should be from minimum of minimumamplitude (�0.72944) to maximum of maximum amplitude(1.036422) of all the six classes.

As discussed in Section 1, it is required to choose the number ofbins for a fault diagnosis problem. This number has been obtainedby carrying out a chain of experiments using a decision tree algo-rithm namely (J48) with different number of bins. At first, therange of bin was divided into two equal parts. That is to say, thenumber of bins used was two. The two histogram features, namely,f1 and f2 were extracted and the corresponding classification accu-racy was also obtained by using J48 algorithm. The method andprocedure of performing the same using J48 algorithm is explainedin Section 5. A set of similar experiments were carried out with dif-ferent number of bins from 3,4, . . . ,50 and the corresponding re-sults are shown in Fig. 2. Upon careful observation of the resultsof the experiments and the corresponding graph (Ref Fig. 2), thebest classification accuracy 100% was obtained when number ofbins was ‘37’. In the present study, the number of bins was chosenas 37 with a bin width of 0.047726. It is also noted from the graphthat the oscillation in classification accuracy reduced drasticallyand tends to stabilize if the number of bins was greater than 14.Hence one could also choose number of bins greater than 14. Inthis study, the best classification accuracy criterion was used.

3.2. Feature extraction and feature selection

In this paper, the bin size chosen was ‘37’. A set of thirty-sevenmeasures f1, f2, . . . , f37 were extracted from the vibration signals and

they are called histogram features. For further study, instead ofusing vibration signals directly, the histogram features extractedfrom vibration signals were used. The less contributing featureswere eliminated from the feature set to reduce the dimension ofthe feature set. Only the relevant features are considered for fur-ther study. The process of selecting relevant features is known asfeature selection and it was carried out by using J48 algorithm.The theory, methodology and implementation of J48 algorithmare discussed in detail in Sugumaran, Muralidharan, and Rama-chandran (2007). The best features in the present study were foundto be f11, f14, f24, f25 and f29. These five features were used as a rep-resentative of the signals for further study. The selected set of fivehistogram features along with their class labels form the data set todetermine the minimum sample size using power analysis.

4. Determination of minimum sample size

Determining the minimum sample size of vibration signals tobuild a classifier is an important issue as discussed earlier and isone of the first things to be considered while attempting the clas-sification of machine conditions through vibration signals. Themethod of performing power analysis is discussed in Section 4.1.The results of power analysis were verified with the help of a func-


tional test using J48 algorithm. As J48 algorithm could be used as aclassifier with 10-fold cross validation method, the number of sam-ples is decreased from 250 per class to three per class. The resultsare presented and discussed in Section 6.

4.1. Power analysis

Performing power analysis and sample size estimation is animportant aspect of experimental design, because without thesecalculations, sample size may be too high or too low. If sample sizeis too low, the experiment will lack the precision to provide reliableanswers to the questions that are investigated. If sample size is toolarge, time and resources will be wasted, often for minimal gain.Power analysis has been used in many applications (Cohen, 1988;Kraemer & Thiemann, 1987; Mace, 1974; Pillai & Mijares, 1959).It is based on two measures of statistical reliability in the hypothe-sis test, namely the confidence interval (1 � a) and power (1 � b).The test compares null hypothesis (H0) against the alternativehypothesis (H1 ). The null hypothesis is defined that the means ofthe classes are the same whereas alternative hypothesis is definedthat the means of classes are not same while the confidence level ofa test is the probability of accepting null hypothesis, the power of atest is the probability of accepting the alternative hypothesis(Hwang et al., 2002). Alternatively false positives, a (Type I error)is the probability of accepting alternative hypothesis while falsenegatives b(Type II error) is the probability of accepting the nullhypothesis. The estimation of sample size in power analysis is donesuch that the confidence and the power (statistical reliability mea-sures) in hypothesis test could reach predefined values. Typicalanalyses might require the confidence of 95% and the power of 95%.

The confidence level and the power are calculated from the dis-tributions of the null hypothesis and alternative hypothesis. Defin-ing these distributions depends on the statistical measures beingused in the hypothesis test. For a two class problem, the methodand procedure of computing hypothesis test using t-distributionis explained in Hwang et al. (2002). In case of multi-class problem(number of classes greater than two), instead of t-statistic, the F-statistic measure derived from Pillai’s V formula (Cohen, 1969; Ol-son, 1974) is used for the estimation of sample size. Pillai’s V is thetrace of the matrix defined by the ratio of between-group variance(B) to total variance (T). It is a statistical measure often used inmultivariate analysis of variance (MANOVA) (Cohen, 1969). ThePillai’s V trace is given by

V ¼ traceðBT�1Þ ¼Xh

i¼1

ki

ki þ 1; ð1Þ

where ki is the ith eigenvalue of W�1B in which W is the within-group variance and h is the number of factors being considered inMANOVA, defined by h = c�1 and c is the number of classes. A highPillai’s V means a high amount of separation between the samplesof classes, with the between-group variance being relatively largecompared to the total variance. The hypothesis test can be designedas follows using F statistic transformed from Pillai’s V.

H0 : l1 ¼ l2 ¼ l3 � � � ¼ lc; H1

: There exists i; j such that li � lj – 0; ð2Þ

H0 : F ¼ ðV=sÞ=ðphÞ1� ðV=sÞð Þ=½sðN � c � pþ sÞ�

� Fðph; sðN � c � pþ sÞÞ; ð3Þ

H1 : F ¼ ðV=sÞ=ðphÞ1� ðV=sÞð Þ=½sðN � c � pþ sÞ�

� F½ph; sðN � c � pþ sÞ; D ¼ sDeN� ð4Þ

with De ¼ Vcritðs�Vcrit Þ

, where p and c are the number of variables and thenumber of classes, respectively. s is defined by min (p,h). By usingthese defined distributions of H0 and H1, the confidence level andthe power could be calculated for a given sample size and effect si-ze.The method used for two-class problem is used here to estimatethe minimum sample size for statistical stability whereby the sam-ple size is increased until the calculated power reaches the prede-fined threshold of 1 � b. However, here is a limitation that thevalue of p cannot be larger than N�c + s = N � 1. This analysis mightproduce a misleading sample size estimate when the real data setwas not consistent with the assumption (normality and equal vari-ance) underlying the statistic used in power analysis. To check theeffect of possible violations of the assumptions on the estimatedsample size, the actual power and mean differences between classesare compared to the predefined values. The actual values in bothcases were sufficiently large that we need not be worried aboutthe impact of data which does not perfectly match the normalityor equal variance assumptions.

5. J48 algorithm

Fault diagnosis can be viewed as a data mining problem whereone extracts information from the acquired data through a classifi-cation process. A predictive model for classification invokes theidea of branches and trees identified through a logical process.The classification is done through a decision tree with its leavesrepresenting the different conditions of the pumps. The sequentialbranching process ending up with the leaves here is based on con-ditional probabilities associated with individual features. Any goodclassifier should have the following properties.

(1) It should have good predictive accuracy; it is the ability ofthe model to correctly predict the class label of new or pre-viously unseen data.

(2) It should have good speed.(3) The computational cost involved in generating and using the

model should be as low as possible.(4) It should be robust; Robustness is the ability of the model to

make correct predictions given the noisy data or data withmissing values.

(5) The level of understanding and insight that is provided byclassification model should be high enough. It is reportedthat C4.5 model introduced by J.R. Quinlan satisfies withthe above criteria and hence the same is used in the presentstudy. Decision tree algorithm (C4.5) has two phases: build-ing and pruning. The building phase is also called as ‘grow-ing phase’. Both these are briefly discussed here.

5.1. Building phase

Training sample set with discrete-valued attributes is recur-sively partitioned until all the records in a partition have the sameclass; this forms the building phase. The tree has a single root nodefor the entire training set. For every partition, a new node is addedto the decision tree. For a set of samples in a partition S, a test attri-bute X is selected for further partitioning the set into S1,S2,S3, . . . ,SL.New nodes for S1,S2,S3, . . . ,SL are created and these are added to thedecision tree as children of the node for S. Further, the node for S islabeled with test X, and partitions S1,S2,S3, . . . ,SL are recursivelypartitioned. A partition in which all the records have identical classlabel is not partitioned further, and the leaf corresponding to it islabeled with the corresponding class. The construction of decisiontree depends very much on how a test attribute X is selected. C4.5uses information entropy evaluation function as the selection cri-


teria. The entropy evaluation function is arrived at through the fol-lowing steps.

Step 1: Calculate Info(S) to identify the class in the training sets S

InfoðSÞ ¼ �Xk

i¼1

frer Ci; S=jSjð Þ½ �log2 freq Ci; S=jSjð Þ½ �f g; ð5Þ

where, jSj is the number of cases in the training set. Ci a class,i = 1,2, . . . ,K. K is the number of classes and freq(Ci,S) is the numberof cases included in Ci.Step 2: Calculate the expected information value, InfoX(S) for test

X to partition S.

InfoXðSÞ ¼ �XK

i¼1

jSij=jSjð ÞInfoðSiÞ½ �; ð6Þ

where L is the number of outputs for test X, Si is a subset of S corre-sponding to ith output and is the number of cases of subset Si.Step 3: Calculate the information gain after partition according to

test X.

GainðXÞ ¼ InfoðSÞ � InfoXðSÞ: ð7Þ

Step 4: Calculate the partition information value Split Info(X)acquiring for S partitioned into L subsets.

Split InfoðXÞ ¼�12

XL

i¼1

jSijjSj log2

Sij jSj j þ 1� Sij j

Sj j

� �log2 1� Sij j

Sj j

� �� :

ð8Þ

Step 5: Calculate the gain ratio of Gain(X) over Split Info(X).

Gain RatioðXÞ ¼ GainðXÞ � Split InfoðXÞ: ð9Þ

The Gain Ratio (X) compensates for the weak point of Gain (X) whichrepresents the quantity of information provided by X in the trainingset. Therefore, an attribute with the highest Gain Ratio (X) is takenas the root of the decision tree.

5.2. Pruning phase

Usually a training set in the sample space leads to a decisiontree which may be too large to be an accurate model; this is dueto over-training or over-fitting. Such a fully grown decision treeneeds to be pruned by removing the less reliable branches to ob-tain better classification performance over the whole instancespace even though it may have a higher error over the trainingset. The C4.5 algorithm uses an error-based post-pruning strategyto deal with over-training problem. For each classification nodeC4.5 calculates a kind of predicted error rate based on the totalaggregate of misclassifications at that particular node. The error-based pruning technique essentially reduces to the replacementof vast sub-trees in the classification structure by singleton nodesor simple branch collections if these actions contribute to a drop inthe overall error rate of the root node.

5.3. Application of decision tree for the problem under study

As is customary the samples are divided into two parts: trainingset and testing set. Training set is used to train classifier and testingset is used to test the validity of the classifier. Ten-fold cross-vali-dation is employed to evaluate classification accuracy. The trainingprocess of C4.5 using the samples with continuous-valued attri-butes is as follows.

(1) The tree starts as a single node representing the trainingsamples.

(2) If the samples are all of the same class, then the nodebecomes a leaf and is labeled with the class.

(3) Otherwise, the algorithm discretises every attribute to selectthe optimal threshold and uses the entropy-based measurecalled information gain (discussed in Section 5.1) as heuris-tic for selecting the attribute that will best separate the sam-ples into individual classes.

(4) A branch is created for each best discrete interval of the testattribute, and the samples are partitioned accordingly.

(5) The algorithm uses the same process recursively to form adecision tree for the samples at each partition.

(6) The recursive partitioning stops only when one of the fol-lowing conditions is true:
(a) All the samples for a given node belong to the same class
or(b) There are no remaining attributes on which the samples

may be further partitioned.(c) There are no samples for the branch test attribute. In this

case, a leaf is created with the majority class in samples.
(7) A pessimistic error pruning method (discussed in Section
5.2) is used to prune the grown tree to improve its robust-ness and accuracy.

6. Results and discussion

In this paper, power analysis was applied to a multi-class prob-lem. The null hypothesis (H0) was that the means of the classes arethe same whereas alternative hypothesis (H1) was defined as themeans of classes are not same. As a basis for the problem, a dataset consisting of 250 samples from each class was considered.The data set was tested for normality, homogeneity and indepen-dence. The expected power level was set to 95% (it is equivalentto a = 0.05 and b = 0.05). The F-test was performed for calculationof sample size for given effect size, a error probability, power,number of groups, repetitions. The test was priori sample size com-putation of multivariate analysis of variance (MANOVA) with re-peated measures and within-between interactions. The centraland non-central distributions with critical F value of the resultare shown in Fig. 3. From the data set, Pillai’s V was computedand its value was found to be 1.486501. Pillai’s V is an indicatorof the statistical stability of the data set. If the value is low, thenthe statistical stability would be less and more number of datapoints would be required to train the classifier and vice versa. Atthe first sight, one could say from Pillai’s V that the statistical sta-bility of the data set under study is reasonably good. Any valueabove 0.5 would get reasonably good statistical stability; henceneeding less number of data points. Upon calculation, the actualnumber of data points required to have good statistical stabilitywas found to be 18 for six classes (for all six classes together). Ithas been reduced to 3 per class with required power level of0.95. To sum up, at least three data points per class should be takenfor training the classifier to retain 95% of the power level in the sig-nal. At this point, it is quite natural to ask, how many data pointswould be required if 90% of power level is sufficient and the appli-cation can afford to bear 10% of a error probability? In order to an-swer the series of questions with different type I error (a errorprobability) and power level (1 � b), a set of experiments were car-ried out to calculate the minimum sample size required. In one ser-ies of experiments, the sample size was computed for various aerror probability values with a power level of 95%. This experimentwas repeated for power levels of up to 80% in steps of 5%. Theimportant results of these experiments were tabulated in Table2. The resulting curves were drawn in a same graph for easy com-parison (Fig. 4). The top most curve in Fig. 4 tells the required min-imum sample size for various a error values at power level of 95%.From the graph (Refer Fig. 4), one could find the required sample

Fig. 3. F tests – MANOVA: Reported measures, within-between intersection: sample size calculation with effect size f(V) = 0.769296, a error probability = 0.05, power (1 � berror probability) = 0.95, number of groups = 6, repetitions = 5 with Pillai’s V formula value = 1.486501.

Table 2Power analysis test results with effect size f(V) = 0.7690296, number of groups = 6, repetitions = 5 with Pillai’s V formula value = 1.486501, numerator df 20.0000.

Output parameter a = 0.05, 1�b = 0.95 a = 0.10, 1�b = 0.90 a = 0.15, 1�b = 0.85 1�b = 0.80, a = 0. 20 a = 0.25, 1� b = 0.75

Noncentrality parameter k 42.581270 35.484342 35.484392 35.484392 35.484392Critical F 1.793302 1.625806 1.478985 1.371388 1.285001Denominator df 48 36 36 36 36Total sample size for 6 classes 18 15 15 15 15Sample size per classes �3 �3 �3 �3 �3Actual power 0.957877 0.937708 0.963965 0.977564 0.985457

Fig. 4. Total sample size as a function of a error probability.


size for a given a error probability and power level. Suppose, onecould sacrifice the power level, then the variation of required sam-ple size for various a error probabilities was computed and plotted(Refer Fig. 5). The important results of these experiments were tab-ulated in Table 2. The values in the Table 2 would be very helpfulfor those who do not have enough data points to train their classi-fiers. In the sense, when enough data points are not available, thenthe next best possible thing is to make some compromise in a errorprobability and a same amount of compromise in power level toget a smaller sample size. One should remember that the statisticalstability would be inversely proportional to the amount of sacrificein a error probability and power level. However, in the presentstudy, the sample size required is found to be just 3 per class (ReferTable 2) for different pair of a error probability and power level.

The sample size is not only a function of a error probability andpower level, but also greatly influenced by a parameter called ‘ef-

fect size’, which is a function of Pillai’s V. For the data set understudy, the Pillai’s V was found to be ‘1.486501’ and the correspond-ing effect size was ‘0.7690296’. To study the influence of effect sizeparameter on sample size, a set of hypothetical experiments wereconducted to compute total sample size as a function of effect size(0.1–0.5 with a step of 0.05) for a error probability of 5% to 20%with a step of 5% with power level of 95%. The resulting plot isshown Fig. 6. A similar study was carried out to compute total sam-ple size as a function of effect size (0.1 to 0.5 with a step of 0.05) forpower level of 80% to 95% with a step of 5% with a error probabilityof 5%. The resulting plot is shown Fig. 7. As effect size increases thetotal sample size decreases for a given a error probability. As a er-ror probability increases, the required sample size decreases withincrease in effect size (Refer Fig. 6). As effect size increases the totalsample size decreases for a given power level. As power level in-creases, the required sample size also increases with increases in

Fig. 5. Total sample size as a function of power level for various a error probabilities.

Fig. 6. Total sample size as a function of effect size for various a error probabilities.


effect size (Refer Fig. 7). As a matter of curiosity, one may be inter-ested in the variation of effect size with respect to error probabil-ity. For the total sample size of 18 (for all six classes together), withdifferent power levels (80% to 95% in steps of 5%), the effect sizewas plotted as a function of error probability as shown in Fig. 8.For a given power level, as the error probability increases the effectsize decreases. For a given error probability, as power level in-creases the effect size also increases. It is to be noted that these ef-fects are for the fixed sample size.

The sample size obtained using power analysis is to be validated/verified through some established method. Another statisticalmethod could be chosen which is well established in the field oralternatively resort to some functional test that would examinethe effect of sample size on classification accuracy. A pool of clas-sifiers is available for researchers from which the suitable classifierhas to be chosen. Often the selection would be based on the easi-

ness in training, general classification accuracy, computationalcomplexity, etc. In the present study, J48 decision tree classifier(A Java implementation of C4.5 algorithm) has been chosen due toeasiness in training the classifier. Also, it has the capability to per-form feature selection simultaneously. As discussed in Section 2,the data set consists of 1500 samples. From each class 250 sampleswere drawn. All 250 samples were considered for training J48 clas-sifier with 10-fold cross-validation method. The classifier’s job hereconsists of two processes: feature selection and feature classifica-tion. Initially, a decision tree was generated from which top fivefeatures that contribute largely for classification were selected.The rest of the features were ignored consciously for all furtherstudies.

The next step was to find out the effect of sample size on clas-sification accuracy. To study this effect, the sample size was con-stantly reduced with a decrement of 5 samples and the

Fig. 7. Total sample size as a function of effect size for various power levels.

Fig. 8. Effect size as a function of a error probability for various power levels.


corresponding classification accuracies were noted. The variationin classification accuracy with respect to the sample size is shownin Fig. 9. From Fig. 9, one could view the classification accuracy fallsabruptly when the sample size was reduced below 3 per class. Thismeans, a classifier could be trained with three samples per class forcentrifugal pump fault diagnosis problem; however, the objective

Fig. 9. Sample size as a function of classification accuracy.

of the study is one step further – what is the minimum sample sizethat will have statistical stability. Upon careful observation, theoscillation of the classification accuracy tends to stabilize as sam-ple size increases. In classification, the root mean square errorand mean absolute error as a function of sample size is plottedand shown in Figs. 10 and 11 respectively.

Fig. 10. Sample size as a function of root mean square value.

Fig. 11. Sample size as a function of mean absolute error.


The maximum classification accuracy of J48 algorithm for 250samples per class was found to be 100%. As per power analysis re-sult, from Table 2, if 5% of a error probability and 95% of power le-vel could be accommodated, then for given data set the minimumrequired sample size was found to be 3. This means, if 3 sampleswere used for training the classifier, the maximum a error proba-bility that likely to happen would be 5%. This has to be validatedwith J48 classifier’s classification results. From Fig. 11, it is evidentthat the mean absolute error is just about 0% (i.e. it did not exceed1%) for cases whose sample size was greater than or equal to 3. Thecorresponding root mean square error is shown in Fig. 10.

In classification problem, the mean absolute error of the classi-fier is a measure of type I error (a error probability). Type I error isan error due to misclassification of the classifier. In general, type Ierror is rejecting a null hypothesis when it is true. a error probabil-ity is a measure of type I error in hypothesis testing and hence, theequivalence is obvious. From the above discussion, the results ofpower analysis were true and the actual error did not exceed theupper bound (5%) found in power analysis. A similar exercise ofvalidating the results at other points also assures the validity ofthe power analysis test. Thus, the sample size suggested by poweranalysis could be confidently used for machine learning approachto fault diagnosis of centrifugal pump.

7. Conclusion

The machine learning approach to fault diagnosis of centrifugalpump with six types of faults was considered with an objective ofdetermining minimum samples required to separate faulty condi-tions, with statistical stability. In the present study, histogram fea-tures were extracted from vibration signals. A method to choosenumber of bins using J48 algorithm and a method for determina-tion of minimum sample size using power analysis have been pro-posed. The results of power analysis were validated using J48classifier. From the results and discussions, it was found that thenumber of bins chosen to be 37 and sample size per class shouldbe more than three so as to get good classification accuracy. Theminimum sample size that will have statistical stability was foundto be three with a power level of 95% and an a error probability of5%. Also, for other paired values of power level and a error proba-bility, the required minimum sample size was found to be threeper class (Refer Table 2). Thus, for a centrifugal pump classificationproblem, the classifier could be trained with three samples in orderto get good classification accuracy with statistical stability. The ef-fects of a error probability and power level on sample size werealso discussed in detail. These results and graphs would serve asa guideline for researchers working in the area of fault diagnosis

of centrifugal pump to choose the number of bins and fix the sam-ple size (machine learning approach).

Acknowledgements

The first author V. Indira and the third author N.R. Sakthivelacknowledge Karpagam University for providing them an opportu-nity to pursue Ph.D (part time), in Karpagam University,Coimbatore.

References

Alfayez, L., Mba, D., & Dyson, G. (2005). The application of acoustic emission fordetecting incipient cavitation and the best efficiency point of a 60 kW mono-block centrifugal pump. NDT and E International, 38, 354–358.

Beal, S. L. (1989). Sample size determination for confidence intervals on thepopulation mean and on the difference between two population means.Biometrics, 45, 969–977.

Birkett, M. A., & Day, S. J. (1994). Internal pilot studies for estimating sample size.Statistics in Medicine, 13, 2455–2463.

Browne, R. H. (1995). On the use of a pilot sample for sample size determination.Statistics in Medicine, 14, 1933–1940.

Buderer, N. M. F. (1996). Statistical methodology: I. Incorporating the prevalence ofdisease into the sample size calculation for sensitivity and specificity. AcademicEmergency Medicine, 3, 895–900.

Bull, S. B. (1993). Sample size and power determination for a binary outcome and anordinal exposure when logistic regression analysis is planned. American Journalof Epidemiology, 137, 676–684.

Casagrande, J. T., Pike, M. C., & Smith, P. G. (1978). An improved approximateformula for calculating sample sizes for comparing two binomial distributions.Biometrics, 34, 483–486.

Cohen, J. (1969). Statistical power analysis for the behavioral sciences. New York:Academic Press.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.).Hillsdale, NJ: Erlbaum.

Day, S. J., & Graham, D. F. (1991). Sample size estimation for comparing two or moretreatment groups in clinical trials. Statistics in Medicine, 10, 33–43.

Donner, A., & Eliasziw, M. (1992). A goodness-of-fit approach to inferenceprocedures for the kappa statistic: Confidence interval construction,significance-testing and sample size estimation. Statistics in Medicine, 11,1511–1519.

Dupont, W. D. (1988). Power calculations for matched case-control studies.Biometrics, 44, 1157–1168.

Faul, Franz, Erdfelder, Edgar, Lang, Albert-Georg, & Buchner, Axel (2007). G⁄ power3: A flexible statistical power analysis program for the social, behavioral, andbiomedical sciences. Behavior Research Methods, 39(2), 175–191.

Feigl, P. (1978). A graphical aid for determining sample size when comparing twoindependent proportions. Biometrics, 34, 111–122.

Flack, V. F., & Eudey, T. L. (1993). Sample size determinations using logisticregression with pilot data. Statistics in Medicine, 12, 1079–1084.

Fleiss, J. L. (1981). Statistical methods for rates and proportions (2nd ed.). New York:Wiley & Sons.

Geller, N. L., & Pocock, S. J. (1987). Interim analyses in randomized clinical trials:Ramifications and guidelines for practitioners. Biometrics, 43, 213–223.

Gordon, I., & Watson, R. (1996). The myth of continuity-corrected sample sizeformulae. Biometrics, 52, 71–76.

Gould, A. L. (1995). Planning and revising the sample size for a trial. Statistics inMedicine, 14, 1039–1051.

Greenland, S. (1988). On sample -size and power calculations for studies usingconfidence intervals. American Journal of Epidemiology, 128, 231–237.

guo-hua, Gao, yong-zhong, Zhag, yu, Zhu, & huang, Duan guang- (2007). Hybridsupport vector machines based multi-fault classification. Journal of ChinaUniversity of Mining and Technology, 17, 246–250.

Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiveroperating characteristic (ROC) curve. Radiology, 143, 29–36.

Haseman, J. K. (1978). Exact sample sizes for use with the Fisher–Irwin test for 2 � 2tables. Biometrics, 34, 106–109.

Hsieh, F. Y. (1989). Sample size tables for logistic regression. Statistics in Medicine, 8,795–802.

Hwang, D., Schmitt, W. A., Stephanopoulos, G., & Stephanopoulos, G. (2002).Determination of sample size and discriminatory expression patterns inmicroarray data. Bioinformatics, 18, 1184–1193.

Kavuri, S. N., & Venkatasubramanian, V. (1993b). Using fuzzy clustering withellipsoidal units in neural networks for robust fault classification. Computersand Chemical Engineering, 17(8), 765–784.

Kim, K., & DeMets, D. L. (1992). Sample size determination for group sequentialclinical trials with immediate response. Statistics in Medicine, 11, 1391–1399.

Konga, Fansen, & chen, Ruheng (2004). A combined method for triplex pump faultdiagnosis based on wavelet transform, fuzzy logic and neural-networks.Mechanical Systems and Signal Processing, 18, 161–168.

Kraemer, H. C., & Thiemann, S. (1987). How many subjects? Newbury Park, CA: Sage.


Lachenbruch, P. A. (1992). On the sample size for studies based upon McNemar’stest. Statistics in Medicine, 11, 1521–1525.

Lachin, J. M. (1992). Power and sample size evaluation for the McNemar test withapplication to matched case-control studies. Statistics in Medicine, 11,1239–1251.

Lakatos, E., & Lan, K. K. G. (1992). A comparison of sample size methods for theLogrank statistic. Statistics in Medicine, 11, 179–191.

Lantos, J. D. (1993). Sample size: Profound implications of mundane calculations.Pediatrics, 91, 155–157.

Lemeshow, S., Hosmer, D. W., & Klar, J. (1988). Sample size requirements for studiesestimating odds ratios or relative risks. Statistics in Medicine, 7, 759–764.

Lewis, R. J. (1993). An introduction to the use of interim data analyses in clinicaltrials. Annals of Emergency Medicine, 22, 1463–1469.

Lipsitz, S. R., & Fitzmaurice, G. M. (1994). Sample size for repeated measures studieswith binary responses. Statistics in Medicine, 13, 1233–1239.

Lockhart, D. J. et al. (1996). Expression monitoring by hybridization to high densityoligonucleotide arrays. Nature Biotechnology, 14, 1675–1680.

Lu, Y., & Bean, J. A. (1995). On the sample size for one-sided equivalence ofsensitivities based upon McNemar’s test. Statistics in Medicine, 14, 1831–1839.

Lubin, J. H., & Gail, M. H. (1990). On power and sample size for studying features ofthe relative odds of disease. American Journal of Epidemiology, 131, 552–566.

Lui, K. J., & Cumberland, W. G. (1992). Sample size requirement for repeatedmeasurements in continuous data. Statistics in Medicine, 11, 633–641.

Mace, A. E. (1974). Sample size determination. Huntington.NY: Krieger.Nam, J. M. (1992). Sample size determination for case-control studies and the

comparison of stratified and unstratified analyses. Biometrics, 48, 389–395.Nam, J. M. (1997). Establishing equivalence of two treatments and sample size

requirements in matched-pairs design. Biometrics, 53, 1422–1430.O’Brien, P. C., & Fleming, T. R. (1979). A multiple testing procedure for clinical trials.

Biometrics, 35, 549–556.Obuchowski, N. A. (1994). Computing sample size for receiver operating

characteristic studies. Investigative Radiology, 29, 238–243.Obuchowski, N. A., & McClish, D. K. (1997). Sample size determination for diagnostic

accuracy studies involving binormal ROC curve indices. Statistics in Medicine, 16,1529–1542.

Olson, C. L. (1974). Comparative roubustness of six tests in multivariate analysis ofvariance. Journal of American Statistical Association, 69, 894–908.

O’Neill, R. T. (1984). Sample sizes for estimation of the odds ratio in unmatchedcase-control studies. American Journal of Epidemiology, 120, 145–153.

Parker, R. A., & Bregman, D. J. (1986). Sample size for individually matched case-control studies. Biometrics, 42, 919–926.

Pearson, E. S., & Hartley, H. O. (1970). Biometrika tables for statisticians (3rd ed.).Cambridge: Cambridge University Press. Vol. I.

Pillai, K. C. S., & Mijares, T. A. (1959). On the moments of the trace of a matrix andapproximations to its distribution. Annals of Mathematical Statistics, 30,1135–1140.

Pocock, S. J. (1977). Group sequential methods in the design and analysis of clinicaltrials. Biometrika, 64, 191–199.

Rengaswamy, R., & Venkatasubramanian, V. (2000). A fast training neural networkand its updation for incipient fault detection and diagnosis. Computers andChemical Engineering, 24(2/7), 431–437.

Roebruck, P., & Kuhn, A. (1995). Comparison of tests and sample size formulae forproving therapeutic equivalence based on the difference of binomialprobabilities. Statistics in Medicine, 14, 1583–1594.

Royston, P. (1993). Exact conditional and unconditional sample size for pair-matched studies with binary outcome: A practical guide. Statistics in Medicine,12, 699–712.

Sakthivel, N. R., Sugumaran, V., & Nair Binoy, B. (2010). Application of supportvector machine and proximal support vector machine for fault classification ofmono-block centrifugal pump. International Journal of Data Analyses Techniquesand Strategies, 1, 38–61.

Samuels, M. L., & Lu, T. F. C. (1992). Sample size requirement for the back-of-the-envelope binomial confidence interval. American Statistician, 46, 228–231.

Satten, G. A., & Kupper, L. L. (1990). Sample size requirements for intervalestimation of the odds ratio. American Journal of Epidemiology, 131, 177–184.

Schena, M., Shalon, D., Davis, R. W., & Brown, P. O. (1995). Quantitative monitoringof gene-expression patterns with a complementary-DNA microarray. Science,270, 467–470.

Schoenfeld, D. A., & Richter, J. R. (1982). Nomograms for calculating the number ofpatients needed for a clinical trial with survival as an endpoint. Biometrics, 38,163–170.

Shih, W. J., & Zhao, P. L. (1997). Design for sample size re-estimation with interimdata for double blind clinical trials with binary outcomes. Statistics in Medicine,16, 1913–1923.

Signorini, D. F. (1991). Sample size for poisson regression. Biometrika, 78, 446–450.Streiner, D. L. (1994). Sample-size formulae for parameter estimation. Perceptual

and Motor Skills, 78, 275–284.Sugumaran, V., Muralidharan, V., & Ramachandran, K. I. (2007). Feature selection

using decision tree and classification through proximal support vector machinefor fault diagnostics of roller bearings. Mechanical System and Signal Processing,21, 930–942.

Sugumaran, V., Sabareesh, G. R., & Ramachandran, K. I. (2008). Fault diagnostics ofroller bearing using kernel based neighborhood score multi-class supportvector machine. Expert Systems With Applications, 34, 3090–3098.

Thomas, R. G., & Conlon, M. (1992). Sample size determination based on fisher’sexact test for use in 2 � 2 comparative trials with low event rates. ControlledClinical Trials, 13, 134–147.

Vaidyanathan, R., & Venkatasubramanian, V. (1992). Representing and diagnosingdynamic process data using neural networks. Engineering Applications ofArtificial Intelligence, 5(1), 11–21.

Wang, Jiangping, & hu, Hongtao (2006). Vibration based fault diagnosis of pumpusing fuzzy technic. Measurement, 39, 176–185.

Wang, W. J., & McFadden, P. D. (1993a). Early detection of gear failure by vibrationanalysis – I. Calculation of the time–frequency distribution. Mechanical Systemsand Signal Processing, 7(3), 193–203.

Wang, W. J., & McFadden, P. D. (1993b). Early detection of gear failure by vibrationanalysis – II. Interpretation of the time–frequency distribution using imageprocessing techniques. Mechanical Systems and Signal Processing, 7(3), 205–215.

Whitehead, J. (1992). The design and analysis of sequential clinical trials (2nd ed.). ChiChester: Ellis Horwood.

Whitehead, J. (1993). Sample size calculations for ordered categorical data. Statisticsin Medicine, 12, 2257–2271.

Whittemore, A. S. (1981). Sample size for logistic regression with small responseprobability. Journal of the American Statistical Association, 76, 27–32.

Widodo, A., & Bo-suk, Y. (2007). Support vector machine in machine conditionmonitoring and fault diagnosis. Mechanical System and Signal Processing, 21,2560–2574.

Date post:	20-Jan-2016
Category:	Documents
Upload:	cesar-audiveth
View:	18 times
Download:	0 times

A Method for Calculation of Optimum Data Size and Bin Size of Histogram Features in Fault Diagnosis...

Documents