Condition monitoring and classification of rotating machinery using wavelets and hidden Markov...

ARTICLE IN PRESS

Mechanical Systemsand

Signal Processing

0888-3270/$ - se

doi:10.1016/j.ym

�CorrespondE-mail addr

Mechanical Systems and Signal Processing 21 (2007) 840–855

www.elsevier.com/locate/jnlabr/ymssp

Condition monitoring and classification of rotating machineryusing wavelets and hidden Markov models

Qiang Miaoa, Viliam Makisb,�

aSchool of Mechatronics Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan 610054, ChinabDepartment of Mechanical and Industrial Engineering, University of Toronto, 5 King’s College Road, Toronto, Ont., Canada M5S 3G8

Received 14 September 2005; received in revised form 14 January 2006; accepted 23 January 2006

Available online 20 March 2006

Abstract

Condition monitoring and classification of machinery state is of great practical significance in manufacturing industry,

because it provides updated information regarding machine status on-line, thus avoiding the production loss and

minimising the chances of catastrophic machine failure. In this paper, the condition classification is based on hidden

Markov models (HMMs) processing information obtained from vibration signals. We present an on-line fault

classification system with an adaptive model re-estimation algorithm. The machinery condition is identified by selecting the

HMM which maximises the probability of a given observation sequence. The proper selection of the observation sequence

is a key step in the development of an HMM-based classification system. In this paper, the classification system is validated

using observation sequences based on the wavelet modulus maxima distribution obtained from real vibration signals,

which has been proved to be effective in fault detection in previous research.

r 2006 Elsevier Ltd. All rights reserved.

Keywords: Condition monitoring; Rotating machinery; Wavelet modulus maxima distribution; Lipschitz exponent; Condition

classification; Hidden Markov model (HMM)

1. Introduction

In recent years, the rapid development of industry automation has motivated the need of more intelligentand reliable machining systems. To minimise the loss due to the interruption of production and high machinefailure cost, it is necessary to monitor machine condition on-line using an effective condition monitoringsystem to provide timely information for maintenance decision-making. Generally, condition monitoringinvolves the observation of machine condition using periodically sampled dynamic response measurementssuch as vibration signals obtained from several transducers.

Vibration measurements obtained from the machine usually contain a lot of useful information, but alsonoise components which should be eliminated from the signal before the information is used for theclassification of machine condition and maintenance planning. The success of a classification system depends

e front matter r 2006 Elsevier Ltd. All rights reserved.

ssp.2006.01.009

ing author.

esses: [email protected] (Q. Miao), [email protected] (V. Makis).

www.elsevier.com/locate/jnlabr/ymssp

ARTICLE IN PRESSQ. Miao, V. Makis / Mechanical Systems and Signal Processing 21 (2007) 840–855 841

very much on the effectiveness of the extracted observation sequence to represent a particular machine state orcondition. During past decades, considerable research effort has focused on the development of variousfeature extraction and condition monitoring techniques. Feature extraction techniques can be classified intothe following three main categories: statistical approach, learning approach and signal processing. PrincipalComponent Analysis (PCA) is one of the most commonly used statistical feature extraction techniques [1]. It isbased on identifying the axes on which data shows the highest variability. PCA transforms the original set offeatures into a smaller subset of linear combinations that account for most of the variability in the originaldata set. In [2], PCA has been applied to monitor on-line an industrial batch polymerisation reactor. Neuralnetwork is a learning approach that has been used, e.g. in [3] for feature selection and extraction of breastcancer features. Spectrum analysis is an example of a signal processing feature extraction technique which wasused, e.g. in [4] for monitoring rotating machinery.

In machine condition monitoring, another crucial step is to establish a reliable and effective conditionclassification system based on extracted features. Condition classification includes the identification of theoperating status of the machine and the type of failure by interpreting the representative system variables. Afailure is defined as an event when a machine proceeds to an abnormal state. It is necessary to identify the typeof failure during its early stage for the selection of the appropriate maintenance action to prevent a moresevere situation. In the ideal case where the machine dynamics and measurement process can be completelymodelled in an accurate manner, a variety of methods for fault classification can be derived based on thesystem state estimation and statistical analysis of the residual error signals. In practice, however, especially forlarge complex systems, the system model may not be reliable, or not even available. Most classificationsystems deal with observation of events that display randomness. In this failure-identification problem, theclassification may utilise spatial and temporal patterns. The temporal patterns usually involve orderedsequences of data over time. The spatial patterns comprise of a unique pattern for each type of failure;variations in the pattern for the same failure type may occur under different operating conditions.

During past decades, research in pattern recognition has followed the two primary directions: thoseadopting a knowledge-based approach, and those adopting a statistical data-based approach. Knowledge-based approaches attempt to express human knowledge in terms of judging rules based on the featuresextracted from signals. Knowledge is represented in computer programs created by human experts in relatedareas. An example of a knowledge-based approach is the application of artificial neural networks (ANNs) andtheir combination with fuzzy logic models to the area of fault classification. ANNs have the advantages ofsuperior learning, noise suppression, and parallel computation abilities. However, successful implementationof this kind of system strongly depends on the proper selection of the type of the network structure and on theamount of training data, which are not always available. In addition, ANNs can handle spatial variations, butcannot provide proper solutions for temporal variations. Therefore, it is reasonable to adopt a doublestochastic approach for the classification of the patterns.

Compared with the previous approach, alternative data-based statistical approaches have achievedconsiderable success. They are usually implemented by extracting information from the existing data anddeveloping stochastic models of the signals. Hidden Markov model (HMM) enables modelling of both thespatial and temporal phenomena. HMM can be used to solve classification problems associated with time-series input data such as speech signals, and can provide an appropriate solution by means of its modellingand learning capability, even though it does not have the exact knowledge of the problems. Over the past 10years, HMM has been applied also in classifying patterns in process trend analysis [5] and machine conditionmonitoring. Ertunc et al. [6] used HMM to determine wear status of the drill bits in a drilling process. Atlas etal. [7] focused on the monitoring of milling processes with HMM at three different time scales and illustratedhow HMM can give accurate wear prediction. Ocak et al. [8] presented the application of HMMs in bearingfault detection.

In this paper, we propose a modelling framework for the classification of machine conditions based onwavelet modulus maxima distribution and HMMs. Wavelet modulus maxima distribution was proposed forsingularity analysis of signals using wavelets in [9], and it has been proved to be fault-sensitive in the previousresearch [10–12]. Therefore, it is reasonable to consider it as a feature for condition classification in this paper.Other researchers also used wavelet modulus maxima for singularity analysis successfully in various fieldsincluding bearing fault diagnosis [13], structural health monitoring [14], damage detection [15] and even

ARTICLE IN PRESSQ. Miao, V. Makis / Mechanical Systems and Signal Processing 21 (2007) 840–855842

financial time-series processing [16]. The selection of HMM was motivated by its success in speech recognitionand other areas. In machine condition monitoring, the actual condition (hidden state) cannot be observeddirectly, but we can infer it through the observations (such as vibration or acoustic measures).

The paper is organised as follows. Section 2 briefly outlines the procedures for feature extraction. Section 3introduces the elements of HMM and presents modelling results, including the selection of the hidden states,the definition of the observation sequence and the adaptive model training and estimating algorithm. Section 4further investigates and validates this system using several sets of gearbox vibration data provided by theApplied Research Laboratory (ARL) at Pennsylvania State University. Conclusions from this research are inSection 5.

2. Feature extraction

2.1. Wavelets and Lipschitz exponent

In mathematics, the local regularity of a function can be measured with Lipschitz exponent a. Let f ðxÞ be afinite energy function, that is f ðxÞ 2 L2ðRÞ. We say that f(x) is Lipschitz a (noapnþ 1) at x0, if there exist twoconstants A and h040 and a polynomial PnðxÞ of order n, such that

jf ðxÞ � Pnðx� x0ÞjpAjx� x0ja for jx� x0joh0. (1)

The polynomial PnðxÞ is often associated with the Taylor’s expansion of f(x) around x0 but the definition isused even if such an expansion does not exist.

When applying a wavelet transform, we consider a wavelet cðxÞ with n+1 vanishing moments, that isZ þ1�1

xkcðxÞdx ¼ 0; for 0pkonþ 1. (2)

This condition states that the wavelet with nþ 1 vanishing moments is orthogonal to the polynomials of upto order n. Suppose that c (x) has n+1 vanishing moments, it is (n+1) times continuously differentiable, andhas a compact support. It was proved in [9] that if the function f ðxÞ is Lipschitz a at x0, noaonþ 1, thenthere exists a constant A such that for all points x in a neighbourhood of x0 and any scale s,

jWf ðs;xÞjpAðsa þ jx� x0jaÞ. (3)

We will show in the next section that under additional assumptions which are satisfied in our application,the inequality (3) can be simplified and will provide a practical way for estimating Lipschitz exponent a in alocalised interval.

2.2. Modulus maxima distribution

In machinery condition monitoring, we are concerned about capturing the occurrence of certain kinds ofsingularities related to machinery health condition. Wavelet transform of a time domain signal generates atwo-dimensional time-scale representation. Many of the coefficients in this representation are very small inmagnitude. Large magnitude component will be present at a time point where the sudden change (such as asingularity caused by a fault) in the signal occurred. From inequality (3), it is possible to detect localsingularities of a function by estimating the ratio of decay of log2|Wf(s, x)| across all scales s. However, this isnot a practical way since it requires the measuring of decay of log2|Wf(s, x)| in the two-dimensional scale-space(s, x) in the neighbourhood of x0. In addition, we are more interested in the occurrence of singularitypoint rather than the Lipschitz exponent value. A practical way of applying inequality (3) is to localise itthrough modulus maxima. Mallat and Hwang [9] proved that the local Lipschitz exponent of f(x) at x0

depends on the decay of |Wf(s, x)| at fine scales in the neighbourhood of x0. The decay can be measured by thelocal maxima.

Define modulus maximum as any point (s0, x0) such that |Wf(s0, x)| is a local maximum at x ¼ x0.Singularities in the signal can be identified by the presence of modulus maxima. If there exists a scales040, and a constant C, such that for x 2 ða; bÞ and sos0, all the modulus maxima of Wf(s, x) belong


to a cone defined by

jx� x0jpCs, (4)

then at each modulus maximum (s, x) in the cone defined by Eq. (4) as

jWf ðs;xÞjpAsa

which is equivalent to

log2jWf ðs;xÞjplog2 Aþ a log2 s. (5)

Modulus maxima representation is composed of many modulus maxima lines, which are connected throughmodulus maxima points. The length of a line is not the true arc length but its span in the vertical direction. Fig. 1shows the wavelet transform and the modulus maxima representation of a real vibration signal. Because ofrelation (5), we use log2 (s) on Y-axis when plotting the wavelet transform or the modulus maxima distribution.

The signal here is the gear motion error [12] extracted from gearbox vibration data under normal condition.Fig. 1(c) is the corresponding modulus maxima representation of the signal. To fully characterise modulusmaxima representation, the following terms need to be defined:

1.
Total number of modulus maxima lines: M. 2. Vector L ¼ ðl1; l2; . . . ; li; . . . ; lMÞ, where li is the length of ith line, 1pipM; 3. Location of modulus maxima points: C ¼ fcijg, where cij corresponds to the location of the jth point on line
i, 1pipM, 1pjpli.
4. Wavelet coefficients of modulus maxima points: W ¼ fwijg, where wij is the wavelet coefficient for jth point
on line i.

0 200 400 600 800 1000

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

Time domain signal

Am

plitu

deLo

g2(s

)

Wavelet transform

0 200 400 600 800 1000

0

2

4

6

8

0 200 400 600 800 1000

0

2

4

6

8

Log2

(s)

Modulus maxima map

0 50 100 150 200 250 300 350

0

2

4

6

8

10

Modulus maxima distribution

line number

Log2

(s)

Sampling points Sampling points

Sampling points

(a)

(b)

(c)

(d)

Fig. 1. Wavelet transform and modulus maxima representation of a vibration signal.


Since singularities caused by machine failures result in the emergence of modulus maxima and the linesconnected through these maxima points span to a certain scale, we can use these lines to characterise thecondition of a machine. By ordering the elements of L from the largest to the smallest, modulus maximadistribution is defined as a function mðxÞ in which x represents the line number and m(x) represents the lengthof each modulus maxima line. Function mðxÞ can be obtained by using the Matlab function sort(*), whichsorts the elements of L in ascending order. The modulus maxima distribution curve of vibration signal in Fig.1(a) is plotted in 1(d). In this feature extraction procedure, the wavelet coefficient information is discarded butthis does not affect the effectiveness of the proposed feature for condition classification.

3. Design of a condition classification system based on HMM

3.1. Description of failure identification problem

The problem of failure identification is defined as the classification of failure modes, oj , given sequentialinput patterns Xt at time t. Input pattern Xt is mathematically defined as an object described by a sequence offeatures at time t,

X t ¼ ðxt1; xt2; . . . ;xti; . . . ;xtdÞ. (6)

The space of input pattern Xt consists of the set of all possible patterns: X t � Rd ; Rd is a d-dimensional realvector space.

A sequence of the k observed data up to time t is defined as Ft�k:

Ft�k ¼ fX t�kþ1; . . . ;X t�1;X tg. (7)

The set of possible failure modes oj forms the space of classes O:

F ¼ o1;o2; . . . ;ocf g, (8)

where c is the number of failure modes in the system. The classification task can be considered as a problem offinding function g, which maps the space of input patterns Ft�k to the space of classes O:

g : Ft�k ! O. (9)

There are three basic approaches to pattern classification: the statistical approach, the syntactic approachand the neural approach. The statistical approach is a data-based method which uses a statistical basis for theclassification. A set of characteristic measurements, denoted ‘‘features’’, are extracted from the input data andused to assign each feature vector to one of the c classes. Features are assumed to reflect a state of the machine,and therefore the underlying model is a set of probabilities or probability density functions. The classifierattempts to integrate all the information, such as measurements and an a priori probability. Decision rulesmay be formulated by several ways; for example, by converting an a priori class probability into ameasurements-conditioned probability, or by formulating a measure of expected classification error, andchoosing a decision rule that minimises this measure.

Both syntactic approach and neural approach are knowledge-based methods. Syntactic classifiers operateby evaluating a symbolic representation of the pattern in terms of structural characteristics of the pattern.Thus the result of syntactic pattern recognition is a structural description of the test pattern. Syntactic methodis applied to problems for which a statistical approach is difficult or impractical. Sands and Garber [17]evaluated a syntactic pattern recognition system for the application to radar signal identification, in whichthree different level-crossing-based pattern representation algorithms are considered. A neural networkapproach is appropriate for pattern-recognition implementation that involves large interconnected networksof relatively simple units. Neural networks are particularly suitable for pattern-association applications.

In machine condition monitoring, the system often exhibits sequentially changing behavior. For example,the health status of a machine deteriorates before failure occurs. When a failure occurs, a decision can be madeby selecting the type of failure o with highest a priori probability PðoÞ. However, this decision is probablyunreasonable. It is more appropriate to determine the type of failure after observing the trend of machinesystem variables, namely, to get the conditional probability PðojFt�kÞ. This conditional probability is calledthe a posteriori probability. Decision-making based on the a posteriori probability is more reliable, because it


employs a priori knowledge together with the observed data. Classification of an unknown pattern Xt

corresponds to finding the optimal model o that maximises PðojFt�kÞ over the type of failure o. Bayes rulecan be applied to calculate the a posteriori probability,

PðojFt�kÞ ¼ maxo

PðFt�kjoÞPðoÞPðFt�kÞ

. (10)

In failure identification problem, the HMM is used to estimate the conditional probability PðFt�kjoÞ. Byusing the HMM, the pattern variability in the parameter space and time can be modelled effectively. HMMutilises a Markov chain to model the changing statistical characteristics that exist in the actual observation ofmachine vibration signals, therefore it is a ‘‘double’’ stochastic procedure. HMM parameters are estimatedusing the Baum–Welch algorithm, which guarantees an improvement in each iteration when maximising thelikelihood function.

3.2. Elements of HMM

The HMM is a Markovian-based model whose states cannot be observed directly. Usually, it contains finitenumber of states, where each state generates an observation at certain time point. The hidden state ischaracterised by two sets of probabilities: a transition probability and an observation probability distribution.In addition, the third probability distribution has to be computed for an HMM: the distribution of the initialhidden state. In summary, the complete specification of an HMM includes the following elements:

�
set of hidden states:
S ¼ fS1;S2; . . . ;SNg,

where N is the number of states in the model;
� state transition probability distribution:
A ¼ faijg,

where aij ¼ P½qtþ1 ¼ Sjjqt ¼ Si�, for 1pi; jpN, qt represents the hidden state at time t;
� set of observation symbols:
V ¼ fv1; v2; . . . ; vMg,

where M is the number of observation symbols per state;
� observation symbol probability distribution:
B ¼ fbjðkÞg,

where bj(k) ¼ P[vk at t|qt ¼ Sj], for 1pjpN, 1pkpM;
� initial state probability distribution:
p ¼ fpig,

where pi ¼ P½q1 ¼ Si�, for 1pipN.

For convenience, an HMM can be represented by the compact notation:

l ¼ ðA;B;pÞ, (11)

to indicate the complete parameter set of the model.

3.3. Hmm training

Training refers to the characteristics of the input patterns to be modelled by the set of parametersl ¼ ðA;B;pÞ. An HMM is applied to a classification problem under the assumption that one can preciselydetermine the model parameters for given observations. However it is difficult to realise this assumptionbecause of the complexity of the problem. It is possible to find local optima through maximum likelihood


estimation. Consider a state sequence Q, Q ¼ q1; q2; . . . ; qT , this method maximises the following equation,given input observation O, O ¼ O1;O2; . . . ;OT :

PðOjlÞ ¼X

Q

pq1bq1ðO1Þ PT

t¼2aqt�1;qt

bqtðOtÞ. (12)

Eq. (12) calculates the probability of given observation symbols on all the paths from the initial to the finalstate. However, this procedure is not realistic, because of heavy computations related all the possible paths.Another procedure, which is called the forward–backward procedure, is available to solve the trainingproblem. The forward variable atðiÞ is defined as

atðiÞ ¼ PðO1;O2; . . . ;Ot; qt ¼ SijlÞ. (13)

This is the probability of the partial observation sequence to time t and state i at time t. Then atðiÞ iscalculated inductively as follows:

1.
Initialisation:
a1ðiÞ ¼ pibiðO1Þ; for all states i. (14)

2.
Induction: Calculating að�Þ along the time axis, for t ¼ 2; . . . ;T , and all states j:
atðjÞ ¼XN

i¼1

at�1ðiÞaij

" #bjðOtÞ. (15)

3.
Termination: the final probability is given by
PðOjlÞ ¼XN

i¼1

aT ðiÞ. (16)

The backward variable btðiÞ, which is used with the forward variable atðiÞ to optimise model parameters, canbe defined as

btðiÞ ¼ PðOtþ1;Otþ2; . . . ;OT jqt ¼ Si; lÞ. (17)

This is the probability of the partial observation sequence from t+1 to the final observation T, given state i

at time t and model l. Again, btðiÞ can be calculated inductively as follows:

1.
Initialisation:
bT ðiÞ ¼ 1; for all states i. (18)

2.
Induction: calculating bð�Þ along the time axis, for t ¼ T � 1; . . . ; 1, and all states j:
btðjÞ ¼XN

i¼1

ajibiðOtþ1Þbtþ1ðiÞ

" #. (19)

Now, let us consider the model-training problem: Given an observation sequence O, how do we findthe optimum model parameters that maximise PðOjlÞ? The iterative algorithm frequently used for estima-tion in HMM-based applications is the Expectation Maximisation (EM) algorithm. The EM algorithm wasfirst introduced by Baum and Petrie in 1966 and it is also referred to in the literature as the Baum–Welchalgorithm [18]. The probability of transition, xtði; jÞ, is defined as the probability of being in state i at time tand making a transition to state j at time t+1, given the observation sequence and the particular model. It can


be computed as

xtði; jÞ ¼ Pðqt ¼ Si; qtþ1 ¼ SjjO; lÞ ¼atðiÞaijbjðOtþ1Þbtþ1ðjÞ

PðOjlÞ. (20)

Similarly, the probability of being in state i at time t, gtðiÞ, given the observation sequence and the model, is

gtðiÞ ¼XN

j¼1

xtði; jÞ ¼ Pðqt ¼ SijO; lÞ ¼atðiÞbtðiÞ

PðOjlÞ. (21)

Using the above formulas, aij, bj , pi of the re-estimated new model l can be computed as follows:

aij ¼

PT�1t¼1 xtði; jÞPT�1

t¼1

Pjxtði; jÞ

¼

PT�1t¼1 xtði; jÞPT�1

t¼1 gtðiÞ, (22)

bjðkÞ ¼

PTt¼1;Ot¼nk

gtðjÞPTt¼1gtðjÞ

, (23)

pi ¼ g1ðiÞ. (24)

The re-estimation formulas of (22)–(24) can be derived by maximising Baum’s auxiliary function [18]

Qðl; lÞ ¼X

Q

PðQjO; lÞ log½PðO;QjlÞ�. (25)

It has been proved by Baum et al. [18] that maximisation of Qðl; lÞ leads to the increase of PðOjlÞ, that is,

PðOjlÞXPðOjlÞ, (26)

where l maximises Qðl; lÞ over l.

3.4. HMM classification

Classification means finding the best path (state sequence) in each trained model, and selecting the one thatmaximises the path probability for a given input observation. However, in real application, especially inmachine condition monitoring, it is more popular to establish several HMM models corresponding todifferent conditions in consideration. In that case, the hidden states of the model do not have physical meaningand decision regarding the current machine state is made by choosing the model that gives maximumprobability of the observation. That is, machine condition C can be selected by

C ¼ argmaxðPðOjlsÞÞ; 1pspH. (27)

Here, H is the number of machine conditions considered in a classification system, which is equal to thenumber of HMM models in the system.

3.5. Selection of observation

In machinery condition monitoring, feature extracted from the measured signal is treated as observation.The significance of observation selection is obvious because a poorly chosen observation will be meaninglesswhen inferring the machine condition. It cannot identify the actual status of machinery. In the previoussection, we proposed modulus maxima distribution and investigation showed that this is a fault-sensitivefeature of machinery condition. It is therefore appropriate to choose this measure as the observation sequenceof a classification system. Fig. 2 shows two observations extracted from normal and failure conditions,respectively. The observation symbol n in this observation sequence is the length (corresponding to log2 s) ofmodulus maxima line in the modulus maxima map (see Fig. 1(c)).

In the experiment, each sampling data file generates one observation sequence O ¼ fv1; . . . ; vi; . . . ; vLg,where vi 2 V . It is not realistic to use the whole distribution curve m(x) as the observation since the length of

ARTICLE IN PRESS

0 50 100 150 200 250 300 350

0

2

4

6

8

10

line number

Log2

(s)

40 60 80 100 120

3

4

5

6

line number

Log2

(s)

NormalFailure

NormalFailure

Zoom in Full scale

(a) (b)

Fig. 2. An example of two observations.

Q. Miao, V. Makis / Mechanical Systems and Signal Processing 21 (2007) 840–855848

the curve is usually very long, say, around 300, which also means that the size of the observation sequence L

and the number of observation symbols M are large. Then, L is around 300 and M ¼ 128. The computationalload of the classification system would be very heavy. In addition, previous study shows that some of theselines are generated by the noise and they can be discarded without degrading the classification performance. Itis reasonable to select the most informative part of the distribution curve (see the area between the dotted linesin Fig. 2(a) and (b)). The analysis in [9] showed that the most informative part in the modulus maximadistribution corresponds to log2 (s) in a certain interval (see Fig. 2(b)), which is related to certain frequencycomponents in the frequency domain. This is obvious because the occurrence of machine failure will result inthe change in the frequency domain. Through the experiments, we found the most informative part and thenumber of observation symbols M ¼ 56 and the length of observation sequence is equal to 30.

3.6. Definition of hidden states and structure of classification system

Another important issue in HMM modelling is the selection of hidden states. In this research, we assumethat the hidden states are governed by a homogeneous Markov chain of order ‘ ¼ 1, while the observationsare independent. We consider several HMMs corresponding to different machine conditions that we areinterested in. Then, to make a decision based on an observation sequence, compute the log-likelihood for eachmodel. If the ith model gives maximum, then conclude that the machine is in the ith condition. Fig. 3 belowshows the general structure of a two-stage condition classification system.

From Fig. 3, stage 1 is to detect the existence of machine failure. It would be appropriate to set up threemodels corresponding to the three machine conditions (normal, warning and failure). The classification instage 1 indicates what to do next, keeping regular inspection interval (normal state), shortening inspectioninterval (warning state), or stopping the machine (failure state). However, in this research, only two conditionsin stage 1 are considered, namely, normal and failure (see Fig. 4), because we did not have sufficient amount ofwarning data for model training. Stage 2 is to determine the failure mode that occurred when the machinefailed. Each model in this stage corresponds to one failure mode. In stage 2, three machine conditions areconsidered (see Fig. 4), namely, adjacent tooth failure, distributed tooth failure and normal condition.

In this research, both the state transition probability distribution A and observation probability distributionB are discrete distributions. So we need to specify the number of hidden states in each model. The selection ofhidden states will affect the performance of the model. The likelihood ratio test is a standard procedure thatcan be used to compare two models [19]. However, since some models are mixtures of different Markovmodels, this method cannot be used directly. A more general and appealing alternative testing procedure is touse the Bayesian Information Criterion (BIC). For a model M, BIC is defined as

BICðMÞ ¼ �2LLðMÞ þ pðMÞlogðnÞ, (28)

ARTICLE IN PRESS

observationsequence

HMM for normalcondition

HMM for failurecondition

HMM foradjacent tooth

failure

HMM fordistributed

tooth failure

HMM fornormal

condition

Stage 1

Stage 2

SelectMaximum

P(O|model 1)

P(O|model 2)

P(O|model N)

Machinecondition

P(O|normal)

P(O|failure)

observationsequence

SelectMaximum

Machine

health state

Fig. 4. The structure of a two-stage classification system considered in this research.

observation

sequence

HMM for normalcondition

HMM for warningcondition

HMM for failurecondition

HMM forfailure mode 1

HMM forfailure mode 2

HMM forfailure mode N

Stage 1

Stage 2

SelectMaximum

P(O|model 1)

P(O|model 2)

P(O|model N)

......

Machine

failure mode

P(O|normal)

P(O|warning)

P(O|failure) Max

Fig. 3. The general structure of a two-stage classification system.

Q. Miao, V. Makis / Mechanical Systems and Signal Processing 21 (2007) 840–855 849

where LL(M) is the log-likelihood of the model, p(M) is the total number of independent parameters inthe model, and n is the number of data points in the log-likelihood. The model having the lowest value ischosen as the best model. Another frequently used criterion is the Akaike Information Criterion (AIC) whichis defined as

AICðMÞ ¼ �2LLðMÞ þ 2pðMÞ. (29)

Although AIC has been proved to tend to overestimate the real order of a Markovian model, we stillcompute AIC for the purpose of comparison.

3.7. Adaptive HMM

In typical speech recognition systems, the problem of speaker’s accent may result in the poor performanceof the system. To address this problem, the concept of speaker adaptation has been introduced, where a smallamount of data from the specific speaker is used to modify the speaker-independent system and improve itsperformance. Similarly, in machine condition monitoring, the health of machine deteriorates before finalfailure occurs. In addition, although it is required that the training data and the test data for classification


system should come from the same kind of mechanical system, differences still exist due to the assemblyeccentricity and other machining factors. Both kinds of problems may lead to the change of HMM modelparameters and affect the classification result. Thus it is necessary to propose adaptive HMM modelling toupdate the model so as to enhance the recognition performance over time.

Firstly, it is necessary to justify that the recognition result for the current observation is correct. If theconfidence level is sufficient, the observation sequence should be used to update the corresponding model. Inthis paper, we use the likelihood difference, i.e. the difference between the highest likelihood score and thesecond highest likelihood score, as the criterion to make a decision. The reason is that for a correctrecognition, the likelihood difference between the selected model and other models should be significant. Sogiven a test sequence, we compare its likelihood difference with a predefined threshold, and update the modelbased on the particular decision. In practice, this threshold can be determined by performing experiments on across-validation data set [20].

A standard maximum a posterior (MAP) adaptation technique [21] is then employed to adapt the HMM.That is, given an existing HMM lold and observation O, we estimate a new HMM, l ¼ ðA;B; pÞ using lold asthe initial parameters in the re-estimation procedure. The application of this method will be discussed in thevalidation section.

4. Evaluation of the proposed classification system

In this section, the two stages of the classification system are validated separately. The vibration data areprovided by the Applied Research Laboratory at Pennsylvania State University [22] on three test runs of singlereduction helical gearboxes. The test rig is in a brand-new state at the beginning of each experiment and runsuntil gear tooth failure. In addition, each experiment included two kinds of workload conditions. That is, itstarted at normal workload (Condition #1) and after some time the workload was doubled or tripled(Condition #2) to accelerate the experiment. Three sets of experimental data are utilised in stage 1 validation,namely TR#5, TR#10 and TR#12. In stage 2 validation, failure data from both TR#5 and TR#10 are chosento model two different failure modes (adjacent tooth failure in TR#5 and distributed tooth failure in TR#10)and the data from normal condition are used to model the third machine condition (normal). Although wehave several other test run data with different failure modes, they cannot be used in this experiment becausethe model and mechanical specifications are different.

4.1. Several issues in HMM modelling

In HMM modelling, several issues need to be clarified before model training and testing. For example, inthis classification system, the hidden state does not have physical meaning, thus when choosing the optimalnumber of hidden states, one point that needs to be considered is the impact of hidden states on HMM modelperformance. Further, due to the restriction of available test run data, we also need to find the optimal numberof training data because in some data sets, only very limited number of normal condition data is available,while in other test runs, the number of failure data is limited. This section is going to investigate these issuesand help to find the optimal number for further modelling. The vibration data utilised in this section comesfrom TR#10, which has 148 data files, with 37 of them coming from normal condition.

Firstly, we need to find the optimal number of hidden states. Since in HMM training, the EM algorithm willrandomly select an initial point, and this randomly selected point will influence the estimation result, we try toeliminate this factor by doing estimation several times under the same model structure (e.g. hidden states).Table 1 below shows the results of HMM modelling for normal condition:

In Table 1, LL represents loglikelihood of a trained model given test data. Larger LL means better model.When the number of hidden states increases from 2 to 7, two model selection criterions, AIC and BIC,decrease at the beginning. When the number of hidden states ¼ 4, they reach the minimum values. As wediscussed previously, these two criteria values should be as small as possible, thus we would preferoptimal number of hidden states ¼ 4. We see that further increase of hidden states results in the increase ofthese two values.

ARTICLE IN PRESS

Table 2

HMM modeling for failure condition in TR#10

Hidden states LL AIC BIC

2 �60.5094 278.66 489.82

3 �49.3828 244.76 439.9

4 �39.3478 195.54 351.64

5 �39.4667 251.63 482.38

6 �33.9862 258.57 513.37

7 �32.4253 332.88 699.78

Table 1

HMM modelling for normal condition in TR#10

Hidden states LL AIC BIC

2 �26.3277 114.82 159.58

3 �22.2419 108.07 153.68

4 �16.8245 95.51 140.08

5 �15.1752 96.65 144.17

6 �12.7183 110.05 172.42

7 �9.5983 116.53 186.32

Table 3

Investigation of training sequence length in TR#10

Length of training sequence Correct Incorrect

5 129 19

6 132 16

7 134 14

8 135 13


Similarly, we find the optimal number of hidden states under failure condition. The result is shown inTable 2 below. In this table, LL increases consistently, but AIC and BIC decrease at the beginning, and dropto the lowest value when number of hidden states ¼ 4. So, again the optimal number of hidden states ¼ 4 infailure condition.

To investigate the impact of the training sequence length on the performance of the model, we choose 5-8data files as training data for each model. That is, the length of training sequence changes from 5 to 8. Table 3below shows the analysis results.

From above, we can see that by increasing the number of training files, the performance will become better.However, the computational time will also increase. It is reasonable to choose the optimal length ¼ 7.

4.2. Stage1 validation: fault detection

As mentioned previously, the function of stage 1 is fault detection. So two models need to be evaluated,corresponding to normal and failure conditions, respectively. In this section, 3 test runs are chosen to validatethe system, namely, TR#5, TR#10 and TR#12. Tables 4–6 provide the test run time specifications of TR#5,TR#10 and TR#12.

Tables 7–9 provide analysis results in TR#5, TR#10 and TR#12, respectively. In these experiments, thenumber of hidden states in each model is 4 and the length of training sequence is 7. The model chosen here isdiscrete HMM, which means the transition probability distribution A and observation symbols probabilitydistribution B are discrete. Modulus maxima distribution is selected as a feature observation vector in the

ARTICLE IN PRESS

Table 4

Time specification of workload change and failure in TR#10

Time period Time stamp File number Workload

Condition #1 11/17/97 16:20 (GMT)–11/21/97 16:20 (GMT) 000–190 1–12 540 lbs (100% max)


Healthy state 11/17/97 16:20 (GMT)–11/23/97 04:30 (GMT) 000–265 1–37

Table 5






Table 6




Condition #2 2/23/98 21:00 (GMT)–2/24/98 07:46 (GMT) 192–222 125–155 1665lbs (300% max)


Table 7

Validation of stage 1 classification system in TR#5

Results of detection Actual machinery conditions Total accuracy

Normal Failure

Normal 66 2

Failure 5 10 91.6%

Accuracy 93.0% 83.3%

Table 8



Normal Failure

Normal 30 5

Failure 7 106 91.9%



model. Since the noise creates singularities whose Lipschitz exponent is negative (see, e.g. [9]), we candiscriminate between the modulus maxima lines generated by the white noise and the lines generated by thesignal by looking at the evolution of their amplitudes across scales. From inequality (5), the noise lines cannotspan to higher scales. Based on this observation, modulus maxima lines produced by noise are usually veryshort and can be discarded before classification. Each feature vector includes 30 elements, corresponding to 30modulus maxima lines in a given scale region. The loglikelihood difference here is 10 and this is based on the

ARTICLE IN PRESS

Table 9



Normal Failure

Normal 135 5

Failure 5 10 93.5%


Table 10

Results of condition classification in stage 2

Machine conditions Decision

Correct Incorrect

Adjacent tooth failure 11 1

Distributed tooth failure 102 9

Normal 28 2


analysis of the experiment. The performance of stage1 classification is excellent, with the total accuracy inthree experiments being 91.6%, 91.9% and 93.5%, respectively.

4.3. Stage 2 validation: condition classification

As mentioned previously, the failure mode in TR#5 is adjacent gear tooth failure and in TR#10, it isdistributed tooth failure. In addition, due to the limited data available, the third condition is defined as normalcondition and data is randomly selected from TR#5 and TR#10. Thirty data files are chosen from both testruns for model testing. Corresponding to the three machine conditions, three HMM models have been set upand estimated using the training data. Table 10 below demonstrates excellent classification performance instage 2 of this system.

4.4. A comparison of two model parameter re-estimation methods

HMM parameter set l includes initial state distribution p, state transition probability distribution A ¼ faijg

and observation symbol probability distribution in state j, B ¼ fbjðkÞg. That is, l ¼ ðp;A;BÞ. In adaptiveHMM modelling, the optimisation of model parameters to best describe (maximise probability of) a givenobservation sequence is the crucial step in model training. For example, the condition of machine deterioratesas it is running and finally goes to failure. The deterioration is a procedure of quantitative change and themodel corresponding to machine condition must be updated to follow the change. The classificationperformance is enhanced by the adaptive modelling technique.

However, after a long system running time, the available history data for re-training is so large that itrequires considerable time and memory to update the model, which is usually unacceptable. In this section, wecompare two model reestimation methods, Methods 1 and 2, with data from normal condition and failurecondition, respectively. Fig. 5 shows the scheme of these two methods.

In each experiment, two models are defined for normal and failure conditions, respectively, and 60 testingdata files are selected from TR#5 and TR#10. In the experiment, after initial training, the system enters testingand re-estimation stage. In the testing stage, a window is defined and the corresponding model is updated withthe new testing data sequence equal to the window length. In Method 1, we have arbitrarily selected initial setof model parameters linit ¼ larbitrary and updated the model parameters by considering the previous data setplus the most recent n (n ¼ 5 is the window length) observations that have been classified as indicating theparticular system condition. In Method 2, reestimation procedure is realised by utilising the previous model

ARTICLE IN PRESS

Table 11

Comparison of two model re-estimation methods

Results of detection Actual machinery conditions Total model

parameter

reestimation time

Normal Failure

Method 1 Normal 28 0

Failure 2 30 96.1 s

Accuracy 93.3% 100.0%

Method 2 Normal 29 0

Failure 1 30 33.1 s

Accuracy 96.7% 100.0%

Reestimation

Reestimation

Initial

training

training

Reestimation

Reestimation

Initial

Method 1

Method 2

Testing and

reestimation

Testing and

reestimation

. . . . . . �init = �old

. . . . . . �init = �arbitrary

Window length

Fig. 5. Two adaptive model-training methods.


parameter set ðloldÞ as the initial point of the algorithm ðlinit ¼ loldÞ and all the data as in Method 1. Theanalysis result is shown in Table 11. From the table, we can see that the performances of these two methodsare very close but the total model parameter re-estimation time is significantly reduced using Method 2.

Since lold is close to the loptimal, the number of iterations will be greatly reduced with Method 2 to reachloptimal. This also indicates that the good selection of linit is important in the EM algorithm.

5. Conclusions

In this paper, we have introduced the new feature extraction approach based on wavelet modulus maximaand proposed an HMM-based two-stage machine condition classification system. The modulus maximadistribution is utilised as the input observation sequence of the system. An adaptive algorithm is proposed andvalidated by three sets of real gearbox vibration data to classify two conditions: normal and failure. Inaddition, in condition classification (stage 2), three HMM models are set up to classify three different machineconditions, namely, adjacent tooth failure, distributed tooth failure and normal condition. The validationresults show an excellent performance of the proposed classification system.

Acknowledgements

The authors wish to thank the referees for their constructive comments, the Natural Sciences andEngineering Research Council of Canada for financial support, and the Applied Research Laboratory at thePennsylvania State University and the Department of the Navy, Office of the Chief of Naval Research (ONR)for providing the vibration data of a gearbox.


References

[1] M. Pechenizkiy, A. Tsymbal, S. Puuronen, PCA-based feature transformation for classification issues in medical diagnostics, in:

Proceedings of the 17th IEEE Symposium on Computer-Based Medical Systems (CBMS’04), 2004.

[2] J.F. MacGregor, T. Kourti, Statistical process control of multivariate processes, Control Engineering Practice 3 (3) (1995) 403–414.

[3] P. Abdolmaleki, L.D. Buadu, H. Naderimansh, Feature extraction and classification of breast cancer on dynamic magnetic resonance

imaging using artificial neural network, Cancer Letters 171 (2) (2001) 183–191.

[4] D. Kocur, R. Stanko, Order bispectrum: a new tool for reciprocated machine condition monitoring, Mechanical Systems and Signal

Processing 14 (6) (2000) 871–890.

[5] K.C. Kwon, J.H. Kim, Accident identification in nuclear power plants using hidden Markov models, Engineering Applications of

Artificial Intelligence 12 (1999) 491–501.

[6] H.M. Ertunc, K.A. Loparo, H. Ocak, Tool wear condition monitoring in drilling operations using hidden Markov models (HMM),

International Journal of Machine Tools & Manufacture 41 (2001) 1363–1384.

[7] L. Atlas, M. Ostendorf, G.D. Bernard, Hidden Markov models for monitoring machining tool-wear, in: IEEE, ICASSP 2000, vol. 6,

2000, pp. 3887–3890.

[8] H. Ocak, K.A. Loparo, A new bearing fault detection and diagnosis scheme based on hidden Markov modeling of vibration signals,

in: IEEE, ICASSP 2001, vol. 5, 2001, pp. 3141–3144.

[9] S. Mallat, W.L. Hwang, Singularity detection and processing with wavelets, IEEE Transactions on Information Theory 38 (2) (1992)

617–643.

[10] Q. Miao, V. Makis, An application of the modulus maxima distribution in machinery condition monitoring, Journal of Quality in

Maintenance Engineering 11 (4) (2005) 375–387.

[11] Q. Miao, V. Makis, Condition classification of rotating machinery using wavelets and hidden Markov models, in: IIE Annual

Research Conference Proceedings, Atlanta, GA, May 2005.

[12] Q. Miao, V. Makis, Extraction of machinery health index in CBM based on wavelet modulus maxima, in: FAIM 2004 (CD ROM),

2004.

[13] Q. Sun, Y. Tang, Singularity analysis using continuous wavelet transform for bearing fault diagnosis, Mechanical Systems and Signal

Processing 16 (6) (2002) 1025–1041.

[14] A.N. Robertson, C.R. Farrar, H. Sohn, Singularity detection for structural health monitoring using holder exponents, Mechanical

Systems and Signal Processing 17 (6) (2003) 1163–1184.

[15] J.-C. Hong, Y.Y. Kim, H.C. Lee, Y.W. Lee, Damage detection using the Lipschitz exponent estimated by the wavelet transform:

application to vibration modes of a beam, International Journal of Solids and Structures 39 (2002) 1803–1816.

[16] Z.R. Struzik, Wavelet methods in (financial) time-series processing, Physica A 296 (2001) 307–319.

[17] O.S. Sands, F.D. Garber, Pattern representation and syntactic classification of radar measurements of commercial aircraft, IEEE

Transaction on Pattern Analysis and Machine Intelligence 12 (2) (1990) 204–211.

[18] L.E. Baum, T. Petrie, G. Soules, N. Weiss, A maximization technique occurring in the statistical analysis of probabilistic functions of

Markov chains, Annals of Materials Statistics 41 (1) (1970) 164–171.

[19] March software v. 2.10, User’s Guide, pp. 43–55 (Chapter 4).

[20] X. Liu, T. Chen, Video-based face recognition using adaptive hidden Markov models, in: Proceedings of the 2003 IEEE Computer

Society Conference on Computer Vision and Pattern Recognition, 2003.

[21] J.L. Gauvain, C.H. Lee, Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains, IEEE

Transactions on Speech and Audio Processing 2 (2) (1994) 291–298.

[22] A.J. Miller, A new wavelet basis for the decomposition of gear motion error signals and its application to gearbox diagnostics, Master

of Science Thesis, The Graduate School, The Pennsylvania State University, 1999.

Date post:	21-Nov-2023
Category:	Documents
Upload:	utoronto
View:	0 times
Download:	0 times

Condition monitoring and classification of rotating machinery using wavelets and hidden Markov...

Documents