+ All Categories
Home > Documents > DESIGN OF A WAVELET BASED DATA MINING TECHNIQUEIn this article, we present wavelet based data mining...

DESIGN OF A WAVELET BASED DATA MINING TECHNIQUEIn this article, we present wavelet based data mining...

Date post: 30-Oct-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
23
BULETINUL INSTITUTULUI POLITEHNIC DIN IAŞI Publicat de Universitatea Tehnică „Gheorghe Asachi” din Iaşi Tomul LVIII (LXII), Fasc. 1, 2012 SecŃia AUTOMATICĂ şi CALCULATOARE DESIGN OF A WAVELET BASED DATA MINING TECHNIQUE BY ZOBNIN BORIS, SERGEI YENDIYAROV and SERGEI PETRUSHENKO Ural State Mining University, Russia Department of Computer Science Yekaterinburg Received: December 4, 2011 Accepted for publication: December 20, 2011 Abstract. In this article, we present wavelet based data mining technique (WBDMT), which can be used to extract useful features from a signal. This technique is based on the continuous wavelet transform (CWT). At the beginning of the article, we discuss a change-point detection algorithm based on the CWT. This change-point detection algorithm is used to extract local features from a signal. We call these features «regimes». Then we discuss a regime merging algorithm which we use to remove redundant regimes in order to extract only useful information from a signal. Key words: wavelet transform, data mining, change-point, signal processing. 2000 Mathematics Subject Classification: 94A12, 97R40, 68T10, 91C20. 1. Introduction Technology now allows us to capture and store vast quantities of data. Finding patterns, trends, and anomalies in these datasets, and summarizing them with simple quantitative models, is one of the grand challenges of the information age – turning data into information and turning information into knowledge. There has been stunning progress in data mining and machine Corresponding author; e-mail: [email protected]
Transcript
Page 1: DESIGN OF A WAVELET BASED DATA MINING TECHNIQUEIn this article, we present wavelet based data mining technique (WBDMT), which can be used to extract useful features from a signal.

BULETINUL INSTITUTULUI POLITEHNIC DIN IAŞI Publicat de

Universitatea Tehnică „Gheorghe Asachi” din Iaşi Tomul LVIII (LXII), Fasc. 1, 2012

SecŃia AUTOMATICĂ şi CALCULATOARE

DESIGN OF A WAVELET BASED DATA MINING TECHNIQUE

BY

ZOBNIN BORIS, SERGEI YENDIYAROV ∗∗∗∗ and SERGEI PETRUSHENKO

Ural State Mining University, Russia

Department of Computer Science Yekaterinburg

Received: December 4, 2011 Accepted for publication: December 20, 2011

Abstract. In this article, we present wavelet based data mining technique

(WBDMT), which can be used to extract useful features from a signal. This technique is based on the continuous wavelet transform (CWT). At the beginning of the article, we discuss a change-point detection algorithm based on the CWT. This change-point detection algorithm is used to extract local features from a signal. We call these features «regimes». Then we discuss a regime merging algorithm which we use to remove redundant regimes in order to extract only useful information from a signal.

Key words: wavelet transform, data mining, change-point, signal processing.

2000 Mathematics Subject Classification: 94A12, 97R40, 68T10, 91C20.

1. Introduction

Technology now allows us to capture and store vast quantities of data.

Finding patterns, trends, and anomalies in these datasets, and summarizing them with simple quantitative models, is one of the grand challenges of the information age – turning data into information and turning information into knowledge. There has been stunning progress in data mining and machine ∗Corresponding author; e-mail: [email protected]

Page 2: DESIGN OF A WAVELET BASED DATA MINING TECHNIQUEIn this article, we present wavelet based data mining technique (WBDMT), which can be used to extract useful features from a signal.

Zobnin Boris, Sergei Yendiyarov and Sergei Petrushenko

28

learning. Data mining has successfully provided solutions for finding information from data in bioinformatics, pharmaceuticals, banking, retail, sports and entertainment, etc. It has been one of the fastest growing fields in the computer industry. Many important problems in science and industry have been addressed by data mining methods, such as neural networks, fuzzy logic, decision trees, genetic algorithms, and statistical methods (Witten & Frank, 2005).

Many research works were published recently which are related to the subject of wavelet transform and data mining. The paper (Prochazka & Kukal, 2008) is devoted to the use of discrete wavelet transform both for signal preprocessing and signal segments feature extraction. Feature vectors belonging to separate signal segments are classified by a competitive neural network as one of methods of cluster analysis and processing.

In the article (Marcin & Sorin, 2010), authors present a system for texture-based probabilistic classification and localisation of 3D objects in 2D digital images. The objects are described by local feature vectors computed using the wavelet transform. In (Behera & Sareeta, 2011) discrete wavelet transform and the S-transform based neural classifier scheme used for time series data mining of power quality events occurring due to power signal disturbances.

The work (Villez et al., 2007) uses two approaches to data mining of time series. Both methods are based on the wavelet decomposition of data series and allow the localization of important characteristics of a time series in both the time and frequency domain. The first method is a common method based on the analysis of wavelet power spectra. The wavelet power analysis indicates the location of major features in frequency and time, but not their type or shape. The second method can be utilized in order to extract more detailed information by means of qualitative description of trends.

A clustering method uses the wavelet transform reported in (Zhang et

al., 2006). Unsupervised feature extraction is carried out in order to improve the time series clustering quality and speed the clustering process. The authors propose an unsupervised feature extraction algorithm for time series clustering using orthogonal wavelets. The features are defined as the approximation coefficients within a specific scale. The feature extraction algorithm selects the feature dimensionality by leveraging two conflicting requirements, i.e., lower dimensionality and lower sum of squared errors between the features and the original time series. The main benefit of the proposed feature extraction algorithm is that dimensionality of the features is chosen automatically.

Wavelets were used in (Sherwood & Derakhshani, 2009) for classification of non-stationary electroencephalographic (EEG) signals for brain-computer interface (BCI) applications. Multiple wavelet families and decomposition methods are explored and compared as means for feature extraction of motor, affective and cognitive tasks. The support vector machine

Page 3: DESIGN OF A WAVELET BASED DATA MINING TECHNIQUEIn this article, we present wavelet based data mining technique (WBDMT), which can be used to extract useful features from a signal.

Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 1, 2012

29

(SVM) classifier was selected based on its single minima convergence characteristics.

In the study (Phinyomark et al., 2011) authors have investigated usefulness of extraction of the EMG (Electromyography) features from multiple-level wavelet decomposition of the EMG signal. Different levels of various mother wavelets were used to obtain the useful resolution components from the EMG signal. Optimal EMG resolution component (sub-signal) was selected and then the reconstruction of the useful information signal was done. The estimated EMG signal that is an effective EMG part was extracted with the popular features, i.e. mean absolute value and root mean square, in order to improve quality of class separability.

In the article (Yendiyarov & Petrushenko, 2011) described the online change detection algorithm which is based on the continuous wavelet transform, the CUSUM algorithm and some AR model. It can be used to detect changes in signals with some presence of noise and abnormal bursts.

Finally, in the article (Yendiyarov et al., 2011) robust probabilistic online change detection algorithm based on the continuous wavelet transform was presented. Authors stated that that despite the fact that they used the normal distribution for calculating probabilities, the algorithm can work well with different types of distributions because it has the tuning parameters and it uses the starting rule which is used by the control chart algorithms. The results for some testing signals show that this algorithm works well and it is able to detect all change points, in spite of the fact that the signals were sampled and influenced by noise.

In this article, we present a new data mining algorithm which is based on the CWT which is used to find change-points of a signal and then we apply our regime merging algorithm which can be thought as special clustering algorithm (i.e. clustering depends on the input data order) which organizes extracted features and reduce them in order to efficiently use them in the future.

The main idea of our algorithm is to apply the continuous wavelet transform together with our change-point detection algorithm in order to extract information about moments of change. Different change-points of the signal divide our signal into some parts, which can be considered as quasi-stationary parts. We will call these parts «regimes». Each regime can be characterized by its centre and deviation around this centre. So every regime can be considered as a circle on a plane and this circle can be described by its centre and radius. Amount of such regimes can be excessive so in order to reduce a redundant amount of these regimes which were extracted by our algorithm, we developed a regime merging algorithm based on the mutual intersection area of intersected regimes.

The following steps have to be taken in order to use our proposed technique: 1) modify our original signal; 2) find the continuous wavelet transform of the modified signal. After the transformation has been made, we

Page 4: DESIGN OF A WAVELET BASED DATA MINING TECHNIQUEIn this article, we present wavelet based data mining technique (WBDMT), which can be used to extract useful features from a signal.

Zobnin Boris, Sergei Yendiyarov and Sergei Petrushenko

30

take absolute maxima and minima values of the transformation at each translation. As a result of this operation we have two signals. Finally, we take both these signals and aggregate them in a new signal which we will call the aggregated signal; 3) use this aggregated signal in our change-point detection algorithm in order to find change-points in our original signal; 4) using quasi-stationary parts of our signal to form the regime set tℝ , by means of estimating the expected values and the standard deviations of these parts; 5) apply the regime merging algorithm to reduce the amount of redundant regimes. In the following sections, we will give detailed description for each of these steps.

2. Original Signal Preprocessing

As we mentioned before, first of all, we have to find the continuous

wavelet transform of the original signal. However, before that we have to preprocess our original signal in order to get the right results. Taking into account that every wavelet function has the following property (Burrus & Ramesh, 1998):

( ) 0t dtψ+∞

−∞

=∫ (1)

This property means that in general the mean value of a wavelet function

is equal to zero. Due to this property, our aggregated signal will be seriously influenced at the beginning and at the end of it (i.e. only part of our wavelet function will be multiplied with our original signal). Therefore, our aggregated signal will have abnormal values at the beginning and at the end of it.

To deal with this problem before transforming the original signal, it has to be extended, so we have to add additional element at the beginning and at the end of the signal. It can be done if we know compact support of a wavelet function.

In our algorithm, we will use the first order Gaussian wavelet function:

2

2( )x

x xeψ−

= − (2)

The first order Gaussian wavelet function has compact support on the

interval [ 5;5]x∈ − . Therefore, for this particular wavelet function a necessary amount of elements L which should be added to our original signal can be calculated as follows:

N L R+ = ⋅ (3)

Page 5: DESIGN OF A WAVELET BASED DATA MINING TECHNIQUEIn this article, we present wavelet based data mining technique (WBDMT), which can be used to extract useful features from a signal.

Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 1, 2012

31

where: R is the width of the right or the left part of our wavelet function (here, by the right and the left part we mean the half of the compact support of the given wavelet function) and L − the maximum value of the scale factor. The scale factor is related to the continuous wavelet transform and it is utilized in the formula (5). It should be set by a user depending on the signal’s frequency characteristics.

Let ( )U t be a value of the original signal at the point t then (1)U is

some value of our signal at the point t = 1 and ( )U n is some value of our

original signal at the point t = n accordingly. Thus using our notation the original signal should be extended in the following way:

( ) (0) ( ) ( )

N NU t U U t U N+ += + +ɶ (4)

where: N is the length of the signal, and the expression (0)

NU + means adding to

the signal N + elements of (0)U . Using this modified signal, we now can use our wavelet transform.

General formula for the continuous wavelet transforms is (Burrus & Ramesh, 1998):

1

0

1( , ) ( )

N

t

t bW a b U t

aaψ

=

− = ⋅

∑ ɶ (5)

where: a is the scale factor and b − the translation factor. The scale and translation of wavelets determine how the mother wavelet dilates and translates along the time or space axis. A scale factor greater than one corresponds to a dilation of the mother wavelet along the horizontal axis and a positive translation corresponds to a translation to the right of the scaled wavelet along the horizontal axis.

Now when we know how to calculate the continuous wavelet transform of our modified signal, we can find the values of maxima and minima of our wavelet transform. We need these signals in order to find change-points of our original signal.

The formulae for calculating absolute maxima and minima values of our wavelet transform are:

min1

max1

ˆ ( ) min ( , )

ˆ ( ) max ( , )

a K

a K

W b W a b

W b W a b

≤ ≤

≤ ≤

=

= (6)

Page 6: DESIGN OF A WAVELET BASED DATA MINING TECHNIQUEIn this article, we present wavelet based data mining technique (WBDMT), which can be used to extract useful features from a signal.

Zobnin Boris, Sergei Yendiyarov and Sergei Petrushenko

32

where: K is the maximum scale factor of our wavelet transform and b varies

from 0 to 1N − , so as a result we get two signals minˆ ( )W b and max

ˆ ( )W b . As we mentioned before K should be set by a user depending on the signal’s frequency characteristics.

Then we should reduce the size of the signals minˆ ( )W b and max

ˆ ( )W b

back to the original size. In order to do so, we remove N + elements from the beginning and from the end of these signals. Let denote these reduced signals as: max ( )W b and min ( )W b .

After this manipulation has been made, we can calculate the aggregated signal which will be used later. The aggregated signal is calculated as follows:

1

max min0

( ) ( ) ( )N

b

W b W b W b∑

== +∑ (7)

where: [0, 1]b N∈ − and N is the length of the original signal. We calculate

( )W b∑

using this formula because we have to preserve information about

maxima as well as minima of our wavelet transform. So, we just add related elements of min ( )W b and max ( )W b together.

Formulae for calculating the expected value and the variance of the signals min ( )W b and max ( )W b through all scale factors are given by:

1

10max

12

max max max0

max ( , )

[ ( )]

[ ( )] ( ( ) [ ( )]) /

N

a Kb

N

b

W a b

E W bN

D W b W b E W b N

≤ ≤=

=

=

= −

(8)

The variance of the aggregated signal can be found from the expression:

max min[ ( )] [ ( )] [ ( )]D W b D W b D W b∑∑ = + (9)

Although the mean value of the wavelet coefficients is equal to zero this

is not true for our case (because we take absolute values of the wavelet coefficients). So, the mean value of our aggregated signal is not equal to zero.

Due to this, the expected value of our aggregated signal can be calculated similarly to the variance:

max min[ ( )] [ ( )] [ ( )]E W b E W b E W b∑∑ = + (10)

Page 7: DESIGN OF A WAVELET BASED DATA MINING TECHNIQUEIn this article, we present wavelet based data mining technique (WBDMT), which can be used to extract useful features from a signal.

Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 1, 2012

33

Initial values for the variance and the expected value are needed when we deal with our change-point detection algorithm.

3. Change-Point Detection Algorithm

In the previous section, we described the necessary transformations of the original signal. As a result of these transformations, we have got our aggregated signal W

∑. In this section we are going to discuss our change-point

detection algorithm, which will work with our signal W∑

. We construct this

algorithm on the basis of statistical hypothesis testing, which is the most common thing for different change detection algorithms (e.g. sequential Wald analysis, CUSUM et al (Bassville & Nikiforov, 1993; Wald, 1947).

In the start of the procedure, there are two hypotheses 0H , and 1H . The first one is called null hypothesis, and is for the time being accepted. The second one is called alternative (hypothesis). It is the hypothesis one tries to prove. Let limup denote the probability of an event 1H conversely

lim1 upα = − is the probability of an event 0H . In accordance with the multiplication theorem for independent events, we can define the joint probability as follows:

1

( ) ( (1)) ( (2)) ... ( ( )) ( ( ))k

t

P t P W P W P W k P W t∑ ∑ ∑ ∑∑

== ⋅ ⋅ ⋅ =∏ (11)

where: t is some point in time and k is the amount of successive events.

We should continue our inspection until lim( ) uP t p∑ > (by inspection we

mean the iterative process of calculating the probability ( )P t∑ , when new value

of ( )W t∑

is being processed). To do our algorithm more robust with respect to

the underlying distribution of the signal W∑

, we start our inspection when our

signal W∑

crosses given level UH (this level can be chosen based on the

statistical properties of the signal). Parameter UH can be a vector in case when we need to have more

customization:

0 1( , ,..., ,..., )TU k nH h h h h= (12)

So, we applied the technique which is used by control chart algorithms

(Shewhart, 1939), and we start to calculate our joint probability

Page 8: DESIGN OF A WAVELET BASED DATA MINING TECHNIQUEIn this article, we present wavelet based data mining technique (WBDMT), which can be used to extract useful features from a signal.

Zobnin Boris, Sergei Yendiyarov and Sergei Petrushenko

34

when ( ) ( )UW t H j∑

≥ . As we stated before this condition is similar to these

conditions which are utilized by the control charts and it is used to avoid false alarms. Where ( )UH j denotes 'j s element of the vector UH :

( ) [ ( )] ( ) [ ( )]UH j E W b j D W b∑ ∑

= + Λ ⋅ (13)

where: 0 1( , ,..., )nλ λ λΛ = is our vector of parameters. Then we calculate the

probability that ( ) UW t H∑

≥ (where: ( )W t∑

is an element at point t and t

denotes some point in time) which can be found from the probability theory:

( )

( ( ) ( )) 1 ( )UH j

UP W k H j f W dW∑

−∞

≥ = − ∫ (14)

where: ( )f W is the cumulative distribution function of our signal W

∑.

In the case of the normal distribution, the Eq. (14) can be rewritten in this form:

(| ( ) [ ( )] | ) ( / [ ( )])P W k E W k D W bε ε∑ ∑ ∑

− < = Φ (15)

Taking into account that we have the one-sided condition ( ) UW t H

∑≥

finally we can write our expression in this form:

1 1( ( ) [ ( )] ( )) ( ( ) / [ ( )])

2 2U UP W k E W k H j H j D W b∑ ∑ ∑

− ≥ = − Φ (16)

where: ( )xΦ is the Laplace function which is given by (Degroot & Schervish, 2011):

2 /2

0

2( )

2

xxx e dx

π−Φ = ∫ (17)

and jλ are the tuning parameters and can be used to define some specific

behaviour of the algorithm, and they depend on both the signal characteristics and the type of changes.

To do our algorithm more robust to some random abnormal bursts we should set ( ) 1P t∑ = every time when ( )W t

∑ becomes less than UH :

Page 9: DESIGN OF A WAVELET BASED DATA MINING TECHNIQUEIn this article, we present wavelet based data mining technique (WBDMT), which can be used to extract useful features from a signal.

Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 1, 2012

35

( ) [ ( )] ( ) [ ( )]W t E W b j D W b∑ ∑ ∑

< + Λ ⋅ (18)

After we have detected some change-point we should have to turn off

our detector and start only when we have gone down from the «hill». For this reason, we introduced a new parameter DH . We defined this

parameter as follows:

[ ( )] [ ( )]DH E W b D W bβ∑ ∑

= + ⋅ (19)

where: β is the tuning parameter which along with the jλ can be set

depending on both the signal characteristics and the type of changes (e.g. for the normal distribution we can set [1,2,3]Λ = and 0β = ).

The parameter DH can be set in the following way: after we have detected a change in our signal, we should start our inspection again only when ( ) DW t H

∑≤ .

Further on, we introduce the following parameter ( )DP t∑ . It has almost

the same meaning as our joint probability, which was discussed earlier. There is only one difference between these two parameters ( DH and UH ), we start to

calculate ( )DP t∑ as soon as ( ) DW t H∑

≤ . Meanwhile, we are calculating the

probabilities that the random variable ( )W t∑

crossed the boundary value of DH .

Analogously to the joint probability we should set ( ) 1DP t∑ = (we set

this parameter in this manner to avoid false alarms) whenever ( )W t∑

becomes

greater than DH :

( ) [ ( )] [ ( )]W t E W b D W bβ∑ ∑ ∑

> + ⋅ (20)

Let limdp denote the tolerance probability for this case. Our experience

shows that in practice, it can be set in the following way lim limu dp p≈ (i.e. we tested this algorithm using different types of distributions with different sets of parameters. Our experiments show that our algorithm performs well with

lim limu dp p≈ . For further reading, please refer to (Yendiyarov & Petrushenko,

2011; Yendiyarov et al., 2011). It is worth noting that DH can be used as the tuning parameter. (For instance, in the case of close events (i.e. signal changes) it is recommended to increase the value of DH . Because when our events are

close to each other, small value of DH will cause frequent false alarms). Fig. 1

Page 10: DESIGN OF A WAVELET BASED DATA MINING TECHNIQUEIn this article, we present wavelet based data mining technique (WBDMT), which can be used to extract useful features from a signal.

Zobnin Boris, Sergei Yendiyarov and Sergei Petrushenko

36

shows the results of our change-point detection algorithm for a testing signal. In this article the testing signal is a real signal, which comes from an iron ore sintering plant. More precisely, we use the content of FeO in the green mix. This signal is not normally distributed. To see how this algorithm works on the normally distributed signals please refer to (Yendiyarov & Petrushenko, 2011; Yendiyarov et al., 2011). The parameters were selected based on the results of experiments.

0 100 200 300 400 500 600 700 800 900 1000 1100

12

13

14

15

16

17

18

19

20

Fig. 1 – Application of our change-point detection algorithm with parameters

lim(1.5, 2, 2.5, 3), 0, 0.005T

upβΛ = = = (blue – the original signal, red – linear

approximations of quasi-stationary parts of the signal).

4. Forming Set of Regimes

After our change-point algorithm has been applied to the signal, we usually get a nonempty set ℂ of change points:

1 2 , ,..., kυ υ υ=ℂ (21)

where: jυ is the index at which a change in the mean of our signal has occurred.

As stated before, each regime can be characterized by its centre and deviation, so in order to get this information, we use estimates of the expected value and the standard deviation of each quasi-stationary part of the signal. Extracted regimes form a set of regimesℝ . Let tℝ be the regime set at iteration

t then some regime with index j we denote j tr ∈ℝ (We perform iterations

Page 11: DESIGN OF A WAVELET BASED DATA MINING TECHNIQUEIn this article, we present wavelet based data mining technique (WBDMT), which can be used to extract useful features from a signal.

Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 1, 2012

37

until our algorithm converges (i.e. we have fixed set of regimes tℝ ), further details will be given later). Using this notation the regime set can be written:

1 ,..., t nr r=ℝ (22)

Each regime jr can be characterized by the coordinates of its centre

( , )j j jc x y= and deviation jσ around this centre:

( , )j j jr c σ= (23)

Based on the set ℂ , we can construct a set of quasi-stationary parts ℤ

(For instance, the first quasi-stationary part of our signal contains elements with indices 1[1,.., ]υ , the second part contains elements with indices 1 2[( 1),.., ]υ υ+ and so on. For better understanding see Fig. 1.):

1 2

11 11, ,...,

k

Nz z zυ υ

υυ ++=ℤ (24)

where: N is the length of the original signal.

Let ( 1) 1k

k

kjp z

υυ − +∈ be an element j of a quasi-stationary part

( 1) 1k

kzυυ − + ∈ℤ . Then we can write the following equations for the mean and the

standard deviation of the ( 1) 1k

kzυυ − + :

( 1)

( 1)

11

2

11

[ ]

( )

[ ]

K

k

k

K

k

k

Nki

ik

Nki k

ik

K

K

p

E zN

p

D zN

υυ

υυ

µ

µσ

=+

=+

= =

−= =

(25)

where: KN is the length of a quasi-stationary part ( 1) 1k

kzυυ − + . So the standard

deviation kσ can be considered as the regime deviation while kµ can be used

as the coordinates of the centre of some regime kr , i.e. ( , )k k kc µ µ= .

Page 12: DESIGN OF A WAVELET BASED DATA MINING TECHNIQUEIn this article, we present wavelet based data mining technique (WBDMT), which can be used to extract useful features from a signal.

Zobnin Boris, Sergei Yendiyarov and Sergei Petrushenko

38

5. Regime Merging Algorithm

In this section, we briefly describe our regime merging algorithm. Some details will be given in the latter sections.

On the basis of tℝ , we form a matrix tΩ which contains mutual areas

of intersection of regimes ir and jr and at iteration t it has the following form:

11 12 1

1 2

, , ...

... ... ,

, , ...

n

t ij

n n nn

S S S

S

S S S

Ω =

(26)

where: ijS is the mutual area of intersection of regimes ir and jr (this area

equals the intersection area of some regimes ir and jr divided by the sum of the

areas of regimes ir and jr accordingly. We will discuss how to calculate this

quantity later, see the formula (44)). Based on the matrix tΩ , we can obtain a set of regimes tℚ , which

should be excluded from our regime set tℝ , and it looks like ,..., t j mr r=ℚ .

Let’s consider how to obtain the set of regimes tℚ , which should be excluded.

First of all, we should check whether or not some regime jr is the subset

of another regime i.e. ,j ir r i j⊂ ≠ . For this reason, we search for elements that

satisfy the following condition (for 100ζ = i.e. we should exclude elements with the mutual area of intersection is equal to 100 (some regime is subsumed by some other regime)):

i ij ij tS Sζ= ≥ ∈Ωℕ (27)

Then based on iℕ , we form the set 1t t−⊆ℚ ℝ :

1

K

t i

i=

=ℚ ℕ∪ (28)

Exclusion of regimes from the set tℝ is performed according to:

1t t t+ = −ℝ ℝ ℚ (29)

Using the set theory, the expression (29) can be written in this way:

Page 13: DESIGN OF A WAVELET BASED DATA MINING TECHNIQUEIn this article, we present wavelet based data mining technique (WBDMT), which can be used to extract useful features from a signal.

Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 1, 2012

39

1 t j j t j tr r r+ = ∈ ∩ ∉ℝ ℝ ℚ (30)

After these regimes have been excluded, we should check whether or

not we should merge some regimes jr and ir . In order to do so, we should set a

reliability coefficient 0 100φ≤ ≤ (this coefficient is set by a user and it can be

seen that when 0φ = we always perform regime merging operation, on the

other hand when 100φ = we never perform regime merging operation.) and then regimes, which should be merged, are obtained using the condition:

( ) ( ) (arg max ) t

i ij i j ij ij ti

S S Sφ σ σ∈

= ≥ ∩ > ∩ ∈Ωℝ

ℕ (31)

The expression (31) can be interpreted as follows: we merge regimes

jr and ir that have the maximum mutual intersection area ijS , which is more

than the reliability coefficientφ . If iℕ is not empty, i.e. i ≠∅ℕ , then we should consider two cases: the first one is when the center of the joint regime lies between the centers of regimes ir , jr , the second case is when the center of

the joint regime lies on the left or on the right from the centers of regimes. In the first case, the center of a new regime *r can be written in this way:

* * *( , )c x y= (32)

How to calculate values of * *,x y in the first case, we will discuss in the

next section. The deviation of the new regime *r is determined from:

* k kcDσ σ δ= + (33)

where: kcDδ is a quantity on which kσ should be increased, in order to cover both regimes (see the formula (56)). All necessary formulae, which we should

utilize, in order to get kcDδ , will be derived in the next section.

The index k is obtained using the following expression:

,

,

j i

i j

jk

i

σ σ

σ σ

>=

> (34)

Page 14: DESIGN OF A WAVELET BASED DATA MINING TECHNIQUEIn this article, we present wavelet based data mining technique (WBDMT), which can be used to extract useful features from a signal.

Zobnin Boris, Sergei Yendiyarov and Sergei Petrushenko

40

Finally, using formulae (28), (29) and (31), we can obtain the regime set 1t+ℝ by means of excluding two intersecting regimes ir , jr and adding some

new regime *r . In the second case, the center of the new regime *r can be selected using this equation:

*

,

,

j j i

i i j

cc

c

σ σ

σ σ

>=

> (35)

The deviation of the regime *r can be calculated using the expression

(36), i.e. we should iteratively increase the largest deviation until the condition

(27) is not satisfied, which means that we should increase *σ until *kr r⊄ :

*1

* *ijt t

Sσ σ+ = +

Ż (36)

where: Ż is an accuracy coefficient [1, )∈ ∞Ż , and it should be set by a user

(Our tests show that, in general it is enough to set 100=Ż ). Using our regime set tℝ we form the set 1t+ℝ , by exclusion the regime

, , , qr q i j q k i j∈ ∩ ≠ ∈ using the expression (29).

After the set 1t+ℝ has been formed, according to the formula (29), we

can write the expression for the matrix of mutual intersections 1t+Ω at iteration

1t + :

1 1,t t t+ +Ω =Ω ℝ (37) where: is the sign, which means an operation on the matrix tΩ , this operation removes rows and columns, which are related to regimes that were excluded from the set tℝ .

We continue this process until iℕ in the expression (31) satisfies the following condition:

,i i t jr +≠ ∅ ∈ℕ ℝ (38)

The previous expression means that we should continue our iterations until

there are no regimes, which are subsets of other regimes or regimes, which should be merged. In the next section, we will discuss how to construct the matrix tΩ .

Page 15: DESIGN OF A WAVELET BASED DATA MINING TECHNIQUEIn this article, we present wavelet based data mining technique (WBDMT), which can be used to extract useful features from a signal.

Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 1, 2012

41

6. Forming the Matrix tΩ at Iteration t

We suppose that at this moment, the set tℝ is readily available then for

the elements ijS , which are located at the main diagonal of the matrix tΩ , we

should set the following values:

,0,ij ij i jS S == ∀ (39)

To check whether or not some regime jr is the subset of another regime

i.e. ,j ir r i j⊂ ≠ , we will use this condition:

2 2( , ) ( ) ( ( ) ( ) )j i j i j i j i j iJ r r x x y yσ σ σ σ= ≥ − ≥ − + −∩ (40)

If ( , )j iJ r r TRUE= , i.e. our regime ,i jr r i j⊂ ≠ , then the following

values should be set to the related elements of tΩ (we set 100ijS = to indicate

that ,j ir r i j⊂ ≠ ):

100, ,ij i jS r r i j= ∀ ⊂ ≠ (41)

In case of the intersection of two regimes ir and jr , this condition

should be stated in the following way:

2 2

2 2

2 2

( , ) (( ) ( ) ( ) )

( ( ) ( ) )

( ( ) ( ) )

j i j i j i j i

j j i j i i

j k j k j k

I r r x x y y

x x y y

x x y y

σ σ

σ σ

σ σ

= + ≥ − + −

+ − + − >

¬ − ≥ − + −

∩ ∩

(42)

If ( , )j iI r r FALSE= , then regimes ir and jr do not intersect, i.e.

j ir r =∅∩ and then values ijS :

0ijS = (43) If ( , )j iI r r TRUE= , then regimes ir and jr intersect, i.e. j ir r ≠ ∅∩ and

values of the related elements of tΩ can be calculated as the mutual area of

intersection ijS of these regimes:

Page 16: DESIGN OF A WAVELET BASED DATA MINING TECHNIQUEIn this article, we present wavelet based data mining technique (WBDMT), which can be used to extract useful features from a signal.

Zobnin Boris, Sergei Yendiyarov and Sergei Petrushenko

42

*

,ijij

i j

SS

S S=

+ (44)

where: *ijS is the intersection area of some regimes ir and jr ; ,i jS S are areas

of regimes ir and jr accordingly. We use this formula because it shows the

relative area of intersection of two regimes. These areas ,i jS S can be obtained

using this formula:

2i iS πσ= (45)

In case of merging of two regimes ir and jr , we need to find the

coordinates of the centre of a new joint regime * i jr r r= ∪ , for this reason we

should consider two cases (We test whether or not the centre of the joint regime lies between some two regimes ir and jr . If this is the case, we check two

conditions in order to find the coordinates of the joint regime, because we have to know relative positions of our regimes ir and jr . For instance, if

1 2 0p p− =

then we can say that our regime ir lies on the left of our regime

jr . On the other hand, if 11 22 0p p− =

then our regime ir lies on the left of our

regime jr , but the center of the joint regime does not lie between our regimes):

(1)

1 ( )

(2)2 ( )

(3)3 ( )

(4)4 ( )

( , ) ( , , )(1)

( , ) ( , , )

( , ) ( , , )(2)

( , ) ( , , )

x y i i ici

x y j j jcj

x y i i ici

x y j j jcj

p p p f x y D

p p p f x y D

p p p f x y D

p p p f x y D

= =

= = −

= = −

= =

(46)

where: ,x yp p are coordinates which are calculated by the function ( , , )f x y z ,

, , ,i i j jx y x y are coordinates of the centres of the given regimes ir and jr and

,ic jcD D are the distances between the centre of the joint regime and our regimes

ir and jr .

The coordinates of the centre of the joint regime *r defined by the formula (As we mentioned above, here we check relative positions of our

Page 17: DESIGN OF A WAVELET BASED DATA MINING TECHNIQUEIn this article, we present wavelet based data mining technique (WBDMT), which can be used to extract useful features from a signal.

Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 1, 2012

43

regimes ir and jr as well as the coordinates of the centre of our joint regime. If

this condition is satisfied then the centre *c lies between our regimes ir and jr .):

(3)

3 4( )* (1)

1 2( )

( , ) , 0

( , ) , 0

x y i

x y i

p p if p pc

p p if p p

− ==

− =

(47)

If conditions (47) are not satisfied, then we consider the next two cases:

(11)

11 ( )

(22)22 ( )

(33)33 ( )

(44)44 ( )

( , ) ( , , )(1)

( , ) ( , , )

( , ) ( , , )(2)

( , ) ( , , )

x y i i ici

x y j j jcj

x y i i ici

x y j j jcj

p p p f x y D

p p p f x y D

p p p f x y D

p p p f x y D

= = −

= = −

= =

= =

(48)

Then the coordinates of the join regime *r can be obtained from the

following equation:

(11)11 22( )

* (33)33 44( )

( , ) , 0

( , ) , 0

x y i

x y i

p p if p pc

p p if p p

− ==

− =

(49)

The expression for the function ( , , )f x y z , which is used in formulae

(46) and (48) has the following form (This formula calculates the coordinates of our joint regime *r based on the coordinates of the regimes ir and jr and the

distances ,ic jcD D . It was constructed on the basis of plane geometry.):

( , , ) ( ( sin ) / , cos )f x y z x z m y mzα α= + + (50)

where: m is the slope of a line which goes through centres of our regimes ir

and jr , and α is an angle between this line and the X axis.

Values of m and α can be easily obtained using these equations:

( )

j i

j i

y yYm

X x x

arctg mα

−∆= =∆ −

=

(51)

Page 18: DESIGN OF A WAVELET BASED DATA MINING TECHNIQUEIn this article, we present wavelet based data mining technique (WBDMT), which can be used to extract useful features from a signal.

Zobnin Boris, Sergei Yendiyarov and Sergei Petrushenko

44

Value of *ijS (the area of intersection of two regimes ir and jr ) should

be calculated differently depending upon whether the condition (47) is satisfied or the condition (49) is satisfied. Because of this we should consider two cases: the first one is when the centre of the joint regime lies between the centres of regimes ir , jr (47), the second case is when the centre of the joint regime lies on

the left or on the right from the centres of regimes (49). In the first case, the intersection area of the regimes ir , jr calculates as

the summation of areas of halves of two ellipses. First of all, we should calculate the values of these variables , ,ic jcH D D , in order to do so, we should

solve the system of equations:

2 2 2

2 2 2

j jc

j ic

ij ic jc

H D

H D

D D D

σ

σ

= +

= + = +

(52)

Solving this system for ,ic jcD D we obtained these formulae:

2 2 2

2 2 2

2

2

ij j iic

ij

ij j ijc

ij

DD

D

DD

D

σ σ

σ σ

− +=

+ −=

(53)

Using formulae (53), the expression for H can be obtained as follows:

2 2 22 2( ) ,

2

ij j ij

ij

DH

D

σ σσ

+ −= − (54)

where: ijD is the distance between centres of regimes ir and jr , which can be

calculated using Euclidean distance:

2 2( ) ( )ij i j i jD x x y y= − + − (55)

Quantities jcDδ and icDδ which are needed to calculate the

intersection area of regimes ir and jr , can be found from equations:

Page 19: DESIGN OF A WAVELET BASED DATA MINING TECHNIQUEIn this article, we present wavelet based data mining technique (WBDMT), which can be used to extract useful features from a signal.

Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 1, 2012

45

jc j jc

ic i ic

D D

D D

δ σ

δ σ

= −

= − (56)

Using above mentioned quantities the area of intersection can be calculated as a sum of halves of two ellipses:

* 1 2ij ij ijS S S= + (57)

Finally, the expressions for 1 2,ij ijS S can be written in this way:

1

2

2

2

jcij

icij

D HS

D HS

πδ

πδ

=

=

(58)

In the second case (49) the area of intersection of regimes ri, rj calculated as the difference between the area of the smallest regime (a regime which has the smallest deviation) and the area of the segment. For this reason, we calculate the value of central angle βC based on the Pythagorean theorem:

2 2 2

* *

2

2*

(2 ) 2 2 cos

2cos 1

C

C

H

H

σ σ β

βσ

= −

= − (59)

As a result the expression for the central angle in degrees Cβ° can be

written as follows:

2 2*arccos(1 (2 / ))

,(180 ) /

C

C C

Hβ σ

β β π°

= − =

(60)

where: *σ is the smallest deviation, which determined by this expression:

*

,

,

i j i

j i j

σ σ σσ

σ σ σ

>=

> (61)

The area of the sector can be obtained from this formula:

2* / 360S CS πσ β °= (62)

Page 20: DESIGN OF A WAVELET BASED DATA MINING TECHNIQUEIn this article, we present wavelet based data mining technique (WBDMT), which can be used to extract useful features from a signal.

Zobnin Boris, Sergei Yendiyarov and Sergei Petrushenko

46

Half-perimeter of the triangle and the area of the triangle:

*

2*

(2 2 ) / 2

( ) ( 2 )

T

T T T T

p H

S p p p H

σ

σ

= +

= − − (63)

The area of the segment is defined as the difference between the area of the sector and the area of the triangle:

SEG S TS S S= − (64)

Finally, the area of intersection *ijS can be calculated using this formula:

*ij i SEGS S S= − (65)

Fig. 2 shows the results of our regime merging algorithm at different iterations for the testing signal (the information about our testing signal was given before).

Fig. 2 – Application of our regime merging algorithm and its consecutive iterations with 55%, 100φ = =Ż (a – the first step; b, c – intermediate steps; d – the final result).

Page 21: DESIGN OF A WAVELET BASED DATA MINING TECHNIQUEIn this article, we present wavelet based data mining technique (WBDMT), which can be used to extract useful features from a signal.

Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 1, 2012

47

7. Conclusions

If data is characterized as recorded facts, then information is the set of

patterns, or expectations, that underlie the data. There is a huge amount of information locked up in databases – information that is potentially important but has not yet been discovered or articulated. Our mission is to bring it forth. Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. The idea is to build computer programs that sift through databases automatically, seeking regularities or patterns. Strong patterns, if found, will likely generalize to make accurate predictions on future data. Algorithms need to be robust enough to cope with imperfect data and to extract regularities that are inexact but useful.

In this article we described the data mining technique. Our algorithm has certain advantages over other wavelet based data mining algorithms: first of all, we utilize many scales simultaneously. User should only choose maximum scale factor and then our signal is transformed using the CWT.

Secondly, we utilize this transformed signal in order to find change-points of our original signal. Proposed algorithm is robust and can be used with some presence of noise and abnormal random bursts. The main role of the wavelet transform is to make our algorithm more robust in comparison with the classical change-detection algorithms (for example, sequential Wald analysis, CUSUM, Bayes-type, filtered derivative et al.). For this reason, we use first order Gaussian wavelet function. The robustness of our change-point detection algorithm is based on the fact that we use many scales of our signal simultaneously. Our algorithm can be easily customized in order to be more suitable for some signals and different types of events (for example, our

algorithm uses the following tuning parameters: lim lim, , ,Tu dp pβΛ ).

Thirdly, together with the change-point detection algorithm we apply our regime merging algorithm. We take parts of our original signal (i.e. different change-points of the signal divide our signal into some parts, which can be considered as quasi-stationary parts) and form set of regimes. So, we process our signal and get our signal’s structure. This structure consists of many features that can be utilized in order to perform diagnostics, forecasting, etc.

Fourthly, our algorithm performs the feature reduction operation in order to remove redundant regimes. This fact will lead to compact description of the original signal. Fifthly, our regime merging algorithm has the following tuning parameters: ,φ Ż .

Someone can actually say the fact that we have so many parameters is bad, but we actually do not think so. The reason is that we can customize our algorithm and can believe that it will work with different signals.

So, our proposed technique can extract useful information from a signal, which then can be used for better understanding of a process under

Page 22: DESIGN OF A WAVELET BASED DATA MINING TECHNIQUEIn this article, we present wavelet based data mining technique (WBDMT), which can be used to extract useful features from a signal.

Zobnin Boris, Sergei Yendiyarov and Sergei Petrushenko

48

consideration. In our research, we have successfully used this technique in order to improve the forecasting accuracy of the sinter quality. In order to do so we built a second order autoregressive model (AR) (i.e. find coefficients by the least squares method). For the terms of our AR model, we use the content of the carbon in the green mix, the content of FeO in the green mix and the content of FeO in the sinter (shifted in time). Then, we have applied our algorithm to our signals. In order to improve forecasting quality we added additional terms to our AR model, we added information about the regimes of the FeO signals (in the green mix as well as in the sinter) and the carbon signal. As a result of this we have improved the forecasting accuracy of the content of FeO in the sinter by 14 per cent (from 81 to 95 per cents).

Finally, taking into account what was stated earlier, it can be seen that our algorithm can be used in practical data mining for finding, and describing, structural patterns in data. These structural patterns can be used for prediction, explanation, and understanding of data under consideration. We hope that this article will contribute to the creation of more competitive engineering systems and widen a range of possible applications of data mining methods.

REFERENCES

Bassville M., Nikiforov V.I., Detection of Abrupt Changes. Theory and Application.

Pretince-Hall, New Jersey, 1993. Behera K., Sareeta M., Discrete Wavelet Transform and s-Transform Based Time Series

Data Mining Using Multilayer Perceptron Neural Network. International Journal of Engineering Science and Technology, 3, 11, 11−21 (2011).

Burrus C.S., Ramesh A.G., Introduction to Wavelets and Wavelet Transforms: A

Primer. Pretince-Hall, New Jersey, 1998. Degroot M.H., Schervish M.J., Probability and Statistics. Pearson Education, Carnegie-

Mellon University, 2011. Marcin G., Sorin S., Local Wavelet Features for Statistical Object Classification and

Localisation. Multimedia, IEEE, 17, 1, 118−125 (2010). Phinyomark A., Limsaku C., Phukpattaranont P., Application of Wavelet Analysis in

EMG Feature Extraction for Pattern Classification. Measurement Science Review, 11, 2, 45−52 (2011).

Prochazka A., Kukal J., Wavelet Transform Use for Feature Extraction and EEG Signal

Segments Classification. Communications, Control and Signal Processing, 719–722 (2008).

Sherwood J., Derakhshani R., On Classifiability of Wavelet Features for EEG-Based

Brain-computer Interfaces. IJCNN, 2895−2902 (2009). Shewhart W.A., Statistical Method from the Viewpoint of Quality Control. Courier

Dover Publications, Washington, 1939. Villez K., Pelletier G., Rose C., Comparison of Two Wavelet-Based Tools for Data

Mining of Urban Water Networks Time Series. Water Science & Technology, 56, 6, 57−64 (2007).

Page 23: DESIGN OF A WAVELET BASED DATA MINING TECHNIQUEIn this article, we present wavelet based data mining technique (WBDMT), which can be used to extract useful features from a signal.

Bul. Inst. Polit. Iaşi, t. LVIII (LXII), f. 1, 2012

49

Wald A., Sequential Analysis. John Wiley and Sons, New York, 1947. Witten I.H., Frank E., Data Mining: Practical Machine Learning Tools and Techniques.

Elsevier Inc., San Francisco, 2005. Yendiyarov S.V., Petrushenko S.Yu., Robust Probabilistic Online Change Detection

Algorithm Based on the Continuous Wavelet Transform. World Academy of Science, Engineering and Technology, 60, 1810−1814 (2011).

Yendiyarov S.V., Zobnin B.B., Petrushenko S.Yu., Online Change Detection Algorithm

Based on the Continuous Wavelet Transform, the CUSUM Algorithm and an

Autoregressive Model. Current Trends in Signal Processing, 1, 2−3, 7−18 (2011).

Zhang H., Bao T., Zhang Y., Lin M., Unsupervised Feature Extraction for Time Series

Clustering Using Orthogonal Wavelet Transform. Informatica, 30, 305−319 (2006).

O TEHNICĂ DATA MINING BAZATĂ PE TRANSFORMATA WAVELET

(Rezumat)

În această lucrare este prezentată o metodă Data mining bazată pe transformata

Wavelet (Wavelet Based Data Mining Technique - WBDMT), ce poate fi utilizată pentru a extrage informaŃii utile privind caracteristicile semnalelor.

Mai întâi de prezintă un algoritm pentru detecŃia modificării punctelor pe baza Transformării Wavelet Continue. Acest algoritm este utilizat pentru extragerea caracteristicilor locale, denumite „regimuri”, din semnalul analizat. În continuare, autorii prezintă un algoritm de îmbinare a regimurilor, cu scopul de a elimina regimurile redundante şi de a extrage doar informaŃiile utile specifice semnalului.

Scopul este de a dezvolta o aplicaŃie software care parsează automat bazele de date conŃinând semnale pentru a căuta modele. Aceste modele structurale pot fi utilizate pentru predicŃie şi descrierea evoluŃiei semnalelor considerate.


Recommended