Download - 1 Appendix A: An Overview on Time Series Data Mining Duong Tuan Anh Faculty of Computer Science & Engineering Ho Chi Minh City University of Technology.

1

Appendix A:An Overview on Time Series Data

Mining

Duong Tuan AnhFaculty of Computer Science & EngineeringHo Chi Minh City University of Technology

October 2009

Course: Decision Support Systems

2

An Overview on Time series Data Mining

Outline

1. Introduction

2. Similarity Search in Time Series Data

3. Feature-based Dimensionality Reduction

4. Discretization

5. Other Time Series Data Mining Tasks

6. Conclusions

3

IntroductionIntroduction

0 50 100 150 200 250 300 350 400 450 50023

24

25

26

27

28

29

25.1750 25.2250 25.2500 25.2500 25.2750 25.3250 25.3500 25.3500 25.4000 25.4000 25.3250 25.2250 25.2000 25.1750

.. .. 24.6250 24.6750 24.6750 24.6250 24.6250 24.6250 24.6750 24.7500

A time series is a collection of observations made sequentially in time.

Examples: Financial time series, scientific time series

4

Time series Similarity Search

Some examples:

- Identifying companies with similar patterns of growth.

- Determining products with similar selling patterns

- Discovering stocks with similar movement in stock prices.

- Finding out whether a musical score is similar to one of a set of copyrighted scores.

5

Time series Similarity Search Distance Measures

• Euclidean distance

• Dynamic Time Warping

• Other distance measures

6

Euclidean Distance MetricEuclidean Distance Metric

Given two time series Q = q1…qn and

C = c1…cn

their Euclidean distance is

defined as:

n

iii cqCQD

1

2,

Q

C

D(Q,C)

7

Fixed Time AxisSequences are aligned “one to one”.

“Warped” Time AxisNonlinear alignments are possible.

Dynamic Time Warping (Berndt et al.)Dynamic Time Warping (Berndt et al.)

Dynamic Time Warping is a technique that finds the optimal alignment between two time series if one time series may be “warped” non-linearly by stretching or shrinking it along its time axis.This warping between two time series can be used or to determine the similarity between the two time series.

8

Dynamic Time Warping

• DTW facilitates the discovery of flexible patterns from time series.

• DTW is used in speech regcognition to determine if two waveforms represent the same spoken phrase.

• Disadvantage: Dynamic programming High computational complexity

• Recently, there has been a lot of research work in improving DTW efficiency.

9

Feature-based Dimensionality Reduction

• Time series databases are often extremely large. Searching directly on these data will be very complex and inefficient.

• To overcome this problem, we should use some of transformation methods to reduce the magnitude of time series.

• These transformation methods are called dimensionality reduction techniques.

10

•Let denote a time series query as T = t1,…,tn, and the set of

time series which constitute the database as Y = {Y1,…YK}.

We assume each sequence in Y is n units long.

• Let N be the dimensionality of the transformed space we wish to index (1 ≤ N ≤ n). For convenience, we assume that N is a factor of n. • CRatio = N/n is the compression ratio.

11

0 20 40 60 80 100 120 140

C

An Example of a An Example of a Dimensionality Reduction Dimensionality Reduction

Technique ITechnique I 0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 …

RawData

The graphic shows a time series with 128 points.

The raw data used to produce the graphic is also reproduced as a column of numbers (just the first 30 or so points are shown).

n = 128

12

0 20 40 60 80 100 120 140

C

. . . . . . . . . . . . . .


Technique IITechnique II 1.5698 1.0485 0.7160 0.8406 0.3709 0.4670 0.2667 0.1928 0.1635 0.1602 0.0992 0.1282 0.1438 0.1416 0.1400 0.1412 0.1530 0.0795 0.1013 0.1150 0.1801 0.1082 0.0812 0.0347 0.0052 0.0017 0.0002 ...

FourierCoefficients

0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 …

RawData

We can decompose the data into 64 pure sine waves using the Discrete Fourier Transform (just the first few sine waves are shown).

The Fourier Coefficients are reproduced as a column of numbers (just the first 30 or so coefficients are shown).

13

0 20 40 60 80 100 120 140

C


Technique IIIITechnique IIII

SortedTruncated

FourierCoefficients

C’

1.5698 1.0485 0.7160 0.8406 0.3709 0.1670 0.4667 0.1928 0.1635 0.1302 0.0992 0.1282 0.2438 0.2316 0.1400 0.1412 0.1530 0.0795 0.1013 0.1150 0.1801 0.1082 0.0812 0.0347 0.0052 0.0017 0.0002 ...

FourierCoefficients

0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 …

RawData

1.5698 1.0485 0.7160 0.8406 0.2667 0.1928 0.1438 0.1416

14

0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 1200 20 40 60 80 100 120 0 20 40 60 80 100 120

Keogh, C

hakrabarti, Pazzani &

Mehrotra K

AIS 2000

Yi &

Faloutsos VLD

B 2000

Keogh, C

hakrabarti, Pazzani &

Mehrotra SIG

MO

D 2001

Chan &

Fu. ICD

E 1999

Korn, Jagadish &

Faloutsos. SIGM

OD

1997

Agraw

al, Faloutsos, &. Sw

ami.

FOD

O 1993

Faloutsos, Ranganathan, &

Manolopoulos. SIG

MO

D 1994

Morinaka,

Yoshikaw

a, Am

agasa, &

Uem

ura, PAK

DD

2001

DFT DWT SVD APCA PAA PLA

15

Indexing Time SeriesIndexing Time Series

1. For each position in the time series, extract a window of length w, and create a feature (a point in k-dimensioned space) for it.

2. Note: The points for a time series will constitute a trail in feature space.

3. Partition the trail into suitable subtrails - Minimal Bounding Rectangles (MBR).4. Store the MBRs in a spatial index structure.

Database C

2

1

4

3

5

7

6

9

8

10

Query Q

16

Q

…the problem is that we have to look at every point to find the nearest neighbor..

17

R1

R2R5

R3

R7R9

R8

R6

R4

We can group clusters of datapoints with “boxes”, called Minimum Bounding Rectangles (MBR).

We can further recursively group MBRs into larger MBRs….

18

R10 R11 R12

R1 R2 R3 R4 R5 R6 R7 R8 R9

Data nodes containing points

R10 R11

R12

…these nested MBRs are organized as a tree (called a spatial access tree or a multidimensional tree). Examples include R-tree, R*-tree, M-tree, Hybrid-Tree etc.

19

Bounding Condition

• In order to ensure completeness (no false dismissals): dk(X’, Y’) ≤ d(X, Y)

General method of Dimensionality Reduction

1. Select a distance measure d.

2. Design a dimensionality reduction technique dk over the k-dimensional reduced space.

3. Produce a distance measure dk over the reduced space, and prove that it obeys the bounding condition.

20

Discretization

• Discretization of a time series is tranforming it into a symbolic string.

• The main benefit of this discretization is that there is an enormous wealth of existing algorithms and data structures that allow the efficient manipulations of symbolic representations.

• Lin and Keogh et al. (2003) proposed a method called Symbolic Aggregate Approximation (SAX), which allows the descretization of original time series into symbolic strings.

21

Symbolic Aggregate Approximation (SAX) [Lin et al. 2003]

baabccbc

The first symbolic representation of time series, that allow

Lower bounding of Euclidean distance Dimensionality Reduction Numerosity Reduction

22

How do we obtain SAX

0 20 40 60 80 100 120

C

C

0

-

-

0 20 40 60 80 100 120

bbb

a

cc

c

a

baabccbc

First convert the time series to PAA representation, then convert the PAA to symbols

It take linear time

23

Two parameter choices

0

-

-

0 20 40 60 80 100 120

bbb

a

cc

c

a

0 20 40 60 80 100 120

C

C

1 2 3 4 5 6 7

1

8

The word size, in this case 8

The alphabet size (cardinality), in this case 3

3

21

24

Extended SAX (ESAX) Lkhagva et al. ,2006

• As Extended SAX is based on SAX, we can get the mean values of the PAA of time series.

• After obtaining PAA, we have the equal sized segments and its mean values.

• Then we define max and min values in each segment.• Discritize the three values to get a triple of symbols

25

0 10 20 30 40 50 60

ACF FDF FCA ABF FFF DCA

26

iSAX [Shieh et al. 2008](Indexable SAX)

27

Time Series Data Mining tasks

• Similarity Search

• Classification

• Clustering

• Motif Discovery

• Novelty/Anomaly Detection

• Time series visualization

• Time series prediction

28

Phân lớp chuỗi thời gian

Class BClass BClass AClass A

Which class does

belong to?

•Phân loại dữ liệu chuỗi thời gian thành một số lớp với một số đặc trưng nhất định.

•Dùng những dữ liệu chuỗi thời gian đã thống kê trong quá khứ để phân loại dữ liệu trong tương lai

29

Phân lớp chuỗi thời gian (tt.)• Đã có một số công trình nghiên cứu liên quan đến bài

toán phân lớp dữ liệu chuỗi thời gian. • Các công trình này tập trung nghiên cứu các mục

tiêu: rút ngắn thời gian xử lý cho các giải thuật phân lớp, và nâng cao độ chính xác cho kết quả phân lớp.

• Các công trình này có thể kể như: – cải tiến phương pháp xoắn thời gian động để có một cận

dưới (lower bound) chặt hơn để từ đó phân lớp dữ liệu chuỗi thời gian chính xác hơn do Keogh & Ratanamahatana, 2004

– thu giảm số lượng thể hiện (numerosity reduction) nhằm giảm số lượng mẫu cần so sánh, nhóm Keogh (2006);

– giải thuật phân lớp với thời gian tùy định (anytime classification) do Keogh và các cộng sự đề ra năm 2006

30

Gom cụm chuỗi thời gian

Gom cụm (clustering) là phân hoạch dữ liệu thành các nhóm sao cho các thành phần trong cụm là tương tự nhau còn các thành phần khác cụm là rất khác nhau.

31

Gom cụm chuỗi thời gian (tt.)• Đã có một số công trình nghiên cứu liên quan đến bài

toán gom cụm dữ liệu chuỗi thời gian. • Các công trình này thường dựa trên một phương pháp

thu giảm số chiều nào đó. • Gom cụm dữ liệu chuỗi thời gian dựa trên dữ liệu đã

biến đổi wavelet rời rạc là – phương pháp được đề xuất bởi Zhang và các cộng sự năm

2006. – Keogh và các cộng sự, năm 2004, đã đưa ra giải thuật I-K-

means để thực hiện gom cụm dữ liệu chuỗi thời gian sử dụng phép biến đổi wavelet;

• Gom cụm dữ liệu chuỗi thời gian dựa trên dữ liệu đã biến đổi PAA đa mức phân giải: Keogh và các cộng sự đưa ra năm 2005

32

Nhận dạng chuỗi con bất thường nhấtChuỗi con bất thường (unusual, abnormal, discord, novel) là chuỗi con mà khác nhất so với các chuỗi con khác trong một chuỗi thời gian.

33

Nhận dạng chuỗi con bất thường nhất (tt.)• Phương pháp HOT SAX, bởi Keogh và các cộng sự năm 2005:

tìm kiếm chuỗi con bất thường nhất trong dữ liệu chuỗi thời gian bằng giải thuật BFDD (bruce force discord discovery) và một số heuristic cải tiến (HDD).

• Phương pháp WAT (wavelet and augmented trie) tìm K chuỗi bất đồng trên dữ liệu chuỗi thời gian sử dụng phép biến đổi wavelet do Y. Bu, T.W. Leung, A. Fu, và các cộng sự năm 2007.

• Giải thuật AWDD (adaptive window-based discord discovery) bởi M. C. Chuah và F. Fu năm 2007 dựa vào ý tưởng của BFDD nhưng cải tiến giải thuật để tìm ra kích thước cửa sổ trượt; (áp dụng cho dữ liệu chuỗi thời gian ECG).

• Giải thuật DADD (disk aware discord discovery) bởi Yankov và các cộng sự năm 2007 để tìm những chuỗi con bất thường nhất trên dữ liệu chuỗi thời gian có kích thước hàng terabyte.

34

Phát hiện mô típ

Mô típ (mẫu lặp) là chuỗi con thường lặp lại nhiều nhất trong chuỗi thời gian

35

Phát hiện mô típ (tt.)

• Giải thuật EMMA, Lin và các cộng sự, 2002.

• Giải thuật Random Projection, Chiu và các cộng sự, 2003.

• Giải thuật phát hiện mô típ có thể xác định chiều dài tối ưu của mô típ (Tanaka, 2007).

36

Hai bài toán TSDM khác

• Dự báo dữ liệu chuỗi thời gian

Neural Network

Neural Network + wavelet transform

• Khám phá luật kết hợp từ dữ liệu chuỗi thời gian

37

Conclusions

• An overview of recent advances in Time series data mining– Feature-based dimensionality reduction – Discretization

• Further research:– Empirical comparisons– Applications in time series data mining

(clustering, classification, finding motifs, anomaly detection, prediction,etc.)