Similar search with trillions of time series

Post on 29-Jun-2015

110 views 3 download

Tags:

transcript

Searching and MiningTrillions of Time Series Subsequencesunder Dynamic Time Warping

Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen,

Gustavo Batista, Brandon Westover, Qiang Zhu, Jesin Zakaria, Eamonn Keogh

Hoan Nguyen – Trung Minh Nguyen

2

Abstract

Optimizationsto search and mine

large databasesvery fast

3

Outline

Problem

Related work

Definitions

Method

Results

Conclusion

4

Problem

Similarity search is an important part of most time series data mining algorithm.

Dynamic Time Warping is the best measure to use but slow.

5

DefinitionsTime series

Time series T is an ordered list:

T = t1, t2, … ,tm

6

DefinitionsSubsequence

Subsequence Ti,k of time series T is a time series of length k start at position i:

T = t1, t2, … ,tm

7

DefinitionsDynamic Time Warping

8

Related workKnown optimizations

Squared distance

√❑

9

Related workKnown optimizations

Lower bounding

LB_KimFL LB_Keogh

10

Related workKnown optimizations

Early abandon

11

MethodEarly abandon Z-Normalization

Q

TT3

T2

T1

Z-N

orm

aliz

atio

n

Q’

T3’T2’

T1’

Long Time series

SubsequencesNormalized

Subsequences

QueryNormalized

Query

Normal approach

12

MethodEarly abandon Z-Normalization Novel approach

Early abandon with Z-normalization

1. Query is Z-normalized

2. Z-normalization of each subsequence will be calculated on the fly with the distance calculation.

3. If distance > best_so_far then early abandon both calculation

13

MethodRe-ordering Early Abandoning

Ordering is created based on the query.

14

MethodCascading Lower Bounds

Lower bounds are used in a cascade to prune candidates.

15

Results

Comparison between:

Naïve

- Z-normalization from start

- full ED(DTW) calculation

State-of-the-art (SOTA)

- Z-normalization from start

- early abandoning

- LB_Keogh bounding for DTW

UCRSuite

16

ResultsBaseline Tests on Random Walk

Million Billion Trillion0

5000

10000

15000

20000

25000

30000

UCR-ED

SOTA-ED

UCR-DTW

SOTA-DTWmin

ute

s

|𝑄|=128

17

ResultsBaseline Tests on Random Walk

Million Billion0

500

1000

1500

2000

2500

UCR-ED

SOTA-ED

UCR-DTW

SOTA-DTWseco

nd

s

|𝑄|=128

18

ResultsBaseline Tests on Random Walk

|𝑇|=2×106

19

ResultsEEG

Series10

100

200

300

400

500

600

3.4

494.3

UCR-ED

SOTA-ED

ho

urs

20

Conclusion

- The approach is very simple yet so effective.

- These optimizations can be applied to most measures but may not work for some, like: Hamming distance