Similar search with trillions of time series

transcript

Searching and MiningTrillions of Time Series Subsequencesunder Dynamic Time Warping

Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen,

Gustavo Batista, Brandon Westover, Qiang Zhu, Jesin Zakaria, Eamonn Keogh

Hoan Nguyen – Trung Minh Nguyen

Abstract

Optimizationsto search and mine

large databasesvery fast

Outline

Problem

Related work

Definitions

Method

Results

Conclusion

Problem

Similarity search is an important part of most time series data mining algorithm.

Dynamic Time Warping is the best measure to use but slow.

DefinitionsTime series

Time series T is an ordered list:

T = t1, t2, … ,tm

DefinitionsSubsequence

Subsequence Ti,k of time series T is a time series of length k start at position i:

T = t1, t2, … ,tm

DefinitionsDynamic Time Warping

Related workKnown optimizations

Squared distance

√❑

Lower bounding

LB_KimFL LB_Keogh

Early abandon

MethodEarly abandon Z-Normalization

T3’T2’

Long Time series

SubsequencesNormalized

Subsequences

QueryNormalized

Normal approach

MethodEarly abandon Z-Normalization Novel approach

Early abandon with Z-normalization

1. Query is Z-normalized

2. Z-normalization of each subsequence will be calculated on the fly with the distance calculation.

3. If distance > best_so_far then early abandon both calculation

MethodRe-ordering Early Abandoning

Ordering is created based on the query.

MethodCascading Lower Bounds

Lower bounds are used in a cascade to prune candidates.

Results

Comparison between:

Naïve

- Z-normalization from start

- full ED(DTW) calculation

State-of-the-art (SOTA)

- Z-normalization from start

- early abandoning

- LB_Keogh bounding for DTW

UCRSuite

ResultsBaseline Tests on Random Walk

Million Billion Trillion0

UCR-ED

SOTA-ED

UCR-DTW

SOTA-DTWmin

|𝑄|=128

Million Billion0

UCR-ED

SOTA-ED

UCR-DTW

SOTA-DTWseco

|𝑄|=128

|𝑇|=2×106

ResultsEEG

Series10

UCR-ED

SOTA-ED

Conclusion

- The approach is very simple yet so effective.

- These optimizations can be applied to most measures but may not work for some, like: Hamming distance

Similar search with trillions of time series

Education