Date post: | 24-Dec-2015 |
Category: |
Documents |
Upload: | doreen-gallagher |
View: | 215 times |
Download: | 0 times |
1
Time Series I
SyllabusNov 4 Introduction to data mining
Nov 5 Association Rules
Nov 10, 14 Clustering and Data Representation
Nov 17 Exercise session 1 (Homework 1 due)
Nov 19 Classification
Nov 24, 26 Similarity Matching and Model Evaluation
Dec 1 Exercise session 2 (Homework 2 due)
Dec 3 Combining Models
Dec 8, 10 Time Series Analysis
Dec 15 Exercise session 3 (Homework 3 due)
Dec 17 Ranking
Jan 13 Review
Jan 14 EXAM
Feb 23 Re-EXAM
3
Why deal with sequential data?• Because all data is sequential
• All data items arrive in the data store in some order
• Examples– transaction data– documents and words
• In some (or many) cases the order does not matter
• In many cases the order is of interest
4
Time-series data: example
Financial time series
5
Questions
• What is time series?
• How do we compare time series data?
• What is the structure of sequential data?
• Can we represent this structure compactly and accurately?
Time Series• A sequence of observations:
– X = (x1, x2, x3, x4, …, xn)
• Each xi is a real number
– e.g., (2.0, 2.4, 4.8, 5.6, 6.3, 5.6, 4.4, 4.5, 5.8, 7.5)
time axis
valu
e ax
is
7
Time Series Databases• A time series is an ordered set of real numbers,
representing the measurements of a real variable at equal time intervals
– Stock prices– Volume of sales over time– Daily temperature readings– ECG data
• A time series database is a large collection of time series
8
Time Series Problems
• The Similarity Problem
X = x1, x2, …, xn and Y = y1, y2, …, yn
• Define and compute Sim(X, Y) or Dist(X, Y)– e.g. do stocks X and Y have similar movements?
• Retrieve efficiently similar time series– Indexing for Similarity Queries
9
Types of queries
• whole match vs subsequence match
• range query vs nearest neighbor query
10
Examples
• Find companies with similar stock prices over a
time interval
• Find products with similar sell cycles
• Cluster users with similar credit card utilization
• Find similar subsequences in DNA sequences
• Find scenes in video streams
11
day
$price
1 365
day
$price
1 365
day
$price
1 365
distance function: by expert
(e.g., Euclidean distance)
12
Problems
• Define the similarity (or distance) function• Find an efficient algorithm to retrieve similar
time series from a database– (Faster than sequential scan)
The Similarity function depends on the Application
13
Metric Distances
• What properties should a similarity distance have to allow (easy) indexing?
– D(A,B) = D(B,A) Symmetry – D(A,A) = 0 Constancy of Self-Similarity– D(A,B) >= 0 Positivity– D(A,B) D(A,C) + D(B,C) Triangular Inequality
• Some times the distance function that best fits an application is not a metric
• Then indexing becomes interesting and challenging
Euclidean Distance
14
• Each time series: a point in the n-dim space
• Euclidean distance– pair-wise point distance
v1
v2
X = x1, x2, …, xn
Y = y1, y2, …, yn
15
Euclidean modelQuery Q
n datapoints
S
Q
Euclidean Distance betweentwo time series Q = {q1, q2, …, qn} and X = {x1, x2, …, xn}
Distance
0.98
0.07
0.21
0.43
Rank
4
1
2
3
Database
n datapoints
16
• Easy to compute: O(n)• Allows scalable solutions to other problems,
such as– indexing– clustering– etc...
Advantages
17
• Query and target lengths should be equal!
• Cannot tolerate noise:– Time shifts– Sequences out of phase– Scaling in the y-axis
Disadvantages
18
Limitations of Euclidean Distance
Euclidean DistanceSequences are aligned “one to one”.
“Warped” Time AxisNonlinear alignments are possible.
Q
Q
C
C
19
Dynamic Time Warping [Berndt, Clifford, 1994]
• DTW allows sequences to be stretched along the time axis– Insert ‘stutters’ into a sequence– THEN compute the (Euclidean) distance
‘stutters’original
20
Computation
)1,1(
),1(
)1,(
min),(
),(),(
jif
jif
jif
qpjif
MNfQPD
ji
dtw
q-stutterno stutter
p-stutter
• DTW is computed by dynamic programming• Given two sequences
– P = {p1, p2, …, pi}
– Q = {q1, q2, …, qj}
21
DTW: Dynamic time warping (1/2)
• Each cell c = (i, j) is a pair of indices
whose corresponding values will be
computed, (xi–yj)2, and included in
the sum for the distance.
• Euclidean path:
– i = j always.
– Ignores off-diagonal cells.
X
Y
xi
yj
(x2–y2)2 + (x1–y1)2 (x1–y1)2
22
(i, j)
DTW: Dynamic time warping (2/2)
• DTW allows any path.• Examine all paths:
• Standard dynamic programming to fill in the table.
• The top-right cell contains final result.
(i, j)(i-1, j)
(i-1, j-1) (i, j-1)
shrink x / stretch y
stretch x / shrink y
X
Y
a
b
• Warping path W: – set of grid cells in the time warping matrix
• DTW finds the optimum warping path W:– the path with the smallest matching score
Optimum warping path W(the best alignment) Properties of a DTW legal path
I. Boundary conditions
W1=(1,1) and WK=(n,m)
II. ContinuityGiven Wk = (a, b), then Wk-1 = (c, d), where a-c ≤ 1, b-d
≤ 1III. Monotonicity
Given Wk = (a, b), then Wk-1 = (c, d), where a-c ≥ 0, b-d ≥ 0
Properties of DTW
X
Y
23
Properties of DTW
I. Boundary conditions
W1=(1,1) and WK=(n,m)
II. ContinuityGiven Wk = (a, b), then Wk-1 = (c, d), where a-c ≤ 1, b-d
≤ 1III. Monotonicity
Given Wk = (a, b), then Wk-1 = (c, d), where a-c ≥ 0, b-d ≥ 0 24
C. S. Myers and L. R. Rabiner. A comparative study of several dynamic time-warping algorithms for connected word recognition. The Bell System Technical Journal, 60(7):1389-1409, Sept. 1981.
Sakoe, H. and Chiba, S., Dynamic programming algorithm optimization for spoken word recognition, IEEE Transactions on Acoustics, Speech and Signal Processing, 26(1) pp. 43– 49, 1978, ISSN: 0096-3518
• Query and target lengths may not be of equal length
• Can tolerate noise:– time shifts– sequences out of phase– scaling in the y-axis
Advantages
25
• Computational complexity: O(nm)
• May not be able to handle some types of noise...
• It is not metric (triangle inequality does not hold)
Disadvantages
26
27
Sakoe-Chiba Band Itakura Parallelogram
r =
Global Constraints Slightly speed up the calculations and prevent pathological warpings A global constraint limits the indices of the warping path
wk = (i, j)k such that j-r i j+r Where r is a term defining allowed range of warping for a given point in a
sequence
28
Complexity of DTW
• Basic implementation = O(n2) where n is the length of the sequences– will have to solve the problem for each (i, j) pair
• If warping window is specified, then O(nr)– only solve for the (i, j) pairs where | i – j | <= r
29
Longest Common Subsequence Measures
(Allowing for Gaps in Sequences)
Gap skipped
30
Longest Common Subsequence (LCSS)
ignore majority of noise
match
match
Advantages of LCSS:
A. Outlying values not matched
B. Distance/Similarity distorted less
Disadvantages of DTW:
A. All points are matched
B. Outliers can distort distance
C. One-to-many mapping
LCSS is more resilient to noise than DTW.
31
Longest Common SubsequenceSimilar dynamic programming solution as DTW, but now we measure similarity not distance.
Can also be expressed as distance
32
Similarity Retrieval
• Range Query– Find all time series X where
• Nearest Neighbor query– Find all the k most similar time series to Q
• A method to answer the above queries: – Linear scan
• A better approach – GEMINI [next time]
33
Lower Bounding – NN search
Intuition Try to use a cheap lower bounding calculation as often as possible Do the expensive, full calculations when absolutely necessary
We can speed up similarity search by using a lower bounding function D: distance measure
LB: lower bounding function s.t.: LB(Q, X) ≤ D(Q, X)
Set best = ∞ For each Xi:
if LB(Xi, Q) < bestif D(Xi, Q) < best best = D(Xi, Q)
1-NN Search Using LB
We assume a database of time series: DB = {X1, X2, …, XN}
34
Lower Bounding – NN search
Intuition Try to use a cheap lower bounding calculation as often as possible Do the expensive, full calculations when absolutely necessary
We can speed up similarity search by using a lower bounding function D: distance measure
LB: lower bounding function s.t.: LB(Q, X) ≤ D(Q, X)
Range Query Using LB
For each Xi:if LB(Xi, Q) ≤ ε
if D(Xi, Q) < ε report Xi
We assume a database of time series: DB = {X1, X2, …, XN}
35
Problems• How to define Lower bounds for different distance
measures?
• How to extract the features? How to define the
feature space?– Fourier transform
– Wavelets transform
– Averages of segments (Histograms or APCA)
– Chebyshev polynomials
– .... your favorite curve approximation...
36
Some Lower Bounds on DTW
A
B
C
D
Each sequence is represented by 4 features: <First, Last, Min, Max>
LB_Kim = maximum squared difference of the corresponding features
LB_Kim
max(Q)
min(Q)
LB_Yi
37
LB_Keogh [Keogh 2004]
L
U
Q
U
LQ
C
Q
C
Q
Sakoe-Chiba Band
Itakura Parallelogram
Ui = max(qi-r : qi+r)Li = min(qi-r : qi+r)
38
CU
LQ
CU
LQ
C
Q
C
Q
Sakoe-Chiba Band
Itakura Parallelogram
n
iiiii
iiii
otherwise
LcifLc
UcifUc
CQKeoghLB1
2
2
0
)(
)(
),(_
LB_Keogh
LB_Keogh
),(),(_ CQDTWCQKeoghLB
39
LB_KeoghSakoe-Chiba
LB_KeoghItakura
LB_Yi
LB_Kim
…proportional to the length of gray lines used in the illustrations
Tightness of LB
nceDistaWarpTimeDynamicTruenceDistaWarpTimeDynamicofEstimateBoundLowerT
0 T 1The larger the
better
Lower Bounding
distanceQ
we want to find the 1-NN to our query data series, Q
Lower Bounding
distanceQ true S1
we compute the distance to the first data series in our dataset, D(S1,Q)
this becomes the best so far (BSF)
Lower Bounding
distanceQ true S1
BSF
LB S2
we compute the distance LB(S2,Q) and it is greater than the BSF
we can safely prune it, since D(S2,Q) LB(S2,Q)
Lower Bounding
distanceQ true S1
BSF
LB S2
we compute the distance LB(S3,Q) and it is smaller than the BSFwe have to compute D(S3,Q) LB(S3,Q), since it may still be smaller
than BSF
LB S3
Lower Bounding
distanceQ true S1
BSF
LB S2
it turns out that D(S3,Q) BSF, so we can safely prune S3
true S3
Lower Bounding
distanceQ true S1
BSF
LB S2true S3
Lower Bounding
distanceQ true S1
BSF
LB S2true S3
we compute the distance LB(S4,Q) and it is smaller than the BSFwe have to compute D(S4,Q) LB(S4,Q), since it may still be smaller
than BSF
LB S4
Lower Bounding
distanceQ true S1
BSF
LB S2true S3true S4
it turns out that D(S4,Q) BSF, so S4 becomes the new BSF
Lower Bounding
distanceQ true S1
S1 cannot be the 1-NN, because S4 is closer to Q
LB S2true S3true S4
BSF
49
How about subsequence matching?
• DTW is defined for full-sequence matching:– All points of the query sequence are matched to all points of
the target sequence
• Subsequence matching:– The query is matched to a part (subsequence) of the target
sequence
Query sequence Data stream
X: long sequence
Q: short sequence
What subsequence of X is the best match for Q?
Subsequence Matching
X: long sequence
Q: short sequence
What subsequence of X is the best match for Q …such that the match ends at position j?
position j
J-Position Subsequence Match
X: long sequence
Q: short sequence
X: long sequence
Q: short sequence
position j
J-Position Subsequence Match
X: long sequence
Q: short sequence
Naïve Solution: DTWExamine all possible subsequences
X: long sequence
Q: short sequence
position j
J-Position Subsequence Match
Naïve Solution: DTWExamine all possible subsequences
X: long sequence
Q: short sequence
X: long sequence
Q: short sequence
Naïve Solution: DTWExamine all possible subsequences
X: long sequence
Q: short sequence
position j
J-Position Subsequence Match
Naïve Solution: DTWExamine all possible subsequences
X: long sequence
Q: short sequence
X: long sequence
Q: short sequence
Naïve Solution: DTWExamine all possible subsequences
X: long sequence
Q: short sequence
position j
J-Position Subsequence Match
Too costly!
Naïve Solution: DTWExamine all possible subsequences
X: long sequence
Q: short sequence
X: long sequence
Q: short sequence
Naïve Solution: DTWExamine all possible subsequences
56
• Compute the time warping matrices starting from every database frame– Need O(n) matrices, O(nm) time per frame
Q
Xxtstartxtend
x1
Why not ‘naive’?
Capture the optimal subsequence starting
from t = tstart
n
m
57
Key Idea• Star-padding
– Use only a single matrix (the naïve solution uses n matrices)
– Prefix Q with ‘*’, that always gives zero distance
– Instead of Q=(q1 , q2 , …, qm), compute distances with Q’
– O(m) time and space (the naïve requires O(nm))
(*)
),,,,('
0
210
q
qqqqQ m
SPRING: dynamic programming
Initialization Insert a “dummy” state ‘*’ at the beginning of the query ‘*’ matches every value in X with score 0
database sequence X
quer
y Q
* 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Computation Perform dynamic programming computation in a similar
manner as standard DTW
database sequence X
* 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
quer
y Q
SPRING: dynamic programming
(i, j)(i, j)(i-1, j)
(i-1, j-1) (i, j-1)
Q[1:i] is
matched
with X[s,j]
database sequence X
* 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
quer
y Q
i
js
For each (i, j): compute the j-position subsequence match of the first i
items of Q to X[s:j]
SPRING: dynamic programming
For each (i, j): compute the j-position subsequence match of the first i
items of Q to X[s:j] Top row: j-position subsequence match of Q for all j’s Final answer: best among j-position matches
Look at answers stored at the top row of the table
database sequence X
* 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
quer
y Q
SPRING: dynamic programming
database sequence X
* 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Subsequence vs. full matchingqu
ery
Q
Q
p1 pi pN
q1
qj
qM
Assume that the database is one very long sequence Concatenate all sequences into one sequence
O (|Q| * |X|) But can be computed faster by looking at only two
adjacent columns
Computational complexity
database sequence X
* 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
quer
y Q
STWM (Subsequence Time Warping Matrix)
• Problem of the star-padding: we lose the information about
the starting frame of the match
• After the scan, “which is the optimal subsequence?”
• Elements of STWM
– Distance value of each subsequence
– Starting position !!
• Combination of star-padding and STWM
– Efficiently identify the optimal subsequence in a stream
fashion
Up next…
• Time series summarizations
– Discrete Fourier Transform (DFT)
– Discrete Wavelet Transform (DWT)
– Piecewise Aggregate Approximation (PAA)
– Symbolic ApproXimation (SAX)
• Time series classification
– Lazy learners
– Shapelets