A Nonparametric Approach for Multiple Change Point ... · Introduction Change Point Analysis The...

transcript

A Nonparametric Approach for Multiple Change PointAnalysis of Multivariate Data

David S. MattesonDepartment of Statistical Science

Cornell University

matteson@cornell.eduwww.stat.cornell.edu/~matteson

Joint work with: Nicholas A. James, ORIE, Cornell University

Sponsorship: National Science Foundation

2014 October

David S. Matteson (matteson@cornell.edu) Change Point Analysis 2014 October 1 / 40

Introduction

Change Point Analysis

The process of detecting distributional changes within time ordered data

Framework:

I Retrospective, offline analysis

I Multivariate observations

I Estimation: number of change points and their positions

I Hierarchical algorithms

Applications:

I Genetics

I Finance

I Emergency Medical Services

Introduction

Change Point AnalysisGiven independent, time ordered observations X1,X2, . . . ,Xn ∈ Rd

Partition into k homogeneous, temporally contiguous subsets

I k is unknownI Size of each subset is unknown

Introduction

Cluster Analysis

Change point analysis is similar to cluster analysis

In cluster analysis we also wish to partition the observations intohomogeneous subsets

I Subsets may not be contiguous in time without some constraints

Cluster Analysis

Hierarchical Estimation

Apply methods from clustering to find change points

Exhaustive search is not practical: O(nk), in general.

May consider Dynamic Programming

We use a hierarchical or sequential approach: O(kn2)

I Divisive: Clusters are divided until each observation is its own cluster

I Agglomerative: Clusters are merged until all observations belong to asingle cluster

Hierarchical Estimation: Divisive Progression

Hierarchical Estimation: Agglomerative Progression

Multivariate Homogeneity

Measuring Multivariate Homogeneity

Suppose X,Y ∈ Rd with X ∼ Fx ⊥⊥ Y ∼ Fy

Let φx(t) = E(e i〈t,X〉) and φy (t) = E

(e i〈t,Y〉) characteristic functions

Define a divergence between Fx and Fy as

E(X,Y; w) =

|φx(t)− φy (t)|2 w(t) dt,

w(t) denotes an arbitrary positive weight function, for which E exists

A Weight Function

A convenient choice for w(t) > 0 (Szekely and Rizzo, 2005):

w(t;α) =

(2πd/2Γ(1− α/2)

α2αΓ((d + α)/2)|t|d+α

in which Γ(x) is the gamma function

Note: for any fixed (d , α), w(t;α) ∝ |t|−(d+α)

Equivalent Divergence MeasuresLet X and Y be independent, and (X′,Y′) be an iid copy of (X,Y)

Theorem

Suppose that E(|X|α + |Y|α) <∞, for some α ∈ (0, 2], then

E(X,Y;α) =

|φx(t)− φy (t)|2(

2πd/2Γ(1− α/2)

α2αΓ((d + α)/2)|t|d+α

= 2E|X− Y|α − E|X− X′|α − E|Y − Y′|α

I If 0 < α < 2 then E(X,Y;α) = 0 if and only if X and Y areidentically distributed

I If α = 2 then E(X,Y;α) = 0 if and only if EX = EY

An Empirical Measure (U-statistics)

Let Xn = {Xi : i = 1, . . . , n} and Ym = {Yj : j = 1, . . . ,m} beindependent iid samples from the distribution of X ,Y ∈ Rd , respectively,such that E |X |α,E |Y |α <∞ for some α ∈ (0, 2)

Define

E(Xn,Ym;α) =

n∑i=1

m∑j=1

|Xi − Yj |α −(

)−1∑1≤i<k≤n

|Xi − Xk |α −(

)−1 ∑1≤j<k≤m

|Yj − Yk |α

Q(Xn,Ym;α) =mn

m + nE(Xn,Ym;α)

Known Location: Two-Sample Homogeneity TestBy strong law of large number for U-statistics Hoeffding (1961)

E(Xn,Ym;α)→ E(X ,Y ;α)

almost surely, as min(m, n)→∞.

Under the null hypothesis of equal distributions, i.e. E(X ,Y ;α) = 0,

Q(Xn,Ym;α)→ Q(X ,Y ;α) =∞∑i=1

in distribution, as min(m, n)→∞. Here, the λi > 0 are constants thatdepend on α and the distributions of X and Y , and the Qi are iid χ2

1, seeRizzo and Szekely (2010).

Under alternative hypothesis of unequal distributions, i.e. E(X ,Y ;α) > 0,

Q(Xn,Ym;α)a.s.−→∞ as min(m, n)→∞.

Q(Xn,Ym;α)→ Q(X ,Y ;α) =∞∑i=1

Single Change Point

Single Change Point: Unknown LocationLet Z1, . . . ,ZT ∈ Rd be an independent sequence.

Suppose heterogeneous sample with observations from two distributions.

Let γ ∈ (0, 1) denote the division of observations, such thatZ1, . . . ,ZbγTc ∼ Fx and ZbγTc+1, . . . ,ZT ∼ Fy for every sample of size T .

Define Xτ = {Z1,Z2, . . . ,Zτ} and Yτ = {Zτ+1,Zτ+2, . . . ,ZT}.

A change point location τT is then estimated as

τT = argmaxτ

QT (Xτ ,Yτ ;α).

Theorem

If E(X ,Y ;α) <∞ and γ ∈ (0, 1), then

τT/Ta.s.−→ γ, as T →∞.

Single Change Point

τT = argmaxτ

QT (Xτ ,Yτ ;α).

Theorem

If E(X ,Y ;α) <∞ and γ ∈ (0, 1), then

Single Change Point

τT = argmaxτ

QT (Xτ ,Yτ ;α).

Theorem

If E(X ,Y ;α) <∞ and γ ∈ (0, 1), then

Multiple Change Points

Multiple Change Points: Unknown Locations

A generalized bisection approach for sequential estimation

For 1 ≤ τ < κ ≤ T , define:

Xτ = {Z1,Z2, . . . ,Zτ} and Yτ (κ) = {Zτ+1,Zτ+2, . . . ,Zκ}

A change point location τ is then estimated as

(τ , κ) = argmax(τ,κ)

Q(Xτ ,Yτ (κ);α).

Sequentially Estimating Multiple Change PointsSuppose k − 1 change points have been estimated: τ1 < · · · < τk−1

This partitions the observations into k clusters C1, C2, . . . , Ck

Given these clusters, we then apply the single change point procedurewithin each of the k clusters.

For ith cluster Ci , denote proposed change point location τ(i),and the associated constant κ(i)

Now let i∗ = argmaxi∈{1,...,k}

Q[Xτ(i),Yτ(i)(κ(i));α],

in which Xτ(i) and Yτ(i)(κ(i)) are defined with respect to Ci

Denote test statistic as

qk = Q(Xτk ,Yτk (κk);α),

τk = τ(i∗) is kth estimated change point, located within cluster Ci∗

The E-Divisive Algorithm Estimation

The E-Divisive Algorithm: Estimating LocationAτ = {Z1,Z2, . . . ,Zτ} and Bτ (κ) = {Zτ+1,Zτ+2, . . . ,Zκ}

Recall, a change point location τ is estimated as

Q(Aτ ,Bτ (κ);α)

Thus, we maximize mnn+m E(A,B;α) for all subsets A and B:

Q(Aτ ,Bτ (κ);α)

The E-Divisive Algorithm Inference

The E-Divisive Algorithm: Inference via Permutation Test

Distribution of test statistic q∗ = Q(Aτ ,Bτ (κ);α)∣∣τ=τ

is unknown

Significance of proposed change point measured via permutation test

Randomly permute series, maximize mnn+m E(A,B;α), record and repeat:

The E-Divisive Algorithm: Inference via Permutation TestDistribution of test statistic q∗ = Q(Aτ ,Bτ (κ);α)

∣∣τ=τ

is unknown

∣∣τ=τ

is unknown

∣∣τ=τ

is unknown

∣∣τ=τ

is unknown

∣∣τ=τ

is unknown

∣∣τ=τ

is unknown

The E-Divisive Algorithm: Multiple Change Points

If q∗ = Q(Aτ ,Bτ (κ);α)∣∣τ=τ

is insignificant: STOP

If significant, condition on location, and repeat within clusters:

Once again, perform permutation test

However, only permute within each cluster:

The E-Divisive Algorithm: Multiple Change PointsOnce again, perform permutation test

The E-Divisive Algorithm ecp Package

The ‘ecp’ R package (CRAN)Signature:

e.divisive(X, sig.lvl=0.05, R=199, k=NULL, min.size=30, alpha=1)

Arguments:

I X - A T × d matrix representation of a length T time series, withd-dimensional observations.

I sig.lvl - The significance level used for the permutation test.

I R - The maximum number of permutations to perform in thepermutation test.

I k - The number of change points to return. If this is NULL only thestatistically significant estimated change points are returned.

I min.size - The minimum number of observations btw change points.

I alpha - The index for test statistic.

The E-Divisive Algorithm ecp Package

The ‘ecp’ R package (CRAN)Returned list:

I k.hat - Number of clusters created by the estimated change points.

I order.found - The order in which the change points were estimated.

I estimates - Locations of the statistically significant change points.

I considered.last - Location of the last change point, that was notfound to be statistically significant at the given significance level.

I permutations - The number of permutations performed by each ofthe sequential permutation test.

I cluster - The estimated cluster membership vector.

I p.values - Approximate p-values estimated from each permutationtest.

Complexity is O(kT 2)David S. Matteson (matteson@cornell.edu) Change Point Analysis 2014 October 21 / 40

Simulation

Simulation Study: Rand IndexCompare E-Divisive with a generalized Wilcoxon/MannWhitney approach:the MultiRank procedure Lung-Yut-Fong et al. (2011)

For two partitions U & V , the Rand Index considers all pairs ofobservations:

Define

{A} Pairs in same cluster under U and in same cluster under V

{B} Pairs in different cluster under U and in different cluster under V

Rand index =#A + #B(T

)An equivalent definition of the Rand index can be found in Hubert andArabie (1985)

Adjusted Rand =Index− Expected Index

Max Index− Expected Index=

Rand− Expected Rand

1− Expected Rand

Simulation

Define

1− Expected Rand

Simulation

Define

1− Expected Rand

Simulation

Define

1− Expected Rand

Simulation

Define

1− Expected Rand

Simulation

Define

1− Expected Rand

Simulation

A change in variance for univariate normal data

Method Correct k Average Adjusted Rand

MultiRank 22/100 0.504

E-Divisive 95/100 0.909

Simulation

A change in correlation for bivariate normal data

Method Correct k Average Adjused Rand

MultiRank 72/100 0.166

E-Divisive 92/100 0.997

Simulation

1,000 simulations, 2 CP: N(0,1), N(µ,1), N(0,1)

Average Rand Average Adj. RandT µ MultiRank E-Divisive MultiRank E-Divisive

1501 0.940 0.948 0.867 0.8852 0.977 0.991 0.949 0.9814 0.981 1.000 0.958 1.000

3001 0.970 0.972 0.933 0.9372 0.989 0.996 0.975 0.9914 0.991 1.000 0.979 1.000

6001 0.986 0.986 0.968 0.9692 0.994 0.998 0.987 0.9964 0.995 1.000 0.990 1.000

Simulation

1,000 simulations, 2 CP: N(0,1), N(0, σ2), N(0,1)

Average Rand Average Adj. RandT σ2 MultiRank E-Divisive MultiRank E-Divisive

1502 0.731 0.902 0.471 0.7855 0.764 0.976 0.521 0.948

10 0.764 0.989 0.519 0.975

3002 0.744 0.924 0.490 0.8345 0.759 0.990 0.511 0.978

10 0.759 0.995 0.512 0.989

6002 0.742 0.970 0.488 0.9335 0.753 0.996 0.500 0.990

10 0.753 0.998 0.501 0.995

Simulation

1,000 simulations, 2 CP: N(0,1), tν(0, 1), N(0,1)

Average Rand Average Adj. RandT ν MultiRank E-Divisive MultiRank E-Divisive

15016 0.632 0.798 0.327 0.5648 0.651 0.830 0.353 0.6312 0.679 0.846 0.395 0.666

30016 0.640 0.755 0.341 0.4928 0.639 0.769 0.338 0.5222 0.680 0.809 0.396 0.596

60016 0.655 0.735 0.365 0.4698 0.653 0.727 0.359 0.4582 0.697 0.813 0.420 0.608

Simulation

1,000 simulations, 2 CP: N2(0, I ),N2(µ, I ),N2(0, I )

Average Rand Average Adj. RandT µ MultiRank E-Divisive MultiRank E-Divisive

3001 0.656 0.698 0.363 0.4062 0.713 0.732 0.446 0.4683 0.743 0.778 0.489 0.549

6001 0.991 0.994 0.981 0.9872 0.995 1.000 0.989 0.9993 0.996 1.000 0.990 1.000

9001 0.994 0.996 0.987 0.9912 0.997 1.000 0.993 0.9993 0.997 1.000 0.993 1.000

Simulation

1,000 simulations, 2 CP: N2(0,Σ),N2(0, I ),N2(0,Σ)

(1 ρρ 1

)Average Rand Average Adj. Rand

T ρ MultiRank E-Divisive MultiRank E-Divisive

3000.5 0.663 0.729 0.373 0.4550.7 0.712 0.728 0.444 0.4620.9 0.745 0.743 0.491 0.488

6000.5 0.674 0.676 0.391 0.3860.7 0.724 0.672 0.462 0.3700.9 0.745 0.834 0.492 0.673

9000.5 0.692 0.635 0.415 0.3220.7 0.724 0.678 0.464 0.3980.9 0.747 0.966 0.494 0.928

Simulation

1,000 simulations, 2 CP: Nd(0,Σ),Nd(0, I ),Nd(0,Σ)

Σw.o./noise =

0BBBBB@1 ρ ρ · · · ρρ 1 ρ · · · ρρ ρ 1 · · · ρ...

......

. . ....

ρ ρ ρ · · · 1

1CCCCCA Σw/noise =

0BBBBB@1 ρ 0 · · · 0ρ 1 0 · · · 00 0 1 · · · 0...

......

. . ....

0 0 0 · · · 1

1CCCCCAWithout Noise With Noise

T d Avg. Rand Avg. Adj. Rand Avg. Rand Avg. Adj. Rand

3002 0.767 0.522 0.774 0.5435 0.912 0.816 0.736 0.4639 0.970 0.935 0.736 0.459

6002 0.817 0.648 0.836 0.8165 0.993 0.984 0.631 0.6269 0.998 0.995 0.666 0.648

9002 0.970 0.937 0.968 0.9335 0.998 0.996 0.644 0.3429 0.999 0.999 0.612 0.284

Applications Genetics

Genetics DataWe applied E-divisive to the aCGH mico-array dataset of 43 individuals with abladder tumor (Bleakley and Vert, 2011); relative hybridization intensity profile forone individual.MultiRank (Lung-Yut-Fong et al., 2011) k = 17 adjRand = 0.677KCPA (Arlot et al., 2012) k = 41 adjRand = 0.658PELT (Killick et al., 2012) k = 47 adjRand = 0.853

0 500 1000 1500 2000

−0.5

0.51.5

MultiRank

0 500 1000 1500 2000

−0.5

0.51.5

0 500 1000 1500 2000

−0.5

0.51.5

0 500 1000 1500 2000

−0.5

0.51.5

E−Divisive

Figure: Top: MultiRank. Bottom: E-divisiveDavid S. Matteson (matteson@cornell.edu) Change Point Analysis 2014 October 31 / 40

Applications Finance

Financial Data: Cisco Systems

The E-divisive procedure was applied to the monthly log returns of theDow 30

Marginal analysis of Cisco Systems Inc. from April 1990 to January 2010.The procedure found change points at April 2000 and October 2002.

Financial Data: Cisco SystemsMarginal analysis of Cisco Systems Inc. from April 1990 to January 2010.The procedure found change points at April 2000 and October 2002.

Financial Data: S&P 500 Index

S&P 500: May 20, 1999 − April 25, 2011

2000 2002 2004 2006 2008 2010

Agglomerative Algorithm

An Agglomerative Algorithm

Given a partition of k clusters C = {C1,C2, . . . ,Ck}, clusters may or maynot be single observations

Consider combining a pair of adjacent clusters

The partition that maximizes the goodness-of-fit statistic determineschange point locations

An Agglomerative Algorithm: Goodness-of-Fit

Goodness-of-fit statistic S(k): sum the E-distances between adjacentclusters

Given clusters C = {C1,C2, . . . ,Ck} with ni = #Ci , define

S(k) =k−1∑i=1

(nini+1

ni + ni+1

)Eαni ,ni+1

(Ci ,Ci+1),

An Agglomerative Algorithm

The partitioning which maximized S(k) is then used to estimate changepoint locations.

Figure: Progression of the goodness of fit statistic, and where it is maximized.

Agglomerative Algorithm Application: EMS

EMS Priority One Response for Toronto 2007

Agglomerative Algorithm Application: EMS

EMS Priority One Response for Toronto 2007

Bibliography

Bibliographyhttp://www.stat.cornell.edu/∼matteson/

Bleakley, K., and Vert, J.-P. (2011), The group fused Lasso for multiple change-pointdetection,, Technical Report HAL-00602121, Bioinformatics Center (CBIO).

Hoeffding, W. (1961), The Strong Law of Large Numbers for U-Statistics,, TechnicalReport 302, North Carolina State University. Dept. of Statistics.

Hubert, L., and Arabie, P. (1985), “Comparing Partitions,” Journal of Classification,2(1), 193 – 218.

James, N. A., and Matteson, D. S. (2013), “ecp: An R Package for NonparametricMultiple Change Point Analysis of Multivariate Data,” arXiv:1309.3295, .

Lung-Yut-Fong, A., Levy-Leduc, C., and Cappe, O. (2011), “Homogeneity andchange-point detection tests for multivariate data using rank statistics,”.

Matteson, D. S., and James, N. A. (2013), “A Nonparametric Approach for MultipleChange Point Analysis of Multivariate Data,” Journal of the American StatisticalAssociation, To Appear.

Rizzo, M. L., and Szekely, G. J. (2010), “Disco Analysis: A Nonparametric Extension ofAnalysis of Variance,” The Annals of Applied Statistics, 4(2), 1034–1055.

Szekely, G. J., and Rizzo, M. L. (2005), “Hierarchical Clustering via JointBetween-Within Distances: Extending Ward’s Minimum Variance Method,” Journal ofClassification, 22(2), 151 – 183.

A Nonparametric Approach for Multiple Change Point ... · Introduction Change Point Analysis The...

Documents