Detecting Change in Multivariate Data Streams Using Minimum Subgraphs
Robert Koyak
Operations Research Dept.
Naval Postgraduate School
Collaborative work with Dave Ruth, Emily Craparo, and Kevin Wood
Basic Setup
0
1 2
1
( ) :
( ) :
j
N
N
F j
H
F F F
H
,
Have observations assumed to be sampled
independently from unknown, multivariate
distributions distribution of observation
T
Homogeneity Hypothesis
Heterogeneity Hypothesis
1 2 1 1
1 1
{2, , }
, ,
( , ) max ( , )
{ 1, , }
k k k
j r jk r j
k N
F F F F F
F F F F
j k N
here exists some such that
and
is
strictly positive and nondecreasing for
2
Heterogeneity includes:
• A single change in distribution at a known change point (“two-sample problem”)
• A single change in distribution at an unknown change point
• Directional drift (in mean or other features) that begins at an unknown point in the observation sequence
3
Distance Matrix
4
1 2 3
1 2 3
( , )
, ,
, ,
distance matrix (Euclidean,
Manhattan, etc.)
Maa, Pearl, and Bartoszynski (1996) :
independent, ~
independent, ~
if and only if
i j
i j i j i j
D d N N
d d
Y Y Y F
Z Z Z G
F G
y y y y
1 2 1 2 3 3( , ) ( , ) ( , ) d Y Y d Z Z d Y ZLL
5
The distance matrix has the information needed to express departure from the homogeneity hypothesis. For the types of departure we want to detect, this information should be expressed in particular ways. How can we unlock it?
6
The strategy we will explore is to fit a minimum subgraph (of some type) to the data treated as vertices in a complete, undirected graph. From the subgraph a statistic is derived that is sensitive to the departures from homogeneity that we wish to detect.
A Graph-Theoretic Approach
7
( , )
( , ), ,
| | ( 1) / 2
ˆ ˆ ˆ( , )
ˆ
N N
N
N
G V E
G V E V
E N N
G V E
G
Complete undirected graph
Subgraph family (e.g. spanning trees,
k-factors, Hamiltonian paths or circuits)
Minimum subgraph is defined by
argmin
G
G
( , )
ˆˆ ( )
Ni ji j E
d
GThe test statistic is
G
Minimum Spanning Trees (MSTs)
• Friedman and Rafsky (1979) used MSTs to define a multivariate extension of the runs test in the context of the two-sample problem
• The test statistic is the number of edges in the MST that join vertices belonging to different samples
• Small values of the statistic are evidence against homogeneity
8
9
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
Sch
uylk
ill
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
7474
80
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
Sch
uylk
ill
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
Sch
uylk
ill
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
7474
80
MST for breast cancer mortality rates, 1969 to 1988 (N = 20), relative to 1968 base. Next, treat Sample 1 as the years 1969–1978 and Sample 2 as the years 1979–1988
10
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
Sch
uylk
ill
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
7474
80
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
Sch
uylk
ill
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
7474
80
There are edges that join vertices in different samples. The p-value, obtained by a permutation test, is about 0.41
ˆ 11MST
Is anything really happening?
11
Spearman rank correlations vs. time, p-values: Philadelphia .0004 Schuylkill .01
Minimum Non-bipartite Matching (MNBM)
• Also known as unipartite matching, 1-factor
• Rosenbaum (2005) defined a “cross-match” test using MNBM analogous to that of Friedman and Rafsky
• The test statistic is the number of edges in the MNBM that join vertices belonging to different samples
• Small values of the statistic are evidence against homogeneity
12
Cross-match test (Rosenbaum)
13
2
/ 2
2
( ) 2
(number of matching edges)
Group 1 has observations
Group 2 has observations
number of cross-matches
number of matches within Group 1
C
C
k r
n N
k
N k
M
M
M M k
n k r NP M r
k r r k
1
,
0 ( ), , / 2r k n k
14
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
Sch
uylk
ill
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
74
80
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
Sch
uylk
ill
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
Sch
uylk
ill
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
74
80
MNBM fit to the breast cancer mortality data. Count the number of edges that join vertices in different groups
15
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
Sch
uylk
ill
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
74
80
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
Sch
uylk
ill
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
Sch
uylk
ill
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
74
80
There are edges that join vertices in different samples. The p-value, obtained from the exact null distribution, is about 0.87
ˆ 6CM
Extensions of the Cross-Match Test
16
1 :
Ruth (2009) and Ruth & Koyak (2011) introduce
two extensions of the cross-match test to detect
departures from homogeneity in the direction
of
(1) An exact, simultaneous cross-match test for
an
H
0 10 1
ˆ( , )
1 1ˆ2 4( , )
ˆ ˆ( ) min ( ) ( , , )
ˆ
| | ( 1)
SCM CM
SPM
unspecified change-point
(2) A sum of (vertex) pair maxima test
kk k k
i j E
i j E
k q k k
i j
i j N N
17
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
Sch
uylk
ill
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
74
80
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
Sch
uylk
ill
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
Sch
uylk
ill
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
0.95 1.05 1.15 1.25
1.0
1.2
1.4
1.6
Philadelphia
69 70
71 72
73
74
75
76
77
78
79
80
81
82 83
84
8586
87
88
74
80
SCM test has exact p-value of 0.59 for testing against an unspecified change-point SPM test has approximate p-value of 0.41
Some Theory
• Friedman & Rafsky’s – Asymptotic normality under H0
– Universal consistency under H1 for the two-sample problem (Henze & Penrose, 1999)
• Rosenbaum’s – Asymptotic normality under H0
– Consistency under restrictive assumptions
• Ruth’s SPM test – Asymptotic normality under H0
– Consistency remains to be proven
18
ˆMST
ˆCM
ˆSPM
Ensemble Tests
19
Problem with graph-theoretic tests: a single minimum
subgraph contains very limited information about and
as such these tests are not very powerful
Tukey suggested fitting multiple "orthogonal" MST
D
s in
Friedman & Rafsky's test and combining them (in a
manner that was not specified)
Two subgraphs are orthgonal if they share no common
edges
For MSTs this is problematic: existence of a
/ 2
fixed number
of orthogonal MSTs (even two) is not assured!
For MNBMs we are assured at least orthogonal
subgraphs (Anderson, 1971) constructed sequentially
N
0.95 1.00 1.05 1.10 1.15 1.20 1.25
1.0
1.2
1.4
1.6
Philadelphia
Schuylk
ill
69 70
71
72
7374
7576
77
78
79
808182 83
84
8586
87
88
First MNBM Fit to the Breast Cancer Mortality Data
0.95 1.00 1.05 1.10 1.15 1.20 1.25
1.0
1.2
1.4
1.6
Philadelphia
Schuylk
ill
69 70
71
72
7374
7576
77
78
79
808182 83
84
8586
87
88
First Two MNBMs Fit to the Breast Cancer Mortality Data
0.95 1.00 1.05 1.10 1.15 1.20 1.25
1.0
1.2
1.4
1.6
Philadelphia
Schuylk
ill
69 70
71
72
7374
7576
77
78
79
808182 83
84
8586
87
88
First Three NMBMs Fit to the Breast Cancer Mortality Data
Structure of Ensembles • Ensemble pairs decompose into Hamiltonian cycles
each having an even number of vertices
– Under H0 all 1-factors are equally likely but it is not true that all ensemble 2-factors are equally likely!
– However, conditional on the cyclic structure uniformity is true
– Second-order properties do not depend on the cyclic structure
• Ensemble 3-factors have more complex cyclic behavior and also exhibit triangles
– Prevalence of triangles depends on the dimensionality of the data:
lower dimension = more triangles 23
Ensemble Tests
24
/ 2
Ruth (2009) proposed an Ensemble Sum of Pair
Maxima (ESPM) test based on fitting a sequence
of orthogonal MNBMs and taking the
cumulative sums of the SPM statistics. The test
takes the followi
n N
1
{1, , } ,
1
2 2
,
ˆ ˆmax ( )
( 1)( 1) / 180, ( 1) / 3
ESPM SPM
ng form:
k
N k n k N
j
N k N
c j
c N N N kN N
Ensemble Tests
25
1
0 ,
1
ˆ ( )
/ ( 1)
SPM(1) Under the process has the
same first two moments as a Brownian bridge,
(2) Although the summands individually are asymptotically
normal
k
N k N k N
j
k
H B t c j
t k N
, the same is not true of the process itself!
(3) Unless the dimensionality of the observations is very large,
classical Brownian bridge theory (Shorack & Wellner, 1987)
produces critical values that violate the nominal level
(4) Ruth (2009) produced critical values for different values of
and dimensionality using extensive simulationsN d
Simulated critical values for N = 200
26
100 Simulated , Bivar. Normal, Homogeneous
27
Critical (.05) = 1.19
( )N kB t
100 Simulated , Bivar. Normal, Mean Jump
28
Critical (.05) = 1.19
( )N kB t
2 4 6 8 10
0.0
0.5
1.0
1.5
2.0
= .05 critical value
= .01 critical value
Number of Orthogonal Matchings (k )
Norm
aliz
ed P
roce
ss
()
NB
ESPMˆ 2.24 has p-value less than .01
Heterogeneity is signaled when six or more matchings are used
()
kt
Power simulations, N = 200, jump at observation 101, = norm of mean vector after the jump, nominal .05-level tests
30
(a) Multivariate normal, mean , 5p
Jump Drift
SCM SPM ESPM JJS SCM SPM ESPM JJS
0 .05 .06 .04 .05 .05 .04 .06 .07
.5 .09 .10 .60 .52 .05 .07 .27 .22
1.0 .33 .41 1.00 1.00 .16 .20 .84 .85
(b) Multivariate normal, mean , 20p
Jump Drift
SCM SPM ESPM JJS SCM SPM ESPM JJS
0 .05 .05 .05 .03 .05 .05 .05 .04
.5 .07 .09 .33 .20 .05 .07 .13 .09
1.0 .16 .22 .95 .95 .09 .11 .56 .49
(c) Multivariate normal, covariance matrix, 5p
Jump Drift
SCM SPM ESPM JJS SCM SPM ESPM JJS
0 .05 .06 .05 .04 .05 .05 .05 .05
.5 .42 .51 .97 .15 .20 .27 .52 .27
1.0 .99 .99 1.00 .24 .77 .79 1.00 .54
Power simulations, N = 200, jump at observation 101, nominal .05-level tests
31
(c) Multivariate normal, covariance matrix, 5p
Jump Drift
SCM SPM ESPM JJS SCM SPM ESPM JJS
0 .05 .06 .05 .04 .05 .05 .05 .05
.5 .42 .51 .97 .15 .20 .27 .52 .27
1.0 .99 .99 1.00 .24 .77 .79 1.00 .54
(d) Multivariate normal mixture, mean , 5p
Jump Drift
SCM SPM ESPM JJS SCM SPM ESPM JJS
0 .05 .05 .04 .27 .04 .04 .06 .28
.5 .08 .09 .56 .38 .07 .07 .21 .33
1.0 .25 .36 .99 .85 .12 .15 .76 .55
1+ mult.
norm
Graph-theoretic Tests: Some Challenges and Possible Directions
1. Computational
2. Theoretical
3. Alternate graph-theoretic approaches
4. Adaptation to real-world problems
32
Computational Challenges
33
2
4
( log( ))
.
( log( ))
Nm N
m N
N N
Finding a MNBM requires computation
time using the Blossom V algorithm (Kolmogorov,
2009). For the complete graph, For ensemble
tests the order of computation is about
wh
1000N
m N
ich is prohibitive with large sample sizes
(e.g. ).
Possible strategies:
(1) Use a greedy algorithm
(2) Restrict the edge set ( )
(3) Try something else
Faster Matchings?
34
Simple greedy heuristics are difficult to extend
to multiple matchings
Edge restriction heuristics. Sufficient conditions
for a perfect matching to exist ( even) include
-- A regular grap
N
/ 2
( )
h of degree
-- A connected, claw-free graph
-- A Delaunay triangulation
Necessary and sufficient conditions: Tutte's
Theorem
odd for all
N
V S S S V
Are MNBM tests universally consistent?
35
Asymptotic theory for MNBM is not straight-
forward even for a single matching, let alone
ensembles.
Aldous & Steele (1992) theory for MSTs exploits
perturbation localizability of MSTs (not applicable
to matchings).
Interesting recent work: "Poisson Matching"
(Holroyd . 2008)et al
36
,
1
( )
{0, 1}, 1, 1, ,
MNBM is a solution to the integer linear program
Minimize:
Subject to:
By replacing the integrality constraints with the
interval constraints
i j i j
n
i j i j i j
ii j
n
i jf x d
x x j n
x
1 12 4
0 1
ˆ ˆ ˆ| | ( 1)RSPM
a solution can be
obtained using LP. A "relaxed" SPM statistic can
be defined by
i j
i j i j
n n
i j i j
x
j x i j x N N
37
12
ˆ {0, ,1}
ˆ0 , 1, ,
Solutions to RNBM satisfy
To fit ensembles enforce the constraints
over a sequence of
problems. There is no assurance that solutions
will be "nested", howeve
i j
i j
x
x k k n
r, which complicates
theory
Performance of relaxed MNBM statistics
compares favorably with that of regular MNBM
What about nearest neighbors?
38
39
40
Possible Applications
• Process control (off-line, on-line)
• Mechanical prognostics
• Threat detection
• Syndromic surveillance
In high-dimensional problems, it may be useful to couple graph-theoretic methods with methods to reduce dimensionality
41
Dimension reduction
42
( , )
( , )
min ( )
s.t. ( ) argmin ( )
{0, 1}
Consider the optimization problem
Vector projects into a low -dimensional space
to minimize the sum of pair i
X E
ij
i j E
Ti j ij
i j
p
r
r
i j x
x
w p'
w
x
w
x w w y y
w
w
ndex differences in
the resulting minimum- weight matching
• Simplification 1: use Manhattan distance:
• Simplification 2: use relaxed matching instead of exact matching; enforce minimum-weight matching using strong duality.
43
,ij r
r
ijr ijr ir jrd d y yd w
{0,1 ( , )} , ,
,( , )
( , )
( , )
min
s.t.
p
v i j r
V
i j
i j E
i ijr
r
r i j
V i
v
v ijr
v j E r
r
r
i
A
j x
a
x
d w i j E
d w
w p'
w x 0 π
1x