A central limit theorem for an omnibus embedding …dml.cs.byu.edu/icdm17ws/Keith.pdfA central limit...

transcript

A central limit theorem for an omnibus embedding ofrandom dot product graphs

Keith Levin1

with Avanti Athreya2, Minh Tang2, Vince Lyzinski3 and Carey E. Priebe2

1University of Michigan, 2Johns Hopkins University, 3University of Massachusetts Amherst

November 18, 2017

Classical two-sample hypothesis testing

Well-studied in statistics (indeed, the only thing we teach undergrads?)

K. Levin (U.Michigan,JHU,UMass) A CLT for omnibus embeddings November 18, 2017 2 / 20

Graph Hypothesis Testing

Q: how to tell if two (or more) graphs are from the same distribution?

Random Dot Product Graph(RDPG; Young and Scheinerman, 2007)

Extends stochastic block model (SBM)Vertices assigned latent positions

drawn i.i.d. from d-dimensional distribution FF constrained so that 0 ≤ xT y ≤ 1 whenever x, y ∈ supp FDenote i-th latent position by Xi ∈ Rd

Edges {i, j} present or absent independently with probability XTi Xj .

Collect latent positions in rows of X ∈ Rn×d .

Warning: Non-identifiabilityModel specified only up to orthogonal rotation of latent positions.

Random Dot Product Graph(RDPG; Young and Scheinerman, 2007)

Extends stochastic block model (SBM)Vertices assigned latent positions

drawn i.i.d. from d-dimensional distribution FF constrained so that 0 ≤ xT y ≤ 1 whenever x, y ∈ supp F .Denote i-th latent position by Xi

Edges {i, j} present or absent independently with probability XTi Xj .

Estimating latent positions:adjacency spectral embedding (Sussman et al, 2012)

Definition (Adjacency Spectral Embedding (ASE))

Given adjacency matrix A , embed vertices of A = USUT into Rd as rowsof X = UdS1/2

d ∈ Rn×d , where Ud denotes first d columns of U, Sd denotestruncation of S to top d eigenvalues.

Under RDPG, ∃W : max1≤i≤n ‖Xi −WXi‖ = OP(n−1/2 log n).

Lyzinski, et al (2014): ASE yields a.a.s. perfect recovery of blockmemberships in SBM

RDPG: what do we mean by same distribution?

Option 1: Test if latent positions are drawn from same distribution.

G1 positions drawn i.i.d. F1, G2 positions drawn i.i.d. F2

Test if F1 = F2

“Nonparametric” testing

Tang, Athreya, Sussman, Lyzinski and Priebe (2017)Estimate latent positions of G1 and G2 via ASE, apply maximum meandiscrepancy (Gretton et al, 2012) to ASE estimates.

Option 1: Test if latent positions are drawn from same distribution.

G1 positions drawn i.i.d. F1, G2 positions drawn i.i.d. F2

Test if F1 = F2

“Nonparametric” testing

Tang, Athreya, Sussman, Lyzinski and Priebe (2017)Estimate latent positions of G1 and G2 via ASE, apply maximum meandiscrepancy (Gretton et al, 2012) to ASE estimates.

Option 2: Test if latent positions are the same

G1 latent positions X ∈ Rn×d , G2 latent positions Y ∈ Rn×d

Test if X = YW for some unitary W .

“Semiparametric” testing

Tang, Athreya, Sussman, Lyzinski and Priebe (2015)Embed both graphs via ASE, align estimated positions via Procrustesanalysis (Gower, 1975). Reject H0 if alignment is poor, i.e., ifTProc = minW∈Ud ‖X − YW‖F is large.

Challenges in semiparametric graph testing

Problem 1: Procrustes alignment introduces variance

More variance⇒ less power.

Problem 2: How to generalize to multiple-graph hypothesis testing?

Ultimately, we want something like ANOVA for graphs.

Goal: develop a technique that...1 Avoids Procrustes alignment2 Generalizes naturally to 3 or more graphs

Omnibus matrix: motivation

Definition (Omnibus matrix)Let graphs G1 and G2 be d-dimensional RDPGs with adjacency matricesA (1) and A (2). We construct an omnibus matrix for the graphs as

A (1) A (1)+A (2)

2A (1)+A (2)

2 A (2)

∈ R2n×2n

Note: generalizes naturally to m graphs, with (i, j)-block (A (i) + A (j))/2.

Omnibus embedding

Reminder

A (1) A (1)+A (2)

2A (1)+A (2)

2 A (2)

∈ R2n×2n

Under H0, we have EA (1) = EA (2) = XXT = P = UPSPUTP

SP ∈ Rd×d diagonal, UP ∈ R

n×d orthonormal columns

EM = P =

[P PP P

[UT UT

] [XT XT

]= UPSPUT

Omnibus embedding

Under H0, we have EA (1) = EA (2) = XXT = P = UPSPUTP

SP ∈ Rd×d diagonal, UP ∈ R

n×d orthonormal columns

EM = P =

[P PP P

[UT UT

] [XT XT

]= UPSPUT

Key pointApplying ASE to M, we get a 2n-by-d matrix,

X , Y ∈ Rn×d provide estimates of latent positions of G1, G2, in the samed-dimensional space without additional alignment step. Natural teststatistic given by TOmni = ‖X − Y‖F .

Main results: Notational preliminaries

In what follows, we assume the null hypothesis

So G1 and G2 have shared latent positions X ∈ Rn×d .

EA (1) = EA (2) = P = UPSPUTP = XXT ∈ Rn×n

We denote the “true latent positions” of M by

P = UPS1/2P∈ R2n×d

and their estimates by

Z = UMS1/2M =

]∈ R2n×d

where SM ∈ Rd×d is the diagonal matrix of the top d eigenvalues of M

and corresponding eigenvectors in columns of UM ∈ R2n×d .

Main results: Concentration inequality

Lemma (Uniform concentration of estimates)

Let {A (i)}mi=1 be adjacency matrices of m independent RDPGs with sharedlatent positions X = UPS1/2

P ∈ Rn×d and let M ∈ Rmn×mn be their omnibusmatrix with top eigenvalues collected in diagonal matrix SM ∈ R

d×d andcorresponding eigenvalues in the columns of UM ∈ R

mn×d . There exists aconstant C > 0 such that with high probability, there exists an orthogonalmatrix W ∈ Rd×d such that

max1≤h≤mn

‖(UMS1/2M − UPS1/2

PW)h,·‖ ≤

Cm1/2 log mn√

Main results: CLT

Theorem (CLT: informally)

P ∈ Rn×d drawn i.i.d. from d-dimensionaldistribution F. Let M ∈ Rmn×mn be their omnibus matrix with topeigenvalues collected in diagonal matrix SM ∈ R

d×d and correspondingeigenvalues in the columns of UM ∈ R

mn×d . Fix h = m(s − 1) + i for i ∈ [n]and s ∈ [m]. Then the error between the h-th position estimate and the(properly rotated) true h-th position is asymptotically a continuous mixtureof normals, with mixing determined by F.

n1/2(UMS1/2M − UPS1/2

PWn)h,· →

∫N(0,Σ(y))dF(y).

Main results: CLT

Theorem (CLT: More formally)

P ∈ Rn×d drawn i.i.d. from d-dimensionaldistribution F. Let M ∈ Rmn×mn be their omnibus matrix with topeigenvalues collected in diagonal matrix SM ∈ R

d×d and correspondingeigenvalues in the columns of UM ∈ R

mn×d . Let Φ(x,Σ) denote the cdf of amultivariate Gaussian with mean 0 and covariance matrix Σ. Fixh = m(s − 1) + i for i ∈ [n] and s ∈ [m]. There exists a sequence of d-by-dorthogonal matrices (Wn)∞n=1 such that for all x ∈ Rd ,

limn→∞

Pr[n1/2(UMS1/2

M − UPS1/2P

Wn)h,· ≤ x]

∫Φ (x,Σ(y)) dF(y),

where Σ(y) = (m + 3)∆−1Σ(y)∆−1/(4m) and

∆ = EFX1XT1 , Σ(y) = EF (yT X1 − (yT X1)2)X1XT

Experiments: hypothesis testing

●● ●●

●●● ●

0 250 500 750 1000Number of vertices (log scale)

Method●

Omnibus

Procrustes

●●

Method●

Omnibus

Procrustes

●●

● ●

●● ●

Method●

Omnibus

Procrustes

Figure: Power of the Procrustes-based (blue) and omnibus-based (green) tests todetect when the two graphs being testing differ in (a) one, (b) five, and (c) ten oftheir latent positions. Each point is the proportion of 1000 trials for which thegiven technique correctly rejected the null hypothesis. Error bars denote twostandard errors of this empirical mean.

Experiments: estimating latent positions

● ●

● ● ● ● ●

● ●

● ● ● ● ●

●●

● ● ● ● ●

●●

● ● ● ● ●

● ●

● ● ● ● ●

20 30 50 80 100 200 300 500 8001000Number of vertices (log scale)

Method●

OMNIbar

PROCbar

Figure: Mean squared error (MSE) in recovery of latent positions (up to rotation)in a 2-graph RDPG model as a function of the number of vertices for differentestimation procedures.

Future Work

Develop graph analogues of ANOVA and other multiple hypothesistesting procedures

Improve techniques for choosing critical value in omnibus test

Improve understanding of power under HA

Thanks!Full paper: https://arxiv.org/abs/1705.09355

A central limit theorem for an omnibus embedding …dml.cs.byu.edu/icdm17ws/Keith.pdfA central limit...

Documents