Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M...

Post on 17-Apr-2018

216 views 3 download

transcript

Community Detection by SCOREwith applications to Statisticians’ Networks

Jiashun Jin

Statistics DepartmentCarnegie Mellon University

Collaborators: Pengsheng Ji (Univ. of Georgia)Zheng Tracy Ke (Univ. of Chicago)

April 6, 2015

Jiashun Jin Community Detection by SCORE

Network community detection

Jiashun Jin Coauthorship and Citation networks for Statisticians

Political web blogs (Adamic and

Glance; 2005)

I n = 1222 web blogs (nodes)

I 16714 hyperlinks (edges)

I #edges n2: adjacencymatrix X is very sparse

I Two perceivable communities

I Goal. Find the (unknown)community labels

Jiashun Jin Community Detection by SCORE

Abstraction (undirected)

Data: adjacency matrix A of a network N = (V ,E )

I V = 1, 2, . . . , n: nodes

A(i , j) =

1, an edge between nodes i and j0, otherwise

I K perceivable “communities”

V = V (1) ∪ V (2) . . . ∪ V (K )

Goal. For each node, predict the community label.

Diagonals of A are 0 for convenience

Jiashun Jin Community Detection by SCORE

Signal and noise decomposition

Adjacency matrix : A = E [A]+W , W ≡ (A−E [A]), “signal”+“noise”

I W = A− E [A]: generalized Wigner matrix

I upper triangles: independent centered-Bernoulli

I Question: How to model Ω if we write

E [X ] = Ω− diag(Ω)

Jiashun Jin Community Detection by SCORE

Box’s wisdom

George E.P. Box (1919–2013)

“All models are wrong, butsome are useful”

Jiashun Jin Community Detection by SCORE

Degree Corrected Block Model (DCBM)Ω(i , j)

θ(i) · θ(j)= P(k , `) ⇐⇒ Ω = ΘLΘ

P =

[a bb c

], Θ =

θ(1)

θ(2). . .

θ(7)

L =

a b a b a b ab c b c b c ba b a b a b ab c b c b c ba b a b a b ab c b c b c ba b a b a b a

permute−−−−→

a a a a b b ba a a a b b ba a a a b b ba a a a b b bb b b b c c cb b b b c c cb b b b c c c

Karrer and Newman (2010)

Jiashun Jin Community Detection by SCORE

Tukey’s suggestion

John W. Tukey (1915–2000)

“Which part of the samplecontains the information”Tukey (1965), PNAS

Jiashun Jin Community Detection by SCORE

Where is the information?

A = Ω− diag(Ω) + W ≈ Ω

SVD : Ω = ΘLΘ = Un,KDK ,K (Un,K )′

Un,K = ΘTn,K =

θ(1)

θ(2). . .

θ(n)

s1 t1

s2 t2

s1 t1...

...s1 t1

s2 t2

Jiashun Jin Community Detection by SCORE

SCORE: algorithm

SCORE: Spectral Clustering On Ratios-of-Eigenvectors

Input: A and K

I Obtain leading eigenvectors η1, η2, . . ., ηK

I Obtain n × (K − 1) matrix of entry-wise ratios

R(i , k) =ηk+1(i)

η1(i), 1 ≤ i ≤ n, 1 ≤ k ≤ K − 1

I Apply k-means to R (assume ≤ K clusters)

Jiashun Jin Community Detection by SCORE

Political weblog network (K = 2)x-axis: i = 1, 2, . . . , n; y -axis: R(i); 58 errors (lowest in literature)

0 200 400 600 800 1000 1200−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

Methods SCORE PCA normalized PCA NSC BCPLErrors 58 437 600 69 104.5 (SD: 145.4)

Newman (2016), Bickel and Chen (2009), Zhao et al (2012)

Jiashun Jin Community Detection by SCORE

Regularity conditions

A = Ω− diag(Ω) + W , Ω = ΘLΘ, L =K∑

k,`=1

P(k, `)1k1′`

I (a). Eigen-spacing of DPD is ≥ a constant C

D(k , k)2 =[ ∑i∈V (k)

θ(i)2]/‖θ‖2

I (b). log(n)θmax‖θ‖1/‖θ‖4 → 0, so that

‖W ‖ ‖Ω‖, with prob. 1− o(n−3)

I (c). log(n)θ2max/θmin ≤ ‖θ‖33, so matrix-form Bernsteininequality holds (for the sum of random matrices)

Jiashun Jin Community Detection by SCORE

Consistency of SCORE

Hammp(ˆ, `) = n−1 minπ

n∑i=1

P(

ˆi 6= π(`i )

), errn =

‖θ‖33‖θ‖4

max n∑

i=1

1

θ(i),

1

θmin

(‖θ‖1‖θ‖2

)2Theorem. Consider DCBM where (a)-(c) hold. As n→∞, if

n−1∗ log(n)errn → 0, where n∗ is the minimum community size,

then Hammp(ˆscore , `) ≤ Cn−1 log3(n)errn.

Proof. Full analysis of Θ−1(ηk − ηk)

I Spectral perturbation theory

I Classical large deviations inequalities

I Matrix-form Bernstein inequality (Tropp, 2012)

Remark. If we assume θ(i)iid∼ F as in Zhao et al (2012), then

Hammp(ˆscore , `) ≤ Cn−1 log3(n)

Jiashun Jin Community Detection by SCORE

Coauthor/Citation Networks (statisticians)

I People most interested: statisticians/friends

I We know “inside information” N/A to outsiders

Scientific Problem: Dynamics of US-basedstatisticians in theory & methods of the HDDA eraHDDA: High-Dimensional Data Analysis

Data: All published research papers in AoS,Biometrika, JASA, and JRSS-B, 2003–2012

Jiashun Jin Community Detection by SCORE

Disclaimer

I Data and scope of scientific interests: limitedI It is not our intention to

I rank one author/paper/area over the othersI label an author/paper to a certain area

I We have to use real names because thenetworks are for real people (“us”)

Jiashun Jin Community Detection by SCORE

Citation Network, I

Large-Scale Multiple Testing by SCORE (359 nodes; 26 shown)

0 5 10 15 2010

15

20

25

30

35

40

Aad van der Vaart

Abba M Krieger

Bradley Efron

Christian P Robert

Christopher Genovese

D R Cox

Daniel Yekutieli

David L DonohoDavid Siegmund

Donald B Rubin

E L Lehmann

Felix AbramovichIain M Johnstone

James O Berger

Jiashun Jin

John D Storey

John Rice

Joseph P Romano

Larry Wasserman

Mark G Low

Paul R Rosenbaum

Peter Muller

Sanat K Sarkar

Subhashis Ghosal

Yoav Benjamini

Zhiyi Chi

Jiashun Jin Community Detection by SCORE

Citation Network, II

Spatial stat./nonparametric stat. by SCORE (1010 nodes; 42 shown)

Adrian E Raftery

Alan E Gelfand

Alan H Welsh

Amy H Herring

Andrew O Finley Anthony OHagan

Athanasios Kottas

Brian S Caffo

Ciprian M Crainiceanu

David Ruppert

Douglas W Nychka

Gareth Roberts

Gary L Rosner

Hao Purdue Zhang

Huiyan Sang

Jeffrey S Morris

Jonathan Tawn

Joseph G Ibrahim

Laurens de Haan

Marc G Genton

Mark F J Steel

Martin Schlather

Michael L Stein

Michael Sherman

Ming−Hui Chen

Mohammad Hosseini−Nasab

Montserrat Fuentes

N Reid

Naisyin Wang

Omiros Papaspiliopoulos

Paul Fearnhead

R Todd Ogden

Raymond J Carroll

Robin Henderson

Simon N Wood

Steven N MacEachern

Sudipto Banerjee

Theo GasserTilmann Gneiting

Ulrich Stadtmuller

Yi Li

Yongtao Guan

Jiashun Jin Community Detection by SCORE

Citation Network, II (further split, I)

Parametric Spatial Statistics by SCORE (304 nodes; 21 shown)

Adrian E Raftery

Andrew O Finley

Anthony OHagan

Cristiano Varin

Douglas W Nychka

Fadoua Balabdaoui

Haavard Rue

Hao Zhang (Purdue)

Huiyan Sang

Jonathan Tawn

Laurens de Haan

Leah J Welty

Marc G GentonMartin Schlather

Michael L Stein

Montserrat Fuentes

N ReidNicolas Chopin

Paolo Vidoni

Sudipto Banerjee

Tilmann Gneiting

Jiashun Jin Community Detection by SCORE

Citation Network, II (further split, II)

Nonparametric Spatial Statistics by SCORE (212 nodes; 21 shown)

Alan E Gelfand

Alexandros Beskos

Athanasios KottasDavid M Blei

Fernando A Quintana

Gareth Roberts

Gary L Rosner

Herbert K H Lee

Ju−Hyun Park

Mark F J Steel

Matthew J Beal

Natesh Pillai

Omiros PapaspiliopoulosPaul Fearnhead

Pilar L Iglesias

Radford M Neal

Robert B GramacySteven N MacEachern

Trivellore E Raghunathan

Yee Whye Teh

Yi Li

Jiashun Jin Community Detection by SCORE

Citation Network, II (further split, III)

Non-parametrics/semi-parametrics by SCORE (392 nodes; 24 shown)

Alan H Welsh

Brian S Caffo

Ciprian M Crainiceanu

D Mikis Stasinopoulos

David Ruppert

Hongtu Zhu

Hua Yun Chen

Jeffrey S Morris

Joseph G Ibrahim

Michael A Benjamin

Ming−Hui Chen

Mohammad Hosseini−Nasab

Naisyin Wang

Nilanjan Chatterjee

Rabi Bhattacharya

Ray Carroll

Robert A Rigby

Robin Henderson

Rui Paulo

Silvia Shimakura

Theo Gasser

Thomas C M Lee

Ulrich Stadtmuller

Vic Patrangenaru

Jiashun Jin Community Detection by SCORE

Citation Network, III

Variable Selection by SCORE (1285 nodes; 40 shown)

Alexandre B Tsybakov

Cun−Hui Zhang

Dan Yu Lin

Elizaveta Levina

Emmanuel J Candes

Hans−Georg Muller

Hansheng Wang

Hao Helen Zhang

Heng Peng

Hui Zou

Ji Zhu

Jian HuangJianhua Z HuangJianqing Fan

Jinchi Lv

Joel L Horowitz

L J Wei

Lixing Zhu

Michael R Kosorok

Ming Yuan

Mohsen Pourahmadi

Nicolai Meinshausen

Peter Buhlmann

Peter HallPeter J Bickel

Qiwei Yao

R Dennis Cook

Robert J Tibshirani

Runze LiTerence TaoTrevor J HastieXuming He

Yi Lin

Jiashun Jin Community Detection by SCORE

Coauthorship Network, I

Objective Bayes by SCORE (64 nodes; 14 shown)

0 5 10 15 206

7

8

9

10

11

Alan E Gelfand

Athanasios Kottas

Carlos M Carvalho

Daniel Walsh

Fei Liu

Gonzalo Garcia−DonatoJ Palomo

James O Berger

Jerry Sacks

John A Cafeo

M J Bayarri

R J Parthasarathy

Rui Paulo

Steven N MacEachern

Jiashun Jin Community Detection by SCORE

Coauthorship Network, II

Biostatistics by SCORE (388 nodes; 16 shown)

David Dunson

Debajyoti SinhaEric Feuer

Helen Zhang

Heping ZhangHongtu Zhu

Steve MarronJi Zhu

Joseph Ibrahim

Jun LiuL J Wei

Louise Ryan

Tapabrata Maiti

Trivellore Raghunathan

Weili LinYimei Li

Zhiliang Ying

Jiashun Jin Community Detection by SCORE

Coauthorship Network, III

HDDA by SCORE (1811 nodes, 32 shown)

Alexandre TsybakovAndrea Rotnitzky

Bani Mallick

Christian Robert

Ciprian Crainiceanu

Enno Mammen

Gerda Claeskens

Giovanni Parmigiani

Hans−Georg Muller

Holger Dette

Hua Liang

James R Robins

Jane−Ling Wang

Jianqing Fan

Larry Wasserman

Larry BrownLixing Zhu

Malay Ghosh

Marc G Genton

Nilanjan Chatterjee

Peter Hall

Peter Muller

Ray Carroll

Robert J Tibshirani

Runze Li

Song Xi Chen

T Tony Cai

Trevor Hastie

Wolfgang Hardle

Xihong Lin

Xuming He

Yanyuan Ma

Jiashun Jin Community Detection by SCORE

Comparisons, I

Undirected networks:

I Newman’s Spectral Clustering (NSC)

I Bickel and Chen’s Profile Likelihood (BCPL)

I Amini et al’s Pseudo Likelihood (APL)

Directed networks: Leicht & Newman’s Spectral Clustering

Jiashun Jin Community Detection by SCORE

Comparisons, IIAdjusted Rand Index (ARI); larger means more similar

SCORE NSC APL BCPLSCORE 1.00 .55 .19 .00NSC 1.00 .41 .00APL 1.00 0.00BCPL 1.00

Sizes of the 3 communities identified by SCORE, NSC, and APL

Objective Bayes Biostat-Coau HDDA-CoauSCORE 64 388 1811

NSC 69 163 2031APL 20 50 2193

SCORE ∩ NSC 55 162 1807SCORE ∩ APL 20 50 1811

NSC ∩ APL 20 50 2032

SCORE ∩ NSC ∩ APL 20 50 1807

Jiashun Jin Community Detection by SCORE

More on Coauthorship Network, I

“Theo. Statist. Learning” (15 nodes) and “Dim. Reduction” (14 nodes)

Alexandre B TsybakovAnatoli B Juditsky

Bin Yu

Bing Li

Fadoua Balabdaoui

Florentina BuneaFrancesca Chiaromonte

Guilherme Rocha

Jon A Wellner

Karim Lounici

Lexin Li

Liliana Forzani

Liping Zhu

Liqiang Ni

Liugen Xue

Lixing ZhuLukas MeierMarkus Kalisch

Marloes H Maathuis

Marten H Wegkamp

Nicolai Meinshausen

Peter Buhlmann

Philippe Rigollet

Piet Groeneboom

R Dennis Cook

Sara van de Geer

Tao Shi

Winfried StuteXia Cui

Xiangrong YinXin Chen

Yuexiao Dong

Jiashun Jin Community Detection by SCORE

More on Coauthorship Network, II

“Johns Hopkins”, “Duke”, “Stanford”, “Quant. Reg.”, “Exp. Design”

Barry RowlingsonBrian S CaffoChong-Zhi DiCiprian M CrainiceanuDavid RuppertDobrin MarchevGalin L JonesJames P HobertJohn P BuonaccorsiJohn StaudenmayerNaresh M PunjabiPeter J DiggleSheng Luo

Carlos M CarvalhoGary L RosnerGerard LetacHelene MassamJames G ScottJonathan R StroudMaria De IorioMike WestNicholas G PolsonPeter Muller

Armin SchwartzmanBenjamin YakirDavid SiegmundF GosselinJohn D StoreyJonathan E TaylorKeith J WorsleyNancy Ruonan ZhangRyan J Tibshirani

Hengjian CuiHuixia Judy WangJianhua HuJianhui ZhouValen E JohnsonWing K FungXuming HeYijun ZuoZhongyi Zhu

Andrey PepelyshevFrank BretzHolger DetteNatalie NeumeyerStanislav VolgushevStefanie BiedermannTim Holland-LetzViatcheslav B Melas

Jiashun Jin Community Detection by SCORE

More on Coauthorship Network, III

David Dunson

Donglin Zeng

Hans−Georg Muller

Hongtu Zhu

Hua Liang

Jianqing Fan

Jing Qin

Joseph G Ibrahim

Peter HallRaymond J Carroll

T Tony Cai

David Dunson

Donglin Zeng

Hans−Georg Muller

Hongtu Zhu

Hua Liang

Jianqing Fan

Jing Qin

Joseph G Ibrahim

Peter HallRaymond J Carroll

T Tony Cai

Jiashun Jin Community Detection by SCORE

More on Coauthorship Network, IV

David Dunson

Donglin Zeng

Hans−Georg Muller

Hongtu Zhu

Hua Liang

Jianqing Fan

Jing Qin

Joseph G Ibrahim

Peter HallRaymond J Carroll

T Tony Cai

David Dunson

Donglin Zeng

Hans−Georg Muller

Hongtu Zhu

Hua Liang

Jianqing Fan

Jing Qin

Joseph G Ibrahim

Peter HallRaymond J Carroll

T Tony Cai

Jiashun Jin Community Detection by SCORE

Take home messages

I Proposed a fast, flexible, easy-to-implement,yet effective, method: SCORE

I Successfully applied to Statisticians’ networksand found many meaningful communities

I Data sets: a fertile ground for future research(many results are not reported here)

References:Jin J (2015) Fast network community detection by SCORE. Ann. Statist.43(1), 57-89.

Ji P, Jin J (2014) Coauthorship and Citation networks for statisticians.

arXiv.1410.2840.

Jiashun Jin Community Detection by SCORE