Models and Algorithms forEvent-Driven Networks
PhD DefenseBrian Thompson
Committee: Muthu Muthukrishnan (advisor), Danfeng Yao (Virginia Tech), Rebecca Wright,
Paul Kantor, Hanghang Tong (CUNY City College)
December 19, 2013
Rutgers University
Models and Algorithms for Event-Driven Networks
A set of nodes :
2
What is an event-driven network?
With a set of time-stamped events :
𝒕𝟐𝒕𝟏 𝒕𝟑
Models and Algorithms for Event-Driven Networks
We consider three problems that arise in the study of event-driven networks:
1. Detecting correlated events
2. Discovering functional communities
3. Modeling academic collaboration
3
Outline
Models and Algorithms for Event-Driven Networks
Temporal dynamics
Group behavior
Attribution
Computational feasibility
4
Themes
Detecting Correlated Events in Communication Networks
Joint work with James Abello
5
Detecting Correlated Events in Communication Networks
Setup:An event-driven network, where events indicate
communication between two nodes
Goal: Identify parts of the network with an unexpectedly
high concentration of recent activity
Challenges:Scalability – data accumulates, need concise
representation
Efficiency – high data rate, time-sensitive information
Variability – entities have different temporal dynamics
Problem Description
6
Detecting Correlated Events in Communication Networks
Network Representation
7
Given an event-driven communication network:
Muthu Rebecca Paul Danfeng Hanghang
Node 1 Node 2 Timestamp
Muthu Rebecca 8:30 AM
Rebecca Paul 9:00 AM
Muthu Danfeng 9:15 AM
Paul Hanghang 2:00 PM
Detecting Correlated Events in Communication Networks
Network Representation
8
For each pair of nodes (could be directed or undirected), we extract a time sequence:
t1 t2 t3 t4 t5
Muthu Rebecca
Detecting Correlated Events in Communication Networks
Network Representation
Paul
Rebecca
Muthu Danfeng
Hanghang
9
We can visualize the network like this:
Goal: Identify sets of nodes with an unexpectedly high concentration of recent activity
Question: How to define “recent”? The most frequent communications will always seem “recent”, overshadowing others’ behavior.
We call this time-scale bias.
NOW
Router Traffic
Temporal Bias
Attack Traffic
Detecting Correlated Events in Communication Networks 10
Detecting Correlated Events in Communication Networks
Time series analysis
Sequence of “summary graphs”
t = 1 t = 2 t = 3 t = 4
12:0
0 AM
1:00
AM
2:00
AM
3:00
AM
4:00
AM
5:00
AM
6:00
AM
7:00
AM
8:00
AM
9:00
AM
10:0
0 AM
11:0
0 AM
12:0
0 PM
1:00
PM
2:00
PM
3:00
PM
4:00
PM
5:00
PM
6:00
PM
7:00
PM
8:00
PM
9:00
PM
10:0
0 PM
11:0
0 PM
Related Work
11
Our Approach
1. Use a streaming stochastic model to concisely represent communication between each node pair
2. Define a notion of “recent” communication that addresses time-scale bias
3. Apply a statistical test to detect correlated recent activity among a set of nodes
Detecting Correlated Events in Communication Networks 12
A renewal process generates a sequence of events with inter-arrival times sampled independently at random from the same positive distribution.
Detecting Correlated Events in Communication Networks
xmin xmax
Inter-Arrival Time Distribution
REneWal theory Approach for Real-time Data StreamsThe REWARDS Model
13
Time sequence:
t1 t2 t3 t4 t5
inter-arrival time =
For each pair of nodes in the network, estimate the parameters of the renewal process that is most likely to have generated the corresponding time sequence
Detecting Correlated Events in Communication Networks
xmin xmax
Inter-Arrival Time Distribution
REneWal theory Approach for Real-time Data StreamsThe REWARDS Model
14
Time sequence:
t1 t2 t3 t4 t5
inter-arrival time =
Detecting Correlated Events in Communication Networks
The age of a renewal process at time is the amount of time elapsed since the last event:
:
AgeΦ (𝑡 )
Recency
15
t1 t2 t3 t4 t50 t
We define the recency of at time as a normalization of the function using the probability integral transform: where is the limit distribution of the function
This eliminates time-scale bias:
Recency
Detecting Correlated Events in Communication Networks 16
We define the recency of a set of processes at time using the Kolmogorov-Smirnov test:
Recency
Detecting Correlated Events in Communication Networks 17
The p-value, , is the probability of getting a max distance at least as large as under
: i.i.d. samples from
1. For a given set of node pairs , maintain the IAT distribution of communication between each pair
2.Every time there is communication activity:• Update the corresponding IAT distribution• Output and the most recent node pairs
Detecting Correlated Events in Communication Networks 18
The L-CORE AlgorithmLocal algorithm for detecting CORrelated Events
𝒖𝟏
𝒖𝟐
𝒖𝟑
𝒖𝟓
𝒖𝟒
1.0
0.90.3
0.8𝒖𝟐
𝒖𝟑
𝒖𝟓
𝒖𝟏
.90
𝒖𝟐
𝒖𝟑
𝒖𝟓
𝒖𝟒
0.9
0.750.7
0.1
0.5
0.3
.42
𝑢1 𝑢2
𝑢3
𝑢4 𝑢5
Node set
0.900
0.973
0.500
0.421
1. Construct a graph on , with
3. Run a variant of the Union-Find algorithm, keeping track of the subgraphs with highest recency
2. Initialize a disjoint set data structure on the nodes
.97
.90 .50
Detecting Correlated Events in Communication Networks 19
The G-CORE AlgorithmGlobal algorithm for detecting CORrelated Events
0.973
0.500
𝒖𝟐
𝒖𝟑
𝒖𝟏
𝒖𝟓
𝒖𝟒
Detecting Correlated Events in Communication Networks 20
Complexity
Let , and let be the number of node pairs that have ever communicated.
REWARDS model: space, update per event
L-CORE: time per event, where is the set of node pairs of interest
G-CORE: worst-case time
Heuristic G-CORE: time in practice, where is a precision parameter
Robustness to Time Scale
Detecting Correlated Events in Communication Networks 21
Simulation: star network, 100 trials w/ normal activity, and 100 trials including a period of correlated activity
Our approach is robust to temporal variability
Detection Latency
Detecting Correlated Events in Communication Networks 22
Data: Enron corpus, ~1000 nodes and ~5000 events
The algorithms identify similar times of correlated activity, but our approach has shorter response time
Visualization
Detecting Correlated Events in Communication Networks 23
Output from G-CORE algorithm on the Bluetooth dataset at 12:00pm on Day 100
Summary of Contributions
REWARDS: a stochastic model for event-driven networks
A formal definition of recency that is time-scale invariant
L-CORE: a streaming local algorithm for detecting correlated recent activity among a given set of node pairs
G-CORE: an efficient global algorithm for detecting correlations throughout the network simultaneously
Detecting Correlated Events in Communication Networks 24
Discovering FunctionalCommunities
Joint work with Linda Ness,David Shallcross, Devasis Bassu
25
Discovering Functional Communities
Setup:An event-driven network, where events correspond
to actions by a single node, each with an associated label
Goal: Identify functional communities of individuals
whouse the same labels
Challenges:Scalability – there may be many nodes and many
labels
Mixed membership – each node may be part of more than one community
Problem Description
26
Discovering Functional Communities
Network Representation
Paul
Rebecca
Muthu Danfeng
Hanghang
27
Given a set of nodes and a collection of labeled events:
Discovering Functional Communities
Network Representation
28
Hanghang
Rebecca
Paul
Danfeng
Muthu
bicluster
Discovering Functional Communities
Network Representation
29
Hanghang
Rebecca
Paul
Danfeng
Muthu
Discovering Functional Communities
Network Representation
30
Hanghang
Danfeng
Paul
Rebecca
Muthu
Goal: Given a matrix, cluster the rows and columns simultaneously to reveal hidden structure
Challenges:Don’t know the number or sizes of clusters a prioriNumber of possible co-clusterings is exponential in
the size of the matrix
R1
R2
C1 C2
Discovering Functional Communities 31
Co-Clustering
Spectral methods use linear algebraic techniques such as SVD to fit a block diagonal structure
Usually require number of clusters to be pre-specified
Likely to perform well on the matrix on the left, but not the one on the right:
Discovering Functional Communities 32
Related Work
1. Define a quality metric for co-clusterings that rewards large, dense biclusters
2. Find a co-clustering that maximizes the metric value
NP-hard in general, so need efficient heuristics
Discovering Functional Communities 33
Our Approach
Motivated by two desired properties:
We propose the following class of metrics:
Proposition: satisfies P1 and P2 for all .
𝜇𝛾= ∑𝐵∈𝛱
(𝑎 (𝐵 )2
𝑠 (𝐵) )⋅(𝑤 (𝐵 )𝑎 (𝐵 ) )
𝛾
large dense
Property P1 Property P2
Discovering Functional Communities 34
Choosing a Metric
1.Build randomized k-d trees on the rows and columns
2.Initialize maximal anti-chains as the leaves of each tree
3.Traverse the trees simultaneously from the bottom up, greedily merging the rows or columns that result in the greatest increase in the metric value
4.Output the co-clustering with the best metric value
Discovering Functional Communities 35
The CC-MACS AlgorithmCo-Clustering via Maximal Anti-Chain Search
Complexity: time for an matrix, where is the number of non-zero values
Discovering Functional Communities 36
Discovering Functional Communities 37
Discovering Functional Communities 38
Discovering Functional Communities 39
Discovering Functional Communities 40
Discovering Functional Communities 41
Discovering Functional Communities 42
Discovering Functional Communities 43
Discovering Functional Communities 44
matrix with dense biclusters of size Compare via -score:
Discovering Functional Communities 45
Experiments: Synthetic Data
Matrices with known structure, taken from the NIST Matrix Market repository
Discovering Functional Communities 46
Experiments: Visual Comparison
Original Matrix
Randomly Permuted
Cross-Association
CC-MACS
Meme-Tracker dataset of Leskovec et al.Top biclusters returned by the CC-MACS
algorithm:
Discovering Functional Communities 47
Experiments: Web Memes
# of Domains # of Memes Density Topic
21 26 98.2%St. Jude
Children’s Hospital
5 178 96.1% Brazilian news
6 39 98.7% Spanish news
6 20 99.2% Tech news
6 17 100.0% Politics
A new class of co-clustering metrics that reward large, dense biclusters
The CC-MACS algorithm, which efficiently searchesthe space of possible co-clusterings for one which maximizes the value of a given metric
Advantages over existing methods:Do not need to specify number of clusters in
advanceNot limited to matrices with a block diagonal
structure
Discovering Functional Communities 48
Summary of Contributions
Modeling Collaborationin Academia
Joint work with Graham Cormode,Qiang Ma, Muthu Muthukrishnan
49
Modeling Collaboration in Academia
Setup:An event-driven network , where events
correspond to joint publications between researchers
Goal:Understand what motivates collaborative
behavior
Challenges:Model complexity – many factors influence which
collabo-rations form and the product of those collaborations
Dynamics – collaboration patterns may change over time
Problem Description
50
Modeling Collaboration in Academia
Model one researcher’s papers and citations over time
Model as a static network: same collaborations and number of papers per year
Related Work
51
+3 +3 +3 +3 +3 +3+6 +6+6 +6+9 +9
Our Approach
Model the system as a repeated game, where the researchers choose collaborators each year in an attempt to maximize their long-term academic success
Determine which sets of collaboration strategies form a game equilibrium, such that no pair of researchers would benefit from changing their strategies in order to collaborate with each other
Modeling Collaboration in Academia 52
Game-Theoretic Model
Players: A set of researchers
Utility: Each researcher wants to maximize
Actions: In year , each researcher has units of “research potential” to distri-bute between individual and collaborative projects
Outcome: Each project produces a paper that will receive citations commensurate with the total research potential invested by the authors
Modeling Collaboration in Academia 53
Main Results
A researcher’s h-index grows asymptotically faster when collaborating than when working independently – versus
In the static multi-player game, there is an equilibrium corresponding to each perfect matching on the researchers
In the dynamic multi-player game, however,the perfect matchings are not in equilibrium
Modeling Collaboration in Academia 54
Future Directions
Do there exist equilibria in the dynamic game?
Extend the model to allow mixed strategies
Analyze the game under other metrics of academic success besides the h-index
Modeling Collaboration in Academia 55
Models and Algorithms for Event-Driven Networks
1. Detecting correlated events New stochastic model to address issue of time-scale
bias Efficiently find subgraphs with unusually high recent
activity
2. Discovering functional communities New class of metrics to reward large, dense biclusters CC-MACS algorithm efficiently finds a good co-
clustering
3. Modeling academic collaboration Game-theoretic model allows formal analysis and
simulation of collaborative behavior in a dynamic setting 56
Other Work
Measuring pairwise influenceUse the REWARDS model to measure influence
between nodes based on the times of their respective activity
Innovation and circulation in information networksDetermine most likely sources of new content, and
measure the importance of each node in the diffusion process
Cascade partitioning Infer likely threads of related content from temporal
and relational information alone
57
Thank you!I owe much gratitude to:
My committee: Muthu Muthukrishnan, Danfeng Yao, Rebecca Wright, Paul Kantor, and Hanghang Tong
Fred Roberts, Tami Carpenter, Tina Eliassi-Rad, and James Abello, for mentoring me over the years
My other collaborators, mentors, and friends at Rutgers, DIMACS/CCICADA, ACS, and elsewhere
The DHS Fellowship which funded me for 3 years
Last but not least, my family and friends 58