Download - Models and Algorithms for Event-Driven Networks PhD Defense Brian Thompson Committee: Muthu Muthukrishnan (advisor), Danfeng Yao (Virginia Tech), Rebecca.

Models and Algorithms forEvent-Driven Networks

PhD DefenseBrian Thompson

Committee: Muthu Muthukrishnan (advisor), Danfeng Yao (Virginia Tech), Rebecca Wright,

Paul Kantor, Hanghang Tong (CUNY City College)

December 19, 2013

Rutgers University

Models and Algorithms for Event-Driven Networks

A set of nodes :

2

What is an event-driven network?

With a set of time-stamped events :

𝒕𝟐𝒕𝟏 𝒕𝟑


We consider three problems that arise in the study of event-driven networks:

1. Detecting correlated events

2. Discovering functional communities

3. Modeling academic collaboration

3

Outline


Temporal dynamics

Group behavior

Attribution

Computational feasibility

4

Themes

Detecting Correlated Events in Communication Networks

Joint work with James Abello

5


Setup:An event-driven network, where events indicate

communication between two nodes

Goal: Identify parts of the network with an unexpectedly

high concentration of recent activity

Challenges:Scalability – data accumulates, need concise

representation

Efficiency – high data rate, time-sensitive information

Variability – entities have different temporal dynamics

Problem Description

6


Network Representation

7

Given an event-driven communication network:

Muthu Rebecca Paul Danfeng Hanghang

Node 1 Node 2 Timestamp

Muthu Rebecca 8:30 AM

Rebecca Paul 9:00 AM

Muthu Danfeng 9:15 AM

Paul Hanghang 2:00 PM



8

For each pair of nodes (could be directed or undirected), we extract a time sequence:

t1 t2 t3 t4 t5

Muthu Rebecca



Paul

Rebecca

Muthu Danfeng

Hanghang

9

We can visualize the network like this:

Goal: Identify sets of nodes with an unexpectedly high concentration of recent activity

Question: How to define “recent”? The most frequent communications will always seem “recent”, overshadowing others’ behavior.

We call this time-scale bias.

NOW

Router Traffic

Temporal Bias

Attack Traffic

Detecting Correlated Events in Communication Networks 10


Time series analysis

Sequence of “summary graphs”

t = 1 t = 2 t = 3 t = 4

12:0

0 AM

1:00

AM

2:00

AM

3:00

AM

4:00

AM

5:00

AM

6:00

AM

7:00

AM

8:00

AM

9:00

AM

10:0

0 AM

11:0

0 AM

12:0

0 PM

1:00

PM

2:00

PM

3:00

PM

4:00

PM

5:00

PM

6:00

PM

7:00

PM

8:00

PM

9:00

PM

10:0

0 PM

11:0

0 PM

Related Work

11

Our Approach

1. Use a streaming stochastic model to concisely represent communication between each node pair

2. Define a notion of “recent” communication that addresses time-scale bias

3. Apply a statistical test to detect correlated recent activity among a set of nodes


A renewal process generates a sequence of events with inter-arrival times sampled independently at random from the same positive distribution.


xmin xmax

Inter-Arrival Time Distribution

REneWal theory Approach for Real-time Data StreamsThe REWARDS Model

13

Time sequence:

t1 t2 t3 t4 t5

inter-arrival time =

For each pair of nodes in the network, estimate the parameters of the renewal process that is most likely to have generated the corresponding time sequence


xmin xmax

Inter-Arrival Time Distribution

REneWal theory Approach for Real-time Data StreamsThe REWARDS Model

14

Time sequence:

t1 t2 t3 t4 t5

inter-arrival time =


The age of a renewal process at time is the amount of time elapsed since the last event:

:

AgeΦ (𝑡 )

Recency

15

t1 t2 t3 t4 t50 t

We define the recency of at time as a normalization of the function using the probability integral transform: where is the limit distribution of the function

This eliminates time-scale bias:

Recency


We define the recency of a set of processes at time using the Kolmogorov-Smirnov test:

Recency


The p-value, , is the probability of getting a max distance at least as large as under

: i.i.d. samples from

1. For a given set of node pairs , maintain the IAT distribution of communication between each pair

2.Every time there is communication activity:• Update the corresponding IAT distribution• Output and the most recent node pairs


The L-CORE AlgorithmLocal algorithm for detecting CORrelated Events

𝒖𝟏

𝒖𝟐

𝒖𝟑

𝒖𝟓

𝒖𝟒

1.0

0.90.3

0.8𝒖𝟐

𝒖𝟑

𝒖𝟓

𝒖𝟏

.90

𝒖𝟐

𝒖𝟑

𝒖𝟓

𝒖𝟒

0.9

0.750.7

0.1

0.5

0.3

.42

𝑢1 𝑢2

𝑢3

𝑢4 𝑢5

Node set

0.900

0.973

0.500

0.421

1. Construct a graph on , with

3. Run a variant of the Union-Find algorithm, keeping track of the subgraphs with highest recency

2. Initialize a disjoint set data structure on the nodes

.97

.90 .50


The G-CORE AlgorithmGlobal algorithm for detecting CORrelated Events

0.973

0.500

𝒖𝟐

𝒖𝟑

𝒖𝟏

𝒖𝟓

𝒖𝟒


Complexity

Let , and let be the number of node pairs that have ever communicated.

REWARDS model: space, update per event

L-CORE: time per event, where is the set of node pairs of interest

G-CORE: worst-case time

Heuristic G-CORE: time in practice, where is a precision parameter

Robustness to Time Scale


Simulation: star network, 100 trials w/ normal activity, and 100 trials including a period of correlated activity

Our approach is robust to temporal variability

Detection Latency


Data: Enron corpus, ~1000 nodes and ~5000 events

The algorithms identify similar times of correlated activity, but our approach has shorter response time

Visualization


Output from G-CORE algorithm on the Bluetooth dataset at 12:00pm on Day 100

Summary of Contributions

REWARDS: a stochastic model for event-driven networks

A formal definition of recency that is time-scale invariant

L-CORE: a streaming local algorithm for detecting correlated recent activity among a given set of node pairs

G-CORE: an efficient global algorithm for detecting correlations throughout the network simultaneously


Discovering FunctionalCommunities

Joint work with Linda Ness,David Shallcross, Devasis Bassu

25

Discovering Functional Communities

Setup:An event-driven network, where events correspond

to actions by a single node, each with an associated label

Goal: Identify functional communities of individuals

whouse the same labels

Challenges:Scalability – there may be many nodes and many

labels

Mixed membership – each node may be part of more than one community

Problem Description

26



Paul

Rebecca

Muthu Danfeng

Hanghang

27

Given a set of nodes and a collection of labeled events:



28

Hanghang

Rebecca

Paul

Danfeng

Muthu

bicluster



29

Hanghang

Rebecca

Paul

Danfeng

Muthu



30

Hanghang

Danfeng

Paul

Rebecca

Muthu

Goal: Given a matrix, cluster the rows and columns simultaneously to reveal hidden structure

Challenges:Don’t know the number or sizes of clusters a prioriNumber of possible co-clusterings is exponential in

the size of the matrix

R1

R2

C1 C2

Discovering Functional Communities 31

Co-Clustering

Spectral methods use linear algebraic techniques such as SVD to fit a block diagonal structure

Usually require number of clusters to be pre-specified

Likely to perform well on the matrix on the left, but not the one on the right:


Related Work

1. Define a quality metric for co-clusterings that rewards large, dense biclusters

2. Find a co-clustering that maximizes the metric value

NP-hard in general, so need efficient heuristics


Our Approach

Motivated by two desired properties:

We propose the following class of metrics:

Proposition: satisfies P1 and P2 for all .

𝜇𝛾= ∑𝐵∈𝛱

(𝑎 (𝐵 )2

𝑠 (𝐵) )⋅(𝑤 (𝐵 )𝑎 (𝐵 ) )

𝛾

large dense

Property P1 Property P2


Choosing a Metric

1.Build randomized k-d trees on the rows and columns

2.Initialize maximal anti-chains as the leaves of each tree

3.Traverse the trees simultaneously from the bottom up, greedily merging the rows or columns that result in the greatest increase in the metric value

4.Output the co-clustering with the best metric value


The CC-MACS AlgorithmCo-Clustering via Maximal Anti-Chain Search

Complexity: time for an matrix, where is the number of non-zero values










matrix with dense biclusters of size Compare via -score:


Experiments: Synthetic Data

Matrices with known structure, taken from the NIST Matrix Market repository


Experiments: Visual Comparison

Original Matrix

Randomly Permuted

Cross-Association

CC-MACS

Meme-Tracker dataset of Leskovec et al.Top biclusters returned by the CC-MACS

algorithm:


Experiments: Web Memes

# of Domains # of Memes Density Topic

21 26 98.2%St. Jude

Children’s Hospital

5 178 96.1% Brazilian news

6 39 98.7% Spanish news

6 20 99.2% Tech news

6 17 100.0% Politics

A new class of co-clustering metrics that reward large, dense biclusters

The CC-MACS algorithm, which efficiently searchesthe space of possible co-clusterings for one which maximizes the value of a given metric

Advantages over existing methods:Do not need to specify number of clusters in

advanceNot limited to matrices with a block diagonal

structure


Summary of Contributions

Modeling Collaborationin Academia

Joint work with Graham Cormode,Qiang Ma, Muthu Muthukrishnan

49

Modeling Collaboration in Academia

Setup:An event-driven network , where events

correspond to joint publications between researchers

Goal:Understand what motivates collaborative

behavior

Challenges:Model complexity – many factors influence which

collabo-rations form and the product of those collaborations

Dynamics – collaboration patterns may change over time

Problem Description

50

Modeling Collaboration in Academia

Model one researcher’s papers and citations over time

Model as a static network: same collaborations and number of papers per year

Related Work

51

+3 +3 +3 +3 +3 +3+6 +6+6 +6+9 +9

Our Approach

Model the system as a repeated game, where the researchers choose collaborators each year in an attempt to maximize their long-term academic success

Determine which sets of collaboration strategies form a game equilibrium, such that no pair of researchers would benefit from changing their strategies in order to collaborate with each other

Modeling Collaboration in Academia 52

Game-Theoretic Model

Players: A set of researchers

Utility: Each researcher wants to maximize

Actions: In year , each researcher has units of “research potential” to distri-bute between individual and collaborative projects

Outcome: Each project produces a paper that will receive citations commensurate with the total research potential invested by the authors


Main Results

A researcher’s h-index grows asymptotically faster when collaborating than when working independently – versus

In the static multi-player game, there is an equilibrium corresponding to each perfect matching on the researchers

In the dynamic multi-player game, however,the perfect matchings are not in equilibrium


Future Directions

Do there exist equilibria in the dynamic game?

Extend the model to allow mixed strategies

Analyze the game under other metrics of academic success besides the h-index



1. Detecting correlated events New stochastic model to address issue of time-scale

bias Efficiently find subgraphs with unusually high recent

activity

2. Discovering functional communities New class of metrics to reward large, dense biclusters CC-MACS algorithm efficiently finds a good co-

clustering

3. Modeling academic collaboration Game-theoretic model allows formal analysis and

simulation of collaborative behavior in a dynamic setting 56

Other Work

Measuring pairwise influenceUse the REWARDS model to measure influence

between nodes based on the times of their respective activity

Innovation and circulation in information networksDetermine most likely sources of new content, and

measure the importance of each node in the diffusion process

Cascade partitioning Infer likely threads of related content from temporal

and relational information alone

57

Thank you!I owe much gratitude to:

My committee: Muthu Muthukrishnan, Danfeng Yao, Rebecca Wright, Paul Kantor, and Hanghang Tong

Fred Roberts, Tami Carpenter, Tina Eliassi-Rad, and James Abello, for mentoring me over the years

My other collaborators, mentors, and friends at Rutgers, DIMACS/CCICADA, ACS, and elsewhere

The DHS Fellowship which funded me for 3 years

Last but not least, my family and friends 58