Sensor and Graph Mining

Post on 05-Feb-2016

37 views 0 download

description

Sensor and Graph Mining. Christos Faloutsos Carnegie Mellon University & IBM www.cs.cmu.edu/~christos. Joint work with. Anthony Brockwell (CMU/Stat) Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Chenxi Wang (CMU) Yang Wang (CMU). Outline. Introduction - motivation - PowerPoint PPT Presentation

transcript

School of Computer ScienceCarnegie Mellon

Sensor and Graph Mining

Christos Faloutsos

Carnegie Mellon University & IBMwww.cs.cmu.edu/~christos

USC 04 C. Faloutsos 2

School of Computer ScienceCarnegie Mellon

Joint work with

• Anthony Brockwell (CMU/Stat)• Deepayan Chakrabarti (CMU) • Spiros Papadimitriou (CMU)• Chenxi Wang (CMU)• Yang Wang (CMU)

USC 04 C. Faloutsos 3

School of Computer ScienceCarnegie Mellon

Outline

• Introduction - motivation

• Problem #1: Stream Mining– Motivation– Main idea– Experimental results

• Problem #2: Graphs & Virus propagation

• Conclusions

USC 04 C. Faloutsos 4

School of Computer ScienceCarnegie Mellon

Introduction• Sensor devices

– Temperature, weather measurements– Road traffic data– Geological observations– Patient physiological data

• Embedded devices– Network routers– Intelligent (active) disks

USC 04 C. Faloutsos 5

School of Computer ScienceCarnegie Mellon

Introduction• Limited resources

– Memory– Bandwidth– Power– CPU

• Remote environments– No human intervention

USC 04 C. Faloutsos 6

School of Computer ScienceCarnegie Mellon

Introduction – problem dfn• Given a emi-infinite stream of values (time

series) x1, x2, …, xt, …

• Find patterns, forecasts, outliers…

USC 04 C. Faloutsos 7

School of Computer ScienceCarnegie Mellon

Introduction

Periodicity? (daily)

Periodicity? (twice daily)

“Noise”??

• E.g.,

USC 04 C. Faloutsos 8

School of Computer ScienceCarnegie Mellon

Introduction

Periodicity? (daily)

Periodicity? (twice daily)

“Noise”??

• Can we capture these patterns– automatically– with limited resources?

USC 04 C. Faloutsos 9

School of Computer ScienceCarnegie Mellon

Related workStatistics: Time series forecasting

• Main problem:

“[…] The first step in the analysis of any time series is to plot the data [and inspect the graph]” [Brockwell 91]

• Typically:• Resource intensive

• Cannot update online

• AR(I)MA and seasonal variants• ARFIMA, GARCH, …

USC 04 C. Faloutsos 10

School of Computer ScienceCarnegie Mellon

Related workDatabases: Continuous Queries

• Typically, different focus:– “Compression”– Not generative models

• Largely orthogonal problem…– Gilbert, Guha, Indyk et al. (STOC 2002)– Garofalakis, Gibbons (SIGMOD 2002)– Chen, Dong, Han et al. (VLDB 2002); Bulut, Singh (ICDE 2003)– Gehrke, Korn, et al. (SIGMOD 2001), Dobra, Garofalakis, Gehrke

et al. (SIGMOD 2002)– Guha, Koudas (ICDE 2003) Datar, Gionis, Indyk et al. (SODA

2002)– Madden+ [SIGMOD02], [SIGMOD03]

USC 04 C. Faloutsos 11

School of Computer ScienceCarnegie Mellon

Goals

• Adapt and handle arbitrary periodic components

• No human intervention/tuning

Also:

• Single pass over the data

• Limited memory (logarithmic)

• Constant-time update

USC 04 C. Faloutsos 12

School of Computer ScienceCarnegie Mellon

Outline

• Introduction - motivation

• Problem #1: Stream Mining– Motivation– Main idea– Experimental results

• Problem #2: Graphs & Virus propagation

• Conclusions

USC 04 C. Faloutsos 13

School of Computer ScienceCarnegie Mellon

Wavelets“Straight” signal

t

I1

t

I2

t

I3

t

I4

t

I5

t

I6

t

I7

t

I8

time

t

xt

USC 04 C. Faloutsos 14

School of Computer ScienceCarnegie Mellon

WaveletsIntroduction – Haar

t

W1,1

t

W1,2

t

W1,3

t

W1,4

t

W2,1

t

W2,2

t

W3,1

t

V4,1

time

frequ

ency

t

xt

USC 04 C. Faloutsos 15

School of Computer ScienceCarnegie Mellon

Wavelets

• So?

• Wavelets compress many real signals well…– Image compression and processing– Vision; Astronomy, seismology, …

• Wavelet coefficients can be updated as new points arrive [Kotidis+]

USC 04 C. Faloutsos 16

School of Computer ScienceCarnegie Mellon

WaveletsCorrelations

t

W1,1

t

W1,2

t

W1,3

t

W1,4

t

W2,1

t

W2,2

t

W3,1

t

V4,1

time

frequ

ency

xt

t

=

USC 04 C. Faloutsos 17

School of Computer ScienceCarnegie Mellon

WaveletsCorrelations

t

W1,1

t

W1,2

t

W1,3

t

W1,4

t

W2,1

t

W2,2

t

W3,1

t

V4,1

time

frequ

ency

xt

t

USC 04 C. Faloutsos 18

School of Computer ScienceCarnegie Mellon

Main ideaCorrelations

• Wavelets are good…

• …we can do even better– One number…– …and the fact that they are

equal/correlated

USC 04 C. Faloutsos 19

School of Computer ScienceCarnegie Mellon

Proposed method

Wl,tWl,t-1Wl,t-2Wl,t l,1Wl,t-1 l,2Wl,t-2 …

Wl’,t’-1Wl’,t’-2Wl’,t’

Wl’,t’ l’,1Wl’,t’-1 l’,2Wl’,t’-2 …

Small windows suffice… (k~4)

USC 04 C. Faloutsos 20

School of Computer ScienceCarnegie Mellon

More details…

• Update of wavelet coefficients

• Update of linear models

• Feature selection– Not all correlations are significant– Throw away the insignificant ones– very important!!

[see paper]

(incremental)

(incremental; RLS)

(single-pass)

USC 04 C. Faloutsos 21

School of Computer ScienceCarnegie Mellon

Complexity• Model update

Space: OlgN + mk2 OlgNTime: Ok2 O1

Where– N: number of points (so far)– k: number of regression coefficients; fixed– m: number of linear models; OlgN

[see paper]

SKIP

USC 04 C. Faloutsos 22

School of Computer ScienceCarnegie Mellon

Outline

• Introduction - motivation

• Problem #1: Stream Mining– Motivation– Main idea– Experimental results

• Problem #2: Graphs & Virus propagation

• Conclusions

USC 04 C. Faloutsos 23

School of Computer ScienceCarnegie Mellon

Setup

• First half used for model estimation

• Models applied forward to forecast entire second half

• AR, Seasonal AR (SAR): R– Simplest possible estimation – no maximum

likelihood estimation (MLE), etc.

• … vs. Python scripts

USC 04 C. Faloutsos 24

School of Computer ScienceCarnegie Mellon

ResultsSynthetic data – Triangle pulse

• Triangle pulse• AR captures wrong trend (or none)• Seasonal AR (SAR) estimation fails

USC 04 C. Faloutsos 25

School of Computer ScienceCarnegie Mellon

ResultsSynthetic data – Mix

• Mix (sine + square pulse)• AR captures wrong trend (or none)• Seasonal AR estimation fails

USC 04 C. Faloutsos 26

School of Computer ScienceCarnegie Mellon

ResultsReal data – Automobile

• Automobile traffic– Daily periodicity with rush-hour peaks– Bursty “noise” at smaller time scales

(filtered)

USC 04 C. Faloutsos 27

School of Computer ScienceCarnegie Mellon

ResultsReal data – Automobile

• Automobile traffic– Daily periodicity with rush-hour peaks– Bursty “noise” at smaller time scales

• AR fails to capture any trend (average)• Seasonal AR estimation fails

USC 04 C. Faloutsos 28

School of Computer ScienceCarnegie Mellon

ResultsReal data – Automobile

• Automobile traffic– Daily periodicity with rush-hour peaks– Bursty “noise” at smaller time scales

• AWSOM spots periodicities, automatically

USC 04 C. Faloutsos 29

School of Computer ScienceCarnegie Mellon

ResultsReal data – Automobile

• Automobile traffic– Daily periodicity with rush-hour peaks– Bursty “noise” at smaller time scales

• Generation with identified noise

USC 04 C. Faloutsos 30

School of Computer ScienceCarnegie Mellon

ResultsReal data – Sunspot

• Sunspot intensity – Slightly time-varying “period”• AR captures wrong trend (average)• Seasonal ARIMA

– Captures immediate wrong downward trend– Requires human to determine seasonal component period (fixed)

USC 04 C. Faloutsos 31

School of Computer ScienceCarnegie Mellon

ResultsReal data – Sunspot

• Sunspot intensity – Slightly time-varying “period”

Estimation: 40 minutes (R) vs. 9 seconds (Python)

USC 04 C. Faloutsos 32

School of Computer ScienceCarnegie Mellon

Variance

• Variance (log-power) vs. scale:– “Noise” diagnostic (if decreasing linear…)

– Can use to estimate noise parameters

~ 1 hour

SKIP

~Hurst exponent

USC 04 C. Faloutsos 33

School of Computer ScienceCarnegie Mellon

Running time

stream size (N)

tim

e (

t)

USC 04 C. Faloutsos 34

School of Computer ScienceCarnegie Mellon

Space requirements

Equal total number of model parameters

USC 04 C. Faloutsos 35

School of Computer ScienceCarnegie Mellon

Conclusion

Adapt and handle arbitrary periodic components

No human intervention/tuning

Single pass over the dataLimited memory (logarithmic)Constant-time update

USC 04 C. Faloutsos 36

School of Computer ScienceCarnegie Mellon

Conclusion

Adapt and handle arbitrary periodic components

No human intervention/tuning

Single pass over the dataLimited memory (logarithmic)Constant-time update

no human

limitedresources

USC 04 C. Faloutsos 37

School of Computer ScienceCarnegie Mellon

Outline

• Introduction - motivation• Problem #1: Streams• Problem #2: Graphs & Virus propagation

– Motivation & problem definition– Related work– Main idea– Experiments

• Conclusions

USC 04 C. Faloutsos 38

School of Computer ScienceCarnegie Mellon

Introduction

Internet Map [lumeta.com]

Food Web [Martinez ’91]

Protein Interactions [genomebiology.com]

Friendship Network [Moody ’01]

► Graphs are ubiquitious

USC 04 C. Faloutsos 39

School of Computer ScienceCarnegie Mellon

Introduction

• What can we do with graph analysis?– Immunization;– Information

Dissemination– network value of a

customer [Domingos+] “Needle exchange” networks of drug users

[Weeks et al. 2002]

“bridges”

USC 04 C. Faloutsos 40

School of Computer ScienceCarnegie Mellon

Problem definition

• Q1: How does a virus spread across an arbitrary network?

• Q2: will it create an epidemic?

• (in a sensor setting, with a ‘gossip’ protocol, will a rumor/query spread?)

USC 04 C. Faloutsos 41

School of Computer ScienceCarnegie Mellon

Framework

• Susceptible-Infected-Susceptible (SIS) model – Cured nodes immediately become susceptible

Susceptible/

healthy

Infected &

infectious

Infected by neighbor

Cured internally

USC 04 C. Faloutsos 43

School of Computer ScienceCarnegie Mellon

The model

• (virus) Birth rate β : probability than an infected neighbor attacks

• (virus) Death rate δ : probability that an infected node heals

Infected

Healthy

NN1

N3

N2Prob. β

Prob. β

Prob. δ

USC 04 C. Faloutsos 44

School of Computer ScienceCarnegie Mellon

Epidemic threshold

Defined as the value of , such that

if / < an epidemic can not happen

Thus,

• given a graph

• compute its epidemic threshold

USC 04 C. Faloutsos 45

School of Computer ScienceCarnegie Mellon

Epidemic threshold

What should depend on?

• avg. degree? and/or highest degree?

• and/or variance of degree?

• and/or determinant of the adjacency matrix?

USC 04 C. Faloutsos 46

School of Computer ScienceCarnegie Mellon

Basic Homogeneous Model

Homogeneous graphs [Kephart-White ’91, ’93]

• Epidemic threshold = 1/<k>• Homogeneous connectivity <k>, ie, all

nodes have ~same degree unrealistic

USC 04 C. Faloutsos 47

School of Computer ScienceCarnegie Mellon

Power-law Networks

• Model for Barabási-Albert networks– [Pastor-Satorras &

Vespignani, ’01, ’02]

– Epidemic threshold = <k> / <k2>

– for BA type networks, with only γ = 3 (γ = slope of power-law exponent)

USC 04 C. Faloutsos 48

School of Computer ScienceCarnegie Mellon

Epidemic threshold

• Homogeneous graphs: 1/<k>• BA (=3) <k> / <k2>

• more complicated graphs ?

• arbitrary, REAL graphs ?

• how many parameters??

USC 04 C. Faloutsos 49

School of Computer ScienceCarnegie Mellon

Epidemic threshold

• [Theorem] We have no epidemic, if

β/δ <τ = 1/ λ1,A

USC 04 C. Faloutsos 50

School of Computer ScienceCarnegie Mellon

Epidemic threshold

• [Theorem] We have no epidemic, if

β/δ <τ = 1/ λ1,A

largest eigenvalueof adj. matrix A

attack prob.

recovery prob.epidemic threshold

Proof: [Wang+03]

USC 04 C. Faloutsos 51

School of Computer ScienceCarnegie Mellon

Epidemic threshold for various networks

• sanity checks / older results:

• Homogeneous networks– λ1,A = <k>; τ = 1/<k>

– where <k> = average degree– This is the same result as of Kephart & White !

USC 04 C. Faloutsos 52

School of Computer ScienceCarnegie Mellon

Epidemic threshold for various networks

• sanity checks / older results:

• Star networks– λ1,A = sqrt(d); τ = 1/ sqrt(d)

– where d = the degree of the central node

USC 04 C. Faloutsos 53

School of Computer ScienceCarnegie Mellon

Epidemic threshold for various networks

• sanity checks / older results:

• Infinite, power-law networks– λ1,A = ∞; τ = 0 : *any* virus has a chance!

[Barabasi et al]

• Finite power-law networks– τ = 1/ λ1,A

USC 04 C. Faloutsos 54

School of Computer ScienceCarnegie Mellon

Outline

• Introduction - motivation• Problem #1: Streams• Problem #2: Graphs & Virus propagation

– Motivation & problem definition– Related work– Main idea– Experiments

• Conclusions

USC 04 C. Faloutsos 55

School of Computer ScienceCarnegie Mellon

Experiments

• 2 graphs– Star network: one “hub” + 99 “spokes”– “Oregon” Internet AS graph:

• 10,900 nodes, 31180 edges

• topology.eecs.umich.edu/data.html

• More in our paper: [SRDS ’03]

USC 04 C. Faloutsos 56

School of Computer ScienceCarnegie Mellon

β/δ > τ (above threshold)

β/δ = τ (at the threshold)

β/δ < τ (below threshold)

Experiments (Star)

USC 04 C. Faloutsos 57

School of Computer ScienceCarnegie Mellon

Experiments (Oregon)

β/δ > τ (above threshold)

β/δ = τ (at the threshold)

β/δ < τ (below threshold)

USC 04 C. Faloutsos 58

School of Computer ScienceCarnegie Mellon

Our prediction vs. previous prediction

• our predictions are more accurate

Oregon Star

PL3PL3

OurOur

Nu

mb

er o

f in

fect

ed n

odes

β/δ β/δ

USC 04 C. Faloutsos 59

School of Computer ScienceCarnegie Mellon

Conclusions

We found an epidemic threshold

√ that applies to any network topology

√ and it depends only on one parameter of the graph

USC 04 C. Faloutsos 60

School of Computer ScienceCarnegie Mellon

Overall conclusions

• Automatic stream mining: AWSOM

• graphs and virus propagation: eigenvalue

USC 04 C. Faloutsos 61

School of Computer ScienceCarnegie Mellon

Ongoing / related work

• Streams– how to find hidden variables on multiple

streams [w/ Spiros and Jimeng Sun]– ‘network tomography’ [w/ Airoldi +]

• Graphs– graph partitioning [w/ Deepay+]– important subgraphs [w/ Tomkins + McCurley]– graph generators [RMAT, w/ Deepay]

USC 04 C. Faloutsos 62

School of Computer ScienceCarnegie Mellon

Thank you!

Contact info:christos @ cs.cmu.edu

spapadim @ cs.cmu.edu

deepay @ cs.cmu.edu

USC 04 C. Faloutsos 63

School of Computer ScienceCarnegie Mellon

Main References

• Spiros Papadimitriou, Anthony Brockwell and Christos Faloutsos Adaptive, Hands-Off Stream Mining VLDB 2003, Berlin, Germany, Sept. 2003.

• [Wang+03] Yang Wang, Deepayan Chakrabarti, Chenxi Wang and Christos Faloutsos: Epidemic Spreading in Real Networks: an Eigenvalue Viewpoint, SRDS 2003, Florence, Italy.

USC 04 C. Faloutsos 64

School of Computer ScienceCarnegie Mellon

Additional References

• Connection Subgraphs, C. Faloutsos, K. McCurley, A. Tomkins, SIAM-DM 2004 workshop on link analysis

• RMAT: A recursive graph generator, D. Chakrabarti, Y. Zhan, C. Faloutsos, SIAM-DM 2004

• iFilter: Network tomography using particle filters, Edoardo Airoldi, Christos Faloutsos (submitted)