Sensor and Graph Mining

Post on 07-Jan-2016

28 views 3 download

description

Sensor and Graph Mining. Christos Faloutsos Carnegie Mellon University & IBM www.cs.cmu.edu/~christos. Joint work with. Anthony Brockwell (CMU/Stat) Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Chenxi Wang (CMU) Yang Wang (CMU). Outline. Introduction - motivation - PowerPoint PPT Presentation

transcript

INTEL 04 C. Faloutsos 1

School of Computer ScienceCarnegie Mellon

Sensor and Graph Mining

Christos Faloutsos

Carnegie Mellon University & IBMwww.cs.cmu.edu/~christos

INTEL 04 C. Faloutsos 2

School of Computer ScienceCarnegie Mellon

Joint work with

• Anthony Brockwell (CMU/Stat)

• Deepayan Chakrabarti (CMU)

• Spiros Papadimitriou (CMU)

• Chenxi Wang (CMU)• Yang Wang (CMU)

INTEL 04 C. Faloutsos 3

School of Computer ScienceCarnegie Mellon

Outline

• Introduction - motivation

• Problem #1: Stream Mining– Motivation– Main idea– Experimental results

• Problem #2: Graphs & Virus propagation

• Conclusions

INTEL 04 C. Faloutsos 4

School of Computer ScienceCarnegie Mellon

Introduction• Sensor devices

– Temperature, weather measurements– Road traffic data– Geological observations– Patient physiological data

• Embedded devices– Network routers– Intelligent (active) disks

INTEL 04 C. Faloutsos 5

School of Computer ScienceCarnegie Mellon

Introduction• Limited resources

– Memory– Bandwidth– Power– CPU

• Remote environments– No human intervention

INTEL 04 C. Faloutsos 6

School of Computer ScienceCarnegie Mellon

Introduction – problem dfn• Given a emi-infinite stream of values (time

series) x1, x2, …, xt, …

• Find patterns, forecasts, outliers…

INTEL 04 C. Faloutsos 7

School of Computer ScienceCarnegie Mellon

Introduction

Periodicity? (daily)

Periodicity? (twice daily)

“Noise”??

• E.g.,

INTEL 04 C. Faloutsos 8

School of Computer ScienceCarnegie Mellon

Introduction

Periodicity? (daily)

Periodicity? (twice daily)

“Noise”??

• Can we capture these patterns– automatically– with limited resources?

INTEL 04 C. Faloutsos 9

School of Computer ScienceCarnegie Mellon

Related workStatistics: Time series forecasting

• Main problem:

“[…] The first step in the analysis of any time series is to plot the data [and inspect the graph]” [Brockwell 91]

• Typically:• Resource intensive

• Cannot update online

• AR(I)MA and seasonal variants• ARFIMA, GARCH, …

INTEL 04 C. Faloutsos 10

School of Computer ScienceCarnegie Mellon

Related workDatabases: Continuous Queries

• Typically, different focus:– “Compression”– Not generative models

• Largely orthogonal problem…– Gilbert, Guha, Indyk et al. (STOC 2002)– Garofalakis, Gibbons (SIGMOD 2002)– Chen, Dong, Han et al. (VLDB 2002); Bulut, Singh (ICDE 2003)– Gehrke, Korn, et al. (SIGMOD 2001), Dobra, Garofalakis, Gehrke

et al. (SIGMOD 2002)– Guha, Koudas (ICDE 2003) Datar, Gionis, Indyk et al. (SODA

2002)– Madden+ [SIGMOD02], [SIGMOD03]

INTEL 04 C. Faloutsos 11

School of Computer ScienceCarnegie Mellon

Goals

• Adapt and handle arbitrary periodic components

• No human intervention/tuning

Also:

• Single pass over the data

• Limited memory (logarithmic)

• Constant-time update

INTEL 04 C. Faloutsos 12

School of Computer ScienceCarnegie Mellon

Outline

• Introduction - motivation

• Problem #1: Stream Mining– Motivation– Main idea– Experimental results

• Problem #2: Graphs & Virus propagation

• Conclusions

INTEL 04 C. Faloutsos 13

School of Computer ScienceCarnegie Mellon

Wavelets“Straight” signal

t

I1

t

I2

t

I3

t

I4

t

I5

t

I6

t

I7

t

I8

time

t

xt

INTEL 04 C. Faloutsos 14

School of Computer ScienceCarnegie Mellon

WaveletsIntroduction – Haar

t

W1,1

t

W1,2

t

W1,3

t

W1,4

t

W2,1

t

W2,2

t

W3,1

t

V4,1

time

frequ

ency

t

xt

INTEL 04 C. Faloutsos 15

School of Computer ScienceCarnegie Mellon

Wavelets

• So?

• Wavelets compress many real signals well…– Image compression and processing– Vision; Astronomy, seismology, …

• Wavelet coefficients can be updated as new points arrive [Kotidis+]

INTEL 04 C. Faloutsos 16

School of Computer ScienceCarnegie Mellon

WaveletsCorrelations

t

W1,1

t

W1,2

t

W1,3

t

W1,4

t

W2,1

t

W2,2

t

W3,1

t

V4,1

time

frequ

ency

xt

t

=

INTEL 04 C. Faloutsos 17

School of Computer ScienceCarnegie Mellon

WaveletsCorrelations

t

W1,1

t

W1,2

t

W1,3

t

W1,4

t

W2,1

t

W2,2

t

W3,1

t

V4,1

time

frequ

ency

xt

t

INTEL 04 C. Faloutsos 18

School of Computer ScienceCarnegie Mellon

Main ideaCorrelations

• Wavelets are good…

• …we can do even better– One number…– …and the fact that they are

equal/correlated

INTEL 04 C. Faloutsos 19

School of Computer ScienceCarnegie Mellon

Proposed method

Wl,tWl,t-1Wl,t-2Wl,t l,1Wl,t-1 l,2Wl,t-2 …

Wl’,t’-1Wl’,t’-2Wl’,t’

Wl’,t’ l’,1Wl’,t’-1 l’,2Wl’,t’-2 …

Small windows suffice… (k~4)

INTEL 04 C. Faloutsos 20

School of Computer ScienceCarnegie Mellon

More details…

• Update of wavelet coefficients

• Update of linear models

• Feature selection– Not all correlations are significant– Throw away the insignificant ones– very important!!

[see paper]

(incremental)

(incremental; RLS)

(single-pass)

INTEL 04 C. Faloutsos 21

School of Computer ScienceCarnegie Mellon

Complexity• Model update

Space: OlgN + mk2 OlgNTime: Ok2 O1

Where– N: number of points (so far)– k: number of regression coefficients; fixed– m: number of linear models; OlgN

[see paper]

SKIP

INTEL 04 C. Faloutsos 22

School of Computer ScienceCarnegie Mellon

Outline

• Introduction - motivation

• Problem #1: Stream Mining– Motivation– Main idea– Experimental results

• Problem #2: Graphs & Virus propagation

• Conclusions

INTEL 04 C. Faloutsos 23

School of Computer ScienceCarnegie Mellon

Setup

• First half used for model estimation

• Models applied forward to forecast entire second half

• AR, Seasonal AR (SAR): R– Simplest possible estimation – no maximum

likelihood estimation (MLE), etc.

• … vs. Python scripts

INTEL 04 C. Faloutsos 24

School of Computer ScienceCarnegie Mellon

ResultsSynthetic data – Triangle pulse

• Triangle pulse• AR captures wrong trend (or none)• Seasonal AR (SAR) estimation fails

INTEL 04 C. Faloutsos 25

School of Computer ScienceCarnegie Mellon

ResultsSynthetic data – Mix

• Mix (sine + square pulse)• AR captures wrong trend (or none)• Seasonal AR estimation fails

INTEL 04 C. Faloutsos 26

School of Computer ScienceCarnegie Mellon

ResultsReal data – Automobile

• Automobile traffic– Daily periodicity with rush-hour peaks– Bursty “noise” at smaller time scales

(filtered)

INTEL 04 C. Faloutsos 27

School of Computer ScienceCarnegie Mellon

ResultsReal data – Automobile

• Automobile traffic– Daily periodicity with rush-hour peaks– Bursty “noise” at smaller time scales

• AR fails to capture any trend (average)• Seasonal AR estimation fails

INTEL 04 C. Faloutsos 28

School of Computer ScienceCarnegie Mellon

ResultsReal data – Automobile

• Automobile traffic– Daily periodicity with rush-hour peaks– Bursty “noise” at smaller time scales

• AWSOM spots periodicities, automatically

INTEL 04 C. Faloutsos 29

School of Computer ScienceCarnegie Mellon

ResultsReal data – Automobile

• Automobile traffic– Daily periodicity with rush-hour peaks– Bursty “noise” at smaller time scales

• Generation with identified noise

INTEL 04 C. Faloutsos 30

School of Computer ScienceCarnegie Mellon

ResultsReal data – Sunspot

• Sunspot intensity – Slightly time-varying “period”• AR captures wrong trend (average)• Seasonal ARIMA

– Captures immediate wrong downward trend– Requires human to determine seasonal component period (fixed)

INTEL 04 C. Faloutsos 31

School of Computer ScienceCarnegie Mellon

ResultsReal data – Sunspot

• Sunspot intensity – Slightly time-varying “period”

Estimation: 40 minutes (R) vs. 9 seconds (Python)

INTEL 04 C. Faloutsos 32

School of Computer ScienceCarnegie Mellon

Variance

• Variance (log-power) vs. scale:– “Noise” diagnostic (if decreasing linear…)

– Can use to estimate noise parameters

~ 1 hour

SKIP

~Hurst exponent

INTEL 04 C. Faloutsos 33

School of Computer ScienceCarnegie Mellon

Running time

stream size (N)

tim

e (

t)

INTEL 04 C. Faloutsos 34

School of Computer ScienceCarnegie Mellon

Space requirements

Equal total number of model parameters

INTEL 04 C. Faloutsos 35

School of Computer ScienceCarnegie Mellon

Conclusion

Adapt and handle arbitrary periodic components

No human intervention/tuning

Single pass over the dataLimited memory (logarithmic)Constant-time update

INTEL 04 C. Faloutsos 36

School of Computer ScienceCarnegie Mellon

Conclusion

Adapt and handle arbitrary periodic components

No human intervention/tuning

Single pass over the dataLimited memory (logarithmic)Constant-time update

no human

limitedresources

INTEL 04 C. Faloutsos 37

School of Computer ScienceCarnegie Mellon

Outline

• Introduction - motivation• Problem #1: Streams• Problem #2: Graphs & Virus propagation

– Motivation & problem definition– Related work– Main idea– Experiments

• Conclusions

INTEL 04 C. Faloutsos 38

School of Computer ScienceCarnegie Mellon

Introduction

Internet Map [lumeta.com]

Food Web [Martinez ’91]

Protein Interactions [genomebiology.com]

Friendship Network [Moody ’01]

► Graphs are ubiquitious

INTEL 04 C. Faloutsos 39

School of Computer ScienceCarnegie Mellon

Introduction

• What can we do with graph analysis?– Immunization;– Information

Dissemination– network value of a

customer [Domingos+] “Needle exchange” networks of drug users

[Weeks et al. 2002]

“bridges”

INTEL 04 C. Faloutsos 40

School of Computer ScienceCarnegie Mellon

Problem definition

• Q1: How does a virus spread across an arbitrary network?

• Q2: will it create an epidemic?

• (in a sensor setting, with a ‘gossip’ protocol, will a rumor/query spread?)

INTEL 04 C. Faloutsos 41

School of Computer ScienceCarnegie Mellon

Framework

• Susceptible-Infected-Susceptible (SIS) model – Cured nodes immediately become susceptible

Susceptible/

healthy

Infected &

infectious

Infected by neighbor

Cured internally

INTEL 04 C. Faloutsos 43

School of Computer ScienceCarnegie Mellon

The model

• (virus) Birth rate β : probability than an infected neighbor attacks

• (virus) Death rate δ : probability that an infected node heals

Infected

Healthy

NN1

N3

N2Prob. β

Prob. β

Prob. δ

INTEL 04 C. Faloutsos 44

School of Computer ScienceCarnegie Mellon

Epidemic threshold

Defined as the value of , such that

if / < an epidemic can not happen

Thus,

• given a graph

• compute its epidemic threshold

INTEL 04 C. Faloutsos 45

School of Computer ScienceCarnegie Mellon

Epidemic threshold

What should depend on?

• avg. degree? and/or highest degree?

• and/or variance of degree?

• and/or determinant of the adjacency matrix?

INTEL 04 C. Faloutsos 46

School of Computer ScienceCarnegie Mellon

Basic Homogeneous Model

Homogeneous graphs [Kephart-White ’91, ’93]

• Epidemic threshold = 1/<k>• Homogeneous connectivity <k>, ie, all

nodes have ~same degree unrealistic

INTEL 04 C. Faloutsos 47

School of Computer ScienceCarnegie Mellon

Power-law Networks

• Model for Barabási-Albert networks– [Pastor-Satorras &

Vespignani, ’01, ’02]

– Epidemic threshold = <k> / <k2>

– for BA type networks, with only γ = 3 (γ = slope of power-law exponent)

INTEL 04 C. Faloutsos 48

School of Computer ScienceCarnegie Mellon

Epidemic threshold

• Homogeneous graphs: 1/<k>• BA (=3) <k> / <k2>

• more complicated graphs ?

• arbitrary, REAL graphs ?

• how many parameters??

INTEL 04 C. Faloutsos 49

School of Computer ScienceCarnegie Mellon

Epidemic threshold

• [Theorem] We have no epidemic, if

β/δ <τ = 1/ λ1,A

INTEL 04 C. Faloutsos 50

School of Computer ScienceCarnegie Mellon

Epidemic threshold

• [Theorem] We have no epidemic, if

β/δ <τ = 1/ λ1,A

largest eigenvalueof adj. matrix A

attack prob.

recovery prob.epidemic threshold

Proof: [Wang+03]

INTEL 04 C. Faloutsos 51

School of Computer ScienceCarnegie Mellon

Epidemic threshold for various networks

• sanity checks / older results:

• Homogeneous networks– λ1,A = <k>; τ = 1/<k>

– where <k> = average degree– This is the same result as of Kephart & White !

INTEL 04 C. Faloutsos 52

School of Computer ScienceCarnegie Mellon

Epidemic threshold for various networks

• sanity checks / older results:

• Star networks– λ1,A = sqrt(d); τ = 1/ sqrt(d)

– where d = the degree of the central node

INTEL 04 C. Faloutsos 53

School of Computer ScienceCarnegie Mellon

Epidemic threshold for various networks

• sanity checks / older results:

• Infinite, power-law networks– λ1,A = ∞; τ = 0 : *any* virus has a chance!

[Barabasi et al]

• Finite power-law networks– τ = 1/ λ1,A

INTEL 04 C. Faloutsos 54

School of Computer ScienceCarnegie Mellon

Outline

• Introduction - motivation• Problem #1: Streams• Problem #2: Graphs & Virus propagation

– Motivation & problem definition– Related work– Main idea– Experiments

• Conclusions

INTEL 04 C. Faloutsos 55

School of Computer ScienceCarnegie Mellon

Experiments

• 2 graphs– Star network: one “hub” + 99 “spokes”– “Oregon” Internet AS graph:

• 10,900 nodes, 31180 edges

• topology.eecs.umich.edu/data.html

• More in our paper: [SRDS ’03]

INTEL 04 C. Faloutsos 56

School of Computer ScienceCarnegie Mellon

β/δ > τ (above threshold)

β/δ = τ (at the threshold)

β/δ < τ (below threshold)

Experiments (Star)

INTEL 04 C. Faloutsos 57

School of Computer ScienceCarnegie Mellon

Experiments (Oregon)

β/δ > τ (above threshold)

β/δ = τ (at the threshold)

β/δ < τ (below threshold)

INTEL 04 C. Faloutsos 58

School of Computer ScienceCarnegie Mellon

Our prediction vs. previous prediction

• our predictions are more accurate

Oregon Star

PL3PL3

OurOur

Nu

mb

er o

f in

fect

ed n

odes

β/δ β/δ

INTEL 04 C. Faloutsos 59

School of Computer ScienceCarnegie Mellon

Conclusions

We found an epidemic threshold

√ that applies to any network topology

√ and it depends only on one parameter of the graph

INTEL 04 C. Faloutsos 60

School of Computer ScienceCarnegie Mellon

Overall conclusions

• Automatic stream mining: AWSOM

• graphs and virus propagation: eigenvalue

INTEL 04 C. Faloutsos 61

School of Computer ScienceCarnegie Mellon

Ongoing / related work

• Streams– how to find hidden variables on multiple

streams [w/ Spiros and Jimeng Sun]– ‘network tomography’ [w/ Airoldi +]

• Graphs– graph partitioning [w/ Deepay+]– important subgraphs [w/ Tomkins + McCurley]– graph generators [RMAT, w/ Deepay]

INTEL 04 C. Faloutsos 62

School of Computer ScienceCarnegie Mellon

Thank you!

Contact info:christos @ cs.cmu.edu

spapadim @ cs.cmu.edu

deepay @ cs.cmu.edu

INTEL 04 C. Faloutsos 63

School of Computer ScienceCarnegie Mellon

Main References

• Spiros Papadimitriou, Anthony Brockwell and Christos Faloutsos Adaptive, Hands-Off Stream Mining VLDB 2003, Berlin, Germany, Sept. 2003.

• [Wang+03] Yang Wang, Deepayan Chakrabarti, Chenxi Wang and Christos Faloutsos: Epidemic Spreading in Real Networks: an Eigenvalue Viewpoint, SRDS 2003, Florence, Italy.

INTEL 04 C. Faloutsos 64

School of Computer ScienceCarnegie Mellon

Additional References

• Connection Subgraphs, C. Faloutsos, K. McCurley, A. Tomkins, SIAM-DM 2004 workshop on link analysis

• RMAT: A recursive graph generator, D. Chakrabarti, Y. Zhan, C. Faloutsos, SIAM-DM 2004

• iFilter: Network tomography using particle filters, Edoardo Airoldi, Christos Faloutsos (submitted)