DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix...

Post on 14-Apr-2020

3 views 0 download

transcript

DATA SCIENCE FOR NETWORK

LAETITIA GAUVIN

DATA MINING AND RELATIONAL DATA

▹ Big Data not natively in structured format

▸ tweets and blogs weakly structured pieces of text

▸ images and video are not structured according to its semantic content to enable search

▹ “The value of data explodes when it can be linked”

▹ “at the end of the 90s a new analytical trend joined data mining and machine learning: the emergence of network science”

Amato, G., Candela, L., Castelli, D., Esuli, A., Falchi, F., Gennaro, C., ... & Pappalardo, L. (2018). How Data Mining and Machine Learning Evolved from Relational Data Base to Data Science. In A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years (pp. 287-306). Springer International Publishing.

INFRASTRUCTURE

DATA

PUBLIC TRANSPORTATION (Ex of multilayer network)

Transportation map useful but not enough, connections matter

Asgari, F., Sultan, A., Xiong, H., Gauthier, V., & El-Yacoubi, M. A. (2016). CT-Mapper: Mapping sparse multimodal cellular trajectories using a multilayer transportation network. Computer Communications, 95, 69-81.

CO-OCCURENCE NETWORKS

source: linguistic networks - Monojit Choudhury - Microsoft

WORD CO-OCCURENCE NETWORK(Ex of weighted network)

DATA

Word frequencies analysis useful but text analysis enriched by co-occurence study

OFFLINE INTERACTIONS

DATA

F2F INTERACTIONS(RFID DATA, GPS, WIFI, BLUETOOTH)Ex of undirected network

Individual activities vs interactions“Interactions are behind the spreading of disease…”

ONLINE INTERACTIONS

DATA

Social networks play an import role ininformation/rumor spreading (e.g. echo chambers)

dplb

DATA

WEB DATA

DATA

: whe

re to

find

them

?

➢ SURVEYS➢ SENSORS (WIFI, bluetooth, GPS…)➢ API➢ SCRAPING➢ ...

ANOMALY DETECTION

slide from http://web.eecs.umich.edu/~dkoutra/courses/F15_598/SAPPL

ICAT

IONS

RANKING

APPL

ICAT

IONS

COMMUNITY DETECTION

Social network (Zachary’s karate club)

Word association networkProtein interaction network

APPL

ICAT

IONS

LINK PREDICTION/MISSING DATA RECOVERY

www.uvm.edu/storylab/2013/02/11/who-will-your-friends-be-next-week-the-link-prediction-problem/APPL

ICAT

IONS

RECOMMENDATION SYSTEMS

fiAPPL

ICAT

IONS

NETWORK REPRESENTATION

ADJACENCY MATRIX

1

234

5

6

1

2

4

5

6

0 1 0 0 0 0

1 0 1 0 0 0

0 1 0 1 1 0

0 0 1 0 0 0

0 0 1 0 0 1

0 0 0 0 1 0

TOOL

S FO

R NE

TWOR

K ST

UDY

: A S

AMPL

E

APPROACHES

▹ Complex systems - network science▸ Modelling▸ Network measures (centrality,degree, shortest

paths…)▹ Machine learning:

▸ classification▸ link prediction▸ clustering...

PROGRAMMING TOOLS

▹ Network visualization: Gephi, D3, plotly...▹ Operation on matrices Matlab -Octave▹ Python - Jupyter (notebook)

▸ Scikit-learn (classification, regression, clustering…)▸ Pandas dataframe (plotting distributions)▸ NetworkX

TOOL

S FO

R NE

TWOR

K ST

UDY

: A S

AMPL

E

EXAMPLES

▹ Fraud detection (e.g. with credit card)

▹ Network failure

▹ Malware / spyware detection

ANOM

ALY

DETE

CTIO

N

ANOM

ALY

DETE

CTIO

N

OUTLIERS ON CLOUDS VS ANOMALY ON GRAPH

Definition from Akoglu, L et. al :

Find the nodes and/or edges and/or substructures that are “few and different” or deviate significantly from the patterns observed in the graph.

Approaches :

▹ structure-based patterns▹ community-based patterns

ANOM

ALY

DETE

CTIO

N

EX : WEB SPAM (STRUCTURE-BASED)

▹ Collection of web pages from the .uk domain from 2002

Becchetti, L., Castillo, C., Donato, D., Leonardi, S., & Baeza-Yates, R. A. (2006, August). Link-Based Characterization and Detection of Web Spam. In AIRWeb (pp. 1-8).AN

OMAL

Y DE

TECT

ION

EX: ANOMALY IN BIPARTITE GRAPHS (COMMUNITY-BASED)

ANOM

ALY

DETE

CTIO

N

Ex of bipartite networks:

▹ users vs. files in a P2P system▹ traders vs. stocks in a financial trading system▹ conferences vs. authors in a scientific publication network

EX: ANOMALY IN BIPARTITE GRAPHS (COMMUNITY-BASED)

Approach:

▹ Scores for nodes based on random walks▹ Combined with graph partitions

Sun, J., Qu, H., Chakrabarti, D., & Faloutsos, C. (2005, November). Neighborhood formation and anomaly detection in bipartite graphs. In Data Mining, Fifth IEEE International Conference on (pp. 8-pp). IEEE.AN

OMAL

Y DE

TECT

ION

ANOMALY DETECTION :REFERENCES

Akoglu, L., Tong, H., & Koutra, D. (2015). Graph based anomaly detection and description: a survey. Data Mining and Knowledge Discovery, 29(3), 626-688.

Ranshous, S., Shen, S., Koutra, D., Harenberg, S., Faloutsos, C., & Samatova, N. F. (2015). Anomaly detection in dynamic networks: a survey. Wiley Interdisciplinary Reviews: Computational Statistics, 7(3), 223-247.

CLUSTERING : K-MEANS

COM

MUN

ITY

DET

ECTI

ON -

CLUS

TERI

NG

k-MEANS APPLIED ON A PROTEIN INTERACTION NETWORK

Jamil, K., Jayaraman, A., Rao, R., & Raju, S. (2012). In silico evidence of signaling pathways of notch mediated networks in leukemia. Computational and structural biotechnology journal, 1(2), 1-11.

“The clusters were composed of densely connected protein interactors, mostly sharing either similarity in function or occurrence in the same pathway.”

COM

MUN

ITY

DET

ECTI

ON -

CLUS

TERI

NG

Bread Butter BeerAnna 1 1 0Bob 1 1 1Charlie 0 1 1

Customer transactions

Machine Matrix MiningBook 1 5 0 3Book 2 0 0 7Book 3 4 6 5

Document-term matrix

Avatar The Matrix Up

Alice 4 2Bob 3 2Charlie 5 3

Incomplete rating matrix

Jan Jun SepSaarbrücken 1 11 10Helsinki 6.5 10.9 8.7Cape Town 15.7 7.8 8.7

Cities and monthly temperatures

Many different kinds of data fit this object-attribute viewpoint.

14 / 27

NETWORK AS MATRICES : LINEAR ALGEBRA

Diagonal with positive real entries

Columns orthonormal

factorization into a product of 3 matrices

Columns orthonormal

Rectangular matrix to decompose n X d n X d

n X nd X d

COM

MUN

ITY

DET

ECTI

ON -

CLUS

TERI

NG

SINGULAR VALUE DECOMPOSITION

SINGULAR VALUE DECOMPOSITION

COM

MUN

ITY

DET

ECTI

ON -

CLUS

TERI

NG

SVD is defined for all matrices

Eigen-value decomposition requires to have a square form, but

- columns of are called right-singular vectors - columns of are called left-singular vectors

The left-singular vectors of are eigenvectors of

Terminology :

Properties:

The right-singular vectors of are eigenvectors of

PROPERTIES OF SVD

STRU

CTUR

E D

ETEC

TION

- CL

USTE

RING

Example : 1-dimensional subspacerows of an n × 2 matrix as n points in a 2-dimensional space

B=U*D2*V'

[U,D,V]=svd(A)

INSIGHTS INTO SVD

STRU

CTUR

E D

ETEC

TION

- CL

USTE

RING

1 0 0 0 2

0 0 3 0 0 0 0 0 0 0 0 4 0 0 0

0 0 1 0

0 1 0 0 0 0 0 -1 1 0 0 0 4 0 0 0 0

0 3 0 0 0 0 0 √5 0 0 0 0 0 0 0

0 1 0 0 0

0 0 1 0 0

√0.2 0 0 0 √0.8

0 0 0 1 0 -√0.8 0 0 0 √0.2

= orthogonal matrix

= orthogonal matrix

diagonal elements = singular values

STRU

CTUR

E D

ETEC

TION

- CL

USTE

RING

The truncated (or thin) SVD only takes the first k columns of U and V and the main k submatrix

The Eckart–Young theorem

Let A_k be the rank-k truncated SVD of A. Then A_k is the closest rank-k matrix of A in the Frobenius sense.

TRUNCATED SVD

STRU

CTUR

E D

ETEC

TION

- CL

USTE

RING

- Strength of each component through the diagonal matrix

- Composition of each component in terms of attributes and object through the unitary matrices

- The first layer explains the most

- The 2nd corrects that by adding and removing smaller values

- The 3rd corrects that by adding and removing even smaller values...

STRU

CTUR

E D

ETEC

TION

- CL

USTE

RING

Most data mining applications do not use full SVD, but truncated SVD

-To concentrate on “the most important parts”

But how to select the rank k of the truncated SVD?

-What is important, what is unimportant?

-What is structure, what is noise?

STRU

CTUR

E D

ETEC

TION

- CL

USTE

RING

Guttman–Kaiser criterion

Select k so that for all i > k, σi < 1

Motivation: all components with singular value less than unit are uninteresting

Another common method is to select enough singular values such that the sum of their squares is 90% of the total sum of the squared singular values

Motivation: The resulting matrix “explains” 90% of the Frobeniusnorm of the matrix (a.k.a. Energy)

STRU

CTUR

E D

ETEC

TION

- CL

USTE

RING

Cattell's Scree test

● The scree plot plots the singular values in decreasing order● ● The plot looks like a side of the hill, thence the name● ● The scree test is a subjective decision on the rank based on the

shape of the scree plot● The rank should be set to a point where there is a clear drop in

the magnitudes ● of the singular values; or the singular values start to even out

STRU

CTUR

E D

ETEC

TION

- CL

USTE

RING

Entropy-based method

Consider the relative contribution of each singular value to the overall Frobenius NormRelative contribution

Low entropy (close to 0): the first singular value has almost all massHigh entropy (close to 1): the singular values are almost equal

STRU

CTUR

E D

ETEC

TION

- CL

USTE

RING

Very common application of SVD is to remove the noise from the dataThis works simply by taking the truncated SVD from the (normalized) data

Original data:Looks like 1-dimensional with some noise

The right singular vectors show the directions

The first looks like the data direction

The second looks like the noise direction

STRU

CTUR

E D

ETEC

TION

- CL

USTE

RING

The latent semantic analysis (LSA) is an information retrieval method that uses SVD

The data: a term–document matrix A

- the values are (weighted) term frequencies- Pre-processing : typically tf/idf values (the frequency of the term in the document divided by

the global frequency of the term)

● Matrix Uk associates documents to topics● Matrix V k associates topics to termsIf two rows of Uk are similar, the corresponding documents “talk about the same things”A query q can be answered by considering its term vector q

STRU

CTUR

E D

ETEC

TION

- CL

USTE

RING

rank-2 matrix representingratings of movies by users

two “concepts” underlying the movies:

- science-fiction

- romance

RECO

MM

ENDA

TION

SYS

TEM

S - C

LUST

ERIN

G

Hidden concepts emerge :

- connects people to concepts- relates movies to concepts- gives the strength of each of the concepts

RECO

MM

ENDA

TION

SYS

TEM

S - C

LUST

ERIN

G

In general, the concepts will not be so clearly delineated

RECO

MM

ENDA

TION

SYS

TEM

S - C

LUST

ERIN

G

One possible rule :sum of the squares of the retained singular values at least 90% of the total sum (12.4)² + (9.5)² +(1.3)² = 245.70 (12.4)² + (9.5)² = 244.01. (99% of the energy) (12.4)² /245.70 about 63%.

RECO

MM

ENDA

TION

SYS

TEM

S - C

LUST

ERIN

G

Truncated matrix close to the original

RECO

MM

ENDA

TION

SYS

TEM

S - C

LUST

ERIN

G

projection in the space of concepts

Another sort of query we can perform in concept space is to find users similar to the new user. We can use V to map all users into concept space. :

Joe maps to [1.74, 0]Jill maps to [0, 5.68]

new user in the system:→ need for recommendation

RECO

MM

ENDA

TION

SYS

TEM

S - M

ISSI

NG D

ATA

RECO

VERY

David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007.

Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin. 2012. Learning from Data. AMLBook.

RECO

MM

ENDA

TION

SYS

TEM

S - C

LUST

ERIN

G

REFERENCES

BACKGROUND

▹ PageRank published in April 1998 at WWW by Sergey Brin and Larry Page

▹ Early search engines used content-based ranking algorithms: no relationship between the pages

networks

RANK

ING

▹ Measure of the importance of web pages based on hyperlinks

▹ In links & out links

▹ PageRank attributes a score to a web page depending on the other web pages

▹ A link to a page : highest score to the page.

RANK

ING

ALGORITHM

▹ Main ideas:

Hyperlink to a page: conveyance of authority . The more in-links has a page, the more prestige the page has.

Pages that point to another page have their own prestige. This will give different weights to their out-links.

RANK

ING

PAGERANK ALGORITHM

▹ Network representation

nodes : pages

edges :hyperlinks

RANK

ING

PAGERANK ALGORITHM

▹ n -total number of webpages- linear equations

adjacency matrix with weights equal to 1 over the number of out-linksset of PageRanks as a n-dimensional vector

RANK

ING

PAGERANK ALGORITHM

▹ Each page as a state of a Markov chain ▹ Hyperlink seen as transitions between states

~ web surfing as a stochastic process

The adjacency matrix is not a stochastic matrix (rows do not sum to 1)

Some page doesn’t point to any page

RANK

ING

PAGERANK ALGORITHM

▹ To cope with rows of 0, we can replace them by row of 1/n. We assume a uniform distribution of the out-links

▹ The adjacency matrix is not irreducible: there is not a probability to go from any page to any other page

add a link from each page to any other pagewith a small probability transition called damping factor

RANK

ING

PAGERANK ALGORITHM

Damping factor : probability that the user will continue clicking

RANK

ING

POWER ITERATION METHOD

▹ Initialization▹ At each time step:

▹ Repeat until convergence:

RANK

ING

▹ Advantages: robust to spam + global measure

▹ Drawbacks: favors old pages / possibility of buying a link on pages with high PageRank

▹ Applications to other cases : for instance, academic papers

Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: bringing order to the web

http://www.slideshare.net/maimustafa566/page-rank-algorithm-33212250

RANK

ING

Urban transportation network properties

spatially embedded multimodal time-resolved

APPL

ICAT

ION

ON A

MUL

TILA

YER

NETW

ORK

Urban transportation data

General Transit Feed Specification

geospatial information

&

schedule information

APPL

ICAT

ION

ON A

MUL

TILA

YER

NETW

ORK

Transportation network representation

APPL

ICAT

ION

ON A

MUL

TILA

YER

NETW

ORK

Public transportation vs car

Choice criteria:1. total travel time2. variability in the total travel time3. number of transfers

APPL

ICAT

ION

ON A

MUL

TILA

YER

NETW

ORK

Uncovering fast connections

Choice of a typical day: focus on commuting hoursMulti-edge P-space representation: 1. Weights time spent in the transportation mean

+ waiting time2. Penalties: transfer times

APPL

ICAT

ION

ON A

MUL

TILA

YER

NETW

ORK

Adaptation of Dijkstra’s algorithm Computation of the shortest path in

time for any origin-destination pair # of transfers limited

Uncovering efficient transportation connections

screenshot taken from Offi - Journey PlannerAPPL

ICAT

ION

ON A

MUL

TILA

YER

NETW

ORK

DIJKSTRA’S ALGORITHM2

1

3 4

APPL

ICAT

ION

ON A

MUL

TILA

YER

NETW

ORK

DIJKSTRA’S ALGORITHM

56

9

7

8

APPL

ICAT

ION

ON A

MUL

TILA

YER

NETW

ORK

Shortest time paths

For each (origin,destination)

commuting time vs

geographical distance

APPL

ICAT

ION

ON A

MUL

TILA

YER

NETW

ORK

Car commuting times

Extracted from the French national survey of transport and mobility 2007-2008

- distance travelled (1 Km resolution), by- transportation mean used & trip duration (1 min resolution)

Typical time needed to commute a particular distance by car : median of the distribution of times over the entire sample

APPL

ICAT

ION

ON A

MUL

TILA

YER

NETW

ORK

Travel time factors

For each distance:

Public transportation commuting times

Car commuting times

APPL

ICAT

ION

ON A

MUL

TILA

YER

NETW

ORK

Alessandretti, L., Karsai, M., & Gauvin, L. (2016). User-based representation of time-resolved multimodal public transportation networks. Open Science, 3(7), 160156.

DETECTION OF MESOSCALE STRUCTURES

R mesoscale structures

link participationtemporalactivity

Kolda, T. G., & Bader, B. W. (2009). Tensor decompositions and applications.SIAM review, 51(3), 455-500.APPL

ICAT

ION

ON A

TEM

PORA

L N

ETW

ORK

MATRICIZATION

IxJxK

IxJK

JxIK

KxIJ

APPL

ICAT

ION

ON A

TEM

PORA

L N

ETW

ORK

FACTORIZATION OUTPUT

● membership of nodes to the components

● membership of links to the components

● temporal activity of the components

B

C

A

APPL

ICAT

ION

ON A

TEM

PORA

L N

ETW

ORK

Bro, R., & Kiers, H. A. (2003). A new efficient method for determining the number of components in PARAFAC models. Journal of chemometrics

ESTIMATION OF THE NUMBER OF COMPONENTS

▹ Core consistency : based on the comparison of the core with Tucker decomposition

▹ Cophenetic coefficient : based on consensus matrices

Brunet, J. P., Tamayo, P., Golub, T. R., & Mesirov, J. P. (2004). Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the national academy of sciences, 101(12), 4164-4169.

APPL

ICAT

ION

ON A

TEM

PORA

L N

ETW

ORK

SocioPatterns.orgAPPL

ICAT

ION

ON A

TEM

PORA

L N

ETW

ORK

APPL

ICAT

ION

ON A

TEM

PORA

L N

ETW

ORK

Gauvin, L., et al. Detecting the community structure and activity patterns of temporal networks: a non-negative tensor factorization approach. PloS one, 2014Gauvin, L., Panisson, A., Barrat, A., & Cattuto, C. (2015). Revealing latent factors of temporal networks for mesoscale intervention in epidemic spread. arXiv preprint arXiv:1501.02758.AP

PLIC

ATIO

N ON

A T

EMPO

RAL

NET

WOR

K

PATTERN DETECTION

APPL

ICAT

ION

ON A

TEM

PORA

L N

ETW

ORK

APPLICATION (2)

▹ 709 students▹ 65 teachers▹ 30 classes

▹ 10 days ▹ 5 min resolution

APPL

ICAT

ION

ON A

TEM

PORA

L N

ETW

ORK

ANOMALY DETECTION

APPL

ICAT

ION

ON A

TEM

PORA

L N

ETW

ORK

ANOMALY DETECTION

Sapienza, A., et al. "Anomaly Detection in Temporal Graph Data: An Iterative Tensor Decomposition and Masking Approach." AALTD@ PKDD/ECML. 2015.APPL

ICAT

ION

ON A

TEM

PORA

L N

ETW

ORK

http://www.datainterfaces.org/2013/06/twitter-topic-explorer/APPL

ICAT

ION

ON A

TEM

PORA

L N

ETW

ORK

OUTLINE▹ Types of data:

▸ social interactions▸ infrastructure (public transportation)▸ text

▹ Types of applications:▸ missing data recovery▸ clustering▸ anomaly detection

▹ Tools▸ Progamming : Python, R, Matlab…▸ Machine learning: supervised, unsupervised (e.g. for classification),

neural network...▹ Topic not discussed but important in data science

▸ Ethics and Privacy issues

PYTHON NOTEBOOK

Ex: Airport network, Karate club

Analysis: degree, clustering coefficient, community, PageRank

Tools: pandas, networkX

REFERENCES

Network ScienceALBERT-LÁSZLÓ BARABÁSI

Networks: An IntroductionMARK NEWMAN

Mining of massive datasets, chapter “Mining Social-Network Graphs”JURE LESKOVEC, ANAND RAJARAMAN, JEFFREY D. ULLMAN