+ All Categories
Home > Documents > DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix...

DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix...

Date post: 14-Apr-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
83
DATA SCIENCE FOR NETWORK LAETITIA GAUVIN
Transcript
Page 1: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

DATA SCIENCE FOR NETWORK

LAETITIA GAUVIN

Page 2: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik
Page 3: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

DATA MINING AND RELATIONAL DATA

▹ Big Data not natively in structured format

▸ tweets and blogs weakly structured pieces of text

▸ images and video are not structured according to its semantic content to enable search

▹ “The value of data explodes when it can be linked”

▹ “at the end of the 90s a new analytical trend joined data mining and machine learning: the emergence of network science”

Amato, G., Candela, L., Castelli, D., Esuli, A., Falchi, F., Gennaro, C., ... & Pappalardo, L. (2018). How Data Mining and Machine Learning Evolved from Relational Data Base to Data Science. In A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years (pp. 287-306). Springer International Publishing.

Page 4: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

INFRASTRUCTURE

DATA

PUBLIC TRANSPORTATION (Ex of multilayer network)

Transportation map useful but not enough, connections matter

Asgari, F., Sultan, A., Xiong, H., Gauthier, V., & El-Yacoubi, M. A. (2016). CT-Mapper: Mapping sparse multimodal cellular trajectories using a multilayer transportation network. Computer Communications, 95, 69-81.

Page 5: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

CO-OCCURENCE NETWORKS

source: linguistic networks - Monojit Choudhury - Microsoft

WORD CO-OCCURENCE NETWORK(Ex of weighted network)

DATA

Word frequencies analysis useful but text analysis enriched by co-occurence study

Page 6: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

OFFLINE INTERACTIONS

DATA

F2F INTERACTIONS(RFID DATA, GPS, WIFI, BLUETOOTH)Ex of undirected network

Individual activities vs interactions“Interactions are behind the spreading of disease…”

Page 7: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

ONLINE INTERACTIONS

DATA

Social networks play an import role ininformation/rumor spreading (e.g. echo chambers)

Page 8: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

dplb

DATA

WEB DATA

Page 9: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

DATA

: whe

re to

find

them

?

➢ SURVEYS➢ SENSORS (WIFI, bluetooth, GPS…)➢ API➢ SCRAPING➢ ...

Page 10: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

ANOMALY DETECTION

slide from http://web.eecs.umich.edu/~dkoutra/courses/F15_598/SAPPL

ICAT

IONS

Page 11: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

RANKING

APPL

ICAT

IONS

Page 12: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

COMMUNITY DETECTION

Social network (Zachary’s karate club)

Word association networkProtein interaction network

APPL

ICAT

IONS

Page 13: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

LINK PREDICTION/MISSING DATA RECOVERY

www.uvm.edu/storylab/2013/02/11/who-will-your-friends-be-next-week-the-link-prediction-problem/APPL

ICAT

IONS

Page 14: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

RECOMMENDATION SYSTEMS

fiAPPL

ICAT

IONS

Page 15: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

NETWORK REPRESENTATION

ADJACENCY MATRIX

1

234

5

6

1

2

4

5

6

0 1 0 0 0 0

1 0 1 0 0 0

0 1 0 1 1 0

0 0 1 0 0 0

0 0 1 0 0 1

0 0 0 0 1 0

TOOL

S FO

R NE

TWOR

K ST

UDY

: A S

AMPL

E

Page 16: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

APPROACHES

▹ Complex systems - network science▸ Modelling▸ Network measures (centrality,degree, shortest

paths…)▹ Machine learning:

▸ classification▸ link prediction▸ clustering...

Page 17: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

PROGRAMMING TOOLS

▹ Network visualization: Gephi, D3, plotly...▹ Operation on matrices Matlab -Octave▹ Python - Jupyter (notebook)

▸ Scikit-learn (classification, regression, clustering…)▸ Pandas dataframe (plotting distributions)▸ NetworkX

TOOL

S FO

R NE

TWOR

K ST

UDY

: A S

AMPL

E

Page 18: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

EXAMPLES

▹ Fraud detection (e.g. with credit card)

▹ Network failure

▹ Malware / spyware detection

ANOM

ALY

DETE

CTIO

N

Page 19: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

ANOM

ALY

DETE

CTIO

N

OUTLIERS ON CLOUDS VS ANOMALY ON GRAPH

Page 20: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

Definition from Akoglu, L et. al :

Find the nodes and/or edges and/or substructures that are “few and different” or deviate significantly from the patterns observed in the graph.

Approaches :

▹ structure-based patterns▹ community-based patterns

ANOM

ALY

DETE

CTIO

N

Page 21: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

EX : WEB SPAM (STRUCTURE-BASED)

▹ Collection of web pages from the .uk domain from 2002

Becchetti, L., Castillo, C., Donato, D., Leonardi, S., & Baeza-Yates, R. A. (2006, August). Link-Based Characterization and Detection of Web Spam. In AIRWeb (pp. 1-8).AN

OMAL

Y DE

TECT

ION

Page 22: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

EX: ANOMALY IN BIPARTITE GRAPHS (COMMUNITY-BASED)

ANOM

ALY

DETE

CTIO

N

Ex of bipartite networks:

▹ users vs. files in a P2P system▹ traders vs. stocks in a financial trading system▹ conferences vs. authors in a scientific publication network

Page 23: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

EX: ANOMALY IN BIPARTITE GRAPHS (COMMUNITY-BASED)

Approach:

▹ Scores for nodes based on random walks▹ Combined with graph partitions

Sun, J., Qu, H., Chakrabarti, D., & Faloutsos, C. (2005, November). Neighborhood formation and anomaly detection in bipartite graphs. In Data Mining, Fifth IEEE International Conference on (pp. 8-pp). IEEE.AN

OMAL

Y DE

TECT

ION

Page 24: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

ANOMALY DETECTION :REFERENCES

Akoglu, L., Tong, H., & Koutra, D. (2015). Graph based anomaly detection and description: a survey. Data Mining and Knowledge Discovery, 29(3), 626-688.

Ranshous, S., Shen, S., Koutra, D., Harenberg, S., Faloutsos, C., & Samatova, N. F. (2015). Anomaly detection in dynamic networks: a survey. Wiley Interdisciplinary Reviews: Computational Statistics, 7(3), 223-247.

Page 25: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

CLUSTERING : K-MEANS

COM

MUN

ITY

DET

ECTI

ON -

CLUS

TERI

NG

Page 26: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

k-MEANS APPLIED ON A PROTEIN INTERACTION NETWORK

Jamil, K., Jayaraman, A., Rao, R., & Raju, S. (2012). In silico evidence of signaling pathways of notch mediated networks in leukemia. Computational and structural biotechnology journal, 1(2), 1-11.

“The clusters were composed of densely connected protein interactors, mostly sharing either similarity in function or occurrence in the same pathway.”

COM

MUN

ITY

DET

ECTI

ON -

CLUS

TERI

NG

Page 27: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

Bread Butter BeerAnna 1 1 0Bob 1 1 1Charlie 0 1 1

Customer transactions

Machine Matrix MiningBook 1 5 0 3Book 2 0 0 7Book 3 4 6 5

Document-term matrix

Avatar The Matrix Up

Alice 4 2Bob 3 2Charlie 5 3

Incomplete rating matrix

Jan Jun SepSaarbrücken 1 11 10Helsinki 6.5 10.9 8.7Cape Town 15.7 7.8 8.7

Cities and monthly temperatures

Many different kinds of data fit this object-attribute viewpoint.

14 / 27

NETWORK AS MATRICES : LINEAR ALGEBRA

Page 28: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

Diagonal with positive real entries

Columns orthonormal

factorization into a product of 3 matrices

Columns orthonormal

Rectangular matrix to decompose n X d n X d

n X nd X d

COM

MUN

ITY

DET

ECTI

ON -

CLUS

TERI

NG

SINGULAR VALUE DECOMPOSITION

Page 29: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

SINGULAR VALUE DECOMPOSITION

COM

MUN

ITY

DET

ECTI

ON -

CLUS

TERI

NG

Page 30: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

SVD is defined for all matrices

Eigen-value decomposition requires to have a square form, but

- columns of are called right-singular vectors - columns of are called left-singular vectors

The left-singular vectors of are eigenvectors of

Terminology :

Properties:

The right-singular vectors of are eigenvectors of

PROPERTIES OF SVD

STRU

CTUR

E D

ETEC

TION

- CL

USTE

RING

Page 31: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

Example : 1-dimensional subspacerows of an n × 2 matrix as n points in a 2-dimensional space

B=U*D2*V'

[U,D,V]=svd(A)

INSIGHTS INTO SVD

STRU

CTUR

E D

ETEC

TION

- CL

USTE

RING

Page 32: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

1 0 0 0 2

0 0 3 0 0 0 0 0 0 0 0 4 0 0 0

0 0 1 0

0 1 0 0 0 0 0 -1 1 0 0 0 4 0 0 0 0

0 3 0 0 0 0 0 √5 0 0 0 0 0 0 0

0 1 0 0 0

0 0 1 0 0

√0.2 0 0 0 √0.8

0 0 0 1 0 -√0.8 0 0 0 √0.2

= orthogonal matrix

= orthogonal matrix

diagonal elements = singular values

STRU

CTUR

E D

ETEC

TION

- CL

USTE

RING

Page 33: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

The truncated (or thin) SVD only takes the first k columns of U and V and the main k submatrix

The Eckart–Young theorem

Let A_k be the rank-k truncated SVD of A. Then A_k is the closest rank-k matrix of A in the Frobenius sense.

TRUNCATED SVD

STRU

CTUR

E D

ETEC

TION

- CL

USTE

RING

Page 34: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

- Strength of each component through the diagonal matrix

- Composition of each component in terms of attributes and object through the unitary matrices

- The first layer explains the most

- The 2nd corrects that by adding and removing smaller values

- The 3rd corrects that by adding and removing even smaller values...

STRU

CTUR

E D

ETEC

TION

- CL

USTE

RING

Page 35: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

Most data mining applications do not use full SVD, but truncated SVD

-To concentrate on “the most important parts”

But how to select the rank k of the truncated SVD?

-What is important, what is unimportant?

-What is structure, what is noise?

STRU

CTUR

E D

ETEC

TION

- CL

USTE

RING

Page 36: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

Guttman–Kaiser criterion

Select k so that for all i > k, σi < 1

Motivation: all components with singular value less than unit are uninteresting

Another common method is to select enough singular values such that the sum of their squares is 90% of the total sum of the squared singular values

Motivation: The resulting matrix “explains” 90% of the Frobeniusnorm of the matrix (a.k.a. Energy)

STRU

CTUR

E D

ETEC

TION

- CL

USTE

RING

Page 37: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

Cattell's Scree test

● The scree plot plots the singular values in decreasing order● ● The plot looks like a side of the hill, thence the name● ● The scree test is a subjective decision on the rank based on the

shape of the scree plot● The rank should be set to a point where there is a clear drop in

the magnitudes ● of the singular values; or the singular values start to even out

STRU

CTUR

E D

ETEC

TION

- CL

USTE

RING

Page 38: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

Entropy-based method

Consider the relative contribution of each singular value to the overall Frobenius NormRelative contribution

Low entropy (close to 0): the first singular value has almost all massHigh entropy (close to 1): the singular values are almost equal

STRU

CTUR

E D

ETEC

TION

- CL

USTE

RING

Page 39: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

Very common application of SVD is to remove the noise from the dataThis works simply by taking the truncated SVD from the (normalized) data

Original data:Looks like 1-dimensional with some noise

The right singular vectors show the directions

The first looks like the data direction

The second looks like the noise direction

STRU

CTUR

E D

ETEC

TION

- CL

USTE

RING

Page 40: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

The latent semantic analysis (LSA) is an information retrieval method that uses SVD

The data: a term–document matrix A

- the values are (weighted) term frequencies- Pre-processing : typically tf/idf values (the frequency of the term in the document divided by

the global frequency of the term)

● Matrix Uk associates documents to topics● Matrix V k associates topics to termsIf two rows of Uk are similar, the corresponding documents “talk about the same things”A query q can be answered by considering its term vector q

STRU

CTUR

E D

ETEC

TION

- CL

USTE

RING

Page 41: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

rank-2 matrix representingratings of movies by users

two “concepts” underlying the movies:

- science-fiction

- romance

RECO

MM

ENDA

TION

SYS

TEM

S - C

LUST

ERIN

G

Page 42: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

Hidden concepts emerge :

- connects people to concepts- relates movies to concepts- gives the strength of each of the concepts

RECO

MM

ENDA

TION

SYS

TEM

S - C

LUST

ERIN

G

Page 43: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

In general, the concepts will not be so clearly delineated

RECO

MM

ENDA

TION

SYS

TEM

S - C

LUST

ERIN

G

Page 44: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

One possible rule :sum of the squares of the retained singular values at least 90% of the total sum (12.4)² + (9.5)² +(1.3)² = 245.70 (12.4)² + (9.5)² = 244.01. (99% of the energy) (12.4)² /245.70 about 63%.

RECO

MM

ENDA

TION

SYS

TEM

S - C

LUST

ERIN

G

Page 45: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

Truncated matrix close to the original

RECO

MM

ENDA

TION

SYS

TEM

S - C

LUST

ERIN

G

Page 46: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

projection in the space of concepts

Another sort of query we can perform in concept space is to find users similar to the new user. We can use V to map all users into concept space. :

Joe maps to [1.74, 0]Jill maps to [0, 5.68]

new user in the system:→ need for recommendation

RECO

MM

ENDA

TION

SYS

TEM

S - M

ISSI

NG D

ATA

RECO

VERY

Page 47: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007.

Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin. 2012. Learning from Data. AMLBook.

RECO

MM

ENDA

TION

SYS

TEM

S - C

LUST

ERIN

G

REFERENCES

Page 48: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

BACKGROUND

▹ PageRank published in April 1998 at WWW by Sergey Brin and Larry Page

▹ Early search engines used content-based ranking algorithms: no relationship between the pages

networks

RANK

ING

Page 49: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

▹ Measure of the importance of web pages based on hyperlinks

▹ In links & out links

▹ PageRank attributes a score to a web page depending on the other web pages

▹ A link to a page : highest score to the page.

RANK

ING

Page 50: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

ALGORITHM

▹ Main ideas:

Hyperlink to a page: conveyance of authority . The more in-links has a page, the more prestige the page has.

Pages that point to another page have their own prestige. This will give different weights to their out-links.

RANK

ING

Page 51: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

PAGERANK ALGORITHM

▹ Network representation

nodes : pages

edges :hyperlinks

RANK

ING

Page 52: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

PAGERANK ALGORITHM

▹ n -total number of webpages- linear equations

adjacency matrix with weights equal to 1 over the number of out-linksset of PageRanks as a n-dimensional vector

RANK

ING

Page 53: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

PAGERANK ALGORITHM

▹ Each page as a state of a Markov chain ▹ Hyperlink seen as transitions between states

~ web surfing as a stochastic process

The adjacency matrix is not a stochastic matrix (rows do not sum to 1)

Some page doesn’t point to any page

RANK

ING

Page 54: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

PAGERANK ALGORITHM

▹ To cope with rows of 0, we can replace them by row of 1/n. We assume a uniform distribution of the out-links

▹ The adjacency matrix is not irreducible: there is not a probability to go from any page to any other page

add a link from each page to any other pagewith a small probability transition called damping factor

RANK

ING

Page 55: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

PAGERANK ALGORITHM

Damping factor : probability that the user will continue clicking

RANK

ING

Page 56: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

POWER ITERATION METHOD

▹ Initialization▹ At each time step:

▹ Repeat until convergence:

RANK

ING

Page 57: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

▹ Advantages: robust to spam + global measure

▹ Drawbacks: favors old pages / possibility of buying a link on pages with high PageRank

▹ Applications to other cases : for instance, academic papers

Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: bringing order to the web

http://www.slideshare.net/maimustafa566/page-rank-algorithm-33212250

RANK

ING

Page 58: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

Urban transportation network properties

spatially embedded multimodal time-resolved

APPL

ICAT

ION

ON A

MUL

TILA

YER

NETW

ORK

Page 59: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

Urban transportation data

General Transit Feed Specification

geospatial information

&

schedule information

APPL

ICAT

ION

ON A

MUL

TILA

YER

NETW

ORK

Page 60: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

Transportation network representation

APPL

ICAT

ION

ON A

MUL

TILA

YER

NETW

ORK

Page 61: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

Public transportation vs car

Choice criteria:1. total travel time2. variability in the total travel time3. number of transfers

APPL

ICAT

ION

ON A

MUL

TILA

YER

NETW

ORK

Page 62: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

Uncovering fast connections

Choice of a typical day: focus on commuting hoursMulti-edge P-space representation: 1. Weights time spent in the transportation mean

+ waiting time2. Penalties: transfer times

APPL

ICAT

ION

ON A

MUL

TILA

YER

NETW

ORK

Page 63: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

Adaptation of Dijkstra’s algorithm Computation of the shortest path in

time for any origin-destination pair # of transfers limited

Uncovering efficient transportation connections

screenshot taken from Offi - Journey PlannerAPPL

ICAT

ION

ON A

MUL

TILA

YER

NETW

ORK

Page 64: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

DIJKSTRA’S ALGORITHM2

1

3 4

APPL

ICAT

ION

ON A

MUL

TILA

YER

NETW

ORK

Page 65: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

DIJKSTRA’S ALGORITHM

56

9

7

8

APPL

ICAT

ION

ON A

MUL

TILA

YER

NETW

ORK

Page 66: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

Shortest time paths

For each (origin,destination)

commuting time vs

geographical distance

APPL

ICAT

ION

ON A

MUL

TILA

YER

NETW

ORK

Page 67: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

Car commuting times

Extracted from the French national survey of transport and mobility 2007-2008

- distance travelled (1 Km resolution), by- transportation mean used & trip duration (1 min resolution)

Typical time needed to commute a particular distance by car : median of the distribution of times over the entire sample

APPL

ICAT

ION

ON A

MUL

TILA

YER

NETW

ORK

Page 68: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

Travel time factors

For each distance:

Public transportation commuting times

Car commuting times

APPL

ICAT

ION

ON A

MUL

TILA

YER

NETW

ORK

Alessandretti, L., Karsai, M., & Gauvin, L. (2016). User-based representation of time-resolved multimodal public transportation networks. Open Science, 3(7), 160156.

Page 69: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

DETECTION OF MESOSCALE STRUCTURES

R mesoscale structures

link participationtemporalactivity

Kolda, T. G., & Bader, B. W. (2009). Tensor decompositions and applications.SIAM review, 51(3), 455-500.APPL

ICAT

ION

ON A

TEM

PORA

L N

ETW

ORK

Page 70: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

MATRICIZATION

IxJxK

IxJK

JxIK

KxIJ

APPL

ICAT

ION

ON A

TEM

PORA

L N

ETW

ORK

Page 71: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

FACTORIZATION OUTPUT

● membership of nodes to the components

● membership of links to the components

● temporal activity of the components

B

C

A

APPL

ICAT

ION

ON A

TEM

PORA

L N

ETW

ORK

Page 72: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

Bro, R., & Kiers, H. A. (2003). A new efficient method for determining the number of components in PARAFAC models. Journal of chemometrics

ESTIMATION OF THE NUMBER OF COMPONENTS

▹ Core consistency : based on the comparison of the core with Tucker decomposition

▹ Cophenetic coefficient : based on consensus matrices

Brunet, J. P., Tamayo, P., Golub, T. R., & Mesirov, J. P. (2004). Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the national academy of sciences, 101(12), 4164-4169.

APPL

ICAT

ION

ON A

TEM

PORA

L N

ETW

ORK

Page 73: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

SocioPatterns.orgAPPL

ICAT

ION

ON A

TEM

PORA

L N

ETW

ORK

Page 74: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

APPL

ICAT

ION

ON A

TEM

PORA

L N

ETW

ORK

Page 75: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

Gauvin, L., et al. Detecting the community structure and activity patterns of temporal networks: a non-negative tensor factorization approach. PloS one, 2014Gauvin, L., Panisson, A., Barrat, A., & Cattuto, C. (2015). Revealing latent factors of temporal networks for mesoscale intervention in epidemic spread. arXiv preprint arXiv:1501.02758.AP

PLIC

ATIO

N ON

A T

EMPO

RAL

NET

WOR

K

Page 76: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

PATTERN DETECTION

APPL

ICAT

ION

ON A

TEM

PORA

L N

ETW

ORK

Page 77: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

APPLICATION (2)

▹ 709 students▹ 65 teachers▹ 30 classes

▹ 10 days ▹ 5 min resolution

APPL

ICAT

ION

ON A

TEM

PORA

L N

ETW

ORK

Page 78: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

ANOMALY DETECTION

APPL

ICAT

ION

ON A

TEM

PORA

L N

ETW

ORK

Page 79: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

ANOMALY DETECTION

Sapienza, A., et al. "Anomaly Detection in Temporal Graph Data: An Iterative Tensor Decomposition and Masking Approach." AALTD@ PKDD/ECML. 2015.APPL

ICAT

ION

ON A

TEM

PORA

L N

ETW

ORK

Page 80: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

http://www.datainterfaces.org/2013/06/twitter-topic-explorer/APPL

ICAT

ION

ON A

TEM

PORA

L N

ETW

ORK

Page 81: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

OUTLINE▹ Types of data:

▸ social interactions▸ infrastructure (public transportation)▸ text

▹ Types of applications:▸ missing data recovery▸ clustering▸ anomaly detection

▹ Tools▸ Progamming : Python, R, Matlab…▸ Machine learning: supervised, unsupervised (e.g. for classification),

neural network...▹ Topic not discussed but important in data science

▸ Ethics and Privacy issues

Page 82: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

PYTHON NOTEBOOK

Ex: Airport network, Karate club

Analysis: degree, clustering coefficient, community, PageRank

Tools: pandas, networkX

Page 83: DATA SCIENCE FOR NETWORK...David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007. Yaser S. Abu-Mostafa, Malik

REFERENCES

Network ScienceALBERT-LÁSZLÓ BARABÁSI

Networks: An IntroductionMARK NEWMAN

Mining of massive datasets, chapter “Mining Social-Network Graphs”JURE LESKOVEC, ANAND RAJARAMAN, JEFFREY D. ULLMAN


Recommended