DATA SCIENCE FOR NETWORK
LAETITIA GAUVIN
DATA MINING AND RELATIONAL DATA
▹ Big Data not natively in structured format
▸ tweets and blogs weakly structured pieces of text
▸ images and video are not structured according to its semantic content to enable search
▹ “The value of data explodes when it can be linked”
▹ “at the end of the 90s a new analytical trend joined data mining and machine learning: the emergence of network science”
Amato, G., Candela, L., Castelli, D., Esuli, A., Falchi, F., Gennaro, C., ... & Pappalardo, L. (2018). How Data Mining and Machine Learning Evolved from Relational Data Base to Data Science. In A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years (pp. 287-306). Springer International Publishing.
INFRASTRUCTURE
DATA
PUBLIC TRANSPORTATION (Ex of multilayer network)
Transportation map useful but not enough, connections matter
Asgari, F., Sultan, A., Xiong, H., Gauthier, V., & El-Yacoubi, M. A. (2016). CT-Mapper: Mapping sparse multimodal cellular trajectories using a multilayer transportation network. Computer Communications, 95, 69-81.
CO-OCCURENCE NETWORKS
source: linguistic networks - Monojit Choudhury - Microsoft
WORD CO-OCCURENCE NETWORK(Ex of weighted network)
DATA
Word frequencies analysis useful but text analysis enriched by co-occurence study
OFFLINE INTERACTIONS
DATA
F2F INTERACTIONS(RFID DATA, GPS, WIFI, BLUETOOTH)Ex of undirected network
Individual activities vs interactions“Interactions are behind the spreading of disease…”
ONLINE INTERACTIONS
DATA
Social networks play an import role ininformation/rumor spreading (e.g. echo chambers)
dplb
DATA
WEB DATA
DATA
: whe
re to
find
them
?
➢ SURVEYS➢ SENSORS (WIFI, bluetooth, GPS…)➢ API➢ SCRAPING➢ ...
ANOMALY DETECTION
slide from http://web.eecs.umich.edu/~dkoutra/courses/F15_598/SAPPL
ICAT
IONS
RANKING
APPL
ICAT
IONS
COMMUNITY DETECTION
Social network (Zachary’s karate club)
Word association networkProtein interaction network
APPL
ICAT
IONS
LINK PREDICTION/MISSING DATA RECOVERY
www.uvm.edu/storylab/2013/02/11/who-will-your-friends-be-next-week-the-link-prediction-problem/APPL
ICAT
IONS
NETWORK REPRESENTATION
ADJACENCY MATRIX
1
234
5
6
1
2
4
5
6
0 1 0 0 0 0
1 0 1 0 0 0
0 1 0 1 1 0
0 0 1 0 0 0
0 0 1 0 0 1
0 0 0 0 1 0
TOOL
S FO
R NE
TWOR
K ST
UDY
: A S
AMPL
E
APPROACHES
▹ Complex systems - network science▸ Modelling▸ Network measures (centrality,degree, shortest
paths…)▹ Machine learning:
▸ classification▸ link prediction▸ clustering...
PROGRAMMING TOOLS
▹ Network visualization: Gephi, D3, plotly...▹ Operation on matrices Matlab -Octave▹ Python - Jupyter (notebook)
▸ Scikit-learn (classification, regression, clustering…)▸ Pandas dataframe (plotting distributions)▸ NetworkX
TOOL
S FO
R NE
TWOR
K ST
UDY
: A S
AMPL
E
EXAMPLES
▹ Fraud detection (e.g. with credit card)
▹ Network failure
▹ Malware / spyware detection
ANOM
ALY
DETE
CTIO
N
ANOM
ALY
DETE
CTIO
N
OUTLIERS ON CLOUDS VS ANOMALY ON GRAPH
Definition from Akoglu, L et. al :
Find the nodes and/or edges and/or substructures that are “few and different” or deviate significantly from the patterns observed in the graph.
Approaches :
▹ structure-based patterns▹ community-based patterns
ANOM
ALY
DETE
CTIO
N
EX : WEB SPAM (STRUCTURE-BASED)
▹ Collection of web pages from the .uk domain from 2002
Becchetti, L., Castillo, C., Donato, D., Leonardi, S., & Baeza-Yates, R. A. (2006, August). Link-Based Characterization and Detection of Web Spam. In AIRWeb (pp. 1-8).AN
OMAL
Y DE
TECT
ION
EX: ANOMALY IN BIPARTITE GRAPHS (COMMUNITY-BASED)
ANOM
ALY
DETE
CTIO
N
Ex of bipartite networks:
▹ users vs. files in a P2P system▹ traders vs. stocks in a financial trading system▹ conferences vs. authors in a scientific publication network
EX: ANOMALY IN BIPARTITE GRAPHS (COMMUNITY-BASED)
Approach:
▹ Scores for nodes based on random walks▹ Combined with graph partitions
Sun, J., Qu, H., Chakrabarti, D., & Faloutsos, C. (2005, November). Neighborhood formation and anomaly detection in bipartite graphs. In Data Mining, Fifth IEEE International Conference on (pp. 8-pp). IEEE.AN
OMAL
Y DE
TECT
ION
ANOMALY DETECTION :REFERENCES
Akoglu, L., Tong, H., & Koutra, D. (2015). Graph based anomaly detection and description: a survey. Data Mining and Knowledge Discovery, 29(3), 626-688.
Ranshous, S., Shen, S., Koutra, D., Harenberg, S., Faloutsos, C., & Samatova, N. F. (2015). Anomaly detection in dynamic networks: a survey. Wiley Interdisciplinary Reviews: Computational Statistics, 7(3), 223-247.
CLUSTERING : K-MEANS
COM
MUN
ITY
DET
ECTI
ON -
CLUS
TERI
NG
k-MEANS APPLIED ON A PROTEIN INTERACTION NETWORK
Jamil, K., Jayaraman, A., Rao, R., & Raju, S. (2012). In silico evidence of signaling pathways of notch mediated networks in leukemia. Computational and structural biotechnology journal, 1(2), 1-11.
“The clusters were composed of densely connected protein interactors, mostly sharing either similarity in function or occurrence in the same pathway.”
COM
MUN
ITY
DET
ECTI
ON -
CLUS
TERI
NG
Bread Butter BeerAnna 1 1 0Bob 1 1 1Charlie 0 1 1
Customer transactions
Machine Matrix MiningBook 1 5 0 3Book 2 0 0 7Book 3 4 6 5
Document-term matrix
Avatar The Matrix Up
Alice 4 2Bob 3 2Charlie 5 3
Incomplete rating matrix
Jan Jun SepSaarbrücken 1 11 10Helsinki 6.5 10.9 8.7Cape Town 15.7 7.8 8.7
Cities and monthly temperatures
Many different kinds of data fit this object-attribute viewpoint.
14 / 27
NETWORK AS MATRICES : LINEAR ALGEBRA
Diagonal with positive real entries
Columns orthonormal
factorization into a product of 3 matrices
Columns orthonormal
Rectangular matrix to decompose n X d n X d
n X nd X d
COM
MUN
ITY
DET
ECTI
ON -
CLUS
TERI
NG
SINGULAR VALUE DECOMPOSITION
SINGULAR VALUE DECOMPOSITION
COM
MUN
ITY
DET
ECTI
ON -
CLUS
TERI
NG
SVD is defined for all matrices
Eigen-value decomposition requires to have a square form, but
- columns of are called right-singular vectors - columns of are called left-singular vectors
The left-singular vectors of are eigenvectors of
Terminology :
Properties:
The right-singular vectors of are eigenvectors of
PROPERTIES OF SVD
STRU
CTUR
E D
ETEC
TION
- CL
USTE
RING
Example : 1-dimensional subspacerows of an n × 2 matrix as n points in a 2-dimensional space
B=U*D2*V'
[U,D,V]=svd(A)
INSIGHTS INTO SVD
STRU
CTUR
E D
ETEC
TION
- CL
USTE
RING
1 0 0 0 2
0 0 3 0 0 0 0 0 0 0 0 4 0 0 0
0 0 1 0
0 1 0 0 0 0 0 -1 1 0 0 0 4 0 0 0 0
0 3 0 0 0 0 0 √5 0 0 0 0 0 0 0
0 1 0 0 0
0 0 1 0 0
√0.2 0 0 0 √0.8
0 0 0 1 0 -√0.8 0 0 0 √0.2
= orthogonal matrix
= orthogonal matrix
diagonal elements = singular values
STRU
CTUR
E D
ETEC
TION
- CL
USTE
RING
The truncated (or thin) SVD only takes the first k columns of U and V and the main k submatrix
The Eckart–Young theorem
Let A_k be the rank-k truncated SVD of A. Then A_k is the closest rank-k matrix of A in the Frobenius sense.
TRUNCATED SVD
STRU
CTUR
E D
ETEC
TION
- CL
USTE
RING
- Strength of each component through the diagonal matrix
- Composition of each component in terms of attributes and object through the unitary matrices
- The first layer explains the most
- The 2nd corrects that by adding and removing smaller values
- The 3rd corrects that by adding and removing even smaller values...
STRU
CTUR
E D
ETEC
TION
- CL
USTE
RING
Most data mining applications do not use full SVD, but truncated SVD
-To concentrate on “the most important parts”
But how to select the rank k of the truncated SVD?
-What is important, what is unimportant?
-What is structure, what is noise?
STRU
CTUR
E D
ETEC
TION
- CL
USTE
RING
Guttman–Kaiser criterion
Select k so that for all i > k, σi < 1
Motivation: all components with singular value less than unit are uninteresting
Another common method is to select enough singular values such that the sum of their squares is 90% of the total sum of the squared singular values
Motivation: The resulting matrix “explains” 90% of the Frobeniusnorm of the matrix (a.k.a. Energy)
STRU
CTUR
E D
ETEC
TION
- CL
USTE
RING
Cattell's Scree test
● The scree plot plots the singular values in decreasing order● ● The plot looks like a side of the hill, thence the name● ● The scree test is a subjective decision on the rank based on the
shape of the scree plot● The rank should be set to a point where there is a clear drop in
the magnitudes ● of the singular values; or the singular values start to even out
STRU
CTUR
E D
ETEC
TION
- CL
USTE
RING
Entropy-based method
Consider the relative contribution of each singular value to the overall Frobenius NormRelative contribution
Low entropy (close to 0): the first singular value has almost all massHigh entropy (close to 1): the singular values are almost equal
STRU
CTUR
E D
ETEC
TION
- CL
USTE
RING
Very common application of SVD is to remove the noise from the dataThis works simply by taking the truncated SVD from the (normalized) data
Original data:Looks like 1-dimensional with some noise
The right singular vectors show the directions
The first looks like the data direction
The second looks like the noise direction
STRU
CTUR
E D
ETEC
TION
- CL
USTE
RING
The latent semantic analysis (LSA) is an information retrieval method that uses SVD
The data: a term–document matrix A
- the values are (weighted) term frequencies- Pre-processing : typically tf/idf values (the frequency of the term in the document divided by
the global frequency of the term)
● Matrix Uk associates documents to topics● Matrix V k associates topics to termsIf two rows of Uk are similar, the corresponding documents “talk about the same things”A query q can be answered by considering its term vector q
STRU
CTUR
E D
ETEC
TION
- CL
USTE
RING
rank-2 matrix representingratings of movies by users
two “concepts” underlying the movies:
- science-fiction
- romance
RECO
MM
ENDA
TION
SYS
TEM
S - C
LUST
ERIN
G
Hidden concepts emerge :
- connects people to concepts- relates movies to concepts- gives the strength of each of the concepts
RECO
MM
ENDA
TION
SYS
TEM
S - C
LUST
ERIN
G
In general, the concepts will not be so clearly delineated
RECO
MM
ENDA
TION
SYS
TEM
S - C
LUST
ERIN
G
One possible rule :sum of the squares of the retained singular values at least 90% of the total sum (12.4)² + (9.5)² +(1.3)² = 245.70 (12.4)² + (9.5)² = 244.01. (99% of the energy) (12.4)² /245.70 about 63%.
RECO
MM
ENDA
TION
SYS
TEM
S - C
LUST
ERIN
G
Truncated matrix close to the original
RECO
MM
ENDA
TION
SYS
TEM
S - C
LUST
ERIN
G
projection in the space of concepts
Another sort of query we can perform in concept space is to find users similar to the new user. We can use V to map all users into concept space. :
Joe maps to [1.74, 0]Jill maps to [0, 5.68]
new user in the system:→ need for recommendation
RECO
MM
ENDA
TION
SYS
TEM
S - M
ISSI
NG D
ATA
RECO
VERY
David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions (Chapter 8) Chapman and Hall, 2007.
Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin. 2012. Learning from Data. AMLBook.
RECO
MM
ENDA
TION
SYS
TEM
S - C
LUST
ERIN
G
REFERENCES
BACKGROUND
▹ PageRank published in April 1998 at WWW by Sergey Brin and Larry Page
▹ Early search engines used content-based ranking algorithms: no relationship between the pages
networks
RANK
ING
▹ Measure of the importance of web pages based on hyperlinks
▹ In links & out links
▹ PageRank attributes a score to a web page depending on the other web pages
▹ A link to a page : highest score to the page.
RANK
ING
ALGORITHM
▹ Main ideas:
Hyperlink to a page: conveyance of authority . The more in-links has a page, the more prestige the page has.
Pages that point to another page have their own prestige. This will give different weights to their out-links.
RANK
ING
PAGERANK ALGORITHM
▹ Network representation
nodes : pages
edges :hyperlinks
RANK
ING
PAGERANK ALGORITHM
▹ n -total number of webpages- linear equations
adjacency matrix with weights equal to 1 over the number of out-linksset of PageRanks as a n-dimensional vector
RANK
ING
PAGERANK ALGORITHM
▹ Each page as a state of a Markov chain ▹ Hyperlink seen as transitions between states
~ web surfing as a stochastic process
The adjacency matrix is not a stochastic matrix (rows do not sum to 1)
Some page doesn’t point to any page
RANK
ING
PAGERANK ALGORITHM
▹ To cope with rows of 0, we can replace them by row of 1/n. We assume a uniform distribution of the out-links
▹ The adjacency matrix is not irreducible: there is not a probability to go from any page to any other page
add a link from each page to any other pagewith a small probability transition called damping factor
RANK
ING
PAGERANK ALGORITHM
Damping factor : probability that the user will continue clicking
RANK
ING
POWER ITERATION METHOD
▹ Initialization▹ At each time step:
▹ Repeat until convergence:
RANK
ING
▹ Advantages: robust to spam + global measure
▹ Drawbacks: favors old pages / possibility of buying a link on pages with high PageRank
▹ Applications to other cases : for instance, academic papers
Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: bringing order to the web
http://www.slideshare.net/maimustafa566/page-rank-algorithm-33212250
RANK
ING
Urban transportation network properties
spatially embedded multimodal time-resolved
APPL
ICAT
ION
ON A
MUL
TILA
YER
NETW
ORK
Urban transportation data
General Transit Feed Specification
geospatial information
&
schedule information
APPL
ICAT
ION
ON A
MUL
TILA
YER
NETW
ORK
Transportation network representation
APPL
ICAT
ION
ON A
MUL
TILA
YER
NETW
ORK
Public transportation vs car
Choice criteria:1. total travel time2. variability in the total travel time3. number of transfers
APPL
ICAT
ION
ON A
MUL
TILA
YER
NETW
ORK
Uncovering fast connections
Choice of a typical day: focus on commuting hoursMulti-edge P-space representation: 1. Weights time spent in the transportation mean
+ waiting time2. Penalties: transfer times
APPL
ICAT
ION
ON A
MUL
TILA
YER
NETW
ORK
Adaptation of Dijkstra’s algorithm Computation of the shortest path in
time for any origin-destination pair # of transfers limited
Uncovering efficient transportation connections
screenshot taken from Offi - Journey PlannerAPPL
ICAT
ION
ON A
MUL
TILA
YER
NETW
ORK
DIJKSTRA’S ALGORITHM2
1
3 4
APPL
ICAT
ION
ON A
MUL
TILA
YER
NETW
ORK
DIJKSTRA’S ALGORITHM
56
9
7
8
APPL
ICAT
ION
ON A
MUL
TILA
YER
NETW
ORK
Shortest time paths
For each (origin,destination)
commuting time vs
geographical distance
APPL
ICAT
ION
ON A
MUL
TILA
YER
NETW
ORK
Car commuting times
Extracted from the French national survey of transport and mobility 2007-2008
- distance travelled (1 Km resolution), by- transportation mean used & trip duration (1 min resolution)
Typical time needed to commute a particular distance by car : median of the distribution of times over the entire sample
APPL
ICAT
ION
ON A
MUL
TILA
YER
NETW
ORK
Travel time factors
For each distance:
Public transportation commuting times
Car commuting times
APPL
ICAT
ION
ON A
MUL
TILA
YER
NETW
ORK
Alessandretti, L., Karsai, M., & Gauvin, L. (2016). User-based representation of time-resolved multimodal public transportation networks. Open Science, 3(7), 160156.
DETECTION OF MESOSCALE STRUCTURES
R mesoscale structures
link participationtemporalactivity
Kolda, T. G., & Bader, B. W. (2009). Tensor decompositions and applications.SIAM review, 51(3), 455-500.APPL
ICAT
ION
ON A
TEM
PORA
L N
ETW
ORK
MATRICIZATION
IxJxK
IxJK
JxIK
KxIJ
APPL
ICAT
ION
ON A
TEM
PORA
L N
ETW
ORK
FACTORIZATION OUTPUT
● membership of nodes to the components
● membership of links to the components
● temporal activity of the components
B
C
A
APPL
ICAT
ION
ON A
TEM
PORA
L N
ETW
ORK
Bro, R., & Kiers, H. A. (2003). A new efficient method for determining the number of components in PARAFAC models. Journal of chemometrics
ESTIMATION OF THE NUMBER OF COMPONENTS
▹ Core consistency : based on the comparison of the core with Tucker decomposition
▹ Cophenetic coefficient : based on consensus matrices
Brunet, J. P., Tamayo, P., Golub, T. R., & Mesirov, J. P. (2004). Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the national academy of sciences, 101(12), 4164-4169.
APPL
ICAT
ION
ON A
TEM
PORA
L N
ETW
ORK
SocioPatterns.orgAPPL
ICAT
ION
ON A
TEM
PORA
L N
ETW
ORK
APPL
ICAT
ION
ON A
TEM
PORA
L N
ETW
ORK
Gauvin, L., et al. Detecting the community structure and activity patterns of temporal networks: a non-negative tensor factorization approach. PloS one, 2014Gauvin, L., Panisson, A., Barrat, A., & Cattuto, C. (2015). Revealing latent factors of temporal networks for mesoscale intervention in epidemic spread. arXiv preprint arXiv:1501.02758.AP
PLIC
ATIO
N ON
A T
EMPO
RAL
NET
WOR
K
PATTERN DETECTION
APPL
ICAT
ION
ON A
TEM
PORA
L N
ETW
ORK
APPLICATION (2)
▹ 709 students▹ 65 teachers▹ 30 classes
▹ 10 days ▹ 5 min resolution
APPL
ICAT
ION
ON A
TEM
PORA
L N
ETW
ORK
ANOMALY DETECTION
APPL
ICAT
ION
ON A
TEM
PORA
L N
ETW
ORK
ANOMALY DETECTION
Sapienza, A., et al. "Anomaly Detection in Temporal Graph Data: An Iterative Tensor Decomposition and Masking Approach." AALTD@ PKDD/ECML. 2015.APPL
ICAT
ION
ON A
TEM
PORA
L N
ETW
ORK
http://www.datainterfaces.org/2013/06/twitter-topic-explorer/APPL
ICAT
ION
ON A
TEM
PORA
L N
ETW
ORK
OUTLINE▹ Types of data:
▸ social interactions▸ infrastructure (public transportation)▸ text
▹ Types of applications:▸ missing data recovery▸ clustering▸ anomaly detection
▹ Tools▸ Progamming : Python, R, Matlab…▸ Machine learning: supervised, unsupervised (e.g. for classification),
neural network...▹ Topic not discussed but important in data science
▸ Ethics and Privacy issues
PYTHON NOTEBOOK
Ex: Airport network, Karate club
Analysis: degree, clustering coefficient, community, PageRank
Tools: pandas, networkX
REFERENCES
Network ScienceALBERT-LÁSZLÓ BARABÁSI
Networks: An IntroductionMARK NEWMAN
Mining of massive datasets, chapter “Mining Social-Network Graphs”JURE LESKOVEC, ANAND RAJARAMAN, JEFFREY D. ULLMAN