+ All Categories
Home > Documents > A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly...

A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly...

Date post: 14-Jul-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
92
A foray into graph mining Neil Shah April 15 th , 2019
Transcript
Page 1: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

A foray into graph miningNeil Shah

April 15th, 2019

Page 2: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

(Graph) data is prevalent

• 2.5 exabytes of data produced every day• 90% generated in the last 2 years• Data is produced as the product of a highly interconnected world

1.3 billion users1 billion daily mobile views

244 million users480 million products

187 million daily actives3.5 billion daily snaps

Page 3: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

(Graph) data shapes perspectives

Movie

recommendation

Search engine

ranking

Product

purchasing Social platform

interaction

Page 4: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

What’s in a graph?

• Graphs consist of nodes, edges and attributes• ex: Facebook social network where

• nodes = individuals• edges = friendship• attributes = gender (node), # of messages exchanged (edge)

• Graphs can easily model relationships between entities• Who-follows-whom on a social network• Who-buys-what on an e-commerce platform• Who-calls-whom using a certain cellular provider

Page 5: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Roadmap

• Preliminaries

• Notable graph properties

• Cool applications• Recommendation and ranking• Clustering• Anomaly detection

• Takeaways

Page 6: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Roadmap

• Preliminaries

• Notable graph properties

• Cool applications• Recommendation and ranking• Clustering• Anomaly detection

• Takeaways

Page 7: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Graph preliminaries – directionality

u1

u2

u3

u4

u5

u6

Users-by-users

u7

u8

u9

u10

u11

u1

u2

u3

u4

u5

u6

Users-by-users

u7

u8

u9

u10

u11

Page 8: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Graph preliminaries – degree

• Degree: # of adjacent edges

• Degree(u7) = 2

u1

u2

u3

u4

u5

u6

Users-by-users

u7

u8

u9

u10

u11

Page 9: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Graph preliminaries – out- and in-degree

• Degree: # of adjacent edges• Out-degree: # outgoing edges• In-degree: # incoming edges

• Out-degree(u4) = 1• In-degree(u6) = 2

u1

u2

u3

u4

u5

u6

Users-by-users

u7

u8

u9

u10

u11

Page 10: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Graph preliminaries – weighted degree

• Weighted degree: total sum of adjacent edge weights• i.e. “how many times did two users

communicate”

• Weighted-degree(u6) = 7

3

4

12

9

1

u1

u2

u3

u4

u5

u6

Users-by-users

u7

u8

u9

u10

u116

Page 11: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Graph preliminaries – ego(net)

• Ego: single, central node

• Ego network (egonet): nodes and edges within one “hop” from ego

• Egonet(u7) =• Nodes {u7, u3, u5}• Edges {u7-u3, u7-u5}

u1

u2

u3

u4

u5

u6

Users-by-users

u7

u8

u9

u10

u11

Page 12: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Graph preliminaries – connectivity

• Two nodes are connected if there is a path between them.• A graph is fully connected if all node

pairs are connected.

• u1 and u8 are connected• u3 and u5 are connected• u1 and u9 are not connected• This graph is not fully connected

u1

u2

u3

u4

u5

u6

Users-by-users

u7

u8

u9

u10

u11

Page 13: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Graph preliminaries – node and edge types

• A heterogeneous graph has multiple node and/or edge types.

• Users and products• Who-buys-what and who-rates-what

u1

u2

u3

u4

u5

u6

p1

p2

p3

p4

p5

Users ProductsUsers-by-products

Page 14: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Graph preliminaries – matrix representation

• Graph connectivity can be summarized in an adjacency matrix.• Ai,j = # (or weight) of edges from node i to j• A usually very sparse (makes compact representations possible!)

u1

u2

u3

u4

u5

u6

Users-by-usersu7

u8

u9

u10

u11

11

11

11

11

1

User

s

Users

Page 15: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Roadmap

• Preliminaries

• Notable graph properties

• Cool applications• Recommendation and ranking• Clustering• Anomaly detection

• Takeaways

Page 16: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Key question: What does a graph “look like”?

• At first look… large, unwieldy and seemingly random.

• Spoiler: In actuality, most real-world graphs are far from random.

Lyon ’03Trace-route paths on the internet

Page 17: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

A quick detour: “Random” graphs

• Erdos-Renyi random graph model: graph G(n,p)• n = number of nodes• p = probability of an edge between two nodes (independent edges)

• Expected # of edges:

• Degree distribution: (binom.)

Babaoglu’ 18

Page 18: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

What about real graphs?

• X-axis: degree, Y-axis: frequency/probability• Degree distributions of real graphs are not “random”• What exactly are they, then?

Log(# posts) vs. log(# users)log(# visitors) vs. log(# sites)log(# peers) vs. log(# routers)

Faloutsos ‘99 Viswanath ‘09Adamic ‘02

Page 19: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

The “scale-free” property

• Real-world graphs are often scale-free, meaning that their degree distribution obeys a power-law:

• Scaling the input by a multiple simply results in proportional scaling of the whole function

• Power laws are linear in log-log scales

• Typical 2 ≤ # ≤ 3

log(# visitors) vs. log(# sites)

Page 20: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Scale-freeness is evident in many domains

Newman ‘05

Page 21: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Why are many real graphs scale-free?

• Hypothesis: preferential attachment, or a “rich-get-richer” effect

• Generative process to construct a network:• Start with !" nodes, each with at least 1 edge• At each timestep, add a new node with ! edges

connecting it to ! already existing nodes• Probability of new node to connect to node # depends on

the degree $% as

• Many real-world variants of this effect:academic citations, recommendation, virality

log(# visitors) vs. log(# sites)

Page 22: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Real graphs have “small-world” effects

• How “far apart” are nodes in real graphs?• Interestingly, not very far! The typical number is 6. You may have heard of

the “six degrees of separation”

• Milgram ‘69: avg. # of hops for a letter to travel from Nebraska to Boston was 6.2 (sample size 64)• Leskovec ‘08: avg. distance between node pairs on MSN messenger

has mode 6 (sample size 180M nodes and 1.3B edges)

Page 23: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

What causes the small-world effect?

• Hypothesis: The abundance of hubs, or high-degree nodes• Even though most nodes aren’t connected to most other nodes, they are

connected to hubs, which facilitate paths

log(# visitors) vs. log(# sites)

Page 24: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

How do real graphs “grow” over time?

• Consider a time-evolving graph !• If it has "($) nodes and &($) edges at time t…• Suppose that "($ + 1) = 2"($)• What is &($ + 1)?

• Not only is it > 2& $ ; the growth is actually superlinear and follows & $ ∝ " $ . (power law!) with 1 ≤ 0 ≤ 2, generally

Page 25: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Real graphs exhibit densification

Avg. out-degree increases over time Power-law in # edges vs. # nodes (over time)

Page 26: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Moreover, the graph diameter shrinks

• Graph diameter = max(distance between node pairs)

• Leskovec ‘05 shows that diameteractually shrinks over time, instead of growing. In other words, nodestend to get closer

• Hypothesis: Once again due toprevalence and growth of hubs

Page 27: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Much more work done on graph behaviors

• Generative graph models (Leskovec ‘05)• Patterns in sizes of connected components (Kang ‘10)• Node in-degree (popularity) over time (McGlohon ‘07) • Duration of calls in phone-call networks (Vaz de Melo ‘10)• Temporal structure evolution (Shah ‘15)

the list goes on

Page 28: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Roadmap

• Preliminaries

• Notable graph properties

• Cool applications• Recommendation and ranking• Clustering• Anomaly detection

• Takeaways

Page 29: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Key question: how can we leverage graphs for recommendation/ranking tasks?• Measuring webpage importance

• Link prediction and recommendation• Local methods• Global methods

Page 30: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

PageRank for large-scale search engines

• Key problem: how to prioritize/curate a large (ever-growing) hyperlinked body of pages by importance and relevance?

• Key idea: leverage the hyperlink citation graph (page-links-page) to rank page importance according to connectivity patterns

• 150 million web pages à 1.7 billion links

Backlinks and Forward links:ØA and B are C’s backlinksØC is A and B’s forward link

Content adapted from Li ‘09

Page 31: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Simplified PageRank

• !: a web page• "!: the set of u’s backlinks• #$: the number of forward links of page v• %: the normalization factor to make & a

probability distribution

• Simplified PageRank is the stationary probability dist. of a random-walk on the graph; a surfer keeps clicking successive pages at random.

Idea: each page equally distributes its own PageRank to its forward-links recursively.

“An important page has many important pages pointing to it”

Page 32: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Simplified PageRank

PageRank Calculation: first iteration

Adjacency matrix transposed and column-normalized (accounts for equal neighbor distribution)

Yahoo Amzn MS

Initial PageRank scores

Read as “Amazon gives ½ of its own PageRank to Yahoo and Microsoft each”

Page 33: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Simplified PageRank

PageRank Calculation: second iteration

Adjacency matrix transposed and column-normalized (accounts for equal neighbor distribution)

Yahoo Amzn MS

Initial PageRank scores

Read as “Amazon gives ½ of its own PageRank to Yahoo and Microsoft each”

Page 34: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Simplified PageRank

Adjacency matrix transposed and column-normalized (accounts for equal neighbor distribution)

Yahoo Amzn MS

Initial PageRank scores

Convergence after some iterations

Read as “Amazon gives ½ of its own PageRank to Yahoo and Microsoft each”

Page 35: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Problem with Simplified PageRank

A loop:

During each iteration, the loop accumulates rank but never distributes rank to other pages!

Page 36: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

The problem in practice

Adjacency matrix transposed and column-normalized (accounts for equal neighbor distribution)

Yahoo Amzn MS

Initial PageRank scores

Read as “Microsoft gives all of its PageRank to Microsoft”

Page 37: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

The problem in practice

Adjacency matrix transposed and column-normalized (accounts for equal neighbor distribution)

Yahoo Amzn MS

Initial PageRank scores

Read as “Microsoft gives all of its PageRank to Microsoft”

Page 38: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

The problem in practice

Adjacency matrix transposed and column-normalized (accounts for equal neighbor distribution)

Yahoo Amzn MS

Initial PageRank scores

Read as “Microsoft gives all of its PageRank to Microsoft”

All roads lead to Microsoft

Page 39: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

A modified solution: (true) PageRank

• This subtle modification solves the problem of “sinks” • PageRanks converge to the dominant eigenvector of the appropriately

configured/normalized adjacency matrix, due to Markov chain theory! Cool!

• Modified PageRank is the same as the simple model, with the exception of the surfer having a random jump probability.

!(#): a distribution of ranks of web pages that the surfer can jump towhen he/she “gets bored” after clicking on successive links.

Page 40: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

A modified solution: PageRank

Adjacency matrix transposed and column-normalized (accounts for equal neighbor distribution)

Yahoo Amzn MS

Initial PageRank scores

20% random jump probability

Page 41: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

PageRank converges quickly and produces empirically good results

• PR (322 Million Links): 52 iterations• PR (161 Million Links): 45 iterations• Scaling factor is roughly linear in logn

Page 42: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Key question: how can we leverage graphs for recommendation/ranking tasks?• Measuring webpage importance

• Link prediction and recommendation• Local methods• Global methods

Page 43: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Exploiting local structure for predicting links

• Key problem: given what we know about interactions in a graph G, what nodes should we recommend a user u to promote engagement?

• Key idea: measure affiliation between u and other nodes by u’s graph neighborhood!

Page 44: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Rich literature in previous measures

Liben-Nowell ‘04

Users-by-users

Page 45: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Key question: how can we leverage graphs for recommendation/ranking tasks?• Measuring webpage importance

• Link prediction and recommendation• Local methods• Global methods

Page 46: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Exploiting global structure for predicting links

• Key problem: given what we know about interactions in a graph G, what nodes should we recommend a user u to promote engagement?

• Key idea: measure affiliation between u and other nodes via a latent factor model/embedding that compactly encodes “interests”

Page 47: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Singular value decomposition

• Used for low-rank matrix approximation• Rank k SVD reduces matrix A into k latent factors/dense

blocks/communities• U and V capture “involvement” of nodes• ! denotes factor “strength”

A "! #$

%×' %×(

(×( (×'

)*)+

),~ x x

()*≥ )+ ≥ …),)

Page 48: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Singular value decomposition

• Used for low-rank matrix approximation• Rank k SVD reduces matrix A into k latent factors/dense

blocks/communities• U and V capture “involvement” of nodes• ! denotes factor “strength”

nus

ers

m videos

~

“music lovers”“artist spotlights”

“adrenaline junkies”“action movies”

“dabbling cooks”“baking shows”

"# "$ "%

&$&# &%

+ + …

'# '$ '%

Page 49: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Recommendation from latent factors

• SVD effectively constructs vector embeddings in a k-dimensional space which “summarize” user/item affinities towards latent factors

• Compute vector similarity between user-user or user-item vectors (depending on application)• Cosine similarity/dot product are common choices

Koren ‘09

Page 50: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Recommendation from latent factors

• SVD effectively constructs vector embeddings in a k-dimensional space which “summarize” user/item affinities towards latent factors• Compute vector similarity between user-user or user-item vectors

(depending on application)• Cosine similarity/dot product are common choices

Koren ‘09

Page 51: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Roadmap

• Preliminaries

• Notable graph properties

• Cool applications• Recommendation and prediction• Clustering• Anomaly detection

• Takeaways

Page 52: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Graph clustering for knowledge extraction

• Key problem: what can we learn about group dynamics from graph interactions? Are there natural “clusters” of behaviors?

• Key idea: tightly-knit graph interactions form graph clusters, which can indicate community behaviors. These are useful for • Behavioral understanding• Computational load balancing• Graph compression• Visualization

advertiser

query

Page 53: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Finding graph clusters

• Given a graph G, we want to find clusters

• Need to:• Formalize the notion of a cluster• Need to design an algorithm that will find sets of nodes that are good clusters

Content adapted from Leskovec ‘10

Page 54: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Clustering objective functions

• Essentially all objectives use the intuition that a good cluster S has• Many edges internally• Few edges pointing outside

• Simplest objective function:• Conductance

• Small conductance corresponds to good clusters

• There are many other formalizations of roughly this intuition• Graph objectives are generally hard to optimize directly. Greedy/approximate

algorithms are common

Page 55: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Clustering objective functions

• Single-criterion (considers either internal or external)• Modularity: m-E(m)• Modularity Ratio: m-E(m)• Volume: åu d(u)=2m+c• Edges cut: c

• Multi-criterion (considers both)• Conductance: c/(2m+c)• Expansion: c/n• Density: 1-m/n2

• CutRatio: c/n(N-n)• Normalized Cut: c/(2m+c) + c/2(M-m)+c• Max ODF: max frac. of edges of a node pointing outside S• Average-ODF: avg. frac. of edges of a node pointing outside • Flake-ODF: frac. of nodes with mode than ½ edges inside

S

n: nodes in Sm: edges in Sc: edges pointing

outside S

Page 56: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Multiple types of clustering algorithms

• Global spectral• Compute graph Laplacian matrix L = D – A• Find 2nd smallest eigenvector of L• Split by sign to get a partitioning of nodes (related

to graph “cut”)• Recurse to get more clusters

• Local spectral• Pick random seed node• Build local clusters around seed nodes based on random walk/PageRank• Prune cluster from graph and repeat

Page 57: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Flow-based algorithms

• METIS: multi-level graph partitioning• If it’s too expensive to partition a big graph… coarsen it into a smaller graph• If it’s still to big, keep coarsening

• Compute a partition and uncoarsen the graph

• Improve heuristically• Swap vertices• Local search

Page 58: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Measuring clustering algorithm performance

• How to quantify performance:• What is the score of clusters across a range of sizes?

• Network Community Profile (NCP) (Leskovec ‘08)• The score of the best cluster of size k

Page 59: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

NCPs for a real graph (LiveJournal)• 500 node comms. from Local Spectral

• 500 node comms. from METIS

Interestingly, Local Spectral clusters are more compact and tighter, despite having higher (worse) conductance than METIS!

Page 60: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

NCPs for various objectives (Local Spectral)• Multiple objectives can be pretty

similar• Conductance• Expansion• Normalized Cut• Cut-ratio• Avg-ODF

• Max-ODF prefers small clusters, Flake-ODF prefers large clusters• Internal density not very good (large

clusters are very sparse)

Page 61: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

You should know…

• Many types of clustering objectives and algorithms -- can use NCP to analyze them• Not many “good” large clusters – real graphs are complicated!

• Different types bias for various aspects (cluster size, internal and external connectivity)

• Overemphasis on clustering objectives can actually lead to “bad” looking clusters according to human intuition

Page 62: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Roadmap

• Preliminaries

• Notable graph properties

• Cool applications• Recommendation and prediction• Clustering• Anomaly detection

• Takeaways

Page 63: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Graph-based anomaly detection

• Key problem: what kinds of anomalous behaviors exist in real graphs, and can we find such anomalies automatically?

• Key idea: we can identify various types of “anomalous” behaviors by building null/normal models and penalizing excessive deviation• Node-based anomalies• Group anomalies (too large, too dense to be

a real community)

Page 64: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Anomalies in graphs: important applications

• Email networks• Spammers

• Computer networks• Hackers/port scanning

• Phone-call networks• Telemarketers

• Social networks• Fake engagement

Page 65: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Major goal

• How to go from a graph to a quantitative model/pattern?

Page 66: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Local, egonet-based anomaly detection

• What does a typical node look like?• Can’t say much about just a node in isolation• Let’s consider the egonets!

• For each node, • extract egonet (=1-step-away neighbors)• extract features (#edges, total weight, etc.) • extract patterns (norms)• compare with the rest of the population (detect anomalies)

Users-by-users

Content adapted from Akoglu ‘10

Page 67: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

What is anomalous?

• Not obvious!

Page 68: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

What is anomalous?

Near-star

Near-clique

telemarketer, port scanner,people adding friendsindiscriminatively, etc.

tightly connected people, terrorist groups?, discussion group, etc.

Heavy vicinity

too much money wrt number of accounts, high donation wrt number of donors, etc.

single-minded, tight company

Dominant heavy link

Page 69: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Basic features to study

• Ni : number of neighbors (degree) of ego i• Ei : number of edges in egonet I• Wi : total weight of egonet I• λw,I : 1st eigenvalue of the weighted adjacency matrix of egonet i

Page 70: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Obs. 1: Egonet Density Power Law

Ei ∝ Niα

1 ≤ α ≤ 2

Differentiates “dense” from “sparse” neighborhoods

Page 71: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Obs. 2: Egonet Weight Power Law

Wi ∝ Eiβ

β ≥ 1

Differentiates “heavy” from “light” neighborhoods

Page 72: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Obs. 3: Egonet !"Weight Power Lawλw,i∝ Wi

γ

0.5 ≤ γ ≤ 1

Differentiates “uniform” distribution from “dominant” heavy edges

Page 73: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Scoring node anomalies

violates our “laws” far away from most pointsAnomaly ≈

scoredist = distance to fitting linescoreoutl = outlierness scorescore = func( scoredist , scoreoutl )

Page 74: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Triaging anomalies

ü can interpret the type of anomaly

ü can sort nodes wrttheir outlierness scores

Page 75: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Interesting results: Blog post-to-post graph

Part of a group of posts who all link to each other Post linking to many other

posts indiscriminately

Page 76: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Interesting results: Committee-to-candidate donations graph

$87M - DNC$25M - RNC

Page 77: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Interesting results: Author-to-conference publishing graph

Has published 40 papers, but to the same conference (and nowhere else)

Have published hundreds of papers, to almost as many conferences!

Page 78: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Group anomalies on graphs

Bob’s

Carol’s

Alice’s

Alice

Content adapted from Shin ‘16

Page 79: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Fraud forms dense blocks

Rest

aura

nts

AccountsRestaurants Accounts

Adjacency Matrix

Page 80: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Tensor modeling for attributed graphs• Natural dense blocks

are sparse on the time axis (formed gradually)• Suspicious dense

blocks are also denseon the time axis (due to synchronous behavior)• Suspicious dense

blocks are denser than natural dense blocks in the tensor model

Rest

aura

nts

Timesta

mp

Sparse

Dense

Accounts

A cell indicates that account i rates restaurant j at time t

Adjacency Tensor

Page 81: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Applications

• Dense bocks signal anomalies/fraud in many multi-attribute graphs

Src IP

DstI

P

Timesta

mp

Src User

DstU

ser

Timesta

mp

UserPa

ge

Timesta

mp

TCP Dumps Wikipedia Revision History

Time-evolvingSocial Network

Page 82: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

How to find dense blocks in such tensors?

• Exact solutions are combinatorial and intractable• Greedy solutions and heuristics are practical (i.e. greedily optimize a

“suspiciousness” metric)• What metric?

Assume a block (subtensor) ! in a 3-way tensor "• #$%&((): *+ + *- + *.• /01 ( = 345678((): *+×*-×*.• :;<<((): sum of entries in (

=

*+

*-

*.

(

Some notable choices:

Traditional Density: ρ? (, = = ABCC ( /Vol(B)(maximized by single entry with max. value)

Arithmetic Avg. Degree: ρI (, = = ABCC ( /Size(B)

Geometric Avg. Degree: ρN (, = = ABCC ( /O Vol B

Page 83: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Detecting a single dense block• Greedy search method • Starts from the entire tensor

5 3 04 6 12 0 0

1 0 1

00 ! = 2.9

Page 84: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Detecting a single dense block• Remove a slice to maximize density !

5 3 04 6 12 0 0 " = 3

Page 85: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Detecting a single dense block

5 3 4 6 2 0 ! =3.3

• Remove a slice to maximize density #

Page 86: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Detecting a single dense block

5 3 4 6 2 0 ! = 3.6

• Remove a slice to maximize density &

Page 87: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Detecting a single dense block• Output: return the densest block so far

5 3 4 6 2 0 ! = 3.6

Page 88: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Handling multiple blocks• Remove found blocks before finding others

Find & Remove

Find & Remove

Find & Remove

Restore

Page 89: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Algorithm details

• Theorem 1 [Remove Minimum Mass First]Among slices in the same mode, removing the slice with minimum mass is always best

12 > 9 > 2

• Theorem 2 [Approximation Guarantee]

!" #,% ≥ '(!" #

∗, %

Density metric Input Tensor Order Densest Block

Page 90: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Practical discoveries

TCP connections forming the densest blocks are network attacks

First three blocks found

Src IP

DstI

P

Timesta

mp

Page 91: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Practical discoveries

First three blocks found by M-Zoom

Page edit wars : 11 usersrevised 10 pages, 2,305 timeswithin 16 hours

User

Page

Timesta

mp

Page 92: A foray into graph miningink-ron.usc.edu/xiangren/ml4know19spring/slides/W14-GM.pdf•Anomaly detection •Takeaways. ... •Modified PageRank is the same as the simple model, with

Takeaways

• Graphs provide a means of describing interactions between objects

• Almost all real graphs are “non-random” and obey various patterns

• Considerable literature in graph mining focuses on learning to leverage large-scale interaction patterns to• Recommend users new content based on what they might like• Identify interesting group behaviors and community norms• Discover abnormalities that correspond to fraud or “audit-worthy” events


Recommended