+ All Categories
Home > Documents > Powerpoint-presentation Information and Computing Sciencesjilles/edu/... · Conclusions Graphs need...

Powerpoint-presentation Information and Computing Sciencesjilles/edu/... · Conclusions Graphs need...

Date post: 27-Jan-2021
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
97
Graph Summarisation Jilles Vreeken 10 July 2015
Transcript
  • Graph SummarisationJilles Vreeken

    10 July 2015

  • The Case of The Lost Pen

    -- or –

    The Case of the Found Pen

    Service Announcement #0

  • Next week, a guest lecture

    Mining Data that Changes

    by dr. Pauli Miettinen (MPI-INF)

    Service Announcement #1

  • Exam.

    Oral.

    3rd and 4th of August.

    Timeslots to be decided.

    Mail me if you want to participate, let me know if you have a preferred time/day.

    Service Announcement #2

  • Service Announcement #3

    Introduction

    Patterns

    Correlation and Causation

    Graphs

    Wrap-up +

    (Subjective) Interestingness

  • Service Announcement #2

    Introduction

    Patterns

    Correlation and Causation

    Graphs

    Wrap-up +

    (Subjective) Interestingness

    ?

    Yes! Prepare questions on anything* you’ve always wanted to ask me.

    Mail them to me in advance, or have me answer on the spot

    * preferably related to TADA, data mining, machine learning, science, the world, etc.

  • Question of the day

    How can we summarise

    the main structure of a graphin easily understandable terms?

  • Graphs

    Graphs are everywhere

    Everything* can be represented as a graph

    * almost

  • Graphs, formally

    We consider graphs 𝐺 = 𝑉, 𝐸with 𝑉 the set of 𝑛 nodes,

    and 𝐸 a set of 𝑚 edges between nodes

    In general, nodes can have labels, and

    edges can have labels, weights and can be directed.

  • Real world graphs

    road networks

    social networks

    biological networks

    cellular

    networks

    relational

    databases

  • Real world graphs

    the internet

  • Graphs, formally

    Today we consider unlabeled unweighted undirected graphs.

    The adjacency matrix 𝐴 then is an𝑛 × 𝑛 matrix 𝐴 ∈ 0,1 𝑛×𝑛 where

    a cell 𝑎𝑖,𝑗 = 1 iff 𝑖, 𝑗 ∈ 𝐸 and 0 otherwise.

    We call the number of edges 𝑑𝑖of a node 𝑖 its degree

  • Why summarisation?

    Visualization

    Guiding attention

  • Why summarisation?

    Visualization

    Guiding attention

  • Staring at an Adjacency Matrix

  • Nodes: wiki editors

    Edges: co-edited

    I don’t see

    anything!

    Staring at a Hairball

  • Stars:

    admins,

    bots,

    heavy users

    Bipartite cores: edit wars

    Nodes: wiki editors

    Edges: co-edited

    Kiev vs. Kyiv vandals

    Example: Wikipedia Controversy

  • Summary Statistics

    For ‘normal’ data, we can get insight by taking an average.

    What kind of summary statistics do we have for graphs?

    Average degree.Not very insightful.

  • Summary Statistics

    For ‘normal’ data, we can get insight by taking an average.

    What kind of summary statistics do we have for graphs?

    Degree plots

  • Powerlaws

  • Summary Statistics

    For ‘normal’ data, we can get insight by taking an average.What kind of summary statistics do we have for graphs?

    Cluster coefficient (global)How clustered are the nodes in the graph?

    𝐶 =𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑜𝑠𝑒𝑑 𝑡𝑟𝑖𝑎𝑛𝑔𝑙𝑒𝑠

    𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑 𝑡𝑟𝑖𝑝𝑙𝑒𝑡𝑠 𝑜𝑓 𝑣𝑒𝑟𝑡𝑖𝑐𝑒𝑠

    Counting triangles requires matrix multiplication, which takes 𝑂(𝑛𝜔) where 𝜔 < 2.376, but takes 𝑂 𝑛2 space.

    (but fast estimators exist)

  • Summary Statistics

    For ‘normal’ data, we can get insight by taking an average.

    What kind of summary statistics do we have for graphs?

    Cluster coefficient (local)

    How close is the neighborhood of

    node 𝑖 to being a clique?

    𝐶𝑖 =2 𝑗, 𝑘 ∈ 𝐸 𝑗, 𝑘 ∈ 𝑁𝑖

    𝑑𝑖(𝑑𝑖 − 1)

    which is 𝑂(𝑑𝑖2) at 𝑂(𝑛2) space

  • Summary Statistics

    For ‘normal’ data, we can get insight by taking an average.

    What kind of summary statistics do we have for graphs?

    Diameter

    The longest shortest path between two nodes.

    Requires calculating all shortest paths.

    Calculating shortest path takes 𝑂(𝑛2).

    So, no.

  • Scalability

    Many real world graphs are big,

    with 𝑛 in the order of millions.

    𝑂(𝑛2) is very scary for a graph miner.

    Current-day graph mining algorithms

    need to be linear in the number of edges,

    or else your paper will almost surely be rejected.

    What are the implications?

  • Summarising a Graph

    Given: a graph

  • Summarising a Graph

    Given: a graph

    Find: a succinct summary

    with possibly

    overlapping subgraphs

  • Summarising a Graph

    Given: a graph

    Find: a succinct summary

    with possibly

    overlapping subgraphs

  • Summarising a Graph

    Given: a graph

    Find:

    ≈important graph

    structures.

    a succinct summary

    with possibly

    overlapping subgraphs

  • Community Detection

    Adjacency MatrixAssumed graph

  • Community Detection

    Adjacency MatrixReal graph

  • Summarising a Graph

    Fully Automatic Cross Associations

    is a nice MDL based algorithm to summarise a matrix.

    1) REASSIGN: Given a grid, assign rows and columns

    s.t. entropy within the grid is minimal.

    (Chakrabarti et al. 2004)

  • Summarising a Graph

    Fully Automatic Cross Associations

    is a nice MDL based algorithm to summarise a matrix.

    1) REASSIGN: Given a grid, assign rows and columns

    s.t. entropy within the grid is minimal.

    2) CROSSASSOC: Find cluster with highest entropy, split it, run REASSIGN.

    Stop when no split reduces the MDL score.

    (Chakrabarti et al. 2004)

  • Summarising a Graph

    Fully Automatic Cross Associations

    is a nice MDL based algorithm to summarise a matrix.

    1) REASSIGN: Given a grid, assign rows and columns

    s.t. entropy within the grid is minimal.

    2) CROSSASSOC: Find cluster with highest entropy, split it, run REASSIGN.

    Stop when no split reduces the MDL score.

    (Chakrabarti et al. 2004)

  • Beyond Cave-men Communities

    Traditional community detection

    algorithms assume that you interact

    only with people in your ‘cave’.

    You are assumed not to interact

    with others, except if you are one

    of few ‘messengers’ between ‘caves’.

    That is not very realistic.

    (Kang & Faloutsos, ICDM 2011)

  • Slash’n’Burn

    Slash’n’Burn finds the node 𝑖with highest 𝑑𝑖 and removesits edges 𝑁𝑖 and recurses.

    SLASHBURN:

    1. Slash top-𝑘 hubs, burn edges

    2. Repeat on the remaining GCCBefore

    (Kang & Faloutsos, ICDM 2011)

  • Slash’n’Burn

    Slash’n’Burn finds the node 𝑖with highest 𝑑𝑖 and removesits edges 𝑁𝑖 and recurses.

    SLASHBURN:

    1. Slash top-𝑘 hubs, burn edges

    2. Repeat on the remaining GCC

    (Kang & Faloutsos, ICDM 2011)

  • Slash’n’Burn

    Slash’n’Burn finds the node 𝑖with highest 𝑑𝑖 and removesits edges 𝑁𝑖 and recurses.

    SLASHBURN:

    1. Slash top-𝑘 hubs, burn edges

    2. Repeat on the remaining GCCAfter

    (Kang & Faloutsos, ICDM 2011)

  • Beyond Cave-men Communities

    Slash’n’Burn applied on the

    AS-Oregon graphs shows that

    real graphs indeed have structure

    beyond cave-men communities!

    – but also include those!

    A nice side-result is that the

    Slash’n’Burned ordered matrix

    has lots of ‘empty space’ and

    can hence be stored efficiently.

    (Kang & Faloutsos, ICDM 2011)

  • Carnegie Mellon University

    Korea Advanced Institute of Science and Technology

    VoG: Summarizing and Understanding Large Graphs

    Danai Koutra

    Jilles Vreeken

    U Kang

    Christos Faloutsos

    SDM, 25 April 2014, Philadelphia, USA

  • Main Idea

    1) Use a graph vocabulary:

    2) Best graph summary

    optimal compression (MDL)

  • Main Idea

    1) Use a graph vocabulary:

    2) Shortest lossless description

    optimal compression (MDL)

  • Given a set of models ℳ,

    the best model 𝑀 ∈ ℳ is

    argmin 𝐿 𝑀 + 𝐿(𝐷 ∣ 𝑀)

    # bits

    for 𝑀# bits for the

    data using 𝑀

    𝑀

    Minimum Description Length

  • a1 x + a0

    𝐿 𝑀 + 𝐿(𝐷|𝑀)

    a10 x10 + a9 x

    9 + … + a0

    errors

    { }

    MDL example

  • Given: - a graph 𝐺 with adjacency matrix 𝐴- vocabulary Ω

    Find: model 𝑀 s.t.𝐿(𝐺,𝑀) = min 𝐿(𝑀) + 𝐿(𝐸)

    Minimum Graph Description

    Model 𝑀Adjacency 𝐴 Error 𝐸

  • VoG: Overview

    argmin

    ≈?

  • VoG: Overview

  • VoG: Overview

    some criterion

  • VoG: Overview

  • VoG: Overview

  • Summary

    VoG: Overview

  • We need candidate structures…

    … How can we get them?

  • Step 1: Graph Decomposition

    We can use:

    Any decomposition method

    We did use/adapt:

    SLASHBURN

  • Slash top-k hubs, burn edges

    Before

    SnB Graph Decomposition

  • Slash top-k hubs, burn edges

    SnB Graph Decomposition

  • Slash top-k hubs, burn edges

    candidate

    structures

    After

    SnB Graph Decomposition

  • Slash top-k hubs, burn edges

    candidate

    structures

    After

    SnB Graph Decomposition

    Notice that the structures can overlap!

  • Slash top-k hubs, burn edges

    candidate

    structures

    After

    SnB Graph Decomposition

  • Slash top-k hubs, burn edges

    Repeat on the remaining GCC

    GCC

    SnB Graph Decomposition

  • Now, how can we

    ‘label’ them?

    We got candidate structures.

  • ≈?

    argmin

    1

    2

    Step 2: Graph Labeling

  • hub? “best”

    node split?

    45

    80

    n

    “best”

    node ordering?

    1

    1

    n

    missing

    edges?

    Graph Representations

  • hub

    Hub: top-degree nodeSpokes: the rest

    DETAILS

    Graph Representations

  • hub

    Hub: top-degree nodeSpokes: the rest

    𝐿𝑁 𝑠𝑡 − 1 + log 𝑛 + log𝑛 − 1𝑠𝑡 − 1

    + 𝐿(𝐸+) + 𝐿(𝐸

    −)

    # of spokes

    hub ID spokes IDs extra missingErrorsStar structure

    𝑛=7

    DETAILS

    Graph Representations

  • hub

    Hub: top-degree nodeSpokes: the rest

    𝐿𝑁 𝑠𝑡 − 1 + log 𝑛 + log𝑛 − 1𝑠𝑡 − 1

    + 𝐿(𝐸+) + 𝐿(𝐸

    −)

    # of spokes

    hub ID spokes IDs extra missingErrors

    𝑛=7

    DETAILS

    Graph Representations

  • hub

    Hub: top-degree nodeSpokes: the rest

    𝐿𝑁 𝑠𝑡 − 1 + log 𝑛 + log𝑛 − 1𝑠𝑡 − 1

    + 𝐿(𝐸+) + 𝐿(𝐸

    −)hub ID spokes IDs extra missingErrors

    6

    𝑛=7

    DETAILS

    Graph Representations

  • hub

    Hub: top-degree nodeSpokes: the rest

    𝐿𝑁 𝑠𝑡 − 1 + log 𝑛 + log𝑛 − 1𝑠𝑡 − 1

    + 𝐿(𝐸+) + 𝐿(𝐸

    −)spokes IDs extra missingErrors

    6

    𝑛=7

    DETAILS

    Graph Representations

  • hub

    Hub: top-degree nodeSpokes: the rest

    𝐿𝑁 𝑠𝑡 − 1 + log 𝑛 + log𝑛 − 1𝑠𝑡 − 1

    + 𝐿(𝐸+) + 𝐿(𝐸

    −)extra missingErrors

    6

    𝑛=7

    DETAILS

    Graph Representations

  • hub

    Hub: top-degree nodeSpokes: the rest

    𝐿𝑁 𝑠𝑡 − 1 + log 𝑛 + log𝑛 − 1𝑠𝑡 − 1

    + 𝐿(𝐸+) + 𝐿(𝐸

    −)extra missing

    6

    𝑛=7

    DETAILS

    Graph Representations

  • Max bipartite graph: NP-hard

    Heuristic: Belief Propagation with heterophily for node classification

    (blue/red)

    DETAILSGraph Representations

  • Max bipartite graph: NP-hard

    Heuristic: Belief Propagation with heterophily for node classification

    (blue/red)

    + logn + log( ) + L(E+ ) + L(E− )# of blue

    nodesn−1|st|−

    1their IDs extra missingErrors

    # of rednodes

    Bipartite graph structure

    DETAILSGraph Representations

  • Max bipartite graph: NP-hard

    Heuristic: Belief Propagation with heterophily for node classification

    (blue/red)

    + logn + log( ) + L(E+ ) + L(E− )# of blue

    nodesn−1|st|−

    1their IDs extra missingErrors

    # of rednodes

    DETAILSGraph Representations

  • Max bipartite graph: NP-hard

    Heuristic: Belief Propagation with heterophily for node classification

    (blue/red)

    + logn + log( ) + L(E+ ) + L(E− )# of blue

    nodesn−1|st|−

    1their IDs extra missing

    # of rednodes

    DETAILSGraph Representations

  • 1

    45

    80

    n

    1

    n

    Longest path: NP-hard

    Heuristic: BFS + local search

    Graph Representations

  • 1

    45

    80

    n

    1

    n

    Longest path: NP-hard

    Heuristic: BFS + local search

    + extramissin

    gErrorsChain structure

    Graph Representations

  • ≈?

    Step 2: Graph Labeling

  • ≈?

    argmin

    Step 2: Graph Labeling

  • ≈?

    argmin

    Step 2: Graph Labeling

  • Step 3: Summary Assembly

  • Step 3: Summary Assembly

  • Step 3: Summary Assembly

    Summary

  • Concepts

    = # bits as structure - # bits as noisecompression gain

    Savings

    DETAILS

  • Step 3: Summary Assembly

  • Step 3: Summary Assembly

  • Step 3: Summary Assembly

  • Summary

    Step 3: Summary Assembly

  • Concepts

    Summary Encoding cost

    𝐿 𝑀 = 𝐿𝑁( 𝑀 + 1) + log𝑀 + 1Ω + 1

    + ∑ − log𝑃 𝑥 𝑠 𝑀 + 𝐿 𝑠

    # of

    structures

    # of

    structures

    per type

    for each structure

    its encoding length

    its

    connectivity

    its

    type3

    # of

    structures

    # of

    structures

    per type

    for each structure

    its encoding length

    : 1

    : 1

    : 1

  • Step 3: Summary Assembly

    𝐿(𝐷,𝑀)

    structures

    DETAILS

  • 75%98% 93%

    75%

    2%

    77%

    46%60%

    0%

    20%

    40%

    60%

    80%

    100%

    Plain Top-10 Top-100 G&F

    Bits needed Unexplained edges

    4292729 bits as noise

    Real graphs have structure!(we can save bits by encoding with structures!)

    Quantitative Analysis

  • 1

    10

    100

    Plain Top-10 Top-100 G&F

    Star

    Near-Bipartite

    Full clique

    Full Bipartite

    Chain

    Main structure types:

    Quantitative Analysis

  • Quantitative Analysis

    1

    10

    100

    1000

    10000

    Plain Top-10 Top-100 G&F

    Star

    Near-Bipartite

    Full clique

    Full Bipartite

    Main structure types:

    Stars, near- and full-bipartite cores.

  • Top-3 Stars

    klay

    [email protected]

    Top-1 NBC

    Ski

    excursion

    [email protected]

    Qualitative Analysis: Enron

  • VOG is near-linear on the number of edges of the input graph.

    Runtime

  • “jellyfish”(Tauro, 2001)

    Future Work

    Those of you interested in a MSc or RIL project…

    Our current vocabulary is

    But many other structures make sense, for example

  • Future Work

    Those of you who might be interested in a MSc or RIL project…

    It would be great if we could mine summaries directly from data

    … without pre-mining all candidate structures

    Real graphs show powerlaw-ish degree distributions,

    … would be great if VoG could take that into account

  • Conclusions

    Graphs need Summaries graphs are powerful – but difficult to interpret far too few (efficient) summary methods available

    Cross-Associations powerful technique to find bi-clusters heuristic, improvements exist

    Slash’n’Burn reorders nodes of a graph finds sub-graphs ‘beyond’ cave-men communities

    VoG summarises graphs with a graph-theoretic vocabulary first of its kind – but a big stack of heuristics fast, good results.

  • Thank you!Graphs need Summaries graphs are powerful – but difficult to interpret far too few (efficient) summary methods available

    Cross-Associations powerful technique to find bi-clusters heuristic, improvements exist

    Slash’n’Burn reorders nodes of a graph finds sub-graphs ‘beyond’ cave-men communities

    VoG summarises graphs with a graph-theoretic vocabulary first of its kind – but a big stack of heuristics fast, good results.


Recommended