Querying Big Graphs:
Theory and Practice
Wenfei Fan
School of Informatics
University of Edinburgh
1
Social networks modeled as graphs
B
A1 Am
W
W
W
W W
W
W W
Node: person
report
Edge: relationship
supervise
Social graphs: Facebook, Twitter, LinkedIn, …
Find all matches of a pattern in a graph
Pattern matching in social graphs
Identify suspects
in a drug ring
3 “Understanding the structure of drug trafficking organizations”
pattern graph
B
A1 Am
W
W
W
W W
W
W W
3
3
1
B
A S
W
4
Graph Pattern Matching
Input: a pattern graph Q and a data graph G
Output: all the matches of Q in G, i.e., all subgraphs of G that
are isomorphism to G
Good for social network analysis?
Applications
• pattern recognition
• intelligence analysis
• transportation network analysis
• Web site classification,
• social position detection,
• user targeted advertising,
• knowledge base disambiguation …
a bijective function f on nodes:
(u,u’ ) ∈ Q iff (f(u), f(u’)) ∈ G
5
New challenges
Real-life social graphs are typically large
Facebook : more than 1 billion nodes,
and over 140 billion links
Is it feasible on big graphs?
Graph pattern matching is costly
• NP-complete to decide whether there exists a match
• possibly exponentially many matches
How long do we have to wait?
6
The good, the bad and the ugly
Traditional computational complexity theory of almost 50 years:
• The good: polynomial time computable (PTIME)
• The bad: NP-hard (intractable)
• The ugly: PSPACE-hard, EXPTIME-hard, undecidable…
Polynomial time queries become intractable on big data
What happens when it comes to big data?
Assuming SSD of 6G/s. A linear scan of a data set D would take
• 1.9 days when D is of 1PB (1015B)
• 5.28 years when D is of 1EB (1018B)
O(n) time is already beyond reach on big data in practice!
7
Tractability revisited for big data
Properly contained in P, unless P = NC
NP and beyond
PTIME
BD-tractable not
BD-tractable
graph pattern matching
8
Coping with the sheer size of real-life graphs
1. Revising graph pattern matching
1) Bounded simulation
2) Incorporating edge relationships
2. Making big graphs small
1) Distributed graph pattern matching
2) Query preserving graph compression
3) Graph pattern matching using views
4) Incremental graph pattern matching
3. Approximate query answering
1) Relaxing the semantics of queries
2) Resource-bounded query answering
Joint work with Xin Wang and Yinghui Wu
Graph pattern matching for social network analysis
9
Pattern matching in social graphs
10
not allowed by
bijection relation
instead of
function
Subgraph isomorphism may be too strict for social data analysis
B
A1
Am
W
W
W
W W
W
W W
3
3
1
B
A S
W
11
Graph simulation
Input: a pattern graph Q and a data graph G
Output: a binary relation S on the nodes of Q and G
Does this suffice?
• each node u in Q is mapped to a node v in G, such that
(u, v’)∈ S
• for each (u,v)∈ S, each edge (u,u’) in Q is mapped to an
edge (v, v’ ) in G, such that (u’,v’ )∈ S
Pattern matching in social graphs
12
edges to paths
The quest for a revision of graph simulation
B
A1
Am
W
W
W
W W
W
W W
3
3
1
B
A S
W
Gen
13
Directed graph G = (V, E, fA)
attributes fA(u): a tuple (A1 = a1, ..., An = an)
Social Graphs
Med
Soc Eco
AI
Chem
(‘dept’=CS, ‘field’=AI)
(‘dept’=CS, ‘field’=DB) (‘dept’=Bio, ‘field’=Gen)
(‘dept’=Bio, ‘field’=Eco)
Social graphs: attributes for data content
DB
label, keywords, blogs,
comments, rating …
14
Pattern Graphs
Pattern graph: Q = (VQ, EQ, fv, fe)
fv(u): a conjunction of A op a, op in <, ≤, =, ≠, >, ≥
fe(u,u’): a constant k or a symbol ∗, bound
Bounded
Unbounded
fv(): ‘dept’=CS
Incorporating search conditions and bounds on the number of hops
Search condition
within k hops
CS Bio
Soc
Med
*
3
*
2
2
3
15
G = (V, E, fA) matches Q = (VQ, EQ, fv, fe) via bounded simulation, if
there exists a binary relation S ⊆ VQ × V such that S
is a total mapping,
satisfies search conditions and bounds on edge-to-path mappings
Bounded Simulation
CS DB
Soc
Med Med
Gen
Soc Eco
*
3
*
2
2
3 AI
Chem
S
Q(G): a unique maximum match relation
Bio
for each u∈ VQ, there exists v∈ V such that
(u,v)∈ S
for each (u,v)∈ S,
attributes fA(v) satisfy predicate fv(u)
each (u,u’ ) in EQ is mapped to a path in G from v to
v’ of length bounded by fe(u,u’ ), (u’,v’ )∈ S
empty if G does
not match Q
Bounded simulation in social graphs
16 The set of all suspects involved in a drug ring
edges to paths
B
A1
Am
W
W
W
W W
W
W W
3
3
1
B
A S
W
relation instead
of function
O(| V | | E | + | EQ| | V |2 + | VQ| | V |)
17
Complexity
Subgraph isomorphism: intractable
Graph simulation: O((| V | + | VQ |) (| E | + | EQ| ))
Input: Pattern Q and data graph G
Output: Q(G) cubic time
comparable: Q is
small in practice
a special case of bounded simulation
o The same bound 1 on all pattern edges (edge-to-edge mapping)
o Unique attributes vs. search conditions: label equality
Capture more sensible matches in social graphs (by 80%)
18
Homeomorphism and monomorphism
Graph homeomorphism: G = (V, E) matches Q = (VQ, EQ)
an injective function from VQ V
edges to pairwise node-disjoint simple paths in G
function rather than relation
Strike a balance between expressive power and complexity
constraints on paths Monomorphism revised: G = (V, E) matches Q = (VQ, EQ)
an injective function from VQ V
edges to nonempty paths in G
Intractable, even when Q
is a tree and G is a DAG
Incorporating edge relationships
19 Incorporating edge “colors”
S: supervise
C: co-author
Ann, CS
Pat, DB
John, DB
Bill, Bio
Don, Gen
Tom, Bio
C
S
S
S
C
C
C
C
C
Mat, DB
DB
CS
Bio
Bio
C
C
S+
pattern
20
Regular patterns
Pattern: Q = (VQ, EQ, fv, fe)
fv(u): a conjunction of A op a, op in <, ≤, =, ≠, >, ≥
fe(u,u’ ): a regular expression of the form
Mapping edges to paths satisfying associated regular expressions
DB
CS
Bio
Bio
C
C
S+ F ::= c | ck | c+ | FF
Simple regular expressions:
fairly common
optimizing patterns (checking
containment in linear-time)
O(| V | | E | + m | EQ| | V |2 + | VQ| | V |)
21
Complexity
bounded simulation: a special case
single color c (hence m = 1)
fe(u,u’ ) = ck | c+
Input: Pattern Q and data graph G
Output: Q(G) m: the number of
distinct colors in Q
Adding edge colors does not incur extra complexity
22
Various notions for graph pattern matching
Which one to use for social network analysis?
matching complexity |Q(G))|
subgraph isomorphism NP-complete |V| |VQ|
graph simulation quadratic time |V| |VQ|
bounded simulation cubic time |V| |VQ|
regular matching cubic time |V| |VQ|
Making big graphs small
23
24
How to make a query tractable on big data?
Can we effectively query big graphs?
Querying big graphs:
• Input: Query Q, and a big graph G,
• Output: Q(G), the set of matches of Q in G
Making big graphs small
Make the cost of query processing “independent” of |G|!
The cost of query processing: a function of |G| and |Q|
O(|G|) time is already beyond reach in practice!
A number of techniques:
1. Distributed query processing
2. Query preserving data compression
3.Query answering using views
4. Bounded incremental evaluation
5. …
O(n2) or O(n3) time
Distributed query processing
25
The cost of evaluation algorithm: f(|G|, |Q|)
Divide and conquer
partition G into fragments (G1, …, Gn), distributed to various sites
manageable sizes
upon receiving a query Q,
• evaluate Q( Gi ) in parallel
• collect partial answers at a coordinator site, and assemble
them to find the answer Q( G ) in the entire G
evaluate Q on smaller Gi
Network traffic and response time: Independent of |G|
Performance guarantees for evaluating graph pattern queries
It is unlikely that we can lower its complexity, but
can we reduce the size of its parameter |G|?
Partial evaluation and
distributed query answering
26
Partial evaluation: a promising approach
compute f( x ) f( s, d )
conduct the part of computation that depends only on s
generate a partial answer
the part of known input
Partial evaluation in distributed query processing
• evaluate Q( Gi ) in parallel
• collect partial matches at a coordinator site, and assemble
them to find the answer Q( G ) in the entire G
yet unavailable input
a residual function
Gj as the yet unavailable input
functions
at each site, Gi as the known input
Query preserving data compression
27
The cost of query processing: f(|G|, |Q|)
Query preserving compression <R, P> for a class L of queries
For any data collection G, C = R(G)
For any Q in L, Q( G ) = P(Q, Gc)
Q( G )
R G Gc
Q P
Q
Q( Gc )
Compress big G into a smaller Gc
reduce the parameter?
What is new about query preserving compression?
28 18 times faster on average for reachability queries
In contrast to lossless compression, no need to
restore the original graph G
Relative to a class L of queries of users’ choice
Better compression ratio: only information about L queries
Query preserving compression <R, P> for a class L of queries
For any dataset G, Gc = R(G)
For any Q in L, Q( G ) = P(Q, Gc)
For any Q in L, Q(Gc) can be directly computed
Any algorithms and indexing structures for G can be used for Gc
no need to decompress Gc
Gc is computed once for all queries Q in L
Incrementally maintained
Answering queries using views
29 The complexity is no longer a function of |G|
can we compute Q(G) without accessing G, i.e.,
independent of |G|?
The cost of query processing: f(|G|, |Q|)
Query answering using views: given a query Q in a language L
and a set V views, find another query Q’ such that
Q and Q’ are equivalent
Q’ only accesses V(G )
for any G, Q(G) = Q’(G)
Answering graph pattern queries on big social graphs:
Regardless of how big G is – the cost is “independent” of G
V(G ) is often much smaller than G (4% -- 12% on real-life data) Improvement: 31 times faster for graph pattern matching
Incremental query answering
30 Minimizing unnecessary recomputation
Incremental query processing:
Input: Q, G, Q(G), ∆G
Output: ∆M such that Q(G⊕∆G) = Q(G) ⊕ ∆M
Changes to the output New output
Changes to the input Old output
When changes ∆G to the data G are small, typically so are the
changes ∆M to the output Q(G⊕∆G)
Changes ∆G are typically small
Compute Q(G) once, and then incrementally maintain it
Real-life data is dynamic – constantly changes, ∆G
Re-compute Q(G⊕∆G) starting from scratch?
5%/week in
Web graphs
Complexity of incremental problems
Bounded: the cost is expressible as f(|CHANGED|, |Q|)?
Optimal: in O(|CHANGED| + |Q|)?
31 Complexity analysis in terms of the size of changes
Incremental query answering
Input: Q, G, Q(G), ∆G
Output: ∆M such that Q(G⊕∆G) = Q(G) ⊕ ∆M
The cost of query processing: a function of |G| and |Q|
incremental algorithms: |CHANGED|, the size of changes in
• the input: ∆G, and
• the output: ∆M
The updating cost that is
inherent to the incremental
problem itself
The amount of work absolutely
necessary to perform for any
incremental algorithm
graph simulation: bounded
32
Graph pattern matching on big graphs
Partial evaluation for distributed query processing?
Query preserving compression: convert big data to small data
Query answering using views: make big data small
Bounded incremental query answering: depending on the size of
the changes rather than the size of the original big data
. . .
Combinations of these can do better than MapReduce!
Make big data small
Yes, MapReduce is useful, but it is not the only way!
Approximate query answering
33
34
Graph simulation or bounded simulation
Relaxing the semantics of query answering
Effectiveness: capture more sensible matches in social graphs
Efficiency: from intractable to low polynomial time
Subgraph isomorphism
NP-complete Exponentially many matches
Quadratic/cubic time |VQ||V|
Use “cheaper” queries whenever possible
Top-k query answering
35
Early termination: return top-k matches without computing Q(G)
Traditional query answering: compute Q(G)
Top-k query answering:
Input: : Query Q, dataset G and a positive integer k.
Output: A top-ranked set of k elements in Q(G)
It is expensive to compute when G is large
The result Q(G) is excessively large for the users to inspect –
larger than G
Improvement: 1.8 times as fast, graph pattern matching
36
The approximation theory revisited
Traditional approximation algorithms A: for an NPO
• for each instance x, A(x) computes a feasible solution y
• quality metric f(x, y)
• performance ratio (minimization): for all x,
A revision of approximation for querying big data
Approximation: for even low PTIME problems, not just NPO
Quality metric: answers to a query is a typically a set, not a number
Approach: it does not help much if A(x) conducts computation on
“big” data x directly!
OPT(x): optimal solution, 1
OPT(x) f(x, y) OPT(x)
Big graphs?
37
Data-driven approximation
Resource-bounded query answering
Input: A class Q of queries, and a resource ratio [0, 1)
Question: Develop an algorithm that given any query Q Q and
graph G, computes Q(G) by accessing at most |G| amount of data
Accessing |G| amount of data in the entire process
Data-driven approximation algorithm A(G)
Dynamic reduction: given Q and G, find GQ such that | GQ | |G|
Compute Q(GQ) as approximate query answers to Q(G)
Performance ratio: F-measure of precision and recall
with best performance ratio
precision = | Q(GQ) Q(G)| / | Q(GQ)|
Recall = | Q(GQ) Q(G)| / | Q(G)|
38
Personalized social search queries
We can make big graphs of PB size fit into our memory!
Graph Search, Facebook
Find me all my friends who live in Edinburgh and like cycling
Find me restaurants in London my friends have been to
Find me photos of my friends in New York
Does Michael connect to lady Gaga through social links?
We can do personalized social search with = 0.0015%!
1.5 * 10-6 * 1PB (1015B) = 15 * 109 = 15GB
We are making big graphs of PB size as small as 15GB!
Non-localized reachability
Localized patterns
with 100% accuracy!
39
Scale independence
Input: A class Q of queries
Question: Can we find, for any query Q Q and any (possibly
big) graph G, a fraction GQ of G such that
|GQ | M, and
Q(G) = Q(GQ)?
Characterizing scale independent queries
Scalable with big graph G, when D grows!
Desirable, but hard
Independent of the size of G
Summing up
40
41
Challenges and opportunities
Challenges: querying big graphs is hard!
Introduce new fundamental problems – a departure from our
familiar terrain
any technique alone may not work very well – MapReduce is not
the only way, and may not be the best way
Nonetheless, we can do it!
Exact query answers: making big data small!
• combinations of effective techniques
Approximate query answering:
• relax the semantics of query answering
• data-driven approximation
…
Querying big graphs: A rich source of questions and vitality!
References
Complexity theory for big data
W. Fan, F. Geerts, F. Neven. Making Queries Tractable on Big Data
with Preprocessing, VLDB 2013.
W. Fan, F. Geerts, and L. Libkin. On Scale Independence for
Querying Big Data. PODS 2014
Querying big social data:
W. Fan, X. Wang, and Y. Wu. Diversified Top-k Graph Pattern
Matching, VLDB, 2014.
W. Fan, X. Wang, and Y. Wu. Answering Graph Pattern Queries
using Views, ICDE 2014.
S. Ma, Y. Cao, W. Fan, J. Huai, and T. Wo. Strong Simulation:
Capturing Topology in Graph Pattern Matching, TODS 39(1), 2014
42
References
Querying big social data:
W. Fan, X. Wang, and Y. Wu. Incremental Graph Pattern Matching,
TODS 38(3), 2013
W. Fan. Graph Pattern Matching Revised for Social Network Analysis,
ICDT 2012.
W. Fan, J. Li, X. Wang, and Y. Wu. Query Preserving Graph
Compression, SIGMOD, 2012.
W. Fan, X. Wang, and Y. Wu. Performance Guarantees for
Distributed Reachability Queries, VLDB, 2012.
W. Fan, J. Li, S. Ma, N. Tang, and Y. Wu. Adding Regular
Expressions to Graph Reachability and Pattern Queries, ICDE 2011.
W. Fan J. Li, S. Ma, and H. Wang, and Y. Wu. Graph Homomorphism
Revisited for Graph Matching, VLDB 2010.
W. Fan J. Li, S. Ma, and N. Tang, and Y. Wu. Graph pattern
matching: From intractable to polynomial time, VLDB, 2010.
43