Download - Querying Big Graphs: Theory and Practice - cse…iwgdm/2014/Slides/Wenfei.pdf · Querying Big Graphs: Theory and Practice ... Find all matches of a pattern in a graph ... • pattern

Querying Big Graphs:

Theory and Practice

Wenfei Fan

School of Informatics

University of Edinburgh

1

Social networks modeled as graphs

B

A1 Am

W

W

W

W W

W

W W

Node: person

report

Edge: relationship

supervise

Social graphs: Facebook, Twitter, LinkedIn, …

Find all matches of a pattern in a graph

Pattern matching in social graphs

Identify suspects

in a drug ring

3 “Understanding the structure of drug trafficking organizations”

pattern graph

B

A1 Am

W

W

W

W W

W

W W

3

3

1

B

A S

W

4

Graph Pattern Matching

Input: a pattern graph Q and a data graph G

Output: all the matches of Q in G, i.e., all subgraphs of G that

are isomorphism to G

Good for social network analysis?

Applications

• pattern recognition

• intelligence analysis

• transportation network analysis

• Web site classification,

• social position detection,

• user targeted advertising,

• knowledge base disambiguation …

a bijective function f on nodes:

(u,u’ ) ∈ Q iff (f(u), f(u’)) ∈ G

5

New challenges

Real-life social graphs are typically large

Facebook : more than 1 billion nodes,

and over 140 billion links

Is it feasible on big graphs?

Graph pattern matching is costly

• NP-complete to decide whether there exists a match

• possibly exponentially many matches

How long do we have to wait?

6

The good, the bad and the ugly

Traditional computational complexity theory of almost 50 years:

• The good: polynomial time computable (PTIME)

• The bad: NP-hard (intractable)

• The ugly: PSPACE-hard, EXPTIME-hard, undecidable…

Polynomial time queries become intractable on big data

What happens when it comes to big data?

Assuming SSD of 6G/s. A linear scan of a data set D would take

• 1.9 days when D is of 1PB (1015B)

• 5.28 years when D is of 1EB (1018B)

O(n) time is already beyond reach on big data in practice!

7

Tractability revisited for big data

Properly contained in P, unless P = NC

NP and beyond

PTIME

BD-tractable not

BD-tractable

graph pattern matching

8

Coping with the sheer size of real-life graphs

1. Revising graph pattern matching

1) Bounded simulation

2) Incorporating edge relationships

2. Making big graphs small

1) Distributed graph pattern matching

2) Query preserving graph compression

3) Graph pattern matching using views

4) Incremental graph pattern matching

3. Approximate query answering

1) Relaxing the semantics of queries

2) Resource-bounded query answering

Joint work with Xin Wang and Yinghui Wu

Graph pattern matching for social network analysis

9


10

not allowed by

bijection relation

instead of

function

Subgraph isomorphism may be too strict for social data analysis

B

A1

Am

W

W

W

W W

W

W W

3

3

1

B

A S

W

11

Graph simulation

Input: a pattern graph Q and a data graph G

Output: a binary relation S on the nodes of Q and G

Does this suffice?

• each node u in Q is mapped to a node v in G, such that

(u, v’)∈ S

• for each (u,v)∈ S, each edge (u,u’) in Q is mapped to an

edge (v, v’ ) in G, such that (u’,v’ )∈ S


12

edges to paths

The quest for a revision of graph simulation

B

A1

Am

W

W

W

W W

W

W W

3

3

1

B

A S

W

Gen

13

Directed graph G = (V, E, fA)

attributes fA(u): a tuple (A1 = a1, ..., An = an)

Social Graphs

Med

Soc Eco

AI

Chem

(‘dept’=CS, ‘field’=AI)

(‘dept’=CS, ‘field’=DB) (‘dept’=Bio, ‘field’=Gen)

(‘dept’=Bio, ‘field’=Eco)

Social graphs: attributes for data content

DB

label, keywords, blogs,

comments, rating …

14

Pattern Graphs

Pattern graph: Q = (VQ, EQ, fv, fe)

fv(u): a conjunction of A op a, op in <, ≤, =, ≠, >, ≥

fe(u,u’): a constant k or a symbol ∗, bound

Bounded

Unbounded

fv(): ‘dept’=CS

Incorporating search conditions and bounds on the number of hops

Search condition

within k hops

CS Bio

Soc

Med

*

3

*

2

2

3

15

G = (V, E, fA) matches Q = (VQ, EQ, fv, fe) via bounded simulation, if

there exists a binary relation S ⊆ VQ × V such that S

is a total mapping,

satisfies search conditions and bounds on edge-to-path mappings

Bounded Simulation

CS DB

Soc

Med Med

Gen

Soc Eco

*

3

*

2

2

3 AI

Chem

S

Q(G): a unique maximum match relation

Bio

for each u∈ VQ, there exists v∈ V such that

(u,v)∈ S

for each (u,v)∈ S,

attributes fA(v) satisfy predicate fv(u)

each (u,u’ ) in EQ is mapped to a path in G from v to

v’ of length bounded by fe(u,u’ ), (u’,v’ )∈ S

empty if G does

not match Q

Bounded simulation in social graphs

16 The set of all suspects involved in a drug ring

edges to paths

B

A1

Am

W

W

W

W W

W

W W

3

3

1

B

A S

W

relation instead

of function

O(| V | | E | + | EQ| | V |2 + | VQ| | V |)

17

Complexity

Subgraph isomorphism: intractable

Graph simulation: O((| V | + | VQ |) (| E | + | EQ| ))

Input: Pattern Q and data graph G

Output: Q(G) cubic time

comparable: Q is

small in practice

a special case of bounded simulation

o The same bound 1 on all pattern edges (edge-to-edge mapping)

o Unique attributes vs. search conditions: label equality

Capture more sensible matches in social graphs (by 80%)

18

Homeomorphism and monomorphism

Graph homeomorphism: G = (V, E) matches Q = (VQ, EQ)

an injective function from VQ V

edges to pairwise node-disjoint simple paths in G

function rather than relation

Strike a balance between expressive power and complexity

constraints on paths Monomorphism revised: G = (V, E) matches Q = (VQ, EQ)

an injective function from VQ V

edges to nonempty paths in G

Intractable, even when Q

is a tree and G is a DAG

Incorporating edge relationships

19 Incorporating edge “colors”

S: supervise

C: co-author

Ann, CS

Pat, DB

John, DB

Bill, Bio

Don, Gen

Tom, Bio

C

S

S

S

C

C

C

C

C

Mat, DB

DB

CS

Bio

Bio

C

C

S+

pattern

20

Regular patterns

Pattern: Q = (VQ, EQ, fv, fe)

fv(u): a conjunction of A op a, op in <, ≤, =, ≠, >, ≥

fe(u,u’ ): a regular expression of the form

Mapping edges to paths satisfying associated regular expressions

DB

CS

Bio

Bio

C

C

S+ F ::= c | ck | c+ | FF

Simple regular expressions:

fairly common

optimizing patterns (checking

containment in linear-time)

O(| V | | E | + m | EQ| | V |2 + | VQ| | V |)

21

Complexity

bounded simulation: a special case

single color c (hence m = 1)

fe(u,u’ ) = ck | c+

Input: Pattern Q and data graph G

Output: Q(G) m: the number of

distinct colors in Q

Adding edge colors does not incur extra complexity

22

Various notions for graph pattern matching

Which one to use for social network analysis?

matching complexity |Q(G))|

subgraph isomorphism NP-complete |V| |VQ|

graph simulation quadratic time |V| |VQ|

bounded simulation cubic time |V| |VQ|

regular matching cubic time |V| |VQ|

Making big graphs small

23

24

How to make a query tractable on big data?

Can we effectively query big graphs?

Querying big graphs:

• Input: Query Q, and a big graph G,

• Output: Q(G), the set of matches of Q in G

Making big graphs small

Make the cost of query processing “independent” of |G|!

The cost of query processing: a function of |G| and |Q|

O(|G|) time is already beyond reach in practice!

A number of techniques:

1. Distributed query processing

2. Query preserving data compression

3.Query answering using views

4. Bounded incremental evaluation

5. …

O(n2) or O(n3) time

Distributed query processing

25

The cost of evaluation algorithm: f(|G|, |Q|)

Divide and conquer

partition G into fragments (G1, …, Gn), distributed to various sites

manageable sizes

upon receiving a query Q,

• evaluate Q( Gi ) in parallel

• collect partial answers at a coordinator site, and assemble

them to find the answer Q( G ) in the entire G

evaluate Q on smaller Gi

Network traffic and response time: Independent of |G|

Performance guarantees for evaluating graph pattern queries

It is unlikely that we can lower its complexity, but

can we reduce the size of its parameter |G|?

Partial evaluation and

distributed query answering

26

Partial evaluation: a promising approach

compute f( x ) f( s, d )

conduct the part of computation that depends only on s

generate a partial answer

the part of known input

Partial evaluation in distributed query processing

• evaluate Q( Gi ) in parallel

• collect partial matches at a coordinator site, and assemble

them to find the answer Q( G ) in the entire G

yet unavailable input

a residual function

Gj as the yet unavailable input

functions

at each site, Gi as the known input

Query preserving data compression

27

The cost of query processing: f(|G|, |Q|)

Query preserving compression <R, P> for a class L of queries

For any data collection G, C = R(G)

For any Q in L, Q( G ) = P(Q, Gc)

Q( G )

R G Gc

Q P

Q

Q( Gc )

Compress big G into a smaller Gc

reduce the parameter?

What is new about query preserving compression?

28 18 times faster on average for reachability queries

In contrast to lossless compression, no need to

restore the original graph G

Relative to a class L of queries of users’ choice

Better compression ratio: only information about L queries

Query preserving compression <R, P> for a class L of queries

For any dataset G, Gc = R(G)

For any Q in L, Q( G ) = P(Q, Gc)

For any Q in L, Q(Gc) can be directly computed

Any algorithms and indexing structures for G can be used for Gc

no need to decompress Gc

Gc is computed once for all queries Q in L

Incrementally maintained

Answering queries using views

29 The complexity is no longer a function of |G|

can we compute Q(G) without accessing G, i.e.,

independent of |G|?

The cost of query processing: f(|G|, |Q|)

Query answering using views: given a query Q in a language L

and a set V views, find another query Q’ such that

Q and Q’ are equivalent

Q’ only accesses V(G )

for any G, Q(G) = Q’(G)

Answering graph pattern queries on big social graphs:

Regardless of how big G is – the cost is “independent” of G

V(G ) is often much smaller than G (4% -- 12% on real-life data) Improvement: 31 times faster for graph pattern matching

Incremental query answering

30 Minimizing unnecessary recomputation

Incremental query processing:

Input: Q, G, Q(G), ∆G

Output: ∆M such that Q(G⊕∆G) = Q(G) ⊕ ∆M

Changes to the output New output

Changes to the input Old output

When changes ∆G to the data G are small, typically so are the

changes ∆M to the output Q(G⊕∆G)

Changes ∆G are typically small

Compute Q(G) once, and then incrementally maintain it

Real-life data is dynamic – constantly changes, ∆G

Re-compute Q(G⊕∆G) starting from scratch?

5%/week in

Web graphs

Complexity of incremental problems

Bounded: the cost is expressible as f(|CHANGED|, |Q|)?

Optimal: in O(|CHANGED| + |Q|)?

31 Complexity analysis in terms of the size of changes

Incremental query answering

Input: Q, G, Q(G), ∆G

Output: ∆M such that Q(G⊕∆G) = Q(G) ⊕ ∆M

The cost of query processing: a function of |G| and |Q|

incremental algorithms: |CHANGED|, the size of changes in

• the input: ∆G, and

• the output: ∆M

The updating cost that is

inherent to the incremental

problem itself

The amount of work absolutely

necessary to perform for any

incremental algorithm

graph simulation: bounded

32

Graph pattern matching on big graphs

Partial evaluation for distributed query processing?

Query preserving compression: convert big data to small data

Query answering using views: make big data small

Bounded incremental query answering: depending on the size of

the changes rather than the size of the original big data

. . .

Combinations of these can do better than MapReduce!

Make big data small

Yes, MapReduce is useful, but it is not the only way!

Approximate query answering

33

34

Graph simulation or bounded simulation

Relaxing the semantics of query answering

Effectiveness: capture more sensible matches in social graphs

Efficiency: from intractable to low polynomial time

Subgraph isomorphism

NP-complete Exponentially many matches

Quadratic/cubic time |VQ||V|

Use “cheaper” queries whenever possible

Top-k query answering

35

Early termination: return top-k matches without computing Q(G)

Traditional query answering: compute Q(G)

Top-k query answering:

Input: : Query Q, dataset G and a positive integer k.

Output: A top-ranked set of k elements in Q(G)

It is expensive to compute when G is large

The result Q(G) is excessively large for the users to inspect –

larger than G

Improvement: 1.8 times as fast, graph pattern matching

36

The approximation theory revisited

Traditional approximation algorithms A: for an NPO

• for each instance x, A(x) computes a feasible solution y

• quality metric f(x, y)

• performance ratio (minimization): for all x,

A revision of approximation for querying big data

Approximation: for even low PTIME problems, not just NPO

Quality metric: answers to a query is a typically a set, not a number

Approach: it does not help much if A(x) conducts computation on

“big” data x directly!

OPT(x): optimal solution, 1

OPT(x) f(x, y) OPT(x)

Big graphs?

37

Data-driven approximation

Resource-bounded query answering

Input: A class Q of queries, and a resource ratio [0, 1)

Question: Develop an algorithm that given any query Q Q and

graph G, computes Q(G) by accessing at most |G| amount of data

Accessing |G| amount of data in the entire process

Data-driven approximation algorithm A(G)

Dynamic reduction: given Q and G, find GQ such that | GQ | |G|

Compute Q(GQ) as approximate query answers to Q(G)

Performance ratio: F-measure of precision and recall

with best performance ratio

precision = | Q(GQ) Q(G)| / | Q(GQ)|

Recall = | Q(GQ) Q(G)| / | Q(G)|

38

Personalized social search queries

We can make big graphs of PB size fit into our memory!

Graph Search, Facebook

Find me all my friends who live in Edinburgh and like cycling

Find me restaurants in London my friends have been to

Find me photos of my friends in New York

Does Michael connect to lady Gaga through social links?

We can do personalized social search with = 0.0015%!

1.5 * 10-6 * 1PB (1015B) = 15 * 109 = 15GB

We are making big graphs of PB size as small as 15GB!

Non-localized reachability

Localized patterns

with 100% accuracy!

39

Scale independence

Input: A class Q of queries

Question: Can we find, for any query Q Q and any (possibly

big) graph G, a fraction GQ of G such that

|GQ | M, and

Q(G) = Q(GQ)?

Characterizing scale independent queries

Scalable with big graph G, when D grows!

Desirable, but hard

Independent of the size of G

Summing up

40

41

Challenges and opportunities

Challenges: querying big graphs is hard!

Introduce new fundamental problems – a departure from our

familiar terrain

any technique alone may not work very well – MapReduce is not

the only way, and may not be the best way

Nonetheless, we can do it!

Exact query answers: making big data small!

• combinations of effective techniques

Approximate query answering:

• relax the semantics of query answering

• data-driven approximation

…

Querying big graphs: A rich source of questions and vitality!

References

Complexity theory for big data

W. Fan, F. Geerts, F. Neven. Making Queries Tractable on Big Data

with Preprocessing, VLDB 2013.

W. Fan, F. Geerts, and L. Libkin. On Scale Independence for

Querying Big Data. PODS 2014

Querying big social data:

W. Fan, X. Wang, and Y. Wu. Diversified Top-k Graph Pattern

Matching, VLDB, 2014.

W. Fan, X. Wang, and Y. Wu. Answering Graph Pattern Queries

using Views, ICDE 2014.

S. Ma, Y. Cao, W. Fan, J. Huai, and T. Wo. Strong Simulation:

Capturing Topology in Graph Pattern Matching, TODS 39(1), 2014

42

References

Querying big social data:

W. Fan, X. Wang, and Y. Wu. Incremental Graph Pattern Matching,

TODS 38(3), 2013

W. Fan. Graph Pattern Matching Revised for Social Network Analysis,

ICDT 2012.

W. Fan, J. Li, X. Wang, and Y. Wu. Query Preserving Graph

Compression, SIGMOD, 2012.

W. Fan, X. Wang, and Y. Wu. Performance Guarantees for

Distributed Reachability Queries, VLDB, 2012.

W. Fan, J. Li, S. Ma, N. Tang, and Y. Wu. Adding Regular

Expressions to Graph Reachability and Pattern Queries, ICDE 2011.

W. Fan J. Li, S. Ma, and H. Wang, and Y. Wu. Graph Homomorphism

Revisited for Graph Matching, VLDB 2010.

W. Fan J. Li, S. Ma, and N. Tang, and Y. Wu. Graph pattern

matching: From intractable to polynomial time, VLDB, 2010.

43