Ten Thousand SQLs

Post on 08-Feb-2016

41 views 1 download

Tags:

description

Ten Thousand SQLs. Kalmesh Nyamagoudar 2010MCS3494. CONTENTS. CN Generation. Example Definitions Algorithm. Sequential Algorithm CLP : Naïve CLP : New OLP DLP Performance Studies. CN Evaluation. BANKS Model. Steiner Trees. Paper1. Paper2. Author1. Author2. Author1. Author2. - PowerPoint PPT Presentation

transcript

Ten Thousand SQLs

Kalmesh Nyamagoudar2010MCS3494

2 October 13, 2011

• Example• Definitions• Algorithm

CN Generatio

n• Sequential

Algorithm• CLP : Naïve• CLP : New• OLP• DLP• Performance

Studies

CN Evaluation

CONTENTS

3 October 13, 2011

BANKS Model

Author1

Author2

Paper1

Author1

Author2

Paper2Steiner Trees

4 October 13, 2011

DISCOVER Model

Author1 Author2

Paper1

TID

NAME

TID

NAME

TID

AID

PID

TID

PID1

PID2

AUTHOR WRITES PAPER CITE

Writes

{}

Paper{}

Writes

{}

 

 

 

Joining Network Of Tuples

Joining Network Of Tuple Sets

Author1: Paper1 Author2: Paper1

Author1 Author2

Paper2

Author1: Paper2 Author2: Paper2

AuthorAuthor1AuthorAuthor2

AuthorAuthor1 Writes

{}

Paper{}

⋈Writes

{}

⋈AuthorAuthor2

5

Database n Relations Each has attributes

Schema Graph : Directed graph that captures p-f relationships in

database schema : Each relation : for each p-f relationship Assumption : No self loops/parallel edges Undirected version of (Future reference : )

Background : DISCOVER

October 13, 2011

6

Background : DISCOVER

Schema

Graph

(TPC-H)

October 13, 2011

7

Background : DISCOVERExample Data

ORDERS ORDERKEY CUSTKEY TOTALPRICE CLERK ... 1000105 12312 $5,000 John Smith 1000111 12312 $3,000 Mike Miller 1000125 10001 $7,000 Mike Miller 1000110 10002 $8,000 Keith Brown

CUSTOMER CUSTKEY NAME NATIONKEY ...

12312 Brad Lou 01

10001 George Walters 01

10013 John Roberts 01

NATION NATIONKEY NAME REGIONKEY 01 USA N.America

c1

c2

c3

o1

o2

o3

o4

n1

Source : Discover[3]

October 13, 2011

8

Background : DISCOVER Query: Smith,Miller”

ORDERS ORDERKEY CUSTKEY TOTALPRICE CLERK ... 1000105 12312 $5,000 John Smith 1000111 12312 $3,000 Mike Miller 1000125 10001 $7,000 Mike Miller 1000110 10002 $8,000 Keith Brown

CUSTOMER CUSTKEY NAME NATIONKEY ...

12312 Brad Lou 01

10001 George Walters 01

10013 John Roberts 01

NATION NATIONKEY NAME REGIONKEY 01 USA N.America

c1

c2

c3

o1

o2

o3

o4

n1

Source : Discover[3]

October 13, 2011

9

ORDERS ORDERKEY CUSTKEY TOTALPRICE CLERK ... 1000105 12312 $5,000 John Smith 1000111 12312 $3,000 Mike Miller 1000125 10001 $7,000 Mike Miller 1000110 10002 $8,000 Keith Brown

CUSTOMER CUSTKEY NAME NATIONKEY ...

12312 Brad Lou 01

10001 George Walters 01

10013 John Roberts 01

NATION NATIONKEY NAME REGIONKEY 01 USA N.America

c1

c2

c3

o1

o2

o3

o4

n1

Source : Discover[3]

Background : DISCOVER Query: Smith,Miller”

SIZE

RESULT

2 O1 C1 O2

October 13, 2011

10

ORDERS ORDERKEY CUSTKEY TOTALPRICE CLERK ... 1000105 12312 $5,000 John Smith 1000111 12312 $3,000 Mike Miller 1000125 10001 $7,000 Mike Miller 1000110 10002 $8,000 Keith Brown

CUSTOMER CUSTKEY NAME NATIONKEY ...

12312 Brad Lou 01

10001 George Walters 01

10013 John Roberts 01

NATION NATIONKEY NAME REGIONKEY 01 USA N.America

c1

c2

c3

o1

o2

o3

o4

n1

Source : Discover[3]

Background : DISCOVER Query: Smith,Miller”

SIZE RESULT

2 O1 C1 O2

4 O1 C1 N1 C2 O3

Joining Network Of Tuples

October 13, 2011

11 October 5, 2011

Joining Network Of Tuple Sets

Background : DISCOVERSource : Discover[2]

12

Final Answer : Joining Network Of Tuples () Tree of tuples For each pair of adjacent tuples , where , , there is an edge in

schema graph and (

Keyword Query Given : Set of keywords Result : Set of all possible joining networks of tuples that are both :

Total : every keyword is contained in at least one tuple of joining network.

Minimal : Removal of any tuple still gives a TJNT? Ordered By : Size of MTJNTs

All such joining network of tuples are Minimal Total Joining Network Of Tuples (MTJNT)

Background : DISCOVER

October 13, 2011

13

Joining Network Of Tuple Sets

Tree of tuple sets

For each pair of adjacent tuple sets , there is an edge in schema

graph

Candidate Network

Given : Set of keywords

Is a Joining Network Of Tuple Sets such that there is an instance I

of the database that has a MTJNT and no tuple that maps to a

free tuple set contains any keywords

2 Steps :

candidate networks (CNs) generation

CNs evaluation

Background : DISCOVER

October 13, 2011

14

Candidate Networks Generation Complete : Every possible MTJNT is produced by

a candidate network output by the algorithm Minimal : Does not produce any redundant

candidate networksExample:

ORDERSSmith ⋈ CUSTOMER{} ⋈ ORDERSMiller

ORDERSSmith ⋈ CUSTOMER{} ⋈ ORDERSMiller ⋈ CUSTOMER{} ORDERSSmith ⋈ CUSTOMER{} ⋈ ORDERS{} ORDERSSmith ⋈ LINEITEM{} ⋈ ORDERSMiller

Tmax : Maximum number of tuple sets in a CN

Background : DISCOVER

October 13, 2011

15

CN Generation

October 13, 2011

Source : Discover[2]

16

CN Generation

October 13, 2011

Source : Discover[2]

17

CN Generation

October 13, 2011

Source : Discover[2]

18

Large number of CNs to be evaluated CNs : usually tightly coupled with each other Reuse the common sub expressionsExample :

Dataset : DBLP No. of tables : 4 Max No. of tuples/result : 11 CN join operations without sharing : 539,052 CN join operations with sharing : 53,008 probability for any two CNs to share computational

cost : 59.64

CN Evaluation : October 13, 2011

19

Sequential Algorithm :Example

Dataset : DBLP

Source : TTS[1]

TID

NAME

TID

NAME

TID

AID

PID

TID

PID1

PID2

AUTHOR WRITE

PAPER CITE

October 13, 2011

20

Keywords Entered : CNs generated :

Source : TTS[1]

Sequential Algorithm :Example

TID

NAME

TID

NAME

TID

AID

PID

TID

PID1

PID2

AUTHOR WRITE PAPER CITE

October 13, 2011

21

CN Evaluation : state-of-art sequential algorithm

Greedy algorithm:

In each iteration build intermediate result of size that

maximizes

No of occurrences of IMR in CNs

Estimated no. of tuples of IMR

gives better results

October 13, 2011

22

Example CN :

Total Cost : 2199

Source : TTS[1]

Sequential Algorithm :Execution Graph

October 13, 2011

23

: DAG ∈ V(GE) : An operator(e.g. or σ) ∈ E(GE)) iff output of is part of the input of

Levels σ : level 1 A node is in level iff

∃, such that ∈ E(GE) and is in level , and

∀, such that ∈ E(GE), the level of is no larger than

: Maximum level of GE Evaluation

Evaluated in a bottom-up fashion No parallelism involved For keyword queries with large number of keywords or with high selective

keywords, the processing time is slow

Sequential Algorithm :Execution Graph

October 13, 2011

24

New Solution

Use of multi-core architecture

Why not existing parallel multi-query processing? Large number of queries Large sharing between queries Large intermediate results

What we need on multi-core archs? CNs in the same core share : most computational cost CNs in different cores share : least computational cost Handle high workload skew Handle errors caused by estimation adaptively

October 13, 2011

25

CN Level Parallelism :Straightforward Approach largest first rule : partition with the least workload

Final Cost : max(cost of each core) = 1949

Source : TTS[1]

October 13, 2011

26

CLP : Straightforward Approach

Execution Time =

Problem : does not consider sub-expression sharing

Source : TTS[1]

select the core : O(n) Add CN to partition :O()

O()

October 13, 2011

27

CLP:Sharing-Aware CN Partitioning

Which CN to distribute first? the largest not-shared/extra cost

To which partition? with maximum sharing if it does not destroy the workload

balancing.

Total cost for a partition = cost after sharing sub-expressions for all CNs in that partition

October 13, 2011

A P P A P P

𝜎 𝑘1 W 𝜎 𝑘1 C 𝜎 𝑘1 C 𝜎 𝑘2 W 𝜎 𝑘2 C 𝜎 𝑘2 C

⋈ ⋈ ⋈ ⋈ ⋈ ⋈

⋈ ⋈

P P

P

⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈𝑐1 𝑐2 𝑐3 𝑐4 𝑐5 𝑐6 𝑐7 𝑐8 𝑐9

Core 1 Core 2 Core 3

3

CN MinCost

1 720

2 727

3 727

4 715

5 727

6 727

7 715

8 727

9 727

10 10 10 100

102

102

500

510

51050

50 50

50 50

50

MaxHeap5 5 5 5 5 5 5 5 5

: Non-Exec Graph of Core 3October 13, 201128

A P P A P P

𝜎 𝑘1 W 𝜎 𝑘1 C 𝜎 𝑘1 C 𝜎 𝑘2 W 𝜎 𝑘2 C 𝜎 𝑘2 C

⋈ ⋈ ⋈ ⋈ ⋈ ⋈

⋈ ⋈

P P

P

⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈𝑐1 𝑐2 𝑐3 𝑐4 𝑐5 𝑐6 𝑐7 𝑐8 𝑐9

3

10 10 10 100

102

102

500

510

51050

50 50

50 50

50

MaxHeap

Core 1 Core 2 Core 3

(727) (727) (715)

CN MinCost

1 610

2 727

3 115

4 605

5 115

6 727

7 715

8 115

9 115

5 5 5 5 5 5 5 5 5

October 13, 201129

A P P P

𝜎 𝑘1 W 𝜎 𝑘1 C W 𝜎 𝑘2 C 𝜎 𝑘2 C

⋈ ⋈ ⋈ ⋈

⋈ ⋈

P P

⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈𝑐1 𝑐2 𝑐3 𝑐4 𝑐5 𝑐6 𝑐7 𝑐8 𝑐9

Core 1 Core 2 Core 3

(727) (727) (715)

(835)

3

CN MinCost

1 115

2 727

3 115

4 115

5 115

6 727

7 715

8 115

9 115

10 10 102

102

510

51050

50 50

50

5 5 5 5 5 5 5 5 5 MaxHeap

October 13, 201130

P P P

𝜎 𝑘1 C W 𝜎 𝑘2 C 𝜎 𝑘2 C

⋈ ⋈ ⋈

P

⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈𝑐1 𝑐2 𝑐3 𝑐4 𝑐5 𝑐6 𝑐7 𝑐8 𝑐9

3

CN MinCost

1 115

2 727

3 115

4 115

5 115

6 727

7 715

8 115

9 115

10 102

102

510

510

50

50

5 5 5 5 5 5 5 5 5 MaxHeap

Core 1 Core 2 Core 3

(727) (727) (715)

(842) (835)

(1442) (1447) (950)

October 13, 201131

P P

W 𝜎 𝑘2 C 𝜎 𝑘2 C

⋈ ⋈

⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈𝑐1 𝑐2 𝑐3 𝑐4 𝑐5 𝑐6 𝑐7 𝑐8 𝑐9

3

CN MinCost

1 115

2 727

3 115

4 115

5 115

6 727

7 715

8 115

9 115

102

102

510

510

5 5 5 5 5 5 5 5 5

Core 1 Core 2 Core 3

(727) (727) (715)

(842) (835)

(950)

MaxHeap

October 13, 201132

33

CLP:Sharing-Aware CN Partitioning

Total Cost = 957

43.5 of the sequential cost

Source : TTS[1]

October 13, 2011

34

CLP:Sharing-Aware CN Partitioning

Execution Time = [ Assuming ]

Redundant works are done by multiple cores

Wrong Estimation : Accumulated Cost

Source : TTS[1]Initialization

Select the core :

Update cost of overlapping CNs :|E(GE)|n overall

October 13, 2011

35

CLP:Error Accumulation

Source : TTS[1]

October 13, 2011

36

Operator Level Parallelism

Each CN is allowed to be processed in different cores, but each

operation must be processed in a certain core

Nodes in the same phase processed parallely

Time Of Partition?

In phase if node is in level of GE

Which operation to distribute?

the largest cost

To which partition?

Minimum cost if it does not destroy the workload balancing

Sharing between CNs and phases : Shared Memory

October 13, 2011

37

Operator Level Parallelism

Final Cost = 737

33.5 of the sequential cost

Source : TTS[1]

October 13, 2011

38

OLP : Overcoming Error Accumulation

Before each phase, re-estimate the cost of each operation

Cost of a select operation

Cost of a join operation can be pre-computed and saved beforehand for each edge

October 13, 2011

39

OLP : Overcoming Accumulated Cost

Source : TTS[1]

643

685

685

October 13, 2011

40

Operator Level Parallelism

Execution Time

A join operation is much more costly than others => becomes the dominant cost when processing

Source : TTS[1] nodes overall

October 13, 2011

41

Data Level Parallelism

each operation in GE can be performed on multiple cores

uses the operation level parallelism if there is no workload skew

partition data adaptively before each time workload skew

happens

Which node to partition?

Most costly node if its dominant

When to merge the sub-results?

At final phase

October 13, 2011

42

Data Level Parallelism

Source : TTS[1]Core 1Core 2Core 3

October 13, 2011

43

Data Level Parallelism

Lemma : In each phase, at most partition operations will be performed Each node will have max copies <= Execution Time = with =

=

Source : TTS[1]

Divide the tuples of child node

Select the child node to be partitioned

Makes copies of selected child node and all its father nodesAdds corresponding edges

Re-estimate

October 13, 2011

44

Performance Studies

For LINEAR processing

processing time for the state of art sequential algorithm no of cores

Implemented In

System Configuration

October 13, 2011

45

Default values(IMDB) : 3 ranges from 2 to 6 with a default value 4 ranges from 4 to 7 with a default value 5

Source : TTS[1]

Performance Studies

October 13, 2011

46

Vary (IMDB)

Source : TTS[1]

October 13, 2011

47

Vary (IMDB)

Source : TTS[1]

October 13, 2011

48

Vary (IMDB)

Source : TTS[1]

October 13, 2011

49

References1. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Ten Thousand

SQLs: Parallel Keyword Queries Computing, Proceedings of the VLDB Endowment, Volume 3 Issue 1-2, September 2010 , Singapore

2. Vagelis Hristidis, Yannis Papakonstantinou, Discover: keyword search in relational databases, VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases, Hong Kong

3. [PPT]  DISCOVER: Keyword Search in Relational Databases

October 13, 2011

50

THANK YOU

October 13, 2011