+ All Categories
Home > Documents > Ten Thousand SQLs

Ten Thousand SQLs

Date post: 08-Feb-2016
Category:
Upload: beau
View: 41 times
Download: 1 times
Share this document with a friend
Description:
Ten Thousand SQLs. Kalmesh Nyamagoudar 2010MCS3494. CONTENTS. CN Generation. Example Definitions Algorithm. Sequential Algorithm CLP : Naïve CLP : New OLP DLP Performance Studies. CN Evaluation. BANKS Model. Steiner Trees. Paper1. Paper2. Author1. Author2. Author1. Author2. - PowerPoint PPT Presentation
Popular Tags:
50
Ten Thousand SQLs Kalmesh Nyamagoudar 2010MCS3494
Transcript
Page 1: Ten Thousand SQLs

Ten Thousand SQLs

Kalmesh Nyamagoudar2010MCS3494

Page 2: Ten Thousand SQLs

2 October 13, 2011

• Example• Definitions• Algorithm

CN Generatio

n• Sequential

Algorithm• CLP : Naïve• CLP : New• OLP• DLP• Performance

Studies

CN Evaluation

CONTENTS

Page 3: Ten Thousand SQLs

3 October 13, 2011

BANKS Model

Author1

Author2

Paper1

Author1

Author2

Paper2Steiner Trees

Page 4: Ten Thousand SQLs

4 October 13, 2011

DISCOVER Model

Author1 Author2

Paper1

TID

NAME

TID

NAME

TID

AID

PID

TID

PID1

PID2

AUTHOR WRITES PAPER CITE

Writes

{}

Paper{}

Writes

{}

 

 

 

Joining Network Of Tuples

Joining Network Of Tuple Sets

Author1: Paper1 Author2: Paper1

Author1 Author2

Paper2

Author1: Paper2 Author2: Paper2

AuthorAuthor1AuthorAuthor2

AuthorAuthor1 Writes

{}

Paper{}

⋈Writes

{}

⋈AuthorAuthor2

Page 5: Ten Thousand SQLs

5

Database n Relations Each has attributes

Schema Graph : Directed graph that captures p-f relationships in

database schema : Each relation : for each p-f relationship Assumption : No self loops/parallel edges Undirected version of (Future reference : )

Background : DISCOVER

October 13, 2011

Page 6: Ten Thousand SQLs

6

Background : DISCOVER

Schema

Graph

(TPC-H)

October 13, 2011

Page 7: Ten Thousand SQLs

7

Background : DISCOVERExample Data

ORDERS ORDERKEY CUSTKEY TOTALPRICE CLERK ... 1000105 12312 $5,000 John Smith 1000111 12312 $3,000 Mike Miller 1000125 10001 $7,000 Mike Miller 1000110 10002 $8,000 Keith Brown

CUSTOMER CUSTKEY NAME NATIONKEY ...

12312 Brad Lou 01

10001 George Walters 01

10013 John Roberts 01

NATION NATIONKEY NAME REGIONKEY 01 USA N.America

c1

c2

c3

o1

o2

o3

o4

n1

Source : Discover[3]

October 13, 2011

Page 8: Ten Thousand SQLs

8

Background : DISCOVER Query: Smith,Miller”

ORDERS ORDERKEY CUSTKEY TOTALPRICE CLERK ... 1000105 12312 $5,000 John Smith 1000111 12312 $3,000 Mike Miller 1000125 10001 $7,000 Mike Miller 1000110 10002 $8,000 Keith Brown

CUSTOMER CUSTKEY NAME NATIONKEY ...

12312 Brad Lou 01

10001 George Walters 01

10013 John Roberts 01

NATION NATIONKEY NAME REGIONKEY 01 USA N.America

c1

c2

c3

o1

o2

o3

o4

n1

Source : Discover[3]

October 13, 2011

Page 9: Ten Thousand SQLs

9

ORDERS ORDERKEY CUSTKEY TOTALPRICE CLERK ... 1000105 12312 $5,000 John Smith 1000111 12312 $3,000 Mike Miller 1000125 10001 $7,000 Mike Miller 1000110 10002 $8,000 Keith Brown

CUSTOMER CUSTKEY NAME NATIONKEY ...

12312 Brad Lou 01

10001 George Walters 01

10013 John Roberts 01

NATION NATIONKEY NAME REGIONKEY 01 USA N.America

c1

c2

c3

o1

o2

o3

o4

n1

Source : Discover[3]

Background : DISCOVER Query: Smith,Miller”

SIZE

RESULT

2 O1 C1 O2

October 13, 2011

Page 10: Ten Thousand SQLs

10

ORDERS ORDERKEY CUSTKEY TOTALPRICE CLERK ... 1000105 12312 $5,000 John Smith 1000111 12312 $3,000 Mike Miller 1000125 10001 $7,000 Mike Miller 1000110 10002 $8,000 Keith Brown

CUSTOMER CUSTKEY NAME NATIONKEY ...

12312 Brad Lou 01

10001 George Walters 01

10013 John Roberts 01

NATION NATIONKEY NAME REGIONKEY 01 USA N.America

c1

c2

c3

o1

o2

o3

o4

n1

Source : Discover[3]

Background : DISCOVER Query: Smith,Miller”

SIZE RESULT

2 O1 C1 O2

4 O1 C1 N1 C2 O3

Joining Network Of Tuples

October 13, 2011

Page 11: Ten Thousand SQLs

11 October 5, 2011

Joining Network Of Tuple Sets

Background : DISCOVERSource : Discover[2]

Page 12: Ten Thousand SQLs

12

Final Answer : Joining Network Of Tuples () Tree of tuples For each pair of adjacent tuples , where , , there is an edge in

schema graph and (

Keyword Query Given : Set of keywords Result : Set of all possible joining networks of tuples that are both :

Total : every keyword is contained in at least one tuple of joining network.

Minimal : Removal of any tuple still gives a TJNT? Ordered By : Size of MTJNTs

All such joining network of tuples are Minimal Total Joining Network Of Tuples (MTJNT)

Background : DISCOVER

October 13, 2011

Page 13: Ten Thousand SQLs

13

Joining Network Of Tuple Sets

Tree of tuple sets

For each pair of adjacent tuple sets , there is an edge in schema

graph

Candidate Network

Given : Set of keywords

Is a Joining Network Of Tuple Sets such that there is an instance I

of the database that has a MTJNT and no tuple that maps to a

free tuple set contains any keywords

2 Steps :

candidate networks (CNs) generation

CNs evaluation

Background : DISCOVER

October 13, 2011

Page 14: Ten Thousand SQLs

14

Candidate Networks Generation Complete : Every possible MTJNT is produced by

a candidate network output by the algorithm Minimal : Does not produce any redundant

candidate networksExample:

ORDERSSmith ⋈ CUSTOMER{} ⋈ ORDERSMiller

ORDERSSmith ⋈ CUSTOMER{} ⋈ ORDERSMiller ⋈ CUSTOMER{} ORDERSSmith ⋈ CUSTOMER{} ⋈ ORDERS{} ORDERSSmith ⋈ LINEITEM{} ⋈ ORDERSMiller

Tmax : Maximum number of tuple sets in a CN

Background : DISCOVER

October 13, 2011

Page 15: Ten Thousand SQLs

15

CN Generation

October 13, 2011

Source : Discover[2]

Page 16: Ten Thousand SQLs

16

CN Generation

October 13, 2011

Source : Discover[2]

Page 17: Ten Thousand SQLs

17

CN Generation

October 13, 2011

Source : Discover[2]

Page 18: Ten Thousand SQLs

18

Large number of CNs to be evaluated CNs : usually tightly coupled with each other Reuse the common sub expressionsExample :

Dataset : DBLP No. of tables : 4 Max No. of tuples/result : 11 CN join operations without sharing : 539,052 CN join operations with sharing : 53,008 probability for any two CNs to share computational

cost : 59.64

CN Evaluation : October 13, 2011

Page 19: Ten Thousand SQLs

19

Sequential Algorithm :Example

Dataset : DBLP

Source : TTS[1]

TID

NAME

TID

NAME

TID

AID

PID

TID

PID1

PID2

AUTHOR WRITE

PAPER CITE

October 13, 2011

Page 20: Ten Thousand SQLs

20

Keywords Entered : CNs generated :

Source : TTS[1]

Sequential Algorithm :Example

TID

NAME

TID

NAME

TID

AID

PID

TID

PID1

PID2

AUTHOR WRITE PAPER CITE

October 13, 2011

Page 21: Ten Thousand SQLs

21

CN Evaluation : state-of-art sequential algorithm

Greedy algorithm:

In each iteration build intermediate result of size that

maximizes

No of occurrences of IMR in CNs

Estimated no. of tuples of IMR

gives better results

October 13, 2011

Page 22: Ten Thousand SQLs

22

Example CN :

Total Cost : 2199

Source : TTS[1]

Sequential Algorithm :Execution Graph

October 13, 2011

Page 23: Ten Thousand SQLs

23

: DAG ∈ V(GE) : An operator(e.g. or σ) ∈ E(GE)) iff output of is part of the input of

Levels σ : level 1 A node is in level iff

∃, such that ∈ E(GE) and is in level , and

∀, such that ∈ E(GE), the level of is no larger than

: Maximum level of GE Evaluation

Evaluated in a bottom-up fashion No parallelism involved For keyword queries with large number of keywords or with high selective

keywords, the processing time is slow

Sequential Algorithm :Execution Graph

October 13, 2011

Page 24: Ten Thousand SQLs

24

New Solution

Use of multi-core architecture

Why not existing parallel multi-query processing? Large number of queries Large sharing between queries Large intermediate results

What we need on multi-core archs? CNs in the same core share : most computational cost CNs in different cores share : least computational cost Handle high workload skew Handle errors caused by estimation adaptively

October 13, 2011

Page 25: Ten Thousand SQLs

25

CN Level Parallelism :Straightforward Approach largest first rule : partition with the least workload

Final Cost : max(cost of each core) = 1949

Source : TTS[1]

October 13, 2011

Page 26: Ten Thousand SQLs

26

CLP : Straightforward Approach

Execution Time =

Problem : does not consider sub-expression sharing

Source : TTS[1]

select the core : O(n) Add CN to partition :O()

O()

October 13, 2011

Page 27: Ten Thousand SQLs

27

CLP:Sharing-Aware CN Partitioning

Which CN to distribute first? the largest not-shared/extra cost

To which partition? with maximum sharing if it does not destroy the workload

balancing.

Total cost for a partition = cost after sharing sub-expressions for all CNs in that partition

October 13, 2011

Page 28: Ten Thousand SQLs

A P P A P P

𝜎 𝑘1 W 𝜎 𝑘1 C 𝜎 𝑘1 C 𝜎 𝑘2 W 𝜎 𝑘2 C 𝜎 𝑘2 C

⋈ ⋈ ⋈ ⋈ ⋈ ⋈

⋈ ⋈

P P

P

⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈𝑐1 𝑐2 𝑐3 𝑐4 𝑐5 𝑐6 𝑐7 𝑐8 𝑐9

Core 1 Core 2 Core 3

3

CN MinCost

1 720

2 727

3 727

4 715

5 727

6 727

7 715

8 727

9 727

10 10 10 100

102

102

500

510

51050

50 50

50 50

50

MaxHeap5 5 5 5 5 5 5 5 5

: Non-Exec Graph of Core 3October 13, 201128

Page 29: Ten Thousand SQLs

A P P A P P

𝜎 𝑘1 W 𝜎 𝑘1 C 𝜎 𝑘1 C 𝜎 𝑘2 W 𝜎 𝑘2 C 𝜎 𝑘2 C

⋈ ⋈ ⋈ ⋈ ⋈ ⋈

⋈ ⋈

P P

P

⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈𝑐1 𝑐2 𝑐3 𝑐4 𝑐5 𝑐6 𝑐7 𝑐8 𝑐9

3

10 10 10 100

102

102

500

510

51050

50 50

50 50

50

MaxHeap

Core 1 Core 2 Core 3

(727) (727) (715)

CN MinCost

1 610

2 727

3 115

4 605

5 115

6 727

7 715

8 115

9 115

5 5 5 5 5 5 5 5 5

October 13, 201129

Page 30: Ten Thousand SQLs

A P P P

𝜎 𝑘1 W 𝜎 𝑘1 C W 𝜎 𝑘2 C 𝜎 𝑘2 C

⋈ ⋈ ⋈ ⋈

⋈ ⋈

P P

⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈𝑐1 𝑐2 𝑐3 𝑐4 𝑐5 𝑐6 𝑐7 𝑐8 𝑐9

Core 1 Core 2 Core 3

(727) (727) (715)

(835)

3

CN MinCost

1 115

2 727

3 115

4 115

5 115

6 727

7 715

8 115

9 115

10 10 102

102

510

51050

50 50

50

5 5 5 5 5 5 5 5 5 MaxHeap

October 13, 201130

Page 31: Ten Thousand SQLs

P P P

𝜎 𝑘1 C W 𝜎 𝑘2 C 𝜎 𝑘2 C

⋈ ⋈ ⋈

P

⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈𝑐1 𝑐2 𝑐3 𝑐4 𝑐5 𝑐6 𝑐7 𝑐8 𝑐9

3

CN MinCost

1 115

2 727

3 115

4 115

5 115

6 727

7 715

8 115

9 115

10 102

102

510

510

50

50

5 5 5 5 5 5 5 5 5 MaxHeap

Core 1 Core 2 Core 3

(727) (727) (715)

(842) (835)

(1442) (1447) (950)

October 13, 201131

Page 32: Ten Thousand SQLs

P P

W 𝜎 𝑘2 C 𝜎 𝑘2 C

⋈ ⋈

⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈𝑐1 𝑐2 𝑐3 𝑐4 𝑐5 𝑐6 𝑐7 𝑐8 𝑐9

3

CN MinCost

1 115

2 727

3 115

4 115

5 115

6 727

7 715

8 115

9 115

102

102

510

510

5 5 5 5 5 5 5 5 5

Core 1 Core 2 Core 3

(727) (727) (715)

(842) (835)

(950)

MaxHeap

October 13, 201132

Page 33: Ten Thousand SQLs

33

CLP:Sharing-Aware CN Partitioning

Total Cost = 957

43.5 of the sequential cost

Source : TTS[1]

October 13, 2011

Page 34: Ten Thousand SQLs

34

CLP:Sharing-Aware CN Partitioning

Execution Time = [ Assuming ]

Redundant works are done by multiple cores

Wrong Estimation : Accumulated Cost

Source : TTS[1]Initialization

Select the core :

Update cost of overlapping CNs :|E(GE)|n overall

October 13, 2011

Page 35: Ten Thousand SQLs

35

CLP:Error Accumulation

Source : TTS[1]

October 13, 2011

Page 36: Ten Thousand SQLs

36

Operator Level Parallelism

Each CN is allowed to be processed in different cores, but each

operation must be processed in a certain core

Nodes in the same phase processed parallely

Time Of Partition?

In phase if node is in level of GE

Which operation to distribute?

the largest cost

To which partition?

Minimum cost if it does not destroy the workload balancing

Sharing between CNs and phases : Shared Memory

October 13, 2011

Page 37: Ten Thousand SQLs

37

Operator Level Parallelism

Final Cost = 737

33.5 of the sequential cost

Source : TTS[1]

October 13, 2011

Page 38: Ten Thousand SQLs

38

OLP : Overcoming Error Accumulation

Before each phase, re-estimate the cost of each operation

Cost of a select operation

Cost of a join operation can be pre-computed and saved beforehand for each edge

October 13, 2011

Page 39: Ten Thousand SQLs

39

OLP : Overcoming Accumulated Cost

Source : TTS[1]

643

685

685

October 13, 2011

Page 40: Ten Thousand SQLs

40

Operator Level Parallelism

Execution Time

A join operation is much more costly than others => becomes the dominant cost when processing

Source : TTS[1] nodes overall

October 13, 2011

Page 41: Ten Thousand SQLs

41

Data Level Parallelism

each operation in GE can be performed on multiple cores

uses the operation level parallelism if there is no workload skew

partition data adaptively before each time workload skew

happens

Which node to partition?

Most costly node if its dominant

When to merge the sub-results?

At final phase

October 13, 2011

Page 42: Ten Thousand SQLs

42

Data Level Parallelism

Source : TTS[1]Core 1Core 2Core 3

October 13, 2011

Page 43: Ten Thousand SQLs

43

Data Level Parallelism

Lemma : In each phase, at most partition operations will be performed Each node will have max copies <= Execution Time = with =

=

Source : TTS[1]

Divide the tuples of child node

Select the child node to be partitioned

Makes copies of selected child node and all its father nodesAdds corresponding edges

Re-estimate

October 13, 2011

Page 44: Ten Thousand SQLs

44

Performance Studies

For LINEAR processing

processing time for the state of art sequential algorithm no of cores

Implemented In

System Configuration

October 13, 2011

Page 45: Ten Thousand SQLs

45

Default values(IMDB) : 3 ranges from 2 to 6 with a default value 4 ranges from 4 to 7 with a default value 5

Source : TTS[1]

Performance Studies

October 13, 2011

Page 46: Ten Thousand SQLs

46

Vary (IMDB)

Source : TTS[1]

October 13, 2011

Page 47: Ten Thousand SQLs

47

Vary (IMDB)

Source : TTS[1]

October 13, 2011

Page 48: Ten Thousand SQLs

48

Vary (IMDB)

Source : TTS[1]

October 13, 2011

Page 49: Ten Thousand SQLs

49

References1. Lu Qin, Jeffrey Xu Yu, Lijun Chang, Ten Thousand

SQLs: Parallel Keyword Queries Computing, Proceedings of the VLDB Endowment, Volume 3 Issue 1-2, September 2010 , Singapore

2. Vagelis Hristidis, Yannis Papakonstantinou, Discover: keyword search in relational databases, VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases, Hong Kong

3. [PPT]  DISCOVER: Keyword Search in Relational Databases

October 13, 2011

Page 50: Ten Thousand SQLs

50

THANK YOU

October 13, 2011


Recommended