+ All Categories
Home > Documents > Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… ·...

Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… ·...

Date post: 06-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
37
Scientific Project: Databases for Multi-dimensional Data, Genomics and modern Hardware David Broneske , Gabriel Campero, Bala Gurumurthy, Marcus Pinnecke, Gunter Saake , October 18, 2019 1 / 37
Transcript
Page 1: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Scientific Project: Databases for Multi-dimensionalData, Genomics and modern Hardware

David Broneske, Gabriel Campero,Bala Gurumurthy, Marcus Pinnecke,Gunter Saake

, October 18, 2019

1 / 37

Page 2: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Overview

• Concepts of this course

• Course of action (milestones, presentations)

• Overview of project topics & forming project teams

• How to perform literature research?

• Further lectures:

Academic writing (2 lectures)

2 / 37 Scientific Project David Broneske et al.

Page 3: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Overview

• Concepts of this course

• Course of action (milestones, presentations)

• Overview of project topics & forming project teams

• How to perform literature research?

• Further lectures:

Academic writing (2 lectures)

3 / 37 Scientific Project David Broneske et al.

Page 4: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Organization

Page 5: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Scientific Project: Modules

Bachelor

• Module: WPF FIN SMK (Schlussel- und Methodenkompetenzen)

• 5 CP = 150h ⇒ 42h presence time (3 SWS) + 108h autonomouswork

Master

• Module: Scientific Team Project (Inf, IngInf, WIF, CV)

DKE: Methods 2 or Applications

DE: Interdisciplinary Team Project, Specialization

• 6 CP = 180h ⇒ 42h presence time (3 SWS) + 138h autonomouswork

Grade at the end of the course for the whole project team

5 / 37 Scientific Project David Broneske et al.

Page 6: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Scientific Project: Prerequisite

• Successful programming test in C++/Java/Python

• 1h theoretical test in a seminar room (data and place to bediscussed)

• Half of the team members have to pass the test

• Topics:

Some language specifics

General program understanding

Control flow understanding

• You can take all tests and have to pass at least one!

6 / 37 Scientific Project David Broneske et al.

Page 7: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Scientific Project: Semester Plan

Introduction 14.10.2019 15.10.2019 16.10.2019 17.10.2019 18.10.2019Team Formation 21.10.2019 22.10.2019 23.10.2019 24.10.2019 25.10.2019Final Teams 28.10.2019 29.10.2019 30.10.2019 31.10.2019 01.11.2019

04.11.2019 05.11.2019 06.11.2019 07.11.2019 08.11.2019MS-I 11.11.2019 12.11.2019 13.11.2019 14.11.2019 15.11.2019

18.11.2019 19.11.2019 20.11.2019 21.11.2019 22.11.201925.11.2019 26.11.2019 27.11.2019 28.11.2019 29.11.2019

MS II 02.12.2019 03.12.2019 04.12.2019 05.12.2019 06.12.201909.12.2019 10.12.2019 11.12.2019 12.12.2019 13.12.201916.12.2019 17.12.2019 18.12.2019 19.12.2019 20.12.2019

Winter Break ... ... ... ... ...MS III 06.01.2020 07.01.2020 08.01.2020 09.01.2020 10.01.2020

13.01.2020 14.01.2020 15.01.2020 16.01.2020 17.01.202020.01.2020 21.01.2020 22.01.2020 23.01.2020 24.01.2020

MS Final 27.01.2020 28.01.2020 29.01.2020 30.01.2020 31.01.2020

Monday Tuesday Wednesday Thursday Friday

Aca-IAca-II

7 / 37 Scientific Project David Broneske et al.

Page 8: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Scientific Project: Milestones

• Milestone I - Topic, schedule, and team presentation & first resultsof literature research

• Milestone II - Concept & additional literature research

• Milestone III - Implementation & evaluation setup

• Milestone IV - Final presentation (wrap-up + evaluation results)

8 / 37 Scientific Project David Broneske et al.

Page 9: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Concepts & Content

Page 10: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Lecture, Meetings & Presentation

Lecture & Presentation

• Time/Place: Friday, 13:00-15:00, G22A - 208

• Lectures with content of course → all

• Presentation of main milestones (see time table)→ each project team

Meetings (Exercise)

• Individual for each project team

• Time and room to be agreed in project teams!

• Presentation of all intermediate results/milestones (informal)

• Discussion, discussion, discussion . . .

10 / 37 Scientific Project David Broneske et al.

Page 11: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Progress of Course

Deliveries

• 4 milestone presentations (main milestones)

• Each team member has to present at least once

• Reporting of (sub) milestones in exercises/meetings

• Written paper about literature research (technical report)

• Prototypical implementation

11 / 37 Scientific Project David Broneske et al.

Page 12: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Deliveries and Grading (I)

Technical Report

• Delivery of report at a given time (deadline)

• Quality/Quantity of literature research

• Number of pages

• Quality of paper structure and evaluation

• Own contribution

12 / 37 Scientific Project David Broneske et al.

Page 13: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Deliveries and Grading (II)

Presentation & Discussion

• Quality of scientific presentation (structure, references, time)

• Assessment regarding the content (e.g., results of particularmilestones)

• Participation of discussion

Organization

• Strictness

• Communication (just-in-time answers, satisfying time constraints)

• Self-organization (Sharing tasks, internal reporting of currentstate-of-work, dealing with problems)

• Autonomous working

13 / 37 Scientific Project David Broneske et al.

Page 14: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Deliveries and Grading (III)

• Grade consists of:

Presentations: 30%,

Implementation: 30%,

Paper: 30%,

Soft Skills: 10%

• Binding registration: Second Milestone

14 / 37 Scientific Project David Broneske et al.

Page 15: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Objectives & Qualification (I)

Acquired skills, specific to research

• Performing literature research

• Understanding and structured reviewing of scientific work

• Autonomous, solution-based reasoning on research task (e.g.,finding alternative solutions)

• How to ask? How to adapt a task (extend/reduce)?

• Academic writing

15 / 37 Scientific Project David Broneske et al.

Page 16: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Objectives & Qualification (II)

Acquired skills, always needed

• Team management

• Project and time scheduling

• Presentation of results

• Flexibility regarding changing conditions

• Reasoning about solutions (”Why is this the best/not adequate. . . ”)

16 / 37 Scientific Project David Broneske et al.

Page 17: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Task & Time Management

Task Management

• Main milestones have to be finished in time

• (Sub) milestones are less strict (but don’t be sloppy)

• Pre-defined work packages ⇒ each project team

. . . defines sub work packages

. . . determines responsibilities for these packages (divide&conquer)

Time Management

• Planning of periods

• Regarding capacities and resources

• Considering other tasks and activities

• Reporting of delays immediately to project members !

17 / 37 Scientific Project David Broneske et al.

Page 18: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Role Management

• Possible roles: team leader, developer, researcher, . . .

• work together vs. responsibilities: design, implementation, testing,writing, . . .

• Delegate for important roles/work packages

• Assignment of (sub) tasks to role for each milestone

18 / 37 Scientific Project David Broneske et al.

Page 19: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Topic & Project Teams

• Teams with 4 to 6 students

• Most tasks can be chosen once

• Projects

Theoretical part

• State of the art

• New ideas

Practical part

• Usually in C++, Java, or Python

• Prototypical implementation

• Evaluation part

19 / 37 Scientific Project David Broneske et al.

Page 20: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Topic 1 - Fragment Skipper (Gridformation)

Intro: Data skipping in Hadoop

• A good data fragmentation is necessary for scalable processing inBig Data.

• Aggressive data skipping is the SOTA ML approach, usinghierarchical agglomerative clustering with Ward’s method.

• We developed a competitor based on deep reinforcement learning.How good is our solution? (can we make it better?)

Your Task

• Literature research: aggressive data skipping, basics on deepreinforcement learning.

• Prototypical implementation of aggressive data skipping usingSpark. Experimental evaluation and analysis comparing with ourDRL solution.

• Long-term goal: A production-ready open source AI-basedpartitioning tool & a paper for consideration in SysML.

20 / 37 Scientific Project David Broneske et al.

Page 21: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Topic 2 - Similarity Skipper (Sim-Skip)

Intro: Learning to hash for high dimensional data management

• Top-k search in dense high-dimensional vectors requires specializedsolutions. This is a very relevant application.

• When data is large, even parallel scans will be inefficient.

• Learned hashing is the current SOTA for managing such kind ofdata in image domains. How good does this technique perform onstructured relational data?

Your Task

• Literature research: Hadoop file formats, deep hashing, triplettraining of neural networks.

• Prototypical implementation of a deep hashing process usingTensorflow and Spark. Experimental evaluation and analysis.

• Long-term goal: A workshop paper in DEEM@SIGMOD.

21 / 37 Scientific Project David Broneske et al.

Page 22: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Topic 3 - Graphs processing w/ML(MLGQE:Malko)

Intro: Applications of deep learning for graph data processing

• Graphs are everywhere & graph technologies are a moving target. Todate still very heuristic-driven. ML is barely tested.

• Differential neural computers are already able to act like primitive graphquery engines.

• How good do models like these fare nowadays for graph processing, andwhat needs to be improved?

Your Task

• Literature research: differential neural computers, graph nets, RDF3x.

• Prototypical implementation using existing libraries from Deepmind,selected datasets. Experimental evaluation and analysis.

• Long-term goal: A workshop paper in GRADES/AIDM@SIGMOD, orAIDB@VLDB.

22 / 37 Scientific Project David Broneske et al.

Page 23: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Topic 4 - Cross device data parallel query processing

Intro

• Heterogeneous hardware used for improving performance

• Work partitioning is problematic and requires training data for best deviceselection

• Further, data parallel execution cannot be decided in prior

We’ve got

• Dispatcher implementation

• Iterative functional parallel executor

Your Task

• Literature Research: data-parallel, iterative and cross-device execution systems

• Understanding of morsel driven parallelism

• Invention of a clever concept to perform data as well as functional parallelexecution across devices

• Implementation of your concept for missing database operators - hash join

• Benchmarking the execution system

23 / 37 Scientific Project David Broneske et al.

Page 24: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Topic 5 - Evaluating GPU-based Execution Models

Intro

• Query processing models for GPU: block-at-a-time & compiled execution

• Block-at-a-time : execute data bulk/function after next in query pipeline

• Compiled: generates execution code in runtime and execute them together

• Not sure which execution model is most suitable for GPUs

We’ve got

• GPU accelerated DBMS - CoGaDB

• Functionalities for b-a-a-t and compiled execution

• Concepts for evaluating of models in stand-alone CPU

Your Task

• Literature Research: Execution models for heterogeneous hardware

• Understanding execution of CoGaDB

• Developing evaluation suite for GPUs

• Implementation missing operations and the evaluation suite constructs

• Critical analysis on the results and investigation on the resultant values

24 / 37 Scientific Project David Broneske et al.

Page 25: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Topic 6 - Indepth Semi-Structured Data Storage

Intro

• Management of semi-structured data (e.g., JSON) has become daily business

• A widespread of formats for physical storage of JSON-like data exists today (e.g., PlainText,BSON, CBor, Avro, Parquet, FlatBuffers, MessagePack, Smile, UBJSON, Ion, Carbon)

• A comprehensive comparison about design goals, concrete data model, limits and benefits,as well as operators, on an abstract, theoretical level is missing, though

We’ve got

• Carbon (Columnar Binary Json) implementation and specification as starting point

• Running theses on evaluation of Carbon vs BSON

Your Task

• Literature Research: Semi-Structured Data Model (math model for JSON-like data), andexisting evaluations for formats mentioned above → understanding on an abstract level

• Research for existing comparisons, design goals, and applications of models above

• Creation of common theoretical model and common taxonomy for formats above

• List missing evaluations, top 5 formats, and come up with at least 5 further advanced andreasonable metrics not considered so far

• Implementation of an evaluation framework with new metrics and missing evaluations asproof of concepts for at least 5 formats (PlainText, BSON, UBJSON, and Carbon excluded)

25 / 37 Scientific Project David Broneske et al.

Page 26: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Topic 7 - Order-By Queries in Elf

Intro

• Elf: multi-dimensional main memory index structure for efficientselections

• Stores data sorting in a multi-dimensional order

• Common data-intensive operator: Sorting

We’ve got

• Elf implementation in C++

Your Task

• Literature Research: Related index structures and sorting algorithms

• Understanding of the Elf and its optimization concepts

• Implementation Sorting Operator for Elf

• Performance evaluation against sequential scans

26 / 37 Scientific Project David Broneske et al.

Page 27: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Topic 7 - Order-By Queries in Elf

1 2

0 1Column C1

Column C2

(1)

(2) (3)

T2 T1 0

0 T3 1

Column C3

Column C4(5) 0(4)

+

0

01

Accelerating multi-column selection predicates inmain-memory – the Elf approach

David BroneskeUniversity of [email protected]

Veit KoppenUniversity of Magdeburg

[email protected]

Gunter SaakeUniversity of Magdeburg

[email protected]

Martin SchalerKarlsruhe Institute of Technology

[email protected]

Abstract—Evaluating selection predicates is a data-intensivetask that reduces intermediate results, which are the input forfurther operations. With analytical queries getting more andmore complex, the number of evaluated selection predicatesper query and table rises, too. This leads to numerous multi-column selection predicates. Recent approaches to increase theperformance of main-memory databases for selection-predicateevaluation aim at optimally exploiting the speed of the CPU byusing accelerated scans. However, scanning each column one byone leaves tuning opportunities open that arise if all predicatesare considered together. To this end, we introduce Elf, an indexstructure that is able to exploit the relation between severalselection predicates. Elf features cache sensitivity, an optimizedstorage layout, fixed search paths, and slight data compression.In our evaluation, we compare its query performance to state-of-the-art approaches and a sequential scan using SIMD capabilities.Our results indicate a clear superiority of our approach for multi-column selection predicate queries with a low combined selectivity.For TPC-H queries with multi-column selection predicates, weachieve a speed-up between a factor of five and two orders ofmagnitude, mainly depending on the selectivity of the predicates.

I. INTRODUCTION

Predicate evaluation is an important task in current OLAP(Online Analytical Processing) scenarios [1]. To extract nec-essary data for reports, fact and dimension tables are passedthrough several filter predicates involving several columns.For example, a typical TPC-H query involving several columnpredicates is Q6, whose WHERE-clause is visualized in Fig. 1(a).We name such a collection of predicates on several columns inthe WHERE-clause a multi-column selection predicate. Multi-column selection predicate evaluation is performed as early aspossible in the query plan, because it shrinks the intermediateresults to a more manageable size. This task has become evenmore important, when all data fits into main memory, becausethe I/O bottleneck is eliminated and, hence, a full table scanbecomes less expensive.

In case all data sets are available in main memory (e.g., ina main-memory database system [2], [3], [4]), the selectivitythreshold for using an index structure instead of an optimizedfull table scan is even smaller than for disk-based databasesystems. In a recent study, Das et al. propose to use anindex structure for very low selectivities only, such as valuesunder 2 % [5]. Hence, most OLAP queries would never usean index structure to evaluate the selection predicates. Toillustrate this, we visualize the selectivity of each selectionpredicate for the TPC-H Query Q6 in Fig. 1(b). All of itssingle predicate selectivities are above the threshold of 2 %and, thus, would prefer an accelerated scan per predicate.

(a)Q6.1Q6.2Q6.3

l shipdate >= [DATE] and l shipdate < [DATE] + ’1 year’and l discount between [DISCOUNT] � 0.01 and [DISCOUNT] + 0.01and l quantity < [QUANTITY]

(b)Q6

Q6.1 Q6.2 Q6.32 %

20 %

40 %

Sele

ctiv

ity

(c)SIM

DSeq Elf

50

100

150

Res

pons

eTi

me

inm

s

Fig. 1. (a) WHERE-clause, (b) selectivity, and (c) response time of TPC-Hquery Q6 and its predicates Q6.1 - Q6.3 on Lineitem table scale factor 100

However, an interesting fact neglected by this approach isthat the accumulated selectivity of the multi-column selectionpredicates (1.72 % for Q6) is below the 2 % threshold. Hence, anindex structure would be favored if it could exploit the relationbetween all selection predicates of the query. Consequently,when considering multi-column selection predicates, we achievethe selectivity required to use an index structure instead of anaccelerated scan.

In this paper, we examine the question: How can we exploitthe combined selectivity of multi-column selection predicatesin order to speed up predicate evaluation? As a solutionfor efficient multi-column selection predicate evaluation, wepropose Elf, an index structure that is able to exploit therelation between data of several columns. Using Elf resultsin performance benefits from several factors up to two ordersof magnitude in comparison to accelerated scans, e.g., a scanusing single instruction multiple data (SIMD). About factor 18can be achieved for Q6 on a Lineitem table of scale factors = 100, as visible in Fig. 1(c).

Elf is a tree structure combining prefix-redundancy elimina-tion with an optimized memory layout explicitly designed forefficient main-memory access. Since the upper levels representthe paths to a lot of data, we use a memory layout that resemblesthat of a column store. This layout allows to prune the searchspace efficiently in the upper layers. Following the paths deeperto the leaves of the tree, the node entries are representing lesserand lesser data. Thus, it makes sense to switch to a memorylayout that resembles a row store, because a row store ismore efficient when accessing several columns of one tuple.

Author Copy of: David Broneske, Veit Köppen, Gunter Saake, Martin Schäler. Accelerating multi-column selection predicates in main-memory – the Elf approach. IEEE 33rd International Conference on Data Engineering (ICDE), 2017, pp. 647-658, DOI 10.1109/ICDE.2017.118

A. Conceptual designIn the following, we explain the basic design with the

help of the example data in Table II. The data set shows fourcolumns to be indexed and a tuple identifier (TID) that uniquelyidentifies each row (e.g., the row id in a column store).

C1 C2 C3 C4 ... TID0 1 0 1 ... T1

0 2 0 0 ... T2

1 0 1 0 ... T3

TABLE II. RUNNING EXAMPLE DATA

In Fig. 2, we depict the resulting Elf for the four indexedcolumns of the example data from Table II. The Elf tree struc-ture maps distinct values of one column to DimensionListsat a specific level in the tree. In the first column, there aretwo distinct values, 0 and 1. Thus, the first DimensionList,L(1), contains two entries and one pointer for each entry. Therespective pointer points to the beginning of the respectiveDimensionList of the second column, L(2) and L(3). Note,as the first two points share the same value in the first column,we observe a prefix redundancy elimination. In the secondcolumn, we cannot eliminate any prefix redundancy, as allattribute combinations in this column are unique. As a result,the third column contains three DimensionLists: L(4), L(5),and L(6). In the final DimensionList, the structure of theentries changes. While in an intermediate DimensionList,an entry consists of a value and a pointer, the pointer in thefinal dimension is interpreted as a tuple identifier (TID).

1 2

0 1

0

Column C1

Column C2

(1)

(2) (3)

0 T3 0 T21 T1

0 1Column C3

Column C4(7)

(5)

(8) (9)

0(4) (6)

Fig. 2. Elf tree structure using prefix-redundancy elimination.

The conceptual Elf structure is designed from the idea ofprefix-redundancy elimination in Section II-B and the propertiesof multi-column selection predicates. To this end, it featuresthe following properties to accelerate multi-column selectionpredicates on the conceptual level:

Prefix-redundancy elimination: Attribute values are mainlyclustered, appear repeatedly, and share the same prefix.Thus, Elf exploits this redundancy as each distinct valueper prefix exists only once in a DimensionList toreduce the amount of stored and queried data.

Ordered node elements: Each DimensionList is an or-dered list of entries. This property is beneficial for equalityor range predicates, because we can stop the search ina list if the current value is bigger than the searchedconstant/range.

Fixed depth: Since, a column of a table corresponds to a levelin the Elf, for a table with n columns, we have to descendat most n nodes to find the corresponding TID. This setsan upper bound on the search cost that does not dependon the amount of stored tuples, but mostly on the amountof used columns.

In summary, our index structure is a bushy tree structureof a fixed height resulting in stable search paths that allowsfor efficient multi-column selection predicate evaluation on aconceptual level. To further optimize such queries, we alsoneed to optimize the memory layout of the Elf approach.

B. Improving Elf’s memory layoutThe straight-forward implementation of Elf is similar to data

structures used in other tree-based index structures. However,this creates an OLTP-optimized version of the Elf, which wecall InsertElf. To enhance OLAP query performance, we usean explicit memory layout, meaning that Elf is linearized intoan array of integer values. For simplicity of explanation, weassume that column values and pointers within Elf are 64-bitinteger values. However, our approach is not restricted to thisdata type. Thus, we can also use 64 bits for pointers and 32 bitsfor values, which is the most common case.

1) Mapping DimensionLists to arrays: To store the nodeentries – in the following named DimensionElements –of Elf, we use two integers. Since we expect the largestperformance impact for scanning these potentially longDimensionLists, our first design principle is adjacency ofthe DimensionElements of one DimensionList, whichleads to a preorder traversal during linearization. To illustratethis, we depict the linearized Elf from Fig. 2 in Fig. 3. Thefirst DimensionList, L(1), starts at position 0 and has twoDimensionElements: E(1), with the value 0 and the pointer04 (depicted with brackets around it), and E(2), with the value1 and the pointer 16 (the negativity of the value 1 marks theend of the list and is explained in the next subsection). Forexplanatory reasons, we highlight DimensionLists withalternating colors.

0 [04] -1 [16] 1 [08] -2 [12] -0 [10]

-1 -0 [14] -0 -0 [18]T1 T2

-0 T3

ELF[00]

ELF[10]

ELF[20]

0 91 2 3 4 5 6 7 8(1) (2) (4)

(7) (5) (8) (3)

(9)

-1 [20](6)

Fig. 3. Memory layout as an array of 64-bit integers

The pointers in the first list indicate that theDimensionLists in the second column, L(2) andL(3) (cf. Fig. 2), start at offset 04 and 16, respectively. Thismechanism works for any subsequent DimensionListanalogously, except for those in the final column (C4). In thefinal column, the second part of a DimensionElement isnot a pointer within the Elf array, but a TID, which we encodeas an integer as well. The order of DimensionLists isdefined to support a depth-first search with expected low hitrates within the DimensionLists. To this end, we firststore a complete DimensionList and then recursively storethe remaining lists starting at the first element. We repeat thisprocedure until we reach the final column.

2) Implicit length control of arrays: The second designprinciple is size reduction. To this end, we store only valuesand pointers, but not the size of the DimensionLists. Toindicate the end of such a list, we utilize the most significantbit (MSB) of the value. Thus, whenever we encounter a

Beispieldaten

Elf-DatenstrukturPerformanzgewinne

Reorganisation

27 / 37 Scientific Project David Broneske et al.

Page 28: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Topic 8 - Parallel Build of Elf

Intro

• Elf: multi-dimensional main memory index structure for efficientselections

• Stores data sorting in a multi-dimensional order

• Building involves a multi-dimensional sort → expensive

We’ve got

• Elf implementation in C++

Your Task

• Literature Research: Parallelization strategies for building indexstructures

• Understanding of the Elf and its optimization concepts

• Implementation different parallelization strategies for Elf

• Performance evaluation against standard implementation28 / 37 Scientific Project David Broneske et al.

Page 29: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Topic 9 - GPU-Accelerated Selections in Elf

Intro

• Elf: multi-dimensional main memory index structure for efficient selections

• Stores data sorting in a multi-dimensional order

• Multi-core architectures demand for a clever parallelization strategy forGPUs

We’ve got

• Elf implementation in C++

• Parallel search and insert for CPU

Your Task

• Literature Research: Related parallelization strategies for index structures

• Understanding of the Elf and its optimization concepts

• Implementation GPU traversal variants for Elf

• Performance evaluation against serial/CPU-parallel implementation

29 / 37 Scientific Project David Broneske et al.

Page 30: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Finding your Team

Topics:

• Topic 1 - Fragment Skipper (Gridformation)

• Topic 2 - Deep hashing (Sim-Skip)

• Topic 3 - Graph processing in a neural network (Malko)

• Topic 4 - Cross-device data parallel query processing

• Topic 5 - Evaluating GPU-based Execution Models

• Topic 6 - Indepth Semi-Structured Data Storage

• Topic 7 - Order-By Queries on Elf

• Topic 8 - Parallel Build of Elf

• Topic 9 - GPU-Accelerated Selections in Elf

When do we meet for the programming test?

30 / 37 Scientific Project David Broneske et al.

Page 31: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Literature Research

Page 32: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

How to Perform Literature Research

• Efficient literature research requires

Knowledge of Where to search

Knowledge of How to search

Finding adequate search terms

Structured review of papers

Knowledge of how to find information in papers

32 / 37 Scientific Project David Broneske et al.

Page 33: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Where to Search (I)

• Different websites available that provide large literature databases

1. Google Scholar: http://scholar.google.de/

Key word and conrete paper search

Often, PDFs are provided

2. DBLP: http://www.informatik.uni-trier.de/~ley/db/

Search for keyword, conferences, journals, author(s)

BibTex and references to other websites

3. Citeseer: http://citeseerx.ist.psu.edu/about/site

keyword, fulltext, author, and title search

BibTex and (partially) PDFs are provided

33 / 37 Scientific Project David Broneske et al.

Page 34: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Where to Search (II)

• Publisher sites are also a suitable target

• ACM Digital Library: http://portal.acm.org/dl.cfm

Keyword, author, conference/literature (proceedings), and titlesearch

Bibtex, mostly PDFs and other information are provided

• IEEE Xplore: http://ieeexplore.ieee.org/Xplore/guesthome.jsp?reload=true

Similar to ACM, but only few PDFs

Extended access within university network

• Springer: http://www.springerlink.de/

Similar to previous

Extended access within university Network

• Further search possibilities: on author, research group or universitysites34 / 37 Scientific Project David Broneske et al.

Page 35: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

How to Search

Some hints to not get lost in the jungle

• Use distinct keywords (fingerprint vs. fingerprint data)

• Keep keywords simple (at most three words)

• Otherwise, search for whole title

• Read abstract (and maybe introduction) ⇒ decision for relevance

First insights

• Read abstract, introduction and background/related work(coarse-grained) to

. . . get a first idea of the approach

. . . find other relevant papers

35 / 37 Scientific Project David Broneske et al.

Page 36: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Information Retrieval

Finding the required information

• Read the paper carefully

• Omit formal parts/sections

• Try to classify (core idea, main characteristics) ⇒ developclassification/evaluation in mind

• Understand the big picture

• Make notes

• Do NOT translate each sentence

36 / 37 Scientific Project David Broneske et al.

Page 37: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/Material/_/DDDPsl… · Scienti c Project: Databases for Multi-dimensional Data, Genomics and modern Hardware

Finding your Team

Topics:

• Topic 1 - Fragment Skipper (Gridformation)

• Topic 2 - Deep hashing (Sim-Skip)

• Topic 3 - Graph processing in a neural network (Malko)

• Topic 4 - Cross-device data parallel query processing

• Topic 5 - Evaluating GPU-based Execution Models

• Topic 6 - Indepth Semi-Structured Data Storage

• Topic 7 - Order-By Queries on Elf

• Topic 8 - Parallel Build of Elf

• Topic 9 - GPU-Accelerated Selections in Elf

37 / 37 Scientific Project David Broneske et al.


Recommended